Improving Multilingual Frame Identification by Estimating Frame Transferability

A recent research direction in computational linguistics involves efforts to make the field, which used to focus primarily on English, more multilingual and inclusive. However, resource creation often remains a bottleneck for many languages, in particular at the semantic level. In this article, we consider the case of frame-semantic annotation. We investigate how to perform frame selection for annotation in a target language by taking advantage of existing annotations in different, supplementary languages, with the goal of reducing the required annotation effort in the target language. We measure success by training and testing frame identification models for the target language. We base our selection methods on measuring frame transferability in the supplementary language, where we estimate which frames will transfer poorly, and therefore should receive more annotation, in the target language. We apply our approach to English, German, and French – three languages which have annotations that are similar in size as well as frames with overlapping lexicographic definitions. We find that transferability is indeed a useful indicator and supports a setup where a limited amount of target language data is suﬀicient to train frame identification systems.


Introduction
Semantic frames are structured representations of everyday scenarios or scenes that can be evoked by several predicates (Fillmore, 1982); for example, predicates such as beat, trounce, demolish, or prevail all evoke a frame about a victor winning over a competitor (BEAT_OPPONENT). Linguistically, frames are scenes that might be realized in different ways. Because of this, semantic frames can be used to account for paraphrase relations among sentences that refer to a shared scenario (He prevailed over the reigning champ ≈ He beat the reigning champ, Ellsworth and Janin, 2007) or to draw inferences (Ben Aharon et al., 2010). The Berkeley FrameNet resource for English (Fillmore and Baker, 2001) provides a dictionary of frames where the main components of a frame, including its predicates and semantic roles, are defined. Along with its dictionary, FrameNet provides annotations of frames in text which demonstrate how the frame is used in language.
Frame semantics is also an appealing framework for cross-lingual research, as many frames are thought to be applicable across languages (Boas, 2005). This premise has fueled linguistic research into the applicability of frame semantics to other languages, which have been as varied as German, Spanish, Latvian, Chinese, and Japanese (Gilardi and Baker, 2018). Unfortunately, a recurring bottleneck in these efforts is the need to create framesemantic annotation. Experiences from existing FrameNet projects show that the timeline for the development of such resources is most likely on the order of years rather than months. This is particularly true for applications of frame semantics in NLP, which involve training frame-semantic parsers (e.g., Das et al., 2014, Roth andLapata, 2015) which require substantial amounts of annotation for each frame.
In this article, we focus on a subproblem of frame-semantic parsing, namely models of frame identification. Frame identification is a disambiguation task where each occurrence of a predicate in context has to be assigned its correct frame given several possible frame candidates. For example, the predicate cover can refer to a physical covering (FILLING: The lid covers the pot) or to the topic of a communication act (TOPIC: The textbook covered modality in detail). The goal of a frame identification system is to take a new instance of a predicate (cover) in context (The article covered the coronavirus vaccine) and automatically identify the frame it evokes (TOPIC). Though frame-semantic parsing efforts have focused largely on the identification of semantic roles, frame identification is still an important task; it has been shown that a majority of errors in a complete frame-semantic pars-ing system can be traced back to errors in frame identification (Hartmann et al., 2017).
In order to avoid the need for large scale annotation, we ask whether existing annotation from languages that are already well-covered (supplementary languages) can be reused to train frame identification models in new languages (target languages). Recent multilingual embeddings are now providing a relatively simple technical means to seamlessly integrate training data from multiple languages (see Section 2.3 for details). However, it is much less clear whether the linguistic properties of the annotated datasets support this procedure. Often, FrameNet frames are found to be broadly applicable to other languages (Gilardi andBaker, 2018, Torrent et al., 2018); at the same time, some amount of 'tuning' may be required regarding their definition. To our knowledge, there are no studies that attempt to quantify these effects in models of cross-lingual frame identification. In linguistics, however, recent studies have emerged which present quantification of frame transferability from English to Brazilian Portuguese on a preliminary study with a set of parallel, frame-annotated sentences (Torrent et al., 2018).
We operationalize the idea of quantifying transferability by training frame identification models on monolingual data (target language) and multilingual (target + supplementary languages), adopting a fixed annotation budget for the target language, where only a modest number of datapoints can be labeled. We fill the annotation budget by performing informed frame selection based on frame transferability. Our notion of frame transferability builds on work that estimates the difficulty of Word Sense Disambiguation (WSD) by measuring the coherence of the word senses in the data (McCarthy et al., 2016). Similarly, we assume that frames that are coherent in the supplementary languages will be better candidates for transfer to an unseen language, requiring less target language annotation than incoherent frames (see Section 2.4 for details).
Clearly, another prominent indicator of frame transferability would be a direct measurement of cross-lingual frame applicability (Boas, 2020, Sikos andPadó, 2018). Unfortunately, such methods already assume the existence of annotation in the target language. Therefore, we choose to exclude explicit measures of cross-lingual applicability from our models, since we crucially want our methodology to generalize to target languages for which we assume that no annotation is yet available. We later discuss cross-lingual frame comparability in our post-hoc analysis. We select target annotations at the frame level (instead of selecting by predicates) for a few reasons. First, the frame level matches our goal of creating data to train a frame identification system.

JULY 2022
Second, in terms of data analysis, we are interested primarily in generalizable properties of frames rather than more fine-grained units.
In our empirical evaluation, we study frame identification over three target languages: English, German, and French, where the languages have frame definitions that are similar (taken directly from English) as well as different (adapted for the language of interest). Our selection method is based on latent properties of frame annotations, which reflect how the frame is used in context over each language. Therefore, we can evaluate which frames our selection models are more likely to choose for target language training: frames with similar or different definitions across languages.
Plan of the Paper. Section 2 sketches relevant related work. Section 3 contains the core method contribution of our study: A method for informed frame selection based on performance prediction using features for crosslingual frame transferability. Section 4 describes the experimental setup, and Section 5 reports our results. Section 6 closes with a discussion.

Frame-Semantic Analysis Across Languages
As sketched in the introduction, a prominent research question from a crosslingual perspective is to what extent semantic frames can be considered to be 'universal' (Boas, 2020). Many FrameNet frames are found to be broadly applicable to other languages (Gilardi and Baker, 2018), and most projects considering other languages use some frames that are essentially unchanged from the English definition, alongside others that have been modified to suit the language of interest. Reasons that call for frame modifications include typological shifts or subtle differences in the frame's interpretation which cause divergences in the core semantic roles and frame-evoking predicates (Ohara, 2014, Boas, 2005; Figure 1 shows the JUSTIFYING frame in English (Baker, 2008), German (SALSA) (Burchardt et al., 2006), and French (Candito et al., 2014), where the definition has been modified for each language.
Differences in annotation strategies is another factor that affects the versatility and frequency of frame coverage in different frame-semantic resources. Annotations typically proceed by a frame-by-frame approach, where the goal is decent coverage of each frame in the lexicon; lemma-by-lemma, where all senses of the annotated lemmas are covered; or full-text annotation, where frames are identified over running text. The English Berkeley FrameNet nition differs across all languages. The differences can be seen in terms of the frame-evoking predicates (above) and the core semantic roles (below).
adopted both frame-by-frame and full text annotations, the French FrameNet used a frame-by-frame approach, and the German SALSA corpus took a lemma-by-lemma annotation approach.

Frame Semantics and Natural Language Processing
Frame semantics has been shown to benefit a number of downstream NLP tasks, including information extraction and question answering (Shen and Lapata, 2007, Burchardt et al., 2009b, Christensen et al., 2010, Taniguchi et al., 2018, Si and Roberts, 2018. Most recently, frames have been proposed as one of the frameworks that could be a basis for studying meaning construal, where the same conceptual background can be expressed with different emphasis or perspective (Trott et al., 2020).
To be useful at scale, though, all of these applications require accurate automatic models of frame-semantic parsing, or at least frame identification. For the most part, all state-of-the-art models are based on word embeddings, high-dimensional representations of word meaning that are created from large collections of unstructured text. While previously such representations were directly based on counts, the current generation of word embeddings is based on neural network architectures such as Word2Vec (Mikolov et al., 2013), FastText (Bojanowski et al., 2017 or BERT (Devlin et al., 2019). Word embeddings can serve as input for supervised classification or regression models for specific tasks, whose training of course requires task-specific annotation ("fine tuning"). For frame identification, relatively straightfor-ward embedding-based classification was quickly able to match and outperform traditional feature-based models (Hermann et al., 2014) Much of the recent work in frame identification focuses predominately on English, although resources have been developed in a handful of other languages -the largest and most well-covered include German (Burchardt et al., 2006), French (Candito et al., 2014), Dutch (Vossen et al., 2018), and Swedish (Borin et al., 2010). Following the release of these resources, framesemantic parsers were developed for most of these target languages, where classifiers predict frames with lexical and syntactic features (Johansson et al., 2012, Michalon et al., 2016).

Modeling Multilingual Frame Identification
The latest generation of embedding architectures are the so-called transformers which can learn contextual dependencies in an unsupervised fashion and construct context-dependent meaning representations: tree will receive one embedding in the phrase the tree in the forest and another one in the phrase dependency tree. Not surprisingly, one of the best-known transformer models, BERT (Devlin et al., 2019), is the basis of state-of-the-art frame identification models for English (Sikos andPadó, 2019, Tan andNa, 2019).
The simplest way to set up the BERT model for frame identification is to predict one frame (including a 'None' option) for each token in a sentence. In this setup, each training datapoint is a single annotated instance of a predicate and its context words, where the label that is predicted for the predicate is the correct frame. Such datapoints can be created straightforwardly from existing frame-semantic annotations.
An important recent development in word embeddings is multilingual embeddings (Upadhyay et al., 2016, Lample et al., 2018, Artetxe et al., 2020. Certain approaches to constructing multilingual embeddings involve adversarial training for refining embeddings cross-lingually (Lample et al., 2018), or bilingual dictionaries for transforming embeddings from a source to a target language (Artetxe et al., 2017). While BERT embeddings were initially trained on corpora in individual languages, researchers realized quickly that embeddings could be trained on multiple corpora simultaneously, or existing embedding spaces aligned with one another. In either case, the result is a space in which words from multiple languages are represented 'on par'. This enables the exploration of different scenarios including experiments where a model is trained with annotations from one language and applied 'as-is' to another, so-called zero-shot learning (Wu andDredze, 2019, Pires et al., 2019). For frame identification, this means that not even comparable corpora are necessary such as were used in previous approaches to cross-lingual frame identification (Johannsen et al., 2015, Kozhevnikov, 2016.
Recently, multilingual embeddings have been used to compute the alignment of lexical unit embeddings across languages in the Multilingual FrameNet alignment package 2 . These embeddings are based on large-scale, multilingual language models which we describe in our approach below, and the translation of a frame's lexical units across languages can be visualized by this method.

Frame Transferability
Since in frame identification, predicates can evoke multiple frames, this task bears a strong resemblance to the well-researched paradigm of WSD. This is why we use a study from WSD on the impact of semantic coherence on disambiguation difficulty (McCarthy et al., 2016) as our basis for estimating a frame's cross-lingual transferability in our multilingual frame identification models.
McCarthy et al. (2016) start from the observation that some words are much easier to disambiguate with regard to word sense than others. While factors like part-of-speech, frequency, or type of ambiguity (homonymy vs. polysemy) play a substantial role, a lot of variance remains unaccounted for. In response, they carry out a study in which they analyzed the difficulty of WSD for various lemmas in terms of the semantic coherence of the senses of these lemmas. They measured two aspects of coherence, representing senses as sets of embeddings for individual senses: (1) lemmas with senses whose instances form tight clusters should pose simpler WSD problems than lemmas whose senses are 'spread out' and (2) lemmas whose senses are well separated from one another are presumably simpler to disambiguate. Mc-Carthy et al. found very good empirical support for these hypotheses.

Cross-Lingual Frame Selection
Recall from Section 2.3 that our goal is to build a frame identification system for a target language T, while we assume that we have access to frame anno-JULY 2022 tations for a set of supplementary language(s) S. The simplest way to do this would be to build a model using only the available frame-labeled data from S. However, given the imperfect comparability of frames across languages (see Figure 1), such a classifier will presumably not do well. Thus, our research question is: Given a fixed annotation budget for T, how can we select frames for annotation to maximally improve a system that has only learned about frames from S?
We pose our frame selection process as a performance prediction task (Bojar et al., 2017, Elloumi et al., 2018 where we are estimating a frame's cross-lingual transferability. We do this by estimating how much the annotations of a frame from T will improve frame identification given the availability of frame data from S. As such, frame selection is based on properties of the frames in S (which is the only data we assume we have), which we use to estimate how useful a T frame annotation will be towards improving the existing, multilingual frame identification system.
An overview of our approach is shown in Figure 2. It consists of three steps: building a baseline (we use the multilingual frame identification system from Section 2.3), learning the frame selection model where we estimate the transferability of frames and select frames from its estimations (Section 3.2), and using the selected T frames plus S frames to build a final frame identification model for T.
To learn the frame selection model, we need to use data from one language pair ⟨S, T⟩ for which we assume annotations already exist. We can then build multilingual frame identification systems trained (a) only on S, and (b) on S plus all available training data from T. We compare frame performance of these (a) and (b) systems to obtain ΔF, the change in performance by adding T frame annotations. A high ΔF indicates that the frame identification system benefits from the T annotations for that frame, whereas a low score indicates that the S annotations are already sufficient.
Specifically, a high ΔF score suggests that the frame has a lower crosslingual transferability, as more language-specific annotations are required to improve performance, and S annotations were not suitable for learning the frame.
In the general case, however, our goal is to define a frame selection process that generalizes to various target languages, including those for which no annotation is available at all. As we argued in the introduction, this means that we only use properties in the frame selection process that are based on data in the supplementary language S.
Finally, we can apply the frame selection model to rank the T frames by their estimated ΔF score and select the T frames with the highest scores for annotation. In our experiments in this article, we do not perform actual annotation; instead, we simulate annotation by simply sampling the respective frame annotations from the existing dataset. We then re-train a multilingual frame identifier on the S annotations, plus the annotation instances of the selected frames from T.

Estimation of Frame Transferability
Using the frame identification architecture described in Section 2.3, we train two models: one trained on all of S data, henceforth M S , and a model trained on all of S plus the training set of T, henceforth M S +T. We define ΔF for each frame f as the difference between the frame's F1 score from both models: We compute ΔF of each frame over the development set in T 3 . A high ΔF indicates that a frame profits substantially from annotation in T and therefore has lower cross-lingual transferability.
Our frame selection is a linear regression model, which is a wellestablished architecture for data analysis in NLP (Baayen, 2008). Estimating frame transferability with linear regression also has the benefit of introspection into how frame properties are related to their performance.

Frame Transferability via Semantic Coherence
As introduced in Section 2.4, the properties that we consider are measures of semantic coherence following McCarthy et al. (2016). We replace the notion JULY 2022 of 'sense' by the notion of 'frame', but use an analogous setup where each instance is represented by one (contextualized) embedding. Recall from Section 2.4 that McCarthy et al.'s first indicator was how tightly the instances of a word sense cluster together. Applied to frames, we have our first hypothesis concerning the variance of a supplementary language frame. Hypothesis #1: the larger the variance of a frame (i.e., the more dissimilar its instances to one another in the supplementary language), the more it profits from target language annotation. We make this idea concrete as follows. Let centroid(F) be the average of all of its annotated instances f: Then, we define Var as the variance of the frame by taking the difference between each individual frame instance (f ) and its frame centroid: The second indicator McCarthy et al. (2016) consider is the average of all between-cluster (i.e., between-sense) distances. We believe that for frame identification, where typically a small number of senses are realistic candidates, it is more sufficient to consider the separability between the current frame and its nearest neighbor. Therefore, we next hypothesize that distance affects frame performance. Hypothesis #2: the smaller the distance between a frame and its nearest neighbor, the more it profits from target language annotation. Formally, we define Dist as the distance between frame centroids, calculated by cosine similarity between a frame F and its nearest neighbor F ′ : As a third indicator, we compute the coherence of a frame as the ratio of the Dist and Var scores: We include Co to account for interactions between Dist and Var and again assume a larger benefit of target language annotation for lower values of Co. Concretely, if we assume in Hypothesis 1 that variance of supplementary language frames should be high, and Hypothesis 2 that distance of frames and their nearest neighbors should be low, we would predict that frames with the lower Co values would be better candidates for frame selection. Alternatively, a higher value of Co would indicate that the frame already has good clusterability, with low variance across the frame's instances and a high distance from other frames, and therefore would likely be learnable from the supplementary annotations and wouldn't require additional target language data.

Experimental Rationale
As we described in Section 3.1, we start with only frame annotations from S and subsequently add a moderate budget of annotations from T (we consider budget sizes of 5k and 10k instances). We simulate target language "annotation" by taking randomly sampled annotated instances of each selected frame from T. In certain cases, there can be a high number of annotations for a single frame; in fact, some resources have frames with a very high (> 1000) number of annotations. If we take all the training instances from these frames, we reduce the diversity of frames that are seen by the classifier and the added frame data would be dominated by these few, highly annotated frames. To prevent this problem, we restrict the number of instances of each frame to 200 random instances, motivated by a desire to cover a substantial number of frames. The number 200 was selected to balance the goals of adding a substantial number of frames and a substantial number of instances per frame.
Since our experiment uses informed frame selection, the question remains how we train the frame selection model. As we noted in Section 3.1, frame identification training requires annotated data both for S (to provide the features) and T (to evaluate the predictions). We therefore train the frame selection model on our language pair ⟨S, T⟩ with the largest number of overlapping frames, namely ⟨German, English⟩. We use the development set T (in this case, English) to learn the frame selection model so that there is no information leakage to either frame identification model training or frame identification model evaluation. The frame selection model is then applied as-is to all other language pairs ⟨S ′ , T ′ ⟩ for frame selection, thus demonstrating its generalization capabilities to unseen languages. Models are then trained with this modified ⟨S ′ + selected T ′ ⟩ data and evaluated over unseen T ′ test data.
Below, we present results for all combinations of supplementary and target languages. Due to our use of multilingual embeddings, we can also construct models based on multiple supplementary languages for a single target language. For these models, we combine the ranked list of frames from each individual ⟨S ′ , T ′ ⟩ pairs and take the top predicted frames from this combined set as our selected frames.

Datasets
Despite growing efforts to create frame-semantic resources for different languages (Torrent et al., 2018), the number of languages with sufficient amounts of publicly available frame-semantic annotations suitable for NLP models is still limited. For this reason, our experiments cannot rival the massively multilingual setups that have been explored for word embeddings (Ammar et al., 2016) or parsing (Agić et al., 2016). Another practical limitation we encountered was the familiarity of the authors with the languages under consideration to qualitatively assess and analyze the output of the models. Therefore, we focus on frame-semantic annotations for three languages: the Berkeley FrameNet 1.5 annotations for English 4 , the French FrameNet corpus 5 (Candito et al., 2014) and the German SALSA corpus 6 (Burchardt et al., 2006). Table 1 provides descriptive statistics for the three resources, including numbers for the frame overlap with English.

Berkeley FrameNet 1.5
The FrameNet 1.5 full-text annotations form the standard corpus for frame identification systems in English and cover a bit more than 1000 frames. In our training, we use a single frame-evoking element, its sentential context, and its frame as one instance for the classifier. We adopt the widely used test/train/dev splits defined by Das et al. (2014).

French FrameNet
The French FrameNet project (Djemaa et al., 2016) adapted their frame inventory from the English FrameNet 1.5. Frame annotations were added to the French Treebank and Sequoia treebank (Abeillé andBarrier, 2004, Candito andSeddah, 2012), which covered four domains (commercial transactions, cognitive stances, causality, and verbal communication). French FrameNet provides its own test/train/dev splits. The French data covers only about 100 frames annotated compared to roughly 1000 frames for the two other languages, resulting from a different sampling strategy. Many French frames were adopted as-is from the Berkeley FrameNet, but about half of them were systematically restructured to yield a better fit with the corpus. This includes cases where multiple English frames have been combined into a new frame. For example, French has the CHATTING_DISCUSSION frame, which combines the CHATTING and DISCUSSION frames from the English lexicon; such cases count as 'unaligned' in the table. Since the number of annotated instances for French and German is on the same order of magnitude (within a factor of 2), but the number of French frames is substantially lower, the average number of annotated instances per frame is highest for French. We believe that this combination of properties (close to English but many changed frames) makes French an interesting target language in our experiments.

The German SALSA Corpus
The SALSA corpus provides frame-semantic annotations over the German TIGER news corpus (Brants et al., 2002). We use the train/test/dev splits defined by Botschen et al. (2018) for our experiments.
SALSA initially adopted frames from the English FrameNet 1.2 inventory. A comparatively small number of frames was modified; in contrast, a large number of frame approximations, called "proto-frames", was added (these count as 'unaligned'). These are lemma-specific frame structures de-JULY 2022 veloped to cover instances for which FrameNet did not provide an adequate frame (see Burchardt et al. 2009a for details).

Multilingual Embeddings
As embeddings, we use mBERT, a multilingual BERT model which represents words of over 100 languages in a shared semantic space. This model was trained on Wikipedia dumps available for the various languages (Karthikeyan et al., 2019).

Evaluation
Classifier accuracy is the percentage of correct predictions of the classifier when the full set of classes is used, and is a standard metric of evaluation for computational systems. For frame identification, we use the full set of frame classes, meaning there is no assumption about which specific frame candidates a single predicate might evoke.

Baselines
We report several different baselines for our experiment. The S only baseline only uses data from the supplementary S language(s) and tests on a target T language without any T training data. Frames that are used from the S language for training in T are the frames in S data that are shared with the target language. These include the "same" and "modified" frames in Table 1, where we do not include frames that are "unaligned" in French and German. For German, the "unaligned" cases include language-specific, "proto-frames", and for French, these include frames whose definition is a blend between two frames where the frame is essentially language-specific and not readily alignable to an English or German frame 7 . In other words, frames that can be readily mapped back to a T frame through the frame's naming and semantic roles are used for S only training.
The Random baseline adds 5k or 10k instances of randomly selected T frames to the S only data. Identical to the Embedding model, a maximum of 200 random instances per frame are chosen.

Results
The starting point of our experiment is the baseline which used only training data of the supplementary language (S only) and evaluated on the test data of the target language (T). Results for this setting are given in the all column in  Table 1). Conversely, EN to DE has the largest frame intersection, and DE is the best model for English as the target language, presumably due to this higher number of shared frames. In sum, leveraging annotations from different, supplementary languages alone-that is, assuming that no annotations for the target language are available, shows reasonable performance but arguably does not yield models that are practically usable. We therefore proceed with adding target language annotations (+5k and +10k) back to the multilingual training. We first consider random frame se-JULY 2022 lection to disentangle the effect of added T data in general with the effect of a deliberate selection of frames (cells 1e -9f). Without comparing our selection to a random frame selection, it would remain an open question as to whether no selection of frames was necessary in the first place and that any target language data of a certain size would yield comparable improvements. Compared to the S only training, results in Random show that even with a random selection of 5k instances from T the performance achieves significant gains. However, all language pairs benefit from a more informed frame selection (cells 1g -9h). Regarding the effect of dataset size, we unsurprisingly find that adding more data (+10k) is always better than adding fewer data (+5k) within each selection strategy, although the improvement is smaller in cases where data from multiple languages is combined (DE+EN to FR and EN+FR to DE). However, in those cases, selecting +5k instances ranked by the embedding-based predictors actually yields a higher accuracy than +10k instances from random frames.
In terms of language pairs, we observe that results for EN to DE and DE to EN are consistently higher than for FR to DE and FR to EN, respectively. When using French as the target language, none of the two supplementary languages performs consistently better than the other. In combination, however, we observe the highest improvements for French. In general, the best results for each language T combine both S languages. This suggests that, when only a few annotations are available, a new language would likely benefit the most from a simple concatenation of available frame annotations from various source languages. In fact, a modest +5k instances of data from multiple languages could achieve similar results as +10k instances depending on the source and target languages.

Benefit of Supplementary Language
Our approach uses the supplementary annotations in two capacities: 1) as part of the multilingual training data, and 2) for the selection of informative target frames. One unanswered question from the results presented in Table  2 is whether the supplementary data is actually benefiting the system at all; more specifically, we need to ask whether we would have achieved the same results with a selection of T frames alone. To answer this question, we train T only models which train the classifiers only on the same 5k/10k instances used for Random baseline, without using any supplementary data. The results of these tests are given in Table 3 below and are directly comparable to the Random baseline results in  T setting from Table 2, we show performance of the combination of both supplementary languages for each target language (cells 3a,f/6a,f/9a,f). Table 3 shows performance in the T only training is consistently-and in most cases substantially-lower than performance for the target languages when S data is added (S+T). This demonstrates again the benefits of multilingual training and confirms that it is worth using multilingual data for training frame identification models when it is available.

Analysis of the Frame Selection Model
Finally, we ask whether we can analyze the performance prediction model in order to better understand how embedding properties of frames are related to the improvement for this frame when adding target language annotation, ΔF. Unfortunately, it turns out that the three properties that we have defined (coherence, nearest neighbor distance, and within-frame variance) show a high degree of collinearity -which is not surprising, given that coherence is defined as a ratio of the other two properties. As a consequence, the coefficients of the performance prediction model lose their interpretability (e.g., McNamee 2005).
For this reason, we excluded the coherence of a frame (Co(F)) from this analysis and estimated a simpler model including only two normalized predictors, namely nearest neighbor distance (Dist) and within-frame variance (Var). The results are shown in Table 4. We initially hypothesized (cf. Section 3.2.1) that 1) the more dissimilar the instances of the frame are to one another, the more it will profit from target language annotation, and 2) the smaller the distance between a frame and its nearest neighbor, the more it will profit from target language annotation. The coefficients confirm only Hypothesis #1, where a high within-frame variance is very significant in predicting a higher ΔF. The other property (Dist) does not significantly con-

Predictor
Coeff Std. Error p-value Nearest neighbor distance (Dist) 0.005 0.07 > 0.10 Within-frame variance (Var) 0.21 0.07 < 0.01 tribute to the prediction of ΔF, indicating that the separation from the nearest neighbor frame is possibly an oversimplification as a measure of the difficulty to model a frame.

Analysis of Frame-Level Performance
We now proceed with an analysis of the frame transfer method and the comparability of frames at the lexicographic level-that is, how well frame definitions are aligned across languages. While the transfer method relied solely on available annotations in the supplementary language, our analysis below looks at the lexicon in both languages, where we compare the performance of frames with high cross-lingual comparability in terms of their lexicographic entries versus frames that are thought to have low lexicographic comparability.

Similarity in Frame Definitions
The German and French FrameNets distinguish between frames that have been modified from the original English FrameNet definition and those that are consistent with English. We take the frames that were selected for annotation in the target language and ask whether there is a difference in the performance gains across these two frame types ("same" and "modified" in their cross-lingual definition). In Table 1, there are 22 cases of German frames that are listed in both categories; for example, the COGITATION frame has two entries in SALSA, one with modified semantic roles and the other which has retained the English definition. We disregard these cases from our analyses. Figure 3 shows that, for all language pairs except one (EN-DE), the selection of modified frames led to higher improvement. The JUSTIFYING frame, where the definition diverges across all three languages (showed in Figure 1), is one of the frames consistently selected by our model for all language pairs. One possible reason for this is that the frames which are described as the same across the resources are already learned sufficiently by S, leading to lower gains in multilingual training; for instance, the CAUSATION frame  ("Frame Type") and model performance ("F1 Scores") for frames selected for annotation. Frames are either modified across languages and therefore diverge lexicographically ("modified") or they have the same definition across language pairs ("same"). Language pairs are in the form <T-S> (e.g., DE-EN is DE as T and EN as S) where results are tested over T test data. For the "Supplementary only" condition (dark bars), we report absolute F1 scores for performance, while "Improvement w/+10k" shows the average increase in F1 score (light bars) after the frame type was added.
was not modified across any language pair, and was never selected as a target for further, language-specific annotations. When we compare absolute F1 scores of the S only model, the results are mixed: only two of the language pairs support this hypothesis (FR-DE, EN-DE), while other language pairs (FR-EN, DE-EN) show similar F1 scores for both frame types. However, modified frames predominately benefit from the target language annotations, suggesting that researchers building frame-semantic resources for different languages should focus more on these modified frames. If it is the case that researchers should target modified frames for annotation, the question might then arise: how would they know whether a frame should be modified? Evidence from previous studies suggest that typological differences between languages can be expected to affect the frame lexicon in a target language (Boas, 2005(Boas, , 2020, but those typological differences can be predictable to a certain extent. Hasegawa et al. (2011) identify cases of frames in English that are primarily composed of transitive verbs and tend to translate poorly in Japanese because Japanese typically prefers to describe events as stative (Ikegami, 1991). These frames would be expected to require modification if one were to build a frame lexicon in Japanese. Beyond typological differences, analysis of parallel corpora has indicated substantial freedom for translators regarding the linguistic realization of the same event: Torrent et al. (2018) find shifts in the part of speech of certain frame-evoking lemmas to cause different frame assignments across translations; Padó and Erk (2005) investigates cases where the contribution of a single frame-evoking element is split among multiple frame-evoking elements in translation. Systematic mining of parallel and comparable corpora could make it feasible for researchers working on a target language to get an idea of specific frames that could require modification, and therefore would warrant annotation.

Frames with High/Low Performance in S Only Training
We take results from the S only model to see which frames performed best across different language pairs. In this condition, no T annotations were used in training, but frame performance is measured over T. As shown in Table  5, many of the frames with the highest F1 scores across the EN-DE pair are those whose predicates form a tight semantic cluster; for example, the KIN-SHIP frame whose predicates are all familial relationships (brother, sister, grandfather, etc.) or the PEOPLE frame which consists of terms relating to humans (man, woman, child). Frames that perform well with only supplementary data are those with low variance within a frame (tight clustering of its instances-in this case, predicates), indicating that they are easier to learn when they form a tight cluster. This is opposite to the results we find in the performance prediction model, where we predict the frames with high variance will need more target language data to learn. Other explanations of these results include the fact that the lexical units in these frames are largely nominal, and their valency patterns are less likely to differ significantly across languages.
Performance for French frames are harder to interpret. Recall from Section 4.2.2 that the set of annotated frames in French was limited to four specific domains. Many of the high performing French frames (COM-MERCIAL_TRANSACTION, COMMERCE_BUY, COMMERCE_SELL, IMPORTING) are in the commerce domain, while frames from cognitive stances or communication (QUESTIONING, REGARD, COMMUNICA-TION_RESPONSE, JUDGMENT_DIRECT_ADDRESS, CONTACTING) tend to appear as low performing cross-lingually. However, the change in domain covaries with other properties: The majority of lexical units (60%) from the commerce domain are nominal predicates, while predicates from the cognitive stance and communication domains are largely verbal (only 28% and 23% nominal, respectively) (Djemaa et al., 2016). This aligns with observations from EN-DE, where the part of speech of the lexical units across languages has a strong impact on cross-lingual performance. It is also possible that the predicates are clustered more tightly in the commerce domain than the other three domains. Ultimately, however, the small number of French frames does not admit a strong interpretation of these findings.

Conclusion
The question of the universality of frames has been posed since the beginning of the theory of frame semantics (Fillmore, 1982, Boas, 2005, 2020. In fact, comparable frames have been found across even typologically unrelated languages such as English and Japanese, presumably due to the fact that frames allow a certain degree of variation in how they can be expressed (Hasegawa et al., 2014). At the same time, frame identification and, more broadly, framesemantic parsing, all require annotated data. Many languages do not have the resources to invest in a full-scale frame annotation project that would lead to a practically usable automatic frame identification system. As computational linguists, we can ask whether we can supplement some of the annotation needs for a target language by existing annotations in other languages. This study was, to our knowledge, the first one to investigate this question of learning frame identification models based on multilingual embeddings. We defined a method that selects frames for annotation in the target language based on estimates of a frame's transferability. To make this estimate, we use features of semantic coherence. Compared to a setting in which we do not use any target language annotation (which yields promising but still ultimately low performance), we found that informed frame selection can construct usable frame identification models within a manageable annotation budget. The most important factor in frame selection, according to our model, is frame-internal variance: Frames which have a more compact cluster in the supplementary language, meaning their predicates all form a relatively coherent group, require less target language annotation than frames that were spread out more. We find this is the case even when the number of instances per frame (200) that we randomly select is relatively modest, and the number of frames (25 frames maximum in 5k, 50 frames maximum in 10k) are also modest. This validates our approach that one can still see improvement in target language frame identification with only a modest, fixed number of frame instances.
In a post-hoc analysis, we established that, overwhelmingly, the frames that were selected yield better results when they have lexicographic definitions which diverge across languages. One plausible explanation for this result is that these lexicographic modifications were motivated by typological differences across the language pairs such as lexicalization or syntactic valence, which emerge as divergences in the semantic representations of the frames in the computational model. Therefore, these modified frames are more useful for selection, as they help refine a supplementary-based language model to learn the specific properties of frames for the target language.

JULY 2022
It cannot be overlooked that the makeup of the frame annotations themselves could have played a large role in the utility of cross-lingual data for frame identification. While much prior work in computational linguistics has shown that datasets with sometimes significant divergences in certain semantic role labeling schema (a subtask of frame-semantic parsing) can still be combined for improved results (Akbik andLi, 2016, Feizabadi andPadó, 2015), we find that the combination of different frame annotations alone does not lead to the greatest possible gains. In fact, there are significant differences in the numbers of instances of each frame that have been annotated, as well as the variety of predicates that evoke those frames. For instance, the German SALSA resource (Burchardt et al., 2006) has one frame (POLITI-CAL_LOCALES) with nearly 1k annotations for a single predicate (Land.n), while each predicate is annotated exactly 100 times in the French FrameNet (Candito et al., 2014). While we controlled for these differences in our selection method by only taking a random sample of 200 instances per frame, it is possible that these differences have an effect when only using supplementary language annotations. Future work could involve controlling for these effects by taking only a fixed number of frame instances from supplementary data in training for a target language.
Our study considered three languages that are among those languages with the largest frame-semantic resources (English, German, and French). It is clear that generalization of our results must consider that these languages are typologically close to one another (although see Burchardt et al., 2009a), and many potential target languages are more dissimilar to these supplementary languages. Naturally, an important avenue of future research is the generalization of our frame selection to a broader range of target languages. As we described earlier, the Multilingual FrameNet alignment tool (described in Section 4.4.1) could be another promising way to gauge frames that would require more annotation for target language frame identification, as these frames would have poorer cross-lingual alignment of their lexical units. However, it would be straightforward to extend our framework to other languages as we observe that a target language model already sees impressive gains with 5k instances of annotated data, which is a small requirement for frame annotation.