Adding linguistic information to parsed corpora

No matter how comprehensively corpus builders design their annotation schemes, users frequently find that information is missing that they need for their research. In this methodological paper I describe and illustrate five methods of adding linguistic information to corpora that have been morphosyntactically annotated (=parsed) in the style of Penn treebanks. Some of these methods involve manual operations; some are executed by CorpusSearch functions; some require a combination of manual and automated procedures. Which method is used depends almost entirely on the type of information to be added and the goals of the user. Of course, the main goal, regardless of method, is to record within the corpus additional information that can be used for analysis and also retained through further searches and data processing.


Introduction
No matter how comprehensively corpus builders design their annotation schemes, users frequently find that information is missing that they need for their research, and so they must add it on their own. In this methodological paper I discuss and illustrate five methods of adding linguistic information of all types (lexical, phonological, morphological, syntactic, semantic, discourse) to corpora that have been morphosyntactically annotated (=parsed) in the style of Penn treebanks, and the advantages and disadvantages of each method. These five methods are the following: 1) adding information to the ur-text; 2) inserting CODE nodes into the token structure; 3) embedding in-formation in coding strings; 4) modifying node labels and structure; and 5) importing token information and other corpus data from the corpus into spreadsheets. Method 1 is necessarily manual, while methods 2 through 5 may involve a combination of manual and automated procedures, functions and tools. Of course the main goal, regardless of method, is to record within the corpus additional information that can be used for analysis and also retained through further searches and data processing. The search engine used for many treebanks, and the one used for the searches and the automated annotation described in this paper, is CorpusSearch (CS). 1 The manual addition of information may be the simplest procedure but, being manual, it is the most prone to error. Information can be added to the two areas of CS output that are reproduced each time CS is run under default conditions: 1) the token ur-text, which contains the token text and ID without any annotation, and 2) the token structure, including the lexical items. The main difference between the two locations is that material internal to the ur-text is not searchable by CS, while the token structure is the object that is searched and modified by CS queries.
A word of warning is appropriate here: annotation that is added manually cannot be reproduced except by repeating the same manual procedure. Annotation added by CS (i.e. coding strings, structure changes, label changes) can easily be reproduced -unless it is based on annotation that was previously added manually. The availability of automated reproduction is important for three reasons: 1) Files can be lost or damaged. Automated reproduction of annotation is relatively simple; manual reproduction is painful and time-consuming. 2) For most users, whenever we look at the output of a new CS query, we find problems, either in the query or else in the corpus; we then must find and fix the source of the problem and run CS again. One way to facilitate this repetition is to use annotated batch files so that the same processes can be documented and repeated. The use of batch files permits the effortless repetition of what may be a long and complex string of searches. An example of a batch file is given in Appendix. 3) We want other scholars to be able to reproduce our research. With this end in mind, it is encouraging to see that many researchers are making their CS queries available, either in an appendix or on the web, along with their search results.
In the remainder of this paper, I describe and evaluate the five methods listed above, presenting case studies for each method from my own recent collaborative research. 2 For readers who are not familiar with CS, some details of the search methodology will be given where space permits; interested readers are referred to the online CS manual. Because of space limitations, the background information and results for each case study are necessarily brief; interested readers are referred to the publications themselves for details and clarifications.
1 Method 1: Adding information manually to the ur-text.
The ur-text consists of the words of the token and the token ID without morphosyntactic annotation; CS outputs the ur-text above the structure for each token in the output file. As mentioned above, adding information manually to the ur-text is arguably the simplest procedure, at least in concept, but it has (at least) three major drawbacks: 1) because it is manual, it is prone to error; 2) it must be applied to CS output, not to the original corpus, because the original corpus does not contain ur-text to accompany the token structure; and 3) the ur-text is not searchable by CS, and therefore any added information can be used only by looking directly at the individual tokens in the data file, one by one. 3 This method was used for some of the tokens in the database for , described as Case study 1 below.
Case study 1: Haeberli et al. 2017, investigating verb second (V2) in Old English, looked at fronted pronominal objects to determine whether they can be analyzed as the result of Formal Movement (Frey 2006a,b;Light 2012). CS was used to retrieve all clauses with fronted pronominal objects, but the preceding context was needed to determine the topic type (familiar, aboutness, contrastive, as in Frascarelli and Hinterhölzl 2007). Examples (1) and (2) below show text manually inserted in the ur-text (the text in the area between '/ *' and '* /'). In (1) below, the ur-text is enclosed in a box, and the information added manually is in red. The original Old English token, including the token ID, is in black.
In (1), the added information is the preceding context and its gloss and the gloss of the token itself. In (2), the added information is the gloss of the token and a comment about structure and word order in the token.
Step 3: Count NP types Step 3 Query: node: IP* query: (NP-OB* idoms CODE) AND (CODE idoms <NPTYPE:*GNR*>) The texts in Table 3 are arranged in chronological order. The columns are arranged left to right in order of increasing saliency of presupposition of existence: there is no presupposition of existence with GNR, NPE, EXS-SCOPE-nrw; there is a clear presupposition of existence with EXS-SCOPE-wd, EXS-SPC; and finally there is a 'gray' area in the middle: EXS-SCOPE-amb, EXS, AMB.
According to Crisma 2015, an develops in three stages in the history of English. In Stage 1, an is the numeral 'one'; in Stage 2, an is an overt existential operator used when an indefinite noun phrase is interpreted as specific or when it takes wide scope over another operator; in Stage 3, an is an expletive used with all singular noun phrases. Crisma notes that in Stages 1 and 2, an is never used with generics.
The numbers are quite small in most of the cells in Table 3; presenting frequencies would be misleading. Nevertheless, clear patterns emerge. We can see that in the M1 period, an acts as an overt existential operator in the following types of nominals: 1) indefinite nominals that are interpreted as specific (EXS-SPC: 0 BSG, 14 AN); 2) nominals that take wide scope over some other operator (EXS-SCOPE-wd: 0 BSG, 1 AN). For nominals in the absence of other logical operators, an is favoured by about 2 to 1 over BSG (EXS: 17 BSG, 35 AN). For NPE nominals, either generic or narrow scope existential, as well as for existential nominatives taking narrow scope, BSG is favoured by about 2 to 1 over an (EXS SCOPE-nrw: 19 BSG, 11 AN; NPE: 26 BSG, 11 AN). In addition, we also see the first sign of change: in two texts, Ancrene Riwle and Hali Meidhad, there are two examples each of an used with generics (GNR).
In the M3 period, we see a number of changes: 1) for generics (GNR), a sharp reversal in the distribution of BSG (6 BSG, 47 AN); 2) for nominals with no presupposition of existence (NPE), there is also a reversal, with only 3 BSG and 55 AN; 3) similarly for existential nominals with narrow scope (EXS-SCOPE-nrw) and existential nominals in the absence of other logical operators (EXS), with all 11 and 17 tokens, respectively, using AN. Our conclusion is that in this period, the use of an with singular nouns has generalised to all contexts, with very few exceptions.

Method 3: Embedding information in coding strings
Coding strings are strings of characters, each character representing a linguistic or extralinguistic variable, which are inserted as nodes in the tokens of a corpus file. Method 3, the construction of coding strings, is the traditional and perhaps most widely used method of adding information to corpus data. Coding strings had their origin in quantitative sociolinguistic research and were used decades before the creation of parsed corpora. The CODING function of CS is used to construct coding strings based on the morphosyntactic annotation and the lexical content of the token; once created, coding strings may be manually extended to encode information that is not represented in the corpus. Since coding strings are part of the token structure, they may be searched and manipulated by CS. Coding strings may also be used as input to software for statistical analysis, like R; this is perhaps their most important function. Case study 3: Taylor and Pintzuk 2015 (T&P 2015) examine the position of objects in Old English and look at the effect of verb order and the length and information structure of the object to support their conclusion that there are two sources for post-verbal objects in Old English, object postposition and base-generation. As shown in (4)  T&P 2015 present the following analysis of these data. They assume that in the Old English period, there was variation in underlying structure: head-initial/final IPs (AuxV/VAux) and VPs (VO/OV). V Aux O can be derived only from head-final IP/VP structure by postposition of O from preverbal position, as shown in (5) If all post-verbal objects were derived by postposition in both V Aux and Aux V clauses, i.e. if structure (5)e didn't exist, we would expect the factors influencing post-verbal position to be the same in both clause types. To test this null hypothesis, T&P 2015 looked at the influence of weight (as measured by the length of the object in words) and informational status (given vs. new) on the position of objects in AuxV and VAux clauses. This was a four-step process. As a first step, CS was used to code each token for three factors: the order of finite auxiliary and non-finite main verb (auxv vs. vaux); the position of the object with respect to the non-finite main verb (ov vs. vo); the length of the object in words (1 . . . 11). The coding query file is given below in (6); 6 an example of a token coded for the first three factors is given in (7). The second step was to manually code the informational status of the object (given vs. new). Examples of tokens coded for all four factors are given in (8) through (11); (8) is the token in (7)  The third step was to use the print_only function of CS to create an output file containing only the coding strings of the data file. The file is shown in (12) below. CS separates the factors by ':', and the user must manually insert a header naming the factors for input to statistical processing, the last step. The results for this study are shown in Table  5.  Table 5, the effect of weight is significant in both clause types, but slightly weaker in AuxV clauses: each additional word in VAux clauses increases the likelihood of VO order by 2.68, in AuxV clauses by 2.43. Informational status is significant only in VAux clauses: the distance between given and new is .9 in VAux clauses, but only .08 in AuxV clauses. T&P 2015 interpret these results as follows: VAuxO clauses are derived only by postposition of the object, and postposition is strongly influenced by weight and informational status: heavy objects  Table 5) and new objects are much more likely to postpose than light objects and given objects. Since AuxVO clauses are derived by two different processes, postposition and base-generation, the effects of weight and informational status are weakened; this is why the effect of weight is weaker in AuxV clauses and the effect of informational status is reduced to non-significance.

Method 4: Modifying corpus annotation (node labels and structure)
Method 4, the modification of corpus annotation, may be done manually, but it is much more efficient (and safer) to use the corpus-revision tool of CS. This tool enables the addition, deletion, and modification of annotation in the corpus, including not only node labels but also structure. Any search that can be made using CS can act as the basis for corpus revision; the output of corpus revision is a new version of the corpus -i.e., the original version of the corpus is not deleted, in case of catastrophic errors. Corpus revision can be used to build an annotated corpus starting from a straight text corpus with only part-of-speech. I frequently find it useful to mark particular structures so that they are easy to identify, and also to evade some CS restrictions, as will be seen in Case study 4 below. Case study 4: Haeberli and Pintzuk 2017 (H&P) look at verb placement in 'true V2' contexts in Old English. H&P analyse in detail one particular clause type: clauses with an initial gif/þa/þonne 'if/when/ when' subordinate clause, followed by a resumptive adverb (e.g. þa/ þonne 'then') and the rest of the main clause; an example is given in (13). Note that initial þa/þonne in Old English main clauses is considered a 'true V2' context: 97.4% (6546/6719) of these clauses exhibit strict V2 order, with the verb in second position followed by the subject. In order to simplify the searches for and coding of these clauses, I wanted to flag the relevant IP-MATs and CP-ADVs by modifying the label. In addition, I wanted to 'remove' 8 the subtrees of all subordinate clauses other than the IP-SUB dominated by the relevant CP-ADV. Three steps were necessary, as shown below; a red font is used for highlighting.
Step 1: Flag the relevant IP-MAT, CP-ADV, and IP-SUB using the query file below. The IP-MAT and CP-ADV are flagged by appending '-z' to the label; the IP-SUB is flagged by prepending 'x-' to the label. Notice that the first token (coeust . . . 26) contains an IP-SUB that is not dominated by the clause-initial CP-ADV; the second token (cobede . . . 3530) contains an IP-MAT that does not dominate a CP-ADV as the first constituent; and the third token (cocanedgX . . . 82) contains both a CP-ADV that is not the first constituent of the IP-MAT and an IP-SUB that is not dominated by the relevant CP-ADV. These are all nodes that are irrelevant to the investigation.

Method 5: Copying coding strings from the corpus into a spreadsheet
Finally, Method 5 copies coding strings from the corpus into a spreadsheet, the content of which may be ordered, manipulated, and displayed in ways that corpus data cannot be. For example, the data in the cells of a spreadsheet can be interpreted as numbers and used for simple calculations like totals, means, and frequencies; in contrast, the content of coding strings within a corpus are characters, not numerical values, and cannot be used as numbers. From the spreadsheet users can create output, e.g. a csv file, that is formatted for statistical analysis. Method 5 provides perhaps the most flexible way of working with and analyzing corpus data, but it should be used with caution, for at least two obvi-ous reasons: it involves manual manipulation of the data, and therefore is prone to error; in addition, it is not always possible to go from a spreadsheet back to a corpus format. Case study 5: Pintzuk 2017 (T&P 2017) look at the effect of weight, among other variables, on split coordination in Old English. Almost all coordinated constituents in early stages of English can be split, as illustrated in (14). T&P 2017 focus on subjects, aiming to measure the effect of length (as measured in number of words) on splitting. They need the length of the first conjunct, the length of the second conjunct (which includes both the conjunction and the nominal), and the length of the entire coordinated nominal in order to determine which of the three, if any, has an effect on splitting. But because of the way coordination is annotated in the corpus, these measurements are not at all straightforward.
If the nominal is not split and the coordinated nouns are bare, with no modification, then the nominal is annotated as a flat structure, with the two nouns and the conjunction immediately dominated by the NP, as shown in (15). In these cases the length of the entire subject can be measured, since it is a constituent (NP-NOM). Although the lengths of the two conjuncts can't be measured individually, since they are not constituents, it can be assumed that the length of the first conjunct is 1. 10 The tokens in this section are coded as follows: file name : token number : flat vs. non-flat : split vs. non.split : final vs. non.final (position within the clause) : length of 1 st conjunct : length of 2 nd conjunct: length of entire conjoined phrase. '/' is used when it is not possible to measure or assume the length. In split flat structures, we can measure the length of the 1 st conjunct and the length of the 2 nd conjunct, since each of these is a constituent; but we cannot measure the length of the entire subject. An example is given in (16) 10 The reader might think that the length of the 2 nd conjunct could be estimated as 2 (conjunction + noun). However, some conjoined constituents have more than two conjuncts, as shown below in (i); in these cases, the length of the 2 nd and following conjuncts cannot be assumed or measured.