Detecting Ad Hoc Rules for Treebank Development

Authors

  • Markus Dickinson Indiana University

DOI:

https://doi.org/10.33011/lilt.v4i.1225

Keywords:

treebanks, grammar, TIGER, WSJ treebank

Abstract

We outline a method of detecting ad hoc, or anomalous, rules in treebank grammars, by exploiting the fact that such rules do not fit with the rest of the grammar.  Ad hoc rules are rules used for specific constructions in one data set and unlikely to be used again.  These include ungeneralizable rules, erroneous rules, rules for ungrammatical text, and rules which are not consistent with the rest of the annotation scheme.  Based on the idea that valid rules should receive support from other rules in the grammar, we develop two methods for detecting ad hoc rules in flat treebanks and show they are successful in detecting such rules.  Although one can put some linguistic knowledge into determining rule similarity and dissimilarity, the methods work best by using a simple, modified Levenshtein distance.  We illustrate this on the English Wall Street Journal treebank and the German TIGER treebank.  For the latter, we extend the method to formalisms incorporating discontinuous constituents, employing CFG-like rules for the comparisons.

Downloads

Published

2011-04-01

How to Cite

Dickinson, M. (2011). Detecting Ad Hoc Rules for Treebank Development. Linguistic Issues in Language Technology, 4. https://doi.org/10.33011/lilt.v4i.1225

Issue

Section

Articles