Proceedings of the Workshop on Computational Methods for Endangered Languages

Expanding the JHU Bible Corpus for Machine Translation of the Indigenous Languages of North America

Garrett Nicolai — 2021-03-02

We present an extension to the JHU Bible corpus, collecting and normalizing more than thirty Bible translations in thirty Indigenous languages of North America. These exhibit a wide variety of interesting syntactic and morphological phenomena that are understudied in the computational community. Neural translation experiments demonstrate significant gains obtained through cross-lingual, many-to-many translation, with improvements of up to 8.4 BLEU over monolingual models for extremely low-resource languages.

The Language Documentation Quartet

Simon Musgrave — 2021-03-02

As we noted in an earlier paper (Musgrave & Thieberger 2012), the written description of a language is an essentially hypertextual exercise, linking various kinds of material in a dense network. An aim based on that insight is to provide a model that can be implemented in tools for language documentation, allowing instantiation of the links always followed in writing a grammar or a dictionary, tracking backwards and forwards to the texts and media as the source of authority for claims made in an analysis. Our earlier paper described our initial efforts to encode Heath’s (1984) grammar, texts (1980), and dictionary (1982) of Nunggubuyu, an Australian language from eastern Arnhemland. We chose this body of work because it was written with many internal links between the three volumes. The links are all encoded with textual indexes which looked to be ready to be instantiated as automated hyperlinks once the technology was available. In this paper, we discuss our progress in identifying how the four component parts of a description (grammar, text, dictionary, media, henceforth the quartet) can be interlinked, what are the logical points at which to join them, and whether there are practical limits to how far this linking should be carried. We suggest that the problems which are exposed in this process can inform the development of an abstract or theoretical data structure for each of the components and these in turn can provide models for language documentation work which can feed into hypertext presentations of the type we are developing.

LARA in the Service of Revivalistics and Documentary Linguistics: Community Engagement and Endangered Languages

Ghil‘ad Zuckermann — 2021-03-02

We argue that LARA, a new web platform that supports easy conversion of text into an online multimedia form designed to support non-native readers, is a good match to the task of creating high-quality resources useful for languages in the revivalistics spectrum. We illustrate with initial case studies in three widely different endangered/revival languages: Irish (Gaelic); Icelandic Sign Language (ÍTM); and Barngarla, a reclaimed Australian Aboriginal language. The exposition is presented from a language community perspective. Links are given to examples of LARA resources constructed for each language.

Fossicking in Dominant Language Teaching: Javanese and Indonesian ‘Low’ Varieties in Language Teaching Resources

Zara Maxwell-Smith — 2021-03-02

‘Low’ and ‘high’ varieties of Indonesian and other languages of Indonesia are poorly resourced for developing human language technologies. Many languages spoken in Indonesia, even those with very large speaker populations, such as Javanese (over 80 million), are thought to be threatened languages. The teaching of Indonesian language focuses on the prestige variety which forms part of the unusual diglossia found in many parts of Indonesia. We developed a publicly available pipeline to scrape and clean text from the PDFs of a classic Indonesian textbook, The Indonesian Way, creating a corpus. Using the corpus and curated wordlists from a number of lexicons I searched for instances of non-prestige varieties of Indonesian, finding that they play a limited, secondary role to formal Indonesian in this textbook. References to other languages used in Indonesia are usually made as a passing comment. These methods help to determine how text teaching resources relate to and influence the language politics of diglossia and the many languages of Indonesia.

Computational Analysis versus Human Intuition: A Critical Comparison of Vector Semantics with Manual Semantic Classification in the Context of Plains Cree

Daniel Dacanay — 2021-03-02

A persistent challenge in the creation of semantically classified dictionaries and lexical resources is the lengthy and expensive process of manual semantic classification, a hindrance which can make adequate semantic resources unattainable for under-resourced language communities. We explore here an alternative to manual classification using a vector semantic method, which, although not yet at the level of human sophistication, can provide usable first-pass semantic classifications in a fraction of the time. As a case example, we use a dictionary in Plains Cree (ISO: crk, Algonquian, Western Canada and United States)

The Usefulness of Bibles in Low-Resource Machine Translation

Ling Liu — 2021-03-02

Bibles are available in a wide range of languages, which provides valuable parallel text between languages since verses can be aligned accurately between all the different translations. How well can such data be utilized to train good neural machine translation (NMT) models? We are particularly interested in low-resource languages of high morphological complexity, and attempt to answer this question in the current work by training and evaluating Basque-English and Navajo-English MT models with the Transformer architecture. Different tokenization methods are applied, among which syllabification turns out to be most effective for Navajo and it is also good for Basque. Another additional data resource which can be potentially available for endangered languages is a dictionary of either word or phrase translations, thanks to linguists’ work on language documentation. Could this data be leveraged to augment Bible data for better performance? We experiment with different ways to utilize dictionary data, and find that word-to-word mapping translation with a word-pair dictionary is more effective than low-resource techniques such as backtranslation or adding dictionary data directly into the training set, though neither backtranslation nor word-to-word mapping translation produce improvements over using Bible data alone in our experiments.

User-Friendly Automatic Transcription of Low-Resource Languages: Plugging ESPnet into Elpis

Oliver Adams — 2021-03-02

This paper reports on progress integrating the speech recognition toolkit ESPnet into Elpis, a web front-end originally designed to provide access to the Kaldi automatic speech recognition toolkit. The goal of this work is to make end-to-end speech recognition models available to language workers via a user-friendly graphical interface. Encouraging results are reported on (i) development of an ESPnet recipe for use in Elpis, with preliminary results on data sets previously used for training acoustic models with the Persephone toolkit along with a new data set that had not previously been used in speech recognition, and (ii) incorporating ESPnet into Elpis along with UI enhancements and a CUDA-supported Dockerfile.

The Relevance of the Source Language in Transfer Learning for ASR

Nils Hjortnæs — 2021-03-02

This study presents new experiments on Zyrian Komi speech recognition. We use Deep-Speech to train ASR models from a language documentation corpus that contains both contemporary and archival recordings. Earlier studies have shown that transfer learning from English and using a domain matching Komi language model both improve the CER and WER. In this study we experiment with transfer learning from a more relevant source language, Russian, and including Russian text in the language model construction. The motivation for this is that Russian and Komi are contemporary contact languages, and Russian is regularly present in the corpus. We found that despite the close contact of Russian and Komi, the size of the English speech corpus yielded greater performance when used as the source language. Additionally, we can report that already an update in DeepSpeech version improved the CER by 3.9% against the earlier studies, which is an important step in the development of Komi ASR.

Shared Digital Resource Application within Insular Scandinavian

Hinrik Hafsteinsson — 2021-03-02

We describe the application of language technology methods and resources devised for Icelandic, a North Germanic language with about 300,000 speakers, in digital language resource creation for Faroese, a North Germanic language with about 50,000 speakers. The current project encompassed the development of a dedicated, high-accuracy part-of-speech (PoS) tagging solution for Faroese. To achieve this, a state-of-the-art neural PoS tagger for Icelandic, ABLTagger, was trained on a 100,000 word PoS-tagged corpus for Faroese, standardised with methods previously applied to Icelandic corpora. This tagger was supplemented with a novel Experimental Database of Faroese Inflection (EDFM), which is a lexicon containing morphological information on 67,488 Faroese words with about one million inflectional forms. This approach produced a PoS-tagging model for Faroese which achieves a 91.40% overall accuracy when evaluated with 10-fold cross validation, which is currently the highest accuracy for a dedicated Faroese PoS-tagger. The products of this project are made available for use in further research in Faroese language technology.

Towards an Open Source Finite-State Morphological Analyzer for Zacatlán-Ahuacatlán-Tepetzintla Nahuatl

Robert Pugh — 2021-03-02

In this paper, we describe an in-progress, free and open-source Finite-State Transducer morphological analyzer for an understudied Nahuatl variant. We discuss our general approach, some of the technical implementation details, the challenges that accompany building such a system for a low-resource language variant, the current status and performance of the system, and directions for future work.

Integrating Automated Segmentation and Glossing into Documentary and Descriptive Linguistics

Sarah Moeller — 2021-03-02

Any attempt to integrate NLP systems to the study of endangered languages must take into consideration traditional approaches by both NLP and linguistics. This paper tests different strategies and workflows for morpheme segmentation and glossing that may affect the potential to integrate machine learning. Two experiments train Transformer models on documentary corpora from five under-documented languages. In one experiment, a model learns segmentation and glossing as a joint step and another model learns the tasks into two sequential steps. We find the sequential approach yields somewhat better results. In a second experiment, one model is trained on surface segmented data, where strings of texts have been simply divided at morpheme boundaries. Another model is trained on canonically segmented data, the approach preferred by linguists, where abstract, underlying forms are represented. We find no clear advantage to either segmentation strategy and note that the difference between them disappears as training data increases. On average the models achieve more than a 0.5 F1-score, with the best models scoring 0.6 or above. An analysis of errors leads us to conclude consistency during manual segmentation and glossing may facilitate higher scores from automatic evaluation but in reality the scores may be lowered when evaluated against original data because instances of annotator error in the original data are “corrected” by the model.

Developing a Shared Task for Speech Processing on Endangered Languages

Gina-Anne Levow — 2021-03-02

Advances in speech and language processing have enabled the creation of applications that could, in principle, accelerate the process of language documentation, as speech communities and linguists work on urgent language documentation and reclamation projects. However, such systems have yet to make a significant impact on language documentation, as resource requirements limit the broad applicability of these new techniques. We aim to exploit the framework of shared tasks to focus the technology research community on tasks which address key pain points in language documentation. Here we present initial steps in the implementation of these new shared tasks, through the creation of data sets drawn from endangered language repositories and baseline systems to perform segmentation and speaker labeling of these audio recordings—important enabling steps in the documentation process. This paper motivates these tasks with a use case, describes data set curation and baseline systems, and presents results on this data. We then highlight the challenges and ethical considerations in developing these speech processing tools and tasks to support endangered language documentation.