Lexeme: the Concept of System and the Creation of Speech Corpora for Two Endangered Languages

In this paper we present the concept of the Lexeme system. Lexeme is a new application for managing speech corpora for endangered languages. Currently, the Lexeme system is under development. Furthermore, we present the first results of the creation of speech corpora for Siberian Ingrian Finnish and Siberian Tatar. These languages are endangered languages. The speech data of these languages were published, are accessible to the public, and are licensed under a Creative Commons Attribution 4.0 license.


Introduction
At present, there are enough software solutions which allow working with speech corpora. There are the following stand-alone applications: IrcamCorpusTools (Veaux and Beller, 2008), EXMARaLDA (Schmidt and Wörner, 2009), LaBB-CAT (Fromont and Hay, 2012) and newer systems such as SPPAS (Bigi, 2015). The following modern software solutions are based on a client-server model: the EMU Speech Database Management System (Winkelmann et al., 2017) and ISCAN (McAuliffe, et al., 2019). The special linguistic tools, such as FieldWorks 1 from SIL International, allow users to document endangered languages. However, we see the need to use tools with special features for working with endangered languages. There are relatively few such tools, for example LingSync and the Online Linguistic Database (Dunham, 2014;. The need to develop new solutions in this area remains an important challenge. We briefly review our solution in Section 2. We describe current status of the 1 https://software.sil.org/fieldworks/ creation of two first corpora of endangered languages for the Lexeme system in Section 3.

The Principles of the Lexeme System
Lexeme is a new system provides following features: the storage of audio data, data processing, representing of speech information to users. This system will have special features for the documentation and revitalization of endangered languages. Lexeme is based on the following key principles:  Openness and transparency (all source code and data (including primary audio data) will be accessible on GitHub and licensed under one of a free license)  Universality (the system will consist of independent levels, users can use artifacts irrespective of other levels for their own projects)  Targeted at different users (linguists, computational linguists, speakers of endangered languages, language activists).
At this moment, the Lexeme system is under development.

The concept of the Lexeme system
The lower level of the Lexeme system: The lower level would bring together all collected primary data from speakers of endangered languages (except speech data that violate the ethical principles). These primary speech data will be accessible to the public under a Creative Commons Attribution 4.0 license (CC BY 4.0).

Lexeme: the Concept of System and the Creation of Speech Corpora for Two Endangered Languages
Ivan Ubaleht Omsk State Technical University, Russia

ivan@ubaleht.com
The examples of the lower level of the Lexeme system can be the speech data for Siberian Ingrian Finnish and Siberian Tatar corpora (see Section 3). 2 The middle level of the Lexeme system: Documentation and annotation of speech data are being conducted on this level. We use ELAN (Wittenburg et al., 2006) for annotating speech data and special methods for bridging "annotation bottleneck" on this level. There are annotations (for example, the annotations 3 in our repository of the corpus of Siberian Ingrian Finnish), schemes of databases, structured data for databases, scripts for conversion to different formats on this level. Users can use annotated speech data freely in own projects not only in the Lexeme system.
We plan to use crowdsourcing for annotation of speech data (work collaboratively with linguists, speakers of endangered languages and language activists). All data in this level will be licensed under a free license too.
The upper level of the Lexeme system: The upper level is the level of applications and services. All language recourses will be accessible through user-friendly applications and services on this level. A powerful system of requests to data and a user-friendly interface for representation of data will be implemented in this level. The Lexeme application will be available via Internet using the lexeme.net domain name. The source code of these applications and services will be accessible on the GitHub under a free license.

The Speech Corpus of Siberian Ingrian Finnish
Language context: Siberian Ingrian Finnishis a language (dialect) based on the Lower Luga Ingrian Finnish and Lower Luga Ingrian (Izhorian) varieties (Kuznetsova et al., 2015). This language is used by the descendants of the settlers from the Lower Luga area.  (Sidorkevich, 2011;Sidorkevich, 2014;Kuznetsova, 2016). Several expeditions were undertaken to Omsk oblast (Ryzhkovo and Mikhailovka settlements) in 2008-2011. Ph.D. thesis was written by D. V. Sidorkevich in 2013 (Sidorkevich, 2014). The Siberian Ingrian Finnish phonology, morphology as well as certain other aspects were described in detail in this Ph.D. thesis.
In 2020, there is still a group of people of elder generation who use Siberian Ingrian Finnish in The current status of the creation of the Siberian Ingrian Finnish Speech Corpus: For the first time speech data of the Siberian Ingrian Finnish language has been published and are accessible to the public. These speech data are available on GitHub and licensed under a Creative Commons Attribution 4.0 license (CC BY 4.0). Currently, the larger part of the audio data from our expeditions has been published 6 . We recorded 10 hours of audio from 8 speakers from four our expeditions to Ryzhkovo and Mikhailovka settlements and from the interviews via phone in 2019-2020. Approximately 5 hours of the audio data were published on GitHub. The structure of the primary audio data is shown in Table 1. At present, we are annotating speech data from our expeditions and creating a database for storing structured data. The database and structured data are essential to the work of the 4 These villages are located in Omsk Oblast. 5 The village used to exist in Omsk Oblast. 6 https://github.com/ubaleht/SiberianIn grianFinnish upper level of the Lexem system (web-application and services).

The Speech Corpus of Siberian Tatar
Language context: The language of Siberian Tatars is a Turkic language. This language is relatively well-studied, around 100,000 people are spoken in this language, but nonetheless this language is an endangered language. The Siberian Tatar language was given the code "sty" (ISO 639-3) by ISO in 2012. The language of the Siberian Tatars has three dialects: Tobol-Irtysh, Tom and Baraba. The Tobol-Irtysh dialect of the language of the Siberian Tatars consists of the following subdialects: Tyumen, Tobol, Zabolotny, Tevriz and Tara. The speech data of our first expedition were recorded in a Tevriz subdialect area.
The current status of the creation of the Siberian Tatar Speech Corpus: Our first expedition was undertaken to Siberian Tatar village Ilchebaga (Ust'-Ishimsky District, Omsk Oblast, Siberia, Russia) in 2020. We recorded the speech data of 10 speakers in this first expedition. These primary speech data have already been published and are accessible to the public 7 . These 7 https://github.com/ubaleht/SiberianTa tar

Code of Speaker and Gender
Year of Birth  speech data are available on GitHub and licensed under a Creative Commons Attribution 4.0 license (CC BY 4.0). The amount of the primary audio data and characteristics of the speakers are shown in Table 2. We started creating the Siberian Tatar speech corpus based on this data. We plan to collect speech material of all the dialects and the accents of Siberian Tatars for this speech corpus. In 2020, we couldn't record more speech data because of the coronavirus COVID-19 pandemic.

Conclusion
In this paper, we have presented our current results of the creation of the speech corpora for Siberian Ingrian Finnish and Siberian Tatar. These languages are endangered languages. For the first time the speech data of these languages were published and are accessible to the public. Furthermore, we briefly reviewed key principles and concept of the Lexeme system. Lexeme is a new application for managing speech corpora for endangered languages.