Shared Digital Resource Application within Insular Scandinavian

Authors

  • Hinrik Hafsteinsson University of Iceland
  • Anton Karl Ingason University of Iceland

DOI:

https://doi.org/10.33011/computel.v1i.961

Abstract

We describe the application of language technology methods and resources devised for Icelandic, a North Germanic language with about 300,000 speakers, in digital language resource creation for Faroese, a North Germanic language with about 50,000 speakers. The current project encompassed the development of a dedicated, high-accuracy part-of-speech (PoS) tagging solution for Faroese. To achieve this, a state-of-the-art neural PoS tagger for Icelandic, ABLTagger, was trained on a 100,000 word PoS-tagged corpus for Faroese, standardised with methods previously applied to Icelandic corpora. This tagger was supplemented with a novel Experimental Database of Faroese Inflection (EDFM), which is a lexicon containing morphological information on 67,488 Faroese words with about one million inflectional forms. This approach produced a PoS-tagging model for Faroese which achieves a 91.40% overall accuracy when evaluated with 10-fold cross validation, which is currently the highest accuracy for a dedicated Faroese PoS-tagger. The products of this project are made available for use in further research in Faroese language technology.

Downloads

Published

2021-03-02