Seeing More Than Whitespace — Tokenisation and Disambiguation in a North Sámi Grammar Checker

Linda Wiechetek; Kevin B. Unhammer; Sjur N. Moshagen

doi:10.33011/computel.v1i.403

Authors

Linda Wiechetek UiT The Arctic University of Norway
Kevin B. Unhammer Trigram AS
Sjur N. Moshagen UiT The Arctic University of Norway

DOI:

https://doi.org/10.33011/computel.v1i.403

Abstract

Communities of lesser resourced languages like North Sámi benefit from language tools such as spell checkers and grammar checkers to improve literacy. Accurate error feedback is dependent on well-tokenised input, but traditional tokenisation as shallow preprocessing is inadequate to solve the challenges of real-world language usage. We present an alternative where tokenisation remains ambiguous until we have linguistic context information available. This lets us accurately detect sentence boundaries, multiwords and compound error detection. We describe a North Sámi grammar checker with such a tokenisation system, and show the results of its evaluation.

Seeing More Than Whitespace — Tokenisation and Disambiguation in a North Sámi Grammar Checker

Authors

DOI:

Abstract

Downloads

Published

Issue

Section