Building a Common Voice Corpus for Laiholh (Hakha Chin)


  • Kelly Berkson Indiana University
  • Samson Lotven Indiana University
  • Peng Hlei Thang Indiana University
  • Thomas Thawngza Indiana University
  • Zai Sung Indiana University
  • James C. Wamsley Indiana University
  • Francis Tyers Indiana University
  • Kenneth Van Bik California State University Fullerton
  • Sandra Kübler Indiana University
  • Donald Williamson Indiana University
  • Matthew Anderson Indiana University



In this paper, we discuss our efforts to build a corpus for Laiholh, also called Hakha Chin. Laiholh is spoken in Chin State in Western Myanmar, in parts of India and Bangladesh, and in several Burmese refugee communities in the US. Indiana, for example, is home to about 25,000 Burmese refugees. The ultimate goal of our team is to contribute to the development of speech translation technology that will be of benefit, both in general and in the local community in Indianapolis. Translation tools would be of great use in local emergency rooms, schools, and businesses. In pursuing our (admittedly lofty) goals, we are building a growing community of speakers, field linguists, computational linguists, and computer scientists. As a team, we have worked to share our different skill sets and mobilize the wider community around collecting data via Mozilla’s Common Voice platform. We present here a reflection on the project thus far, the kind of description we wish had existed when we were first building this collaboration and determining preliminary project goals. We hope that other communities and language activists who are thinking about developing speech technology may benefit from hearing about our motivations, concerns, experiences, and successes.