A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi
DOI:
https://doi.org/10.33011/lilt.v2i.1203Keywords:
named entity, named entity recognitionAbstract
This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian languages, namely Bengali and Hindi, using the Conditional Random Field (CRF) framework. The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes. This set of features includes language independent as well as language dependent components. We have used the annotated corpora of 122,467 tokens for Bengali and 502,974 tokens for Hindi tagged with a tag set of twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL). We have considered only the tags that denote person names, location names, organization names, number expressions, time expressions and measurement expressions. A number of experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi. The system has been tested with the gold standard test sets of 35K for Bengali and 50K tokens for Hindi. Evaluation results in overall f-score values of 81.15% for Bengali and 78.29% for Hindi for the test sets. 10-fold cross validation tests yield f-score values of 83.89% for Bengali and 80.93% for Hindi. ANOVA analysis is performed to show that the performance improvement due to the use of language dependent features is statistically significant.
Downloads
Published
How to Cite
Issue
Section
License
This work is licensed under CC BY 4.0, which permits you to use, share, adapt, distribute, and reproduce it in any medium or format, provided you credit the original author(s) and source.