Learning to Classify Documents According to Formal and Informal Style

Fadi Abu Sheikha; Diana Inkpen

doi:10.33011/lilt.v8i.1305

Authors

Fadi Abu Sheikha School of Electrical Engineering and Computer Science, University of Ottawa
Diana Inkpen School of Electrical Engineering and Computer Science, University of Ottawa

DOI:

https://doi.org/10.33011/lilt.v8i.1305

Keywords:

classification, formal style, informal style

Abstract

This paper discusses an important issue in computational linguistics: classifying texts as formal or informal style. Our work describes a genre-independent methodology for building classifiers for formal and informal texts. We used machine learning techniques to do the automatic classification, and performed the classification experiments at both the document level and the sentence level. First, we studied the main characteristics of each style, in order to train a system that can distinguish between them. We then built two datasets: the first dataset represents general-domain documents of formal and informal style, and the second represents medical texts. We tested on the second dataset at the document level, to determine if our model is sufficiently general, and that it works on any type of text. The datasets are built by collecting documents for both styles from different sources. After collecting the data, we extracted features from each text. The features that we designed represent the main characteristics of both styles. Finally, we tested several classification algorithms, namely Decision Trees, Naïve Bayes, and Support Vector Machines, in order to choose the classifier that generates the best classification results.

Learning to Classify Documents According to Formal and Informal Style

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Information