Zipf’s Law and L’Arbitraire du Signe

Authors

  • Martin Kay

DOI:

https://doi.org/10.33011/lilt.v6i.1251

Keywords:

digital systems

Abstract

In every field of scientific enquiry, there is much data and therefore frequent cause to turn to the computer to help process it. This is certainly true of linguists. They use computers to search for examples of grammatical phenomena in large corpora and to collect statistics on their occurrence. They can use them to compile lexica, and to compare them with a view to assessing the relatedness of pairs of languages. Activities like these are collectively referred to as Natural Language Processing (NLP). Generally speaking, however, NLP is an engineering, rather than a scientific enterprise, much of it devoted to developing technologies, like machine translation, information retrieval, and speech recognition. It would be natural to expect these technological developments to be informed by the results of scientific enquiry carried out by linguists. In other words, it would be natural that they should have a foundation in computational linguistics. But this is rarely the case. Technological development in NLP is based almost entirely on machine-learning models most of which are wild and fantastical from a linguist’s perspective. This, of course, is an aberration which, fortunately, may be in the course of correction.

In a tightly argued and largely convincing essay elsewhere in this volume, Steven Abney expresses a different view. “Computational linguistics”, he writes, “is not a specialization of linguistics at all, at least not if we take “linguistics” and “computational linguistics” as academic communities defined by their membership.” An academic community is a set of people and a set is surely defined by its membership, but sets do not confer on their members the right to appropriate names already long since claimed by the members of other sets. In this paper, I shall continue to use the term “Computational Linguistics” to refer to an approach to the subject of linguistics that is informed and inspired by computing. With Abney, I shall argue in this paper that “Language is a computational system, and there is a depth of understanding that is simply unachievable without a thorough knowledge of computation.” There is a natural affinity between linguistics and computer science, and it is one that has very little to do with NLP. It arises because human language is one of very few naturally occurring phenomena that is fundamentally digital. Linguists and lay people alike tacitly acknowledge this affinity when they discuss such questions as whether spider is an insect, whether the vowel in “marry” is the same as the one in “merry”, or whether I can claim simultaneously that “I heard about the argument in the library” while denying the truth of both “I was in the library” and “The argument was in the library”. Notice that, while a spider may be more or less like an insect, it cannot be more or less an insect. Either it is, or it is not. Likewise with the vowels in “marry” and “merry”. They may sound more or less different in the speech of different people, but the vowels of a particular English speaker’s language constitute a small, fixed set and, in a given dialect, the vowels in these words are instances either of the same, or different members of that set. The sentence about the argument and the library has (at least) two syntactic structures, one of which puts me, and one which puts the argument, in the library. Language places the phenomena in its purview into absolutely discrete classes, and this is what makes it a digital system.

Downloads

Published

2011-10-01

How to Cite

Kay, M. (2011). Zipf’s Law and L’Arbitraire du Signe. Linguistic Issues in Language Technology, 6. https://doi.org/10.33011/lilt.v6i.1251