"I'm Sorry Dave, I'm Afraid I Can't Do That": Linguistics, Statistics, and Natural Language Processing Circa 2001
"I'm Sorry Dave, I'm Afraid I Can't Do That": Linguistics, Statistics, and Natural Language Processing Circa 2001
"I'm Sorry Dave, I'm Afraid I Can't Do That": Linguistics, Statistics, and Natural Language Processing Circa 2001
Its the year 2000, but where are the flying cars? I was promised flying cars.
Avery Brooks, IBM commercial
According to many pop-culture visions of the future, technology will eventually produce the Machine that
Can Speak to Us. Examples range from the False Maria in Fritz Langs 1926 film Metropolis to Knight
Riders KITT (a talking car) to Star Wars C-3PO (said to have been modeled on the False Maria). And,
of course, there is the HAL 9000 computer from 2001: A Space Odyssey; in one of the films most famous
scenes, the astronaut Dave asks HAL to open a pod bay door on the spacecraft, to which HAL responds,
Im sorry Dave, Im afraid I cant do that.
Natural language processing, or NLP, is the field of computer science devoted to creating such machines
that is, enabling computers to use human languages both as input and as output. The area is quite
broad, encompassing problems ranging from simultaneous multi-language translation to advanced search
engine development to the design of computer interfaces capable of combining speech, diagrams, and other
modalities simultaneously. A natural consequence of this wide range of inquiry is the integration of ideas
from computer science with work from many other fields, including linguistics, which provides models of
language; psychology, which provides models of cognitive processes; information theory, which provides
models of communication; and mathematics and statistics, which provide tools for analyzing and acquiring
such models.
The interaction of these ideas together with advances in machine learning (see [other chapter]) has resulted in concerted research activity in statistical natural language processing: making computers languageenabled by having them acquire linguistic information directly from samples of language itself. In this essay,
we describe the history of statistical NLP; the twists and turns of the story serve to highlight the sometimes
complex interplay between computer science and other fields.
Although currently a major focus of research, the data-driven, computational approach to language
processing was for some time held in deep disregard because it directly conflicts with another commonlyheld viewpoint: human language is so complex that language samples alone seemingly cannot yield enough
information to understand it. Indeed, it is often said that NLP is AI-complete (a pun on NP-completeness;
see [other chapter]), meaning that the most difficult problems in artificial intelligence manifest themselves
in human language phenomena. This belief in language use as the touchstone of intelligent behavior dates
back at least to the 1950 proposal of the Turing Test1 as a way to gauge whether machine intelligence has
been achieved; as Turing wrote, The question and answer method seems to be suitable for introducing
almost any one of the fields of human endeavour that we wish to include.
The reader might be somewhat surprised to hear that language understanding is so hard. After all, human
children get the hang of it in a few years, word processing software now corrects (some of) our grammatical
To appear in the National Research Council study on the Fundamentals of Computer Science. This is an April 2003 version.
1
Roughly speaking, a computer will have passed the Turing Test if it can engage in conversations indistinguishable from that of
a humans.
errors, and TV ads show us phones capable of effortless translation. One might therefore be led to believe
that HAL is just around the corner.
Such is not the case, however. In order to appreciate this point, we temporarily divert from describing
statistical NLPs history which touches upon Hamilton versus Madison, the sleeping habits of colorless
green ideas, and what happens when one fires a linguist to examine a few examples illustrating why
understanding human language is such a difficult problem.
Since Japanese doesnt have spaces between words, one is faced with the initial task of deciding what
the component words are. In particular, this character sequence corresponds to at least two possible word
2
Or, perhaps, the files themselves are patient? But our knowledge about the world rules this possibility out.
A C change
However, data-driven approaches fell out of favor in the late 1950s. One of the commonly cited factors is
a 1957 argument by linguist (and student of Harris) Noam Chomsky, who believed that language behavior
should be analyzed at a much deeper level than its surface statistics. He claimed,
It is fair to assume that neither sentence (1) [Colorless green ideas sleep furiously] nor (2)
[Furiously sleep ideas green colorless] ... has ever occurred .... Hence, in any [computed]
statistical model ... these sentences will be ruled out on identical grounds as equally remote
3
To take an analogous example in English, consider the non-word-delimited sequence of letters theyouthevent. This corresponds to the word sequences the youth event, they out he vent, and the you the vent.
from English.4 Yet (1), though nonsensical, is grammatical, while (2) is not.
That is, we humans know that sentence (1), which at least obeys (some) rules of grammar, is indeed more
probable than (2), which is just word salad; but (the claim goes), since both sentences are so rare, they
will have identical statistics i.e., a frequency of zero in any sample of English. Chomskys criticism
is essentially that data-driven approaches will always suffer from a lack of data, and hence are doomed to
failure.
This observation turned out to be remarkably prescient: even now, when billions of words of text are
available on-line, perfectly reasonable phrases are not present. Thus, the so-called sparse data problem
continues to be a serious challenge for statistical NLP even today. And so, the effect of Chomskys claim,
together with some negative results for machine learning and a general lack of computing power at the time,
was to cause researchers to turn away from empirical approaches and toward knowledge-based approaches
where human experts encoded relevant information in computer-usable form.
This change in perspective led to several new lines of fundamental, interdisciplinary research. For example, Chomskys work viewing language as a formal, mathematically-describable object has had lasting
impact on both linguistics and computer science; indeed, the Chomsky hierarchy, a sequence of increasingly
more powerful classes of grammars, is a staple of the undergraduate computer science curriculum. Conversely, the highly influential work of, among others, Kazimierz Adjukiewicz, Joachim Lambek, David K.
Lewis, and Richard Montague adopted the lambda calculus, a fundamental concept in the study of programming languages, to model the semantics of natural languages.
datasets, and powerful learning algorithms (both for and ) has led to our achieving the milestone
of commercial-grade speech recognition products capable of handling continuous speech ranging over a
large vocabulary.
But let us return to our story. Buoyed by the successes in speech recognition in the 70s and 80s
(substantial performance gains over knowledge-based systems were posted), researchers began applying
data-driven approaches to many problems in natural language processing, in a turn-around so extreme that it
has been deemed a revolution. Indeed, now empirical methods are used at all levels of language analysis.
This is not just due to increased resources: a succession of breakthroughs in machine learning algorithms
has allowed us to leverage existing resources much more effectively. At the same time, evidence from psychology shows that human learning may be more statistically-based than previously thought; for instance,
work by Jenny Saffran, Richard Aslin, and Elissa Newport reveals that 8-month-old infants can learn to divide continuous speech into word segments based simply on the statistics of sounds following one another.
Hence, it seems that the revolution is here to stay.
Of course, we must not go overboard and mistakenly conclude that the successes of statistical NLP
render linguistics irrelevant (rash statements to this effect have been made in the past, e.g., the notorious
remark, Every time I fire a linguist, my performance goes up). The information and insight that linguists,
psychologists, and others have gathered about language is invaluable in creating high-performance broaddomain language understanding systems; for instance, in the speech recognition setting described above, a
better understanding of language structure can lead to better language models. Moreover, truly interdisciplinary research has furthered our understanding of the human language faculty. One important example of
this is the development of the head-driven phrase structure grammar (HPSG) formalism this is a way of
analyzing natural language utterances that truly marries deep linguistic information with computer science
mechanisms, such as unification and recursive data-types, for representing and propagating this information
throughout the utterances structure. In sum, computational techniques and data-driven methods are now an
integral part both of building systems capable of handling language in a domain-independent, flexible, and
graceful way, and of improving our understanding of language itself.
Acknowledgments
Thanks to the members of the CSTB Fundamentals of Computer Science study and especially Alan
Biermann for their helpful feedback. Also, thanks to Alex Acero, Takako Aikawa, Mike Bailey, Regina
Barzilay, Eric Brill, Chris Brockett, Claire Cardie, Joshua Goodman, Ed Hovy, Rebecca Hwa, John Lafferty,
Bob Moore, Greg Morrisett, Fernando Pereira, Hisami Suzuki, and many others for stimulating discussions
and very useful comments. Rie Kubota Ando provided the Japanese example. The use of the term revolution to describe the re-ascendance of statistical methods comes from Julia Hirschbergs 1998 invited
address to the American Association for Artificial Intelligence. I learned of the McDonnell-Douglas ad and
some of its analyses from a class run by Stuart Shieber. All errors are mine alone. This paper is based upon
work supported in part by the National Science Foundation under ITR/IM grant IIS-0081334 and a Sloan
Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed above are
those of the authors and do not necessarily reflect the views of the National Science Foundation or the Sloan
Foundation.
References
Adjukiewicz, Kazimierz. 1935. Die syntaktische Konnexitat. Studia Philosophica, 1:127. English translation available in Storrs McCall, editor, Polish Logic 1920-1939, Clarendon Press (1967).
Chomsky, Noam. 1957. Syntactic Structures. Number IV in Janua Linguarum. Mouton, The Hague, The
Netherlands.
Firth, John Rupert. 1957. A synopsis of linguistic theory 19301955. In the Philological Societys Studies
in Linguistic Analysis. Blackwell, Oxford, pages 132. Reprinted in Selected Papers of J. R. Firth,
edited by F. Palmer. Longman, 1968.
Good, Irving J. 1953. The population frequencies of species and the estimation of population parameters.
Biometrika, 40(3,4):237264.
Harris, Zellig. 1951. Methods in Structural Linguistics. University of Chicago Press. Reprinted by Phoenix
Books in 1960 under the title Structural Linguistics.
Lambek, Joachim. 1958. The mathematics of sentence structure. American Mathematical Monthly,
65:154169.
Lewis, David K. 1970. General semantics. Synth`ese, 22:1867.
Montague, Richard. 1974. Formal Philosophy: Selected Papers of Richard Montague. Yale University
Press. Edited by Richmond H. Thomason.
Mosteller, Frederick and David L. Wallace. 1984. Applied Bayesian and Classical Inference: The Case
of the Federalist Papers. Springer-Verlag. First edition published in 1964 under the title Inference and
Disputed Authorship: The Federalist.
Pollard, Carl and Ivan Sag. 1994. Head-driven phrase structure grammar. Chicago University Press and
CSLI Publications.
Saffran, Jenny R., Richard N. Aslin, and Elissa L. Newport. 1996. Statistical learning by 8-month-old
infants. Science, 274(5294):19261928, December.
Shannon, Claude E. 1948. A mathematical theory of communication. Bell System Technical Journal,
27:379423 and 623656.
Turing, Alan M. 1950. Computing machinery and intelligence. Mind, LIX:43360.
Weaver, Warren. 1949. Translation. Memorandum. Reprinted in W.N. Locke and A.D. Booth, eds., Machine Translation of Languages: Fourteen Essays, MIT Press, 1955.