Eisenstein NLP Notes
Eisenstein NLP Notes
Eisenstein NLP Notes
2 Jacob Eisenstein
3 June 1, 2018
4 Contents
5 Contents 1
6 1 Introduction 13
7 1.1 Natural language processing and its neighbors . . . . . . . . . . . . . . . . . 13
8 1.2 Three themes in natural language processing . . . . . . . . . . . . . . . . . . 17
9 1.2.1 Learning and knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 17
10 1.2.2 Search and learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
11 1.2.3 Relational, compositional, and distributional perspectives . . . . . . 20
12 1.3 Learning to do natural language processing . . . . . . . . . . . . . . . . . . . 22
13 1.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
14 1.3.2 How to use this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
15 I Learning 27
1
2 CONTENTS
32 2.4.2 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
33 2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
34 2.5.1 Batch optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
35 2.5.2 Online optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
36 2.6 *Additional topics in classification . . . . . . . . . . . . . . . . . . . . . . . . 56
37 2.6.1 Feature selection by regularization . . . . . . . . . . . . . . . . . . . . 56
38 2.6.2 Other views of logistic regression . . . . . . . . . . . . . . . . . . . . . 56
39 2.7 Summary of learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 58
40 3 Nonlinear classification 61
41 3.1 Feedforward neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
42 3.2 Designing neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
43 3.2.1 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
44 3.2.2 Network structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
45 3.2.3 Outputs and loss functions . . . . . . . . . . . . . . . . . . . . . . . . 66
46 3.2.4 Inputs and lookup layers . . . . . . . . . . . . . . . . . . . . . . . . . 67
47 3.3 Learning neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
48 3.3.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
49 3.3.2 Regularization and dropout . . . . . . . . . . . . . . . . . . . . . . . . 71
50 3.3.3 *Learning theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
51 3.3.4 Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
52 3.4 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . 75
323 As a general rule, words, word counts, and other types of observations are indicated with
324 Roman letters (a, b, c); parameters are indicated with Greek letters (α, β, θ). Vectors are
325 indicated with bold script for both random variables x and parameters θ. Other useful
326 notations are indicated in the table below.
Basics
exp x the base-2 exponent, 2x
log x the base-2 logarithm, log2 x
{xn }Nn=1 the set {x1 , x2 , . . . , xN }
xji xi raised to the power j
(j)
xi indexing by both i and j
Linear algebra
x(i) a column vector of feature counts for instance i, often word counts
xj:k elements j through k (inclusive) of a vector x
[x; y] vertical concatenation of two column vectors
[x, y] horizontal concatenation of two column vectors
en a “one-hot” vector with a value of 1 at position n, and zero everywhere
else
θ> the transpose of a column vector θ
P (i)
θ · x(i) the dot product N j=1 θj × xj
X a matrix
xi,j row i, column j of matrix X
x1 0 0
Diag(x) a matrix with x on the diagonal, e.g., 0 x2 0
0 0 x3
X−1 the inverse of matrix X
11
12 CONTENTS
Text datasets
wm word token at position m
N number of training instances
M length of a sequence (of words or tags)
V number of words in vocabulary
y (i) the true label for instance i
ŷ a predicted label
Y the set of all possible labels
K number of possible labels K = |Y|
the start token
the stop token
y (i) a structured label for instance i, such as a tag sequence
Y(w) the set of possible labelings for the word sequence w
♦ the start tag
the stop tag
Probabilities
Pr(A) probability of event A
Pr(A | B) probability of event A, conditioned on event B
pB (b) the marginal probability of random variable B taking value b; written
p(b) when the choice of random variable is clear from context
pB|A (b | a) the probability of random variable B taking value b, conditioned on A
taking value a; written p(b | a) when clear from context
A∼p the random variable A is distributed according to distribution p. For
example, X ∼ N (0, 1) states that the random variable X is drawn from
a normal distribution with zero mean and unit variance.
A|B∼p conditioned on the random variable B, A is distributed according to p.1
Machine learning
Ψ(x(i) , y) the score for assigning label y to instance i
f (x(i) , y) the feature vector for instance i with label y
θ a (column) vector of weights
`(i) loss on an individual instance i
L objective function for an entire dataset
L log-likelihood of a dataset
λ the amount of regularization
328 Introduction
329 Natural language processing is the set of methods for making human language accessible
330 to computers. In the past decade, natural language processing has become embedded
331 in our daily lives: automatic machine translation is ubiquitous on the web and in social
332 media; text classification keeps emails from collapsing under a deluge of spam; search
333 engines have moved beyond string matching and network analysis to a high degree of
334 linguistic sophistication; dialog systems provide an increasingly common and effective
335 way to get and share information.
336 These diverse applications are based on a common set of ideas, drawing on algo-
337 rithms, linguistics, logic, statistics, and more. The goal of this text is to provide a survey
338 of these foundations. The technical fun starts in the next chapter; the rest of this current
339 chapter situates natural language processing with respect to other intellectual disciplines,
340 identifies some high-level themes in contemporary natural language processing, and ad-
341 vises the reader on how best to approach the subject.
346 Computational Linguistics Most of the meetings and journals that host natural lan-
347 guage processing research bear the name “computational linguistics”, and the terms may
348 be thought of as essentially synonymous. But while there is substantial overlap, there is
349 an important difference in focus. In linguistics, language is the object of study. Computa-
350 tional methods may be brought to bear, just as in scientific disciplines like computational
351 biology and computational astronomy, but they play only a supporting role. In contrast,
13
14 CHAPTER 1. INTRODUCTION
352 natural language processing is focused on the design and analysis of computational al-
353 gorithms and representations for processing natural human language. The goal of natu-
354 ral language processing is to provide new computational capabilities around human lan-
355 guage: for example, extracting information from texts, translating between languages, an-
356 swering questions, holding a conversation, taking instructions, and so on. Fundamental
357 linguistic insights may be crucial for accomplishing these tasks, but success is ultimately
358 measured by whether and how well the job gets done.
359 Machine Learning Contemporary approaches to natural language processing rely heav-
360 ily on machine learning, which makes it possible to build complex computer programs
361 from examples. Machine learning provides an array of general techniques for tasks like
362 converting a sequence of discrete tokens in one vocabulary to a sequence of discrete to-
363 kens in another vocabulary — a generalization of what normal people might call “transla-
364 tion.” Much of today’s natural language processing research can be thought of as applied
365 machine learning. However, natural language processing has characteristics that distin-
366 guish it from many of machine learning’s other application domains.
367 • Unlike images or audio, text data is fundamentally discrete, with meaning created
368 by combinatorial arrangements of symbolic units. This is particularly consequential
369 for applications in which text is the output, such as translation and summarization,
370 because it is not possible to gradually approach an optimal solution.
371 • Although the set of words is discrete, new words are always being created. Further-
372 more, the distribution over words (and other linguistic elements) resembles that of a
373 power law (Zipf, 1949): there will be a few words that are very frequent, and a long
374 tail of words that are rare. A consequence is that natural language processing algo-
375 rithms must be especially robust to observations that do not occur in the training
376 data.
377 • Language is recursive: units such as words can combine to create phrases, which
378 can combine by the very same principles to create larger phrases. For example, a
379 noun phrase can be created by combining a smaller noun phrase with a preposi-
380 tional phrase, as in the whiteness of the whale. The prepositional phrase is created by
381 combining a preposition (in this case, of ) with another noun phrase (the whale). In
382 this way, it is possible to create arbitrarily long phrases, such as,
383 (1.1) . . . huge globular pieces of the whale of the bigness of a human head.1
384 The meaning of such a phrase must be analyzed in accord with the underlying hier-
385 archical structure. In this case, huge globular pieces of the whale acts as a single noun
386 phrase, which is conjoined with the prepositional phrase of the bigness of a human
1
Throughout the text, this notation will be used to introduce linguistic examples.
387 head. The interpretation would be different if instead, huge globular pieces were con-
388 joined with the prepositional phrase of the whale of the bigness of a human head —
389 implying a disappointingly small whale. Even though text appears as a sequence,
390 machine learning methods must account for its implicit recursive structure.
391 Artificial Intelligence The goal of artificial intelligence is to build software and robots
392 with the same range of abilities as humans (Russell and Norvig, 2009). Natural language
393 processing is relevant to this goal in several ways. The capacity for language is one of the
394 central features of human intelligence, and no artificial intelligence program could be said
395 to be complete without the ability to communicate in words.2
396 Much of artificial intelligence research is dedicated to the development of systems
397 that can reason from premises to a conclusion, but such algorithms are only as good as
398 what they know (Dreyfus, 1992). Natural language processing is a potential solution to
399 the “knowledge bottleneck”, by acquiring knowledge from natural language texts, and
400 perhaps also from conversations; This idea goes all the way back to Turing’s 1949 pa-
401 per Computing Machinery and Intelligence, which proposed the Turing test and helped to
402 launch the field of artificial intelligence (Turing, 2009).
403 Conversely, reasoning is sometimes essential for basic tasks of language processing,
404 such as determining who a pronoun refers to. Winograd schemas are examples in which
405 a single word changes the likely referent of a pronoun, in a way that seems to require
406 knowledge and reasoning to decode (Levesque et al., 2011). For example,
407 (1.2) The trophy doesn’t fit into the brown suitcase because it is too [small/large].
408 When the final word is small, then the pronoun it refers to the suitcase; when the final
409 word is large, then it refers to the trophy. Solving this example requires spatial reasoning;
410 other schemas require reasoning about actions and their effects, emotions and intentions,
411 and social conventions.
412 The Winograd schemas demonstrate that natural language understanding cannot be
413 achieved in isolation from knowledge and reasoning. Yet the history of artificial intelli-
414 gence has been one of increasing specialization: with the growing volume of research in
415 subdisciplines such as natural language processing, machine learning, and computer vi-
2
This view seems to be shared by some, but not all, prominent researchers in artificial intelli-
gence. Michael Jordan, a specialist in machine learning, has said that if he had a billion dollars
to spend on any large research project, he would spend it on natural language processing (https:
//www.reddit.com/r/MachineLearning/comments/2fxi6v/ama_michael_i_jordan/). On the
other hand, in a public discussion about the future of artificial intelligence in February 2018, com-
puter vision researcher Yann Lecun argued that language was perhaps the “50th most important”
thing to work on, and that it would be a great achievement if AI could attain the capabilities of an
orangutan, which presumably do not include language (http://www.abigailsee.com/2018/02/21/
deep-learning-structure-and-innate-priors.html).
416 sion, it is difficult for anyone to maintain expertise across the entire field. Still, recent work
417 has demonstrated interesting connections between natural language processing and other
418 areas of AI, including computer vision (e.g., Antol et al., 2015) and game playing (e.g.,
419 Branavan et al., 2009). The dominance of machine learning throughout artificial intel-
420 ligence has led to a broad consensus on representations such as graphical models and
421 knowledge graphs, and on algorithms such as backpropagation and combinatorial opti-
422 mization. Many of the algorithms and representations covered in this text are part of this
423 consensus.
424 Computer Science The discrete and recursive nature of natural language invites the ap-
425 plication of theoretical ideas from computer science. Linguists such as Chomsky and
426 Montague have shown how formal language theory can help to explain the syntax and
427 semantics of natural language. Theoretical models such as finite-state and pushdown au-
428 tomata are the basis for many practical natural language processing systems. Algorithms
429 for searching the combinatorial space of analyses of natural language utterances can be
430 analyzed in terms of their computational complexity, and theoretically motivated approx-
431 imations can sometimes be applied.
432 The study of computer systems is also relevant to natural language processing. Pro-
433 cessing large datasets of unlabeled text is a natural application for parallelization tech-
434 niques like MapReduce (Dean and Ghemawat, 2008; Lin and Dyer, 2010); handling high-
435 volume streaming data sources such as social media is a natural application for approx-
436 imate streaming and sketching techniques (Goyal et al., 2009). When deep neural net-
437 works are implemented in production systems, it is possible to eke out speed gains using
438 techniques such as reduced-precision arithmetic (Wu et al., 2016). Many classical natu-
439 ral language processing algorithms are not naturally suited to graphics processing unit
440 (GPU) parallelization, suggesting directions for further research at the intersection of nat-
441 ural language processing and computing hardware (Yi et al., 2011).
442 Speech Processing Natural language is often communicated in spoken form, and speech
443 recognition is the task of converting an audio signal to text. From one perspective, this is
444 a signal processing problem, which might be viewed as a preprocessing step before nat-
445 ural language processing can be applied. However, context plays a critical role in speech
446 recognition by human listeners: knowledge of the surrounding words influences percep-
447 tion and helps to correct for noise (Miller et al., 1951). For this reason, speech recognition
448 is often integrated with text analysis, particularly with statistical language model, which
449 quantify the probability of a sequence of text (see chapter 6). Beyond speech recognition,
450 the broader field of speech processing includes the study of speech-based dialogue sys-
451 tems, which are briefly discussed in chapter 19. Historically, speech processing has often
452 been pursued in electrical engineering departments, while natural language processing
453 has been the purview of computer scientists. For this reason, the extent of interaction
454 between these two disciplines is less than it might otherwise be.
455 Others Natural language processing plays a significant role in emerging interdisciplinary
456 fields like computational social science and the digital humanities. Text classification
457 (chapter 4), clustering (chapter 5), and information extraction (chapter 17) are particularly
458 useful tools; another is probabilistic topic models (Blei, 2012), which are not covered in
459 this text. Information retrieval (Manning et al., 2008) makes use of similar tools, and
460 conversely, techniques such as latent semantic analysis (§ 14.3) have roots in information
461 retrieval. Text mining is sometimes used to refer to the application of data mining tech-
462 niques, especially classification and clustering, to text. While there is no clear distinction
463 between text mining and natural language processing (nor between data mining and ma-
464 chine learning), text mining is typically less concerned with linguistic structure, and more
465 interested in fast, scalable algorithms.
488 representations such as syntax trees have not yet gone the way of the visual edge detector
489 or the auditory triphone. Linguists have argued for the existence of a “language faculty”
490 in all human beings, which encodes a set of abstractions specially designed to facilitate
491 the understanding and production of language. The argument for the existence of such
492 a language faculty is based on the observation that children learn language faster and
493 from fewer examples than would be reasonably possible, if language was learned from
494 experience alone.3 Regardless of the cognitive validity of these arguments, it seems that
495 linguistic structures are particularly important in scenarios where training data is limited.
496 Moving away from the extreme ends of the continuum, there are a number of ways in
497 which knowledge and learning can be combined in natural language processing. Many
498 supervised learning systems make use of carefully engineered features, which transform
499 the data into a representation that can facilitate learning. For example, in a task like doc-
500 ument classification, it may be useful to identify each word’s stem, so that a learning
501 system can more easily generalize across related terms such as whale, whales, whalers, and
502 whaling. This is particularly important in the many languages that exceed English in the
503 complexity of the system of affixes that can attach to words. Such features could be ob-
504 tained from a hand-crafted resource, like a dictionary that maps each word to a single
505 root form. Alternatively, features can be obtained from the output of a general-purpose
506 language processing system, such as a parser or part-of-speech tagger, which may itself
507 be built on supervised machine learning.
508 Another synthesis of learning and knowledge is in model structure: building machine
509 learning models whose architectures are inspired by linguistic theories. For example, the
510 organization of sentences is often described as compositional, with meaning of larger
511 units gradually constructed from the meaning of their smaller constituents. This idea
512 can be built into the architecture of a deep neural network, which is then trained using
513 contemporary deep learning techniques (Dyer et al., 2016).
514 The debate about the relative importance of machine learning and linguistic knowl-
515 edge sometimes becomes heated. No machine learning specialist likes to be told that their
516 engineering methodology is unscientific alchemy;4 nor does a linguist want to hear that
517 the search for general linguistic principles and structures has been made irrelevant by big
518 data. Yet there is clearly room for both types of research: we need to know how far we
519 can go with end-to-end learning alone, while at the same time, we continue the search for
520 linguistic representations that generalize across applications, scenarios, and languages.
521 For more on the history of this debate, see Church (2011); for an optimistic view of the
522 potential symbiosis between computational linguistics and deep learning, see Manning
3
The Language Instinct (Pinker, 2003) articulates these arguments in an engaging and popular style. For
arguments against the innateness of language, see Elman et al. (1998).
4
Ali Rahimi argued that much of deep learning research was similar to “alchemy” in a presentation at
the 2017 conference on Neural Information Processing Systems. He was advocating for more learning theory,
not more linguistics.
523 (2015).
527 where,
534 This basic structure can be used across a huge range of problems. For example, the
535 input x might be a social media post, and the output y might be a labeling of the emotional
536 sentiment expressed by the author (chapter 4); or x could be a sentence in French, and the
537 output y could be a sentence in Tamil (chapter 18); or x might be a sentence in English,
538 and y might be a representation of the syntactic structure of the sentence (chapter 10); or
539 x might be a news article and y might be a structured record of the events that the article
540 describes (chapter 17).
541 By adopting this formulation, we make an implicit decision that language processing
542 algorithms will have two distinct modules:
543 Search. The search module is responsible for computing the argmax of the function Ψ. In
544 other words, it finds the output ŷ that gets the best score with respect to the input
545 x. This is easy when the search space Y(x) is small enough to enumerate, or when
546 the scoring function Ψ has a convenient decomposition into parts. In many cases,
547 we will want to work with scoring functions that do not have these properties, mo-
548 tivating the use of more sophisticated search algorithms. Because the outputs are
549 usually discrete in language processing problems, search often relies on the machin-
550 ery of combinatorial optimization.
5
Throughout this text, equations will be numbered by square brackets, and linguistic examples will be
numbered by parentheses.
551 Learning. The learning module is responsible for finding the parameters θ. This is typ-
552 ically (but not always) done by processing a large dataset of labeled examples,
553 {(x(i) , y (i) )}N
i=1 . Like search, learning is also approached through the framework
554 of optimization, as we will see in chapter 2. Because the parameters are usually
555 continuous, learning algorithms generally rely on numerical optimization, search-
556 ing over vectors of real numbers for parameters that optimize some function of the
557 model and the labeled data. Some basic principles of numerical optimization are
558 reviewed in Appendix B.
559 The division of natural language processing into separate modules for search and
560 learning makes it possible to reuse generic algorithms across a range of different tasks
561 and models. This means that the work of natural language processing can be focused on
562 the design of the model Ψ, while reaping the benefits of decades of progress in search,
563 optimization, and learning. Much of this textbook will focus on specific classes of scoring
564 functions, and on the algorithms that make it possible to search and learn efficiently with
565 them.
566 When a model is capable of making subtle linguistic distinctions, it is said to be expres-
567 sive. Expressiveness is often traded off against the efficiency of search and learning. For
568 example, a word-to-word translation model makes search and learning easy, but it is not
569 expressive enough to distinguish good translations from bad ones. Unfortunately many
570 of the most important problems in natural language processing seem to require expres-
571 sive models, in which the complexity of search grows exponentially with the size of the
572 input. In these models, exact search is usually impossible. Intractability threatens the neat
573 modular decomposition between search and learning: if search requires a set of heuristic
574 approximations, then it may be advantageous to learn a model that performs well under
575 these specific heuristics. This has motivated some researchers to take a more integrated
576 approach to search and learning, as briefly mentioned in chapters 11 and 15.
586 (1.3) Umashanthi interviewed Ana. She works for the college newspaper.
587 Who works for the college newspaper? The word journalist, while not stated in the ex-
588 ample, implicitly links the interview to the newspaper, making Umashanthi the most likely
589 referent for the pronoun. (A general discussion of how to resolve pronouns is found in
590 chapter 15.)
591 Yet despite the inferential power of the relational perspective, it is not easy to formalize
592 computationally. Exactly which elements are to be related? Are journalists and reporters
593 distinct, or should we group them into a single unit? Is the kind of interview performed by
594 a journalist the same as the kind that one undergoes when applying for a job? Ontology
595 designers face many such thorny questions, and the project of ontology design hearkens
596 back to Borges’ (1993) Celestial Emporium of Benevolent Knowledge, which divides animals
597 into:
598 (a) belonging to the emperor; (b) embalmed; (c) tame; (d) suckling pigs; (e)
599 sirens; (f) fabulous; (g) stray dogs; (h) included in the present classification;
600 (i) frenzied; (j) innumerable; (k) drawn with a very fine camelhair brush; (l) et
601 cetera; (m) having just broken the water pitcher; (n) that from a long way off
602 resemble flies.
603 Difficulties in ontology construction have led some linguists to argue that there is no task-
604 independent way to partition up word meanings (Kilgarriff, 1997).
605 Some problems are easier. Each member in a group of journalists is a journalist: the -s
606 suffix distinguishes the plural meaning from the singular in most of the nouns in English.
607 Similarly, a journalist can be thought of, perhaps colloquially, as someone who produces or
608 works on a journal. (Taking this approach even further, the word journal derives from the
609 French jour+nal, or day+ly = daily.) In this way, the meaning of a word is constructed from
610 the constituent parts — the principle of compositionality. This principle can be applied
611 to larger units: phrases, sentences, and beyond. Indeed, one of the great strengths of the
612 compositional view of meaning is that it provides a roadmap for understanding entire
613 texts and dialogues through a single analytic lens, grounding out in the smallest parts of
614 individual words.
615 But alongside journalists and anti-parliamentarians, there are many words that seem to
616 be linguistic atoms: think, for example, of whale, blubber, and Nantucket. Furthermore,
617 idiomatic phrases like kick the bucket and shoot the breeze have meanings that are quite
618 different from the sum of their parts (Sag et al., 2002). Composition is of little help for such
619 words and expressions, but their meanings can be ascertained — or at least approximated
620 — from the contexts in which they appear. Take, for example, blubber, which appears in
621 such contexts as:
624 (1.6) Amongst oily substances, blubber has been employed as a manure.
625 These contexts form the distributional properties of the word blubber, and they link it to
626 words which can appear in similar constructions: fat, pelts, and barnacles. This distribu-
627 tional perspective makes it possible to learn about meaning from unlabeled data alone;
628 unlike relational and compositional semantics, no manual annotation or expert knowl-
629 edge is required. Distributional semantics is thus capable of covering a huge range of
630 linguistic phenomena. However, it lacks precision: blubber is similar to fat in one sense, to
631 pelts in another sense, and to barnacles in still another. The question of why all these words
632 tend to appear in the same contexts is left unanswered.
633 The relational, compositional, and distributional perspectives all contribute to our un-
634 derstanding of linguistic meaning, and all three appear to be critical to natural language
635 processing. Yet they are uneasy collaborators, requiring seemingly incompatible repre-
636 sentations and algorithmic approaches. This text presents some of the best known and
637 most successful methods for working with each of these representations, but it is hoped
638 that future research will reveal new ways to combine them.
646 Search. Viterbi, CKY, minimum spanning tree, shift-reduce, integer linear programming,
647 beam search.
648 Learning. Naı̈ve Bayes, logistic regression, perceptron, expectation-maximization, matrix
649 factorization, backpropagation, recurrent neural networks.
650 This text explains how these methods work, and how they can be applied to problems
651 that arise in the computer processing of natural language: document classification, word
652 sense disambiguation, sequence labeling (part-of-speech tagging and named entity recog-
653 nition), parsing, coreference resolution, relation extraction, discourse analysis, language
654 modeling, and machine translation.
658 summary of what is expected, and where you can learn more:
659 Mathematics and machine learning. The text assumes a background in multivariate cal-
660 culus and linear algebra: vectors, matrices, derivatives, and partial derivatives. You
661 should also be familiar with probability and statistics. A review of basic proba-
662 bility is found in Appendix A, and a minimal review of numerical optimization is
663 found in Appendix B. For linear algebra, the online course and textbook from Strang
664 (2016) are excellent sources of review material. Deisenroth et al. (2018) are currently
665 preparing a textbook on Mathematics for Machine Learning, and several chapters can
666 be found online.6 For an introduction to probabilistic modeling and estimation, see
667 James et al. (2013); for a more advanced and comprehensive discussion of the same
668 material, the classic reference is Hastie et al. (2009).
669 Linguistics. This book assumes no formal training in linguistics, aside from elementary
670 concepts likes nouns and verbs, which you have probably encountered in the study
671 of English grammar. Ideas from linguistics are introduced throughout the text as
672 needed, including discussions of morphology and syntax (chapter 9), semantics
673 (chapters 12 and 13), and discourse (chapter 16). Linguistic issues also arise in the
674 application-focused chapters 4, 8, and 18. A short guide to linguistics for students
675 of natural language processing is offered by Bender (2013); you are encouraged to
676 start there, and then pick up a more comprehensive introductory textbook (e.g., Ak-
677 majian et al., 2010; Fromkin et al., 2013).
678 Computer science. The book is targeted at computer scientists, who are assumed to have
679 taken introductory courses on the analysis of algorithms and complexity theory. In
680 particular, you should be familiar with asymptotic analysis of the time and memory
681 costs of algorithms, and should have seen dynamic programming. The classic text
682 on algorithms is offered by Cormen et al. (2009); for an introduction to the theory of
683 computation, see Arora and Barak (2009) and Sipser (2012).
686 Learning. This section builds up a set of machine learning tools that will be used through-
687 out the rest of the textbook. Because the focus is on machine learning, the text
688 representations and linguistic phenomena are mostly simple: “bag-of-words” text
689 classification is treated as a model example. Chapter 4 describes some of the more
690 linguistically interesting applications of word-based text analysis.
6
https://mml-book.github.io/
691 Sequences and trees. This section introduces the treatment of language as a structured
692 phenomena. It describes sequence and tree representations and the algorithms that
693 they facilitate, as well as the limitations that these representations impose. Chap-
694 ter 9 introduces finite state automata and briefly overviews a context-free account of
695 English syntax.
696 Meaning. This section takes a broad view of efforts to represent and compute meaning
697 from text, ranging from formal logic to neural word embeddings. It is also includes
698 two topics that are closely related to semantics: resolution of ambiguous references,
699 and analysis of multi-sentence discourse structure.
700 Applications. The final section offers chapter-length treatments on three of the most promi-
701 nent applications of natural language processing: information extraction, machine
702 translation, and text generation. Each of these applications merits a textbook length
703 treatment of its own (Koehn, 2009; Grishman, 2012; Reiter and Dale, 2000); the chap-
704 ters here explain some of the most well known systems using the formalisms and
705 methods built up earlier in the book, while introducing methods such as neural at-
706 tention.
707 Each chapter contains some advanced material, which is marked with an asterisk.
708 This material can be safely omitted without causing misunderstandings later on. But
709 even without these advanced sections, the text is too long for a single semester course, so
710 instructors will have to pick and choose among the chapters.
711 Chapters 2 and 3 provide building blocks that will be used throughout the book, and
712 chapter 4 describes some critical aspects of the practice of language technology. Lan-
713 guage models (chapter 6), sequence labeling (chapter 7), and parsing (chapter 10 and 11)
714 are canonical topics in natural language processing, and distributed word embeddings
715 (chapter 14) are so ubiquitous that students will complain if you leave them out. Of the
716 applications, machine translation (chapter 18) is the best choice: it is more cohesive than
717 information extraction, and more mature than text generation. In my experience, nearly
718 all students benefit from the review of probability in Appendix A.
719 • A course focusing on machine learning should add the chapter on unsupervised
720 learning (chapter 5). The chapters on predicate-argument semantics (chapter 13),
721 reference resolution (chapter 15), and text generation (chapter 19) are particularly
722 influenced by recent machine learning innovations, including deep neural networks
723 and learning to search.
724 • A course with a more linguistic orientation should add the chapters on applica-
725 tions of sequence labeling (chapter 8), formal language theory (chapter 9), semantics
726 (chapter 12 and 13), and discourse (chapter 16).
727 • For a course with a more applied focus — for example, a course targeting under-
728 graduates — I recommend the chapters on applications of sequence labeling (chap-
729 ter 8), predicate-argument semantics (chapter 13), information extraction (chapter 17),
730 and text generation (chapter 19).
731 Acknowledgments
732 Several of my colleagues and students read early drafts of chapters in their areas of exper-
733 tise, including Yoav Artzi, Kevin Duh, Heng Ji, Jessy Li, Brendan O’Connor, Yuval Pinter,
734 Nathan Schneider, Pamela Shapiro, Noah A. Smith, Sandeep Soni, and Luke Zettlemoyer.
735 I would also like to thank the following people for helpful discussions of the material:
736 Kevin Murphy, Shawn Ling Ramirez, William Yang Wang, and Bonnie Webber. Several
737 students, colleagues, friends, and family found mistakes in early drafts: Parminder Bha-
738 tia, Kimberly Caras, Barbara Eisenstein, Chris Gu, Joshua Killingsworth, Jonathan May,
739 Taha Merghani, Gus Monod, Raghavendra Murali, Nidish Nair, Brendan O’Connor, Yuval
740 Pinter, Nathan Schneider, Zhewei Sun, Ashwin Cunnapakkam Vinjimur, Clay Washing-
741 ton, Ishan Waykul, and Yuyu Zhang. Special thanks to the many students in Georgia
742 Tech’s CS 4650 and 7650 who suffered through early versions of the text.
744 Learning
27
745 Chapter 2
747 We’ll start with the problem of text classification: given a text document, assign it a dis-
748 crete label y ∈ Y, where Y is the set of possible labels. This problem has many appli-
749 cations, from spam filtering to analysis of electronic health records. Text classification is
750 also a building block that is used throughout more complex natural language processing
751 tasks.
752 To perform this task, the first question is how to represent each document. A common
753 approach is to use a vector of word counts, e.g., x = [0, 1, 1, 0, 0, 2, 0, 1, 13, 0 . . .]> , where
754 xj is the count of word j. The length of x is V , |V|, where V is the set of possible words
755 in the vocabulary.
756 The object x is a vector, but colloquially we call it a bag of words, because it includes
757 only information about the count of each word, and not the order in which the words
758 appear. We have thrown out grammar, sentence boundaries, paragraphs — everything
759 but the words. Yet the bag of words model is surprisingly effective for text classification.
760 If you see the word freeee in an email, is it a spam email? What if you see the word
761 Bayesian? For many labeling problems, individual words can be strong predictors.
762 To predict a label from a bag-of-words, we can assign a score to each word in the
763 vocabulary, measuring the compatibility with the label. In the spam filtering case, we
764 might assign a positive score to the word freeee for the label SPAM, and a negative score
765 to the word Bayesian. These scores are called weights, and they are arranged in a column
766 vector θ.
767 Suppose that you want a multiclass classifier, where K , |Y| > 2. For example, we
768 might want to classify news stories about sports, celebrities, music, and business. The goal
769 is to predict a label ŷ, given the bag of words x, using the weights θ. For each label y ∈ Y,
770 we compute a score Ψ(x, y), which is a scalar measure of the compatibility between the
771 bag-of-words x and the label y. In a linear bag-of-words classifier, this score is the vector
29
30 CHAPTER 2. LINEAR TEXT CLASSIFICATION
772 inner product between the weights θ and the output of a feature function f (x, y),
773 As the notation suggests, f is a function of two arguments, the word counts x and the
774 label y, and it returns a vector output. For example, given arguments x and y, element j
775 of this feature vector might be,
(
xfreeee , if y = S PAM
fj (x, y) = [2.2]
0, otherwise
776 This function returns the count of the word freeee if the label is S PAM, and it returns zero
777 otherwise. The corresponding weight θj then scores the compatibility of the word freeee
778 with the label S PAM. A positive score means that this word makes the label more likely.
To formalize this feature function, we define f (x, y) as a column vector,
779 where [0; 0; . . . ; 0] is a column vector of (K − 1) × V zeros, and the semicolon indicates
| {z }
(K−1)×V
780 vertical concatenation. This arrangement is shown in Figure 2.1; the notation may seem
781 awkward at first, but it generalizes to an impressive range of learning settings.
Given a vector of weights, θ ∈ RV ×K , we can now compute the score Ψ(x, y). This
inner product gives a scalar measure of the compatibility of the observation x with label
y.1 For any document x, we predict the label ŷ,
782 This inner product notation gives a clean separation between the data (x and y) and the
783 parameters (θ). This notation also generalizes nicely to structured prediction, in which
1
Only V × (K − 1) features and weights are necessary. By stipulating that Ψ(x, y = K) = 0 regardless of
x, it is possible to implement any classification rule that can be achieved with V × K features and weights.
This is the approach taken in binary classification rules like y = Sign(β ·x+a), where β is a vector of weights,
a is an offset, and the label set is Y = {−1, 1}. However, for multiclass classification, it is more concise to
write θ · f (x, y) for all y ∈ Y.
0 aardvark
0 ...
1 best 0 y=Fiction
0 ...
2 it
0 ...
2 of x y=News
0 ...
It was the 2 the
best of times, 0 ...
0 y=Gossip
it was the 2 times
worst of 0 ...
times... 2 was
0 ...
1 worst 0 y=Sports
0 ...
0 zyxt
1 <OFFSET>
x f(x ,y=News)
Figure 2.1: The bag-of-words and feature vector representations, for a hypothetical text
classification task.
784 the space of labels Y is very large, and we want to model shared substructures between
785 labels.
786 It is common to add an offset feature at the end of the vector of word counts x, which
787 is always 1. We then have to also add an extra zero to each of the zero vectors, to make the
788 vector lengths match. This gives the entire feature vector f (x, y) a length of (V + 1) × K.
789 The weight associated with this offset feature can be thought of as a bias for or against
790 each label. For example, if we expect most documents to be spam, then the weight for
791 the offset feature for y = S PAM should be larger than the weight for the offset feature for
792 y = H AM.
Returning to the weights θ, where do they come from? One possibility is to set them
by hand. If we wanted to distinguish, say, English from Spanish, we can use English
and Spanish dictionaries, and set the weight to one for each word that appears in the
θ(E,bicycle) =1 θ(S,bicycle) =0
θ(E,bicicleta) =0 θ(S,bicicleta) =1
θ(E,con) =1 θ(S,con) =1
θ(E,ordinateur) =0 θ(S,ordinateur) =0.
793 Similarly, if we want to distinguish positive and negative sentiment, we could use posi-
794 tive and negative sentiment lexicons (see § 4.1.2), which are defined by social psycholo-
795 gists (Tausczik and Pennebaker, 2010).
796 But it is usually not easy to set classification weights by hand, due to the large number
797 of words and the difficulty of selecting exact numerical weights. Instead, we will learn the
798 weights from data. Email users manually label messages as SPAM; newspapers label their
799 own articles as BUSINESS or STYLE. Using such instance labels, we can automatically
800 acquire weights using supervised machine learning. This chapter will discuss several
801 machine learning approaches for classification. The first is based on probability. For a
802 review of probability, consult Appendix A.
2
In this notation, each tuple (language, word) indexes an element in θ, which remains a vector.
3
The notation pX,Y (x(i) , y (i) ) indicates the joint probability that random variables X and Y take the
specific values x(i) and y (i) respectively. The subscript will often be omitted when it is clear from context.
For a review of random variables, see Appendix A.
808 The notation p(x(i) , y (i) ; θ) indicates that θ is a parameter of the probability function. The
809 product of probabilities can be replaced by a sum of log-probabilities because the log func-
810 tion is monotonically increasing over positive arguments, and so the same θ will maxi-
811 mize both the probability and its logarithm. Working with logarithms is desirable because
812 of numerical stability: on a large dataset, multiplying many probabilities can underflow
813 to zero.4
814 The probability p(x(i) , y (i) ; θ) is defined through a generative model — an idealized
815 random process that has generated the observed data.5 Algorithm 1 describes the gener-
816 ative model describes the Naı̈ve Bayes classifier, with parameters θ = {µ, φ}.
817 • The first line of this generative model encodes the assumption that the instances are
818 mutually independent: neither the label nor the text of document i affects the label
819 or text of document j.6 Furthermore, the instances are identically distributed: the
820 distributions over the label y (i) and the text x(i) (conditioned on y (i) ) are the same
821 for all instances i.
822 • The second line of the generative model states that the random variable y (i) is drawn
823 from a categorical distribution with parameter µ. Categorical distributions are like
824 weighted dice: the vector µ = [µ1 , µ2 , . . . , µK ]> gives the probabilities of each la-
825 bel, so that the probability of drawing label y is equal to µy . For example, if Y =
826 {P POSITIVE , NEGATIVE, NEUTRAL }, we might have µ = [0.1, 0.7, 0.2]> . We require
7
827 y∈Y µy = 1 and µy ≥ 0, ∀y ∈ Y.
828 • The third line describes how the bag-of-words counts x(i) are generated. By writing
829 x(i) | y (i) , this line indicates that the word counts are conditioned on the label, so
4
Throughout this text, you may assume all logarithms and exponents are base 2, unless otherwise indi-
cated. Any reasonable base will yield an identical classifier, and base 2 is most convenient for working out
examples by hand.
5
Generative models will be used throughout this text. They explicitly define the assumptions underlying
the form of a probability distribution over observed and latent variables. For a readable introduction to
generative models in statistics, see Blei (2014).
6
Can you think of any cases in which this assumption is too strong?
7
Formally, we require µ ∈ ∆K−1 , where ∆K−1 is the K − 1 probability simplex, the set of all vectors of
K nonnegative numbers that sum to one. Because of the sum-to-one constraint, there are K − 1 degrees of
freedom for a vector of size K.
830 that the joint probability is factored using the chain rule,
The Naı̈ve Bayes prediction rule is to choose the label y which maximizes log p(x, y; µ, φ):
Now we can plug in the probability distributions from the generative story.
V
Y x
log p(x | y; φ) + log p(y; µ) = log B(x) j
φy,j + log µy [2.16]
j=1
V
X
= log B(x) + xj log φy,j + log µy [2.17]
j=1
where
869 The feature function f (x, y) is a vector of V word counts and an offset, padded by
870 zeros for the labels not equal to y (see Equations 2.3-2.5, and Figure 2.1). This construction
871 ensures that the inner product θ · f (x, y) only activates the features whose weights are
872 in θ (y) . These features and weights are all we need to compute the joint log-probability
873 log p(x, y) for each y. This is a key point: through this notation, we have converted the
874 problem of computing the log-likelihood for a document-label pair (x, y) into the compu-
875 tation of a vector inner product.
880 where count(y, j) refers to the count of word j in documents with label y.
881 Equation 2.21 defines the relative frequency estimate for φ. It can be justified as a
882 maximum likelihood estimate: the estimate that maximizes the probability p(x(1:N ) , y (1:N ) ; θ).
883 Based on the generative model in Algorithm 1, the log-likelihood is,
N
X
L(φ, µ) = log pmult (x(i) ; φy(i) ) + log pcat (y (i) ; µ), [2.22]
i=1
which is now written as a function L of the parameters φ and µ. Let’s continue to focus
on the parameters φ. Since p(y) is constant with respect to φ, we can drop it:
N
X N
X V
X (i)
L(φ) = log pmult (x(i) ; φy(i) ) = log B(x(i) ) + xj log φy(i) ,j , [2.23]
i=1 i=1 j=1
887 These constraints can be incorporated by adding a set of Lagrange multipliers (see Ap-
888 pendix B for more details). Solving separately for each label y, we obtain the Lagrangian,
V
X X XV
(i)
`(φy ) = xj log φy,j − λ( φy,j − 1). [2.25]
i:y (i) =y j=1 j=1
The solution is obtained by setting each element in this vector of derivatives equal to zero,
X (i)
λφy,j = xj [2.27]
i:y (i) =y
X N
X
(i) (i)
φy,j ∝ xj = δ y (i) = y xj = count(y, j), [2.28]
i:y (i) =y i=1
where δ y (i) = y is a delta function, also sometimes called an indicator function, which
returns one if y (i) = y, and zero otherwise. Equation 2.28 shows three different notations
for the same thing: a sum over the word counts for all documents i such that the label
y (i) = y. This gives a solution for each φy up to a constant of proportionality. Now recall
P
the constraint Vj=1 φy,j = 1, which arises because φy represents a vector of probabilities
for each word in the vocabulary. This constraint leads to an exact solution,
count(y, j)
φy,j = PV . [2.29]
0)
j 0 =1 count(y, j
889 This is equal to the relative frequency estimator from Equation 2.21. A similar derivation
P
890 gives µy ∝ N i=1 δ y
(i) = y .
907 • Unbiased classifiers may overfit the training data, yielding poor performance on
908 unseen data.
909 • But if the smoothing is too large, the resulting classifier can underfit instead. In the
910 limit of α → ∞, there is zero variance: you get the same classifier, regardless of the
911 data. However, the bias is likely to be large.
919 set of values (e.g., α ∈ {0.001, 0.01, 0.1, 1, 10}), compute the accuracy for each value, and
920 choose the setting that maximizes the accuracy.
921 The goal is to tune α so that the classifier performs well on unseen data. For this reason,
922 the data used for hyperparameter tuning should not overlap the training set, where very
923 small values of α will be preferred. Instead, we hold out a development set (also called
924 a tuning set) for hyperparameter selection. This development set may consist of a small
925 fraction of the labeled data, such as 10%.
926 We also want to predict the performance of our classifier on unseen data. To do this,
927 we must hold out a separate subset of data, called the test set. It is critical that the test set
928 not overlap with either the training or development sets, or else we will overestimate the
929 performance that the classifier will achieve on unlabeled data in the future. The test set
930 should also not be used when making modeling decisions, such as the form of the feature
931 function, the size of the vocabulary, and so on (these decisions are reviewed in chapter 4.)
932 The ideal practice is to use the test set only once — otherwise, the test set is used to guide
933 the classifier design, and test set accuracy will diverge from accuracy on truly unseen
934 data. Because annotated data is expensive, this ideal can be hard to follow in practice,
935 and many test sets have been used for decades. But in some high-impact applications like
936 machine translation and information extraction, new test sets are released every year.
937 When only a small amount of labeled data is available, the test set accuracy can be
938 unreliable. K-fold cross-validation is one way to cope with this scenario: the labeled
939 data is divided into K folds, and each fold acts as the test set, while training on the other
940 folds. The test set accuracies are then aggregated. In the extreme, each fold is a single data
941 point; this is called leave-one-out cross-validation. To perform hyperparameter tuning in
942 the context of cross-validation, another fold can be used for grid search. It is important
943 not to repeatedly evaluate the cross-validated accuracy while making design decisions
944 about the classifier, or you will overstate the accuracy on truly unseen data.
956 • To better handle out-of-vocabulary terms, we want features that apply to multiple
957 words, such as prefixes and suffixes (e.g., anti-, un-, -ing) and capitalization.
958 • We also want n-gram features that apply to multi-word units: bigrams (e.g., not
959 good, not bad), trigrams (e.g., not so bad, lacking any decency, never before imagined), and
960 beyond.
These features flagrantly violate the Naı̈ve Bayes independence assumption. Consider
what happens if we add a prefix feature. Under the Naı̈ve Bayes assumption, we make
the following approximation:11
To test the quality of the approximation, we can manipulate the left-hand side by applying
the chain rule,
But Pr(prefix = un- | word = unfit, y) = 1, since un- is guaranteed to be the prefix for the
word unfit. Therefore,
961 because the probability of any given word starting with the prefix un- is much less than
962 one. Naı̈ve Bayes will systematically underestimate the true probabilities of conjunctions
963 of positively correlated features. To use such features, we need learning algorithms that
964 do not rely on an independence assumption.
965 The origin of the Naı̈ve Bayes independence assumption is the learning objective,
966 p(x(1:N ) , y (1:N ) ), which requires modeling the probability of the observed text. In clas-
967 sification problems, we are always given x, and are only interested in predicting the label
968 y, so it seems unnecessary to model the probability of x. Discriminative learning algo-
969 rithms focus on the problem of predicting y, and do not attempt to model the probability
970 of the text x.
974 our choice of features. Why not forget about probability and learn the weights in an error-
975 driven way? The perceptron algorithm, shown in Algorithm 3, is one way to do this.
976 Here’s what the algorithm says: if you make a mistake, increase the weights for fea-
977 tures that are active with the correct label y (i) , and decrease the weights for features that
978 are active with the guessed label ŷ. This is an online learning algorithm, since the clas-
979 sifier weights change after every example. This is different from Naı̈ve Bayes, which
980 computes corpus statistics and then sets the weights in a single operation — Naı̈ve Bayes
981 is a batch learning algorithm. Algorithm 3 is vague about when this online learning pro-
982 cedure terminates. We will return to this issue shortly.
983 The perceptron algorithm may seem like a cheap heuristic: Naı̈ve Bayes has a solid
984 foundation in probability, but the perceptron is just adding and subtracting constants from
985 the weights every time there is a mistake. Will this really work? In fact, there is some nice
986 theory for the perceptron, based on the concept of linear separability:
987 Definition 1 (Linear separability). The dataset D = {(x(i) , y (i) )}N i=1 is linearly separable iff
988 there exists some weight vector θ and some margin ρ such that for every instance (x(i) , y (i) ), the
989 inner product of θ and the feature function for the true label, θ · f (x(i) , y (i) ), is at least ρ greater
990 than inner product of θ and the feature function for every other possible label, θ · f (x(i) , y 0 ).
991 Linear separability is important because of the following guarantee: if your data is
992 linearly separable, then the perceptron algorithm will find a separator (Novikoff, 1962).12
993 So while the perceptron may seem heuristic, it is guaranteed to succeed, if the learning
994 problem is easy enough.
995 How useful is this proof? Minsky and Papert (1969) famously proved that the simple
996 logical function of exclusive-or is not separable, and that a perceptron is therefore inca-
997 pable of learning this function. But this is not just an issue for the perceptron: any linear
998 classification algorithm, including Naı̈ve Bayes, will fail on this task. In natural language
999 classification problems usually involve high dimensional feature spaces, with thousands
1000 or millions of features. For these problems, it is very likely that the training data is indeed
1001 separable. And even if the data is not separable, it is still possible to place an upper bound
1002 on the number of errors that the perceptron algorithm will make (Freund and Schapire,
1003 1999).
1028 set. At this point, it is probably best to stop; this stopping criterion is known as early
1029 stopping.
1030 Generalization is the ability to make good predictions on instances that are not in
1031 the training data. Averaging can be proven to improve generalization, by computing an
1032 upper bound on the generalization error (Freund and Schapire, 1999; Collins, 2002).
1034 Naı̈ve Bayes chooses the weights θ by maximizing the joint log-likelihood log p(x(1:N ) , y (1:N ) ).
1035 By convention, optimization problems are generally formulated as minimization of a loss
1036 function. The input to a loss function is the vector of weights θ, and the output is a non-
1037 negative scalar, measuring the performance of the classifier on a training instance. The
1038 loss `(θ; x(i) , y (i) ) is then a measure of the performance of the weights θ on the instance
1039 (x(i) , y (i) ). The goal of learning is to minimize the sum of the losses across all instances in
1040 the training set.
We can trivially reformulate maximum likelihood as a loss function, by defining the
1041 The problem of minimizing `NB is thus identical to the problem of maximum-likelihood
1042 estimation.
1043 Loss functions provide a general framework for comparing machine learning objec-
1044 tives. For example, an alternative loss function is the zero-one loss,
(
0, y (i) = argmaxy θ · f (x(i) , y)
`0-1 (θ; x(i) , y (i) ) = [2.40]
1, otherwise
1045 The zero-one loss is zero if the instance is correctly classified, and one otherwise. The
1046 sum of zero-one losses is proportional to the error rate of the classifier on the training
1047 data. Since a low error rate is often the ultimate goal of classification, this may seem
1048 ideal. But the zero-one loss has several problems. One is that it is non-convex,13 which
1049 means that there is no guarantee that gradient-based optimization will be effective. A
1050 more serious problem is that the derivatives are useless: the partial derivative with respect
1051 to any parameter is zero everywhere, except at the points where θ·f (x(i) , y) = θ·f (x(i) , ŷ)
1052 for some ŷ. At those points, the loss is discontinuous, and the derivative is undefined.
1053 The perceptron optimizes the following loss function:
1054 When ŷ = y (i) , the loss is zero; otherwise, it increases linearly with the gap between the
1055 score for the predicted label ŷ and the score for the true label y (i) . Plotting this loss against
1056 the input maxy∈Y θ · f (x(i) , y) − θ · f (x(i) , y (i) ) gives a hinge shape, motivating the name
1057 hinge loss.
13
A function f is convex iff αf (xi )+(1−α)f (xj ) ≥ f (αxi +(1−α)xj ), for all α ∈ [0, 1] and for all xi and xj
on the domain of the function. In words, any weighted average of the output of f applied to any two points is
larger than the output of f when applied to the weighted average of the same two points. Convexity implies
that any local minimum is also a global minimum, and there are many effective techniques for optimizing
convex functions (Boyd and Vandenberghe, 2004). See Appendix B for a brief review.
1058 To see why this is the loss function optimized by the perceptron, take the derivative
1059 with respect to θ,
∂
`P ERCEPTRON (θ; x(i) , y (i) ) = f (x(i) , ŷ) − f (x(i) , y (i) ). [2.42]
∂θ
1060 At each instance perceptron algorithm takes a step of magnitude one in the opposite direc-
∂
1061 tion of this gradient, ∇θ `P ERCEPTRON = ∂θ `P ERCEPTRON (θ; x(i) , y (i) ). As we will see in § 2.5,
1062 this is an example of the optimization algorithm stochastic gradient descent, applied to
1063 the objective in Equation 2.41.
1064 Breaking ties with subgradient descent Careful readers will notice the tacit assumption
1065 that there is a unique ŷ that maximizes θ · f (x(i) , y). What if there are two or more labels
1066 that maximize this function? Consider binary classification: if the maximizer is y (i) , then
1067 the gradient is zero, and so is the perceptron update; if the maximizer is ŷ 6= y (i) , then the
1068 update is the difference f (x(i) , y (i) )−f (x(i) , ŷ). The underlying issue is that the perceptron
1069 loss is not smooth, because the first derivative has a discontinuity at the hinge point,
1070 where the score for the true label y (i) is equal to the score for some other label ŷ. At this
1071 point, there is no unique gradient; rather, there is a set of subgradients. A vector v is a
1072 subgradient of the function g at u0 iff g(u) − g(u0 ) ≥ v · (u − u0 ) for all u. Graphically,
1073 this defines the set of hyperplanes that include g(u0 ) and do not intersect g at any other
1074 point. As we approach the hinge point from the left, the gradient is f (x, ŷ)−f (x, y); as we
1075 approach from the right, the gradient is 0. At the hinge point, the subgradients include all
1076 vectors that are bounded by these two extremes. In subgradient descent, any subgradient
1077 can be used (Bertsekas, 2012). Since both 0 and f (x, ŷ) − f (x, y) are subgradients at the
1078 hinge point, either one can be used in the perceptron update.
1079 Perceptron versus Naı̈ve Bayes The perceptron loss function has some pros and cons
1080 with respect to the negative log-likelihood loss implied by Naı̈ve Bayes.
1081 • Both `NB and `P ERCEPTRON are convex, making them relatively easy to optimize. How-
1082 ever, `NB can be optimized in closed form, while `P ERCEPTRON requires iterating over
1083 the dataset multiple times.
1084 • `NB can suffer infinite loss on a single example, since the logarithm of zero probabil-
1085 ity is negative infinity. Naı̈ve Bayes will therefore overemphasize some examples,
1086 and underemphasize others.
1087 • `P ERCEPTRON treats all correct answers equally. Even if θ only gives the correct answer
1088 by a tiny margin, the loss is still zero.
3.0
2.5 0/1 loss
2.0 margin loss
1.5 logistic loss
loss
1.0
0.5
0.0
−0.5
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
θ> f (x, y) − maxy0 6=y θ> f (x, y 0 )
The margin represents the difference between the score for the correct label y (i) , and
the score for the highest-scoring label. The intuition behind large margin classification is
that it is not enough just to label the training data correctly — the correct label should be
separated from other labels by a comfortable margin. This idea can be encoded into a loss
function,
(
(i) 0,
(i) γ(θ; x(i) , y (i) ) ≥ 1,
`M ARGIN (θ; x , y ) = [2.44]
1 − γ(θ; x(i) , y (i) ), otherwise
= 1 − γ(θ; x(i) , y (i) ) , [2.45]
+
1094 where (x)+ = max(0, x). The loss is zero if there is a margin of at least 1 between the
1095 score for the true label and the best-scoring alternative ŷ. This is almost identical to the
1096 perceptron loss, but the hinge point is shifted to the right, as shown in Figure 2.2. The
1097 margin loss is a convex upper bound on the zero-one loss.
1099 This is a constrained optimization problem, where the second line describes constraints
1100 on the space of possible solutions θ. In this case, the constraint is that the functional
1101 margin always be at least one, and the objective is that the minimum geometric margin
1102 be as large as possible.
Any scaling factor on θ will cancel in the numerator and denominator of the geometric
margin. This means that if the data is linearly separable at ρ, we can increase this margin
to 1 by rescaling θ. We therefore need only minimize the denominator ||θ||2 , subject to
the constraint on the functional margin. The minimizer of ||θ||2 is also the minimizer of
1 2 1 PV 2
2 ||θ||2 = 2 j=1 θj , which is easier to work with. This gives the optimization problem,
1
min . ||θ||22
θ 2
s.t. γ(θ; x(i) , y (i) ) ≥ 1, ∀i . [2.47]
1103 This optimization problem is a quadratic program: the objective is a quadratic func-
1104 tion of the parameters, and the constraints are all linear inequalities. The resulting clas-
1105 sifier is better known as the support vector machine. The name derives from one of the
1106 solutions, which is to incorporate the constraints through Lagrange multipliers αi ≥ 0, i =
1107 1, 2, . . . , N . The instances for which αi > 0 are the support vectors; other instances are
1108 irrelevant to the classification boundary.
geometric
margin
functional
margin
Figure 2.3: Functional and geometric margins for a binary classification problem. All
separators that satisfy the margin constraint are shown. The separator with the largest
geometric margin is shown in bold.
X N
1
min ||θ||22 + C ξi
θ,ξ 2
i=1
(i) (i)
s.t. γ(θ; x , y ) + ξi ≥ 1, ∀i
ξi ≥ 0, ∀i . [2.48]
1110 The hyperparameter C controls the tradeoff between violations of the margin con-
1111 straint and the preference for a low norm of θ. As C → ∞, slack is infinitely expensive,
1112 and there is only a solution if the data is separable. As C → 0, slack becomes free, and
1113 there is a trivial solution at θ = 0. Thus, C plays a similar role to the smoothing parame-
1114 ter in Naı̈ve Bayes (§ 2.1.4), trading off between a close fit to the training data and better
1115 generalization. Like the smoothing parameter of Naı̈ve Bayes, C must be set by the user,
1116 typically by maximizing performance on a heldout development set.
1117 To solve the constrained optimization problem defined in Equation 2.48, we can first
The inequality is tight, because the slack variables are penalized in the objective, and there
is no advantage to increasing them beyond the minimum value (Ratliff et al., 2007; Smith,
2011). The problem can therefore be transformed into the unconstrained optimization,
XN
λ
min ||θ||22 + (1 − γ(θ; x(i) , y (i) ))+ , [2.50]
θ 2
i=1
1119 where each ξi has been substituted by the right-hand side of Equation 2.49, and the factor
1120 of C on the slack variables has been replaced by an equivalent factor of λ = C1 on the
1121 norm of the weights.
1122 Now define the cost of a classification error as,14
(
1, y (i) 6= ŷ
c(y (i) , ŷ) = [2.51]
0, otherwise.
1123 This objective maximizes over all y ∈ Y, in search of labels that are both strong, as mea-
1124 sured by θ · f (x(i) , y), and wrong, as measured by c(y (i) , y). This maximization is known
1125 as cost-augmented decoding, because it augments the maximization objective to favor
1126 high-cost predictions. If the highest-scoring label is y = y (i) , then the margin constraint is
1127 satisfied, and the loss for this instance is zero. Cost-augmentation is only for learning: it
1128 is not applied when making predictions on unseen data.
Differentiating Equation 2.52 with respect to the weights gives,
N
X
∇θ LSVM =λθ + f (x(i) , ŷ) − f (x(i) , y (i) ) [2.53]
i=1
ŷ = argmax θ · f (x(i) , y) + c(y (i) , y), [2.54]
y∈Y
1129 where LSVM refers to minimization objective in Equation 2.52. This gradient is very similar
1130 to the perceptron update. One difference is the additional term λθ, which regularizes the
14
We can also define specialized cost functions that heavily penalize especially undesirable er-
rors (Tsochantaridis et al., 2004). This idea is revisited in chapter 7.
1131 weights towards 0. The other difference is the cost c(y (i) , y), which is added to θ · f (x, y)
1132 when choosing ŷ during training. This term derives from the margin constraint: large
1133 margin classifiers learn not only from instances that are incorrectly classified, but also
1134 from instances for which the correct classification decision was not sufficiently confident.
1144 The final line is obtained by plugging in Equation 2.55 and taking the logarithm.15 Inside
15
The log-sum-exp term is a common pattern in machine learning. It is numerically unstable, because it
will underflow if the inner product is small, and overflow if the inner product is large. Scientific computing
libraries usually contain special functions for computing logsumexp, but with some thought, you should be
able to see how to create an implementation that is numerically stable.
1145 the sum, we have the (additive inverse of the) logistic loss,
X
`L OG R EG (θ; x(i) , y (i) ) = θ · f (x(i) , y (i) ) + log exp(θ · f (x(i) , y 0 )) [2.58]
y 0 ∈Y
1146 The logistic loss is shown in Figure 2.2. A key difference from the zero-one and hinge
1147 losses is that logistic loss is never zero. This means that the objective function can always
1148 be improved by assigning higher confidence to the correct label.
1150 As with the support vector machine, better generalization can be obtained by penalizing
1151 the norm of θ. This is done by adding a term of λ2 ||θ||22 to the minimization objective.
1152 This is called L2 regularization, because ||θ||22 is the squared L2 norm of the vector θ.
1153 Regularization forces the estimator to trade off performance on the training data against
1154 the norm of the weights, and this can help to prevent overfitting. Consider what would
1155 happen to the unregularized weight for a base feature j that is active in only one instance
1156 x(i) : the conditional log-likelihood could always be improved by increasing the weight
1157 for this feature, so that θ(j,y(i) ) → ∞ and θ(j,ỹ6=y(i) ) → −∞, where (j, y) is the index of
(i)
1158 feature associated with xj and label y in f (x(i) , y).
In § 2.1.4, we saw that smoothing the probabilities of a Naı̈ve Bayes classifier can be
justified in a hierarchical probabilistic model, in which the parameters of the classifier
are themselves random variables, drawn from a prior distribution. The same justification
applies to L2 regularization. In this case, the prior is a zero-mean Gaussian on each term
of θ. The log-likelihood under a zero-mean Gaussian is,
1 2
log N (θj ; 0, σ 2 ) ∝ − θ , [2.59]
2σ 2 j
1
1159 so that the regularization weight λ is equal to the inverse variance of the prior, λ = σ2
.
1161 The final step employs the definition of a conditional expectation (§ A.5). The gradient of
1162 the logistic loss is equal to the difference between the expected counts under the current
1163 model, EY |X [f (x(i) , y)], and the observed feature counts f (x(i) , y (i) ). When these two
1164 vectors are equal for a single instance, there is nothing more to learn from it; when they
1165 are equal in sum over the entire dataset, there is nothing more to learn from the dataset as
1166 a whole. The gradient of the hinge loss is nearly identical, but it involves the features of
1167 the predicted label under the current model, f (x(i) , ŷ), rather than the expected features
1168 EY |X [f (x(i) , y)] under the conditional distribution p(y | x; θ).
The regularizer contributes λθ to the overall gradient:
N
X X
λ θ · f (x(i) , y (i) ) − log
LL OG R EG = ||θ||22 − exp θ · f (x(i) , y 0 ) [2.65]
2 0
i=1 y ∈Y
N
X
(i) (i) (i)
∇θ LL OG R EG =λθ − f (x , y ) − Ey|x [f (x , y)] . [2.66]
i=1
1172 • In Naı̈ve Bayes, the objective is the joint likelihood log p(x(1:N ) , y (1:N ) ). Maximum
1173 likelihood estimation yields a closed-form solution for θ.
1174 • In the support vector machine, the objective is the regularized margin loss,
N
X
λ
LSVM = ||θ||22 + (max(θ · f (x(i) , y) + c(y (i) , y)) − θ · f (x(i) , y (i) ))+ , [2.67]
2 y∈Y
i=1
1175 There is no closed-form solution, but the objective is convex. The perceptron algo-
1176 rithm minimizes a similar objective.
1177 • In logistic regression, the objective is the regularized negative log-likelihood,
λ
N
X X
LL OG R EG = ||θ||22 − θ · f (x(i) , y (i) ) − log exp θ · f (x(i) , y) [2.68]
2
i=1 y∈Y
1179 These learning algorithms are distinguished by what is being optimized, rather than
1180 how the optimal weights are found. This decomposition is an essential feature of con-
1181 temporary machine learning. The domain expert’s job is to design an objective function
1182 — or more generally, a model of the problem. If the model has certain characteristics,
1183 then generic optimization algorithms can be used to find the solution. In particular, if an
1184 objective function is differentiable, then gradient-based optimization can be employed;
1185 if it is also convex, then gradient-based optimization is guaranteed to find the globally
1186 optimal solution. The support vector machine and logistic regression have both of these
1187 properties, and so are amenable to generic convex optimization techniques (Boyd and
1188 Vandenberghe, 2004).
1190 where ∇θ L is the gradient computed over the entire training set, and η (t) is the step size
1191 at iteration t. If the objective L is a convex function of θ, then this procedure is guaranteed
1192 to terminate at the global optimum, for appropriate schedule of learning rates, η (t) .16
16
Specifically, the learning rate must have the following properties (Bottou et al., 2016):
∞
X
η (t) =∞ [2.70]
t=1
∞
X
(η (t) )2 <∞. [2.71]
t=1
1193 In practice, gradient descent can be slow to converge, as the gradient can become
1194 infinitesimally small. Faster convergence can be obtained by second-order Newton opti-
1195 mization, which incorporates the inverse of the Hessian matrix,
∂2L
Hi,j = [2.72]
∂θi ∂θj
1196 The size of the Hessian matrix is quadratic in the number of features. In the bag-of-words
1197 representation, this is usually too big to store, let alone invert. Quasi-Network optimiza-
1198 tion techniques maintain a low-rank approximation to the inverse of the Hessian matrix.
1199 Such techniques usually converge more quickly than gradient descent, while remaining
1200 computationally tractable even for large feature sets. A popular quasi-Newton algorithm
1201 is L-BFGS (Liu and Nocedal, 1989), which is implemented in many scientific computing
1202 environments, such as scipy and Matlab.
1203 For any gradient-based technique, the user must set the learning rates η (t) . While con-
1204 vergence proofs usually employ a decreasing learning rate, in practice, it is common to fix
1205 η (t) to a small constant, like 10−3 . The specific constant can be chosen by experimentation,
1206 although there is research on determining the learning rate automatically (Schaul et al.,
1207 2013; Wu et al., 2018).
N
X
`(θ; x(i) , y (i) ) ≈ N × `(θ; x(j) , y (j) ), (x(j) , y (j) ) ∼ {(x(i) , y (i) )}N
i=1 , [2.73]
i=1
1214 where the instance (x(j) , y (j) ) is sampled at random from the full dataset.
1215 In stochastic gradient descent, the approximate gradient is computed by randomly
1216 sampling a single instance, and an update is made immediately. This is similar to the
1217 perceptron algorithm, which also updates the weights one instance at a time. In mini-
1218 batch stochastic gradient descent, the gradient is computed over a small set of instances.
1219 A typical approach is to set the minibatch size so that the entire batch fits in memory on a
1220 graphics processing unit (GPU; Neubig et al., 2017). It is then possible to speed up learn-
1221 ing by parallelizing the computation of the gradient over each instance in the minibatch.
These properties can be obtained by the learning rate schedule η (t) = η (0) t−α for α ∈ [1, 2].
Algorithm 5 Generalized gradient descent. The function B ATCHER partitions the train-
ing set into B batches such that each instance appears in exactly one batch. In gradient
descent, B = 1; in stochastic gradient descent, B = N ; in minibatch stochastic gradient
descent, 1 < B < N .
1: procedure G RADIENT-D ESCENT(x(1:N ) , y (1:N ) , L, η (1...∞) , B ATCHER, Tmax )
2: θ←0
3: t←0
4: repeat
5: (b(1) , b(2) , . . . , b(B) ) ← B ATCHER(N )
6: for n ∈ {1, 2, . . . , B} do
7: t←t+1
(n) (n) (n) (n)
8: θ (t) ← θ (t−1) − η (t) ∇θ L(θ (t−1) ; x(b1 ,b2 ,...) , y (b1 ,b2 ,...) )
9: if Converged(θ (1,2,...,t) ) then
10: return θ (t)
11: until t ≥ Tmax
12: return θ (t)
1222 Algorithm 5 offers a generalized view of gradient descent. In standard gradient de-
1223 scent, the batcher returns a single batch with all the instances. In stochastic gradient de-
1224 scent, it returns N batches with one instance each. In mini-batch settings, the batcher
1225 returns B minibatches, 1 < B < N .
There are many other techniques for online learning, and the field is currently quite
active (Bottou et al., 2016). Some algorithms use an adaptive step size, which can be dif-
ferent for every feature (Duchi et al., 2011). Features that occur frequently are likely to be
updated frequently, so it is best to use a small step size; rare features will be updated in-
frequently, so it is better to take larger steps. The AdaGrad (adaptive gradient) algorithm
achieves this behavior by storing the sum of the squares of the gradients for each feature,
and rescaling the learning rate by its inverse:
1232 regularization updates can be performed all at once. This strategy is described in detail
1233 by Kummerfeld et al. (2015).
1263 This is the origin of the name logistic regression. Logistic regression can be viewed as
1264 part of a larger family of generalized linear models (GLMs), in which various other “link
1265 functions” convert between the inner product θ · x and the parameter of a conditional
1266 probability distribution.
1267 In the early NLP literature, logistic regression is frequently called maximum entropy
1268 classification (Berger et al., 1996). This name refers to an alternative formulation, in
1269 which the goal is to find the maximum entropy probability function that satisfies moment-
1270 matching constraints. These constraints specify that the empirical counts of each feature
1271 should match the expected counts under the induced probability distribution pY |X;θ ,
N
X N X
X
(i) (i)
fj (x , y ) = p(y | x(i) ; θ)fj (x(i) , y), ∀j [2.78]
i=1 i=1 y∈Y
1272 The moment-matching constraint is satisfied exactly when the derivative of the condi-
1273 tional log-likelihood function (Equation 2.64) is equal to zero. However, the constraint
1274 can be met by many values of θ, so which should we choose?
1275 The entropy of the conditional probability distribution pY |X is,
X X
H(pY |X ) = − pX (x) pY |X (y | x) log pY |X (y | x), [2.79]
x∈X y∈Y
1276 where X is the set of all possible feature vectors, and pX (x) is the probability of observing
1277 the base features x. The distribution pX is unknown, but it can be estimated by summing
1278 over all the instances in the training set,
N
1 XX
H̃(pY |X ) = − pY |X (y | x(i) ) log pY |X (y | x(i) ). [2.80]
N
i=1 y∈Y
1279 If the entropy is large, the likelihood function is smooth across possible values of y;
1280 if it is small, the likelihood function is sharply peaked at some preferred value; in the
1281 limiting case, the entropy is zero if p(y | x) = 1 for some y. The maximum-entropy cri-
1282 terion chooses to make the weakest commitments possible, while satisfying the moment-
1283 matching constraints from Equation 2.78. The solution to this constrained optimization
1284 problem is identical to the maximum conditional likelihood (logistic-loss) formulation
1285 that was presented in § 2.4.
1289 Naı̈ve Bayes Pros: easy to implement; estimation is fast, requiring only a single pass over
1290 the data; assigns probabilities to predicted labels; controls overfitting with smooth-
1291 ing parameter. Cons: often has poor accuracy, especially with correlated features.
1292 Perceptron Pros: easy to implement; online; error-driven learning means that accuracy
1293 is typically high, especially after averaging. Cons: not probabilistic; hard to know
1294 when to stop learning; lack of margin can lead to overfitting.
1295 Support vector machine Pros: optimizes an error-based metric, usually resulting in high
1296 accuracy; overfitting is controlled by a regularization parameter. Cons: not proba-
1297 bilistic.
1298 Logistic regression Pros: error-driven and probabilistic; overfitting is controlled by a reg-
1299 ularization parameter. Cons: batch learning requires black-box optimization; logistic
1300 loss can “overtrain” on correctly labeled examples.
1301 One of the main distinctions is whether the learning algorithm offers a probability
1302 over labels. This is useful in modular architectures, where the output of one classifier
1303 is the input for some other system. In cases where probability is not necessary, the sup-
1304 port vector machine is usually the right choice, since it is no more difficult to implement
1305 than the perceptron, and is often more accurate. When probability is necessary, logistic
1306 regression is usually more accurate than Naı̈ve Bayes.
1316 Exercises
P
1317 1. Let x be a bag-of-words vector such that Vj=1 xj = 1. Verify that the multinomial
1318 probability pmult (x; φ), as defined in Equation 2.12, is identical to the probability of
1319 the same document under a categorical distribution, pcat (w; φ).
1320 2. Derive the maximum-likelihood estimate for the parameter µ in Naı̈ve Bayes.
1321 3. As noted in the discussion of averaged perceptron in § 2.2.2, the computation of the
1322 running sum m ← m + θ is unnecessarily expensive, requiring K × V operations.
1323 Give an alternative way to compute the averaged weights θ, with complexity that is
P
1324 independent of V and linear in the sum of feature sizes N (i) (i)
i=1 |f (x , y )|.
1325 4. Consider a dataset that is comprised of two identical instances x(1) = x(2) with
1326 distinct labels y (1) 6= y (2) . Assume all features are binary xj ∈ {0, 1} for all j.
1327 Now suppose that the averaged perceptron always chooses i = 1 when t is even,
1328 and i = 2 when t is odd, and that it will terminate under the following condition:
1 X 1 X
(t) (t−1)
≥ max θj − θj . [2.81]
j t t−1
t t
1329 In words, the algorithm stops when the largest change in the averaged weights is
1330 less than or equal to . Compute the number of iterations before the averaged per-
1331 ceptron terminates.
1332 5. Suppose you have two labeled datasets D1 and D2 , with the same features and la-
1333 bels.
1334 • Let θ (1) be the unregularized logistic regression (LR) coefficients from training
1335 on dataset D1 .
1336 • Let θ (2) be the unregularized LR coefficients (same model) from training on
1337 dataset D2 .
1338 • Let θ ∗ be the unregularized LR coefficients from training on the combined
1339 dataset D1 ∪ D2 .
1340
1343 Linear classification may seem like all we need for natural language processing. The bag-
1344 of-words representation is inherently high dimensional, and the number of features is
1345 often larger than the number of training instances. This means that it is usually possible
1346 to find a linear classifier that perfectly fits the training data. Moving to nonlinear classifi-
1347 cation may therefore only increase the risk of overfitting. For many tasks, lexical features
1348 (words) are meaningful in isolation, and can offer independent evidence about the in-
1349 stance label — unlike computer vision, where individual pixels are rarely informative,
1350 and must be evaluated holistically to make sense of an image. For these reasons, natu-
1351 ral language processing has historically focused on linear classification to a greater extent
1352 than other machine learning application domains.
1353 But in recent years, nonlinear classifiers have swept through natural language pro-
1354 cessing, and are now the default approach for many tasks (Manning, 2016). There are at
1355 least three reasons for this change.
1356 • There have been rapid advances in deep learning, a family of nonlinear meth-
1357 ods that learn complex functions of the input through multiple layers of compu-
1358 tation (Goodfellow et al., 2016).
1359 • Deep learning facilitates the incorporation of word embeddings, which are dense
1360 vector representations of words. Word embeddings can be learned from large amounts
1361 of unlabeled data, and enable generalization to words that do not appear in the an-
1362 notated training data (word embeddings are discussed in detail in chapter 14).
1363 • A third reason for the rise of deep nonlinear learning algorithms is hardware. Many
1364 deep learning models can be implemented efficiently on graphics processing units
1365 (GPUs), offering substantial performance improvements over CPU-based comput-
1366 ing.
1367 This chapter focuses on neural networks, which are the dominant approach for non-
61
62 CHAPTER 3. NONLINEAR CLASSIFICATION
1368 linear classification in natural language processing today.1 Historically, a few other non-
1369 linear learning methods have been applied to language data:
1370 • Kernel methods are generalizations of the nearest-neighbor classification rule, which
1371 classifies each instance by the label of the most similar example in the training
1372 set (Hastie et al., 2009). The application of the kernel support vector machine to
1373 information extraction is described in chapter 17.
1374 • Decision trees classify instances by checking a set of conditions. Scaling decision
1375 trees to bag-of-words inputs is difficult, but decision trees have been successful in
1376 problems such as coreference resolution (chapter 15), where more compact feature
1377 sets can be constructed (Soon et al., 2001).
1378 • Boosting and related ensemble methods work by combining the predictions of sev-
1379 eral “weak” classifiers, each of which may consider only a small subset of features.
1380 Boosting has been successfully applied to text classification (Schapire and Singer,
1381 2000) and syntactic analysis (Abney et al., 1999), and remains one of the most suc-
1382 cessful methods on machine learning competition sites such as Kaggle (Chen and
1383 Guestrin, 2016).
1390 1. Use the text x to predict the features z. Specifically, train a logistic regression clas-
1391 sifier to compute p(zk | x), for each k ∈ {1, 2, . . . , Kz }.
1392 2. Use the features z to predict the label y. Again, train a logistic regression classifier
1393 to compute p(y | z). On test data, z is unknown, so we use the probabilities p(z | x)
1394 from the first layer as the features.
1395 This setup is shown in Figure 3.1, which describes the proposed classifier in a compu-
1396 tation graph: the text features x are connected to the middle layer z, which in turn is
1397 connected to the label y.
1398 Since each zk ∈ {0, 1}, we can treat p(zk | x) as a binary classification problem, using
1399 binary logistic regression:
(x→z) (x→z)
Pr(zk = 1 | x; Θ(x→z) ) = σ(θk · x) = (1 + exp(−θk · x))−1 , [3.1]
1
I will use “deep learning” and “neural networks” interchangeably.
z ...
x ...
Figure 3.1: A feedforward neural network. Shaded circles indicate observed features,
usually words; squares indicate nodes in the computation graph, which are computed
from the information carried over the incoming arrows.
1400 where σ(·) is the sigmoid function (shown in Figure 3.2), and the matrix Θ(x→z) ∈ RKz ×V
1401 is constructed by stacking the weight vectors for each zk ,
(x→z) (x→z) (x→z) >
Θ(x→z) = [θ1 , θ2 , . . . , θKz ] . [3.2]
1402 We will assume that x contains a term with a constant value of 1, so that a corresponding
(x→z)
1403 offset parameter is included in each θk .
1404 The output layer is computed by the multi-class logistic regression probability,
(z→y)
(z→y)
exp(θj · z + bj )
Pr(y = j | z; Θ , b) = P (z→y)
, [3.3]
j 0 ∈Y exp(θj 0 · z + bj 0 )
1405 where bj is an offset for label j, and the output weight matrix Θ(z→y) ∈ RKy ×Kz is again
1406 constructed by concatenation,
(z→y) (z→y) (z→y) >
Θ(z→y) = [θ1 , θ2 , . . . , θKy ] . [3.4]
1408 where element j in the output of the SoftMax function is computed as in Equation 3.3.
We have now defined a multilayer classifier, which can be summarized as,
1409 where σ(·) is now applied elementwise to the vector of inner products,
(x→z) (x→z) (x→z)
σ(Θ(x→z) x) = [σ(θ1 · x), σ(θ2 · x), . . . , σ(θKz · x)]> . [3.8]
values derivatives
3 1.0
2 0.8
0.6
1
0.4
0 sigmoid
tanh 0.2
ReLU 0.0
1
3 2 1 0 1 2 3 3 2 1 0 1 2 3
Now suppose that the hidden features z are never observed, even in the training data.
We can still construct the architecture in Figure 3.1. Instead of predicting y from a discrete
vector of predicted values z, we use the probabilities σ(θk · x). The resulting classifier is
barely changed:
z =σ(Θ(x→z) x) [3.9]
p(y | x; Θ(z→y) , b) = SoftMax(Θ(z→y) z + b). [3.10]
1410 This defines a classification model that predicts the label y ∈ Y from the base features x,
1411 through a“hidden layer” z. This is a feedforward neural network.2
1421 • The range of the sigmoid function is (0, 1). The bounded range ensures that a cas-
1422 cade of sigmoid functions will not “blow up” to a huge output, and this is impor-
2
The architecture is sometimes called a multilayer perceptron, but this is misleading, because each layer
is not a perceptron as defined in Algorithm 3.
1423 tant for deep networks with several hidden layers. The derivative of the sigmoid is
∂
1424
∂a σ(a) = σ(a)(1 − σ(a)). This derivative becomes small at the extremes, which can
1425 make learning slow; this is called the vanishing gradient problem.
1426 • The range of the tanh activation function is (−1, 1): like the sigmoid, the range
1427 is bounded, but unlike the sigmoid, it includes negative values. The derivative is
∂ 2
1428
∂a tanh(a) = 1 − tanh(a) , which is steeper than the logistic function near the ori-
1429 gin (LeCun et al., 1998). The tanh function can also suffer from vanishing gradients
1430 at extreme values.
1431 • The rectified linear unit (ReLU) is zero for negative inputs, and linear for positive
1432 inputs (Glorot et al., 2011),
(
a, a ≥ 0
ReLU(a) = [3.11]
0, otherwise.
1433 The derivative is a step function, which is 1 if the input is positive, and zero oth-
1434 erwise. Once the activation is zero, the gradient is also zero. This can lead to the
1435 problem of dead neurons, where some ReLU nodes are zero for all inputs, through-
1436 out learning. A solution is the leaky ReLU, which has a small positive slope for
1437 negative inputs (Maas et al., 2013),
(
a, a≥0
Leaky-ReLU(a) = [3.12]
.0001a, otherwise.
1438 Sigmoid and tanh are sometimes described as squashing functions, because they squash
1439 an unbounded input into a bounded range. Glorot and Bengio (2010) recommend against
1440 the use of the sigmoid activation in deep networks, because its mean value of 12 can cause
1441 the next layer of the network to be saturated, with very small gradients on their own
1442 parameters. Several other activation functions are reviewed by Goodfellow et al. (2016),
1443 who recommend ReLU as the “default option.”
1450 It is also possible to “short circuit” a hidden layer, by propagating information directly
1451 from the input to the next higher level of the network. This is the idea behind residual net-
1452 works, which propagate information directly from the input to the subsequent layer (He
1453 et al., 2016),
z = f (Θ(x→z) x) + x, [3.13]
where f is any nonlinearity, such as sigmoid or ReLU. A more complex architecture is
the highway network (Srivastava et al., 2015; Kim et al., 2016), in which an addition gate
controls an interpolation between f (Θ(x→z) x) and x:
1454 where refers to an elementwise vector product, and 1 is a column vector of ones. The
1455 sigmoid function is applied elementwise to its input; recall that the output of this function
1456 is restricted to the range [0, 1]. Gating is also used in the long short-term memory (LSTM),
1457 which is discussed in chapter 6. Residual and highway connections address a problem
1458 with deep architectures: repeated application of a nonlinear activation function can make
1459 it difficult to learn the parameters of the lower levels of the network, which are too distant
1460 from the supervision signal.
1463 where ey(i) is a one-hot vector of zeros with a value of 1 at position y (i) . The inner product
1464 between ey(i) and log ỹ is also called the multinomial cross-entropy, and this terminology
1465 is preferred in many neural networks papers and software packages.
of the hidden layer may need to be arbitrarily large. Furthermore, the fact that a network has the capacity to
approximate any given function does not say anything about whether it is possible to learn the function using
gradient-based optimization.
It is also possible to train neural networks from other objectives, such as a margin loss.
In this case, it is not necessary to use softmax at the output layer: an affine transformation
of the hidden layer is enough:
1466 In regression problems, the output is a scalar or vector (see § 4.1.2). For these problems, a
1467 typical loss function is the squared error (y − ŷ)2 or squared norm ||y − ŷ||22 .
Let us now consider how to estimate the parameters Θ(x→z) , Θ(z→y) and b, using on-
line gradient-based optimization. The simplest such algorithm is stochastic gradient de-
scent (Algorithm 5). The relevant updates are,
1486 where η (t) is the learning rate on iteration t, `(i) is the loss at instance (or minibatch) i, and
(x→z) (z→y)
1487 θk is column k of the matrix Θ(x→z) , and θk is column k of Θ(z→y) .
(z→y)
The gradients of the negative log-likelihood on b and θk are very similar to the
gradients in logistic regression,
>
∂`(i) ∂`(i) ∂`(i)
∇θ(z→y) `(i) = (z→y)
, (z→y)
,..., (z→y)
[3.27]
k ∂θk,1 ∂θk,2 ∂θk,Ky
∂`(i) ∂ X
=− θ (z→y)
(i) · z − log exp θy(z→y) · z [3.28]
(z→y) (z→y) y
∂θk,j ∂θk,j y∈Y
= Pr(y = j | z; Θ(z→y) , b) − δ j = y (i) zk , [3.29]
1488 where δ j = y (i) is a function that returns one when j = y (i) , and zero otherwise. The
1489 gradient ∇b `(i) is similar to Equation 3.29.
The gradients on the input layer weights Θ(x→z) can be obtained by applying the chain
rule of differentiation:
(x→z)
where f 0 (θk · x) is the derivative of the activation function f , applied at the input
(x→z)
θk · x. For example, if f is the sigmoid function, then the derivative is,
∂`(i)
= × zk × (1 − zk ) × xn . [3.34]
∂zk
∂`(i)
1491 • If the negative log-likelihood `(i) does not depend much on zk , ∂zk → 0, then it
∂`(i)
1492 doesn’t matter how zk is computed, and so (x→z) → 0.
∂θn,k
1493 • If zk is near 1 or 0, then the curve of the sigmoid function (Figure 3.2) is nearly flat,
1494 and changing the inputs will make little local difference. The term zk × (1 − zk ) is
1495 maximized at zk = 12 , where the slope of the sigmoid function is steepest.
(x→z) ∂`(i)
1496 • If xn = 0, then it does not matter how we set the weights θn,k , so (x→z) = 0.
∂θn,k
1510 Variables. The variables include the inputs x, the hidden nodes z, the outputs y, and the
1511 loss function. Inputs are variables that do not have parents. Backpropagation com-
1512 putes the gradients with respect to all variables except the inputs, but does not up-
1513 date the variables during learning.
1514 Parameters. In a feedforward network, the parameters include the weights and offsets.
1515 Parameter nodes do not have parents, and they are updated during learning.
1516 Objective. The objective node is not the parent of any other node. Backpropagation begins
1517 by computing the gradient with respect to this node.
1518 If the computation graph is a directed acyclic graph, then it is possible to order the
1519 nodes with a topological sort, so that if node t is a parent of node t0 , then t < t0 . This
1520 means that the values {vt }Tt=1 can be computed in a single forward pass. The topolog-
1521 ical sort is reversed when computing gradients: each gradient gt is computed from the
1522 gradients of the children of t, implementing the chain rule of differentiation. The general
1523 backpropagation algorithm for computation graphs is shown in Algorithm 6, and illus-
1524 trated in Figure 3.3.
1525 While the gradients with respect to each parameter may be complex, they are com-
1526 posed of products of simple parts. For many networks, all gradients can be computed
1527 through automatic differentiation. This means that end users need only specify the feed-
1528 forward computation, and the gradients necessary for learning can be obtained automati-
1529 cally. There are many software libraries that perform automatic differentiation on compu-
1530 tation graphs, such as Torch (Collobert et al., 2011), TensorFlow (Abadi et al., 2016), and
1531 DyNet (Neubig et al., 2017). One important distinction between these libraries is whether
1532 they support dynamic computation graphs, in which the structure of the computation
1533 graph varies across instances. Static computation graphs are compiled in advance, and
1534 can be applied to fixed-dimensional data, such as bag-of-words vectors. In many natu-
1535 ral language processing problems, each input has a distinct structure, requiring a unique
1536 computation graph.
a va
d
vx
gx gd
vb
b x vx
gx
vc
gx ge e
c
Figure 3.3: Backpropagation at a single node x in the computation graph. The values of
the predecessors va , vb , vc are the inputs to x, which computes vx , and passes it on to the
successors d and e. The gradients at the successors gd and ge are passed back to x, where
they are incorporated into the gradient gx , which is then passed back to the predecessors
a, b, and c.
N
X
L= `(i) + λz→y ||Θ(z→y) ||2F + λx→z ||Θ(x→z) ||2F , [3.35]
i=1
P
1541 where ||Θ||2F = i,j θi,j
2 is the squared Frobenius norm, which generalizes the L norm
2
1542 to matrices. The bias parameters b are not regularized, as they do not contribute to the
1543 sensitivity of the classifier to the inputs. In gradient-based optimization, the practical
1544 effect of Frobenius norm regularization is that the weights “decay” towards zero at each
1545 update, motivating the alternative name weight decay.
1546 Another approach to controlling model complexity is dropout, which involves ran-
1547 domly setting some computation nodes to zero during training (Srivastava et al., 2014).
1548 For example, in the feedforward network, on each training instance, with probability ρ we
1549 set each input xn and each hidden layer node zk to zero. Srivastava et al. (2014) recom-
1550 mend ρ = 0.5 for hidden units, and ρ = 0.2 for input units. Dropout is also incorporated
(x→z)
1551 in the gradient computation, so if node zk is dropped, then none of the weights θk will
1552 be updated for this instance. Dropout prevents the network from learning to depend too
1553 much on any one feature or hidden node, and prevents feature co-adaptation, in which a
1554 hidden unit is only useful in combination with one or more other hidden units. Dropout is
1555 a special case of feature noising, which can also involve adding Gaussian noise to inputs
1556 or hidden units (Holmstrom and Koistinen, 1992). Wager et al. (2013) show that dropout is
1557 approximately equivalent to “adaptive” L2 regularization, with a separate regularization
1558 penalty for each feature.
1590 stochastic gradient descent, and by feature noising techniques such as dropout, can help
1591 online optimization to escape saddle points and find high-quality optima (Ge et al., 2015).
1592 Other techniques address saddle points directly, using local reconstructions of the Hessian
1593 matrix (Dauphin et al., 2014) or higher-order derivatives (Anandkumar and Ge, 2016).
Initialization Initialization is not especially important for linear classifiers, since con-
vexity ensures that the global optimum can usually be found quickly. But for multilayer
neural networks, it is helpful to have a good starting point. One reason is that if the mag-
nitude of the initial weights is too large, a sigmoid or tanh nonlinearity will be saturated,
leading to a small gradient, and slow learning. Large gradients are also problematic. Ini-
tialization can help avoid these problems, by ensuring that the variance over the initial
gradients is constant and bounded throughout the network. For networks with tanh acti-
vation functions, this can be achieved by sampling the initial weights from the following
uniform distribution (Glorot and Bengio, 2010),
" √ √ #
6 6
θi,j ∼U − p ,p , [3.36]
din (n) + dout (n) din (n) + dout (n)
[3.37]
1598 For the weights leading to a ReLU activation function, He et al. (2015) use similar argu-
1599 mentation to justify sampling from a zero-mean Gaussian distribution,
p
θi,j ∼ N (0, 2/din (n)) [3.38]
Rather than initializing the weights independently, it can be beneficial to initialize each
layer jointly as an orthonormal matrix, ensuring that Θ> Θ = I (Saxe et al., 2014). Or-
thonormal matrices preserve the norm of the input, so that ||Θx|| = ||x||, which prevents
the gradients from exploding or vanishing. Orthogonality ensures that the hidden units
are uncorrelated, so that they correspond to different features of the input. Orthonormal
initialization can be performed by applying singular value decomposition to a matrix of
1600 The matrix U contains the singular vectors of A, and is guaranteed to be orthonormal.
1601 For more on singular value decomposition, see chapter 14.
1602 Even with careful initialization, there can still be significant variance in the final re-
1603 sults. It can be useful to make multiple training runs, and select the one with the best
1604 performance on a heldout development set.
1605 Clipping and normalizing the gradients As already discussed, the magnitude of the
1606 gradient can pose problems for learning: too large, and learning can diverge, with succes-
1607 sive updates thrashing between increasingly extreme values; too small, and learning can
1608 grind to a halt. Several heuristics have been proposed to address this issue.
1609 • In gradient clipping (Pascanu et al., 2013), an upper limit is placed on the norm of
1610 the gradient, and the gradient is rescaled when this limit is exceeded,
(
g ||ĝ|| < τ
CLIP (g̃) = τ [3.43]
||g|| g otherwise.
• In batch normalization (Ioffe and Szegedy, 2015), the inputs to each computation
node are recentered by their mean and variance across all of the instances in the
minibatch B (see § 2.5.2). For example, in a feedforward network with one hidden
layer, batch normalization would tranform the inputs to the hidden layer as follows:
1 X (i)
µ(B) = x [3.44]
|B|
i∈B
1 X (i)
s(B) = (x − µ(B) )2 [3.45]
|B|
i∈B
p
x(i) =(x(i) − µ(B) )/ s(B) . [3.46]
1611 Empirically, this speeds convergence of deep architectures. One explanation is that
1612 it helps to correct for changes in the distribution of activations during training.
• In layer normalization (Ba et al., 2016), the inputs to each nonlinear activation func-
tion are recentered across the layer:
a =Θ(x→z) x [3.47]
Kz
X
1
µ= ak [3.48]
Kz
k=1
Kz
X
1
s= (ak − µ)2 [3.49]
Kz
k=1
√
z =(a − µ)/ s. [3.50]
1613 Layer normalization has similar motivations to batch normalization, but it can be
1614 applied across a wider range of architectures and training conditions.
Online optimization The trend towards deep learning has spawned a cottage industry
of online optimization algorithms, which attempt to improve on stochastic gradient de-
scent. AdaGrad was reviewed in § 2.5.2; its main innovation is to set adaptive learning
rates for each parameter by storing the sum of squared gradients. Rather than using the
sum over the entire training history, we can keep a running estimate,
(t) (t−1) 2
vj =βvj + (1 − β)gt,j , [3.51]
1615 where gt,j is the gradient with respect to parameter j at time t, and β ∈ [0, 1]. This term
1616 places more emphasis on recent gradients, and is employed in the AdaDelta (Zeiler, 2012)
1617 and Adam (Kingma and Ba, 2014) optimizers. Online optimization and its theoretical
1618 background are reviewed by Bottou et al. (2016). Early stopping, mentioned in § 2.2.2,
1619 can help to avoid overfitting, by terminating training after reaching a plateau in the per-
1620 formance on a heldout validation set.
Kf
Kf
X(1) z
pooling prediction
CC * convolution
Ke
X (0)
M
1631 then act as base features in a linear classifier. But rather than designing these feature ex-
1632 tractors by hand, a better approach is to learn them, using the magic of backpropagation.
1633 This is the idea behind convolutional neural networks.
1634 Following § 3.2.4, define the base layer of a neural network as,
where ewm is a column vector of zeros, with a 1 at position wm . The base layer has dimen-
sion X(0) ∈ RKe ×M , where Ke is the size of the word embeddings. To merge information
across adjacent words, we convolve X(0) with a set of filter matrices C(k) ∈ RKe ×h . Convo-
lution is indicated by the symbol ∗, and is defined,
Ke X h
!
(1)
X (k) (0)
X(1) =f (b + C ∗ X(0) ) =⇒ xk,m = f bk + ck0 ,n × xk0 ,m+n−1 , [3.53]
k0 =1 n=1
1635 where f is an activation function such as tanh or ReLU, and b is a vector of offsets. The
1636 convolution operation slides the matrix C(k) across the columns of X(0) ; at each position
(0)
1637 m, compute the elementwise product C(k) Xm:m+h−1 , and take the sum.
1638 A simple filter might compute a weighted average over nearby words,
0.5 1 0.5
0.5 1 0.5
C(k) =
... ... ... , [3.54]
0.5 1 0.5
1639 thereby representing trigram units like not so unpleasant. In one-dimensional convolu-
1640 tion, each filter matrix C(k) is constrained to have non-zero values only at row k (Kalch-
1641 brenner et al., 2014).
1642 To deal with the beginning and end of the input, the base matrix X(0) may be padded
1643 with h column vectors of zeros at the beginning and end; this is known as wide convolu-
1644 tion. If padding is not applied, then the output from each layer will be h − 1 units smaller
1645 than the input; this is known as narrow convolution. The filter matrices need not have
1646 identical filter widths, so more generally we could write hk to indicate to width of filter
1647 C(k) . As suggested by the notation X(0) , multiple layers of convolution may be applied,
1648 so that X(d) is the input to X(d+1) .
After D convolutional layers, we obtain a matrix representation of the document X(D) ∈
RKz ×M . If the instances have variable lengths, it is necessary to aggregate over all M word
positions to obtain a fixed-length representation. This can be done by a pooling operation,
such as max-pooling (Collobert et al., 2011) or average-pooling,
(D) (D) (D)
z = MaxPool(X(D) ) =⇒ zk = max xk,1 , xk,2 , . . . xk,M [3.55]
M
(D) 1 X (D)
z = AvgPool(X ) =⇒ zk = xk,m . [3.56]
M
m=1
1649 The vector z can now act as a layer in a feedforward network, culminating in a prediction
1650 ŷ and a loss `(i) . The setup is shown in Figure 3.4.
Just as in feedforward networks, the parameters (C(k) , b, Θ) can be learned by back-
propagating from the classification loss. This requires backpropagating through the max-
pooling operation, which is a discontinuous function of the input. But because we need
only a local gradient, backpropagation flows only through the argmax m:
(
(D) (D) (D) (D)
∂zk 1, xk,m = max xk,1 , xk,2 , . . . xk,M
(D)
= [3.57]
∂xk,m 0, otherwise.
1651 The computer vision literature has produced a huge variety of convolutional architec-
1652 tures, and many of these bells and whistles can be applied to text data. One avenue for
1653 improvement is more complex pooling operations, such as k-max pooling (Kalchbrenner
1654 et al., 2014), which returns a matrix of the k largest values for each filter. Another innova-
1655 tion is the use of dilated convolution to build multiscale representations (Yu and Koltun,
1656 2016). At each layer, the convolutional operator applied in strides, skipping ahead by s
1657 steps after each feature. As we move up the hierarchy, each layer is s times smaller than
1658 the layer below it, effectively summarizing the input. This idea is shown in Figure 3.5.
1659 Multi-layer convolutional networks can also be augmented with “shortcut” connections,
1660 as in the ResNet model from § 3.2.2 (Johnson and Zhang, 2017).
Figure 3.5: A dilated convolutional neural network captures progressively larger context
through recursive application of the convolutional operator (Strubell et al., 2017) [todo:
permission]
1677 Exercises
1678 1. Prove that the softmax and sigmoid functions are equivalent when the number of
1679 possible labels is two. Specifically, for any Θ(z→y) (omitting the offset b for simplic-
1680 ity), show how to construct a vector of weights θ such that,
1682 Your network should have a single output node which uses the Sign activation func-
1683 tion. Use a single hidden layer, with ReLU activation functions. Describe all weights
1684 and offsets.
1685 3. Consider the same network as above (with ReLU activations for the hidden layer),
1686 with an arbitrary differentiable loss function `(y (i) , ỹ), where ỹ is the activation of
1687 the output node. Suppose all weights and offsets are initialized to zero. Prove that
1688 gradient-based optimization cannot learn the desired function from this initializa-
1689 tion.
1690 4. The simplest solution to the previous problem relies on the use of the ReLU activa-
1691 tion function at the hidden layer. Now consider a network with arbitrary activations
1692 on the hidden layer. Show that if the initial weights are any uniform constant, then
1693 it is not possible to learn the desired function.
1694 5. Consider a network in which: the base features are all binary, x ∈ {0, 1}M ; the
1695 hidden layer activation function is sigmoid, zk = σ(θk · x); and the initial weights
1696 are sampled independently from a standard normal distribution, θj,k ∼ N (0, 1).
∂zk
1697 • Show how the probability of a small initial gradient on any weight, ∂θj,k < α,
1698 depends on the size of the input M . Hint: use the lower bound,
Pr(σ(θk · x) × (1 − σ(θk · x)) < α) ≥ 2 Pr(σ(θk · x) < α), [3.60]
1699 and relate this probability to the variance V [θk · x].
1700 • Design an alternative initialization that removes this dependence.
6. Suppose that the parameters Θ = {Θ(x→z) , Θ(z → y), b} are a local optimum of a
feedforward network in the following sense: there exists some > 0 such that,
||Θ̃(x→z) − Θ(x→z) ||2F + ||Θ̃(z→y) − Θ(z→y) ||2F + ||b̃ − b||22 <
⇒ L(Θ̃) > L(Θ) [3.61]
1701 Define the function π as a permutation on the hidden units, as described in § 3.3.3,
1702 so that for any Θ, L(Θ) = L(Θπ ). Prove that if a feedforward network has a local
1703 optimum in the sense of Equation 3.61, then its loss is not a convex function of the
1704 parameters Θ, using the definition of convexity from § 2.3
1708 Having learned some techniques for classification, this chapter shifts the focus from math-
1709 ematics to linguistic applications. Later in the chapter, we will consider the design deci-
1710 sions involved in text classification, as well as evaluation practices.
1725 • Tweets containing happy emoticons can be marked as positive, sad emoticons as
1726 negative (Read, 2005; Pak and Paroubek, 2010).
1
Comprehensive surveys on sentiment analysis and related problems are offered by Pang and Lee (2008)
and Liu (2015).
81
82 CHAPTER 4. LINGUISTIC APPLICATIONS OF CLASSIFICATION
1727 • Reviews with four or more stars can be marked as positive, two or fewer stars as
1728 negative (Pang et al., 2002).
1729 • Statements from politicians who are voting for a given bill are marked as positive
1730 (towards that bill); statements from politicians voting against the bill are marked as
1731 negative (Thomas et al., 2006).
1732 The bag-of-words model is a good fit for sentiment analysis at the document level: if
1733 the document is long enough, we would expect the words associated with its true senti-
1734 ment to overwhelm the others. Indeed, lexicon-based sentiment analysis avoids machine
1735 learning altogether, and classifies documents by counting words against positive and neg-
1736 ative sentiment word lists (Taboada et al., 2011).
1737 Lexicon-based classification is less effective for short documents, such as single-sentence
1738 reviews or social media posts. In these documents, linguistic issues like negation and ir-
1739 realis (Polanyi and Zaenen, 2006) — events that are hypothetical or otherwise non-factual
1740 — can make bag-of-words classification ineffective. Consider the following examples:
1752 Bigrams can handle relatively straightforward cases, such as when an adjective is immedi-
1753 ately negated; trigrams would be required to extend to larger contexts (e.g., not the worst).
1754 But this approach will not scale to more complex examples like (4.4) and (4.5). More
1755 sophisticated solutions try to account for the syntactic structure of the sentence (Wilson
1756 et al., 2005; Socher et al., 2013), or apply more complex classifiers such as convolutional
1757 neural networks (Kim, 2014), which are described in chapter 3.
1768 Stance classification In debates, each participant takes a side: for example, advocating
1769 for or against proposals like adopting a vegetarian lifestyle or mandating free college ed-
1770 ucation. The problem of stance classification is to identify the author’s position from the
1771 text of the argument. In some cases, there is training data available for each position,
1772 so that standard document classification techniques can be employed. In other cases, it
1773 suffices to classify each document as whether it is in support or opposition of the argu-
1774 ment advanced by a previous document (Anand et al., 2011). In the most challenging
1775 case, there is no labeled data for any of the stances, so the only possibility is group docu-
1776 ments that advocate the same position (Somasundaran and Wiebe, 2009). This is a form
1777 of unsupervised learning, discussed in chapter 5.
1778 Targeted sentiment analysis The expression of sentiment is often more nuanced than a
1779 simple binary label. Consider the following examples:
1780 (4.6) The vodka was good, but the meat was rotten.
1781 (4.7) Go to Heaven for the climate, Hell for the company. –Mark Twain
1782 These statements display a mixed overall sentiment: positive towards some entities (e.g.,
1783 the vodka), negative towards others (e.g., the meat). Targeted sentiment analysis seeks to
1784 identify the writer’s sentiment towards specific entities (Jiang et al., 2011). This requires
1785 identifying the entities in the text and linking them to specific sentiment words — much
1786 more than we can do with the classification-based approaches discussed thus far. For
1787 example, Kim and Hovy (2006) analyze sentence-internal structure to determine the topic
1788 of each sentiment expression.
1789 Aspect-based opinion mining seeks to identify the sentiment of the author of a review
1790 towards predefined aspects such as PRICE and SERVICE, or, in the case of (4.7), CLIMATE
1791 and COMPANY (Hu and Liu, 2004). If the aspects are not defined in advance, it may again
1792 be necessary to employ unsupervised learning methods to identify them (e.g., Branavan
1793 et al., 2009).
1794 Emotion classification While sentiment analysis is framed in terms of positive and neg-
1795 ative categories, psychologists generally regard emotion as more multifaceted. For ex-
1796 ample, Ekman (1992) argues that there are six basic emotions — happiness, surprise, fear,
1797 sadness, anger, and contempt — and that they are universal across human cultures. Alm
1798 et al. (2005) build a linear classifier for recognizing the emotions expressed in children’s
1799 stories. The ultimate goal of this work was to improve text-to-speech synthesis, so that
1800 stories could be read with intonation that reflected the emotional content. They used bag-
1801 of-words features, as well as features capturing the story type (e.g., jokes, folktales), and
1802 structural features that reflect the position of each sentence in the story. The task is diffi-
1803 cult: even human annotators frequently disagreed with each other, and the best classifiers
1804 achieved accuracy between 60-70%.
1815 Ordinal ranking In many problems, the labels are ordered but discrete: for example,
1816 product reviews are often integers on a scale of 1 − 5, and grades are on a scale of A − F .
1817 Such problems can be solved by discretizing the score θ · x into “ranks”,
ŷ = argmin r, [4.2]
r: θ·x≥br
1821 Lexicon-based classification Sentiment analysis is one of the only NLP tasks where
1822 hand-crafted feature weights are still widely employed. In lexicon-based classification (Taboada
1823 et al., 2011), the user creates a list of words for each label, and then classifies each docu-
1824 ment based on how many of the words from each list are present. In our linear classifica-
1825 tion framework, this is equivalent to choosing the following weights:
(
1, j ∈ Ly
θy,j = [4.3]
0, otherwise,
1826 where Ly is the lexicon for label y. Compared to the machine learning classifiers discussed
1827 in the previous chapters, lexicon-based classification may seem primitive. However, su-
1828 pervised machine learning relies on large annotated datasets, which are time-consuming
1829 and expensive to produce. If the goal is to distinguish two or more categories in a new
1830 domain, it may be simpler to start by writing down a list of words for each category.
1831 An early lexicon was the General Inquirer (Stone, 1966). Today, popular sentiment lexi-
1832 cons include sentiwordnet (Esuli and Sebastiani, 2006) and an evolving set of lexicons
1833 from Liu (2015). For emotions and more fine-grained analysis, Linguistic Inquiry and Word
1834 Count (LIWC) provides a set of lexicons (Tausczik and Pennebaker, 2010). The MPQA lex-
1835 icon indicates the polarity (positive or negative) of 8221 terms, as well as whether they are
1836 strongly or weakly subjective (Wiebe et al., 2005). A comprehensive comparison of senti-
1837 ment lexicons is offered by Ribeiro et al. (2016). Given an initial seed lexicon, it is possible
1838 to automatically expand the lexicon by looking for words that frequently co-occur with
1839 words in the seed set (Hatzivassiloglou and McKeown, 1997; Qiu et al., 2011).
1845 These headlines are ambiguous because they contain words that have multiple mean-
1846 ings, or senses. Word sense disambiguation is the problem of identifying the intended
1847 sense of each word token in a document. Word sense disambiguation is part of a larger
1848 field of research called lexical semantics, which is concerned with meanings of the words.
1849 At a basic level, the problem of word sense disambiguation is to identify the correct
1850 sense for each word token in a document. Part-of-speech ambiguity (e.g., noun versus
1851 verb) is usually considered to be a different problem, to be solved at an earlier stage.
1852 From a linguistic perspective, senses are not properties of words, but of lemmas, which
1853 are canonical forms that stand in for a set of inflected words. For example, arm/N is a
1854 lemma that includes the inflected form arms/N — the /N indicates that it we are refer-
1855 ring to the noun, and not its homonym arm/V, which is another lemma that includes
1856 the inflected verbs (arm/V, arms/V, armed/V, arming/V). Therefore, word sense disam-
1857 biguation requires first identifying the correct part-of-speech and lemma for each token,
2
These examples, and many more, can be found at http://www.ling.upenn.edu/˜beatrice/
humor/headlines.html
1858 and then choosing the correct sense from the inventory associated with the corresponding
1859 lemma.3 (Part-of-speech tagging is discussed in § 8.1.)
1885 *Other lexical semantic relations Besides synonymy, WordNet also describes many
1886 other lexical semantic relationships, including:
1891 Classification of these relations relations can be performed by searching for character-
1892 istic patterns between pairs of words, e.g., X, such as Y, which signals hyponymy (Hearst,
1893 1992), or X but Y, which signals antonymy (Hatzivassiloglou and McKeown, 1997). An-
1894 other approach is to analyze each term’s distributional statistics (the frequency of its
1895 neighboring words). Such approaches are described in detail in chapter 14.
1898 (4.11) Town officials are hoping to attract new manufacturing plants through weakened
1899 environmental regulations.
1900 (4.12) The endangered plants play an important role in the local ecosystem.
1901 As in document classification, many of these features are irrelevant, but a few are very
1902 strong predictors. In this example, the context word endangered is a strong signal that
1903 the intended sense is biology rather than manufacturing. We would therefore expect a
1904 learning algorithm to assign high weight to (endangered, BIOLOGY), and low weight to
1905 (endangered, MANUFACTURING).5
It may also be helpful to go beyond the bag-of-words: for example, one might encode
the position of each context word with respect to the target, e.g.,
1906 These are called collocation features, and they give more information about the specific
1907 role played by each context word. This idea can be taken further by incorporating addi-
1908 tional syntactic information about the grammatical role played by each context feature,
1909 such as the dependency path (see chapter 11).
5
The context bag-of-words can be also used be used to perform word-sense disambiguation without
machine learning: the Lesk (1986) algorithm selects the word sense whose dictionary definition best overlaps
the local context.
1910 Using such features, a classifier can be trained from labeled data. A semantic concor-
1911 dance is a corpus in which each open-class word (nouns, verbs, adjectives, and adverbs)
1912 is tagged with its word sense from the target dictionary or thesaurus. SemCor is a seman-
1913 tic concordance built from 234K tokens of the Brown corpus (Francis and Kucera, 1982),
1914 annotated as part of the WordNet project (Fellbaum, 2010). SemCor annotations look like
1915 this:
1917 with the superscripts indicating the annotated sense of each polysemous word, and the
1918 subscripts indicating the part-of-speech.
1919 As always, supervised classification is only possible if enough labeled examples can
1920 be accumulated. This is difficult in word sense disambiguation, because each polysemous
1921 lemma requires its own training set: having a good classifier for the senses of serve is no
1922 help towards disambiguating plant. For this reason, unsupervised and semisupervised
1923 methods are particularly important for word sense disambiguation (e.g., Yarowsky, 1995).
1924 These methods will be discussed in chapter 5. Unsupervised methods typically lean on
1925 the heuristic of “one sense per discourse”, which means that a lemma will usually have
1926 a single, consistent sense throughout any given document (Gale et al., 1992). Based on
1927 this heuristic, we can propagate information from high-confidence instances to lower-
1928 confidence instances in the same document (Yarowsky, 1995).
Figure 4.1: The output of four nltk tokenizers, applied to the string Isn’t Ahab, Ahab? ;)
1943 to define a subset of characters as whitespace, and then split the text on these tokens.
1944 However, whitespace-based tokenization is not ideal: we may want to split conjunctions
1945 like isn’t and hyphenated phrases like prize-winning and half-asleep, and we likely want
1946 to separate words from commas and periods that immediately follow them. At the same
1947 time, it would be better not to split abbreviations like U.S. and Ph.D. In languages with
1948 Roman scripts, tokenization is typically performed using regular expressions, with mod-
1949 ules designed to handle each of these cases. For example, the nltk package includes a
1950 number of tokenizers (Loper and Bird, 2002); the outputs of four of the better-known tok-
1951 enizers are shown in Figure 4.1. Social media researchers have found that emoticons and
1952 other forms of orthographic variation pose new challenges for tokenization, leading to the
1953 development of special purpose tokenizers to handle these phenomena (O’Connor et al.,
1954 2010).
1955 Tokenization is a language-specific problem, and each language poses unique chal-
1956 lenges. For example, Chinese does not include spaces between words, nor any other
1957 consistent orthographic markers of word boundaries. A “greedy” approach is to scan the
1958 input for character substrings that are in a predefined lexicon. However, Xue et al. (2003)
1959 notes that this can be ambiguous, since many character sequences could be segmented in
1960 multiple ways. Instead, he trains a classifier to determine whether each Chinese character,
1961 or hanzi, is a word boundary. More advanced sequence labeling methods for word seg-
1962 mentation are discussed in § 8.4. Similar problems can occur in languages with alphabetic
1963 scripts, such as German, which does not include whitespace in compound nouns, yield-
1964 ing examples such as Freundschaftsbezeigungen (demonstration of friendship) and Dilet-
1965 tantenaufdringlichkeiten (the importunities of dilettantes). As Twain (1997) argues, “These
1966 things are not words, they are alphabetic processions.” Social media raises similar problems
1967 for English and other languages, with hashtags such as #TrueLoveInFourWords requiring
1968 decomposition for analysis (Brun and Roux, 2014).
Figure 4.2: Sample outputs of the Porter (1980) and Lancaster (Paice, 1990) stemmers, and
the WordNet lemmatizer
1974 case distinctions might be relevant in some situations: for example, apple is a delicious
1975 pie filling, while Apple is a company that specializes in proprietary dongles and power
1976 adapters.
1977 For Roman script, case conversion can be performed using unicode string libraries.
1978 Many scripts do not have case distinctions (e.g., the Devanagari script used for South
1979 Asian languages, the Thai alphabet, and Japanese kana), and case conversion for all scripts
1980 may not be available in every programming environment. (Unicode support is an im-
1981 portant distinction between Python’s versions 2 and 3, and is a good reason for mi-
1982 grating to Python 3 if you have not already done so. Compare the output of the code
1983 "\à l\’hôtel".upper() in the two language versions.)6
1984 Case conversion is a type of normalization, which refers to string transformations that
1985 remove distinctions that are irrelevant to downstream applications (Sproat et al., 2001).
1986 Other normalizations include the standardization of numbers (e.g., 1,000 to 1000) and
1987 dates (e.g., August 11, 2015 to 2015/11/08). Depending on the application, it may even be
1988 worthwhile to convert all numbers and dates to special tokens, !NUM and !DATE. In social
1989 media, there are additional orthographic phenomena that may be normalized, such as ex-
1990 pressive lengthening, e.g., cooooool (Aw et al., 2006; Yang and Eisenstein, 2013). Similarly,
1991 historical texts feature spelling variations that may need to be normalized to a contempo-
1992 rary standard form (Baron and Rayson, 2008).
1993 A more extreme form of normalization is to eliminate inflectional affixes, such as the
1994 -ed and -s suffixes in English. On this view, bike, bikes, biking, and biked all refer to the
1995 same underlying concept, so they should be grouped into a single feature. A stemmer is
1996 a program for eliminating affixes, usually by applying a series of regular expression sub-
1997 stitutions. Character-based stemming algorithms are necessarily approximate, as shown
1998 in Figure 4.2: the Lancaster stemmer incorrectly identifies -ers as an inflectional suffix of
1999 sisters (by analogy to fix/fixers), and both stemmers incorrectly identify -s as a suffix of this
2000 and Williams. Fortunately, even inaccurate stemming can improve bag-of-words classifi-
2001 cation models, by merging related strings and thereby reducing the vocabulary size.
2002 Accurately handling irregular orthography requires word-specific rules. Lemmatizers
6
[todo: I want to make this a footnote, but can’t figure out how.]
Pang and Lee Movie Reviews (English) MAC-Morpho Corpus (Brazilian Portuguese)
1.0 1.0
Token coverage
Token coverage
0.5 0.5
0 10000 20000 30000 40000 0 10000 20000 30000 40000 50000 60000 70000
Vocabulary size Vocabulary size
(a) Movie review data in English (b) News articles in Brazilian Portuguese
Figure 4.3: Tradeoff between token coverage (y-axis) and vocabulary size, on the nltk
movie review dataset, after sorting the vocabulary by decreasing frequency. The red
dashed lines indicate 80%, 90%, and 95% coverage.
2003 are systems that identify the underlying lemma of a given wordform. They must avoid the
2004 over-generalization errors of the stemmers in Figure 4.2, and also handle more complex
2005 transformations, such as geese→goose. The output of the WordNet lemmatizer is shown in
2006 the final line of Figure 4.2. Both stemming and lemmatization are language-specific: an
2007 English stemmer or lemmatizer is of little use on a text written in another language. The
2008 discipline of morphology relates to the study of word-internal structure, and is described
2009 in more detail in § 9.1.2.
2010 The value of normalization depends on the data and the task. Normalization re-
2011 duces the size of the feature space, which can help in generalization. However, there
2012 is always the risk of merging away linguistically meaningful distinctions. In supervised
2013 machine learning, regularization and smoothing can play a similar role to normalization
2014 — preventing the learner from overfitting to rare features — while avoiding the language-
2015 specific engineering required for accurate normalization. In unsupervised scenarios, such
2016 as content-based information retrieval (Manning et al., 2008) and topic modeling (Blei
2017 et al., 2003), normalization is more critical.
2028 morphological complexity of Portuguese, which includes many more inflectional suffixes
2029 than English.
2030 Eliminating rare words is not always advantageous for classification performance: for
2031 example, names, which are typically rare, play a large role in distinguishing topics of news
2032 articles. Another way to reduce the size of the feature space is to eliminate stopwords such
2033 as the, to, and and, which may seem to play little role in expressing the topic, sentiment,
2034 or stance. This is typically done by creating a stoplist (e.g., nltk.corpus.stopwords),
2035 and then ignoring all terms that match the list. However, corpus linguists and social psy-
2036 chologists have shown that seemingly inconsequential words can offer surprising insights
2037 about the author or nature of the text (Biber, 1991; Chung and Pennebaker, 2007). Further-
2038 more, high-frequency words are unlikely to cause overfitting in discriminative classifiers.
2039 As with normalization, stopword filtering is more important for unsupervised problems,
2040 such as term-based document retrieval.
2041 Another alternative for controlling model size is feature hashing (Weinberger et al.,
2042 2009). Each feature is assigned an index using a hash function. If a hash function that
2043 permits collisions is chosen (typically by taking the hash output modulo some integer),
2044 then the model can be made arbitrarily small, as multiple features share a single weight.
2045 Because most features are rare, accuracy is surprisingly robust to such collisions (Ganchev
2046 and Dredze, 2008).
2063 perform feature selection, so you may need to construct a tuning or development set for
2064 this purpose, as discussed in § 2.1.5.
2065 There are a number of ways to evaluate classifier performance. The simplest is accu-
2066 racy: the number of correct predictions, divided by the total number of instances,
N
1 X
acc(y, ŷ) = δ(y (i) = ŷ). [4.4]
N
i
2067 Exams are usually graded by accuracy. Why are other metrics necessary? The main
2068 reason is class imbalance. Suppose you are building a classifier to detect whether an
2069 electronic health record (EHR) describes symptoms of a rare disease, which appears in
2070 only 1% of all documents in the dataset. A classifier that reports ŷ = N EGATIVE for
2071 all documents would achieve 99% accuracy, but would be practically useless. We need
2072 metrics that are capable of detecting the classifier’s ability to discriminate between classes,
2073 even when the distribution is skewed.
2074 One solution is to build a balanced test set, in which each possible label is equally rep-
2075 resented. But in the EHR example, this would mean throwing away 98% of the original
2076 dataset! Furthermore, the detection threshold itself might be a design consideration: in
2077 health-related applications, we might prefer a very sensitive classifier, which returned a
2078 positive prediction if there is even a small chance that y (i) = P OSITIVE. In other applica-
2079 tions, a positive result might trigger a costly action, so we would prefer a classifier that
2080 only makes positive predictions when absolutely certain. We need additional metrics to
2081 capture these characteristics.
2087 Similarly, for any label, there are two ways to be correct:
Classifiers that make a lot of false positives are too sensitive; classifiers that make a
lot of false negatives are not sensitive enough. These two conditions are captured by the
Evaluating multi-class classification Recall, precision, and F - MEASURE are defined with
respect to a specific label k. When there are multiple labels of interest (e.g., in word sense
disambiguation or emotion classification), it is necessary to combine the F - MEASURE
across each class. Macro F - MEASURE is the average F - MEASURE across several classes,
1 X
Macro-F (y, ŷ) = F - MEASURE(y, ŷ, k) [4.8]
|K|
k∈K
2105 In multi-class problems with unbalanced class distributions, the macro F - MEASURE is a
2106 balanced measure of how well the classifier recognizes each class. In micro F - MEASURE,
2107 we compute true positives, false positives, and false negatives for each class, and then add
2108 them up to compute a single recall, precision, and F - MEASURE. This metric is balanced
2109 across instances rather than classes, so it weights each class in proportion to its frequency
2110 — unlike macro F - MEASURE, which weights each class equally.
7 (1+β 2 )rp
F - MEASURE is sometimes called F1 , and generalizes to Fβ = β 2 p+r
. The β parameter can be tuned to
emphasize recall or precision.
1.0
0.8
Figure 4.4: ROC curves for three classifiers of varying discriminative power, measured by
AUC (area under the curve)
2112 In binary classification problems, it is possible to trade off between recall and precision by
2113 adding a constant “threshold” to the output of the scoring function. This makes it possible
2114 to trace out a curve, where each point indicates the performance at a single threshold. In
2115 the receiver operating characteristic (ROC) curve,8 the x-axis indicates the false positive
FP
2116 rate, FP+TN , and the y-axis indicates the recall, or true positive rate. A perfect classifier
2117 attains perfect recall without any false positives, tracing a “curve” from the origin (0,0) to
2118 the upper left corner (0,1), and then to (1,1). In expectation, a non-discriminative classifier
2119 traces a diagonal line from the origin (0,0) to the upper right corner (1,1). Real classifiers
2120 tend to fall between these two extremes. Examples are shown in Figure 4.4.
2121 The ROC curve can be summarized in a single number by taking its integral, the area
2122 under the curve (AUC). The AUC can be interpreted as the probability that a randomly-
2123 selected positive example will be assigned a higher score by the classifier than a randomly-
2124 selected negative example. A perfect classifier has AUC = 1 (all positive examples score
2125 higher than all negative examples); a non-discriminative classifier has AUC = 0.5 (given
2126 a randomly selected positive and negative example, either could score higher with equal
2127 probability); a perfectly wrong classifier would have AUC = 0 (all negative examples score
2128 higher than all positive examples). One advantage of AUC in comparison to F - MEASURE
2129 is that the baseline rate of 0.5 does not depend on the label distribution.
8
The name “receiver operator characteristic” comes from the metric’s origin in signal processing applica-
tions (Peterson et al., 1954). Other threshold-free metrics include precision-recall curves, precision-at-k, and
balanced F - MEASURE; see Manning et al. (2008) for more details.
0.15
Figure 4.5: Probability mass function for the binomial distribution. The pink highlighted
areas represent the cumulative probability for a significance test on an observation of
k = 10 and N = 30.
2158 of binary random variables. We write k ∼ Binom(θ, N ) to indicate that k is drawn from
2159 a binomial distribution, with parameter N indicating the number of random “draws”,
2160 and θ indicating the probability of “success” on each draw. Each draw is an example on
2161 which the two classifiers disagree, and a “success” is a case in which c1 is right and c2 is
2162 wrong. (The label space is assumed to be binary, so if the classifiers disagree, exactly one
2163 of them is correct. The test can be generalized to multi-class classification by focusing on
2164 the examples in which exactly one classifier is correct.)
2165 The probability mass function (PMF) of the binomial distribution is,
N k
pBinom (k; N, θ) = θ (1 − θ)N −k , [4.9]
k
2166 with θk representing the probability of the k successes, (1 − θ)N
−k representing the prob-
N
2167 ability of the N − k unsuccessful draws. The expression k = k!(NN−k)! !
is a binomial
2168 coefficient, representing the number of possible orderings of events; this ensures that the
2169 distribution sums to one over all k ∈ {0, 1, 2, . . . , N }.
Under the null hypothesis, when the classifiers disagree, each classifier is equally
likely to be right, so θ = 12 . Now suppose that among N disagreements, c1 is correct k < N2
times. The probability of c1 being correct k or fewer times is the one-tailed p-value, be-
cause it is computed from the area under the binomial probability mass function from 0
to k, as shown in the left tail of Figure 4.5. This cumulative probability is computed as a
sum over all values i ≤ k,
Xk
(i) (i) (i) 1 1
Pr count(ŷ2 = y 6= ŷ1 ) ≤ k; N, θ = = pBinom i; N, θ = . [4.10]
Binom 2 2
i=0
2170 The one-tailed p-value applies only to the asymmetric null hypothesis that c1 is at least
2171 as accurate as c2 . To test the two-tailed null hypothesis that c1 and c2 are equally accu-
Algorithm 7 Bootstrap sampling for classifier evaluation. The original test set is
{x(1:N ) , y (1:N ) }, the metric is δ(·), and the number of samples is M .
procedure B OOTSTRAP -S AMPLE(x(1:N ) , y (1:N ) , δ(·), M )
for t ∈ {1, 2, . . . , M } do
for i ∈ {1, 2, . . . , N } do
j ∼ UniformInteger(1, N )
x̃(i) ← x(j)
ỹ (i) ← y (j)
d(t) ← δ(x̃(1:N ) , ỹ (1:N ) )
return {d(t) }M t=1
2172 rate, we would take the sum of one-tailed p-values, where the second term is computed
2173 from the right tail of Figure 4.5. The binomial distribution is symmetric, so this can be
2174 computed by simply doubling the one-tailed p-value.
2175 Two-tailed tests are more stringent, but they are necessary in cases in which there is
2176 no prior intuition about whether c1 or c2 is better. For example, in comparing logistic
2177 regression versus averaged perceptron, a two-tailed test is appropriate. In an ablation
2178 test, c2 may contain a superset of the features available to c1 . If the additional features are
2179 thought to be likely to improve performance, then a one-tailed test would be appropriate,
2180 if chosen in advance. However, such a test can only prove that c2 is more accurate than
2181 c1 , and not the reverse.
2198 Algorithm 7, and then setting the top and bottom of the 95% confidence interval to the
2199 values at the 2.5% and 97.5% percentiles of the sorted outputs. Alternatively, you can fit
2200 a normal distribution to the set of differences across bootstrap samples, and compute a
2201 Gaussian confidence interval from the mean and variance.
2202 As the number of bootstrap samples goes to infinity, M → ∞, the bootstrap estimate
2203 is increasingly accurate. A typical choice for M is 104 or 105 ; larger numbers of samples
2204 are necessary for smaller p-values. One way to validate your choice of M is to run the test
2205 multiple times, and ensure that the p-values are similar; if not, increase M by an order of
2206 magnitude. This is a√heuristic measure of the variance of the test, which can decreases
2207 with the square root M (Robert and Casella, 2013).
2234 notations for five “domains” of web documents: question answers, emails, newsgroups,
2235 reviews, and blogs (Petrov and McDonald, 2012).
2264 1. Determine what the annotations are to include. This is usually based on some
2265 theory of the underlying phenomenon: for example, if the goal is to produce an-
2266 notations about the emotional state of a document’s author, one should start with a
2267 theoretical account of the types or dimensions of emotion (e.g., Mohammad and Tur-
2268 ney, 2013). At this stage, the tradeoff between expressiveness and scalability should
2269 be considered: a full instantiation of the underlying theory might be too costly to
2270 annotate at scale, so reasonable approximations should be considered.
2271 2. Optionally, one may design or select a software tool to support the annotation
2272 effort. Existing general-purpose annotation tools include BRAT (Stenetorp et al.,
2273 2012) and MMAX2 (Müller and Strube, 2006).
2274 3. Formalize the instructions for the annotation task. To the extent that the instruc-
2275 tions are not explicit, the resulting annotations will depend on the intuitions of the
2276 annotators. These intuitions may not be shared by other annotators, or by the users
2277 of the annotated data. Therefore explicit instructions are critical to ensuring the an-
2278 notations are replicable and usable by other researchers.
2279 4. Perform a pilot annotation of a small subset of data, with multiple annotators for
2280 each instance. This will give a preliminary assessment of both the replicability and
2281 scalability of the current annotation instructions. Metrics for computing the rate of
2282 agreement are described below. Manual analysis of specific disagreements should
2283 help to clarify the instructions, and may lead to modifications of the annotation task
2284 itself. For example, if two labels are commonly conflated by annotators, it may be
2285 best to merge them.
2286 5. Annotate the data. After finalizing the annotation protocol and instructions, the
2287 main annotation effort can begin. Some, if not all, of the instances should receive
2288 multiple annotations, so that inter-annotator agreement can be computed. In some
2289 annotation projects, instances receive many annotations, which are then aggregated
2290 into a “consensus” label (e.g., Danescu-Niculescu-Mizil et al., 2013). However, if the
2291 annotations are time-consuming or require significant expertise, it may be preferable
2292 to maximize scalability by obtaining multiple annotations for only a small subset of
2293 examples.
2294 6. Compute and report inter-annotator agreement, and release the data. In some
2295 cases, the raw text data cannot be released, due to concerns related to copyright or
2296 privacy. In these cases, one solution is to publicly release stand-off annotations,
2297 which contain links to document identifiers. The documents themselves can be re-
2298 leased under the terms of a licensing agreement, which can impose conditions on
2299 how the data is used. It is important to think through the potential consequences of
2300 releasing data: people may make personal data publicly available without realizing
2301 that it could be redistributed in a dataset and publicized far beyond their expecta-
2302 tions (boyd and Crawford, 2012).
agreement − E[agreement]
κ= . [4.11]
1 − E[agreement]
2313 The numerator is the difference between the observed agreement and the chance agree-
2314 ment, and the denominator is the difference between perfect agreement and chance agree-
2315 ment. Thus, κ = 1 when the annotators agree in every case, and κ = 0 when the annota-
2316 tors agree only as often as would happen by chance. Various heuristic scales have been
2317 proposed for determining when κ indicates “moderate”, “good”, or “substantial” agree-
2318 ment; for reference, Lee and Narayanan (2005) report κ ≈ 0.45 − 0.47 for annotations
2319 of emotions in spoken dialogues, which they describe as “moderate agreement”; Stolcke
2320 et al. (2000) report κ = 0.8 for annotations of dialogue acts, which are labels for the pur-
2321 pose of each turn in a conversation.
2322 When there are two annotators, the expected chance agreement is computed as,
X
E[agreement] = P̂r(Y = k)2 , [4.12]
k
2323 where k is a sum over labels, and P̂r(Y = k) is the empirical probability of label k across
2324 all annotations. The formula is derived from the expected number of agreements if the
2325 annotations were randomly shuffled. Thus, in a binary labeling task, if one label is applied
2326 to 90% of instances, chance agreement is .92 + .12 = .82.
2332 satisfaction rate on previous tasks. The use of relatively untrained “crowdworkers” con-
2333 trasts with earlier annotation efforts, which relied on professional linguists (Marcus et al.,
2334 1993). However, crowdsourcing has been found to produce reliable annotations for many
2335 language-related tasks (Snow et al., 2008). Crowdsourcing is part of the broader field of
2336 human computation (Law and Ahn, 2011).
2341 Exercises
2342 1. As noted in § 4.3.3, words tend to appear in clumps, with subsequent occurrences
2343 of a word being more probable. More concretely, if word j has probability φy,j
2344 of appearing in a document with label y, then the probability of two appearances
(i)
2345 (xj = 2) is greater than φ2y,j .
2346 Suppose you are applying Naı̈ve Bayes to a binary classification. Focus on a word j
2347 which is more probable under label y = 1, so that,
2350 2. Prove that F-measure is never greater than the arithmetic mean of recall and preci-
2351 sion, r+p r+p
2 . Your solution should also show that F-measure is equal to 2 iff r = p.
2352 3. Given a binary classification problem in which the probability of the “positive” label
2353 is equal to α, what is the expected F - MEASURE of a random classifier which ignores
2354 the data, and selects ŷ = +1 with probability 21 ? (Assume that p(ŷ)⊥p(y).) What is
2355 the expected F - MEASURE of a classifier that selects ŷ = +1 with probability α (also
2356 independent of y (i) )? Depending on α, which random classifier will score better?
2357 4. Suppose that binary classifiers c1 and c2 disagree on N = 30 cases, and that c1 is
2358 correct in k = 10 of those cases.
2359 • Write a program that uses primitive functions such as exp and factorial to com-
2360 pute the two-tailed p-value — you may use an implementation of the “choose”
2361 function if one is avaiable. Verify your code against the output of a library for
2362 computing the binomial test or the binomial CDF, such as scipy.stats.binom
2363 in Python.
2364 • Then use a randomized test to try to obtain the same p-value. In each sample,
2365 draw from a binomial distribution with N = 30 and θ = 21 . Count the fraction
2366 of samples in which k ≤ 10. This is the one-tailed p-value; double this to
2367 compute the two-tailed p-value.
2368 • Try this with varying numbers of bootstrap samples: M ∈ {100, 1000, 5000, 10000}.
2369 For M = 100 and M = 1000, run the test 10 times, and plot the resulting p-
2370 values.
2371 • Finally, perform the same tests for N = 70 and k = 25.
2372 5. SemCor 3.0 is a labeled dataset for word sense disambiguation. You can download
2373 it,13 or access it in nltk.corpora.semcor.
2374 Choose a word that appears at least ten times in SemCor (find), and annotate its
2375 WordNet senses across ten randomly-selected examples, without looking at the ground
2376 truth. Use online WordNet to understand the definition of each of the senses.14 Have
2377 a partner do the same annotations, and compute the raw rate of agreement, expected
2378 chance rate of agreement, and Cohen’s kappa.
2379 6. Download the Pang and Lee movie review data, currently available from http:
2380 //www.cs.cornell.edu/people/pabo/movie-review-data/. Hold out a
2381 randomly-selected 400 reviews as a test set.
2382 Download a sentiment lexicon, such as the one currently available from Bing Liu,
2383 https://www.cs.uic.edu/˜liub/FBS/sentiment-analysis.html. Tokenize
2384 the data, and classify each document as positive iff it has more positive sentiment
2385 words than negative sentiment words. Compute the accuracy and F - MEASURE on
2386 detecting positive reviews on the test set, using this lexicon-based classifier.
2387 Then train a discriminative classifier (averaged perceptron or logistic regression) on
2388 the training set, and compute its accuracy and F - MEASURE on the test set.
2389 Determine whether the differences are statistically significant, using two-tailed hy-
2390 pothesis tests: Binomial for the difference in accuracy, and bootstrap for the differ-
2391 ence in macro-F - MEASURE.
2392 The remaining problems will require you to build a classifier and test its properties. Pick
2393 a multi-class text classification dataset, such as RCV115 ). Divide your data into training
13
e.g., https://github.com/google-research-datasets/word_sense_disambigation_
corpora or http://globalwordnet.org/wordnet-annotated-corpora/
14
http://wordnetweb.princeton.edu/perl/webwn
15
http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_
rcv1v2_README.htm
2394 (60%), development (20%), and test sets (20%), if no such division already exists. [todo:
2395 this dataset is already tokenized, find something else]
2396 7. Compare various vocabulary sizes of 102 , 103 , 104 , 105 , using the most frequent words
2397 in each case (you may use any reasonable tokenizer). Train logistic regression clas-
2398 sifiers for each vocabulary size, and apply them to the development set. Plot the
2399 accuracy and Macro-F - MEASURE with the increasing vocabulary size. For each vo-
2400 cabulary size, tune the regularizer to maximize accuracy on a subset of data that is
2401 held out from the training set.
2406 Compute the token/type ratio for each tokenizer on the training data, and explain
2407 what you find. Train your classifier on each tokenized dataset, tuning the regularizer
2408 on a subset of data that is held out from the training data. Tokenize the development
2409 set, and report accuracy and Macro-F - MEASURE.
2410 9. Apply the Porter and Lancaster stemmers to the training set, using any reasonable
2411 tokenizer, and compute the token/type ratios. Train your classifier on the stemmed
2412 data, and compute the accuracy and Macro-F - MEASURE on stemmed development
2413 data, again using a held-out portion of the training data to tune the regularizer.
2414 10. Identify the best combination of vocabulary filtering, tokenization, and stemming
2415 from the previous three problems. Apply this preprocessing to the test set, and
2416 compute the test set accuracy and Macro-F - MEASURE. Compare against a baseline
2417 system that applies no vocabulary filtering, whitespace tokenization, and no stem-
2418 ming.
2419 Use the binomial test to determine whether your best-performing system is signifi-
2420 cantly more accurate than the baseline.
2421 Use the bootstrap test with M = 104 to determine whether your best-performing
2422 system achieves significantly higher macro-F - MEASURE.
2428 Without labeled data, is it possible to learn anything? This scenario is known as unsu-
2429 pervised learning, and we will see that indeed it is possible to learn about the underlying
2430 structure of unlabeled observations. This chapter will also explore some related scenarios:
2431 semi-supervised learning, in which only some instances are labeled, and domain adap-
2432 tation, in which the training data differs from the data on which the trained system will
2433 be deployed.
2439 It is difficult to obtain sufficient training data for word sense disambiguation, because
2440 even a large corpus will contain only a few instances of all but the most common words.
2441 Is it possible to learn anything about these different senses without labeled data?
2442 Word sense disambiguation is usually performed using feature vectors constructed
2443 from the local context of the word to be disambiguated. For example, for the word
107
108 CHAPTER 5. LEARNING WITHOUT SUPERVISION
20
0
0 10 20 30 40
density of word group 1
2444 bank, the immediate context might typically include words from one of the following two
2445 groups:
2446 1. financial, deposits, credit, lending, capital, markets, regulated, reserve, liquid, assets
2447 2. land, water, geography, stream, river, flow, deposits, discharge, channel, ecology
2448 Now consider a scatterplot, in which each point is a document containing the word bank.
2449 The location of the document on the x-axis is the count of words in group 1, and the
2450 location on the y-axis is the count for group 2. In such a plot, shown in Figure 5.1, two
2451 “blobs” might emerge, and these blobs correspond to the different senses of bank.
2452 Here’s a related scenario, from a different problem. Suppose you download thousands
2453 of news articles, and make a scatterplot, where each point corresponds to a document:
2454 the x-axis is the frequency of the group of words (hurricane, winds, storm); the y-axis is the
2455 frequency of the group (election, voters, vote). This time, three blobs might emerge: one
2456 for documents that are largely about a hurricane, another for documents largely about a
2457 election, and a third for documents about neither topic.
2458 These clumps represent the underlying structure of the data. But the two-dimensional
2459 scatter plots are based on groupings of context words, and in real scenarios these word
2460 lists are unknown. Unsupervised learning applies the same basic idea, but in a high-
2461 dimensional space with one dimension for every context word. This space can’t be di-
2462 rectly visualized, but the idea is the same: try to identify the underlying structure of the
2463 observed data, such that there are a few clusters of points, each of which is internally
2464 coherent. Clustering algorithms are capable of finding such structure automatically.
2468 a cluster assignment for each instance, and a central (“mean”) location for each cluster.
2469 K-means iterates between updates to the assignments and the centers:
2470 1. each instance is placed in the cluster with the closest center;
2471 2. each center is recomputed as the average over points in the cluster.
2472 This is formalized in Algorithm 8. The term ||x(i) − ν||2 refers to the squared Euclidean
P (i)
2473 norm, Vj=1 (xj − νj )2 .
2474 Soft K-means is a particularly relevant variant. Instead of directly assigning each
2475 point to a specific cluster, soft K-means assigns each point a distribution over clusters
P
2476 q (i) , so that K (i) (i) (i)
k=1 q (k) = 1, and ∀k , q (k) ≥ 0. The soft weight q (k) is computed from
2477
(i)
the distance of x to the cluster center νk . In turn, the center of each cluster is computed
2478 from a weighted average of the points in the cluster,
N
X
1
νk = PN q (i) (k)x(i) . [5.1]
(i)
i=1 q (k) i=1
2479 We will now explore a probablistic version of soft K-means clustering, based on expec-
2480 tation maximization (EM). Because EM clustering can be derived as an approximation to
2481 maximum-likelihood estimation, it can be extended in a number of useful ways.
Expectation maximization combines the idea of soft K-means with Naı̈ve Bayes classifi-
cation. To review, Naı̈ve Bayes defines a probability distribution over the data,
N
X
log p(x, y; φ, µ) = log p(x(i) | y (i) ; φ) × p(y (i) ; µ) [5.2]
i=1
Now suppose that you never observe the labels. To indicate this, we’ll refer to the label
of each instance as z (i) , rather than y (i) , which is usually reserved for observed variables.
By marginalizing over the latent variables z, we compute the marginal probability of the
observed instances x:
N
X
log p(x; φ, µ) = log p(x(i) ; φ, µ) [5.3]
i=1
N
X K
X
= log p(x(i) , z; φ, µ) [5.4]
i=1 z=1
XN XK
= log p(x(i) | z; φ) × p(z; µ). [5.5]
i=1 z=1
2483 To estimate the parameters φ and µ, we can maximize the marginal likelihood in Equa-
2484 tion 5.5. Why is this the right thing to maximize? Without labels, discriminative learning
2485 is impossible — there’s nothing to discriminate. So maximum likelihood is all we have.
2486 When the labels are observed, we can estimate the parameters of the Naı̈ve Bayes
2487 probability model separately for each label. But marginalizing over the labels couples
2488 these parameters, making direct optimization of log p(x) intractable. We will approximate
2489 the log-likelihood by introducing an auxiliary variable q (i) , which is a distribution over the
2490 label set Z = {1, 2, . . . , K}. The optimization procedure will alternate between updates to
2491 q and updates to the parameters (φ, µ). Thus, q (i) plays here as in soft K-means.
To derive the updates for this optimization, multiply the right side of Equation 5.5 by
q (i) (z)
the ratio q (i) (z)
= 1,
M
X K
X q (i) (z)
log p(x; φ, µ) = log p(x(i) | z; φ) × p(z; µ) × [5.6]
i=1 z=1
q (i) (z)
XM XK
1
= log q (i) (z) × p(x(i) | z; φ) × p(z; µ) × [5.7]
i=1 z=1
q (i) (z)
M
" #
X p(x(i) | z; φ)p(z; µ)
= log Eq(i) , [5.8]
i=1
q (i) (z)
P
2492 where Eq(i) [f (z)] = K (i)
z=1 q (z) × f (z) refers to the expectation of the function f under
2493 the distribution z ∼ q (i) .
Jensen’s inequality says that because log is a concave function, we can push it inside
the expectation, and obtain a lower bound.
N
" #
X p(x(i) | z; φ)p(z; µ)
log p(x; φ, µ) ≥ Eq(i) log [5.9]
i=1
q (i) (z)
N
X h i
J, Eq(i) log p(x(i) | z; φ) + log p(z; µ) − log q (i) (z) [5.10]
i=1
XN h i
= Eq(i) log p(x(i) , z; φ, µ) + H(q (i) ) [5.11]
i=1
2494 We will focus on Equation 5.10, which is the lower bound on the marginal log-likelihood
2495 of the observed data, log p(x). Equation 5.11 shows the connection to the information
P
2496 theoretic concept of entropy, H(q (i) ) = − K (i) (i)
z=1 q (z) log q (z), which measures the av-
2497 erage amount of information produced by a draw from the distribution q (i) . The lower
2498 bound J is a function of two groups of arguments:
2501 The expectation-maximization (EM) algorithm maximizes the bound with respect to each
2502 of these arguments in turn, while holding the other fixed.
N X
X K h i
J= q (i) (z) log p(x(i) | z; φ) + log p(z; µ) − log q (i) (z) . [5.12]
i=1 z=1
When optimizing this bound, we must also respect a set of “sum-to-one” constraints,
PK (i)
z=1 q (z) = 1 for all i. Just as in Naı̈ve Bayes, this constraint can be incorporated into a
Lagrangian:
N X
X K K
X
(i) (i) (i) (i)
Jq = q (z) log p(x | z; φ) + log p(z; µ) − log q (z) + λ (1 − q (i) (z)),
i=1 z=1 z=1
[5.13]
∂Jq
= log p(x(i) | z; φ) + log p(z; θ) − log q (i) (z) − 1 − λ(i) [5.14]
∂q (i) (z)
log q (i) (z) = log p(x(i) | z; φ) + log p(z; µ) − 1 − λ(i) [5.15]
(i) (i)
q (z) ∝p(x | z; φ) × p(z; µ). [5.16]
p(x(i) | z; φ) × p(z; µ)
q (i) (z) = PK [5.17]
(i) | z 0 ; φ) × p(z 0 ; µ)
z 0 =1 p(x
=p(z | x(i) ; φ, µ). [5.18]
2505 After normalizing, each q (i) — which is the soft distribution over clusters for data x(i) —
2506 is set to the posterior probability p(z | x(i) ; φ, µ) under the current parameters. Although
2507 the Lagrange multipliers λ(i) were introduced as additional parameters, they drop out
2508 during normalization.
N X
X K K
X V
X
(i) (i) (i)
Jφ = q (z) log p(x | z; φ) + log p(z; µ) − log q (z) + λz (1 − φz,j ).
i=1 z=1 z=1 j=1
[5.19]
2510 The term log p(x(i) | z; φ) is the conditional log-likelihood for the multinomial, which
2511 expands to,
XV
log p(x(i) | z, φ) = C + xj log φz,j , [5.20]
j=1
2512 where C is a constant with respect to φ — see Equation 2.12 in § 2.1 for more discussion
2513 of this probability function.
Setting the derivative of Jφ equal to zero,
N (i)
∂Jφ X xj
= q (i) (z) × − λz [5.21]
∂φz,j φz,j
i=1
XN
(i)
φz,j ∝ q (i) (z) × xj . [5.22]
i=1
2514 where the counter j ∈ {1, 2, . . . , V } indexes over base features, such as words.
2515 This update sets φz equal to the relative frequency estimate of the expected counts under
2516 the distribution q. As in supervised Naı̈ve Bayes, we can smooth these counts by adding
P
2517 a constant α. The update for µ is similar: µz ∝ N (i)
i=1 q (z) = Eq [count(z)], which is the
2518 expected frequency of cluster z. These probabilities can also be smoothed. In sum, the
2519 M-step is just like Naı̈ve Bayes, but with expected counts rather than observed counts.
2520 The multinomial likelihood p(x | z) can be replaced with other probability distribu-
2521 tions: for example, for continuous observations, a Gaussian distribution can be used. In
2522 some cases, there is no closed-form update to the parameters of the likelihood. One ap-
2523 proach is to run gradient-based optimization at each M-step; another is to simply take a
2524 single step along the gradient step and then return to the E-step (Berg-Kirkpatrick et al.,
2525 2010).
440000
430000
0 2 4 6 8
iteration
Figure 5.2: Sensitivity of expectation maximization to initialization. Each line shows the
progress of optimization from a different random initialization.
2551 optimum for the likelihood p(x | z), and in online settings where new data is constantly
2552 streamed in (see Liang and Klein, 2009, for a comparison for online EM variants).
260000
Negative log-likelihood bound 85000 Out-of-sample negative log likelihood
AIC
240000 80000
75000
220000
10 20 30 40 50 10 20 30 40 50
Number of clusters Number of clusters
Figure 5.3: The negative log-likelihood and AIC for several runs of expectation maximiza-
tion, on synthetic data. Although the data was generated from a model with K = 10, the
optimal number of clusters is K̂ = 15, according to AIC and the heldout log-likelihood.
The training set log-likelihood continues to improve as K increases.
2588 • Introduce latent variables z, such that it is easy to write the probability P (x, z). It
2589 should also be easy to estimate the associated parameters, given knowledge of z.
Q
2590 • Derive the E-step updates for q(z), which is typically factored as q(z) = N (i)
i=1 qz (i) (z ),
2591 where i is an index over instances.
2592 • The M-step updates typically correspond to the soft version of a probabilistic super-
2593 vised learning algorithm, like Naı̈ve Bayes.
2594 This section discusses a few of the many applications of this general framework.
count of word j in the context of instance i. Truncated singular value decomposition ap-
proximates the matrix C as a product of three matrices, U, S, V, under the constraint that
U and V are orthonormal, and S is diagonal:
min ||C − USV> ||F [5.25]
U,S,V
s.t.U ∈ RV ×K , UU> = I
S = Diag(s1 , s2 , . . . , sK )
V> ∈ RNp ×K , VV> = I,
qP
where || · ||F is the Frobenius norm, ||X||F = 2
2603 i,j Xi,j . The matrix U contains the
2604 left singular vectors of C, and the rows of this matrix can be used as low-dimensional
2605 representations of the count vectors ci . EM clustering can be made more robust by setting
2606 the instance descriptions x(i) equal to these rows, rather than using raw counts (Schütze,
2607 1998). However, because the instances are now dense vectors of continuous numbers, the
2608 probability p(x(i) | z) must be defined as a multivariate Gaussian distribution.
2609 In truncated singular value decomposition, the hyperparameter K is the truncation
2610 limit: when K is equal to the rank of C, the norm of the difference between the original
2611 matrix C and its reconstruction USV> will be zero. Lower values of K increase the recon-
2612 struction error, but yield vector representations that are smaller and easier to learn from.
2613 Singular value decomposition is discussed in more detail in chapter 14.
N` is the number of labeled instances and Nu is the number of unlabeled instances. We can
learn from the combined data by maximizing a lower bound on the joint log-likelihood,
N
X̀ NX
` +Nu
(i) (i)
L= log p(x , y ; µ, φ) + log p(x(j) ; µ, φ) [5.26]
i=1 j=N` +1
N
X̀ NX
` +Nu K
X
(i) (i) (i)
= log p(x | y ; φ) + log p(y ; µ) + log p(x(j) , y; µ, φ). [5.27]
i=1 j=N` +1 y=1
Algorithm 9 Generative process for the Naı̈ve Bayes classifier with hidden components
for Document i ∈ {1, 2, . . . , N } do:
Draw the label y (i) ∼ Categorical(µ);
Draw the component z (i) ∼ Categorical(βy(i) );
Draw the word counts x(i) | y (i) , z (i) ∼ Multinomial(φz (i) ).
2622 The left sum is identical to the objective in Naı̈ve Bayes; the right sum is the marginal log-
2623 likelihood for expectation-maximization clustering, from Equation 5.5. We can construct a
2624 lower bound on this log-likelihood by introducing distributions q (j) for all j ∈ {N` + 1, . . . , N` + Nu }.
2625 The E-step updates these distributions; the M-step updates the parameters φ and µ, us-
2626 ing the expected counts from the unlabeled data and the observed counts from the labeled
2627 data.
2628 A critical issue in semi-supervised learning is how to balance the impact of the labeled
2629 and unlabeled data on the classifier weights, especially when the unlabeled data is much
2630 larger than the labeled dataset. The risk is that the unlabeled data will dominate, caus-
2631 ing the parameters to drift towards a “natural clustering” of the instances — which may
2632 not correspond to a good classifier for the labeled data. One solution is to heuristically
2633 reweight the two components of Equation 5.26, tuning the weight of the two components
2634 on a heldout development set (Nigam et al., 2000).
(5.1) , Villeneuve a bel et bien réussi son pari de changer de perspectives tout en assurant
une cohérence à la franchise.2
(5.2) / Il est également trop long et bancal dans sa narration, tiède dans ses intentions, et
tiraillé entre deux personnages et directions qui ne parviennent pas à coexister en har-
monie.3
(5.3) Denis Villeneuve a réussi une suite parfaitement maitrisée4
(5.4) Long, bavard, hyper design, à peine agité (le comble de l’action : une bagarre dans la
flotte), métaphysique et, surtout, ennuyeux jusqu’à la catalepsie.5
(5.5) Une suite d’une écrasante puissance, mêlant parfaitement le contemplatif au narratif.6
(5.6) Le film impitoyablement bavard finit quand même par se taire quand se lève l’espèce
de bouquet final où semble se déchaı̂ner, comme en libre parcours de poulets décapités,
l’armée des graphistes numériques griffant nerveusement la palette graphique entre ag-
onie et orgasme.7
Table 5.1: Labeled and unlabeled reviews of the films Blade Runner 2049 and Transformers:
The Last Knight.
The labeled data includes (x(i) , y (i) ), but not z (i) , so this is another case of missing
data. Again, we sum over the missing data, applying Jensen’s inequality to as to obtain a
lower bound on the log-likelihood,
Kz
X
(i) (i)
log p(x , y ) = log p(x(i) , y (i) , z; µ, φ, β) [5.28]
z=1
≥ log p(y (i) ; µ) + Eq(i) [log p(x(i) | z; φ) + log p(z | y (i) ; β) − log q (i) (z)].
Z|Y
[5.29]
We are now ready to apply expectation maximization. As usual, the E-step updates
(i)
the distribution over the missing data, qZ|Y . The M-step updates the parameters,
Eq [count(y, z)]
βy,z = PKz [5.30]
0
z 0 =1 Eq [count(y, z )]
Eq [count(z, j)]
φz,j = PV . [5.31]
0
j 0 =1 Eq [count(z, j )]
2656 ble 5.1, there are two labeled examples, one positive and one negative. From this data, a
2657 learner could conclude that réussi is positive and long is negative. This isn’t much! How-
2658 ever, we can propagate this information to the unlabeled data, and potentially learn more.
2659 • If we are confident that réussi is positive, then we might guess that (5.3) is also posi-
2660 tive.
2661 • That suggests that parfaitement is also positive.
2662 • We can then propagate this information to (5.5), and learn from this words in this
2663 example.
2664 • Similarly, we can propagate from the labeled data to (5.4), which we guess to be
2665 negative because it shares the word long. This suggests that bavard is also negative,
2666 which we propagate to (5.6).
2667 Instances (5.3) and (5.4) were “similar” to the labeled examples for positivity and negativ-
2668 ity, respectively. By using these instances to expand the models for each class, it became
2669 possible to correctly label instances (5.5) and (5.6), which didn’t share any important fea-
2670 tures with the original labeled data. This requires a key assumption: that similar instances
2671 will have similar labels.
2672 In § 5.2.2, we discussed how expectation maximization can be applied to semi-supervised
2673 learning. Using the labeled data, the initial parameters φ would assign a high weight for
2674 réussi in the positive class, and a high weight for long in the negative class. These weights
2675 helped to shape the distributions q for instances (5.3) and (5.4) in the E-step. In the next
2676 iteration of the M-step, the parameters φ are updated with counts from these instances,
2677 making it possible to correctly label the instances (5.5) and (5.6).
2678 However, expectation-maximization has an important disadvantage: it requires using
2679 a generative classification model, which restricts the features that can be used for clas-
2680 sification. In this section, we explore non-probabilistic approaches, which impose fewer
2681 restrictions on the classification model.
x(1) x(2) y
1. Peachtree Street located on LOC
2. Dr. Walker said PER
3. Zanzibar located in ? → LOC
4. Zanzibar flew to ? → LOC
5. Dr. Robert recommended ? → PER
6. Oprah recommended ? → PER
2692 Co-training is an iterative multi-view learning algorithm, in which there are separate
2693 classifiers for each view (Blum and Mitchell, 1998). At each iteration of the algorithm, each
2694 classifier predicts labels for a subset of the unlabeled instances, using only the features
2695 available in its view. These predictions are then used as ground truth to train the classifiers
2696 associated with the other views. In the example shown in Table 5.2, the classifier on x(1)
2697 might correctly label instance #5 as a person, because of the feature Dr; this instance would
2698 then serve as training data for the classifier on x(2) , which would then be able to correctly
2699 label instance #6, thanks to the feature recommended. If the views are truly independent,
2700 this procedure is robust to drift. Furthermore, it imposes no restrictions on the classifiers
2701 that can be used for each view.
2702 Word-sense disambiguation is particularly suited to multi-view learning, thanks to the
2703 heuristic of “one sense per discourse”: if a polysemous word is used more than once in
2704 a given text or conversation, all usages refer to the same sense (Gale et al., 1992). This
2705 motivates a multi-view learning approach, in which one view corresponds to the local
2706 context (the surrounding words), and another view corresponds to the global context at
2707 the document level (Yarowsky, 1995). The local context view is first trained on a small
2708 seed dataset. We then identify its most confident predictions on unlabeled instances. The
2709 global context view is then used to extend these confident predictions to other instances
2710 within the same documents. These new instances are added to the training data to the
2711 local context classifier, which is retrained and then applied to the remaining unlabeled
2712 data.
2718 In label propagation, this is done through a series of matrix operations (Zhu et al.,
2719 2003). Let Q be a matrix of size N × K, in which each row q (i) describes the labeling
2720 of instance i. When ground truth labels are available, then q (i) is an indicator vector,
(i) (i)
2721 with qy(i) = 1 and qy0 6=y(i) = 0. Let us refer to the submatrix of rows containing labeled
2722 instances as QL , and the remaining rows as QU . The rows of QU are initialized to assign
2723 equal probabilities to all labels, qi,k = K1 .
2724 Now, let Ti,j represent the “transition” probability of moving from node j to node i,
ωi,j
Ti,j , Pr(j → i) = PN . [5.33]
k=1 ωk,j
We compute values of Ti,j for all instances j and all unlabeled instances i, forming a matrix
of size NU × N . If the dataset is large, this matrix may be expensive to store and manip-
ulate; a solution is to sparsify it, by keeping only the κ largest values in each row, and
setting all other values to zero. We can then “propagate” the label distributions to the
unlabeled instances,
2725 The expression Q̃U 1 indicates multiplication of Q̃U by a column vector of ones, which is
2726 equivalent to computing the sum of each row of Q̃U . The matrix Diag(s) is a diagonal
2727 matrix with the elements of s on the diagonals. The product Diag(s)−1 Q̃U has the effect
2728 of normalizing the rows of Q̃U , so that each row of QU is a probability distribution over
2729 labels.
2742 and disappointing will apply across both movies and appliances; but others, like terrifying,
2743 may have meanings that are domain-specific. Domain adaptation algorithms attempt
2744 to do better than direct transfer, by learning from data in both domains. There are two
2745 main families of domain adaptation algorithms, depending on whether any labeled data
2746 is available in the target domain.
2753 Interpolation. Train a classifier for each domain, and combine their predictions. For ex-
2754 ample,
ŷ = argmax λs Ψs (x, y) + (1 − λs )Ψt (x, y), [5.37]
y
2755 where Ψs and Ψt are the scoring functions from the source and target domain clas-
2756 sifiers respectively, and λs is the interpolation weight.
2757 Prediction. Train a classifier on the source domain data, use its prediction as an additional
2758 feature in a classifier trained on the target domain data.
2759 Priors. Train a classifier on the source domain data, and use its weights as a prior distri-
2760 bution on the weights of the classifier for the target domain data. This is equivalent
2761 to regularizing the target domain weights towards the weights of the source domain
2762 classifier (Chelba and Acero, 2006),
N
X
`(θt ) = `(i) (x(i) , y (i) ; θt ) + λ||θt − θs ||22 , [5.38]
i=1
2763 where `(i) is the prediction loss on instance i, and λ is the regularization weight.
2764 with (boring, −, MOVIE) indicating the word boring appearing in a negative labeled doc-
2765 ument in the MOVIE domain, and (boring, −, ∗) indicating the same word in a negative
2766 labeled document in any domain. It is up to the learner to allocate weight between the
2767 domain-specific and cross-domain features: for words that facilitate prediction in both
2768 domains, the learner will use the cross-domain features; for words that are relevant only
2769 to a single domain, the domain-specific features will be used. Any discriminative classi-
2770 fier can be used with these augmented features.9
2790 present in each example, using the remaining base features. Let φj denote the weights of
2791 this classifier, and us horizontally concatenate the weights for each of the Np pivot features
2792 into a matrix Φ = [φ1 , φ2 , . . . , φNP ].
2793 We then perform truncated singular value decomposition on Φ, as described in § 5.2.1,
2794 obtaining Φ ≈ USV> . The rows of the matrix U summarize information about each base
2795 feature: indeed, the truncated singular value decomposition identifies a low-dimension
2796 basis for the weight matrix Φ, which in turn links base features to pivot features. Sup-
2797 pose that a base feature reliable occurs only in the target domain of appliance reviews.
2798 Nonetheless, it will have a positive weight towards some pivot features (e.g., outstanding,
2799 recommended), and a negative weight towards others (e.g., worthless, unpleasant). A base
2800 feature such as watchable might have the same associations with the pivot features, and
2801 therefore, ureliable ≈ uwatchable . The matrix U can thus project the base features into a
2802 space in which this information is shared.
2815 Adversarial objectives The ultimate goal is for the transformed representations g(x(i) )
2816 to be domain-general. This can be made an explicit optimization criterion by comput-
2817 ing the similarity of transformed instances both within and between domains (Tzeng
2818 et al., 2015), or by formulating an auxiliary classification task, in which the domain it-
2819 self is treated as a label (Ganin et al., 2016). This setting is adversarial, because we want
2820 to learn a representation that makes this classifier perform poorly. At the same time, we
2821 want g(x(i) ) to enable accurate predictions of the labels y (i) .
2822 To formalize this idea, let d(i) represent the domain of instance i, and let `d (g(x(i) ), d(i) ; θd )
2823 represent the loss of a classifier (typically a deep neural network) trained to predict d(i)
2824 from the transformed representation g(x(i) ), using parameters θd . Analogously, let `y (g(x(i) ), y (i) ; θy )
2825 represent the loss of a classifier trained to predict the label y (i) from g(x(i) ), using param-
`y y (i)
x g(x) `d d(i)
Figure 5.4: A schematic view of adversarial domain adaptation. The loss `y is computed
only for instances from the source domain, where labels y (i) are available.
2826 eters θy . The transformation g can then be trained from two criteria: it should yield accu-
2827 rate predictions of the labels y (i) , while making inaccurate predictions of the domains d(i) .
2828 This can be formulated as a joint optimization problem,
NX
` +Nu N
X̀
min `d (g(x(i) ; θg ), d(i) ; θd ) − `y (g(x(i) ), y (i) ; θy ), [5.40]
f ,θg θy ,θd
i=1 i=1
2829 where N` is the number of labeled instances and Nu is the number of unlabeled instances,
2830 with the labeled instances appearing first in the dataset. This setup is shown in Figure 5.4.
2831 The loss can be optimized by stochastic gradient descent, jointly training the parameters
2832 of the non-linear transformation θg , and the parameters of the prediction models θd and
2833 θy .
2847 data z (1:Nz ) is the set of cluster memberships, y (1:N ) , so we draw samples from the pos-
2848 terior distribution over clusterings of the data. If a single clustering is required, we can
2849 select the one with the highest conditional likelihood, ẑ = argmaxz p(z (1:Nz ) | x(1:Nx ) ).
This general family of algorithms is called Markov Chain Monte Carlo (MCMC):
“Monte Carlo” because it is based on a series of random draws; “Markov Chain” because
the sampling procedure must be designed such that each sample depends only on the
previous sample, and not on the entire sampling history. Gibbs sampling is an MCMC
algorithm in which each latent variable is sampled from its posterior distribution,
where z (−n) indicates {z\z (n) }, the set of all latent variables except for z (n) . Repeatedly
drawing samples over all latent variables constructs a Markov chain, and which is guar-
anteed to converge to a sequence of samples from, p(z (1:Nz ) | x(1:Nx ) ). In probabilistic
clustering, the sampling distribution has the following form,
2850 In this case, the sampling distribution does not depend on the other instances x(−i) , z (−i) :
2851 given the parameters φ and µ, the posterior distribution over each z (i) can be computed
2852 from x(i) alone.
2853 In sampling algorithms, there are several choices for how to deal with the parameters.
2854 One possibility is to sample them too. To do this, we must add them to the generative
2855 story, by introducing a prior distribution. For the multinomial and categorical parameters
2856 in the EM clustering model, the Dirichlet distribution is a typical choice, since it defines
2857 a probability on exactly the set of vectors that can be parameters: vectors that sum to one
2858 and include only non-negative numbers.10
2859 To incorporate this prior, the generative model must augmented to indicate that each
2860 φz ∼ Dirichlet(αφ ), and µ ∼ Dirichlet(αµ ). The hyperparameters α are typically set to
If K
10 P
i θi = 1 and θi ≥ 0 for all i, then θ is said to be on the K − 1 simplex. A Dirichlet distribution with
parameter α ∈ RK
+ has support over the K − 1 simplex,
K
1 Y αi −1
pDirichlet (θ | α) = θ [5.44]
B(α) i=1 i
QK
Γ(αi )
B(α) = i=1 , [5.45]
Γ( K
P
i=1 αi )
with Γ(·) indicating the gamma function, a generalization of the factorial function to non-negative reals.
2861 a constant vector α = [α, α, . . . , α]. When α is large, the Dirichlet distribution tends to
2862 generate vectors that are nearly uniform; when α is small, it tends to generate vectors that
2863 assign most of their probability mass to a few entries. Given prior distributions over φ
2864 and µ, we can now include them in Gibbs sampling, drawing values for these parameters
2865 from posterior distributions that are conditioned on the other variables in the model.
2866 Unfortunately, sampling φ and µ usually leads to slow convergence, meaning that a
2867 large number of samples is required before the Markov chain breaks free from the initial
2868 conditions. The reason is that the sampling distributions for these parameters are tightly
2869 constrained by the cluster memberships y (i) , which in turn are tightly constrained by the
2870 parameters. There are two solutions that are frequently employed:
2871 • Empirical Bayesian methods maintain φ and µ as parameters rather than latent
2872 variables. They still employ sampling in the E-step of the EM algorithm, but they
2873 update the parameters using expected counts that are computed from the samples
2874 rather than from parametric distributions. This EM-MCMC hybrid is also known
2875 as Monte Carlo Expectation Maximization (MCEM; Wei and Tanner, 1990), and is
2876 well-suited for cases in which it is difficult to compute q (i) directly.
2877 • In collapsed Gibbs sampling, we analytically integrate φ and µ out of the model.
2878 The cluster memberships y (i) are the only remaining latent variable; we sample them
2879 from the compound distribution,
Z
p(y (i) | x(1:N ) , y (−i) ; αφ , αµ ) = p(φ, µ | y (−i) , x(1:N ) ; αφ , αµ )p(y (i) | x(1:N ) , y (−i) , φ, µ)dφdµ.
φ,µ
[5.46]
2880 For multinomial and Dirichlet distributions, the sampling distribution can be com-
2881 puted in closed form.
2882 MCMC algorithms are guaranteed to converge to the true posterior distribution over
2883 the latent variables, but there is no way to know how long this will take. In practice, the
2884 rate of convergence depends on initialization, just as expectation-maximization depends
2885 on initialization to avoid local optima. Thus, while Gibbs Sampling and other MCMC
2886 algorithms provide a powerful and flexible array of techniques for statistical inference in
2887 latent variable models, they are not a panacea for the problems experienced by EM.
2889 which is similar to the eigendecomposition C = QΛQ> . This suggests that simply by
2890 finding the eigenvectors and eigenvalues of C, we could obtain the parameters φ and µ,
2891 and this is what motivates the name spectral learning.
2892 While moment-matching and eigendecomposition are similar in form, they impose
2893 different constraints on the solutions: eigendecomposition requires orthonormality, so
2894 that QQ> = I; in estimating the parameters of a text clustering model, we require that µ
2895 and the columns of Φ are probability vectors. Spectral learning algorithms must therefore
2896 include a procedure for converting the solution into vectors that are non-negative and
2897 sum to one. One approach is to replace eigendecomposition (or the related singular value
2898 decomposition) with non-negative matrix factorization (Xu et al., 2003), which guarantees
2899 that the solutions are non-negative (Arora et al., 2013).
2900 After obtaining the parameters φ and µ, the distribution over clusters can be com-
2901 puted from Bayes’ rule:
2902 Spectral learning yields provably good solutions without regard to initialization, and can
2903 be quite fast in practice. However, it is more difficult to apply to a broad family of gen-
2904 erative models than more generic techniques like EM and Gibbs Sampling. For more on
2905 applying spectral learning across a range of latent variable models, see Anandkumar et al.
2906 (2014).
2909 • Active learning: the learner selects unlabeled instances and requests annotations (Set-
2910 tles, 2012).
2911 • Multiple instance learning: labels are applied to bags of instances, with a positive
2912 label applied if at least one instance in the bag meets the criterion (Dietterich et al.,
2913 1997; Maron and Lozano-Pérez, 1998).
2914 • Constraint-driven learning: supervision is provided in the form of explicit con-
2915 straints on the learner (Chang et al., 2007; Ganchev et al., 2010).
2916 • Distant supervision: noisy labels are generated from an external resource (Mintz
2917 et al., 2009, also see § 17.2.3).
2918 • Multitask learning: the learner induces a representation that can be used to solve
2919 multiple classification tasks (Collobert et al., 2011).
2920 • Transfer learning: the learner must solve a classification task that differs from the
2921 labeled data (Pan and Yang, 2010).
2922 Expectation maximization was introduced by Dempster et al. (1977), and is discussed
2923 in more detail by Murphy (2012). Like most machine learning treatments, Murphy focus
2924 on continuous observations and Gaussian likelihoods, rather than the discrete observa-
2925 tions typically encountered in natural language processing. Murphy (2012) also includes
2926 an excellent chapter on MCMC; for a textbook-length treatment, see Robert and Casella
2927 (2013). For still more on Bayesian latent variable models, see Barber (2012), and for ap-
2928 plications of Bayesian models to natural language processing, see Cohen (2016). Surveys
2929 are available for semi-supervised learning (Zhu and Goldberg, 2009) and domain adapta-
2930 tion (Søgaard, 2013), although both pre-date the current wave of interest in deep learning.
2931 Exercises
2932 1. Derive the expectation maximization update for the parameter µ in the EM cluster-
2933 ing model.
2934 2. The expectation maximization lower bound J is defined in Equation 5.10. Prove
2935 that the inverse −J is convex in q. You can use the following facts about convexity:
2936 • f (x) is convex in x iff αf (x1 ) + (1 − α)f (x2 ) ≥ f (αx1 + (1 − α)x2 ) for all
2937 α ∈ [0, 1].
2938 • If f (x) and g(x) are both convex in x, then f (x) + g(x) is also convex in x.
2940 3. Derive the E-step and M-step updates for the following generative model. You may
(i)
2941 assume that the labels y (i) are observed, but zm is not.
2952 • Import nltk, run nltk.download() and select semcor. Import semcor
2953 from nltk.corpus.
2954 • The command semcor.tagged sentences(tag=’sense’) returns an iter-
2955 ator over sense-tagged sentences in the corpus. Each sentence can be viewed as
2956 an iterator over tree objects. For tree objects that are sense-annotated words,
2957 you can access the annotation as tree.label(), and the word itself with
2958 tree.leaves(). So semcor.tagged sentences(tag=’sense’)[0][2].label()
2959 would return the sense annotation of the third word in the first sentence.
2960 • Extract all sentences containing the senses say.v.01 and say.v.02.
2961 • Build bag-of-words vectors x(i) , containing the counts of other words in those
2962 sentences, including all words that occur in at least two sentences.
2963 • Implement and run expectation-maximization clustering on the merged data.
2964 • Compute the frequency with which each cluster includes instances of say.v.01
2965 and say.v.02.
2966 5. Using the iterative updates in Equations 5.34-5.36, compute the outcome of the label
2967 propagation algorithm for the following examples.
? 0 1 ? ? 0 0 1
1 ? ? ?
2968 The value inside the node indicates the label, y (i) ∈ {0, 1}, with y (i) =? for unlabeled
2969 nodes. The presence of an edge between two nodes indicates wi,j = 1, and the
2970 absence of an edge indicates wi,j = 0. For the third example, you need only compute
2971 the first three iterations, and then you can guess at the solution in the limit.
2972 In the remaining exercises, you will try out some approaches for semisupervised learn-
2973 ing and domain adaptation. You will need datasets in multiple domains. You can obtain
2974 product reviews in multiple domains here: https://www.cs.jhu.edu/˜mdredze/
2975 datasets/sentiment/processed_acl.tar.gz. Choose a source and target domain,
2976 e.g. dvds and books, and divide the data for the target domain into training and test sets
2977 of equal size.
3003 8. Using only 5% of the target domain training data (and all of the source domain train-
3004 ing data), implement one of the supervised domain adaptation baselines in § 5.4.1.
3005 See if this improves on the “direct transfer” baseline from the previous problem
3006 9. Implement EasyAdapt (§ 5.4.1), again using 5% of the target domain training data
3007 and all of the source domain data.
3008 10. Now try unsupervised domain adaptation, using the “linear projection” method
3009 described in § 5.4.2. Specifically:
3010 • Identify 500 pivot features as the words with the highest frequency in the (com-
3011 plete) training data for the source and target domains. Specifically, let xdi be the
3012 count of the word i in domain d: choose the 500 words with the largest values
target
3013 of min(xsource
i , xi ).
3014 • Train a classifier to predict each pivot feature from the remaining words in the
3015 document.
3016 • Arrange the features of these classifiers into a matrix Φ, and perform truncated
3017 singular value decomposition, with k = 20
3018 • Train a classifier from the source domain data, using the combined features
3019 x(i) ⊕ U> x(i) — these include the original bag-of-words features, plus the pro-
3020 jected features.
3021 • Apply this classifier to the target domain test set, and compute the accuracy.
135
3024 Chapter 6
3026 In probabilistic classification, the problem is to compute the probability of a label, condi-
3027 tioned on the text. Let’s now consider the inverse problem: computing the probability of
3028 text itself. Specifically, we will consider models that assign probability to a sequence of
3029 word tokens, p(w1 , w2 , . . . , wM ), with wm ∈ V. The set V is a discrete vocabulary,
3030 Why would you want to compute the probability of a word sequence? In many appli-
3031 cations, the goal is to produce word sequences as output:
3032 • In machine translation (chapter 18), we convert from text in a source language to
3033 text in a target language.
3034 • In speech recognition, we convert from audio signal to text.
3035 • In summarization (§ 16.3.4.1; § 19.2), we convert from long texts into short texts.
3036 • In dialogue systems (§ 19.3), we convert from the user’s input (and perhaps an
3037 external knowledge base) into a text response.
3038 In many of the systems for performing these tasks, there is a subcomponent that com-
3039 putes the probability of the output text. The purpose of this component is to generate
3040 texts that are more fluent. For example, suppose we want to translate a sentence from
3041 Spanish to English.
137
138 CHAPTER 6. LANGUAGE MODELS
3045 A good language model of English will tell us that the probability of this translation is
3046 low, in comparison with more grammatical alternatives,
p(The coffee black me pleases much) < p(I love dark coffee). [6.2]
3047 How can we use this fact? Warren Weaver, one of the early leaders in machine trans-
3048 lation, viewed it as a problem of breaking a secret code (Weaver, 1955):
3049 When I look at an article in Russian, I say: ’This is really written in English,
3050 but it has been coded in some strange symbols. I will now proceed to decode.’
3052 • The English sentence w(e) is generated from a language model, pe (w(e) ).
3053 • The Spanish sentence w(s) is then generated from a translation model, ps|e (w(s) | w(e) ).
Given these two distributions, we can then perform translation by Bayes rule:
3054 This is sometimes called the noisy channel model, because it envisions English text
3055 turning into Spanish by passing through a noisy channel, ps|e . What is the advantage of
3056 modeling translation this way, as opposed to modeling pe|s directly? The crucial point is
3057 that the two distributions ps|e (the translation model) and pe (the language model) can be
3058 estimated from separate data. The translation model requires examples of correct trans-
3059 lations, but the language model requires only text in English. Such monolingual data is
3060 much more widely available. Furthermore, once estimated, the language model pe can be
3061 reused in any application that involves generating English text, from summarization to
3062 speech recognition.
3064 This estimator is unbiased: in the theoretical limit of infinite data, the estimate will
3065 be correct. But in practice, we are asking for accurate counts over an infinite number of
3066 events, since sequences of words can be arbitrarily long. Even with an aggressive upper
3067 bound of, say, M = 20 tokens in the sequence, the number of possible sequences is V 20 . A
3068 small vocabularly for English would have V = 104 , so there are 1080 possible sequences.
3069 Clearly, this estimator is very data-hungry, and suffers from high variance: even gram-
3070 matical sentences will have probability zero if have not occurred in the training data.1 We
3071 therefore need to introduce bias to have a chance of making reliable estimates from finite
3072 training data. The language models that follow in this chapter introduce bias in various
3073 ways.
We begin with n-gram language models, which compute the probability of a sequence
as the product of probabilities of subsequences. The probability of a sequence p(w) =
p(w1 , w2 , . . . , wM ) can be refactored using the chain rule (see § A.2):
Each element in the product is the probability of a word given all its predecessors. We
can think of this as a word prediction task: given the context Computers are, we want to com-
pute a probability over the next token. The relative frequency estimate of the probability
of the word useless in this context is,
3074 We haven’t made any approximations yet, and we could have just as well applied the
3075 chain rule in reverse order,
3076 or in any other order. But this means that we also haven’t really made any progress:
3077 to compute the conditional probability p(wM | wM −1 , wM −2 , . . . , w1 ), we would need to
3078 model V M −1 contexts. Such a distribution cannot be estimated from any realistic sample
3079 of text.
1
Chomsky has famously argued that this is evidence against the very concept of probabilistic language
models: no such model could distinguish the grammatical sentence colorless green ideas sleep furiously from
the ungrammatical permutation furiously sleep ideas green colorless. Indeed, even the bigrams in these two
examples are unlikely to occur — at least, not in texts written before Chomsky proposed this example.
To solve this problem, n-gram models make a crucial simplifying approximation: con-
dition on only the past n − 1 words.
p(wm | wm−1 . . . w1 ) ≈p(wm | wm−1 , . . . , wm−n+1 ) [6.9]
This means that the probability of a sentence w can be approximated as
M
Y
p(w1 , . . . , wM ) ≈ p(wm | wm−1 , . . . , wm−n+1 ) [6.10]
m
3080 This model requires estimating and storing the probability of only V n events, which is
3081 exponential in the order of the n-gram, and not V M , which is exponential in the length of
3082 the sentence. The n-gram probabilities can be computed by relative frequency estimation,
count(wm−2 , wm−1 , wm )
p(wm | wm−1 , wm−2 ) = P 0
[6.12]
w0 count(wm−2 , wm−1 , w )
3083 The hyperparameter n controls the size of the context used in each conditional proba-
3084 bility. If this is misspecified, the language model will perform poorly. Let’s consider the
3085 potential problems concretely.
3089 In each example, the bolded words depend on each other: the likelihood of their
3090 depends on knowing that gorillas is plural, and the likelihood of crashed depends on
3091 knowing that the subject is a computer. If the n-grams are not big enough to capture
3092 this context, then the resulting language model would offer probabilities that are too
3093 low for these sentences, and too high for sentences that fail basic linguistic tests like
3094 number agreement.
3095 When n is too big. In this case, it is hard good estimates of the n-gram parameters from
3096 our dataset, because of data sparsity. To handle the gorilla example, it is necessary to
3097 model 6-grams, which means accounting for V 6 events. Under a very small vocab-
3098 ulary of V = 104 , this means estimating the probability of 1024 distinct events.
3099 These two problems point to another bias-variance tradeoff (see § 2.1.4). A small n-
3100 gram size introduces high bias, and a large n-gram size introduces high variance. But
3101 in reality we often have both problems at the same time! Language is full of long-range
3102 dependencies that we cannot capture because n is too small; at the same time, language
3103 datasets are full of rare phenomena, whose probabilities we fail to estimate accurately
3104 because n is too large. One solution is to try to keep n large, while still making low-
3105 variance estimates of the underlying parameters. To do this, we will introduce a different
3106 sort of bias: smoothing.
count(wm−1 , wm ) + α
psmooth (wm | wm−1 ) = P 0
. [6.13]
w0 ∈V count(wm−1 , w ) + V α
3120 This basic framework is called Lidstone smoothing, but special cases have other names:
3124 To maintain normalization, anything that we add to the numerator (α) must also ap-
3125 pear in the denominator (V α). This idea is reflected in the concept of effective counts:
M
c∗i = (ci + α) , [6.14]
M +Vα
Table 6.1: Example of Lidstone smoothing and absolute discounting in a bigram language
model, for the context (alleged, ), for a toy corpus with a total of twenty counts over the
seven words shown. Note that discounting decreases the probability for all but the un-
seen words, while Lidstone smoothing increases the effective counts and probabilities for
deficiencies and outbreak.
P
where ci is the count of event i, c∗i is the effective count, and M = Vi=1 ci is the total num-
P P
ber of tokens in the dataset (w1 , w2 , . . . , wM ). This term ensures that Vi=1 c∗i = Vi=1 ci = M .
The discount for each n-gram is then computed as,
c∗i (ci + α) M
di = = .
ci ci (M + V α)
3127 Discounting “borrows” probability mass from observed n-grams and redistributes it. In
3128 Lidstone smoothing, the borrowing is done by increasing the denominator of the relative
3129 frequency estimates. The borrowed probability mass is then redistributed by increasing
3130 the numerator for all n-grams. Another approach would be to borrow the same amount
3131 of probability mass from all observed n-grams, and redistribute it among only the unob-
3132 served n-grams. This is called absolute discounting. For example, suppose we set an
3133 absolute discount d = 0.1 in a bigram model, and then redistribute this probability mass
3134 equally over the unseen words. The resulting probabilities are shown in Table 6.1.
Discounting reserves some probability mass from the observed data, and we need not
redistribute this probability mass equally. Instead, we can backoff to a lower-order lan-
guage model: if you have trigrams, use trigrams; if you don’t have trigrams, use bigrams;
if you don’t even have bigrams, use unigrams. This is called Katz backoff. In the simple
case of backing off from bigrams to unigrams, the bigram probabilities are computed as,
3135 The term α(j) indicates the amount of probability mass that has been discounted for
3136 context j. This probability mass is then divided across all the unseen events, {i0 : c(i0 , j) =
3137 0}, proportional to the unigram probability of each word i0 . The discount parameter d can
3138 be optimized to maximize performance (typically held-out log-likelihood) on a develop-
3139 ment set.
3144 In this equation, p∗n is the unsmoothed empirical probability given by an n-gram lan-
3145 guage model, and λn is the weight assigned to this model. To ensure that the interpolated
3146
Pnmax is still a valid probability distribution, the values of λ must obey the constraint,
p(w)
3147 n=1 λn = 1. But how to find the specific values?
3148 An elegant solution is expectation maximization. Recall from chapter 5 that we can
3149 think about EM as learning with missing data: we just need to choose missing data such
3150 that learning would be easy if it weren’t missing. What’s missing in this case? Think of
3151 each word wm as drawn from an n-gram of unknown size, zm ∈ {1 . . . nmax }. This zm is
3152 the missing data that we are looking for. Therefore, the application of EM to this problem
3153 involves the following generative process:
3154 for Each token wm , m = 1, 2, . . . , M do:
3155 draw the n-gram size zm ∼ Categorical(λ);
3156
3157
draw wm ∼ p∗zm (wm | wm−1 , . . . , wm−zm ).
If the missing data {Zm } were known, then λ could be estimated as the relative fre-
quency,
count(Zm = z)
λz = [6.17]
M
XM
∝ δ(Zm = z). [6.18]
m=1
But since we do not know the values of the latent variables Zm , we impute a distribution
qm in the E-step, which represents the degree of belief that word token wm was generated
from a n-gram of order zm ,
3158 A solution is obtained by iterating between updates to q and λ. The complete algorithm
3159 is shown in Algorithm 10.
3166 • Francisco
3167 • Duluth
3168 Now suppose that both bigrams visited Duluth and visited Francisco are unobserved in
3169 the training data, and furthermore, the unigram probability p∗1 (Francisco) is greater than
3170 p∗ (Duluth). Nonetheless we would still guess that p(visited Duluth) > p(visited Francisco),
3171 because Duluth is a more “versatile” word: it can occur in many contexts, while Francisco
3172 usually occurs in a single context, following the word San. This notion of versatility is the
3173 key to Kneser-Ney smoothing.
Writing u for a context of undefined length, and count(w, u) as the count of word w in
context u, we define the Kneser-Ney bigram probability as
( count(w,u)−d
count(u) , count(w, u) > 0
pKN (w | u) = [6.23]
α(u) × pcontinuation (w), otherwise
|u : count(w, u) > 0|
pcontinuation (w) = P 0 0 0
. [6.24]
w0 ∈V |u : count(w , u ) > 0|
First, note that we reserve probability mass using absolute discounting d, which is
taken from all unobserved n-grams. The total amount of discounting in context u is
d × |w : count(w, u) > 0|, and we divide this probability mass equally among the unseen
n-grams,
d
α(u) = |w : count(w, u) > 0| × . [6.25]
count(u)
3174 This is the amount of probability mass left to account for versatility, which we define via
3175 the continuation probability pcontinuation (w) as proportional to the number of observed con-
3176 texts in which w appears. The numerator of the continuation probability is the number of
3177 contexts u in which w appears; the denominator normalizes the probability by summing
3178 the same quantity over all words w0 .
3179 The idea of modeling versatility by counting contexts may seem heuristic, but there is
3180 an elegant theoretical justification from Bayesian nonparametrics (Teh, 2006). Kneser-Ney
3181 smoothing on n-grams was the dominant language modeling technique before the arrival
3182 of neural language models.
h0 h1 h2 h3 ···
x1 x2 x3 ···
w1 w2 w3 ···
Figure 6.1: The recurrent neural network language model, viewed as an “unrolled” com-
putation graph. Solid lines indicate direct computation, dotted blue lines indicate proba-
bilistic dependencies, circles indicate random variables, and squares indicate computation
nodes.
exp(βw · vu )
p(w | u) = P , [6.26]
w0 ∈V exp(βw0 · vu )
3196 where βw · vu represents a dot product. As usual, the denominator ensures that the prob-
3197 ability distribution is properly normalized. This vector of probabilities is equivalent to
3198 applying the softmax transformation (see § 3.1) to the vector of dot-products,
The word vectors βw are parameters of the model, and are estimated directly. The
context vectors vu can be computed in various ways, depending on the model. A simple
2
This idea predates neural language models (e.g., Rosenfeld, 1996; Roark et al., 2007).
but effective neural language model can be built from a recurrent neural network (RNN;
Mikolov et al., 2010). The basic idea is to recurrently update the context vectors while
moving through the sequence. Let hm represent the contextual information at position m
in the sequence. RNN language models are defined,
xm ,φwm [6.28]
hm =RNN(xm , hm−1 ) [6.29]
exp(βwm+1 · hm )
p(wm+1 | w1 , w2 , . . . , wm ) = P , [6.30]
w0 ∈V exp(βw0 · hm )
3199 where φ is a matrix of input word embeddings, and xm denotes the embedding for word
3200 wm . The conversion of wm to xm is sometimes known as a lookup layer, because we
3201 simply lookup the embeddings for each word in a table; see § 3.2.4.
3202 The Elman unit defines a simple recurrent operation (Elman, 1990),
RNN(xm , hm−1 ) , g(Θhm−1 + xm ), [6.31]
3203 where Θ ∈ RK×K is the recurrence matrix and g is a non-linear transformation function,
3204 often defined as the elementwise hyperbolic tangent tanh (see § 3.1).3 The tanh acts as a
3205 squashing function, ensuring that each element of hm is constrained to the range [−1, 1].
3206 Although each wm depends on only the context vector hm−1 , this vector is in turn
3207 influenced by all previous tokens, w1 , w2 , . . . wm−1 , through the recurrence operation: w1
3208 affects h1 , which affects h2 , and so on, until the information is propagated all the way to
3209 hm−1 , and then on to wm (see Figure 6.1). This is an important distinction from n-gram
3210 language models, where any information outside the n-word window is ignored. In prin-
3211 ciple, the RNN language model can handle long-range dependencies, such as number
3212 agreement over long spans of text — although it would be difficult to know where exactly
3213 in the vector hm this information is represented. The main limitation is that informa-
3214 tion is attenuated by repeated application of the squashing function g. Long short-term
3215 memories (LSTMs), described below, are a variant of RNNs that address this issue, us-
3216 ing memory cells to propagate information through the sequence without applying non-
3217 linearities (Hochreiter and Schmidhuber, 1997).
3218 The denominator in Equation 6.30 is a computational bottleneck, because it involves
3219 a sum over the entire vocabulary. One solution is to use a hierarchical softmax function,
3220 which computes the sum more efficiently by organizing the vocabulary into a tree (Mikolov
3221 et al., 2011). Another strategy is to optimize an alternative metric, such as noise-contrastive
3222 estimation (Gutmann and Hyvärinen, 2012), which learns by distinguishing observed in-
3223 stances from artificial instances generated from a noise distribution (Mnih and Teh, 2012).
3224 Both of these strategies are described in § 14.5.3.
3
In the original Elman network, the sigmoid function was used in place of tanh. For an illuminating
mathematical discussion of the advantages and disadvantages of various nonlinearities in recurrent neural
networks, see the lecture notes from Cho (2015).
3227 • φi ∈ RK , the “input” word vectors (these are sometimes called word embeddings,
3228 since each word is embedded in a K-dimensional space);
3229 • βi ∈ RK , the “output” word vectors;
3230 • Θ ∈ RK×K , the recurrence operator;
3231 • h0 , the initial state.
3232 Each of these parameters can be estimated by formulating an objective function over the
3233 training corpus, L(w), and then applying backpropagation to obtain gradients on the
3234 parameters from a minibatch of training examples (see § 3.3.1). Gradient-based updates
3235 can be computed from an online learning algorithm such as stochastic gradient descent
3236 (see § 2.5.2).
3237 The application of backpropagation to recurrent neural networks is known as back-
3238 propagation through time, because the gradients on units at time m depend in turn on the
3239 gradients of units at earlier times n < m. Let `m+1 represent the negative log-likelihood
3240 of word m + 1,
`m+1 = − log p(wm+1 | w1 , w2 , . . . , wm ). [6.32]
We require the gradient of this loss with respect to each parameter, such as θk,k0 , an indi-
vidual element in the recurrence matrix Θ. Since the loss depends on the parameters only
through hm , we can apply the chain rule of differentiation,
∂`m+1 ∂`m+1 ∂hm
= . [6.33]
∂θk,k0 ∂hm ∂θk,k0
3241 where g 0 is the local derivative of the nonlinear function g. The key point in this equation
3242 is that the derivative ∂θ∂hm
depends on ∂h m−1 ∂hm−2
∂θk,k0 , which will depend in turn on ∂θk,k0 , and
k,k0
3243 so on, until reaching the initial state h0 .
∂hm
3244 Each derivative ∂θ will be reused many times: it appears in backpropagation from
k,k0
3245 the loss `m , but also in all subsequent losses `n>m . Neural network toolkits such as
3246 Torch (Collobert et al., 2011) and DyNet (Neubig et al., 2017) compute the necessary
3247 derivatives automatically, and cache them for future use. An important distinction from
3248 the feedforward neural networks considered in chapter 3 is that the size of the computa-
3249 tion graph is not fixed, but varies with the length of the input. This poses difficulties for
3250 toolkits that are designed around static computation graphs, such as TensorFlow (Abadi
3251 et al., 2016).4
4
See https://www.tensorflow.org/tutorials/recurrent (retrieved Feb 8, 2018).
hm hm+1
om om+1
cm fm+1 cm+1
im im+1
c̃m c̃m+1
xm xm+1
Figure 6.2: The long short-term memory (LSTM) architecture. Gates are shown in boxes
with dotted edges. In an LSTM language model, each hm would be used to predict the
next word wm+1 .
The gates are functions of the input and previous hidden state. They are computed
from elementwise sigmoid activations, σ(x) = (1 + exp(−x))−1 , ensuring that their values
will be in the range [0, 1]. They can therefore be viewed as soft, differentiable logic gates.
The LSTM architecture is shown in Figure 6.2, and the complete update equations are:
3281 The operator is an elementwise (Hadamard) product. Each gate is controlled by a vec-
3282 tor of weights, which parametrize the previous hidden state (e.g., Θ(h→f ) ) and the current
3283 input (e.g., Θ(x→f ) ), plus a vector offset (e.g., bf ). The overall operation can be infor-
3284 mally summarized as (hm , cm ) = LSTM(xm , (hm−1 , cm−1 )), with (hm , cm ) representing
3285 the LSTM state after reading token m.
3286 The LSTM outperforms standard recurrent neural networks across a wide range of
3287 problems. It was first used for language modeling by Sundermeyer et al. (2012), but can
3288 be applied more generally: the vector hm can be treated as a complete representation of
3289 the input sequence up to position m, and can be used for any labeling task on a sequence
3290 of tokens, as we will see in the next chapter.
3291 There are several LSTM variants, of which the Gated Recurrent Unit (Cho et al., 2014)
3292 is one of the more well known. Many software packages implement a variety of RNN
3293 architectures, so choosing between them is simple from a user’s perspective. Jozefowicz
3294 et al. (2015) provide an empirical comparison of various modeling choices circa 2015.
M
X
`(w) = log p(wm | wm−1 , . . . , w1 ), [6.42]
m=1
3317 • In the limit of a perfect language model, probability 1 is assigned to the held-out
1
3318 corpus, with Perplex(w) = 2− M log2 1 = 20 = 1.
3319 • In the opposite limit, probability zero is assigned to the held-out corpus, which cor-
1
3320 responds to an infinite perplexity, Perplex(w) = 2− M log2 0 = 2∞ = ∞.
1
• Assume a uniform, unigram model in which p(wi ) = V for all words in the vocab-
ulary. Then,
M
X X M
1
log2 (w) = log2 =− log2 V = −M log2 V
V
m=1 m=1
1
M log2 V
Perplex(w) =2 M
=2log2 V
=V.
3321 This is the “worst reasonable case” scenario, since you could build such a language
3322 model without even looking at the data.
3323 In practice, language models tend to give perplexities in the range between 1 and V .
3324 A small benchmark dataset is the Penn Treebank, which contains roughly a million to-
3325 kens; its vocabulary is limited to 10,000 words, with all other tokens mapped a special
3326 hUNKi symbol. On this dataset, a well-smoothed 5-gram model achieves a perplexity of
3327 141 (Mikolov and Zweig, Mikolov and Zweig), and an LSTM language model achieves
3328 perplexity of roughly 80 (Zaremba, Sutskever, and Vinyals, Zaremba et al.). Various en-
3329 hancements to the LSTM architecture can bring the perplexity below 60 (Merity et al.,
3330 2018). A larger-scale language modeling dataset is the 1B Word Benchmark (Chelba et al.,
3331 2013), which contains text from Wikipedia. On this dataset, a perplexities of around 25
3332 can be obtained by averaging together multiple LSTM language models (Jozefowicz et al.,
3333 2016).
3339 The report said U.S. intelligence agencies believe Russian military intelligence,
3340 the GRU, used intermediaries such as WikiLeaks, DCLeaks.com and the Guc-
3341 cifer 2.0 ”persona” to release emails...
3342 Suppose that you trained a language model on the Gigaword corpus,6 which was released
3343 in 2003. The bolded terms either did not exist at this date, or were not widely known; they
3344 are unlikely to be in the vocabulary. The same problem can occur for a variety of other
3345 terms: new technologies, previously unknown individuals, new words (e.g., hashtag), and
3346 numbers.
3347 One solution is to simply mark all such terms with a special token, hUNKi. While
3348 training the language model, we decide in advance on the vocabulary (often the K most
3349 common terms), and mark all other terms in the training data as hUNKi. If we do not want
3350 to determine the vocabulary size in advance, an alternative approach is to simply mark
3351 the first occurrence of each word type as hUNKi.
3352 But is often better to make distinctions about the likelihood of various unknown words.
3353 This is particularly important in languages that have rich morphological systems, with
3354 many inflections for each word. For example, Portuguese is only moderately complex
3355 from a morphological perspective, yet each verb has dozens of inflected forms (see Fig-
3356 ure 4.3b). In such languages, there will be many word types that we do not encounter in a
3357 corpus, which are nonetheless predictable from the morphological rules of the language.
3358 To use a somewhat contrived English example, if transfenestrate is in the vocabulary, our
3359 language model should assign a non-zero probability to the past tense transfenestrated,
3360 even if it does not appear in the training data.
3361 One way to accomplish this is to supplement word-level language models with character-
3362 level language models. Such models can use n-grams or RNNs, but with a fixed vocab-
3363 ulary equal to the set of ASCII or Unicode characters. For example Ling et al. (2015)
3364 propose an LSTM model over characters, and Kim (2014) employ a convolutional neural
3365 network (LeCun and Bengio, 1995). A more linguistically motivated approach is to seg-
3366 ment words into meaningful subword units, known as morphemes (see chapter 9). For
5
Bayoumy, Y. and Strobel, W. (2017, January 6). U.S. intel report: Putin directed cy-
ber campaign to help Trump. Reuters. Retrieved from http://www.reuters.com/article/
us-usa-russia-cyber-idUSKBN14Q1T8 on January 7, 2017.
6
https://catalog.ldc.upenn.edu/LDC2003T05
3367 example, Botha and Blunsom (2014) induce vector representations for morphemes, which
3368 they build into a log-bilinear language model; Bhatia et al. (2016) incorporate morpheme
3369 vectors into an LSTM.
3375 Exercises
3376 1. exercises tk
3379 The goal of sequence labeling is to assign tags to words, or more generally, to assign dis-
3380 crete labels to discrete elements in a sequence. There are many applications of sequence
3381 labeling in natural language processing, and chapter 8 presents an overview. A classic ap-
3382 plication is part-of-speech tagging, which involves tagging each word by its grammatical
3383 category. Coarse-grained grammatical categories include N OUNs, which describe things,
3384 properties, or ideas, and V ERBs, which describe actions and events. Consider a simple
3385 input:
3387 A dictionary of coarse-grained part-of-speech tags might include NOUN as the only valid
3388 tag for they, but both NOUN and VERB as potential tags for can and fish. A accurate se-
3389 quence labeling algorithm should select the verb tag for both can and fish in (7.1), but it
3390 should select the noun tags for the same two words in the phrase can of fish.
3392 Here the feature function takes three arguments as input: the sentence to be tagged (e.g.,
3393 they can fish), the proposed tag (e.g., N or V), and the index of the token to which this tag
155
156 CHAPTER 7. SEQUENCE LABELING
3394 is applied. This simple feature function then returns a single feature: a tuple including
3395 the word to be tagged and the tag that has been proposed. If the vocabulary size is V
3396 and the number of tags is K, then there are V × K features. Each of these features must
3397 be assigned a weight. These weights can be learned from a labeled dataset using a clas-
3398 sification algorithm such as perceptron, but this isn’t necessary in this case: it would be
3399 equivalent to define the classification weights directly, with θw,y = 1 for the tag y most
3400 frequently associated with word w, and θw,y = 0 for all other tags.
However, it is easy to see that this simple classification approach cannot correctly tag
both they can fish and can of fish, because can and fish are grammatically ambiguous. To han-
dle both of these cases, the tagger must rely on context, such as the surrounding words.
We can build context into the feature set by incorporating the surrounding words as ad-
ditional features:
3401 These features contain enough information that a tagger should be able to choose the
3402 right tag for the word fish: words that come after can are likely to be verbs, so the feature
3403 (wm−1 = can, ym = V) should have a large positive weight.
3404 However, even with this enhanced feature set, it may be difficult to tag some se-
3405 quences correctly. One reason is that there are often relationships between the tags them-
3406 selves. For example, in English it is relatively rare for a verb to follow another verb —
3407 particularly if we differentiate M ODAL verbs like can and should from more typical verbs,
3408 like give, transcend, and befuddle. We would like to incorporate preferences against tag se-
3409 quences like V ERB-V ERB, and in favor of tag sequences like N OUN-V ERB. The need for
3410 such preferences is best illustrated by a garden path sentence:
3412 Grammatically, the word the is a D ETERMINER. When you read the sentence, what
3413 part of speech did you first assign to old? Typically, this word is an ADJECTIVE — abbrevi-
3414 ated as J — which is a class of words that modify nouns. Similarly, man is usually a noun.
3415 The resulting sequence of tags is D J N D N. But this is a mistaken “garden path” inter-
3416 pretation, which ends up leading nowhere. It is unlikely that a determiner would directly
3417 follow a noun,1 and it is particularly unlikely that the entire sentence would lack a verb.
3418 The only possible verb in (7.2) is the word man, which can refer to the act of maintaining
3419 and piloting something — often boats. But if man is tagged as a verb, then old is seated
3420 between a determiner and a verb, and must be a noun. And indeed, adjectives often have
3421 a second interpretation as nouns when used in this way (e.g., the young, the restless). This
3422 reasoning, in which the labeling decisions are intertwined, cannot be applied in a setting
3423 where each tag is produced by an independent classification decision.
3443 where each ψ(·) scores a local part of the tag sequence. Note that the sum goes up to M +1,
3444 so that we can include a score for a special end-of-sequence tag, ψ(w1:M , , yM , M + 1).
3445 We also define a special the tag to begin the sequence, y0 , ♦.
1
The main exception occurs with ditransitive verbs, such as They gave the winner a trophy.
3446 In a linear model, local scoring function can be defined as a dot product of weights
3447 and features,
3448 The feature vector f can consider the entire input w, and can look at pairs of adjacent
3449 tags. This is a step up from per-token classification: the weights can assign low scores
3450 to infelicitous tag pairs, such as noun-determiner, and high scores for frequent tag pairs,
3451 such as determiner-noun and noun-verb.
In the example they can fish, a minimal feature function would include features for
word-tag pairs (sometimes called emission features) and tag-tag pairs (sometimes called
transition features):
M
X +1
f (w = they can fish, y = N V V ) = f (w, ym , ym−1 , m) [7.10]
m=1
=f (w, N, ♦, 1)
+ f (w, V, N, 2)
+ f (w, V, V, 3)
+ f (w, , V, 4) [7.11]
=(wm = they, ym = N) + (ym = N, ym−1 = ♦)
+ (wm = can, ym = V) + (ym = V, ym−1 = N)
+ (wm = fish, ym = V) + (ym = V, ym−1 = V)
+ (ym = , ym−1 = V). [7.12]
3452 There are seven active features for this example: one for each word-tag pair, and one
3453 for each tag-tag pair, including a final tag yM +1 = . These features capture the two main
3454 sources of information for part-of-speech tagging in English: which tags are appropriate
3455 for each word, and which tags tend to follow each other in sequence. Given appropriate
3456 weights for these features, taggers can achieve high accuracy, even for difficult cases like
3457 the old man the boat. We will now discuss how this restricted scoring function enables
3458 efficient inference, through the Viterbi algorithm (Viterbi, 1967).
3460 where the final line simplifies the notation with the shorthand,
This inference problem can be solved efficiently using dynamic programming, a al-
gorithmic technique for reusing work in recurrent computations. As is often the case in
dynamic programming, we begin by solving an auxiliary problem: rather than finding
the best tag sequence, we simply compute the score of the best tag sequence,
M
X +1
max Ψ(w, y1:M ) = max sm (ym , ym−1 ). [7.17]
y1:M y1:M
m=1
This score involves a maximization over all tag sequences of length M , written maxy1:M .
This maximization can be broken into two pieces,
M
X +1
max Ψ(w, y1:M ) = max max sm (ym , ym−1 ), [7.18]
y1:M yM y1:M −1
m=1
which simply says that we maximize over the final tag yM , and we maximize over all
“prefixes”, y1:M −1 . But within the sum of scores, only the final term sM +1 (, yM ) depends
on yM . We can pull this term out of the second maximization,
M
X
max Ψ(w, y1:M ) = max sM +1 (, yM ) + max sm (ym , ym−1 ). [7.19]
y1:M yM y1:M −1
m=1
This same reasoning can be applied recursively to the second term of Equation 7.19,
pulling out sM (yM , yM −1 ), and so on. We can formalize this idea by defining an auxiliary
Algorithm 11 The Viterbi algorithm. Each sm (k, k 0 ) is a local score for tag ym = k and
ym−1 = k 0 .
for k ∈ {0, . . . K} do
v1 (k) = s1 (k, ♦)
for m ∈ {2, . . . , M } do
for k ∈ {0, . . . , K} do
vm (k) = maxk0 sm (k, k 0 ) + vm−1 (k 0 )
bm (k) = argmaxk0 sm (k, k 0 ) + vm−1 (k 0 )
yM = argmaxk sM +1 (, k) + vM (k)
for m ∈ {M − 1, . . . 1} do
ym = bm (ym+1 )
return y1:M
Viterbi variable,
m
X
vm (ym ) , max sn (yn , yn−1 ) [7.20]
y1:m−1
n=1
m−1
X
= max sm (ym , ym−1 ) + max sn (yn , yn−1 ) [7.21]
ym−1 y1:m−2
n=1
= max sm (ym , ym−1 ) + vm−1 (ym−1 ). [7.22]
ym−1
3461 The variable vm (k) represents the score of the best sequence of length m ending in tag k.
Each set of Viterbi variables is computed from the local score sm (ym , ym−1 ), and from
the previous set of Viterbi variables. The initial condition of the recurrence is simply the
first score,
The maximum overall score for the sequence is then the final Viterbi variable,
3462 Thus, the score of the best labeling for the sequence can be computed in a single forward
3463 sweep: first compute all variables v1 (·) from Equation 7.23, and then compute all variables
3464 v2 (·) from the recurrence Equation 7.22, and continue until reaching the final variable
3465 vM +1 ().
3466 Graphically, it is customary to arrange these variables in a structure known as a trellis,
3467 shown in Figure 7.1. Each column indexes a token m in the sequence, and each row
N -3 -9 -9
V -12 -5 -11
Figure 7.1: The trellis representation of the Viterbi variables, for the example they can fish,
using the weights shown in Table 7.1.
3468 indexes a tag in Y; every vm−1 (k) is connected to every vm (k 0 ), that vm (k 0 ) is computed
3469 from vm−1 (k). Special nodes are set aside for the start and end states.
3470 Our real goal is to find the best scoring sequence, not simply to compute its score.
3471 But solving the auxiliary problem gets us almost all the way there. Recall that each vm (k)
3472 represents the score of the best tag sequence ending in that tag k in position m. To compute
3473 this, we maximize over possible values of ym−1 . If we keep track of the “argmax” tag that
3474 maximizes this choice at each step, then we can walk backwards from the final tag, and
3475 recover the optimal tag sequence. This is indicated in Figure 7.1 by the solid blue lines,
3476 which we trace back from the final position. These “back-pointers” are written bm (k),
3477 indicating the optimal tag ym−1 on the path to Ym = k.
3478 The complete Viterbi algorithm is shown in Algorithm 11. When computing the initial
3479 Viterbi variables v1 (·), we use a special tag, ♦, to indicate the start of the sequence. When
3480 computing the final tag YM , we use another special tag, , to indicate the end of the
3481 sequence. Linguistically, these special tags enable the use of transition features for the tags
3482 that begin and end the sequence: for example, conjunctions are unlikely to end sentences
3483 in English, so we would like a low score for sM +1 (, CC); nouns are relatively likely to
3484 appear at the beginning of sentences, so we would like a high score for s1 (N, ♦), assuming
3485 the noun tag is compatible with the first word token w1 .
3486 Complexity If there are K tags and M positions in the sequence, then there are M × K
3487 Viterbi variables to compute. Computing each variable requires finding a maximum over
3488 K possible predecessor tags. The total time complexity of populating the trellis is there-
3489 fore O(M K 2 ), with an additional factor for the number of active features at each position.
3490 After completing the trellis, we simply trace the backwards pointers to the beginning of
3491 the sequence, which takes O(M ) operations.
N V
they can fish ♦ −1 −2 −∞
N −2 −3 −3 N −3 −1 −1
V −10 −1 −3 V −1 −3 −1
(b) Weights for transition features. The
(a) Weights for emission features.
“from” tags are on the columns, and the “to”
tags are on the rows.
Table 7.1: Feature weights for the example trellis shown in Figure 7.1. Emission weights
from ♦ and are implicitly set to −∞.
3503 so b4 () = N. As there is no emission w4 , the emission features have scores of zero.
2
The tagging they/N can/V fish/N corresponds to the scenario of putting fish into cans, or perhaps of
firing them.
3504 To compute the optimal tag sequence, we walk backwards from here, next checking
3505 b3 (N) = V, and then b2 (V) = N, and finally b1 (N) = ♦. This yields y = (N, V, N), which
3506 corresponds to the linguistic interpretation of the fishes being put into cans.
M
X +2
Ψ(w, y) = f (w, ym , ym−1 , ym−2 , m), [7.31]
m=1
3535 Naı̈ve Bayes was introduced as a generative model — a probabilistic story that ex-
3536 plains the observed data as well as the hidden label. A similar story can be constructed
3537 for probabilistic sequence labeling: first, the tags are drawn from a prior distribution; next,
3538 the tokens are drawn from a conditional likelihood. However, for inference to be tractable,
3539 additional independence assumptions are required. First, the probability of each token
3540 depends only on its tag, and not on any other element in the sequence:
M
Y
p(w | y) = p(wm | ym ). [7.32]
m=1
3542 where y0 = ♦ in all cases. Due to this Markov assumption, probabilistic sequence labeling
3543 models are known as hidden Markov models (HMMs).
3544 The generative process for the hidden Markov model is shown in Algorithm 12. Given
3545 the parameters λ and φ, we can compute p(w, y) for any token sequence w and tag se-
3546 quence y. The HMM is often represented as a graphical model (Wainwright and Jordan,
3547 2008), as shown in Figure 7.2. This representation makes the independence assumptions
3548 explicit: if a variable v1 is probabilistically conditioned on another variable v2 , then there
3549 is an arrow v2 → v1 in the diagram. If there are no arrows between v1 and v2 , they
3550 are conditionally independent, given each variable’s Markov blanket. In the hidden
3551 Markov model, the Markov blanket for each tag ym includes the “parent” ym−1 , and the
3552 “children” ym+1 and wm .3
3553 It is important to reflect on the implications of the HMM independence assumptions.
3554 A non-adjacent pair of tags ym and yn are conditionally independent; if m < n and we
3555 are given yn−1 , then ym offers no additional information about yn . However, if we are
3556 not given any information about the tags in a sequence, then all tags are probabilistically
3557 coupled.
3
In general graphical models, a variable’s Markov blanket includes its parents, children, and its children’s
other parents (Murphy, 2012).
y1 y2 ··· yM
w1 w2 ··· wM
Figure 7.2: Graphical representation of the hidden Markov model. Arrows indicate prob-
abilistic dependencies.
3560 Emission probabilities. The probability pe (wm | ym ; φ) is the emission probability, since
3561 the words are treated as probabilistically “emitted”, conditioned on the tags.
3562 Transition probabilities. The probability pt (ym | ym−1 ; λ) is the transition probability,
3563 since it assigns probability to each possible tag-to-tag transition.
Both of these groups of parameters are typically computed from smoothed relative
frequency estimation on a labeled corpus (see § 6.2 for a review of smoothing). The un-
smoothed probabilities are,
count(Wm = i, Ym = k)
φk,i , Pr(Wm = i | Ym = k) =
count(Ym = k)
count(Ym = k 0 , Ym−1 = k)
λk,k0 , Pr(Ym = k 0 | Ym−1 = k) = .
count(Ym−1 = k)
3564 Smoothing is more important for the emission probability than the transition probability,
3565 because the vocabulary is much larger than the number of tags.
3569 As in Naı̈ve Bayes, it is equivalent to find the tag sequence with the highest log-probability,
3570 since the logarithm is a monotonically increasing function. It is furthermore equivalent
3571 to maximize the joint probability p(y, w) = p(y | w) × p(w) ∝ p(y | w), which is pro-
3572 portional to the conditional probability. Putting these observations together, the inference
where,
sm (ym , ym−1 ) , log λym ,ym−1 + log φym ,wm , [7.40]
3574 and, (
1, w=
φ,w = [7.41]
0, otherwise,
3575 which ensures that the stop tag can only be applied to the final token .
This derivation shows that HMM inference can be viewed as an application of the
Viterbi decoding algorithm, given an appropriately defined scoring function. The local
score sm (ym , ym−1 ) can be interpreted probabilistically,
sm (ym , ym−1 ) = log py (ym | ym−1 ) + log pw|y (wm | ym ) [7.42]
= log p(ym , wm | ym−1 ). [7.43]
Now recall the definition of the Viterbi variables,
vm (ym ) = max sm (ym , ym−1 ) + vm−1 (ym−1 ) [7.44]
ym−1
By setting vm−1 (ym−1 ) = maxy1:m−2 log p(y1:m−1 , w1:m−1 ), we obtain the recurrence,
vm (ym ) = max log p(ym , wm | ym−1 ) + max log p(y1:m−1 , w1:m−1 ) [7.46]
ym−1 y1:m−2
In words, the Viterbi variable vm (ym ) is the log probability of the best tag sequence ending
in ym , joint with the word sequence w1:m . The log probability of the best complete tag
sequence is therefore,
*Viterbi as an example of the max-product algorithm The Viterbi algorithm can also be
implemented using probabilities, rather than log-probabilities. In this case, each vm (ym )
is equal to,
3576 Each Viterbi variable is computed by maximizing over a set of products. Thus, the Viterbi
3577 algorithm is a special case of the max-product algorithm for inference in graphical mod-
3578 els (Wainwright and Jordan, 2008). However, the product of probabilities tends towards
3579 zero over long sequences, so the log-probability version of Viterbi is recommended in
3580 practical implementations.
3586 The Viterbi algorithm permits the inclusion of richer information in the local scoring func-
3587 tion ψ(w1:M , ym , ym−1 , m), which can be defined as a weighted sum of arbitrary local fea-
3588 tures,
ψ(w, ym , ym−1 , m) = θ · f (w, ym , ym−1 , m), [7.54]
3590 where f (global) (w, y) is a global feature vector, which is a sum of local feature vectors,
M
X +1
(global)
f (w, y) = f (w1:M , ym , ym−1 , m), [7.59]
m=1
3593 Word affix features. Consider the problem of part-of-speech tagging on the first four
3594 lines of the poem Jabberwocky (Carroll, 1917):
3599 Many of these words were made up by the author of the poem, so a corpus would offer
3600 no information about their probabilities of being associated with any particular part of
3601 speech. Yet it is not so hard to see what their grammatical roles might be in this passage.
3602 Context helps: for example, the word slithy follows the determiner the, so it is probably a
3603 noun or adjective. Which do you think is more likely? The suffix -thy is found in a number
3604 of adjectives, like frothy, healthy, pithy, worthy. It is also found in a handful of nouns — e.g.,
3605 apathy, sympathy — but nearly all of these have the longer coda -pathy, unlike slithy. So the
3606 suffix gives some evidence that slithy is an adjective, and indeed it is: later in the text we
3607 find that it is a combination of the adjectives lithe and slimy.4
4
Morphology is the study of how words are formed from smaller linguistic units. Computational ap-
proaches to morphological analysis are touched on in chapter 9; Bender (2013) provides a good overview of
the underlying linguistic principles.
3608 Fine-grained context. The hidden Markov model captures contextual information in the
3609 form of part-of-speech tag bigrams. But sometimes, the necessary contextual information
3610 is more specific. Consider the noun phrases this fish and these fish. Many part-of-speech
3611 tagsets distinguish between singular and plural nouns, but do not distinguish between
3612 singular and plural determiners.5 A hidden Markov model would be unable to correctly
3613 label fish as singular or plural in both of these cases, because it only has access to two
3614 features: the preceding tag (determiner in both cases) and the word (fish in both cases).
3615 The classification-based tagger discussed in § 7.1 had the ability to use preceding and suc-
3616 ceeding words as features, and it can also be incorporated into a Viterbi-based sequence
3617 labeler as a local feature.
Example Consider the tagging D J N (determiner, adjective, noun) for the sequence the
slithy toves, so that
Let’s create the feature vector for this example, assuming that we have word-tag features
(indicated by W ), tag-tag features (indicated by T ), and suffix features (indicated by M ).
You can assume that you have access to a method for extracting the suffix -thy from slithy,
-es from toves, and ∅ from the, indicating that this word has no suffix.6 The resulting
feature vector is,
3618 These examples show that local features can incorporate information that lies beyond
3619 the scope of a hidden Markov model. Because the features are local, it is possible to apply
3620 the Viterbi algorithm to identify the optimal sequence of tags. The remaining question
5
For example, the Penn Treebank tagset follows these conventions.
6
Such a system is called a morphological segmenter. The task of morphological segmentation is briefly
described in § 9.1.4.4; a well known segmenter is Morfessor (Creutz and Lagus, 2007). In real applica-
tions, a typical approach is to include features for all orthographic suffixes up to some maximum number of
characters: for slithy, we would have suffix features for -y, -hy, and -thy.
3621 is how to estimate the weights on these features. § 2.2 presented three main types of
3622 discriminative classifiers: perceptron, support vector machine, and logistic regression.
3623 Each of these classifiers has a structured equivalent, enabling it to be trained from labeled
3624 sequences rather than individual tokens.
We can apply exactly the same update in the case of structure prediction,
3626 This learning algorithm is called structured perceptron, because it learns to predict the
3627 structured output y. The only difference is that instead of computing ŷ by enumerating
3628 the entire set Y, the Viterbi algorithm is used to efficiently search the set of possible tag-
3629 gings, Y M . Structured perceptron can be applied to other structured outputs as long as
3630 efficient inference is possible. As in perceptron classification, weight averaging is crucial
3631 to get good performance (see § 2.2.2).
Example For the example they can fish, suppose that the reference tag sequence is y (i) =
N V V, but the tagger incorrectly returns the tag sequence ŷ = N V N. Assuming a model
with features for emissions (wm , ym ) and transitions (ym−1 , ym ), the corresponding struc-
tured perceptron update is:
3635 be applied to sequence labeling. A support vector machine in which the output is a struc-
3636 tured object, such as a sequence, is called a structured support vector machine (Tsochan-
3637 taridis et al., 2004).7
3638 In classification, we formalized the large-margin constraint as,
3639 requiring a margin of at least 1 between the scores for all labels y that are not equal to the
3640 correct label y (i) . The weights θ are then learned by constrained optimization (see § 2.3.2).
3641 This idea can be applied to sequence labeling by formulating an equivalent set of con-
3642 straints for all possible labelings Y(w) for an input w. However, there are two problems.
3643 First, in sequence labeling, some predictions are more wrong than others: we may miss
3644 only one tag out of fifty, or we may get all fifty wrong. We would like our learning algo-
3645 rithm to be sensitive to this difference. Second, the number of constraints is equal to the
3646 number of possible labelings, which is exponentially large in the length of the sequence.
3647 The first problem can be addressed by adjusting the constraint to require larger mar-
3648 gins for more serious errors. Let c(y (i) , ŷ) ≥ 0 represent the cost of predicting label ŷ when
3649 the true label is y (i) . We can then generalize the margin constraint,
3650 This cost-augmented margin constraint specializes to the constraint in Equation 7.67 if we
3651 choose the delta function c(y (i) , y) = δ (() y (i) 6= y). A more expressive cost function is
3652 the Hamming cost,
XM
(i) (i)
c(y , y) = δ(ym 6= ym ), [7.69]
m=1
3653 which computes the number of errors in y. By incorporating the cost function as the
3654 margin constraint, we require that the true labeling be seperated from the alternatives by
3655 a margin that is proportional to the number of incorrect tags in each alternative labeling.
The second problem is that the number of constraints is exponential in the length
of the sequence. This can be addressed by focusing on the prediction ŷ that maximally
violates the margin constraint. This prediction can be identified by solving the following
cost-augmented decoding problem:
7
This model is also known as a max-margin Markov network (Taskar et al., 2003), emphasizing that the
scoring function is constructed from a sum of components, which are Markov independent.
3656 where in the second line we drop the term θ · f (w(i) , y (i) ), which is constant in y.
We can now reformulate the margin constraint for sequence labeling,
θ · f (w(i) , y (i) ) − max θ · f (w(i) , y) + c(y (i) , y) ≥ 0. [7.72]
y∈Y(w)
3657 If the score for θ · f (w(i) , y (i) ) is greater than the cost-augmented score for all alternatives,
3658 then the constraint will be met. The name “cost-augmented decoding” is due to the fact
3659 that the objective includes the standard decoding problem, maxŷ∈Y(w) θ · f (w, ŷ), plus
3660 an additional term for the cost. Essentially, we want to train against predictions that are
3661 strong and wrong: they should score highly according to the model, yet incur a large loss
3662 with respect to the ground truth. Training adjusts the weights to reduce the score of these
3663 predictions.
3664 For cost-augmented decoding to be tractable, the cost function must decompose into
3665 local parts, just as the feature function f (·) does. The Hamming cost, defined above,
3666 obeys this property. To perform cost-augmented decoding using the Hamming cost, we
(i)
3667 need only to add features fm (ym ) = δ(ym 6= ym ), and assign a constant weight of 1 to
3668 these features. Decoding can then be performed using the Viterbi algorithm.8
As with large-margin classifiers, it is possible to formulate the learning problem in an
unconstrained form, by combining a regularization term on the weights and a Lagrangian
for the constraints:
!
1 X h i
min ||θ||22 − C θ · f (w(i) , y (i) ) − max θ · f (w(i) , y) + c(y (i) , y) , [7.73]
θ 2 y∈Y(w(i) )
i
3669 In this formulation, C is a parameter that controls the tradeoff between the regulariza-
3670 tion term and the margin constraints. A number of optimization algorithms have been
3671 proposed for structured support vector machines, some of which are discussed in § 2.3.2.
3672 An empirical comparison by Kummerfeld et al. (2015) shows that stochastic subgradient
3673 descent — which is essentially a cost-augmented version of the structured perceptron —
3674 is highly competitive.
3680 This is almost identical to logistic regression, but because the label space is now tag
3681 sequences, we require efficient algorithms for both decoding (searching for the best tag
3682 sequence given a sequence of words w and a model θ) and for normalizing (summing
3683 over all tag sequences). These algorithms will be based on the usual locality assumption
P +1
3684 on the scoring function, Ψ(w, y) = M m=1 ψ(w, ym , ym−1 , m).
3686 This is identical to the decoding problem for structured perceptron, so the same Viterbi
3687 recurrence as defined in Equation 7.22 can be used.
X N
λ
` = ||θ||2 − log p(y (i) | w(i) ; θ) [7.75]
2
i=1
λ XN X
= ||θ||2 − θ · f (w(i) , y (i) ) + log exp θ · f (w(i) , y 0 ) , [7.76]
2
i=1 y 0 ∈Y(w(i) )
more generally, cliques) of variables in a factor graph. In sequence labeling, the pairs of variables include
all adjacent tags (ym , ym−1 ). The probability is conditioned on the words w, which are always observed,
motivating the term “conditional” in the name.
3689 where λ controls the amount of regularization. The final term in Equation 7.76 is a sum
3690 over all possible labelings. This term is the log of the denominator in Equation 7.74, some-
3691 times known as the partition function.10 There are |Y|M possible labelings of an input of
3692 size M , so we must again exploit the decomposition of the scoring function to compute
3693 this sum efficiently.
P
The sum y∈Yw(i) exp Ψ(y, w) can be computed efficiently using the forward recur-
rence, which is closely related to the Viterbi recurrence. We first define a set of forward
variables, αm (ym ), which is equal to the sum of the scores of all paths leading to tag ym at
position m:
X m
X
αm (ym ) , exp sn (yn , yn−1 ) [7.77]
y1:m−1 n=1
m
X Y
= exp sn (yn , yn−1 ). [7.78]
y1:m−1 n=1
P
Note the similarity to the definition of the Viterbi variable, vm (ym ) = maxy1:m−1 m
n=1 sn (yn , yn−1 ).
In the hidden Markov model, the Viterbi recurrence had an alternative interpretation as
the max-product algorithm (see Equation 7.53); analogously, the forward recurrence is
known as the sum-product algorithm, because of the form of [7.78]. The forward variable
can also be computed through a recurrence:
m
X Y
αm (ym ) = exp sn (yn , yn−1 ) [7.79]
y1:m−1 n=1
X X m−1
Y
= (exp sm (ym , ym−1 )) exp sn (yn , yn−1 ) [7.80]
ym−1 y1:m−2 n=1
X
= (exp sm (ym , ym−1 )) × αm−1 (ym−1 ). [7.81]
ym−1
Using the forward recurrence, it is possible to compute the denominator of the condi-
tional probability,
X X M
Y
Ψ(w, y) = sM +1 (, yM ) sm (ym , ym−1 ) [7.82]
y∈Y(w) y1:M m=1
10
The terminology of “potentials” and “partition functions” comes from statistical mechanics (Bishop,
2006).
X N
λ
` = ||θ||2 − θ · f (w(i) , y (i) ) + log αM +1 (). [7.84]
2
i=1
3694 Probabilistic programming environments, such as Torch (Collobert et al., 2011) and dynet (Neu-
3695 big et al., 2017), can compute the gradient of this objective using automatic differentiation.
3696 The programmer need only implement the forward algorithm as a computation graph.
As in logistic regression, the gradient of the likelihood with respect to the parameters
is a difference between observed and expected feature counts:
X N
d`
=λθj + E[fj (w(i) , y)] − fj (w(i) , y (i) ), [7.85]
dθj
i=1
3697 where fj (w(i) , y (i) ) refers to the count of feature j for token sequence w(i) and tag se-
3698 quence y (i) . The expected feature counts are computed “under the hood” when automatic
3699 differentiation is applied to Equation 7.84 (Eisner, 2016).
3700 Before the widespread use of automatic differentiation, it was common to compute
3701 the feature expectations from marginal tag probabilities p(ym | w). These marginal prob-
3702 abilities are sometimes useful on their own, and can be computed using the forward-
3703 backward algorithm. This algorithm combines the forward recurrence with an equivalent
3704 backward recurrence, which traverses the input from wM back to w1 .
P QM
0 y:Ym =k,Ym−1 =k0 n=1 exp sn (yn , yn−1 )
Pr(Ym−1 = k , Ym = k | w) = P QM . [7.86]
exp s (y 0 , y0 )
y 0 n=1 n n n−1
The numerator sums over all tag sequences that include the transition (Ym−1 = k 0 ) →
(Ym = k). Because we are only interested in sequences that include the tag bigram, this
sum can be decomposed into three parts: the prefixes y1:m−1 , terminating in Ym−1 = k 0 ; the
11
Recall the notational convention of upper-case letters for random variables, e.g. Ym , and lower case
letters for specific values, e.g., ym , so that Ym = k is interpreted as the event of random variable Ym taking
the value k.
Ym−1 = k0
Ym = k
transition (Ym−1 = k 0 ) → (Ym = k); and the suffixes ym:M , beginning with the tag Ym = k:
X M
Y X m−1
Y
exp sn (yn , yn−1 ) = exp sn (yn , yn−1 )
y:Ym =k,Ym−1 =k0 n=1 y1:m−1 :Ym−1 =k0 n=1
0
× exp sm (k, k )
X M
Y +1
× exp sn (yn , yn−1 ). [7.87]
ym:M :Ym =k n=m+1
The result is product of three terms: a score that sums over all the ways to get to the
position (Ym−1 = k 0 ), a score for the transition from k 0 to k, and a score that sums over
all the ways of finishing the sequence from (Ym = k). The first term of Equation 7.87 is
equal to the forward variable, αm−1 (k 0 ). The third term — the sum over ways to finish the
sequence — can also be defined recursively, this time moving over the trellis from right to
left, which is known as the backward recurrence:
X M
Y +1
βm (k) , exp sn (yn , yn−1 ) [7.88]
ym:M :Ym =k n=m
X X M
Y +1
0
= exp sm+1 (k , k) exp sn (yn , yn−1 ) [7.89]
k0 ∈Y ym+1:M :Ym =k0 n=m+1
X
= exp sm+1 (k 0 , k) × βm+1 (k 0 ). [7.90]
k0 ∈Y
3706 To understand this computation, compare with the forward recurrence in Equation 7.81.
3707 The application of the forward and backward probabilities is shown in Figure 7.3.
3708 Both the forward and backward recurrences operate on the trellis, which implies a space
3709 complexity O(M K). Because both recurrences require computing a sum over K terms at
3710 each node in the trellis, their time complexity is O(M K 2 ).
3723 The score ψm (y) can also be converted into a probability distribution using the usual soft-
3724 max operation,
exp ψm (y)
p(y | w1:m ) = P 0
. [7.95]
y 0 ∈Y exp ψm (y )
3725 Using this transformation, it is possible to train the tagger from the negative log-likelihood
3726 of the tags, as in a conditional random field. Alternatively, a hinge loss or margin loss
3727 objective can be constructed from the raw scores ψm (y).
The hidden state hm accounts for information in the input leading up to position m,
but it ignores the subsequent tokens, which may also be relevant to the tag ym . This can
be addressed by adding a second RNN, in which the input is reversed, running the recur-
rence from wM to w1 . This is known as a bidirectional recurrent neural network (Graves
and Schmidhuber, 2005), and is specified as:
←
− ←
−
h m =g(xm , h m+1 ), m = 1, 2, . . . , M. [7.96]
→
−
3728 The hidden states of the left-to-right RNN are denoted h m . The left-to-right and right-to-
←
− → −
3729 left vectors are concatenated, hm = [ h m ; h m ]. The scoring function in Equation 7.93 is
3730 applied to this concatenated vector.
3731 Bidirectional RNN tagging has several attractive properties. Ideally, the representa-
3732 tion hm summarizes the useful information from the surrounding context, so that it is not
3733 necessary to design explicit features to capture this information. If the vector hm is an ad-
3734 equate summary of this context, then it may not even be necessary to perform the tagging
3735 jointly: in general, the gains offered by joint tagging of the entire sequence are diminished
3736 as the individual tagging model becomes more powerful. Using backpropagation, the
3737 word vectors x can be trained “end-to-end”, so that they capture word properties that are
3738 useful for the tagging task. Alternatively, if limited labeled data is available, we can use
3739 word embeddings that are “pre-trained” from unlabeled data, using a language modeling
3740 objective (as in § 6.3) or a related word embedding technique (see chapter 14). It is even
3741 possible to combine both fine-tuned and pre-trained embeddings in a single model.
3742 Neural structure prediction The bidirectional recurrent neural network incorporates in-
3743 formation from throughout the input, but each tagging decision is made independently.
3744 In some sequence labeling applications, there are very strong dependencies between tags:
3745 it may even be impossible for one tag to follow another. In such scenarios, the tagging
3746 decision must be made jointly across the entire sequence.
3747 Neural sequence labeling can be combined with the Viterbi algorithm by defining the
3748 local scores as:
sm (ym , ym−1 ) = βym · hm + ηym−1 ,ym , [7.97]
3749 where hm is the RNN hidden state, βym is a vector associated with tag ym , and ηym−1 ,ym
3750 is a scalar parameter for the tag transition (ym−1 , ym ). These local scores can then be
3751 incorporated into the Viterbi algorithm for inference, and into the forward algorithm for
3752 training. This model is shown in Figure 7.4. It can be trained from the conditional log-
3753 likelihood objective defined in Equation 7.76, backpropagating to the tagging parameters
ym−1 ym ym+1
←
− ←
− ←
−
h m−1 hm h m+1
→
− →
− →
−
h m−1 hm h m+1
xm−1 xm xm+1
Figure 7.4: Bidirectional LSTM for sequence labeling. The solid lines indicate computa-
tion, the dashed lines indicate probabilistic dependency, and the dotted lines indicate the
optional additional probabilistic dependencies between labels in the biLSTM-CRF.
3754 β and η, as well as the parameters of the RNN. This model is called the LSTM-CRF, due
3755 to its combination of aspects of the long short-term memory and conditional random field
3756 models (Huang et al., 2015).
3757 The LSTM-CRF is especially effective on the task of named entity recognition (Lample
3758 et al., 2016), a sequence labeling task that is described in detail in § 8.3. This task has strong
3759 dependencies between adjacent tags, so structure prediction is especially important.
→
− (w) ← −(w) →
− (w)
3774 [ h Nw ; h 0 ], where h Nw is the final state of the right-facing pass for word w, and Nw
3775 is the number of characters in the word. The character RNN model is trained by back-
3776 propagation from the tagging objective. On the test data, the trained RNN is applied to
3777 out-of-vocabulary words (or all words), yielding inputs to the word-level tagging RNN.
3778 Other approaches to compositional word embeddings are described in § 14.7.1.
E[count(W = i, Y = k)]
Pr(W = i | Y = k) = φk,i =
E[count(Y = k)]
E[count(Ym = k, Ym−1 = k 0 )]
Pr(Ym = k | Ym−1 = k 0 ) = λk0 ,k =
E[count(Ym−1 = k 0 )]
3800 The expected counts are computed in the E-step, using the forward and backward
3801 recurrences. The local scores follow the usual definition for hidden Markov models,
As described in § 7.5.3.3, these marginal probabilities can be computed from the forward-
backward recurrence,
αm−1 (k 0 ) × sm (k, k 0 ) × βm (k)
Pr(Ym−1 = k 0 , Ym = k | w) = . [7.101]
αM +1 ()
Applying the conditional independence assumptions of the hidden Markov model (de-
fined in Algorithm 12), the product is equal to the joint probability of the tag bigram and
the entire input,
3802 The expected emission counts can be computed in a similar manner, using the product
3803 αm (k) × βm (k).
3814 Gibbs Sampling has been applied to unsupervised part-of-speech tagging by Goldwater
3815 and Griffiths (2007). Beam sampling is a more sophisticated sampling algorithm, which
3816 randomly draws entire sequences y1:M , rather than individual tags ym ; this algorithm
3817 was applied to unsupervised part-of-speech tagging by Van Gael et al. (2009). Spectral
3818 learning (see § 5.5.2) can also be applied to sequence labeling. By factoring matrices of
3819 co-occurrence counts of word bigrams and trigrams (Song et al., 2010; Hsu et al., 2012), it
3820 is possible to obtain globally optimal estimates of the transition and emission parameters,
3821 under mild assumptions.
3823 Each recurrence that we have seen so far is a special case of this generalized Viterbi
3824 recurrence:
3825 • In the max-product Viterbi recurrence over probabilities, the ⊕ operation corre-
3826 sponds to maximization, and the ⊗ operation corresponds to multiplication.
3827 • In the forward recurrence over probabilities, the ⊕ operation corresponds to addi-
3828 tion, and the ⊗ operation corresponds to multiplication.
3829 • In the max-product Viterbi recurrence over log-probabilities, the ⊕ operation corre-
3830 sponds to maximization, and the ⊗ operation corresponds to addition.13
3831 • In the forward recurrence over log-probabilities, the ⊕ operation corresponds to log-
3832 addition, a ⊕ b = log(ea + eb ). The ⊗ operation corresponds to addition.
3833 The mathematical abstraction offered by semiring notation can be applied to the soft-
3834 ware implementations of these algorithms, yielding concise and modular implementa-
3835 tions. The O PEN FST library (Allauzen et al., 2007) is an example of a software package in
3836 which the algorithms are parametrized by the choice of semiring.
3837 Exercises
3838 1. Consider the garden path sentence, The old man the boat. Given word-tag and tag-tag
3839 features, what inequality in the weights must hold for the correct tag sequence to
3840 outscore the garden path tag sequence for this example?
3841 2. Sketch out an algorithm for a variant of Viterbi that returns the top-n label se-
3842 quences. What is the time and space complexity of this algorithm?
3845 4. Suppose you receive a stream of text, where some of tokens have been replaced at
3846 random with NOISE. For example:
3849 Assume you have access to a pre-trained bigram language model, which gives prob-
3850 abilities p(wm | wm−1 ). These probabilities can be assumed to be non-zero for all
3851 bigrams.
3852 a) Show how to use the Viterbi algorithm to try to recover the source by maxi-
3853 mizing the bigram language model log-probability. Specifically, set the scores
3854 sm (ym , ym−1 ) so that the Viterbi algorithm selects a sequence of words that
3855 maximizes the bigram language model log-probability, while leaving the non-
3856 noise tokens intact. Your solution should not modify the logic of the Viterbi
3857 algorithm, it should only set the scores sm (ym , ym−1 ).
3858 b) An alternative solution is to iterate through the text from m ∈ {1, 2, . . . , M },
3859 replacing each noise token with the word that maximizes P (wm | wm−1 ) ac-
3860 cording to the bigram language model. Given an upper bound on the expected
3861 fraction of tokens for which the two approaches will disagree.
3862 5. Consider an RNN tagging model with a tanh activation function on the hidden
3863 layer, and a hinge loss on the output. (The problem also works for the margin loss
3864 and negative log-likelihood.) Suppose you initialize all parameters to zero: this
3865 includes the word embeddings that make up x, the transition matrix Θ, the out-
3866 put weights β, and the initial hidden state h0 . Prove that for any data and for any
3867 gradient-based learning algorithm, all parameters will be stuck at zero.
3868 Extra credit: would a sigmoid activation function avoid this problem?
3871 Sequence labeling has applications throughout natural language processing. This chap-
3872 ter focuses on part-of-speech tagging, morpho-syntactic attribute tagging, named entity
3873 recognition, and tokenization. It also touches briefly on two applications to interactive
3874 settings: dialogue act recognition and the detection of code-switching points between
3875 languages.
3890 One difference between these two examples is that the first contains part-of-speechtransitions
3891 that are typical in English: adjective to adjective, adjective to noun, noun to verb, and verb
3892 to adverb. The second example contains transitions that are unusual: noun to adjective
3893 and adjective to verb. The ambiguity in a headline like,
185
186 CHAPTER 8. APPLICATIONS OF SEQUENCE LABELING
3895 can also be explained in terms of parts of speech: in the interpretation that was likely
3896 intended, strikes is a noun and idle is a verb; in the alternative explanation, strikes is a verb
3897 and idle is an adjective.
3898 Part-of-speech tagging is often taken as a early step in a natural language processing
3899 pipeline. Indeed, parts-of-speech provide features that can be useful for many of the
3900 tasks that we will encounter later, such as parsing (chapter 10), coreference resolution
3901 (chapter 15), and relation extraction (chapter 17).
3914 Here howling and shrieking are events, but grammatically they act as a noun and adjective
3915 respectively.
3919 Open class tags Nearly all languages contain nouns, verbs, adjectives, and adverbs.2
3920 These are all open word classes, because new words can easily be added to them. The
3921 UD tagset includes two other tags that are open classes: proper nouns and interjections.
3922 • Nouns (UD tag: NOUN) tend to describe entities and concepts, e.g.,
1
The UD tagset builds on earlier work from Petrov et al. (2012), in which a set of twelve universal tags
was identified by creating mappings from tagsets for individual languages.
2
One prominent exception is Korean, which some linguists argue does not have adjectives Kim (2002).
3924 In English, nouns tend to follow determiners and adjectives, and can play the subject
3925 role in the sentence. They can be marked for the plural number by an -s suffix.
3926 • Proper nouns (PROPN) are tokens in names, which uniquely specify a given entity,
3928 • Verbs (VERB), according to the UD guidelines, “typically signal events and ac-
3929 tions.” But they are also defined grammatically: they “can constitute a minimal
3930 predicate in a clause, and govern the number and types of other constituents which
3931 may occur in a clause.”3
3934 English verbs tend to come in between the subject and some number of direct ob-
3935 jects, depending on the verb. They can be marked for tense and aspect using suffixes
3936 such as -ed and -ing. (These suffixes are an example of inflectional morphology,
3937 which is discussed in more detail in § 9.1.4.)
3938 • Adjectives (ADJ) describe properties of entities,
3941 In the second example, scarce is a predicative adjective, linked to the subject by the
3942 copula verb are. This means that In contrast, murderous and veteran are attribute
3943 adjectives, modifying the noun phrase in which they are embedded.
3944 • Adverbs (ADV) describe properties of events, and may also modify adjectives or
3945 other adverbs:
3946 (8.11) It is not down on any map; true places never are.
3947 (8.12) . . . treacherously hidden beneath the loveliest tints of azure
3948 (8.13) Not drowned entirely, though.
3950 (8.14) Aye aye! it was that accursed white whale that razed me.
3
http://universaldependencies.org/u/pos/VERB.html
3951 Closed class tags Closed word classes rarely receive new members. They are sometimes
3952 referred to as function words — as opposed to content words — as they have little lexical
3953 meaning of their own, but rather, help to organize the components of the sentence.
3954 • Adpositions (ADP) describe the relationship between a complement (usually a noun
3955 phrase) and another unit in the sentence, typically a noun or verb phrase.
3959 As the examples show, English generally uses prepositions, which are adpositions
3960 that appear before their complement. (An exception is ago, as in, we met three days
3961 ago). Postpositions are used in other languages, such as Japanese and Turkish.
3962 • Auxiliary verbs (AUX) are a closed class of verbs that add information such as
3963 tense, aspect, person, and number.
3969 The final example is a copula verb, which is also tagged as an auxiliary in the UD
3970 corpus.
3971 • Coordinating conjunctions (CCONJ) express relationships between two words or
3972 phrases, which play a parallel role:
3974 • Subordinating conjunctions (SCONJ) link two elements, making one syntactically
3975 subordinate to the other:
3977 • Pronouns (PRON) are words that substitute for nouns or noun phrases.
3980 The example includes the personal pronouns I and it, as well as the relative pronoun
3981 what. Other pronouns include myself, somebody, and nothing.
3982 • Determiners (DET) provide additional information about the nouns or noun phrases
3983 that they modify:
3984 (8.27) What the white whale was to Ahab, has been hinted.
3985 (8.28) It is not down on any map.
3986 (8.29) I try all things . . .
3987 (8.30) Shall we keep chasing this murderous fish?
3992 (8.31) How then can this one small heart beat.
3993 (8.32) I am going to put him down for the three hundredth.
3994 • Particles (PART) are a catch-all of function words that combine with other words or
3995 phrases, but do not meet the conditions of the other tags. In English, this includes
3996 the infinitival to, the possessive marker, and negation.
3997 (8.33) Better to sleep with a sober cannibal than a drunk Christian.
3998 (8.34) So man’s insanity is heaven’s sense
3999 (8.35) It is not down on any map
4000 As the second example shows, the possessive marker is not considered part of the
4001 same token as the word that it modifies, so that man’s is split into two tokens. (Tok-
4002 enization is described in more detail in § 8.4.) A non-English example of a particle
4003 is the Japanese question marker ka, as in,4
4006 Other The remaining UD tags include punctuation (PUN) and symbols (SYM). Punc-
4007 tuation is purely structural — e.g., commas, periods, colons — while symbols can carry
4008 content of their own. Examples of symbols include dollar and percentage symbols, math-
4009 ematical operators, emoticons, emojis, and internet addresses. A final catch-all tag is X,
4010 which is used for words that cannot be assigned another part-of-speech category. The X
4011 tag is also used in cases of code switching (between languages), described in § 8.5.
4066 Similar results for the PTB data have been achieved using conditional random fields (CRFs;
4067 Toutanova et al., 2003).
4068 More recent work has demonstrated the power of neural sequence models, such as the
4069 long short-term memory (LSTM) (§ 7.6). Plank et al. (2016) apply a CRF and a bidirec-
4070 tional LSTM to twenty-two languages in the UD corpus, achieving an average accuracy
4071 of 94.3% for the CRF, and 96.5% with the bi-LSTM. Their neural model employs three
4072 types of embeddings: fine-tuned word embeddings, which are updated during training;
4073 pre-trained word embeddings, which are never updated, but which help to tag out-of-
4074 vocabulary words; and character-based embeddings. The character-based embeddings
4075 are computed by running an LSTM on the individual characters in each word, thereby
4076 capturing common orthographic patterns such as prefixes, suffixes, and capitalization.
4077 Extensive evaluations show that these additional embeddings are crucial to their model’s
4078 success.
. . PUNCT
Figure 8.1: UD and PTB part-of-speech tags, and UD morphosyntactic attributes. Example
selected from the UD 1.4 English corpus.
4090 miner or pronominal modifier), and D EFINITE =D EF, which indicates that it is a definite
4091 article (referring to a specific, known entity). The verbs are each marked with several
4092 attributes. The auxiliary verb was is third-person, singular, past tense, finite (conjugated),
4093 and indicative (describing an event that has happened or is currently happenings); the
4094 main verb destroyed is in participle form (so there is no additional person and number
4095 information), past tense, and passive voice. Some, but not all, of these distinctions are
4096 reflected in the PTB tags VBD (past-tense verb) and VBN (past participle).
4097 While there are thousands of papers on part-of-speech tagging, there is comparatively
4098 little work on automatically labeling morphosyntactic attributes. Faruqui et al. (2016)
4099 train a support vector machine classification model, using a minimal feature set that in-
4100 cludes the word itself, its prefixes and suffixes, and type-level information listing all pos-
4101 sible morphosyntactic attributes for each word and its neighbors. Mueller et al. (2013) use
4102 a conditional random field (CRF), in which the tag space consists of all observed com-
4103 binations of morphosyntactic attributes (e.g., the tag would be DEF+ART for the word
4104 the in Figure 8.1). This massive tag space is managed by decomposing the feature space
4105 over individual attributes, and pruning paths through the trellis. More recent work has
4106 employed bidirectional LSTM sequence models. For example, Pinter et al. (2017) train
4107 a bidirectional LSTM sequence model. The input layer and hidden vectors in the LSTM
4108 are shared across attributes, but each attribute has its own output layer, culminating in
4109 a softmax over all attribute values, e.g. ytNUMBER ∈ {S ING, P LURAL, . . .}. They find that
4110 character-level information is crucial, especially when the amount of labeled data is lim-
4111 ited.
4112 Evaluation is performed by first computing recall and precision for each attribute.
4113 These scores can then be averaged at either the type or token level to obtain micro- or
4114 macro-F - MEASURE. Pinter et al. (2017) evaluate on 23 languages in the UD treebank,
4115 reporting a median micro-F - MEASURE of 0.95. Performance is strongly correlated with the
4116 size of the labeled dataset for each language, with a few outliers: for example, Chinese is
4117 particularly difficult, because although the dataset is relatively large (105 tokens in the UD
4118 1.4 corpus), only 6% of tokens have any attributes, offering few useful labeled instances.
Figure 8.2: BIO notation for named entity recognition. Example (8.38) is drawn from the
GENIA corpus of biomedical documents (Ohta et al., 2002).
4128 stracts.
4129 A standard approach to tagging named entity spans is to use discriminative sequence
4130 labeling methods such as conditional random fields. However, the named entity recogni-
4131 tion (NER) task would seem to be fundamentally different from sequence labeling tasks
4132 like part-of-speech tagging: rather than tagging each token, the goal in is to recover spans
4133 of tokens, such as The United States Army.
4134 This is accomplished by the BIO notation, shown in Figure 8.2. Each token at the
4135 beginning of a name span is labeled with a B- prefix; each token within a name span is la-
4136 beled with an I- prefix. These prefixes are followed by a tag for the entity type, e.g. B-L OC
4137 for the beginning of a location, and I-P ROTEIN for the inside of a protein name. Tokens
4138 that are not parts of name spans are labeled as O. From this representation, the entity
4139 name spans can be recovered unambiguously. This tagging scheme is also advantageous
4140 for learning: tokens at the beginning of name spans may have different properties than
4141 tokens within the name, and the learner can exploit this. This insight can be taken even
4142 further, with special labels for the last tokens of a name span, and for unique tokens in
4143 name spans, such as Atlanta in the example in Figure 8.2. This is called BILOU notation,
4144 and it can yield improvements in supervised named entity recognition (Ratinov and Roth,
4145 2009).
Feature-based sequence labeling Named entity recognition was one of the first applica-
tions of conditional random fields (McCallum and Li, 2003). The use P of Viterbi decoding
restricts the feature function f (w, y) to be a sum of local features, m f (w, ym , ym−1 , m),
so that each feature can consider only local adjacent tags. Typical features include tag tran-
sitions, word features for wm and its neighbors, character-level features for prefixes and
suffixes, and “word shape” features for capitalization and other orthographic properties.
As an example, base features for the word Army in the example in (8.37) include:
(CURR - WORD :Army, PREV- WORD :U.S., NEXT- WORD :captured, PREFIX -1:A-,
PREFIX -2:Ar-, SUFFIX -1:-y, SUFFIX -2:-my, SHAPE :Xxxx)
4146 Another source of features is to use gazzeteers: lists of known entity names. For example,
4147 the U.S. Social Security Administration provides a list of tens of thousands of given names
(1) 日文 章魚 怎麼 說?
Japanese octopus how say
How to say octopus in Japanese?
(2) 日 文章 魚 怎麼 說?
Japan essay fish how say
4148 — more than could be observed in any annotated corpus. Tokens or spans that match an
4149 entry in a gazetteer can receive special features; this provides a way to incorporate hand-
4150 crafted resources such as name lists in a learning-driven framework.
4151 Neural sequence labeling for NER Current research has emphasized neural sequence
4152 labeling, using similar LSTM models to those employed in part-of-speech tagging (Ham-
4153 merton, 2003; Huang et al., 2015; Lample et al., 2016). The bidirectional LSTM-CRF (Fig-
4154 ure 7.4 in § 7.6) does particularly well on this task, due to its ability to model tag-to-tag
4155 dependencies. However, Strubell et al. (2017) show that convolutional neural networks
4156 can be equally accurate, with significant improvement in speed due to the efficiency of
4157 implementing ConvNets on graphics processing units (GPUs). The key innovation in
4158 this work was the use of dilated convolution, which is described in more detail in § 3.4.
4176 field to predict labels of S TART or N ON S TART on each character. More recent work has
4177 employed neural network architectures. For example, Chen et al. (2015) use an LSTM-
4178 CRF architecture, as described in § 7.6: they construct a trellis, in which each tag is scored
4179 according to the hidden state of an LSTM, and tag-tag transitions are scored according
4180 to learned transition weights. The best-scoring segmentation is then computed by the
4181 Viterbi algorithm.
4187 (8.39) Although everything written on this site est disponible en anglais
is available in English
4188 and in French, my personal videos seront bilingues
will be bilingual
4189 Accurately analyzing such texts requires first determining which languages are being
4190 used. Furthermore, quantitative analysis of code switching can provide insights on the
4191 languages themselves and their relative social positions.
4192 Code switching can be viewed as a sequence labeling problem, where the goal is to la-
4193 bel each token as a candidate switch point. In the example above, the words est, and, and
4194 seront would be labeled as switch points. Solorio and Liu (2008) detect English-Spanish
4195 switch points using a supervised classifier, with features that include the word, its part-of-
4196 speech in each language (according to a supervised part-of-speech tagger), and the prob-
4197 abilities of the word and part-of-speech in each language. Nguyen and Dogruöz (2013)
4198 apply a conditional random field to the problem of detecting code switching between
4199 Turkish and Dutch.
4200 Code switching is a special case of the more general problem of word level language
4201 identification, which Barman et al. (2014) address in the context of trilingual code switch-
4202 ing between Bengali, English, and Hindi. They further observe an even more challenging
4203 phenomenon: intra-word code switching, such as the use of English suffixes with Bengali
4204 roots. They therefore mark each token as either (1) belonging to one of the three languages;
4205 (2) a mix of multiple languages; (3) “universal” (e.g., symbols, numbers, emoticons); or
4206 (4) undefined.
7
As quoted in http://blogues.lapresse.ca/lagace/2008/09/08/
justin-trudeau-really-parfait-bilingue/, accessed August 21, 2017.
4230 as questions and answers. These features are handled with an additional emission distri-
4231 bution, p(am | ym ), which is modeled with a probabilistic decision tree (Murphy, 2012).
4232 While acoustic features yield small improvements overall, they play an important role in
4233 distinguish questions from statements, and agreements from backchannels.
4234 Recurrent neural architectures for dialogue act labeling have been proposed by Kalch-
4235 brenner and Blunsom (2013) and Ji et al. (2016), with strong empirical results. Both models
4236 are recurrent at the utterance level, so that each complete utterance updates a hidden state.
4237 The recurrent-convolutional network of Kalchbrenner and Blunsom (2013) uses convolu-
4238 tion to obtain a representation of each individual utterance, while Ji et al. (2016) use a
4239 second level of recurrence, over individual words. This enables their method to also func-
4240 tion as a language model, giving probabilities over sequences of words in a document.
4241 Exercises
4245 We have now seen methods for learning to label individual words, vectors of word counts,
4246 and sequences of words; we will soon proceed to more complex structural transforma-
4247 tions. Most of these techniques could apply to counts or sequences from any discrete vo-
4248 cabulary; there is nothing fundamentally linguistic about, say, a hidden Markov model.
4249 This raises a basic question that this text has not yet considered: what is a language?
4250 This chapter will take the perspective of formal language theory, in which a language
4251 is defined as a set of strings, each of which is a sequence of elements from a finite alphabet.
4252 For interesting languages, there are an infinite number of strings that are in the language,
4253 and an infinite number of strings that are not. For example:
4254 • the set of all even-length sequences from the alphabet {a, b}, e.g., {∅, aa, ab, ba, bb, aaaa, aaab, . . .};
4255 • the set of all sequences from the alphabet {a, b} that contain aaa as a substring, e.g.,
4256 {aaa, aaaa, baaa, aaab, . . .};
4257 • the set of all sequences of English words (drawn from a finite dictionary) that con-
4258 tain at least one verb (a finite subset of the dictionary);
4259 • the python programming language.
4260 Formal language theory defines classes of languages and their computational prop-
4261 erties. Of particular interest is the computational complexity of solving the membership
4262 problem — determining whether a string is in a language. The chapter will focus on
4263 three classes of formal languages: regular, context-free, and “mildly” context-sensitive
4264 languages.
4265 A key insight of 20th century linguistics is that formal language theory can be usefully
4266 applied to natural languages such as English, by designing formal languages that cap-
4267 ture as many properties of the natural language as possible. For many such formalisms, a
4268 useful linguistic analysis comes as a byproduct of solving the membership problem. The
199
200 CHAPTER 9. FORMAL LANGUAGE THEORY
4269 membership problem can be generalized to the problems of scoring strings for their ac-
4270 ceptability (as in language modeling), and of transducing one string into another (as in
4271 translation).
4288 • The set of all even length strings on the alphabet {a, b}: ((aa)|(ab)|(ba)|(bb))∗
4289 • The set of all sequences of the alphabet {a, b} that contain aaa as a substring: (a|b)∗ aaa(a|b)∗
4290 • The set of all sequences of English words that contain at least one verb: W ∗ V W ∗ ,
4291 where W is an alternation between all words in the dictionary, and V is an alterna-
4292 tion between all verbs (V ⊆ W ).
4293 This list does not include a regular expression for the Python programming language,
4294 because this language is not regular — there is no regular expression that can capture its
4295 syntax. We will discuss why towards the end of this section.
4296 Regular languages are closed under union, intersection, and concatenation. This means,
4297 for example, that if two languages L1 and L2 are regular, then so are the languages L1 ∪L2 ,
4298 L1 ∩ L2 , and the language of strings that can be decomposed as s = tu, with s ∈ L1 and
4299 t ∈ L2 . Regular languages are also closed under negation: if L is regular, then so is the
4300 language L = {s ∈ / L}.
a b
q0 b q1
start
Σ ={a, b} [9.1]
Q ={q0 , q1 } [9.2]
F ={q1 } [9.3]
δ ={(q0 , a) → q0 , (q0 , b) → q1 , (q1 , b) → q1 }. [9.4]
4320 This FSA defines a language over an alphabet of two symbols, a and b. The transition
4321 function δ is written as a set of arcs: (q0 , a) → q0 says that if the machine is in state
4322 q0 and reads symbol a, it stays in q0 . Figure 9.1 provides a graphical representation of
4323 M1 . Because each pair of initial state and symbol has at most one resulting state, M1 is
4324 deterministic: each string ω induces at most one accepting path. Note that there are no
4325 transitions for the symbol a in state q1 ; if a is encountered in q1 , then the acceptor is stuck,
4326 and the input string is rejected.
4327 What strings does M1 accept? The start state is q0 , and we have to get to q1 , since this
4328 is the only final state. Any number of a symbols can be consumed in q0 , but a b symbol is
4329 required to transition to q1 . Once there, any number of b symbols can be consumed, but
4330 an a symbol cannot. So the regular expression corresponding to the language defined by
4331 M1 is a∗ bb∗ .
4345 • Derivational morphology describes the use of affixes to convert a word from one
4346 grammatical category to another (e.g., from the noun grace to the adjective graceful),
4347 or to change the meaning of the word (e.g., from grace to disgrace).
4348 • Inflectional morphology describes the addition of details such as gender, number,
4349 person, and tense (e.g., the -ed suffix for past tense in English).
4350 Morphology is a rich topic in linguistics, deserving of a course in its own right.1 The
4351 focus here will be on the use of finite state automata for morphological analysis. The
1
A good starting point would be a chapter from a linguistics textbook (e.g., Akmajian et al., 2010; Bender,
2013). A key simplification in this chapter is the focus on affixes at the sole method of derivation and inflec-
tion. English makes use of affixes, but also incorporates apophony, such as the inflection of foot to feet. Semitic
languages like Arabic and Hebrew feature a template-based system of morphology, in which roots are triples
of consonants (e.g., ktb), and words are created by adding vowels: kataba (Arabic: he wrote), kutub (books),
maktab (desk). For more detail on morphology, see texts from Haspelmath and Sims (2013) and Lieber (2015).
4352 current section deals with derivational morphology; inflectional morphology is discussed
4353 in § 9.1.4.3.
4354 Suppose that we want to write a program that accepts only those words that are con-
4355 structed in accordance with the rules of English derivational morphology:
4360 (Recall that the asterisk indicates that a linguistic example is judged unacceptable by flu-
4361 ent speakers of a language.) These examples cover only a tiny corner of English deriva-
4362 tional morphology, but a number of things stand out. The suffix -ful converts the nouns
4363 grace and disgrace into adjectives, and the suffix -ly converts adjectives into adverbs. These
4364 suffixes must be applied in the correct order, as shown by the unacceptability of *grace-
4365 lyful. The -ful suffix works for only some words, as shown by the use of alluring as the
4366 adjectival form of allure. Other changes are made with prefixes, such as the derivation
4367 of disgrace from grace, which roughly corresponds to a negation; however, fair is negated
4368 with the un- prefix instead. Finally, while the first three examples suggest that the direc-
4369 tion of derivation is noun → adjective → adverb, the example of fair suggests that the
4370 adjective can also be the base form, with the -ness suffix performing the conversion to a
4371 noun.
4372 Can we build a computer program that accepts only well-formed English words, and
4373 rejects all others? This might at first seem trivial to solve with a brute-force attack: simply
4374 make a dictionary of all valid English words. But such an approach fails to account for
4375 morphological productivity — the applicability of existing morphological rules to new
4376 words and names, such as Trump to Trumpy and Trumpkin, and Clinton to Clintonian and
4377 Clintonite. We need an approach that represents morphological rules explicitly, and for
4378 this we will try a finite state acceptor.
4379 The dictionary approach can be implemented as a finite state acceptor, with the vo-
4380 cabulary Σ equal to the vocabulary of English, and a transition from the start state to the
4381 accepting state for each word. But this would of course fail to generalize beyond the origi-
4382 nal vocabulary, and would not capture anything about the morphotactic rules that govern
4383 derivations from new words. The first step towards a more general approach is shown in
4384 Figure 9.2, which is the state diagram for a finite state acceptor in which the vocabulary
4385 consists of morphemes, which include stems (e.g., grace, allure) and affixes (e.g., dis-, -ing,
4386 -ly). This finite state acceptor consists of a set of paths leading away from the start state,
4387 with derivational affixes added along the path. Except for qneg , the states on these paths
4388 are all final, so the FSA will accept disgrace, disgraceful, and disgracefully, but not dis-.
-ful -ly
qN1 qJ1 qA1
grace
grace -ful -ly
qneg qN2 qJ2 qA2
dis-
start q0
-ly
Figure 9.2: A finite state acceptor for a fragment of English derivational morphology. Each
path represents possible derivations from a single root form.
4389 This FSA can be minimized to the form shown in Figure 9.3, which makes the gen-
4390 erality of the finite state approach more apparent. For example, the transition from q0 to
4391 qJ2 can be made to accept not only fair but any single-morpheme (monomorphemic) ad-
4392 jective that takes -ness and -ly as suffixes. In this way, the finite state acceptor can easily
4393 be extended: as new word stems are added to the vocabulary, their derived forms will be
4394 accepted automatically. Of course, this FSA would still need to be extended considerably
4395 to cover even this small fragment of English morphology. As shown by cases like music
4396 → musical, athlete → athletic, English includes several classes of nouns, each with its own
4397 rules for derivation.
4398 The FSAs shown in Figure 9.2 and 9.3 accept allureing, not alluring. This reflects a dis-
4399 tinction between morphology — the question of which morphemes to use, and in what
4400 order — and orthography — the question of how the morphemes are rendered in written
4401 language. Just as orthography requires dropping the e preceding the -ing suffix, phonol-
4402 ogy imposes a related set of constraints on how words are rendered in speech. As we will
4403 see soon, these issues are handled through finite state transducers, which are finite state
4404 automata that take inputs and produce outputs.
dis-
grace -ing
start q0 qN2
allure -ly
fair
qJ2 -ness qN3
Figure 9.3: Minimization of the finite state acceptor shown in Figure 9.2.
4411 “Though many things are possible in morphology, some are more possible than others.”
4412 But finite state acceptors give no way to express preferences among technically valid
4413 choices.
4414 Weighted finite state acceptors (WFSAs) are generalizations of FSAs, in which each
4415 accepting path is assigned a score, computed from the transitions, the initial state, and the
4416 final state. Formally, a weighted finite state acceptor M = (Q, Σ, λ, ρ, δ) consists of:
4422 WFSAs depart from the FSA formalism in three ways: every state can be an initial
4423 state, with score λ(q); every state can be an accepting state, with score ρ(q); transitions are
4424 possible between any pair of states on any input, with a score δ(qi , ω, qj ). Nonetheless,
4425 FSAs can be viewed as a special case: for any FSA M we can build an equivalent WFSA
4426 by setting λ(q) = ∞ for all q 6= q0 , ρ(q) = ∞ for all q ∈ / F , and δ(qi , ω, qj ) = ∞ for all
4427 transitions {(q1 , ω) → q2 } that are not permitted by the transition function of M .
4428 The total score for any path π = t1 , t2 , . . . , tN is equal to the sum of these scores,
N
X
d(π) = λ(from-state(t1 )) + δ(tn ) + ρ(to-state(tN )). [9.5]
n
4429 A shortest-path algorithm is used to find the minimum-cost path through a WFSA for
4430 string ω, with time complexity O(E + V log V ), where E is the number of edges and V is
4431 the number of vertices (Cormen et al., 2009).2
2
Shortest-path algorithms find the path with the minimum cost. In many cases, the path weights are log
The log probability under an n-gram language model can be modeled in a WFSA. First
consider a unigram language model. We need only a single state q0 , with transition scores
δ(q0 , ω, q0 ) = log p1 (ω). The initial and final scores can be set to zero. Then the path score
for w1 , w2 , . . . , wM is equal to,
M
X M
X
0+ δ(q0 , wm , q0 ) + 0 = log p1 (wm ). [9.7]
m m
For an n-gram language model with n > 1, we need probabilities that condition on
the past history. For example, in a bigram language model, the transition weights must
represent log p2 (wm | wm−1 ). The transition scoring function must somehow “remember”
the previous word or words. This can be done by adding more states: to model the bigram
probability p2 (wm | wm−1 ), we need a state for every possible wm−1 — a total of V states.
The construction indexes each state qi by a context event wm−1 = i. The weights are then
assigned as follows:
(
log Pr(wm = j | wm−1 = i), ω = j
δ(qi , ω, qj ) =
−∞, ω 6= j
λ(qi ) = log Pr(w1 = i | w0 = )
ρ(qi ) = log Pr(wM +1 = | wM = i).
4435 The transition function is designed to ensure that the context is recorded accurately:
4436 we can move to state j on input ω only if ω = j; otherwise, transitioning to state j is
4437 forbidden by the weight of −∞. The initial weight function λ(qi ) is the log probability of
4438 receiving i as the first token, and the final weight function ρ(qi ) is the log probability of
4439 receiving an “end-of-string” token after observing wM = i.
4443 may have multiple accepting paths. In some applications, the score for the input is ag-
4444 gregated across all such paths. Such aggregate scores can be computed by generalizing
4445 WFSAs with semiring notation, first introduced in § 7.7.3.
4446 Let d(π) represent the total score for path π = t1 , t2 , . . . , tN , which is computed as,
4447 This is a generalization of Equation 9.5 to semiring notation, using the semiring multipli-
4448 cation operator ⊗ in place of addition.
4449 Now let s(ω) represent the total score for all paths Π(ω) that consume input ω,
M
s(ω) = d(π). [9.9]
π∈Π(ω)
4450 Here, semiring addition (⊕) is used to combine the scores of multiple paths.
4451 The generalization to semirings covers a number of useful special cases. In the log-
4452 probability semiring, multiplication is defined as log p(x) ⊗ log p(y) = log p(x) + log p(y),
4453 and addition is defined as log p(x) ⊕ log p(y) = log(p(x) + p(y)). Thus, s(ω) represents
4454 the log-probability of accepting input ω, marginalizing over all paths π ∈ Π(ω). In the
4455 boolean semiring, the ⊗ operator is logical conjunction, and the ⊕ operator is logical
4456 disjunction. This reduces to the special case of unweighted finite state acceptors, where
4457 the score s(ω) is a boolean indicating whether there exists any accepting path for ω. In
4458 the tropical semiring, the ⊕ operator is a maximum, so the resulting score is the score of
4459 the best-scoring path through the WFSA. The OpenFST toolkit uses semirings and poly-
4460 morphism to implement general algorithms for weighted finite state automata (Allauzen
4461 et al., 2007).
4466 with p̂ indicating the interpolated probability, p2 indicating the bigram probability, and
4467 p1 indicating the unigram probability. We set λ2 = (1 − λ1 ) so that the probabilities sum
4468 to one.
4469 Interpolated bigram language models can be implemented using a non-deterministic
4470 WFSA (Knight and May, 2009). The basic idea is shown in Figure 9.4. In an interpolated
4471 bigram language model, there is one state for each element in the vocabulary — in this
b : λ2 p2 (b | a)
a : λ2 p2 (a | a)
: λ1
2 : λ1
qA start qU qB
a : p1 (a) b : p1 (b)
b : λ2 p2 (b | b)
a : λ2 p2 (a | b)
4472 case, the states qA and qB — which are capture the contextual conditioning in the bigram
4473 probabilities. To model unigram probabilities, there is an additional state qU , which “for-
4474 gets” the context. Transitions out of qU involve unigram probabilities, p1 (a) and p2 (b);
4475 transitions into qU emit the empty symbol , and have probability λ1 , reflecting the inter-
4476 polation weight for the unigram model. The interpolation weight for the bigram model is
4477 included in the weight of the transition qA → qB .
4478 The epsilon transitions into qU make this WFSA non-deterministic. Consider the score
4479 for the sequence (a, b, b). The initial state is qU , so the symbol a is generated with score
4480 p1 (a)3 Next, we can generate b from the unigram model by taking the transition qA → qB ,
4481 with score λ2 p2 (b | a). Alternatively, we can take a transition back to qU with score λ1 ,
4482 and then emit b from the unigram model with score p1 (b). To generate the final b token,
4483 we face the same choice: emit it directly from the self-transition to qB , or transition to qU
4484 first.
The total score for the sequence (a, b, b) is the semiring sum over all accepting paths,
s(a, b, b) = p1 (a) ⊗ λ2 p2 (b | a) ⊗ λ2 p(b | b)
⊕ p1 (a) ⊗ λ1 ⊗ p1 (b) ⊗ λ2 p(b | b)
⊕ p1 (a) ⊗ λ2 p2 (b | a) ⊗ p1 (b) ⊗ p1 (b)
⊕ p1 (a) ⊗ λ1 ⊗ p1 (b) ⊗ p1 (b) ⊗ p1 (b) . [9.11]
4485 Each line in Equation 9.11 represents the probability of a specific path through the WFSA.
4486 In the probability semiring, ⊗ is multiplication, so that each path is the product of each
3
We could model the sequence-initial bigram probability p2 (a | ), but for simplicity the WFSA does not
admit this possibility, which would require another state.
4487 transition weight, which are themselves probabilities. The ⊕ operator is addition, so that
4488 the total score is the sum of the scores (probabilities) for each path. This corresponds to
4489 the probability under the interpolated bigram language model.
The edit distance between two strings s and t is a measure of how many operations are
required to transform one string into another. There are several ways to compute edit
distance, but one of the most popular is the Levenshtein edit distance, which counts the
minimum number of insertions, deletions, and substitutions. This can be computed by
a one-state weighted finite state transducer, in which the input and output alphabets are
identical. For simplicity, consider the alphabet Σ = Ω = {a, b}. The edit distance can be
computed by a one-state transducer with the following transitions,
The Porter (1980) stemming algorithm is a “lexicon-free” algorithm for stripping suffixes
from English words, using a sequence of character-level rules. Each rule can be described
a/a, b/b : 0
a/, b/ : 1
start q
/a, /b : 1
a/b, b/a : 1
Figure 9.5: State diagram for the Levenshtein edit distance finite state transducer. The
label x/y : c indicates a cost of c for a transition with input x and output y.
4505 The final two lines appear to conflict; they are meant to be interpreted as an instruction
4506 to remove a terminal -s unless it is part of an -ss ending. A state diagram to handle just
4507 these final two lines is shown in Figure 9.6. Make sure you understand how this finite
4508 state transducer handles cats, steps, bass, and basses.
¬s/¬s
s/
start q1 q2
/a q3 a/s
/b b/s
q4
...
Figure 9.6: State diagram for final two lines of step 1a of the Porter stemming diagram.
States q3 and q4 “remember” the observations a and b respectively; the ellipsis . . . repre-
sents additional states for each symbol in the input alphabet. The notation ¬s/¬s is not
part of the FST formalism; it is a shorthand to indicate a set of self-transition arcs for every
input/output symbol except s.
infinitive cantar (to sing) comer (to eat) vivir (to live)
yo (1st singular) canto como vivo
tu (2nd singular) cantas comes vives
él, ella, usted (3rd singular) canta come vive
nosotros (1st plural) cantamos comemos vivimos
vosotros (2nd plural, informal) cantáis coméis vivı́s
ellos, ellas (3rd plural);
cantan comen viven
ustedes (2nd plural)
Table 9.1: Spanish verb inflections for the present indicative tense. Each row represents
a person and number, and each column is a regular example from a class of verbs, as
indicated by the ending of the infinitive form.
4523 input vocabulary Σ corresponds to the set of letters used in Spanish spelling, and the out-
4524 put vocabulary Ω corresponds to these same letters, plus the vocabulary of morphological
4525 features (e.g., +S ING, +V ERB). In Figure 9.7, there are two paths that take canto as input,
4526 corresponding to the verb and noun meanings; the choice between these paths could be
4527 guided by a part-of-speech tagger. By inversion, the inputs and outputs for each tran-
4528 sition are switched, resulting in a finite state generator, capable of producing the correct
4529 surface form for any morphological analysis.
4530 Finite state morphological analyzers and other unweighted transducers can be de-
4531 signed by hand. The designer’s goal is to avoid overgeneration — accepting strings or
4532 making transductions that are not valid in the language — as well as undergeneration —
o/o
c/c a/a n/n t/t /a /r /+Verb o/+PresInd /+1p /+Sing
start
a/+PresInd
/+3p /+Sing
Figure 9.7: Fragment of a finite state transducer for Spanish morphology. There are two
accepting paths for the input canto: canto+N OUN +M ASC +S ING (masculine singular noun,
meaning a song), and cantar+V ERB +P RES I ND +1 P +S ING (I sing). There is also an accept-
ing path for canta, with output cantar+V ERB +P RES I ND +3 P +S ING (he/she sings).
4533 failing to accept strings or transductions that are valid. For example, a pluralization trans-
4534 ducer that does not accept foot/feet would undergenerate. Suppose we “fix” the transducer
4535 to accept this example, but as a side effect, it now accepts boot/beet; the transducer would
4536 then be said to overgenerate. A transducer that accepts foot/foots but not foot/feet would
4537 both overgenerate and undergenerate.
4549 Example: Morphology and orthography In English morphology, the suffix -ed is added
4550 to signal the past tense for many verbs: cook→cooked, want→wanted, etc. However, English
4551 orthography dictates that this process cannot produce a spelling with consecutive e’s, so
4552 that bake→baked, not bakeed. A modular solution is to build separate transducers for mor-
4553 phology and orthography. The morphological transducer TM transduces from bake+PAST
4554 to bake+ed, with the + symbol indicating a segment boundary. The input alphabet of TM
4555 includes the lexicon of words and the set of morphological features; the output alphabet
4556 includes the characters a-z and the + boundary marker. Next, an orthographic transducer
4557 TO is responsible for the transductions cook+ed → cooked, and bake+ed → baked. The input
4558 alphabet of TO must be the same as the output alphabet for TM , and the output alphabet
4559 is simply the characters a-z. The composed transducer (TO ◦ TM ) then transduces from
4560 bake+PAST to the spelling baked. The design of TO is left as an exercise.
Example: Hidden Markov models Hidden Markov models (chapter 7) can be viewed as
weighted finite state transducers, and they can be constructed by transduction. Recall that
a hidden Markov model defines a joint probability over words and tags, p(w, y), which
can be computed as a path through a trellis structure. This trellis is itself a weighted finite
state acceptor, with edges between all adjacent nodes qm−1,i → qm,j on input Ym = j. The
edge weights are log-probabilities,
δ(qm−1,i , Ym = j, qm,j ) = log p(wm , Ym = j | Ym−i = j) [9.20]
= log p(wm | Ym = j) + log Pr(Ym = j | Ym−1 = i). [9.21]
4561 Because there is only one possible transition for each tag Ym , this WFSA is deterministic.
4562 The score for any tag sequence {ym }M m=1 is the sum of these log-probabilities, correspond-
4563 ing to the total log probability log p(w, y). Furthermore, the trellis can be constructed by
4564 the composition of simpler FSTs.
4579 • The trellis is a structure with M ×K nodes, for each of the M words to be tagged and
4580 each of the K tags in the tagset. It can be built by composition of (TE ◦TT ) against an
4581 unweighted chain FSA MA (w) that is specially constructed to accept only a given
4582 input w1 , w2 , . . . , wM , shown in Figure 9.9. The trellis for input w is built from the
4583 composition MA (w) ◦ (TE ◦ TT ). Composing with the unweighted MA (w) does not
4584 affect the edge weights from (TE ◦ TT ), but it selects the subset of paths that generate
4585 the word sequence w.
N/aardvark
N/abacus N
N/. . .
V/. . .
V/abacus
V
V/aardvark
Figure 9.8: Finite state transducer for hidden Markov models, with a small tagset of nouns
and verbs. For each pair of tags (including self-loops), there is an edge for every word in
the vocabulary. For simplicity, input and output are only shown for the edges from the
start state. Weights are also omitted from the diagram; for each edge from qi to qj , the
weight is equal to log p(wm , Ym = j | Ym−1 = i), except for edges to the end state, which
are equal to log Pr(Ym = | Ym−1 = i).
Figure 9.9: Chain finite state acceptor for the input They can fish.
4598 model, but this is a special case of the more general problem for finite state automata.
4599 Eisner (2002) describes an expectation semiring, which enables the expected number of
4600 transitions across each arc to be computed through a semiring shortest-path algorithm.
4601 Alternative approaches for generative models include Markov Chain Monte Carlo (Chi-
4602 ang et al., 2010) and spectral learning (Balle et al., 2011).
4603 Further afield, we can take a perceptron-style approach, with each arc corresponding
4604 to a feature. The classic perceptron update would update the weights by subtracting the
4605 difference between the feature vector corresponding to the predicted path and the feature
4606 vector corresponding to the correct path. Since the path is not observed, we resort to a
4607 hidden variable perceptron. The model is described formally in § 12.4, but the basic idea
4608 is to compute an update from the difference between the features from the predicted path
4609 and the features for the best-scoring path that generates the correct output.
4
Details of the proof can be found in an introductory computer science theory textbook (e.g., Sipser, 2012).
the dog
the cat the dog chased
the goat the cat the dog chased kissed
...
4633 glish includes center embedding, and so the argument goes, English grammar as a whole
4634 cannot be regular.5
4635 A more practical argument for moving beyond regular languages is modularity. Many
4636 linguistic phenomena — especially in syntax — involve constraints that apply at long
4637 distance. Consider the problem of determiner-noun number agreement in English: we
4638 can say the coffee and these coffees, but not *these coffee. By itself, this is easy enough to model
4639 in an FSA. However, fairly complex modifying expressions can be inserted between the
4640 determiner and the noun:
4646 Again, an FSA can be designed to accept modifying expressions such as burnt and badly-
4647 ground Italian. Let’s call this FSA FM . To reject the final example, a finite state acceptor
4648 must somehow “remember” that the determiner was plural when it reaches the noun cof-
4649 fee at the end of the expression. The only way to do this is to make two identical copies
4650 of FM : one for singular determiners, and one for plurals. While this is possible in the
4651 finite state framework, it is inconvenient — especially in languages where more than one
4652 attribute of the noun is marked by the determiner. Context-free languages facilitate mod-
4653 ularity across such long-range dependencies.
S →S O P S | N UM
O P →+ | − | × | ÷
N UM →N UM D IGIT | D IGIT
D IGIT →0 | 1 | 2 | . . . | 9
4661 In the production rule A → β, the left-hand side (LHS) A must be a non-terminal;
4662 the right-hand side (RHS) can be a sequence of terminals or non-terminals, {n, σ}∗ , n ∈
4663 N, σ ∈ Σ. A non-terminal can appear on the left-hand side of many production rules.
4664 A non-terminal can appear on both the left-hand side and the right-hand side; this is a
4665 recursive production, and is analogous to self-loops in finite state automata. The name
4666 “context-free” is based on the property that the production rule depends only on the LHS,
4667 and not on its ancestors or neighbors; this is analogous to Markov property of finite state
4668 automata, in which the behavior at each step depends only on the current state, on not on
4669 the path by which that state was reached.
4670 A derivation τ is a sequence of steps from the start symbol S to a surface string w ∈ Σ∗ ,
4671 which is the yield of the derivation. A string w is in a context-free language if there is
4672 some derivation from S yielding w. Parsing is the problem of finding a derivation for a
4673 string in a grammar. Algorithms for parsing are described in chapter 10.
4674 Like regular expressions, context-free grammars define the language but not the com-
4675 putation necessary to recognize it. The context-free analogues to finite state acceptors are
4676 pushdown automata, a theoretical model of computation in which input symbols can be
4677 pushed onto a stack with potentially infinite depth. For more details, see Sipser (2012).
S S S
Num S Op S S Op S
Digit S Op S − Num Num + S Op S
4 Num + Num Digit Digit Num − Num
1 2 2 3
Figure 9.12: Some example derivations from the arithmetic grammar in Figure 9.11
4683 defines two productions, A → x and A → y. This grammar is recursive: the non-termals S
4684 and N UM can produce themselves.
4685 Derivations are typically shown as trees, with production rules applied from the top
4686 to the bottom. The tree on the left in Figure 9.12 describes the derivation of a single digit,
4687 through the sequence of productions S → N UM → D IGIT → 4 (these are all unary pro-
4688 ductions, because the right-hand side contains a single element). The other two trees in
4689 Figure 9.12 show alternative derivations of the string 1 + 2 − 3. The existence of multiple
4690 derivations for a string indicates that the grammar is ambiguous.
Context-free derivations can also be written out according to the pre-order tree traver-
sal.6 For the two derivations of 1 + 2 - 3 in Figure 9.12, the notation is:
(S (S (S (Num (Digit 1))) (Op +) (S (Num (Digit 2)))) (Op - ) (S (Num (Digit 3)))) [9.23]
(S (S (Num (Digit 1))) (Op +) (S (Num (Digit 2)) (Op - ) (S (Num (Digit 3))))). [9.24]
S →aSb | ab
S →aSb | aabb | ab
4692 Two grammars are weakly equivalent if they generate the same strings. Two grammars
4693 are strongly equivalent if they generate the same strings via the same derivations. The
4694 grammars above are only weakly equivalent.
6
This is a depth-first left-to-right search that prints each node the first time it is encountered (Cormen
et al., 2009, chapter 12).
In Chomsky Normal Form (CNF), the right-hand side of every production includes
either two non-terminals, or a single terminal symbol:
A →BC
A →a
4695 All CFGs can be converted into a CNF grammar that is weakly equivalent. To convert a
4696 grammar into CNF, we first address productions that have more than two non-terminals
4697 on the RHS by creating new “dummy” non-terminals. For example, if we have the pro-
4698 duction,
W → X Y Z, [9.25]
it is replaced with two productions,
W →X W\X [9.26]
W\X →Y Z. [9.27]
4699 In these productions, W\X is a new dummy non-terminal. This transformation binarizes
4700 the grammar, which is critical for efficient bottom-up parsing, as we will see in chapter 10.
4701 Productions whose right-hand side contains a mix of terminal and non-terminal symbols
4702 can be replaced in a similar fashion.
4703 Unary non-terminal productions A → B are replaced as follows: identify all produc-
4704 tions B → α, and add A → α to the grammar. For example, in the grammar described in
4705 Figure 9.11, we would replace N UM → D IGIT with N UM → 1 | 2 | . . . | 9. However, we
4706 keep the production N UM → N UM D IGIT, which is a valid binary production.
4719 the conventions of the Penn Treebank (PTB; Marcus et al., 1993), a large-scale annotation
4720 of English language syntax. The generalization to “mildly” context-sensitive languages is
4721 discussed in § 9.3.
4722 The Penn Treebank annotation is a phrase-structure grammar of English. This means
4723 that sentences are broken down into constituents, which are contiguous sequences of
4724 words that function as coherent units for the purpose of linguistic analysis. Constituents
4725 generally have a few key properties:
4729 In contrast, gave her and brother a cannot easily be moved while preserving gram-
4730 maticality.
4731 Substitution. Constituents can be substituted by other phrases of the same type.
4734 In contrast, substitution is not possible for other contiguous units like Max thanked
4735 and thanked his.
4740 Units like brother bought and bought a cannot easily be coordinated.
4741 These examples argue for units such as her brother and bought a fish to be treated as con-
4742 stituents. Other sequences of words in these examples, such as Abigail gave and brother
4743 a fish, cannot be moved, substituted, and coordinated in these ways. In phrase-structure
4744 grammar, constituents are nested, so that the senator from New Jersey contains the con-
4745 stituent from New Jersey, which in turn contains New Jersey. The sentence itself is the max-
4746 imal constituent; each word is a minimal constituent, derived from a unary production
4747 from a part-of-speech tag. Between part-of-speech tags and sentences are phrases. In
4748 phrase-structure grammar, phrases have a type that is usually determined by their head
4749 word: for example, a noun phrase corresponds to a noun and the group of words that
4750 modify it, such as her younger brother; a verb phrase includes the verb and its modifiers,
4751 such as bought a fish and greedily ate it.
4752 In context-free grammars, each phrase type is a non-terminal, and each constituent is
4753 the substring that the non-terminal yields. Grammar design involves choosing the right
4754 set of non-terminals. Fine-grained non-terminals make it possible to represent more fine-
4755 grained linguistic phenomena. For example, by distinguishing singular and plural noun
4756 phrases, it is possible to have a grammar of English that generates only sentences that
4757 obey subject-verb agreement. However, enforcing subject-verb agreement is considerably
4758 more complicated in languages like Spanish, where the verb must agree in both person
4759 and number with subject. In general, grammar designers must trade off between over-
4760 generation — a grammar that permits ungrammatical sentences — and undergeneration
4761 — a grammar that fails to generate grammatical sentences. Furthermore, if the grammar is
4762 to support manual annotation of syntactic structure, it must be simple enough to annotate
4763 efficiently.
4765 To better understand how phrase-structure grammar works, let’s consider the specific
4766 case of the Penn Treebank grammar of English. The main phrase categories in the Penn
4767 Treebank (PTB) are based on the main part-of-speech classes: noun phrase (NP), verb
4768 phrase (VP), prepositional phrase (PP), adjectival phrase (ADJP), and adverbial phrase
4769 (ADVP). The top-level category is S, which conveniently stands in for both “sentence”
4770 and the “start” symbol. Complement clauses (e.g., I take the good old fashioned ground that
4771 the whale is a fish) are represented by the non-terminal SBAR. The terminal symbols in
4772 the grammar are individual words, which are generated from unary productions from
4773 part-of-speech tags (the PTB tagset is described in § 8.1).
4774 This section explores the productions from the major phrase-level categories, explain-
4775 ing how to generate individual tag sequences. The production rules are approached in a
4776 “theory-driven” manner: first the syntactic properties of each phrase type are described,
4777 and then some of the necessary production rules are listed. But it is important to keep
4778 in mind that the Penn Treebank was produced in a “data-driven” manner. After the set
4779 of non-terminals was specified, annotators were free to analyze each sentence in what-
4780 ever way seemed most linguistically accurate, subject to some high-level guidelines. The
4781 grammar of the Penn Treebank is simply the set of productions that were required to ana-
4782 lyze the several million words of the corpus. By design, the grammar overgenerates — it
4783 does not exclude ungrammatical sentences.
S →NP VP [9.28]
which accounts for simple sentences like Abigail ate the kimchi — as we will see, the direct
object the kimchi is part of the verb phrase. But there are more complex forms of sentences
as well:
4785 where ADVP is an adverbial phrase (e.g., unfortunately, very unfortunately) and C C is a
4786 coordinating conjunction (e.g., and, but).8
NP →N N | N NS | N NP | P RP [9.32]
NP →D ET N N | D ET N NS | D ET N NP [9.33]
4788 The tags N N, N NS, and N NP refer to singular, plural, and proper nouns; P RP refers to
4789 personal pronouns, and D ET refers to determiners. The grammar also contains terminal
4790 productions from each of these tags, e.g., P RP → I | you | we | . . . .
Noun phrases may be modified by adjectival phrases (ADJP; e.g., the small Russian dog)
and numbers (C D; e.g., the five pastries), each of which may optionally follow a determiner:
Some noun phrases include multiple nouns, such as the liberation movement and an
antelope horn, necessitating additional productions:
NP →N N N N | N N N NS | D ET N N N N | . . . [9.36]
8
Notice that the grammar does not include the recursive production S → ADVP S. It may be helpful to
think about why this production would cause the grammar to overgenerate.
4791 These multiple noun constructions can be combined with adjectival phrases and cardinal
4792 numbers, leading to a large number of additional productions.
Recursive noun phrase productions include coordination, prepositional phrase attach-
ment, subordinate clauses, and verb phrase adjuncts:
4793 These recursive productions are a major source of ambiguity, because the VP and PP non-
4794 terminals can also generate NP children. Thus, the the President of the Georgia Institute of
4795 Technology can be derived in two ways, as can a whale taken near Shetland in October.
4796 But aside from these few recursive productions, the noun phrase fragment of the Penn
4797 Treebank grammar is relatively flat, containing a large of number of productions that go
4798 from NP directly to a sequence of parts-of-speech. If noun phrases had more internal
4799 structure, the grammar would need fewer rules, which, as we will see, would make pars-
4800 ing faster and machine learning easier. Vadas and Curran (2011) propose to add additional
4801 structure in the form of a new non-terminal called a nominal modifier (NML), e.g.,
4802 (9.17) (NP (NN crude) (NN oil) (NNS prices)) (PTB analysis)
4803 (NP (NML (NN crude) (NN oil)) (NNS prices)) (NML-style analysis)
4804 Another proposal is to treat the determiner as the head of a determiner phrase (DP;
4805 Abney, 1987). There are linguistic arguments for and against determiner phrases (e.g.,
4806 Van Eynde, 2006). From the perspective of context-free grammar, DPs enable more struc-
4807 tured analyses of some constituents, e.g.,
4808 (9.18) (NP (DT the) (JJ white) (NN whale)) (PTB analysis)
4809 (DP (DT the) (NP (JJ white) (NN whale))) (DP-style analysis).
own:
VP → V B | V BZ | V BD | V BN | V BG | V BP [9.41]
4811 Each of these productions uses recursion, with the VP non-terminal appearing in both the
4812 LHS and RHS. This enables the creation of complex verb phrases, such as She will have
4813 wanted to have been snacking.
Transitive verbs take noun phrases as direct objects, and ditransitive verbs take two
direct objects:
These productions are not recursive, so a unique production is required for each verb
part-of-speech. They also do not distinguish transitive from intransitive verbs, so the
resulting grammar overgenerates examples like *She sleeps sushi and *She learns Boyang
algebra. Sentences can also be direct objects:
4814 The first production overgenerates, licensing sentences like *Asha sees Boyang eats the kim-
4815 chi. This problem could be addressed by designing a more specific set of sentence non-
4816 terminals, indicating whether the main verb can be conjugated.
Verbs can also be modified by prepositional phrases and adverbial phrases:
4817 Again, because these productions are not recursive, the grammar must include produc-
4818 tions for every verb part-of-speech.
A special set of verbs, known as copula, can take predicative adjectives as direct ob-
jects:
4819 The PTB does not have a special non-terminal for copular verbs, so this production gen-
4820 erates non-grammatical examples such as *She eats tall.
Particles (PRT as a phrase; R P as a part-of-speech) work to create phrasal verbs:
4821 As the second production shows, particle productions are required for all configurations
4822 of verb parts-of-speech and direct objects.
Adverbial phrases are usually bare adverbs (ADVP → R B), with a few exceptions:
4825 The tags J JR and J JS refer to comparative and superlative adjectives respectively.
All of these phrase types can be coordinated:
S
S
NP VP
NP VP
We V NP
We VP PP
eat NP PP
V NP IN NP
sushi IN NP
eat sushi with chopsticks
with chopsticks
4843 machine), PSPACE-complete problems cannot be solved efficiently if P 6= NP. Thus, de-
4844 signing an efficient parsing algorithm for the full class of context-sensitive languages is
4845 probably hopeless.10
4846 However, Joshi (1985) identifies a set of properties that define mildly context-sensitive
4847 languages, which are a strict subset of context-sensitive languages. Like context-free lan-
4848 guages, mildly context-sensitive languages are efficiently parseable. However, the mildly
4849 context-sensitive languages include non-context-free languages, such as the “copy lan-
4850 guage” {ww | w ∈ Σ∗ } and the language am bn cm dn . Both are characterized by cross-
4851 serial dependencies, linking symbols at long distance across the string.11 For example, in
4852 the language an bm cn dm , each a symbol is linked to exactly one c symbol, regardless of the
4853 number of intervening b symbols.
Figure 9.14: A syntactic analysis in CCG involving forward and backward function appli-
cation
4863 As with the move from regular to context-free languages, mildly context-sensitive lan-
4864 guages can be motivated by expedience. While infinite sequences of cross-serial depen-
4865 dencies cannot be handled by context-free grammars, even finite sequences of cross-serial
4866 dependencies are more convenient to handle using a mildly context-sensitive formalism
4867 like tree-adjoining grammar (TAG) and combinatory categorial grammar (CCG). Fur-
4868 thermore, TAG-inspired parsers have been shown to be particularly effective in parsing
4869 the Penn Treebank (Collins, 1997; Carreras et al., 2008), and CCG plays a leading role in
4870 current research on semantic parsing (Zettlemoyer and Collins, 2005). Furthermore, these
4871 two formalisms are weakly equivalent: any language that can be specified in TAG can also
4872 be specified in CCG, and vice versa (Joshi et al., 1991). The remainder of the chapter gives
4873 a brief overview of CCG, but you are encouraged to consult Joshi and Schabes (1997) and
4874 Steedman and Baldridge (2011) for more detail on TAG and CCG respectively.
Figure 9.15: A syntactic analysis in CCG involving function composition (example modi-
fied from Steedman and Baldridge, 2011)
4892 its right results in a noun phrase, so determiners have the type NP/N. Figure 9.14 pro-
4893 vides an example involving a transitive verb and a determiner. A key point from this
4894 example is that it can be trivially transformed into phrase-structure tree, by treating each
4895 function application as a constituent phrase. Indeed, when CCG’s only combinatory op-
4896 erators are forward and backward function application, it is equivalent to context-free
4897 grammar. However, the location of the “effort” has changed. Rather than designing good
4898 productions, the grammar designer must focus on the lexicon — choosing the right cate-
4899 gories for each word. This makes it possible to parse a wide range of sentences using only
4900 a few generic combinatory operators.
4901 Things become more interesting with the introduction of two additional operators:
4902 composition and type-raising. Function composition enables the combination of com-
4903 plex types: X/Y ◦ Y /Z ⇒B X/Z (forward composition) and Y \Z ◦ X\Y ⇒B X\Z (back-
4904 ward composition).12 Composition makes it possible to “look inside” complex types, and
4905 combine two adjacent units if the “input” for one is the “output” for the other. Figure 9.15
4906 shows how function composition can be used to handle modal verbs. While this sen-
4907 tence can be parsed using only function application, the composition-based analysis is
4908 preferable because the unit might learn functions just like a transitive verb, as in the exam-
4909 ple Abigail studies Swahili. This in turn makes it possible to analyze conjunctions such as
4910 Abigail studies and might learn Swahili, attaching the direct object Swahili to the entire con-
4911 joined verb phrase studies and might learn. The Penn Treebank grammar fragment from
4912 § 9.2.3 would be unable to handle this case correctly: the direct object Swahili could attach
4913 only to the second verb learn.
4914 Type raising converts an element of type X to a more complex type: X ⇒T T /(T \X)
4915 (forward type-raising to type T ), and X ⇒T T \(T /X) (backward type-raising to type
4916 T ). Type-raising makes it possible to reverse the relationship between a function and its
4917 argument — by transforming the argument into a function over functions over arguments!
4918 An example may help. Figure 9.15 shows how to analyze an object relative clause, a story
4919 that Abigail tells. The problem is that tells is a transitive verb, expecting a direct object to
4920 its right. As a result, Abigail tells is not a valid constituent. The issue is resolved by raising
12
The subscript B follows notation from Curry and Feys (1958).
Figure 9.16: A syntactic analysis in CCG involving an object relative clause (based on
slides from Alex Clark)
4921 Abigail from NP to the complex type (S/NP)\NP). This function can then be combined
4922 with the transitive verb tells by forward composition, resulting in the type (S/NP), which
4923 is a sentence lacking a direct object to its right.13 From here, we need only design the
4924 lexical entry for the complementizer that to expect a right neighbor of type (S/NP), and
4925 the remainder of the derivation can proceed by function application.
4926 Composition and type-raising give CCG considerable power and flexibility, but at a
4927 price. The simple sentence Abigail tells Max can be parsed in two different ways: by func-
4928 tion application (first forming the verb phrase tells Max), and by type-raising and compo-
4929 sition (first forming the non-constituent Abigail tells). This derivational ambiguity does
4930 not affect the resulting linguistic analysis, so it is sometimes known as spurious ambi-
4931 guity. Hockenmaier and Steedman (2007) present a translation algorithm for converting
4932 the Penn Treebank into CCG derivations, using composition and type-raising only when
4933 necessary.
4934 Exercises
4935 1. Sketch out the state diagram for finite-state acceptors for the following languages
4936 on the alphabet {a, b}.
4944 a) Define a finite-state acceptor that accepts all strings with edit distance 1 from
4945 the target string, target.
4946 b) Now think about how to generalize your design to accept all strings with edit
4947 distance from the target string equal to d. If the target string has length `, what
4948 is the minimal number of states required?
4949 3. Construct an FSA in the style of Figure 9.3, which handles the following examples:
4952 Be sure that your FSA does not accept any further derivations, such as *nationalizeral
4953 and *Americanizern.
4954 4. Show how to construct a trigram language model in a weighted finite-state acceptor.
4955 Make sure that you handle the edge cases at the beginning and end of the sequence
4956 accurately.
4957 5. Extend the FST in Figure 9.6 to handle the other two parts of rule 1a of the Porter
4958 stemmer: -sses → ss, and -ies → -i.
4964 7. Add parenthesization to the grammar in Figure 9.11 so that it is no longer ambigu-
4965 ous.
4966 8. Construct three examples — a noun phrase, a verb phrase, and a sentence — which
4967 can be derived from the Penn Treebank grammar fragment in § 9.2.3, yet are not
4968 grammatical. Avoid reusing examples from the text. Optionally, propose corrections
4969 to the grammar to avoid generating these cases.
4970 9. Produce parses for the following sentences, using the Penn Treebank grammar frag-
4971 ment from § 9.2.3.
4975 Then produce parses for three short sentences from a news article from this week.
4979 where X can refer to any type. Using this lexical entry, show how to parse the two
4980 examples above. In the second example, Swahili should be combined with the coor-
4981 dination Abigail speaks and Max understands, and not just with the verb understands.
4984 Parsing is the task of determining whether a string can be derived from a given context-
4985 free grammar, and if so, how. The parse structure can answer basic questions of who-did-
4986 what-to-whom, and is useful for various downstream tasks, such as semantic analysis
4987 (chapter 12 and 13) and information extraction (chapter 17).
For a given input and grammar, how many parse trees are there? Consider a minimal
context-free grammar with only one non-terminal, X, and the following productions:
X →X X
X →aardvark | abacus | . . . | zyther
The second line indicates unary productions to every nonterminal in Σ. In this gram-
mar, the number of possible derivations for a string w is equal to the number of binary
bracketings, e.g.,
4988 The number of such bracketings is a Catalan number, which grows super-exponentially
(2n)!
4989 in the length of the sentence, Cn = (n+1)!n! . As with sequence labeling, it is only possible to
4990 exhaustively search the space of parses by resorting to locality assumptions, which make it
4991 possible to search efficiently by reusing shared substructures with dynamic programming.
4992 This chapter focuses on a bottom-up dynamic programming algorithm, which enables
4993 exhaustive search of the space of possible parses, but imposes strict limitations on the
4994 form of scoring function. These limitations can be relaxed by abandoning exhaustive
4995 search. Non-exact search methods will be briefly discussed at the end of this chapter, and
4996 one of them — transition-based parsing — will be the focus of chapter 11.
233
234 CHAPTER 10. CONTEXT-FREE PARSING
S → NP VP
NP → NP PP | we | sushi | chopsticks
PP → I N NP
IN → with
VP → V NP | VP PP
V → eat
5017 • Begin by filling in the diagonal: the cells t[m − 1, m] for all m ∈ {1, 2, . . . , M }. These
5018 cells are filled with terminal productions that yield the individual tokens; for the
5019 word w2 = sushi, we fill in t[1, 2] = {NP}, and so on.
5020 • Then fill in the next diagonal, in which each cell corresponds to a subsequence of
5021 length two: t[0, 2], t[1, 3], . . . , t[M − 2, M ]. These cells are filled in by looking for
5022 binary productions capable of producing at least one entry in each of the cells corre-
1
The name is for Cocke-Kasami-Younger, the inventors of the algorithm. It is a special case chart parsing,
because its stores reusable computations in a chart-like data structure.
5023 sponding to left and right children. For example, the cell t[1, 3] includes VP because
5024 the grammar includes the production VP → V NP, and the chart contains V ∈ t[1, 2]
5025 and NP ∈ t[2, 3].
5026 • At the next diagonal, the entries correspond to spans of length three. At this level,
5027 there is an additional decision at each cell: where to split the left and right children.
5028 The cell t[i, j] corresponds to the subsequence wi+1:j , and we must choose some
5029 split point i < k < j, so that wi+1:k is the left child and wk+1:j is the right child. We
5030 consider all possible k, looking for productions that generate elements in t[i, k] and
5031 t[k, j]; the left-hand side of all such productions can be added to t[i, j]. When it is
5032 time to compute t[i, j], the cells t[i, k] and t[k, j] are guaranteed to be complete, since
5033 these cells correspond to shorter sub-strings of the input.
5034 • The process continues until we reach t[0, M ].
5035 Figure 10.1 shows the chart that arises from parsing the sentence We eat sushi with chop-
5036 sticks using the grammar defined above.
We NP ∅ S ∅ S
eat V VP ∅ VP
sushi NP ∅ NP
with P PP
chopsticks NP
Figure 10.1: An example completed CKY chart. The solid and dashed lines show the back
pointers resulting from the two different derivations of VP in position t[1, 5].
5056 • Productions with more than two elements on the right-hand side can be binarized
5057 by creating additional non-terminals, as described in § 9.2.1.2. For example, given
5058 the production VP → V NP NP (for ditransitive verbs), we can convert to VP →
5059 VPditrans /NP NP, and then add the production VPditrans /NP → V NP.
5060 • What about unary productions like VP → V? In practice, this is handled by mak-
5061 ing a second pass on each diagonal, in which each cell t[i, j] is augmented with all
5062 possible unary productions capable of generating each item already in the cell —
5063 formally, t[i, j] is extended to its unary closure. Suppose the example grammar in
5064 Table 10.1 were extended to include the production VP → V, enabling sentences
5065 with intransitive verb phrases, like we eat. Then the cell t[1, 2] — corresponding to
5066 the word eat — would first include the set {V}, and would be augmented to the set
5067 {V, VP} during this second pass.
5077 • Attachment ambiguity: e.g., We eat sushi with chopsticks, I shot an elephant in my
5078 pajamas. In these examples, the prepositions (with, in) can attach to either the verb
5079 or the direct object.
5080 • Modifier scope: e.g., southern food store, plastic cup holder. In these examples, the first
5081 word could be modifying the subsequent adjective, or the final noun.
5082 • Particle versus preposition: e.g., The puppy tore up the staircase. Phrasal verbs like
5083 tore up often include particles which could also act as prepositions. This has struc-
5084 tural implications: if up is a preposition, then up the staircase is a prepositional
5085 phrase; if up is a particle, then the staircase is the direct object to the verb.
5086 • Complement structure: e.g., The students complained to the professor that they didn’t
5087 understand. This is another form of attachment ambiguity, where the complement
5088 that they didn’t understand could attach to the main verb (complained), or to the indi-
5089 rect object (the professor).
5090 • Coordination scope: e.g., “I see,” said the blind man, as he picked up the hammer and
5091 saw. In this example, the lexical ambiguity for saw enables it to be coordinated either
5092 with the noun hammer or the verb picked up.
5093 These forms of ambiguity can combine, so that seemingly simple headlines like Fed
5094 raises interest rates have dozens of possible analyses even in a minimal grammar. In a
5095 broad coverage grammar, typical sentences can have millions of parses. While careful
5096 grammar design can chip away at this ambiguity, a better strategy is combine broad cov-
5097 erage parsers with data driven strategies for identifying the correct analysis.
5107 Precision: the fraction of constituents in the system parse that match a constituent in the
5108 reference parse.
5109 Recall: the fraction of constituents in the reference parse that match a constituent in the
5110 system parse.
5111 In labeled precision and recall, the system must also match the phrase type for each
5112 constituent; in unlabeled precision and recall, it is only required to match the constituent
5113 structure. As in chapter 4, the precision and recall can be combined into an F - MEASURE,
5114 F = 2×P ×R
P +R .
5115 In Figure 10.2, suppose that the left tree is the system parse and the right tree is the
5116 reference parse. We have the following spans:
NP VP S
We V NP NP VP
eat NP PP We VP PP
sushi IN NP V NP IN NP
Figure 10.2: Two possible analyses from the grammar in Table 10.1
5122 The labeled and unlabeled precision of this parse is 34 = 0.75, and the recall is 34 = 0.75, for
5123 an F-measure of 0.75. For an example in which precision and recall are not equal, suppose
5124 the reference parse instead included the production VP → V NP PP. In this parse, the
5125 reference does not contain the constituent w2:3 , so the recall would be 1.3
Each case ends with a preposition, which can be attached to the verb met or the noun
phrase the president. This ambiguity can be resolved by using a labeled corpus to compare
the likelihood of the observing the preposition alongside each candidate attachment point,
5130 A comparison of these probabilities would successfully resolve this case (Hindle and
5131 Rooth, 1993). Other cases, such as the example . . . eat sushi with chopsticks, require consider-
5132 ing the object of the preposition — consider the alternative . . . eat sushi with soy sauce. With
5133 sufficient labeled data, the problem of prepositional phrase attachment can be treated as
5134 a classification task (Ratnaparkhi et al., 1994).
3
While the grammar must be binarized before applying the CKY algorithm, evaluation is performed on
the original parses. It is therefore necessary to “unbinarize” the output of a CKY-based parser, converting it
back to the original grammar.
5135 However, there are inherent limitations to local solutions. While toy examples may
5136 have just a few ambiguities to resolve, realistic sentences have thousands or millions of
5137 possible parses. Furthermore, attachment decisions are interdependent, as shown in the
5138 garden path example:
5140 We may want to attach with claws to scratch, as would be correct in the shorter sentence
5141 in cats scratch people with claws. But this leaves nowhere to attach with knives. The cor-
5142 rect interpretation can be identified only be considering the attachment decisions jointly.
5143 The huge number of potential parses may seem to make exhaustive search impossible.
5144 But as with sequence labeling, locality assumptions make it possible to search this space
5145 efficiently.
5148 with X corresponding to the left-hand side non-terminal and α corresponding to the right-
5149 hand side. For grammars in Chomsky normal formal, α is either a pair of non-terminals or
5150 a terminal symbol. The indices i, j, k anchor the production in the input, with X deriving
5151 the span wi+1:j . For binary productions, wi+1:k indicates the span of the left child, and
5152 wk+1:j indicates the span of the right child; for unary productions, k is ignored. For an
5153 input w, the optimal parse is then,
5154 where T (w) is the set of derivations that yield the input w.
5155 The scoring function Ψ decomposes across anchored productions,
X
Ψ(τ ) = ψ(X → α, (i, j, k)). [10.5]
(X→α,(i,j,k))∈τ
5156 This is a locality assumption, akin to the assumption in Viterbi sequence labeling. In this
5157 case, the assumption states that the overall score is a sum over scores of productions,
5158 which are computed independently. In a weighted context-free grammar (WCFG), the
5159 score of each anchored production X → (α, i, j, k) is simply ψ(X → α), ignoring the
5160 anchors (i, j, k). In other parsing models, the anchors can be used to access features of the
5161 input, while still permitting efficient bottom-up parsing.
Table 10.2: An example weighted context-free grammar (WCFG). The weights are chosen
so that exp ψ(·) sums to one over right-hand sides for each non-terminal; this is required
by probabilistic context-free grammars, but not by WCFGs in general.
Example Consider the weighted grammar shown in Table 10.2, and the analysis in Fig-
ure 10.2b.
5162 In the alternative parse in Figure 10.2a, the production VP → VP PP (with score −2) is
5163 replaced with the production NP → NP PP (with score −1); all other productions are the
5164 same. As a result, the score for this parse is −10.
5165 This example hints at a big problem with WCFG parsing on non-terminals such as
5166 NP, VP, and PP: a WCFG will always prefer either VP or NP attachment, without regard
5167 to what is being attached! This problem is addressed in § 10.5.
5177 These scores are combined by addition. As in the unscored CKY algorithm, the table
5178 is constructed by considering spans of increasing length, so the scores for spans t[i, k, Y ]
5179 and t[k, j, Z] are guaranteed to be available at the time we compute the score t[i, j, X]. The
5180 value t[0, M, S] is the score of the best derivation of w from the grammar. Algorithm 14
5181 formalizes this procedure.
5182 As in unweighted CKY, the parse is recovered from the table of back pointers b, where
5183 each b[i, j, X] stores the argmax split point k and production X → Y Z in the derivation of
5184 wi+1:j from X. The best parse can be obtained by tracing these pointers backwards from
5185 b[0, M, S], all the way to the terminal symbols. This is analogous to the computation of the
5186 best sequence of labels in the Viterbi algorithm by tracing pointers backwards from the
5187 end of the trellis. Note that we need only store back-pointers for the best path to t[i, j, X];
5188 this follows from the locality assumption that the global score for a parse is a combination
5189 of the local scores of each production in the parse.
Example Let’s revisit the parsing table in Figure 10.1. In a weighted CFG, each cell
would include a score for each non-terminal; non-terminals that cannot be generated are
assumed to have a score of −∞. The first diagonal contains the scores of unary produc-
tions: t[0, 1, NP] = −2, t[1, 2, V] = 0, and so on. At the next diagonal, we compute the
scores for spans of length 2: t[1, 3, VP] = −1 + 0 − 3 = −4, t[3, 5, PP] = 0 + 0 − 3 = −3, and
so on. Things get interesting when we reach the cell t[1, 5, VP], which contains the score
for the derivation of the span w2:5 from the non-terminal VP. This score is computed as a
max over two alternatives,
t[1, 5, VP] = max(ψ(VP → VP PP, (1, 3, 5)) + t[1, 3, VP] + t[3, 5, PP],
ψ(VP → V NP, (1, 2, 5)) + t[1, 2, V] + t[2, 5, NP]) [10.8]
= max( − 2 − 4 − 3, −1 + 0 − 7) = −8. [10.9]
5190 Since the second case is the argmax, we set the back-pointer b[1, 5, VP] = (V, NP, 2), en-
5191 abling the optimal derivation to be recovered.
Example The WCFG inP Table 10.2 is designed so that the weights are log-probabilities,
satisfying the constraint α exp ψ(X → α) = 1. As noted earlier, there are two parses in
T (we eat sushi with chopsticks), with scores Ψ(τ1 ) = log p(τ1 ) = −10 and Ψ(τ2 ) = log p(τ2 ) =
−11. Therefore, the conditional probability p(τ1 | w) is equal to,
5206 The inside recurrence The denominator of Equation 10.10 can be viewed as a language
5207 model, summing over all valid derivations of the string w,
X
p(w) = p(τ 0 ). [10.12]
τ 0 :yield(τ 0 )=w
Just as the CKY algorithm makes it possible to maximize over all such analyses, with
a few modifications it can also compute their sum. Each cell t[i, j, X] must store the log
probability of deriving wi+1:j from non-terminal X. To compute this, we replace the max-
imization over split points k and productions X → Y Z with a “log-sum-exp” operation,
which exponentiates the log probabilities of the production and the children, sums them
in probability space, and then converts back to the log domain:
X
t[i, j, X] = log exp (ψ(X → Y Z) + t[i, k, Y ] + t[k, j, Z]) [10.13]
k,Y,Z
X
= log exp (log p(Y Z | X) + log p(Y → wi+1:k ) + log p(Z → wk+1:j ))
k,Y,Z
[10.14]
X
= log p(Y Z | X) × p(Y → wi+1:k ) × p(Z → wk+1:j ) [10.15]
k,Y,Z
X
= log p(Y Z, wi+1:k , wk+1:j | X) [10.16]
k,Y,Z
5208 This is called the inside recurrence, because it computes the probability of each subtree
5209 as a combination of the probabilities of the smaller subtrees that are inside of it. The
5210 name implies a corresponding outside recurrence, which computes the probability of
5211 a non-terminal X spanning wi+1:j , joint with the outside context (w1:i , wj+1:M ). This
5212 recurrence is described in § 10.4.3. The inside and outside recurrences are analogous to the
5213 forward and backward recurrences in probabilistic sequence labeling (see § 7.5.3.3). They
5214 can be used to compute the marginal probabilities of individual anchored productions,
5215 p(X → α, (i, j, k) | w), summing over all possible derivations of w.
5217 This recurrence subsumes all of the algorithms that we have encountered in this chapter.
5218 Unweighted CKY. When L ψ(X → α, (i, j, k)) is a Boolean truth value {>, ⊥}, ⊗ is logical
5219 conjunction, and is logical disjunction, then we derive the CKY recurrence for
5220 unweighted context-free grammars, discussed in § 10.1 and Algorithm 13.
L
5221 Weighted CKY. When ψ(X → α, (i, j, k)) is a scalar score, ⊗ is addition, and is maxi-
5222 mization, then we derive the CKY recurrence for weighted context-free grammars,
5223 discussed in § 10.3 and Algorithm 14. When ψ(X → α, (i, j, k)) = log p(α | X),
5224 this same setting derives the CKY recurrence for finding the maximum likelihood
5225 derivation in a probabilistic context-free grammar.
L
5226 Inside recurrence.
P When ψ(X → α, (i, j, k)) is a log probability, ⊗ is addition, and =
5227 log exp, then we derive the inside recurrence for probabilistic context-free gram-
5228 mars, discussed in § 10.3.2. It is also possible to set ψ(X → α, (i, j, k))L
directly equal
5229 to the probability p(α | X). In this case, ⊗ is multiplication, and is addition.
5230 While this may seem more intuitive than working with log probabilities, there is the
5231 risk of underflow on long inputs.
5232 Regardless of how the scores are combined, the key point is the locality assumption:
5233 the score for a derivation is the combination of the independent scores for each anchored
5234 production, and these scores do not depend on any other part of the derivation. For exam-
5235 ple, if two non-terminals are siblings, the scores of productions from these non-terminals
5236 are computed independently. This locality assumption is analogous to the first-order
5237 Markov assumption in sequence labeling, where the score for transitions between tags
5238 depends only on the previous tag and current tag, and not on the history. As with se-
5239 quence labeling, this assumption makes it possible to find the optimal parse efficiently; its
5240 linguistic limitations are discussed in § 10.5.
5245 In all cases, learning requires a treebank, which is a dataset of sentences labeled with
5246 context-free parses. Parsing research was catalyzed by the Penn Treebank (Marcus et al.,
5247 1993), the first large-scale dataset of this type (see § 9.2.2). Phrase structure treebanks exist
5248 for roughly two dozen other languages, with coverage mainly restricted to European and
5249 East Asian languages, plus Arabic and Urdu.
5251 For example, the probability of the production NP → D ET N N is the corpus count of
5252 this production, divided by the count of the non-terminal NP. This estimator applies
5253 to terminal productions as well: the probability of N N → whale is the count of how often
5254 whale appears in the corpus as generated from an N N tag, divided by the total count of the
5255 N N tag. Even with the largest treebanks — currently on the order of one million tokens
5256 — it is difficult to accurately compute probabilities of even moderately rare events, such
5257 as N N → whale. Therefore, smoothing is critical for making PCFGs effective.
5261 where the feature vector f (X, α) is a function of the left-hand side X, the right-hand side
5262 α, the anchor indices (i, j, k), and the input w.
5263 The basic feature f (X, α, (i, j, k)) = {(X, α)} encodes only the identity of the pro-
5264 duction itself, which is a discriminatively-trained model with the same expressiveness as
5265 a PCFG. Features on anchored productions can include the words that border the span
5266 wi , wj+1 , the word at the split point wk+1 , the presence of a verb or noun in the left child
5267 span wi+1:k , and so on (Durrett and Klein, 2015). Scores on anchored productions can be
5268 incorporated into CKY parsing without any modification to the algorithm, because it is
5269 still possible to compute each element of the table t[i, j, X] recursively from its immediate
5270 children.
5271 Other features can be obtained by grouping elements on either the left-hand or right-
5272 hand side: for example it can be particularly beneficial to compute additional features
5273 by clustering terminal symbols, with features corresponding to groups of words with
5274 similar syntactic properties. The clustering can be obtained from unlabeled datasets that
5275 are much larger than any treebank, improving coverage. Such methods are described in
5276 chapter 14.
Feature-based parsing models can be estimated using the usual array of discrimina-
tive learning techniques. For example, a structure perceptron update can be computed
as (Carreras et al., 2008),
X
f (τ, w(i) ) = f (X, α, (i, j, k), w(i) ) [10.22]
(X→α,(i,j,k))∈τ
5285 Using this probability, a WCFG can be trained by maximizing the conditional log-likelihood
5286 of a labeled corpus.
5287 Just as in logistic regression and the conditional random field over sequences, the
5288 gradient of the conditional log-likelihood is the difference between the observed and ex-
5289 pected counts of each feature. The expectation Eτ |w [f (τ, w(i) ); θ] requires summing over
5290 all possible parses, and computing the marginal probabilities of anchored productions,
5291 p(X → α, (i, j, k) | w). In CRF sequence labeling, marginal probabilities over tag bigrams
5292 are computed by the two-pass forward-backward algorithm (§ 7.5.3.3). The analogue for
5293 context-free grammars is the inside-outside algorithm, in which marginal probabilities
5294 are computed from terms generated by an upward and downward pass over the parsing
5295 chart:
Y Y
X Z Z X
Figure 10.3: The two cases faced by the outside recurrence in the computation of β(i, j, X)
X j
X
α(i, j, X) , log exp (ψ(X → Y Z, (i, j, k)) + α(i, k, Y ) + α(k, j, Z)) .
(X→Y Z) k=i+1
[10.26]
X M
X
exp β(i, j, X) , exp [ψ(Y → X Z, (i, k, j)) + α(j, k, Z) + β(i, k, Y )]
(Y →X Z) k=j+1
[10.27]
X i−1
X
+ exp [ψ(Y → Z X, (k, i, j)) + α(k, i, Z) + β(k, j, Y )] .
(Y →Z X) k=0
[10.28]
5298 The first line of Equation 10.28 is the score under the condition that X is a left child
5299 of its parent, which spans wi+1:k , with k > j; the second line is the score under the
5300 condition that X is a right child of its parent Y , which spans wk+1:j , with k < i.
5301 The two cases are shown in Figure 10.3. In each case, we sum over all possible
5302 productions with X on the right-hand side. The parent Y is bounded on one side
p(X wi+1:j , w)
p(X wi+1:j | w) = [10.29]
p(w)
p(wi+1:j | X) × p(X, w1:i , xj+1:M )
= [10.30]
p(w)
exp (α(i, j, X) + β(i, j, X))
= . [10.31]
exp α(0, M, S)
5306 Marginal probabilities of individual productions can be computed similarly (see exercise
5307 2). These marginal probabilities can be used for training a conditional random field parser,
5308 and also for the task of unsupervised grammar induction, in which a PCFG is estimated
5309 from a dataset of unlabeled text (Lari and Young, 1990; Pereira and Schabes, 1992).
where uwi is a word embedding associated with the word wi . The vector vi,j,k can then be
passed through a feedforward neural network, and used to compute the score of the an-
chored production. For example, this score can be computed as a bilinear product (Durrett
and Klein, 2015),
S S
NP VP NP VP
PRP V NP PRP V NP
she likes NN PP she likes NP CC NP
wine P NP NN PP and NNP
from NP CC NP wine P NP Italy
NNP and NNP from NNP
France Italy France
Figure 10.4: The left parse is preferable because of the conjunction of phrases headed by
France and Italy, but these parses cannot be distinguished by a WCFG.
5323 • The context-free assumption is too strict: for example, the probability of the produc-
5324 tion NP → NP PP is much higher (in the PTB) if the parent of the noun phrase is a
5325 verb phrase (indicating that the NP is a direct object) than if the parent is a sentence
5326 (indicating that the NP is the subject of the sentence).
5327 • The Penn Treebank non-terminals are too coarse: there are many kinds of noun
5328 phrases and verb phrases, and accurate parsing sometimes requires knowing the
5329 difference. As we have already seen, when faced with prepositional phrase at-
5330 tachment ambiguity, a weighted CFG will either always choose NP attachment (if
5331 ψ(NP → NP PP) > ψ(VP → VP PP)), or it will always choose VP attachment. To
5332 get more nuanced behavior, more fine-grained non-terminals are needed.
5333 • More generally, accurate parsing requires some amount of semantics — understand-
5334 ing the meaning of the text to be parsed. Consider the example cats scratch people with
5335 claws: knowledge of about cats, claws, and scratching is necessary to correctly resolve
5336 the attachment ambiguity.
5337 An extreme example is shown in Figure 10.4. The analysis on the left is preferred
5338 because of the conjunction of similar entities France and Italy. But given the non-terminals
5339 shown in the analyses, there is no way to differentiate these two parses, since they include
5340 exactly the same productions. What is needed seems to be more precise non-terminals.
5341 One possibility would be to rethink the linguistics behind the Penn Treebank, and ask
S S
NP VP NP-S VP-S
she V NP she VP-VP NP-VP
heard DT NN heard DT-NP NN-NP
the bear the bear
5342 the annotators to try again. But the original annotation effort took five years, and there
5343 is a little appetite for another annotation effort of this scope. Researchers have therefore
5344 turned to automated techniques.
5346 This phenomenon can be captured by parent annotation (Johnson, 1998), in which each
5347 non-terminal is augmented with the identity of its parent, as shown in Figure 10.5). This is
5348 sometimes called vertical Markovization, since a Markov dependency is introduced be-
5349 tween each node and its parent (Klein and Manning, 2003). It is analogous to moving from
5350 a bigram to a trigram context in a hidden Markov model. In principle, parent annotation
5351 squares the size of the set of non-terminals, which could make parsing considerably less
5352 efficient. But in practice, the increase in the number of non-terminals that actually appear
5353 in the data is relatively modest (Johnson, 1998).
5354 Parent annotation weakens the WCFG locality assumptions. This improves accuracy
5355 by enabling the parser to make more fine-grained distinctions, which better capture real
5356 linguistic phenomena. However, each production is more rare, and so careful smoothing
5357 or regularization is required to control the variance over production scores.
5366 The underlined words are the head words of their respective phrases: met heads the verb
5367 phrase, and President heads the direct object noun phrase. These heads provide useful
5368 semantic information. But they break the context-free assumption, which states that the
5369 score for a production depends only on the parent and its immediate children, and not
5370 the substructure under each child.
The incorporation of head words into context-free parsing is known as lexicalization,
and is implemented in rules of the form,
5371 Lexicalization was a major step towards accurate PCFG parsing. It requires solving three
5372 problems: identifying the heads of all constituents in a treebank; parsing efficiently while
5373 keeping track of the heads; and estimating the scores for lexicalized productions.
VP(meet) VP(meet)
the President on NN DT NN P NP
Monday the President of NN
Mexico
5381 of the noun; the head of a sentence in a S → NP VP production is the head of the verb
5382 phrase.
5383 Table 10.3 shows a fragment of the head percolation rules used in many English pars-
5384 ing systems. The meaning of the first rule is that to find the head of an S constituent, first
5385 look for the rightmost VP child; if you don’t find one, then look for the rightmost SBAR
5386 child, and so on down the list. Verb phrases are headed by left verbs (the head of can plan
5387 on walking is planned, since the modal verb can is tagged M D); noun phrases are headed by
5388 the rightmost noun-like non-terminal (so the head of the red cat is cat),6 and prepositional
5389 phrases are headed by the preposition (the head of at Georgia Tech is at). Some of these
5390 rules are somewhat arbitrary — there’s no particular reason why the head of cats and dogs
5391 should be dogs — but the point here is just to get some lexical information that can support
5392 parsing, not to make deep claims about syntax. Figure 10.6 shows the application of these
5393 rules to two of the running examples.
6
The noun phrase non-terminal is sometimes treated as a special case. Collins (1997) uses a heuristic that
looks for the rightmost child which is a noun-like part-of-speech (e.g., N N, N NP), a possessive marker, or a
superlative adjective (e.g., the greatest). If no such child is found, the heuristic then looks for the leftmost NP.
If there is no child with tag NP, the heuristic then applies another priority list, this time from right to left.
Table 10.3: A fragment of head percolation rules for English, from http://www.cs.
columbia.edu/˜mcollins/papers/heads
5401 To compute t` , we maximize over all split points k > h, since the head word must be in
5402 the left child. We then maximize again over possible head words h0 for the right child. An
5403 analogous computation is performed for tr . The size of the table is now O(M 3 N ), where
5404 M is the length of the input and N is the number of non-terminals. Furthermore, each
5405 cell is computed by performing O(M 2 ) operations, since we maximize over both the split
5406 point k and the head h0 . The time complexity of the algorithm is therefore O(RM 5 N ),
5407 where R is the number of rules in the grammar. Fortunately, more efficient solutions are
5408 possible. In general, the complexity of parsing can be reduced to O(M 4 ) in the length of
5409 the input; for a broad class of lexicalized CFGs, the complexity can be made cubic in the
5410 length of the input, just as in unlexicalized CFGs (Eisner, 2000).
5424 The first feature scores the unlexicalized production NP → NP PP; the next two features
5425 lexicalize only one element of the production, thereby scoring the appropriateness of NP
5426 attachment for the individual words President and of ; the final feature scores the specific
5427 bilexical affinity of President and of. For bilexical pairs that are encountered frequently in
5428 the treebank, this bilexical feature can play an important role in parsing; for pairs that are
5429 absent or rare, regularization will drive its weight to zero, forcing the parser to rely on the
5430 more coarse-grained features.
5431 In chapter 14, we will encounter techniques for clustering words based on their distri-
5432 butional properties — the contexts in which they appear. Such a clustering would group
5433 rare and common words, such as whale, shark, beluga, Leviathan. Word clusters can be used
7
The real situation is even more difficult, because non-binary context-free grammars can involve trilexical
or higher-order dependencies, between the head of the constituent and multiple of its children (Carreras et al.,
2008).
5434 as features in discriminative lexicalized parsing, striking a middle ground between full
5435 lexicalization and non-terminals (Finkel et al., 2008). In this way, labeled examples con-
5436 taining relatively common words like whale can help to improve parsing for rare words
5437 like beluga, as long as those two words are clustered together.
5456 • In the E-step, estimate a marginal distribution q over the refinement type of each
5457 non-terminal in each derivation. These marginals are constrained by the original
5458 annotation: an NP can be reannotated as NP4, but not as VP3. Marginal probabil-
5459 ities over refined productions can be computed from the inside-outside algorithm,
5460 as described in § 10.4.3, where the E-step enforces the constraints imposed by the
5461 original annotations.
5462 • In the M-step, recompute the parameters of the grammar, by summing over the
5463 probabilities of anchored productions that were computed in the E-step:
M X
X j
M X
E[count(X → Y Z)] = p(X → Y Z, (i, j, k) | w). [10.43]
i=0 j=i k=i
5464 As usual, this process can be iterated to convergence. To determine the number of re-
5465 finement types for each tag, Petrov et al. (2006) apply a split-merge heuristic; Liang et al.
5466 (2007) and Finkel et al. (2007) apply Bayesian nonparametrics (Cohen, 2016).
Proper nouns
NNP-14 Oct. Nov. Sept.
NNP-12 John Robert James
NNP-2 J. E. L.
NNP-1 Bush Noriega Peters
NNP-15 New San Wall
NNP-3 York Francisco Street
Personal Pronouns
PRP-0 It He I
PRP-1 it he they
PRP-2 it them him
Table 10.4: Examples of automatically refined non-terminals and some of the words that
they generate (Petrov et al., 2006).
5467 Some examples of refined non-terminals are shown in Table 10.4. The proper nouns
5468 differentiate months, first names, middle initials, last names, first names of places, and
5469 second names of places; each of these will tend to appear in different parts of grammatical
5470 productions. The personal pronouns differentiate grammatical role, with PRP-0 appear-
5471 ing in subject position at the beginning of the sentence (note the capitalization), PRP-1
5472 appearing in subject position but not at the beginning of the sentence, and PRP-2 appear-
5473 ing in object position.
5488 the parse on the k-best list that has the lowest error. In either case, the reranker need only
5489 evaluate the K best parses, and so no context-free assumptions are necessary. This opens
5490 the door to more expressive scoring functions:
5491 • It is possible to incorporate arbitrary non-local features, such as the structural par-
5492 allelism and right-branching orientation of the parse (Charniak and Johnson, 2005).
5493 • Reranking enables the use of recursive neural networks, in which each constituent
5494 span wi+1:j receives a vector ui,j which is computed from the vector representa-
5495 tions of its children, using a composition function that is linked to the production
5496 rule (Socher et al., 2013), e.g.,
ui,k
ui,j = f ΘX→Y Z [10.44]
uk,j
5497 The overall score of the parse can then be computed from the final vector, Ψ(τ ) =
5498 θu0,M .
5499 Reranking can yield substantial improvements in accuracy. The main limitation is that it
5500 can only find the best parse among the K-best offered by the generator, so it is inherently
5501 limited by the ability of the bottom-up parser to find high-quality candidates.
5520 The set of available actions is constrained by the situation: the parser can only shift if
5521 there are remaining terminal symbols in the input, and it can only reduce if an applicable
5522 production rule exists in the grammar. If the parser arrives at a state where the input
5523 has been completely consumed, and the stack contains only the element S, then the input
5524 is accepted. If the parser arrives at a non-accepting state where there are no possible
5525 actions, the input is rejected. A parse error occurs if there is some action sequence that
5526 would accept an input, but the parser does not find it.
5527 Example Consider the input we eat sushi and the grammar in Table 10.1. The input can
5528 be parsed through the following sequence of actions:
5539 One thing to notice from this example is that the number of shift actions is equal to the
5540 length of the input. The number of reduce actions is equal to the number of non-terminals
5541 in the analysis, which grows linearly in the length of the input. Thus, the overall time
5542 complexity of shift-reduce parsing is linear in the length of the input (assuming the com-
5543 plexity of each individual classification decision is constant in the length of the input).
5544 This is far better than the cubic time complexity required by CKY parsing.
5555 independence assumptions required by chart parsing; search errors are the price that must
5556 be paid for this flexibility.
5557 Learning transition-based parsing Transition-based parsing can be combined with ma-
5558 chine learning by training a classifier to select the correct action in each situation. This
5559 classifier is free to choose any feature of the input, the state of the parser, and the parse
5560 history. However, there is no optimality guarantee: the parser may choose a suboptimal
5561 parse, due to a mistake at the beginning of the analysis. Nonetheless, some of the strongest
5562 CFG parsers are based on the shift-reduce architecture, rather than CKY. A recent gener-
5563 ation of models links shift-reduce parsing with recurrent neural networks, updating a
5564 hidden state vector while consuming the input (e.g., Cross and Huang, 2016; Dyer et al.,
5565 2016). Learning algorithms for transition-based parsing are discussed in more detail in
5566 § 11.3.
5567 Exercises
1
p(X → X X) = [10.45]
2
1
p(X → Y ) = [10.46]
2
1
p(Y → σ) = , ∀σ ∈ Σ [10.47]
|Σ|
5568 a) Compute the probability p(τ̂ ) of the maximum probability parse for a string
5569 w ∈ ΣM .
P
5570 b) Compute the marginal probability p(w) = τ :yield(τ )=w p(τ ).
5571 c) Compute the conditional probability p(τ̂ | w).
5572 2. Use the inside and outside scores to compute the marginal probability p(Xi:j → Yi:k−1 Zk:j | w),
5573 indicating that Y spans wi:k−1 , Z spans wk:j , and X is the parent of Y and Z, span-
5574 ning wi:j .
P
5575 3. Suppose that the potentials Ψ(X → α) are log-probabilities, so that α exp Ψ(X → α) = 1
5576 for all X. Verify that the semiringPinside recurrence from Equation 10.26 generates
5577 the log-probability log p(w) = log τ :yield(τ )=w p(τ ).
5581 The previous chapter discussed algorithms for analyzing sentences in terms of nested con-
5582 stituents, such as noun phrases and verb phrases. However, many of the key sources of
5583 ambiguity in phrase-structure analysis relate to questions of attachment: where to attach a
5584 prepositional phrase or complement clause, how to scope a coordinating conjunction, and
5585 so on. These attachment decisions can be represented with a more lightweight structure:
5586 a directed graph over the words in the sentence, known as a dependency parse. Syn-
5587 tactic annotation has shifted its focus to such dependency structures: at the time of this
5588 writing, the Universal Dependencies project offers more than 100 dependency treebanks
5589 for more than 60 languages.1 This chapter will describe the linguistic ideas underlying
5590 dependency grammar, and then discuss exact and transition-based parsing algorithms.
5591 The chapter will also discuss recent research on learning to search in transition-based
5592 structure prediction.
1
universaldependencies.org
261
262 CHAPTER 11. DEPENDENCY PARSING
S(scratch)
NP(cats) VP(scratch)
Figure 11.1: Dependency grammar is closely linked to lexicalized context free grammars:
each lexical head has a dependency path to every other word in the constituent. (This
example is based on the lexicalization rules from § 10.5.2, which make the preposition
the head of a prepositional phrase. In the more contemporary Universal Dependencies
annotations, the head of with claws would be claws, so there would be an edge scratch →
claws.)
5603 occupies the central position for the noun phrase, with the word the playing a supporting
5604 role.
5605 The relationships between words in a sentence can be formalized in a directed graph,
5606 based on the lexicalized phrase-structure parse: create an edge (i, j) iff word i is the head
5607 of a phrase whose child is a phrase headed by word j. Thus, in our example, we would
5608 have scratch → cats and cats → the. We would not have the edge scratch → the, because
5609 although S(scratch) dominates D ET(the) in the phrase-structure parse tree, it is not its im-
5610 mediate parent. These edges describe syntactic dependencies, a bilexical relationship
5611 between a head and a dependent, which is at the heart of dependency grammar.
5612 Continuing to build out this dependency graph, we will eventually reach every word
5613 in the sentence, as shown in Figure 11.1b. In this graph — and in all graphs constructed
5614 in this way — every word has exactly one incoming edge, except for the root word, which
5615 is indicated by a special incoming arrow from above. Furthermore, the graph is weakly
5616 connected: if the directed edges were replaced with undirected edges, there would be a
5617 path between all pairs of nodes. From these properties, it can be shown that there are no
5618 cycles in the graph (or else at least one node would have to have more than one incoming
5619 edge), and therefore, the graph is a tree. Because the graph includes all vertices, it is a
5620 spanning tree.
5624 do we decide which is the head? Here are some possible criteria:
5625 • The head sets the syntactic category of the construction: for example, nouns are the
5626 heads of noun phrases, and verbs are the heads of verb phrases.
5627 • The modifier may be optional while the head is mandatory: for example, in the
5628 sentence cats scratch people with claws, the subtrees cats scratch and cats scratch people
5629 are grammatical sentences, but with claws is not.
5630 • The head determines the morphological form of the modifier: for example, in lan-
5631 guages that require gender agreement, the gender of the noun determines the gen-
5632 der of the adjectives and determiners.
5633 • Edges should first connect content words, and then connect function words.
5634 As always, these guidelines sometimes conflict. The Universal Dependencies (UD)
5635 project has attempted to identify a set of principles that can be applied to dozens of dif-
5636 ferent languages (Nivre et al., 2016).2 These guidelines are based on the universal part-
5637 of-speech tags from chapter 8. They differ somewhat from the head rules described in
5638 § 10.5.2: for example, on the principle that dependencies should relate content words, the
5639 prepositional phrase with claws would be headed by claws, resulting in an edge scratch →
5640 claws, and another edge claws → with.
5641 One objection to dependency grammar is that not all syntactic relations are asymmet-
5642 ric. Coordination is one of the most obvious examples (Popel et al., 2013): in the sentence,
5643 Abigail and Max like kimchi (Figure 11.2), which word is the head of the coordinated noun
5644 phrase Abigail and Max? Choosing either Abigail or Max seems arbitrary; fairness argues
5645 for making and the head, but this seems like the least important word in the noun phrase,
5646 and selecting it would violate the principle of linking content words first. The Universal
5647 Dependencies annotation system arbitrarily chooses the left-most item as the head — in
5648 this case, Abigail — and includes edges from this head to both Max and the coordinating
5649 conjunction and. These edges are distinguished by the labels CONJ (for the thing begin
5650 conjoined) and CC (for the coordinating conjunction). The labeling system is discussed
5651 next.
root
nsubj conj
conj cc
cc obj advmod
punct
conj
cc
root
obj nsubj
cop
nsubj compound compound advmod
Figure 11.3: A labeled dependency parse from the English UD Treebank (reviews-361348-
0006)
5657 kimchi is the object.3 The negation not is treated as an adverbial modifier (ADVMOD) on
5658 the noun jook.
5659 A slightly more complex example is shown in Figure 11.3. The multiword expression
5660 New York pizza is treated as a “flat” unit of text, with the elements linked by the COM -
5661 POUND relation. The sentence includes two clauses that are conjoined in the same way
5662 that noun phrases are conjoined in Figure 11.2. The second clause contains a copula verb
5663 (see § 8.1.1). For such clauses, we treat the “object” of the verb as the root — in this case,
5664 it — and label the verb as a dependent, with the COP relation. This example also shows
5665 how punctuations are treated, with label PUNCT.
VP
VP VP PP
V NP PP PP VP PP with a fork
VP PP PP
Figure 11.4: The three different CFG analyses of this verb phrase all correspond to a single
dependency structure.
5669 representing prepositional phrase adjuncts to the verb ate. Because there is apparently no
5670 meaningful difference between these analyses, the Penn Treebank decides by convention
5671 to use the two-level representation (see Johnson, 1998, for a discussion). As shown in
5672 Figure 11.4d, these three cases all look the same in a dependency parse.
5673 But dependency grammar imposes its own set of annotation decisions, such as the
5674 identification of the head of a coordination (§ 11.1.1); without lexicalization, context-free
5675 grammar does not require either element in a coordination to be privileged in this way.
5676 Dependency parses can be disappointingly flat: for example, in the sentence Yesterday,
5677 Abigail was reluctantly giving Max kimchi, the root giving is the head of every dependency!
5678 The constituent parse arguably offers a more useful structural analysis for such cases.
5679 Projectivity Thus far, we have defined dependency trees as spanning trees over a graph
5680 in which each word is a vertex. As we have seen, one way to construct such trees is by
5681 connecting the heads in a lexicalized constituent parse. However, there are spanning trees
5682 that cannot be constructed in this way. Syntactic constituents are contiguous spans. In a
5683 spanning tree constructed from a lexicalized constituent parse, the head h of any con-
5684 stituent that spans the nodes from i to j must have a path to every node in this span. This
5685 is property is known as projectivity, and projective dependency parses are a restricted
5686 class of spanning trees. Informally, projectivity means that “crossing edges” are prohib-
5687 ited. The formal definition follows:
acl:relcl
root
obl:tmod
obj nsubj
nsubj det cop
Figure 11.5: An example of a non-projective dependency parse. The “crossing edge” arises
from the relative clause which was vegetarian and the oblique temporal modifier yesterday.
5688 Definition 2 (Projectivity). An edge from i to j is projective iff all k between i and j are descen-
5689 dants of i. A dependency parse is projective iff all its edges are projective.
5690 Figure 11.5 gives an example of a non-projective dependency graph in English. This
5691 dependency graph does not correspond to any constituent parse. As shown in Table 11.1,
5692 non-projectivity is more common in languages such as Czech and German. Even though
5693 relatively few dependencies are non-projective in these languages, many sentences have
5694 at least one such dependency. As we will soon see, projectivity has important algorithmic
5695 consequences.
5701 where Y(w) is the set of valid dependency parses on the input w. As usual, the number
5702 of possible labels |Y(w)| is exponential in the length of the input (Wu and Chao, 2004).
Figure 11.6: Feature templates for higher-order dependency parsing (Koo and Collins,
2010) [todo: permission]
5703 Algorithms that search over this space of possible graphs are known as graph-based de-
5704 pendency parsers.
In sequence labeling and constituent parsing, it was possible to search efficiently over
an exponential space by choosing a feature function that decomposes into a sum of local
feature vectors. A similar approach is possible for dependency parsing, by requiring the
scoring function to decompose across dependency arcs i → j:
X r
Ψ(y, w; θ) = ψ(i →− j, w; θ). [11.2]
r
i→
− j∈y
5705 Dependency parsers that operate under this assumption are known as arc-factored, since
5706 the overall score is a product of scores over all arcs.
5707 The top line scores computes a scoring function that includes the grandparent k; the
5708 bottom line computes a scoring function for each sibling s. For projective dependency
5709 graphs, there are efficient algorithms for second-order and third-order dependency pars-
5710 ing (Eisner, 1996; McDonald and Pereira, 2006; Koo and Collins, 2010); for non-projective
5711 dependency graphs, second-order dependency parsing is NP-hard (McDonald and Pereira,
5712 2006). The specific algorithms are discussed in the next section.
5744 Liu-Edmonds algorithm (Chu and Liu, 1965; Edmonds, 1967) computes this maximum
r
5745 spanning tree efficiently. It does this by first identifying the best incoming edge i →
− j for
5746 each vertex j. If the resulting graph does not contain cycles, it is the maximum spanning
5747 tree. If there is a cycle, it is collapsed into a super-vertex, whose incoming and outgoing
5748 edges are based on the edges to the vertices in the cycle. The algorithm is then applied
5749 recursively to the resulting graph, and process repeats until a graph without cycles is
5750 obtained.
5751 The time complexity of identifying the best incoming edge for each vertex is O(M 2 R),
5752 where M is the length of the input and R is the number of relations; in the worst case, the
5753 number of cycles is O(M ). Therefore, the complexity of the Chu-Liu-Edmonds algorithm
5754 is O(M 3 R). This complexity can be reduced to O(M 2 N ) by storing the edge scores in a
5755 Fibonnaci heap (Gabow et al., 1986). For more detail on graph-based parsing algorithms,
5756 see Eisner (1997) and Kübler et al. (2009).
r r
Linear ψ(i →
− j, w; θ) = θ · f (i →
− j, w) [11.4]
r
Neural ψ(i →
− j, w; θ) = Feedforward([uwi ; uwj ]; θ) [11.5]
r
Generative ψ(i →
− j, w; θ) = log p(wj , r | wi ). [11.6]
5775 Each of these features can be conjoined with the dependency edge label r. Note that
5776 features in an arc-factored parser can refer to words other than wi and wj . The restriction
5777 is that the features consider only a single arc.
Bilexical features (e.g., sushi → chopsticks) are powerful but rare, so it is useful to aug-
ment them with coarse-grained alternatives, by “backing off” to the part-of-speech or
affix. For example, the following features are created by backing off to part-of-speech tags
in an unlabeled dependency parser:
f (3 →
− 5, we eat sushi with chopsticks) = hsushi → chopsticks,
sushi → N NS,
N N → chopsticks,
N NS → N Ni.
5778 Regularized discriminative learning algorithms can then trade off between features at
5779 varying levels of detail. McDonald et al. (2005) take this approach as far as tetralexical
5780 features (e.g., (wi , wi+1 , wj−1 , wj )). Such features help to avoid choosing arcs that are un-
5781 likely due to the intervening words: for example, there is unlikely to be an edge between
5782 two nouns if the intervening span contains a verb. A large list of first and second-order
5783 features is provided by Bohnet (2010), who uses a hashing function to store these features
5784 efficiently.
y ={(ROOT, 2), (2, 1), (2, 3), (3, 5), (5, 4)} [11.10]
X
log p(w | y) = log p(wj | wi ) [11.11]
(i→j)∈y
5797 In this case, the argmax requires a maximization over all dependency trees for the sen-
5798 tence, which can be computed using the algorithms described in § 11.2.1. We can apply
5799 all the usual tricks from § 2.2: weight averaging, a large margin objective, and regular-
5800 ization. McDonald et al. (2005) were the first to treat dependency parsing as a structure
5801 prediction problem, using MIRA, an online margin-based learning algorithm. Neural arc
5802 scores can be learned in the same way, backpropagating from a margin loss to updates on
5803 the feedforward network that computes the score for each edge.
A conditional random field for arc-factored dependency parsing is built on the proba-
bility model,
P r
exp i→ − j∈y ψ(i →
− j, w; θ)
r
p(y | w) = P P r [11.15]
y 0 ∈Y(w) exp r
i→
− j∈y 0
ψ(i →
− j, w; θ)
5804 Such a model is trained to minimize the negative log conditional-likelihood. Just as in
5805 CRF sequence models (§ 7.5.3) and the logistic regression classifier (§ 2.4), the gradients
r
5806 involve marginal probabilities p(i →
− j | w; θ), which in this case are probabilities over
5807 individual dependencies. In arc-factored models, these probabilities can be computed
5808 in polynomial time. For projective dependency trees, the marginal probabilities can be
5809 computed in cubic time, using a variant of the inside-outside algorithm (Lari and Young,
5810 1990). For non-projective dependency parsing, marginals can also be computed in cubic
5811 time, using the matrix-tree theorem (Koo et al., 2007; McDonald et al., 2007; Smith and
5812 Smith, 2007). Details of these methods are described by Kübler et al. (2009).
5814 Graph-based dependency parsing offers exact inference, meaning that it is possible to re-
5815 cover the best-scoring parse for any given model. But this comes at a price: the scoring
5816 function is required to decompose into local parts — in the case of non-projective parsing,
5817 these parts are restricted to individual arcs. These limitations are felt more keenly in de-
5818 pendency parsing than in sequence labeling, because second-order dependency features
5819 are critical to correctly identify some types of attachments. For example, prepositional
5820 phrase attachment depends on the attachment point, the object of the preposition, and
5821 the preposition itself; arc-factored scores cannot account for all three of these features si-
5822 multaneously. Graph-based dependency parsing may also be criticized on the basis of
5823 intuitions about human language processing: people read and listen to sentences sequen-
5824 tially, incrementally building mental models of the sentence structure and meaning before
5825 getting to the end (Jurafsky, 1996). This seems hard to reconcile with graph-based algo-
5826 rithms, which perform bottom-up operations on the entire sentence, requiring the parser
5827 to keep every word in memory. Finally, from a practical perspective, graph-based depen-
5828 dency parsing is relatively slow, running in cubic time in the length of the input.
5829 Transition-based algorithms address all three of these objections. They work by mov-
5830 ing through the sentence sequentially, while performing actions that incrementally up-
5831 date a stored representation of what has been read thus far. As with the shift-reduce
5832 parser from § 10.6.2, this representation consists of a stack, onto which parsing substruc-
5833 tures can be pushed and popped. In shift-reduce, these substructures were constituents;
5834 in the transition systems that follow, they will be projective dependency trees over partial
5835 spans of the input.4 Parsing is complete when the input is consumed and there is only
5836 a single structure on the stack. The sequence of actions that led to the parse is known as
5837 the derivation. One problem with transition-based systems is that there may be multiple
5838 derivations for a single parse structure — a phenomenon known as spurious ambiguity.
4
Transition systems also exist for non-projective dependency parsing (e.g., Nivre, 2008).
5847 indicating that the stack contains only the special node R OOT, the entire input is on the
5848 buffer, and the set of arcs is empty. An accepting configuration is,
5849 where the stack contains only R OOT, the buffer is empty, and the arcs A define a spanning
5850 tree over the input. The arc-standard and arc-eager systems define a set of transitions
5851 between configurations, which are capable of transforming an initial configuration into
5852 an accepting configuration. In both of these systems, the number of actions required to
5853 parse an input grows linearly in the length of the input, making transition-based parsing
5854 considerably more efficient than graph-based methods.
5859 • S HIFT: move the first item from the input buffer on to the top of the stack,
5860 where we write i|β to indicate that i is the leftmost item in the input buffer, and σ|i
5861 to indicate the result of pushing i on to stack σ.
5862 • A RC - LEFT: create a new left-facing arc of type r between the item on the top of the
5863 stack and the first item in the input buffer. The head of this arc is j, which remains
r
5864 at the front of the input buffer. The arc j →
− i is added to A. Formally,
r
(σ|i, j|β, A) ⇒ (σ, j|β, A ⊕ j →
− i), [11.19]
r
5865 where r is the label of the dependency arc, and ⊕ concatenates the new arc j →
− i to
5866 the list A.
Table 11.2: Arc-standard derivation of the unlabeled dependency parse for the input they
like bagels with lox.
5867 • A RC - RIGHT: creates a new right-facing arc of type r between the item on the top of
5868 the stack and the first item in the input buffer. The head of this arc is i, which is
r
5869 “popped” from the stack and pushed to the front of the input buffer. The arc i → − j
5870 is added to A. Formally,
r
(σ|i, j|β, A) ⇒ (σ, i|β, A ⊕ i →
− j), [11.20]
5872 Each action has preconditions. The S HIFT action can be performed only when the buffer
5873 has at least one element. The A RC - LEFT action cannot be performed when the root node
5874 R OOT is on top of the stack, since this node must be the root of the entire tree. The A RC -
5875 LEFT and A RC - RIGHT remove the modifier words from the stack (in the case of A RC - LEFT )
5876 and from the buffer (in the case of A RC -R IGHT), so it is impossible for any word to have
5877 more than one parent. Furthermore, the end state can only be reached when every word is
5878 removed from the buffer and stack, so the set of arcs is guaranteed to constitute a spanning
5879 tree. An example arc-standard derivation is shown in Table 11.2.
5887 arcs, as occurs in Table 11.2. Note that the decision to shift bagels onto the stack guarantees
5888 that the prepositional phrase with lox will attach to the noun phrase, and that this decision
5889 must be made before the prepositional phrase is itself parsed. This has been argued to be
5890 cognitively implausible (Abney and Johnson, 1991); from a computational perspective, it
5891 means that a parser may need to look several steps ahead to make the correct decision.
5892 Arc-eager dependency parsing changes the ARC - RIGHT action so that right depen-
5893 dents can be attached before all of their dependents have been found. Rather than re-
5894 moving the modifier from both the buffer and stack, the ARC - RIGHT action pushes the
5895 modifier on to the stack, on top of the head. Because the stack can now contain elements
5896 that already have parents in the partial dependency graph, two additional changes are
5897 necessary:
5898 • A precondition is required to ensure that the ARC - LEFT action cannot be applied
5899 when the top element on the stack already has a parent in A.
5900 • A new REDUCE action is introduced, which can remove elements from the stack if
5901 they already have a parent in A:
5902 As a result of these changes, it is now possible to create the arc like → bagels before parsing
5903 the prepositional phrase with lox. Furthermore, this action does not imply a decision about
5904 whether the prepositional phrase will attach to the noun or verb. Noun attachment is
5905 chosen in the parse in Table 11.3, but verb attachment could be achieved by applying the
5906 REDUCE action at step 5 or 7.
Table 11.3: Arc-eager derivation of the unlabeled dependency parse for the input they like
bagels with lox.
[Root] Shift [Root, they] Arc-Right [Root, they] [Root, can] Arc-Right [Root]
they can fish can fish fish Arc-Right ∅ ∅
Arc-Left [Root, can] [Root, fish] [Root]
fish Arc-Left ∅ Arc-Right ∅
Figure 11.7: Beam search for unlabeled dependency parsing, with beam size K = 2. The
arc lists for each configuration are not shown, but can be computed from the transitions.
5920 a poor derivation. For example, in Table 11.2, if ARC - RIGHT were chosen at step 4, then
5921 the parser would later be forced to attach the prepositional phrase with lox to the verb
5922 likes. Note that the likes → bagels arc is indeed part of the correct dependency parse, but
5923 the arc-standard transition system requires it to be created later in the derivation.
Beam search addresses this issue by maintaining a set of hypothetical derivations,
called a beam. At step t of the derivation, there is a set of k hypotheses, each of which is a
tuple of a score and a sequence of actions,
5924 Each hypothesis is then “expanded” by considering the set of all valid actions from the
(k) (k)
5925 current configuration ct , written A(ct ). This yields a large set of new hypotheses. For
(k) (k)
5926 each action aA(ct ), we score the new hypothesis At ⊕ a. The top k hypotheses by
5927 this scoring metric are kept, and parsing proceeds to the next step (Zhang and Clark,
5928 2008). Note that beam search requires a scoring function for action sequences, rather than
5929 individual actions. This issue will be revisited in the next section.
5930 An example of beam search is shown in Figure 11.7, with a beam size of K = 2. For the
5931 first transition, the only valid action is S HIFT, so there is only one possible configuration
5932 at t = 2. From this configuration, there are three possible actions. The top two are A RC -
5933 R IGHT and A RC -L EFT, and so the resulting hypotheses from these actions are on the beam
5934 at t = 3. From these configurations, there are three possible actions each, but the best
5935 two are expansions of the bottom hypothesis at t = 3. Parsing continues until t = 5, at
5936 which point both hypotheses reach an accepting state. The best-scoring hypothesis is then
5937 selected as the parse.
5939 where A(c) is the set of admissible actions in the current configuration c, w is the input,
5940 and Ψ is a scoring function with parameters θ (Yamada and Matsumoto, 2003).
5941 A feature-based score can be computed, Ψ(a, c, w) = θ · f (a, c, w), using features that
5942 may consider any aspect of the current configuration and input sequence. Typical features
5943 for transition-based dependency parsing include: the word and part-of-speech of the top
5944 element on the stack; the word and part-of-speech of the first, second, and third elements
5945 on the input buffer; pairs and triples of words and parts-of-speech from the top of the
5946 stack and the front of the buffer; the distance (in tokens) between the element on the top
5947 of the stack and the element in the front of the input buffer; the number of modifiers of
5948 each of these elements; and higher-order dependency features as described above in the
5949 section on graph-based dependency parsing (see, e.g., Zhang and Nivre, 2011).
5950 Parse actions can also be scored by neural networks. For example, Chen and Manning
5951 (2014) build a feedforward network in which the input layer consists of the concatenation
5952 of embeddings of several words and tags:
5953 • the top three words on the stack, and the first three words on the buffer;
5954 • the first and second leftmost and rightmost children (dependents) of the top two
5955 words on the stack;
5956 • the leftmost and right most grandchildren of the top two words on the stack;
5957 • embeddings of the part-of-speech tags of these words.
c =(σ, β, A)
x(c, w) =[vwσ1 , vtσ1 vwσ2 , vtσ2 , vwσ3 , vtσ3 , vwβ1 , vtβ1 , vwβ2 , vtβ2 , . . .],
where vwσ1 is the embedding of the first word on the stack, vtβ2 is the embedding of the
part-of-speech tag of the second word on the buffer, and so on. Given this base encoding
of the parser state, the score for the set of possible actions is computed through a feedfor-
ward network,
5958 where the vector z plays the same role as the features f (a, c, w), but is a learned represen-
5959 tation. Chen and Manning (2014) use a cubic elementwise activation function, g(x) = x3 ,
5960 so that the hidden layer models products across all triples of input features. The learning
5961 algorithm updates the embeddings as well as the parameters of the feedforward network.
exp Ψ(a, c, w; θ)
p(a | c, w) = P 0
[11.26]
a0 ∈A(c) exp Ψ(a , c, w; θ)
(i)
N |A
X X| (i) (i)
θ̂ = argmax log p(at | ct , w), [11.27]
θ i=1 t=1
where the denominator sums over the set of all possible action sequences, A(w).5 In the
conditional random field model for sequence labeling (§ 7.5.3), it was possible to compute
5
Andor et al. (2016) prove that the set of globally-normalized conditional distributions is a strict superset
of the set of locally-normalized conditional distributions, and that globally-normalized conditional models
are therefore strictly more expressive.
this sum explicitly, using dynamic programming. In transition-based parsing, this is not
possible. However, the sum can be approximated using beam search,
|A |0 |A(k) |
X X K
X X (k) (k)
exp Ψ(a0t , c0t , w) ≈ exp Ψ(at , ct , w), [11.29]
A0 ∈A(w) t=1 k=1 t=1
where A(k) is an action sequence on a beam of size K. This gives rise to the following loss
function,
|A(i) | K |A(k) |
X (i) (i)
X X (k) (k)
L(θ) = − Ψ(at , ct , w) + log exp Ψ(at , ct , w). [11.30]
t=1 k=1 t=1
5998 The derivatives of this loss involve expectations with respect to a probability distribution
5999 over action sequences on the beam.
6020 • As noted earlier, the gold dependency parse can be derived by multiple action se-
6021 quences. Rather than checking for the presence of a single oracle action sequence on
6022 the beam, we can check if the gold dependency parse is reachable from the current
6023 beam, using a dynamic oracle (Goldberg and Nivre, 2012).
Figure 11.8: Google n-grams results for the bigram write code and the dependency arc write
=> code (and their morphological variants)
6024 • By maximizing the score of the gold action sequence, we are training a decision
6025 function to find the correct action given the gold context. But in reality, the parser
6026 will make errors, and the parser is not trained to find the best action given a context
6027 that may not itself be optimal. This issue is addressed by various generalizations of
6028 incremental perceptron, known as learning to search (Daumé III et al., 2009). Some
6029 of these methods are discussed in chapter 15.
6031 Dependency parsing is used in many real-world applications: any time you want to know
6032 about pairs of words which might not be adjacent, you can use dependency arcs instead
6033 of regular expression search patterns. For example, you may want to match strings like
6034 delicious pastries, delicious French pastries, and the pastries are delicious.
6035 It is possible to search the Google n-gramscorpus by dependency edges, finding the
6036 trend in how often a dependency edge appears over time. For example, we might be inter-
6037 ested in knowing when people started talking about writing code, but we also want write
6038 some code, write good code, write all the code, etc. The result of a search on the dependency
6039 edge write → code is shown in Figure 11.8. This capability has been applied to research
6040 in digital humanities, such as the analysis of gender in Shakespeare Muralidharan and
6041 Hearst (2013).
A classic application of dependency parsing is relation extraction, which is described
in chapter 17. The goal of relation extraction is to identify entity pairs, such as
6042 which stand in some relation to each other (in this case, the relation is authorship). Such
6043 entity pairs are often referenced via consistent chains of dependency relations. Therefore,
6044 dependency paths are often a useful feature in supervised systems which learn to detect
6045 new instances of a relation, based on labeled examples of other instances of the same
6046 relation type (Culotta and Sorensen, 2004; Fundel et al., 2007; Mintz et al., 2009).
6047 Cui et al. (2005) show how dependency parsing can improve automated question an-
6048 swering. Suppose you receive the following query:
6049 (11.1) What percentage of the nation’s cheese does Wisconsin produce?
6051 (11.2) In Wisconsin, where farmers produce 28% of the nation’s cheese, . . .
6052 The location of Wisconsin in the surface form of this string makes it a poor match for the
6053 query. However, in the dependency graph, there is an edge from produce to Wisconsin in
6054 both the question and the potential answer, raising the likelihood that this span of text is
6055 relevant to the question.
6056 A final example comes from sentiment analysis. As discussed in chapter 4, the polarity
6057 of a sentence can be reversed by negation, e.g.
6058 (11.3) There is no reason at all to believe the polluters will suddenly become reasonable.
6059 By tracking the sentiment polarity through the dependency parse, we can better iden-
6060 tify the overall polarity of the sentence, determining when key sentiment words are re-
6061 versed (Wilson et al., 2005; Nakagawa et al., 2010).
6068 item currently on the beam. Another search algorithm for transition-based parsing is
6069 easy-first, which abandons the left-to-right traversal order, and adds the highest-scoring
6070 edges first, regardless of where they appear (Goldberg and Elhadad, 2010). Goldberg et al.
6071 (2013) note that although transition-based methods can be implemented in linear time in
6072 the length of the input, naı̈ve implementations of beam search will require quadratic time,
6073 due to the cost of copying each hypothesis when it is expanded on the beam. This issue
6074 can be addressed by using a more efficient data structure for the stack.
6075 Exercises
6076 1. The dependency structure 1 ← 2 → 3, with 2 as the root, can be obtained from more
6077 than one set of actions in arc-standard parsing. List both sets of actions that can
6078 obtain this parse.
6079 2. Suppose you have a set of unlabeled arc scores ψ(i → j), where the score depends
6080 only on the identity of the two words. The scores include ψ(R OOT → j).
6081 • Assuming each word occurs only once in the sentence ((i 6= j) ⇐ (wi 6= wj )),
6082 how would you construct a weighted lexicalized context-free grammar so that
6083 the score of any projective dependency tree is equal to the score of some equiv-
6084 alent derivation in the lexicalized context-free grammar?
6085 • Verify that your method works for a simple example like they eat fish.
6086 • How would you adapt your method to handle the case an individual word
6087 may appear multiple times in the sentence?
6088 3. Provide the UD-style dependency parse for the sentence Xi-Lan eats shoots and leaves,
6089 assuming leaves is a verb. Provide arc-standard and arc-eager derivations for this
6090 dependency parse.
6092 Meaning
285
6093 Chapter 12
6095 The previous few chapters have focused on building systems that reconstruct the syntax
6096 of natural language — its structural organization — through tagging and parsing. But
6097 some of the most exciting and promising potential applications of language technology
6098 involve going beyond syntax to semantics — the underlying meaning of the text:
6099 • Answering questions, such as where is the nearest coffeeshop? or what is the middle name
6100 of the mother of the 44th President of the United States?.
6101 • Building a robot that can follow natural language instructions to execute tasks.
6102 • Translating a sentence from one language into another, while preserving the under-
6103 lying meaning.
6104 • Fact-checking an article by searching the web for contradictory evidence.
6105 • Logic-checking an argument by identifying contradictions, ambiguity, and unsup-
6106 ported assertions.
6107 Semantic analysis involves converting natural language into a meaning representa-
6108 tion. To be useful, a meaning representation must meet several criteria:
6109 • c1: it should be unambiguous: unlike natural language, there should be exactly one
6110 meaning per statement;
6111 • c2: it should provide a way to link language to external knowledge, observations,
6112 and actions;
6113 • c3: it should support computational inference, so that meanings can be combined
6114 to derive additional knowledge;
6115 • c4: it should be expressive enough to cover the full range of things that people talk
6116 about in natural language.
287
288 CHAPTER 12. LOGICAL SEMANTICS
6117 Much more than this can be said about the question of how best to represent knowledge
6118 for computation (e.g., Sowa, 2000), but this chapter will focus on these four criteria.
6120 The first criterion for a meaning representation is that statements in the representation
6121 should be unambiguous — they should have only one possible interpretation. Natural
6122 language does not have this property: as we saw in chapter 10, sentences like cats scratch
6123 people with claws have multiple interpretations.
6124 But what does it mean for a statement to be unambiguous? Programming languages
6125 provide a useful example: the output of a program is completely specified by the rules of
6126 the language and the properties of the environment in which the program is run. For ex-
6127 ample, the python code 5 + 3 will have the output 8, as will the codes (4*4)-(3*3)+1
6128 and ((8)). This output is known as the denotation of the program, and can be written
6129 as,
6130 The denotations of these arithmetic expressions are determined by the meaning of the
6131 constants (e.g., 5, 3) and the relations (e.g., +, *, (, )). Now let’s consider another snippet
6132 of python code, double(4). The denotation of this code could be, Jdouble(4)K = 8, or
6133 it could be Jdouble(4)K = 44 — it depends on the meaning of double. This meaning
6134 is defined in a world model M as an infinite set of pairs. We write the denotation with
6135 respect to model M as J·KM , e.g., JdoubleKM = {(0,0), (1,2), (2,4), . . .}. The world
6136 model would also define the (infinite) list of constants, e.g., {0,1,2,...}. As long as the
6137 denotation of string φ in model M can be computed unambiguously, the language can be
6138 said to be unambiguous.
6139 This approach to meaning is known as model-theoretic semantics, and it addresses
6140 not only criterion c1 (no ambiguity), but also c2 (connecting language to external knowl-
6141 edge, observations, and actions). For example, we can connect a representation of the
6142 meaning of a statement like the capital of Georgia with a world model that includes knowl-
6143 edge base of geographical facts, obtaining the denotation Atlanta. We might populate
6144 a world model by applying an image analysis algorithm to Figure 12.1, and then use this
6145 world model to evaluate propositions like a man is riding a moose. Another desirable prop-
6146 erty of model-theoretic semantics is that when the facts change, the denotations change
6147 too: the meaning representation of President of the USA would have a different denotation
6148 in the model M2014 as it would in M2022 .
Figure 12.1: A (doctored) image, which could be the basis for a world model
6156 Propositional symbols. Greek symbols like φ and ψ will be used to represent proposi-
6157 tions, which are statements that are either true or false. For example, φ may corre-
6158 spond to the proposition, bagels are delicious.
6159 Boolean operators. We can build up more complex propositional formulas from Boolean
6160 operators. These include:
6168 It is not strictly necessary to have all five Boolean operators: readers familiar with
6169 Boolean logic will know that it is possible to construct all other operators from either the
6170 NAND (not-and) or NOR (not-or) operators. Nonetheless, it is clearest to use all five
6171 operators. From the truth conditions for these operators, it is possible to define a number
6172 of “laws” for these Boolean operators, such as,
These laws can be combined to derive further equivalences, which can support logical
inferences. For example, suppose φ = The music is loud and ψ = Max can’t sleep. Then if
we are given,
6177 we can derive ψ (Max can’t sleep) by application of modus ponens, which is one of a
6178 set of inference rules that can be derived from more basic laws and used to manipulate
6179 propositional formulas. Automated theorem provers are capable of applying inference
6180 rules to a set of premises to derive desired propositions (Loveland, 2016).
6187 People are capable of making inferences from this sentence pair, but such inferences re-
6188 quire formal tools that are beyond propositional logic. To understand the relationship
6189 between the statement anyone is making noise and the statement Abigail is making noise, our
6190 meaning representation requires the additional machinery of first-order logic (FOL).
6191 In FOL, logical propositions can be constructed from relationships between entities.
6192 Specifically, FOL extends propositional logic with the following classes of terms:
6193 Constants. These are elements that name individual entities in the model, such as MAX
6194 and ABIGAIL. The denotation of each constant in a model M is an element in the
6195 model, e.g., JMAXK = m and JABIGAILK = a.
6196 Relations. Relations can be thought of as sets of entities, or sets of tuples. For example,
6197 the relation CAN - SLEEP is defined as the set of entities who can sleep, and has the
6198 denotation JCAN - SLEEPK = {a, m, . . .}. To test the truth value of the proposition
6199 CAN - SLEEP ( MAX ), we ask whether J MAX K ∈ J CAN - SLEEP K. Logical relations that are
6200 defined over sets of entities are sometimes called properties.
6201 Relations may also be ordered tuples of entities. For example BROTHER ( MAX , ABIGAIL )
6202 expresses the proposition that MAX is the brother of ABIGAIL. The denotation of
6203 such relations is a set of tuples, JBROTHERK = {(m,a), (x,y), . . .}. To test the
6204 truth value of the proposition BROTHER ( MAX , ABIGAIL ), we ask whether the tuple
6205 (JMAXK, JABIGAILK) is in the denotation JBROTHERK.
Using constants and relations, it is possible to express statements like Max can’t sleep
and Max is Abigail’s brother:
These statements can also be combined using Boolean operators, such as,
6206 This fragment of first-order logic permits only statements about specific entities. To
6207 support inferences about statements like If anyone is making noise, then Max can’t sleep,
6208 two more elements must be added to the meaning representation:
6209 Variables. Variables are mechanisms for referring to entities that are not locally specified.
6210 We can then write CAN - SLEEP(x) or BROTHER(x, ABIGAIL). In these cases, x is a free
6211 variable, meaning that we have not committed to any particular assignment.
6212 Quantifiers. Variables are bound by quantifiers. There are two quantifiers in first-order
6213 logic.2
6214 • The existential quantifier ∃, which indicates that there must be at least one en-
6215 tity to which the variable can bind. For example, the statement ∃xMAKES - NOISE ( X )
6216 indicates that there is at least one entity for which MAKES - NOISE is true.
6217 • The universal quantifier ∀, which indicates that the variable must be able to
6218 bind to any entity in the model. For example, the statement,
6220 The expressions ∃x and ∀x make x into a bound variable. A formula that contains
6221 no free variables is a sentence.
6222 Functions. Functions map from entities to entities, e.g., JCAPITAL - OF ( GEORGIA )K = JATLANTAK.
6223 With functions, it is convenient to add an equality operator, supporting statements
6224 like,
∀x∃y MOTHER - OF(x) = DAUGHTER - OF(y). [12.4]
6225 Note that MOTHER - OF is a functional analogue of the relation MOTHER, so that
6226 MOTHER - OF(x) = y if MOTHER (x, y). Any logical formula that uses functions can be
6227 rewritten using only relations and quantification. For example,
An important property of quantifiers is that the order can matter. Unfortunately, natu-
ral language is rarely clear about this! The issue is demonstrated by examples like everyone
speaks a language, which has the following interpretations:
6229 In the first case, y may refer to several different languages, while in the second case, there
6230 is a single y that is spoken by everyone.
2
In first-order logic, it is possible to quantify only over entities. In second-order logic, it is possible to
quantify over properties, supporting statements like Butch has every property that a good boxer has (example
from Blackburn and Bos, 2005),
6238 The Boolean operators ∧, ∨, . . . provide ways to construct more complicated sentences,
6239 and the truth of such statements can be assessed based on the truth tables associated with
6240 these operators. The statement ∃xφ is true if there is some assignment of the variable x
6241 to an entity in the model such that φ is true; the statement ∀xφ is true if φ is true under
6242 all possible assignments of x. More formally, we would say that φ is satisfied under M,
6243 written as M |= φ.
6244 Truth conditional semantics allows us to define several other properties of sentences
6245 and pairs of sentences. Suppose that in every M under which φ is satisfied, another
6246 formula ψ is also satisfied; then φ entails ψ, which is also written as φ |= ψ. For example,
CAPITAL ( GEORGIA , ATLANTA ) |= ∃xCAPITAL(GEORGIA, x). [12.9]
6247 A statement that is satisfied under any model, such as φ ∨ ¬φ, is valid, written |= (φ ∨
6248 ¬φ). A statement that is not satisfied under any model, such as φ ∧ ¬φ, is unsatisfiable,
6249 or inconsistent. A model checker is a program that determines whether a sentence φ
6250 is satisfied in M. A model builder is a program that constructs a model in which φ
6251 is satisfied. The problems of checking for consistency and validity in first-order logic
6252 are undecidable, meaning that there is no algorithm that can automatically determine
6253 whether an FOL formula is valid or inconsistent.
S : likes(alex, brit)
NP : alex VP : ?
likes Brit
Figure 12.2: The principle of compositionality requires that we identify meanings for the
constituents likes and likes Brit that will make it possible to compute the meaning for the
entire sentence.
6266 The previous section laid out a lot of formal machinery; the remainder of this chapter
6267 links these formalisms back to natural language. Given an English sentence like Alex likes
6268 Brit, how can we obtain the desired first-order logical representation, LIKES ( ALEX , BRIT )?
6269 This is the task of semantic parsing. Just as a syntactic parser is a function from a natu-
6270 ral language sentence to a syntactic structure such as a phrase structure tree, a semantic
6271 parser is a function from natural language to logical formulas.
6272 As in syntactic analysis, semantic parsing is difficult because the space of inputs and
6273 outputs is very large, and their interaction is complex. Our best hope is that, like syntactic
6274 parsing, semantic parsing can somehow be decomposed into simpler sub-problems. This
6275 idea, usually attributed to the German philosopher Gottlob Frege, is called the principle
6276 of compositionality: the meaning of a complex expression is a function of the meanings of
6277 that expression’s constituent parts. We will define these “constituent parts” as syntactic
6278 constituents: noun phrases and verb phrases. These constituents are combined using
6279 function application: if the syntactic parse contains the production x → y z, then the
6280 semantics of x, written x.sem, will be computed as a function of the semantics of the
6281 constituents, y.sem and z.sem.3 4
3
§ 9.3.2 briefly discusses Combinatory Categorial Grammar (CCG) as an alternative to a phrase-structure
analysis of syntax. CCG is argued to be particularly well-suited to semantic parsing (Hockenmaier and
Steedman, 2007), and is used in much of the contemporary work on machine learning for semantic parsing,
summarized in § 12.4.
4
The approach of algorithmically building up meaning representations from a series of operations on the
syntactic structure of a sentence is generally attributed to the philosopher Richard Montague, who published
a series of influential papers on the topic in the early 1970s (e.g., Montague, 1973).
6294 This functional expression is the meaning of the verb phrase likes Brit; it takes a single
6295 argument, and returns the result of substituting that argument for x in the expression
6296 LIKES (x, BRIT ). We write this substitution as,
6297 with the symbol “@” indicating function application. Function application in the lambda
6298 calculus is sometimes called β-reduction or β-conversion. The expression φ@ψ indicates
6299 a function application to be performed by β-reduction, and φ(ψ) indicates a function or
6300 predicate in the final logical form.
6301 Equation 12.11 shows how to obtain the desired semantics for the sentence Alex likes
6302 Brit: by applying the lambda expression λx.LIKES(x, BRIT) to the logical constant ALEX.
6303 This rule of composition can be specified in a syntactic-semantic grammar, in which
6304 syntactic productions are paired with semantic operations. For the syntactic production
6305 S → NP VP, we have the semantic rule [email protected].
The meaning of the transitive verb phrase likes Brit can also be obtained by function
application on its syntactic constituents. For the syntactic production VP → V NP, we
apply the semantic rule,
5
Formally, all first-order logic formulas are lambda expressions; in addition, if φ is a lambda expression,
then λx.φ is also a lambda expression. Readers who are familiar with functional programming will recognize
lambda expressions from their use in programming languages such as Lisp and Python.
S : likes(alex, brit)
likes Brit
Figure 12.3: Derivation of the semantic representation for Alex likes Brit in the grammar
G1 .
S → NP VP [email protected]
VP → Vt NP Vt [email protected]
VP → Vi Vi .sem
Vt → likes λy.λx.LIKES(x, y)
Vi → sleeps λx.SLEEPS(x)
NP → Alex ALEX
NP → Brit BRIT
6306 Thus, the meaning of the transitive verb likes is a lambda expression whose output is
6307 another lambda expression: it takes y as an argument to fill in one of the slots in the LIKES
6308 relation, and returns a lambda expression that is ready to take an argument to fill in the
6309 other slot.6
6310 Table 12.1 shows a minimal syntactic-semantic grammar fragment, G1 . The complete
6311 derivation of Alex likes Brit in G1 is shown in Figure 12.3. In addition to the transitive
6312 verb likes, the grammar also includes the intransitive verb sleeps; it should be clear how
6313 to derive the meaning of sentences like Alex sleeps. For verbs that can be either transitive
6314 or intransitive, such as eats, we would have two terminal productions, one for each sense
6315 (terminal productions are also called the lexical entries). Indeed, most of the grammar is
6316 in the lexicon (the terminal productions), since these productions select the basic units of
6317 the semantic interpretation.
6
This can be written in a few different ways. The notation λy, x.LIKES(x, y) is a somewhat informal way to
indicate a lambda expression that takes two arguments; this would be acceptable in functional programming.
Logicians (e.g., Carpenter, 1997) often prefer the more formal notation λy.λx.LIKES(x)(y), indicating that each
lambda expression takes exactly one argument.
S : ∃xdog(x) ∧ sleeps(x)
A dog sleeps
Figure 12.4: Derivation of the semantic representation for A dog sleeps, in grammar G2
This is a lambda expression that takes two relations as arguments, P and Q. The relation
P is scoped to the outer lambda expression, so it will be provided by the immediately
7
Conversely, the sentence Every dog sleeps would involve a universal quantifier, ∀xDOG(x) ⇒ SLEEPS(x).
The definite article the requires more consideration, since the dog must refer to some dog which is uniquely
identifiable, perhaps from contextual information external to the sentence. Carpenter (1997, pp. 96-100)
summarizes recent approaches to handling definite descriptions.
8
Carpenter (1997) offers an alternative treatment based on combinatory categorial grammar.
adjacent noun, which in this case is DOG. Thus, the noun phrase a dog has the semantics,
6334 This is a lambda expression that is expecting another relation, Q, which will be provided
6335 by the verb phrase, SLEEPS. This gives the desired analysis, ∃xDOG(x) ∧ SLEEPS(x).9
6336 If noun phrases like a dog are interpreted as lambda expressions, then proper nouns
6337 like Alex must be treated in the same way. This is achieved by type-raising from con-
6338 stants to lambda expressions, x ⇒ λP.P (x). After type-raising, the semantics of Alex is
6339 λP.P (ALEX) — a lambda expression that expects a relation to tell us something about
6340 ALEX .10 Again, make sure you see how the analysis in Figure 12.4 can be applied to the
6341 sentence Alex sleeps.
6342 Direct objects are handled by applying the same type-raising operation to transitive
6343 verbs: the meaning of verbs such as likes is raised to,
As a result, we can keep the verb phrase production VP.sem = [email protected], knowing
that the direct object will provide the function P in Equation 12.19. To see how this works,
let’s analyze the verb phrase likes a dog. After uniquely relabeling each lambda variable,
we have,
VP.sem [email protected]
=(λP.λx.P (λy.LIKES(x, y)))@(λQ.∃z DOG(z) ∧ Q(z))
=λx.(λQ.∃z DOG(z) ∧ Q(z))@(λy.LIKES(x, y))
=λx.∃z DOG(z) ∧ (λy.LIKES(x, y))@z
=λx.∃z DOG(z) ∧ LIKES(x, z).
6344 These changes are summarized in the revised grammar G2 , shown in Table 12.2. Fig-
6345 ure 12.5 shows a derivation that involves a transitive verb, an indefinite noun phrase, and
6346 a proper noun.
9
When applying β-reduction to arguments that are themselves lambda expressions, be sure to use unique
variable names to avoid confusion. For example, it is important to distinguish the x in the semantics for a
from the x in the semantics for likes. Variable names are abstractions, and can always be changed — this is
known as α-conversion. For example, λx.P (x) can be converted to λy.P (y), etc.
10
Compositional semantic analysis is often supported by type systems, which make it possible to check
whether a given function application is valid. The base types are entities e and truth values t. A property,
such as DOG, is a function from entities to truth values, so its type is written he, ti. A transitive verb has type
Alex
Figure 12.5: Derivation of the semantic representation for A dog likes Alex.
S → NP VP [email protected]
VP → Vt NP Vt [email protected]
VP → Vi Vi .sem
NP → D ET N N D ET.sem@N N.sem
NP → N NP λP.P (N NP.sem)
D ET →a λP.λQ.∃xP (x) ∧ Q(x)
D ET → every λP.λQ.∀x(P (x) ⇒ Q(x))
Vt → likes λP.λx.P (λy.LIKES(x, y))
Vi → sleeps λx.SLEEPS(x)
NN → dog DOG
N NP → Alex ALEX
N NP → Brit BRIT
he, he, tii: after receiving the first entity (the direct object), it returns a function from entities to truth values,
which will be applied to the subject of the sentence. The type-raising operation x ⇒ λP.P (x) corresponds
to a change in type from e to hhe, ti, ti: it expects a function from entities to truth values, and returns a truth
value.
6355 which can range from complete derivations to much more limited training signals. We
6356 will begin with the case of complete supervision, and then consider how learning is still
6357 possible even when seemingly key information is missing.
6358 Datasets Early work on semantic parsing focused on natural language expressions of
6359 geographical database queries, such as What states border Texas. The GeoQuery dataset
6360 of Zelle and Mooney (1996) was originally coded in prolog, but has subsequently been
6361 expanded and converted into the SQL database query language by Popescu et al. (2003)
6362 and into first-order logic with lambda calculus by Zettlemoyer and Collins (2005), pro-
6363 viding logical forms like λx.STATE(x) ∧ BORDERS(x, TEXAS). Another early dataset con-
6364 sists of instructions for RoboCup robot soccer teams (Kate et al., 2005). More recent work
6365 has focused on broader domains, such as the Freebase database (Bollacker et al., 2008),
6366 for which queries have been annotated by Krishnamurthy and Mitchell (2012) and Cai
6367 and Yates (2013). Other recent datasets include child-directed speech (Kwiatkowski et al.,
6368 2012) and elementary school science exams (Krishnamurthy, 2016).
Figure 12.6: Derivation for gold semantic analysis of Alex eats shoots and leaves
Figure 12.7: Derivation for incorrect semantic analysis of Alex eats shoots and leaves
P
6387 composes across the productions in the derivation, f (w, z, y) = Tt=1 f (w, zt , y), where
6388 zt indicates a single syntactic-semantic production. For example, we might have a feature
6389 for the production S → NP VP : [email protected], as well as for terminal productions
6390 like N NP → Alex : ALEX. Under this decomposition, it is possible to compute scores
6391 for each semantically-annotated subtree in the analysis of w, so that bottom-up parsing
6392 algorithms like CKY (§ 10.1) can be applied to find the best-scoring semantic analysis.
6393 Figure 12.6 shows a derivation of the correct semantic analysis of the sentence Alex
6394 eats shoots and leaves, in a simplified grammar in which the plural noun phrases shoots
6395 and leaves are interpreted as logical constants SHOOTS and LEAVESn . Figure 12.7 shows a
6396 derivation of an incorrect analysis. Assuming one feature per production, the perceptron
6397 update is shown in Table 12.3. From this update, the parser would learn to prefer the
6398 noun interpretation of leaves over the verb interpretation. It would also learn to prefer
6399 noun phrase coordination over verb phrase coordination.
6400 While the update is explained in terms of the perceptron, it would be easy to replace
6401 the perceptron with a conditional random field. In this case, the online updates would be
6402 based on feature expectations, which can be computed using the inside-outside algorithm
6403 (§ 10.6).
Table 12.3: Perceptron update for analysis in Figure 12.6 (gold) and Figure 12.7 (predicted)
6405 which is the standard log-linear model, applied to the logical form y and the derivation
6406 z.
Since the derivation z unambiguously determines the logical form y, it may seem silly
to model the joint probability over y and z. However, since z is unknown, it can be
marginalized out,
X
p(y | w) = p(y, z | w). [12.21]
z
The semantic parser can then select the logical form with the maximum log marginal
probability,
X X exp(θ · f (w, z, y))
log p(y, z | w) = log P [12.22]
z z
y 0 , z 0 exp(θ · f (w, z 0 , y 0 ))
X
∝ log exp(θ · f (w, z 0 , y 0 )) [12.23]
z
≥ max θ · f (w, z, y). [12.24]
z
6407 It is impossible to push the log term inside the sum over z, so our usual linear scoring
6408 function does not apply. We can recover this scoring function only in approximation, by
6409 taking the max (rather than the sum) over derivations z, which provides a lower bound.
11
An exception is the work of Ge and Mooney (2005), who annotate the meaning of each syntactic con-
stituents for several hundred sentences.
6410 This log-likelihood is not convex in θ, unlike the log-likelihood of a fully-observed condi-
6411 tional random field. This means that learning can give different results depending on the
6412 initialization.
The derivative of Equation 12.26 is,
∂`i X X
= p(z | y, w; θ)f (w, z, y) − p(y 0 , z 0 | w; θ)f (w, z 0 , y 0 ) [12.27]
∂θ z 0 0 y ,z
tations for many natural language sentences. For example, in the geography domain, the
denotation of a question would be its answer (Clarke et al., 2010; Liang et al., 2013):
6434 Similarly, in a robotic control setting, the denotation of a command would be an action or
6435 sequence of actions (Artzi and Zettlemoyer, 2013). In both cases, the idea is to reward the
6436 semantic parser for choosing an analysis whose denotation is correct: the right answer to
6437 the question, or the right action.
Learning from logical forms was made possible by summing or maxing over deriva-
tions. This idea can be carried one step further, summing or maxing over all logical forms
with the correct denotation. Let vi (y) ∈ {0, 1} be a validation function, which assigns a
binary score indicating whether the denotation JyK for the text w(i) is correct. We can then
learn by maximizing a conditional-likelihood objective,
X
`(i) (θ) = log vi (y) × p(y | w; θ) [12.29]
y
X X
= log vi (y) × p(y, z | w; θ), [12.30]
y z
6438 which sums over all derivations z of all valid logical forms, {y : vi (y) = 1}. This cor-
6439 responds to the log-probability that the semantic parser produces a logical form with a
6440 valid denotation.
∂`(i) X X
= p(y, z | w)f (w, z, y) − p(y 0 , z 0 | w)f (w, z 0 , y 0 ), [12.31]
∂θ
y,z:vi (y)=1 y 0 ,z 0
6441 which is the usual difference in feature expectations. The positive term computes the
6442 expected feature expectations conditioned on the denotation being valid, while the second
6443 term computes the expected feature expectations according to the current model, without
6444 regard to the ground truth. Large-margin learning formulations are also possible for this
6445 problem. For example, Artzi and Zettlemoyer (2013) generate a set of valid and invalid
6446 derivations, and then impose a constraint that all valid derivations should score higher
6447 than all invalid derivations. This constraint drives a perceptron-like learning rule.
6464 Exercises
6465 1. Derive the modus ponens inference rule, which states that if we know φ ⇒ ψ and
6466 φ, then ψ must be true. The derivation can be performed using the definition of the
6467 ⇒ operator and some of the laws provided in § 12.2.1, plus one additional identity:
6468 ⊥ ∨ φ = φ.
12
Videos are currently available at http://yoavartzi.com/tutorial/
13
http://yoavartzi.com/spf
14
https://github.com/percyliang/sempre
6469 2. Convert the following examples into first-order logic, using the relations CAN - SLEEP,
6470 MAKES - NOISE , and BROTHER .
6475 3. Extend the grammar fragment G1 to include the ditransitive verb teaches and the
6476 proper noun Swahili. Show how to derive the interpretation for the sentence Alex
6477 teaches Brit Swahili, which should be TEACHES ( ALEX , BRIT, SWAHILI ). The grammar
6478 need not be in Chomsky Normal Form. For the ditransitive verb, use NP1 and NP2
6479 to indicate the two direct objects.
6480 4. Derive the semantic interpretation for the sentence Alex likes every dog, using gram-
6481 mar fragment G2 .
6482 5. Extend the grammar fragment G2 to handle adjectives, so that the meaning of an
6483 angry dog is λP.∃xDOG(x) ∧ ANGRY(x) ∧ P (x). Specifically, you should supply the
6484 lexical entry for the adjective angry, and you should specify the syntactic-semantic
6485 productions NP → D ET NOM, NOM → J J NOM, and NOM → N N.
6486 6. Extend your answer to the previous question to cover copula constructions with
6487 predicative adjectives, such as Alex is angry. The interpretation should be ANGRY ( ALEX ).
6488 You should add a verb phrase production VP → Vcop J J, and a terminal production
6489 Vcop → is. Show why your grammar extensions result in the correct interpretation.
6490 7. In Figure 12.6 and Figure 12.7, we treat the plurals shoots and leaves as entities. Revise
6491 G2 so that the interpretation of Alex eats leaves is ∀x.(LEAF(x) ⇒ EATS(ALEX, x)), and
6492 show the resulting perceptron update.
8. Statements like every student eats a pizza have two possible interpretations, depend-
ing on quantifier scope:
∀x∃y PIZZA(y) ∧ (STUDENT(x) ⇒ EATS(x, y)) [12.32]
∃y∀xPIZZA(y) ∧ (STUDENT(x) ⇒ EATS(x, y)) [12.33]
6493 Explain why these interpretations really are different, and modify the grammar G2
6494 so that it can produce both interpretations.
6495 9. Derive Equation 12.27.
6496 10. In the GeoQuery domain, give a natural language query that has multiple plausible
6497 semantic interpretations with the same denotation. List both interpretaions and the
6498 denotation.
6499 Hint: There are many ways to do this, but one approach involves using toponyms
6500 (place names) that could plausibly map to several different entities in the model.
6503 This chapter considers more “lightweight” semantic representations, which discard some
6504 aspects of first-order logic, but focus on predicate-argument structures. Let’s begin by
6505 thinking about the semantics of events, with a simple example:
6508 In this representation, we define variable x for the book, and we link the strings Asha and
6509 Boyang to entities A SHA and B OYANG. Because the action of giving involves a giver, a
6510 recipient, and a gift, the predicate GIVE must take three arguments.
6511 Now suppose we have additional information about the event:
6513 One possible to solution is to extend the predicate GIVE to take additional arguments,
But this is clearly unsatisfactory: yesterday and relunctantly are optional arguments,
and we would need a different version of the G IVE predicate for every possible combi-
nation of arguments. Event semantics solves this problem by reifying the event as an
existentially quantified variable e,
309
310 CHAPTER 13. PREDICATE-ARGUMENT SEMANTICS
6514 In this way, each argument of the event — the giver, the recipient, the gift — can be rep-
6515 resented with a relation of its own, linking the argument to the event e. The expression
6516 GIVER (e, A SHA ) says that A SHA plays the role of GIVER in the event. This reformulation
6517 handles the problem of optional information such as the time or manner of the event,
6518 which are called adjuncts. Unlike arguments, adjuncts are not a mandatory part of the
6519 relation, but under this representation, they can be expressed with additional logical rela-
6520 tions that are conjoined to the semantic interpretation of the sentence. 1
6521 The event semantic representation can be applied to nested clauses, e.g.,
6523 As with first-order logic, the goal of event semantics is to provide a representation that
6524 generalizes over many surface forms. Consider the following paraphrases of (13.1):
6529 All have the same event semantic meaning as Equation 13.1, but the ways in which the
6530 meaning can be expressed are diverse. The final example does not even include a verb:
6531 events are often introduced by verbs, but as shown by (13.7), the noun gift can introduce
6532 the same predicate, with the same accompanying arguments.
6533 Semantic role labeling (SRL) is a relaxed form of semantic parsing, in which each
6534 semantic role is filled by a set of tokens from the text itself. This is sometimes called
6535 “shallow semantics” because, unlike model-theoretic semantic parsing, role fillers need
6536 not be symbolic expressions with denotations in some world model. A semantic role
6537 labeling system is required to identify all predicates, and then specify the spans of text
6538 that fill each role. To give a sense of the task, here is a more complicated example:
6540 In this example, there are two predicates, expressed by the verbs want and give. Thus, a
6541 semantic role labeler might return the following output:
6542 • (P REDICATE : wants, WANTER : Boyang, D ESIRE : Asha to give him a linguistics book)
6543 • (P REDICATE : give, G IVER : Asha, R ECIPIENT : him, G IFT : a linguistics book)
6544 Boyang and him may refer to the same person, but the semantic role labeling is not re-
6545 quired to resolve this reference. Other predicate-argument representations, such as Ab-
6546 stract Meaning Representation (AMR), do require reference resolution. We will return to
6547 AMR in § 13.3, but first, let us further consider the definition of semantic roles.
6560 The respective roles of Asha, Boyang, and the book are nearly identical across the first
6561 two examples. The third example is slightly different, but the fourth example shows that
6562 the roles of GIVER and TEACHER can be viewed as related.
6563 One way to think about the relationship between roles such as GIVER and TEACHER is
6564 by enumerating the set of properties that an entity typically possesses when it fulfills these
6565 roles: givers and teachers are usually animate (they are alive and sentient) and volitional
6566 (they choose to enter into the action).2 In contrast, the thing that gets loaned or taught is
6567 usually not animate or volitional; furthermore, it is unchanged by the event.
6568 Building on these ideas, thematic roles generalize across predicates by leveraging the
6569 shared semantic properties of typical role fillers (Fillmore, 1968). For example, in exam-
6570 ples (13.9-13.12), Asha plays a similar role in all four sentences, which we will call the
2
There are always exceptions. For example, in the sentence The C programming language has taught me a
lot about perseverance, the “teacher” is the The C programming language, which is presumably not animate or
volitional.
6571 agent. This reflects several shared semantic properties: she is the one who is actively and
6572 intentionally performing the action, while Boyang is a more passive participant; the book
6573 and the lesson would play a different role, as non-animate participants in the event.
6574 Example annotations from three well known systems are shown in Figure 13.1. We
6575 will now discuss these systems in more detail.
6580 • A GENT: “A CTOR in an event who initiates and carries out the event intentionally or
6581 consciously, and who exists independently of the event.”
6582 • PATIENT: “U NDERGOER in an event that experiences a change of state, location or
6583 condition, that is causally involved or directly affected by other participants, and
6584 exists independently of the event.”
6585 • R ECIPIENT: “D ESTINATION that is animate”
6586 • T HEME: “U NDERGOER that is central to an event or state that does not have control
6587 over the way the event occurs, is not structurally changed by the event, and/or is
6588 characterized as being in a certain position or condition throughout the state.”
6589 • T OPIC: “T HEME characterized by information content transferred to another partic-
6590 ipant.”
3
http://verbs.colorado.edu/verb-index/VerbNet_Guidelines.pdf
6591 VerbNet roles are organized in a hierarchy, so that a TOPIC is a type of THEME, which in
6592 turn is a type of UNDERGOER, which is a type of PARTICIPANT, the top-level category.
6593 In addition, VerbNet organizes verb senses into a class hierarchy, in which verb senses
6594 that have similar meanings are grouped together. Recall from § 4.2 that multiple meanings
6595 of the same word are called senses, and that WordNet identifies senses for many English
6596 words. VerbNet builds on WordNet, so that verb classes are identified by the WordNet
6597 senses of the verbs that they contain. For example, the verb class give-13.1 includes
6598 the first WordNet sense of loan and the second WordNet sense of lend.
6599 Each VerbNet class or subclass takes a set of thematic roles. For example, give-13.1
6600 takes arguments with the thematic roles of A GENT, T HEME, and R ECIPIENT;4 the pred-
6601 icate TEACH takes arguments with the thematic roles A GENT, T OPIC, R ECIPIENT, and
6602 S OURCE.5 So according to VerbNet, Asha and Boyang play the roles of A GENT and R ECIP -
6603 IENT in the sentences,
6606 The book and algebra are both T HEMES, but algebra is a subcategory of T HEME — a T OPIC
6607 — because it consists of information content that is given to the receiver.
6616 Proto-Patient. Undergoes change of state; causally affected by another participant; sta-
6617 tionary relative to the movement of another participant; does not exist indepen-
6618 dently of the event.6
4
https://verbs.colorado.edu/verb-index/vn/give-13.1.php
5
https://verbs.colorado.edu/verb-index/vn/transfer_mesg-37.1.1.php
6
Reisinger et al. (2015) ask crowd workers to annotate these properties directly, finding that annotators
tend to agree on the properties of each argument. They also find that in English, arguments having more
proto-agent properties tend to appear in subject position, while arguments with more proto-patient proper-
ties appear in object position.
6619 In the examples in Figure 13.1, Asha has most of the proto-agent properties: in giving
6620 the book to Boyang, she is acting volitionally (as opposed to Boyang got a book from Asha, in
6621 which it is not clear whether Asha gave up the book willingly); she is sentient; she causes
6622 a change of state in Boyang; she exists independently of the event. Boyang has some
6623 proto-agent properties: he is sentient and exists independently of the event. But he also
6624 some proto-patient properties: he is the one who is causally affected and who undergoes
6625 change of state. The book that Asha gives Boyang has even fewer of the proto-agent
6626 properties: it is not volitional or sentient, and it has no causal role. But it also lacks many
6627 of the proto-patient properties: it does not undergo change of state, exists independently
6628 of the event, and is not stationary.
6629 The Proposition Bank, or PropBank (Palmer et al., 2005), builds on this basic agent-
6630 patient distinction, as a middle ground between generic thematic roles and roles that are
6631 specific to each predicate. Each verb is linked to a list of numbered arguments, with A RG 0
6632 as the proto-agent and A RG 1 as the proto-patient. Additional numbered arguments are
6633 verb-specific. For example, for the predicate TEACH,7 the arguments are:
6637 Verbs may have any number of arguments: for example, WANT and GET have five, while
6638 EAT has only A RG 0 and A RG 1. In addition to the semantic arguments found in the frame
6639 files, roughly a dozen general-purpose adjuncts may be used in combination with any
6640 verb. These are shown in Table 13.1.
6641 PropBank-style semantic role labeling is annotated over the entire Penn Treebank. This
6642 annotation includes the sense of each verbal predicate, as well as the argument spans.
Table 13.1: PropBank adjuncts (Palmer et al., 2005), sorted by frequency in the corpus
6652 of roughly 1000 frames, and a corpus of more than 200,000 “exemplar sentences,” in which
6653 the frames and their elements are annotated.8
6654 Rather than seeking to link semantic roles such as TEACHER and GIVER into the-
6655 matic roles such as AGENT, FrameNet aggressively groups verbs into frames, and links
6656 semantically-related roles across frames. For example, the following two sentences would
6657 be annotated identically in FrameNet:
6660 This is because teach and learn are both lexical units in the EDUCATION TEACHING frame.
6661 Furthermore, roles can be shared even when the frames are distinct, as in the following
6662 two examples:
6665 The GIVING and GETTING frames both have R ECIPIENT and T HEME elements, so Boyang
6666 and the book would play the same role. Asha’s role is different: she is the D ONOR in the
6667 GIVING frame, and the S OURCE in the GETTING frame. FrameNet makes extensive use of
6668 multiple inheritance to share information across frames and frame elements: for example,
6669 the COMMERCE SELL and LENDING frames inherit from GIVING frame.
8
Current details and data can be found at https://framenet.icsi.berkeley.edu/
6674 (13.19) [A RG 0 Asha] [GIVE .01 gave] [A RG 2 Boyang’s mom] [A RG 1 a book] [AM-TMP yesterday].
6675 Note that a single sentence may have multiple verbs, and therefore a given word may be
6676 part of multiple role-fillers:
6685 where,
6686 • (i, j) indicates the span of a phrasal constituent (wi+1 , wi+2 , . . . , wj );9
6687 • w represents the sentence as a sequence of tokens;
6688 • ρ is the index of the predicate verb in w;
6689 • τ is the structure of the phrasal constituent parse of w.
6690 Early work on semantic role labeling focused on discriminative feature-based models,
6691 where ψ(w, y, i, j, ρ, τ ) = θ · f (w, y, i, j, ρ, τ ). Table 13.2 shows the features used in a sem-
6692 inal paper on FrameNet semantic role labeling (Gildea and Jurafsky, 2002). By 2005 there
9
PropBank roles can also be filled by split constituents, which are discontinuous spans of text. This
situation most frequently in reported speech, e.g. [A RG 1 By addressing these problems], Mr. Maxwell said,
[A RG 1 the new funds have become extremely attractive.] (example adapted from Palmer et al., 2005). This issue
is typically addressed by defining “continuation arguments”, e.g. C-A RG 1, which refers to the continuation
of A RG 1 after the split.
Predicate lemma and The lemma of the predicate verb and its part-of-speech tag
POS tag
Voice Whether the predicate is in active or passive voice, as deter-
mined by a set of syntactic patterns for identifying passive
voice constructions
Phrase type The constituent phrase type for the proposed argument in
the parse tree, e.g. NP, PP
Headword and POS The head word of the proposed argument and its POS tag,
tag identified using the Collins (1997) rules
Position Whether the proposed argument comes before or after the
predicate in the sentence
Syntactic path The set of steps on the parse tree from the proposed argu-
ment to the predicate (described in detail in the text)
Subcategorization The syntactic production from the first branching node
above the predicate. For example, in Figure 13.2, the
subcategorization feature around taught would be VP →
VBD NP PP.
Table 13.2: Features used in semantic role labeling by Gildea and Jurafsky (2002).
6693 were several systems for PropBank semantic role labeling, and their approaches and fea-
6694 ture sets are summarized by Carreras and Màrquez (2005). Typical features include: the
6695 phrase type, head word, part-of-speech, boundaries, and neighbors of the proposed argu-
6696 ment wi+1:j ; the word, lemma, part-of-speech, and voice of the verb wρ (active or passive),
6697 as well as features relating to its frameset; the distance and path between the verb and
6698 the proposed argument. In this way, semantic role labeling systems are high-level “con-
6699 sumers” in the NLP stack, using features produced from lower-level components such as
6700 part-of-speech taggers and parsers. More comprehensive feature sets are enumerated by
6701 Das et al. (2014) and Täckström et al. (2015).
6702 A particularly powerful class of features relate to the syntactic path between the ar-
6703 gument and the predicate. These features capture the sequence of moves required to get
6704 from the argument to the verb by traversing the phrasal constituent parse of the sentence.
6705 The idea of these features is to capture syntactic regularities in how various arguments
6706 are realized. Syntactic path features are best illustrated by example, using the parse tree
6707 in Figure 13.2:
6708 • The path from Asha to the verb taught is N NP↑NP↑S↓VP↓V BD. The first part of
6709 the path, NNP↑NP↑S, means that we must travel up the parse tree from the N NP
6710 tag (proper noun) to the S (sentence) constituent. The second part of the path,
6711 S↓VP↓V BD, means that we reach the verb by producing a VP (verb phrase) from
NP VP
Figure 13.2: Semantic role labeling on the phrase-structure parse tree for a sentence. The
dashed line indicates the syntactic path from Asha to the predicate verb taught.
6712 the S constituent, and then by producing a V BD (past tense verb). This feature is
6713 consistent with Asha being in subject position, since the path includes the sentence
6714 root S.
6715 • The path from the class to taught is NP↑VP↓V BD. This is consistent with the class
6716 being in object position, since the path passes through the VP node that dominates
6717 the verb taught.
6718 Because there are many possible path features, it can also be helpful to look at smaller
6719 parts: for example, the upward and downward parts can be treated as separate features;
6720 another feature might consider whether S appears anywhere in the path.
6721 Rather than using the constituent parse, it is also possible to build features from the
6722 dependency path between the head word of each argument and the verb (Pradhan et al.,
6723 2005). Using the Universal Dependency part-of-speech tagset and dependency relations (Nivre
6724 et al., 2016), the dependency path from Asha to taught is P ROPN ← V ERB, because taught
N SUBJ
6725 is the head of a relation of type ← with Asha. Similarly, the dependency path from class
N SUBJ
6726 to taught is N OUN ← V ERB, because class heads the noun phrase that is a direct object of
DOBJ
6727 taught. A more interesting example is Asha wanted to teach the class, where the path from
6728 Asha to teach is P ROPN ← V ERB → V ERB. The right-facing arrow in second relation
N SUBJ X COMP
6729 indicates that wanted is the head of its X COMP relation with teach.
6733 • For a given verb, there can be only one argument of each type (A RG 0, A RG 1, etc.)
6734 • Arguments cannot overlap. This problem arises when we are labeling the phrases
6735 in a constituent parse tree, as shown in Figure 13.2: if we label the PP about algebra
6736 as an argument or adjunct, then its children about and algebra must be labeled as ∅.
6737 The same constraint also applies to the syntactic ancestors of this phrase.
6738 These constraints introduce dependencies across labeling decisions. In structure pre-
6739 diction problems such as sequence labeling and parsing, such dependencies are usually
6740 handled by defining a scoring over the entire structure, y. Efficient inference requires
6741 that the global score decomposes into local parts: for example, in sequence labeling, the
6742 scoring function decomposes into scores of pairs of adjacent tags, permitting the applica-
6743 tion of the Viterbi algorithm for inference. But the constraints that arise in semantic role
6744 labeling are less amenable to local decomposition.10 We therefore consider constrained
6745 optimization as an alternative solution.
Let the set C(τ ) refer to all labelings that obey the constraints introduced by the parse
τ . The semantic role labeling problem can be reformulated as a constrained optimization
over y ∈ C(τ ),
X
max ψ(w, yi,j , i, j, ρ, τ )
y
(i,j)∈τ
6756 where r ∈ R is a label in the set {A RG 0, A RG 1, . . . , A M -L OC, . . . , ∅}. Thus, the variables
6757 z are a binarized version of the semantic role labeling y.
The objective can then be formulated as a linear function of z.
X X
ψ(w, yi,j , i, j, ρ, τ ) = ψ(w, r, i, j, ρ, τ ) × zi,j,r , [13.7]
(i,j)∈τ i,j,r
6758 which is the sum of the scores of all relations, as indicated by zi,j,r .
6759 Recall that zi,j,r = 1 iff the span (i, j) has label r; this constraint says that for each possible
6760 label r 6= ∅, there can be at most one (i, j) such that zi,j,r = 1. Rewriting this constraint
6761 can be written in the form Az ≤ b, as you will find if you complete the exercises at the
6762 end of the chapter.
Now consider the constraint that labels cannot overlap. Let’s define the convenience
function o((i, j), (i0 , j 0 )) = 1 iff (i, j) overlaps (i0 , j 0 ), and zero otherwise. Thus, o will
indicate if a constituent (i0 , j 0 ) is either an ancestor or descendant of (i, j). The constraint
is that if two constituents overlap, only one can have a non-null label:
X X
∀(i, j) ∈ τ, o((i, j), (i0 , j 0 )) × zi0 ,j 0 ,r ≤ 1, [13.9]
(i0 ,j 0 )∈τ r6=∅
6764 Learning with constraints Learning can be performed in the context of constrained op-
6765 timization using the usual perceptron or large-margin classification updates. Because
6766 constrained inference is generally more time-consuming, a key question is whether it is
6767 necessary to apply the constraints during learning. Chang et al. (2008) find that better per-
6768 formance can be obtained by learning without constraints, and then applying constraints
6769 only when using the trained model to predict semantic roles for unseen data.
6770 How important are the constraints? Das et al. (2014) find that an unconstrained, classification-
6771 based method performs nearly as well as constrained optimization for FrameNet parsing:
6772 while it commits many violations of the “no-overlap” constraint, the overall F1 score is
6773 less than one point worse than the score at the constrained optimum. Similar results
6774 were obtained for PropBank semantic role labeling by Punyakanok et al. (2008). He et al.
6775 (2017) find that constrained inference makes a bigger impact if the constraints are based
6776 on manually-labeled “gold” syntactic parses. This implies that errors from the syntac-
6777 tic parser may limit the effectiveness of the constraints. Punyakanok et al. (2008) hedge
6778 against parser error by including constituents from several different parsers; any con-
6779 stituent can be selected from any parse, and additional constraints ensure that overlap-
6780 ping constituents are not selected.
6781 Implementation Integer linear programming solvers such as glpk,11 cplex,12 and Gurobi13
6782 allow inequality constraints to be expressed directly in the problem definition, rather than
6783 in the matrix form Az ≤ b. The time complexity of integer linear programming is theoret-
6784 ically exponential in the number of variables |z|, but in practice these off-the-shelf solvers
6785 obtain good solutions efficiently. Das et al. (2014) report that the cplex solver requires 43
6786 seconds to perform inference on the FrameNet test set, which contains 4,458 predicates.
6787 Recent work has shown that many constrained optimization problems in natural lan-
6788 guage processing can be solved in a highly parallelized fashion, using optimization tech-
6789 niques such as dual decomposition, which are capable of exploiting the underlying prob-
6790 lem structure (Rush et al., 2010). Das et al. (2014) apply this technique to FrameNet se-
6791 mantic role labeling, obtaining an order-of-magnitude speedup over cplex.
6796 B-A RG 1; all remaining tokens in the span are inside, and are therefore labeled I-A RG 1.
6797 Tokens outside any argument are labeled O. For example:
Recurrent neural networks are a natural approach to this tagging task. For example,
Zhou and Xu (2015) apply a deep bidirectional multilayer LSTM (see § 7.6) to PropBank
semantic role labeling. In this model, each bidirectional LSTM serves as input for an-
other, higher-level bidirectional LSTM, allowing complex non-linear transformations of
the original input embeddings, X = [x1 , x2 , . . . , xM ]. The hidden state of the final LSTM
(K) (K) (K)
is Z(K) = [z1 , z2 , . . . , zM ]. The “emission” score for each tag Ym = y is equal to the
(K)
inner product θy · zm , and there is also a transition score for each pair of adjacent tags.
The complete model can be written,
6799 Note that the final step maximizes over the entire labeling y, and includes a score for
6800 each tag transition ψym−1 ,ym . This combination of LSTM and pairwise potentials on tags
6801 is an example of an LSTM-CRF. The maximization over y is performed by the Viterbi
6802 algorithm.
6803 This model strongly outperformed alternative approaches at the time, including con-
6804 strained decoding and convolutional neural networks.14 More recent work has combined
6805 recurrent neural network models with constrained decoding, using the A∗ search algo-
6806 rithm to search over labelings that are feasible with respect to the constraints (He et al.,
6807 2017). This yields small improvements over the method of Zhou and Xu (2015). He et al.
6808 (2017) obtain larger improvements by creating an ensemble of SRL systems, each trained
6809 on an 80% subsample of the corpus. The average prediction across this ensemble is more
6810 robust than any individual model.
(w / want-01 w / want-01
:ARG0 (b / boy) Arg0 Arg1
:ARG1 (g / go-02
b / boy g / go-02
:ARG0 b)) Arg1
Figure 13.3: Two views of the AMR representation for the sentence The boy wants to go.
6818 The Abstract Meaning Representation (AMR) unifies this analysis into a graph struc-
6819 ture, in which each node is a variable, and each edge indicates a concept (Banarescu
6820 et al., 2013). This can be written in two ways, as shown in Figure 13.3. On the left is the
6821 PENMAN notation (Matthiessen and Bateman, 1991), in which each set of parentheses
6822 introduces a variable. Each variable is an instance of a concept, which is indicated with
6823 the slash notation: for example, w / want-01 indicates that the variable w is an instance
6824 of the concept want-01, which in turn refers to the PropBank frame for the first sense
6825 of the verb want. Relations are introduced with colons: for example, :ARG0 (b / boy)
6826 indicates a relation of type ARG 0 with the newly-introduced variable b. Variables can be
6827 reused, so that when the variable b appears again as an argument to g, it is understood to
6828 refer to the same boy in both cases. This arrangement is indicated compactly in the graph
6829 structure on the right, with edges indicating concepts.
6830 One way in which AMR differs from PropBank-style semantic role labeling is that it
6831 reifies each entity as a variable: for example, the boy in (13.22) is reified in the variable
6832 b, which is reused as A RG 0 in its relationship with w / want-01, and as A RG 1 in its
6833 relationship with g / go-02. Reifying entities as variables also makes it possible to
6834 represent the substructure of noun phrases more explicitly. For example, Asha borrowed
6835 the algebra book would be represented as:
6836 (b / borrow-01
6837 :ARG0 (p / person
6838 :name (n / name
6839 :op1 "Asha"))
6840 :ARG1 (b2 / book
6841 :topic (a / algebra)))
6842 This indicates that the variable p is a person, whose name is the variable n; that name
6843 has one token, the string Asha. Similarly, the variable b2 is a book, and the topic of b2
6844 is a variable a whose type is algebra. The relations name and topic are examples of
6845 non-core roles, which are similar to adjunct modifiers in PropBank. However, AMR’s
6846 inventory is more extensive, including more than 70 non-core roles, such as negation,
6847 time, manner, frequency, and location. Lists and sequences — such as the list of tokens in
6848 a name — are described using the roles op1, op2, etc.
6849 Another feature of AMR is that a semantic predicate can be introduced by any syntac-
6850 tic element, as in the following examples from Banarescu et al. (2013):
6855 (d / destroy-01
6856 :ARG0 (b / boy)
6857 :ARG1 (r / room))
6858 The noun destruction is linked to the verb destroy, which is captured by the PropBank
6859 frame destroy-01. This can happen with adjectives as well: in the phrase the attractive
6860 spy, the adjective attractive is linked to the PropBank frame attract-01:
6861 (s / spy
6862 :ARG0-of (a / attract-01))
6863 In this example, ARG0-of is an inverse relation, indicating that s is the ARG0 of the
6864 predicate a. Inverse relations make it possible for all AMR parses to have a single root
6865 concept, which should be the focus of the utterance.
6866 While AMR goes farther than semantic role labeling, it does not link semantically-
6867 related frames such as buy/sell (as FrameNet does), does not handle quantification (as
6868 first-order predicate calculus does), and makes no attempt to handle noun number and
6869 verb tense (as PropBank does). A recent survey by Abend and Rappoport (2017) situ-
6870 ates AMR with respect to several other semantic representation schemes. Other linguistic
6871 features of AMR are summarized in the original paper (Banarescu et al., 2013) and the
6872 tutorial slides by Schneider et al. (2015).
6890 Graph-based parsing One family of approaches to AMR parsing is similar to the graph-
6891 based methods that we encountered in syntactic dependency parsing (chapter 11). For
6892 these systems (Flanigan et al., 2014), parsing is a two-step process:
6893 1. Concept identification (Figure 13.4a). This involves constructing concept subgraphs
6894 for individual words or spans of adjacent words. For example, in the sentence,
6895 Asha likes algebra, we would hope to identify the minimal subtree including just the
6896 concept like-01 for the word like, and the subtree (p / person :name (n /
6897 name :op1 Asha)) for the word Asha.
6898 2. Relation identification (Figure 13.4b). This involves building a directed graph over
6899 the concepts, where the edges are labeled by the relation type. AMR imposes a
6900 number of constraints on the graph: all concepts must be included, the graph must
6901 be connected (there must be a path between every pair of nodes in the undirected
6902 version of the graph), and every node must have at most one outgoing edge of each
6903 type.
6904 Both of these problems are solved by structure prediction. Concept identification re-
6905 quires simultaneously segmenting the text into spans, and labeling each span with a graph
6906 fragment containing one or more concepts. This is done by computing a set of features
6907 for each candidate span s and concept labeling c, and then returning the labeling with the
6908 highest overall score.
Figure 13.4: Subtasks for Abstract Meaning Representation parsing, from Schneider et al.
(2015). [todo: permission]
6909 Relation identification can be formulated as search for the maximum spanning sub-
6910 graph, under a set of constraints. Each labeled edge has a score, which is computed
6911 from features of the concepts. We then search for the set of labeled edges that maximizes
6912 the sum of these scores, under the constraint that the resulting graph is a well-formed
6913 AMR (Flanigan et al., 2014). This constrained search can be performed by optimization
6914 techniques such as integer linear programming, as described in § 13.2.2.
6915 Transition-based parsing In many cases, AMR parses are structurally similar to syn-
6916 tactic dependency parses. Figure 13.5 shows one such example. This motivates an alter-
6917 native approach to AMR parsing: modify the syntactic dependency parse until it looks
6918 like a good AMR parse. Wang et al. (2015) propose a transition-based method, based on
6919 incremental modifications to the syntactic dependency tree (transition-based dependency
6920 parsing is discussed in § 11.3). At each step, the parser performs an action: for example,
6921 adding an AMR relation label to the current dependency edge, swapping the direction of
6922 a syntactic dependency edge, or cutting an edge and reattaching the orphaned subtree to
6923 a new parent. The overall system is trained as a classifier, learning to choose the action as
6924 would be given by an oracle that is capable of reproducing the ground-truth parse.
Figure 13.5: Syntactic dependency parse and AMR graph for the sentence The police want
to arrest Michael Karras in Singapore (borrowed from Wang et al. (2015)) [todo: permission]
6929 by linking questions to sentences in a corpus of text. Shen and Lapata (2007) perform
6930 FrameNet semantic role labeling on the query, and then construct a weighted bipartite
6931 graph15 between FrameNet semantic roles and the words and phrases in the sentence.
6932 This is done by first scoring all pairs of semantic roles and assignments, as shown in the
6933 top half of Figure 13.6. They then find the bipartite edge cover, which is the minimum
6934 weighted subset of edges such that each vertex has at least one edge, as shown in the
6935 bottom half of Figure 13.6. After analyzing the question in this manner, Shen and Lapata
6936 then find semantically-compatible sentences in the corpus, by performing graph matching
6937 on the bipartite graphs for the question and candidate answer sentences. Finally, the
6938 expected answer phrase in the question — typically the wh-word — is linked to a phrase in
6939 the candidate answer source, and that phrase is returned as the answer.
6940 Relation extraction The task of relation extraction involves identifying pairs of entities
6941 for which a given semantic relation holds (see § 17.2. For example, we might like to find
6942 all pairs (i, j) such that i is the INVENTOR - OF j. PropBank semantic role labeling can
6943 be applied to this task by identifying sentences whose verb signals the desired relation,
6944 and then extracting A RG 1 and A RG 2 as arguments. (To fully solve this task, these argu-
6945 ments must then be linked to entities, as described in chapter 17.) Christensen et al. (2010)
6946 compare a semantic role labeling system against a simpler approach based on surface pat-
6947 terns (Banko et al., 2007). They find that the SRL system is considerably more accurate,
6948 but that it is several orders of magnitude slower. Conversely, Barnickel et al. (2009) apply
6949 SENNA, a convolutional neural network SRL system (Collobert and Weston, 2008) to the
6950 task of identifying biomedical relations (e.g., which genes inhibit or activate each other).
15
A bipartite graph is one in which the vertices can be divided into two disjoint sets, and every edge
connect a vertex in one set to a vertex in the other.
Figure 13.6: FrameNet semantic role labeling is used in factoid question answering, by
aligning the semantic roles in the question (q) against those of sentences containing an-
swer candidates (ac). “EAP” is the expected answer phrase, replacing the word who in the
question. Figure reprinted from Shen and Lapata (2007) [todo: permission]
Figure 13.7: Fragment of AMR knowledge network for entity linking. Figure reprinted
from Pan et al. (2015) [todo: permission]
6951 In comparison with a strong baseline that applies a set of rules to syntactic dependency
6952 structures (Fundel et al., 2007), the SRL system is faster but less accurate. One possible
6953 explanation for these divergent results is that Fundel et al. compare against a baseline
6954 which is carefully tuned for performance in a relatively narrow domain, while the system
6955 of Banko et al. is designed to analyze text across the entire web.
6956 Entity linking Another core task in information extraction is to link mentions of entities
6957 (e.g., Republican candidates like Romney, Paul, and Johnson . . . ) to entities in a knowledge
6958 base (e.g., LYNDON J OHNSON or G ARY J OHNSON). This task, which is described in § 17.1,
6959 is often performed by examining nearby “collaborator” mentions — in this case, Romney
6960 and Paul. By jointly linking all such mentions, it is possible to arrive at a good overall
6961 solution. Pan et al. (2015) apply AMR to this problem. For each entity, they construct a
6962 knowledge network based on its semantic relations with other mentions within the same
6963 sentence. They then rerank a set of candidate entities, based on the overlap between
6964 the entity’s knowledge network and the semantic relations present in the sentence (Fig-
6965 ure 13.7).
6966 Exercises
6967 1. Write out an event semantic representation for the following sentences. You may
6968 make up your own predicates.
6972 2. Find the PropBank framesets for share and hate at http://verbs.colorado.edu/
6973 propbank/framesets-english-aliases/, and rewrite your answers from the
6974 previous question, using the thematic roles A RG 0, A RG 1, and A RG 2.
6975 3. Compute the syntactic path features for Abigail and Max in each of the example sen-
6976 tences (13.26) and (13.28) in Question 1, with respect to the verb share. If you’re not
6977 sure about the parse, you can try an online parser such as http://nlp.stanford.
6978 edu:8080/parser/.
6979 4. Compute the dependency path features for Abigail and Max in each of the example
6980 sentences (13.26) and (13.28) in Question 1, with respect to the verb share. Again, if
6981 you’re not sure about the parse, you can try an online parser such as http://nlp.
6982 stanford.edu:8080/parser/. As a hint, the dependency relation between share
6983 and Max is OBL according to the Universal Dependency treebank (version 2).
6984 5. PropBank semantic role labeling includes reference arguments, such as,
6986 The label R-A M -L OC indicates that word which is a reference to The bed, which ex-
6987 presses the location of the event. Reference arguments must have referents: the tag
6988 R-A M -L OC can appear only when A M -L OC also appears in the sentence. Show how
6989 to express this as a linear constraint, specifically for the tag R-A M -L OC. Be sure to
6990 correctly handle the case in which neither A M -L OC nor R-A M -L OC appear in the
6991 sentence.
6992 6. Explain how to express the constraints on semantic role labeling in Equation 13.8
6993 and Equation 13.9 in the general form Az ≥ b.
7010 For (13.32), recall that multi-token names are created using op1, op2, etc. You will
7011 need to consult Banarescu et al. (2013) for (13.34), and Schneider et al. (2015) for
7012 (13.35). You may assume that her refers to the girl in this example.
7013 10. Using an off-the-shelf PropBank SRL system,17 build a simplified question answer-
7014 ing system in the style of Shen and Lapata (2007). Specifically, your system should
7015 do the following:
17
At the time of writing, the following systems are availabe: SENNA (http://ronan.collobert.
com/senna/), Illinois Semantic Role Labeler (https://cogcomp.cs.illinois.edu/page/
software_view/SRL), and mate-tools (https://code.google.com/archive/p/mate-tools/).
7016 • For each document in a collection, it should apply the semantic role labeler,
7017 and should store the output as a tuple.
7018 • For a question, your system should again apply the semantic role labeler. If
7019 any of the roles are filled by a wh-pronoun, you should mark that role as the
7020 expected answer phrase (EAP).
7021 • To answer the question, search for a stored tuple which matches the question as
7022 well as possible (same predicate, no incompatible semantic roles, and as many
7023 matching roles as possible). Align the EAP against its role filler in the stored
7024 tuple, and return this as the answer.
7025 To evaluate your system, download a set of three news articles on the same topic,
7026 and write down five factoid questions that should be answerable from the arti-
7027 cles. See if your system can answer these questions correctly. (If this problem is
7028 assigned to an entire class, you can build a large-scale test set and compare various
7029 approaches.)
7033 A recurring theme in natural language processing is the complexity of the mapping from
7034 words to meaning. In chapter 4, we saw that a single word form, like bank, can have mul-
7035 tiple meanings; conversely, a single meaning may be created by multiple surface forms,
7036 a lexical semantic relationship known as synonymy. Despite this complex mapping be-
7037 tween words and meaning, natural language processing systems usually rely on words
7038 as the basic unit of analysis. This is especially true in semantics: the logical and frame
7039 semantic methods from the previous two chapters rely on hand-crafted lexicons that map
7040 from words to semantic predicates. But how can we analyze texts that contain words
7041 that we haven’t seen before? This chapter describes methods that learn representations
7042 of word meaning by analyzing unlabeled data, vastly improving the generalizability of
7043 natural language processing systems. The theory that makes it possible to acquire mean-
7044 ingful representations from unlabeled data is the distributional hypothesis.
333
334 CHAPTER 14. DISTRIBUTIONAL AND DISTRIBUTED SEMANTICS
Table 14.1: Distributional statistics for tezgüino and five related terms
7054 What other words fit into these contexts? How about: loud, motor oil, tortillas, choices,
7055 wine? Each row of Table 14.1 is a vector that summarizes the contextual properties for
7056 each word, with a value of one for contexts in which the word can appear, and a value of
7057 zero for contexts in which it cannot. Based on these vectors, we can conclude: wine is very
7058 similar to tezgüino; motor oil and tortillas are fairly similar to tezgüino; loud is completely
7059 different.
7060 These vectors, which we will call word representations, describe the distributional
7061 properties of each word. Does vector similarity imply semantic similarity? This is the dis-
7062 tributional hypothesis, stated by Firth (1957) as: “You shall know a word by the company
7063 it keeps.” The distributional hypothesis has stood the test of time: distributional statistics
7064 are a core part of language technology today, because they make it possible to leverage
7065 large amounts of unlabeled data to learn about rare words that do not appear in labeled
7066 training data.
7067 Distributional statistics have a striking ability to capture lexical semantic relationships
7068 such as analogies. Figure 14.1 shows two examples, based on two-dimensional projections
7069 of distributional word embeddings, discussed later in this chapter. In each case, word-
7070 pair relationships correspond to regular linear patterns in this two dimensional space. No
7071 labeled data about the nature of these relationships was required to identify this underly-
7072 ing structure.
7073 Distributional semantics are computed from context statistics. Distributed seman-
7074 tics are a related but distinct idea: that meaning can be represented by numerical vectors
7075 rather than symbolic structures. Distributed representations are often estimated from dis-
7076 tributional statistics, as in latent semantic analysis and WORD 2 VEC, described later in this
7077 chapter. However, distributed representations can also be learned in a supervised fashion
7078 from labeled data, as in the neural classification models encountered in chapter 3.
Figure 14.1: Lexical semantic relationships have regular linear structures in two di-
mensional projections of distributional statistics. From http://nlp.stanford.edu/
projects/glove/.[todo: redo to make words bigger?]
Table 14.2: Contexts for the word learns, according to various word representations. For
dependency context, (one, NSUBJ) means that there is a relation of type NSUBJ (nominal
subject) to the word one, and (moment, ACL−1 ) means that there is a relation of type ACL
(adjectival clause) from the word moment.
7099 by integrating unsupervised clustering with word embedding estimation (Huang and
7100 Yates, 2012; Li and Jurafsky, 2015). However, Arora et al. (2016) argue that it is unnec-
7101 essary to model distinct word senses explicitly, because the embeddings for each surface
7102 form are a linear combination of the embeddings of the underlying senses.
7121 can be viewed in terms of word similarity. Applying latent semantic analysis (§ 14.3) to
7122 contexts of size h = 2 and h = 30 yields the following nearest-neighbors for the word
7123 dog:1
7124 • (h = 2): cat, horse, fox, pet, rabbit, pig, animal, mongrel, sheep, pigeon
7125 • (h = 30): kennel, puppy, pet, bitch, terrier, rottweiler, canine, cat, to bark, Alsatian
7126 Which word list is better? Each word in the h = 2 list is an animal, reflecting the fact that
7127 locally, the word dog tends to appear in the same contexts as other animal types (e.g., pet
7128 the dog, feed the dog). In the h = 30 list, nearly everything is dog-related, including specific
7129 breeds such as rottweiler and Alsatian. The list also includes words that are not animals
7130 (kennel), and in one case (to bark), is not a noun at all. The 2-word context window is more
7131 sensitive to syntax, while the 30-word window is more sensitive to topic.
Matrix factorization The matrix C = {count(i, j)} stores the co-occurrence counts of
word i and context j. Word representations can be obtained by approximately factoring
this matrix, so that count(i, j) is approximated by a function of a word embedding ui and
a context embedding vj . These embeddings can be obtained by minimizing the norm of
the reconstruction error,
1
The example is from lecture slides by Marco Baroni, Alessandro Lenci, and Stefan Evert, who applied
latent semantic analysis to the British National Corpus. You can find an online demo here: http://clic.
cimec.unitn.it/infomap-query/
7150 where V is the size of the vocabulary, |C| is the number of contexts, and K is size of the
7151 resulting embeddings, which are set equal to the rows of the matrix U. The matrix S is
7152 constrained to be diagonal (these diagonal elements are called the singular values), and
7153 the columns of the product SV> provide descriptions of the contexts. Each element ci,j is
7154 then reconstructed as a bilinear product,
K
X
ci,j ≈ ui,k sk vj,k . [14.6]
k=1
7155 The objective is to minimize the sum of squared approximation errors. The orthonormal-
7156 ity constraints U> U = V> V = I ensure that all pairs of dimensions in U and V are
7157 uncorrelated, so that each dimension conveys unique information. Efficient implemen-
7158 tations of truncated singular value decomposition are available in numerical computing
7159 packages such as scipy and matlab.2
Latent semantic analysis is most effective when the count matrix is transformed before
the application of SVD. One such transformation is pointwise mutual information (PMI;
Church and Hanks, 1990), which captures the degree of association between word i and
2
An important implementation detail is to represent C as a sparse matrix, so that the storage cost is equal
to the number of non-zero entries, rather than the size V × |C|.
context j,
p(i, j) p(i | j)p(j) p(i | j)
PMI (i, j) = log = log = log [14.7]
p(i)p(j) p(i)p(j) p(i)
V
X
= log count(i, j) − log count(i0 , j) [14.8]
i0 =1
X V X
X
0
− log count(i, j ) + log count(i0 , j 0 ). [14.9]
j 0 ∈C i0 =1 j 0 ∈C
7160 The pointwise mutual information can be viewed as the logarithm of the ratio of the con-
7161 ditional probability of word i in context j to the marginal probability of word i in all
7162 contexts. When word i is statistically associated with context j, the ratio will be greater
7163 than one, so PMI(i, j) > 0. The PMI transformation focuses latent semantic analysis on re-
7164 constructing strong word-context associations, rather than on reconstructing large counts.
7165 The PMI is negative when a word and context occur together less often than if they
7166 were independent, but such negative correlations are unreliable because counts of rare
7167 events have high variance. Furthermore, the PMI is undefined when count(i, j) = 0. One
7168 solution to these problems is to use the Positive PMI (PPMI),
(
PMI (i, j), p(i | j) > p(i)
PPMI (i, j) = [14.10]
0, otherwise.
7169 Bullinaria and Levy (2007) compare a range of matrix transformations for latent se-
7170 mantic analysis, using a battery of tasks related to word meaning and word similarity
7171 (for more on evaluation, see § 14.6). They find that PPMI-based latent semantic analysis
7172 yields strong performance on a battery of tasks related to word meaning: for example,
7173 PPMI-based LSA vectors can be used to solve multiple-choice word similarity questions
7174 from the Test of English as a Foreign Language (TOEFL), obtaining 85% accuracy.
Figure 14.2: Some subtrees produced by bottom-up Brown clustering (Miller et al., 2004)
on news text [todo: permission]
7184 A solution to this problem is hierarchical clustering: using the distributional statistics
7185 to induce a tree-structured representation. Fragments of Brown cluster trees are shown in
7186 Figure 14.2 and Table 14.3. Each word’s representation consists of a binary string describ-
7187 ing a path through the tree: 0 for taking the left branch, and 1 for taking the right branch.
7188 In the subtree in the upper right of the figure, the representation of the word conversation
7189 is 10; the representation of the word assessment is 0001. Bitstring prefixes capture similar-
7190 ity at varying levels of specificity, and it is common to use the first eight, twelve, sixteen,
7191 and twenty bits as features in tasks such as named entity recognition (Miller et al., 2004)
7192 and dependency parsing (Koo et al., 2008).
Hierarchical trees can be induced from a likelihood-based objective, using a discrete
latent variable ki ∈ {1, 2, . . . , K} to represent the cluster of word i:
M
X
log p(w; k) ≈ log p(wm | wm−1 ; k) [14.11]
m=1
XM
, log p(wm | kwm ) + log p(kwm | kwm−1 ). [14.12]
m=1
7193 This is similar to a hidden Markov model, with the crucial difference that each word can
Table 14.3: Fragment of a Brown clustering of Twitter data (Owoputi et al., 2013). Each
row is a leaf in the tree, showing the ten most frequent words. This part of the tree
emphasizes verbs of communicating and knowing, especially in the present partici-
ple. Each leaf node includes orthographic variants (thinking, thinkin, thinkn), semanti-
cally related terms (excited, thankful, grateful), and some outliers (5’2, +k). See http:
//www.cs.cmu.edu/˜ark/TweetNLP/cluster_viewer.html for more.
7195 where p(k1 , k2 ) is the joint probability of a bigram involving a word in cluster k1 followed
7196 by a word in k2 . This probability and the PMI are both computed from the co-occurrence
Algorithm 17 Exchange clustering algorithm. Assumes that words are sorted by fre-
quency, and that M AX MI finds the cluster pair whose merger maximizes the mutual in-
formation, as defined in Equation 14.13.
procedure E XCHANGE C LUSTERING({count(·, ·)}, K)
for i ∈ 1 . . . K do . Initialization
ki ← i, i = 1, 2, . . . , K
for j ∈ 1 . . . K do
ci,j ← count(i, j)
τ ← {(i)}K i=1
for i ∈ {K + 1, K + 2, . . . V } do . Iteratively add each word to the clustering
τ ← τ ∪ (i)
for k ∈ τ do
ck,i ← count(k, i)
ci,k ← count(i, k)
î, ĵ ← M AX MI(C)
τ, C ← M ERGE(î, ĵ, C, τ )
repeat . Merge the remaining clusters into a tree
î, ĵ ← M AX MI(C, τ )
τ, C ← M ERGE(î, ĵ, C, τ )
until |τ | = 1
return τ
procedure M ERGE(i, j, C, τ )
τ ← τ \ i \ j ∪ (i, j) . Merge the clusters in the tree
for k ∈ τ do . Aggregate the counts across the merged clusters
ck,(i,j) ← ck,i + ck,j
c(i,j),k ← ci,k + cj,k
return τ, C
7197 counts between clusters. After each merger, the co-occurrence vectors for the merged
7198 clusters are simply added up, so that the next optimal merger can be found efficiently.
7199 This bottom-up procedure requires iterating over the entire vocabulary, and evaluat-
7200 ing Kt2 possible mergers at each step, where Kt is the current number of clusters at step t
7201 of the algorithm. Furthermore, computing the score for each merger involves a sum over
7202 Kt2 clusters. The maximum number of clusters is K0 = V , which occurs when every word
7203 is in its own cluster at the beginning of the algorithm. The time complexity is thus O(V 5 ).
7204 To avoid this complexity, practical implementations use a heuristic approximation
7205 called exchange clustering. The K most common words are placed in clusters of their
7206 own at the beginning of the process. We then consider the next most common word, and
vwm
U U
(a) Continuous bag-of-words (CBOW) (b) Skipgram
Figure 14.3: The CBOW and skipgram variants of WORD 2 VEC. The parameter U is the
matrix of word embeddings, and each vm is the context embedding for word wm .
7207 merge it with one of the existing clusters. This continues until the entire vocabulary has
7208 been incorporated, at which point the K clusters are merged down to a single cluster,
7209 forming a tree. The algorithm never considers more than K + 1 clusters at any step, and
7210 the complexity is O(V K + V log V ), with the second term representing the cost of sorting
7211 the words at the beginning of the algorithm.
h
1 X
vm = vwm+n + vwm−n . [14.14]
2h
n=1
7228 Thus, CBOW is a bag-of-words model, because the order of the context words does not
7229 matter; it is continuous, because rather than conditioning on the words themselves, we
7230 condition on a continuous vector constructed from the word embeddings. The parameter
7231 h determines the neighborhood size, which Mikolov et al. (2013) set to h = 4.
The CBOW model optimizes an approximation to the corpus log-likelihood,
M
X
log p(w) ≈ log p(wm | wm−h , wm−h+1 , . . . , wm+h−1 , wm+h ) [14.15]
m=1
XM
exp (uwm · v m )
= log PV [14.16]
m=1 j=1 exp (uj · v m )
M
X V
X
= uwm · v m − log exp (uj · v m ) . [14.17]
m=1 j=1
7233 In the skipgram approximation, each word is generated multiple times; each time it is con-
7234 ditioned only on a single word. This makes it possible to avoid averaging the word vec-
7235 tors, as in the CBOW model. The local neighborhood size hm is randomly sampled from
7236 a uniform categorical distribution over the range {1, 2, . . . , hmax }; Mikolov et al. (2013) set
7237 hmax = 10. Because the neighborhood grows outward with h, this approach has the effect
7238 of weighting near neighbors more than distant ones. Skipgram performs better on most
7239 evaluations than CBOW (see § 14.6 for details of how to evaluate word representations),
7240 but CBOW is faster to train (Mikolov et al., 2013).
0
1 2
Ahab 3 4
σ(u0 · vc )
whale blubber
σ(−u0 · vc ) × σ(u2 · vc ) σ(−u0 · vc ) × σ(−u2 · vc )
Figure 14.4: A fragment of a hierarchical softmax tree. The probability of each word is
computed as a product of probabilities of local branching decisions in the tree.
7244 is quadratic in the size of the recurrent state vector. CBOW and skipgram avoid this
7245 computation, and incur only a linear time complexity in the size of the word and con-
7246 text representations. However, all three models compute a normalized probability over
7247 word tokens; a naı̈ve implementation of this probability requires summing over the entire
7248 vocabulary. The time complexity of this sum is O(V × K), which dominates all other com-
7249 putational costs. There are two solutions: hierarchical softmax, a tree-based computation
7250 that reduces the cost to a logarithm of the size of the vocabulary; and negative sampling,
7251 an approximation that eliminates the dependence on vocabulary size. Both methods are
7252 also applicable to RNN language models.
7259 logarithmic in the number of leaf nodes, and thus the number of multiplications is equal
7260 to O(log V ). The number of non-leaf nodes is equal to O(2V − 1), so the number of pa-
7261 rameters to be estimated increases by only a small multiple. The tree can be constructed
7262 using an incremental clustering procedure similar to hierarchical Brown clusters (Mnih
7263 and Hinton, 2008), or by using the Huffman (1952) encoding algorithm for lossless com-
7264 pression.
7266 where ψ(i, j) is the score for word i in context j, and Wneg is the set of negative samples.
P
7267 The objective is to maximize the sum over the corpus, M m=1 ψ(wm , cm ), where wm is
7268 token m and cm is the associated context.
7269 The set of negative samples Wneg is obtained by sampling from a unigram language
7270 model. Mikolov et al. (2013) construct this unigram language model by exponentiating
3
7271 the empirical word probabilities, setting p̂(i) ∝ (count(i)) 4 . This has the effect of redis-
7272 tributing probability mass from common to rare words. The number of negative samples
7273 increases the time complexity of training by a constant factor. Mikolov et al. (2013) report
7274 that 5-20 negative samples works for small training sets, and that two to five samples
7275 suffice for larger corpora.
7287 Word embeddings are obtained by factoring this matrix with truncated singular value
7288 decomposition.
GloVe (“global vectors”) are a closely related approach (Pennington et al., 2014), in
which the matrix to be factored is constructed from log co-occurrence counts, Mij =
log count(i, j). The word embeddings are estimated by minimizing the sum of squares,
V X
X V 2
min f (Mij ) log Mij − log Mij
u,v,b,b̃ j=1 j∈C
V
7289 where bi and b̃j are offsets for word i and context j, which are estimated jointly with the
7290 embeddings u and v. The weighting function f (Mij ) is set to be zero at Mij = 0, thus
7291 avoiding the problem of taking the logarithm of zero counts; it saturates at Mij = mmax ,
7292 thus avoiding the problem of overcounting common word-context pairs. This heuristic
7293 turns out to be critical to the method’s performance.
7294 The time complexity of sparse matrix reconstruction is determined by the number of
7295 non-zero word-context counts. Pennington et al. (2014) show that this number grows
7296 sublinearly with the size of the dataset: roughly O(N 0.8 ) for typical English corpora. In
7297 contrast, the time complexity of WORD 2 VEC is linear in the corpus size. Computing the co-
7298 occurrence counts also requires linear time in the size of the corpus, but this operation can
7299 easily be parallelized using MapReduce-style algorithms (Dean and Ghemawat, 2008).
Table 14.4: Subset of the WS-353 (Finkelstein et al., 2002) dataset of word similarity ratings
(examples from Faruqui et al. (2016)).
7309 For any embedding method, we can evaluate whether the cosine similarity of word em-
7310 beddings is correlated with human judgments of word similarity. The WS-353 dataset (Finkel-
7311 stein et al., 2002) includes similarity scores for 353 word pairs (Table 14.4). To test the
7312 accuracy of embeddings for rare and morphologically complex words, Luong et al. (2013)
7313 introduce a dataset of “rare words.” Outside of English, word similarity resources are
7314 limited, mainly consisting of translations of WS-353.
7315 Word analogies (e.g., king:queen :: man:woman) have also been used to evaluate word
7316 embeddings (Mikolov et al., 2013). In this evaluation, the system is provided with the first
7317 three parts of the analogy (i1 : j1 :: i2 :?), and the final element is predicted by finding the
7318 word embedding most similar to ui1 − uj1 + ui2 . Another evaluation tests whether word
7319 embeddings are related to broad lexical semantic categories called supersenses (Ciaramita
7320 and Johnson, 2003): verbs of motion, nouns that describe animals, nouns that describe
7321 body parts, and so on. These supersenses are annotated for English synsets in Word-
7322 Net (Fellbaum, 2010). This evaluation is implemented in the qvec metric, which tests
7323 whether the matrix of supersenses can be reconstructed from the matrix of word embed-
7324 dings (Tsvetkov et al., 2015).
7325 Levy et al. (2015) compared several dense word representations for English — includ-
7326 ing latent semantic analysis, WORD 2 VEC, and GloVe — using six word similarity metrics
7327 and two analogy tasks. None of the embeddings outperformed the others on every task,
7328 but skipgrams were the most broadly competitive. Hyperparameter tuning played a key
7329 role: any method will perform badly if the wrong hyperparameters are used. Relevant
7330 hyperparameters include the embedding size, as well as algorithm-specific details such
7331 as the neighborhood size and the number of negative samples.
umillicuries
(M ) (M ) (M ) (M ) (M )
umilli+ ucurie u+s umilli+ (M )
ucurie u+s
Figure 14.5: Two architectures for building word embeddings from subword units. On the
left, morpheme embeddings u(m) are combined by addition with the non-compositional
word embedding ũ (Botha and Blunsom, 2014). On the right, morpheme embeddings are
combined in a recursive neural network (Luong et al., 2013).
7372 becomes especially important to leverage other sources of information beyond distribu-
7373 tional statistics.
7380 millicuries This word has morphological structure (see § 9.1.2 for more on morphology):
7381 the prefix milli- indicates an amount, and the suffix -s indicates a plural. (A millicurie
7382 is an unit of radioactivity.)
7383 caesium This word is a single morpheme, but the characters -ium are often associated
7384 with chemical elements. (Caesium is the British spelling of a chemical element,
7385 spelled cesium in American English.)
7386 IAEA This term is an acronym, as suggested by the use of capitalization. The prefix I- fre-
7387 quently refers to international organizations, and the suffix -A often refers to agen-
7388 cies or associations. (IAEA is the International Atomic Energy Agency.)
7389 Zhezhgan This term is in title case, suggesting the name of a person or place, and the
7390 character bigram zh indicates that it is likely a transliteration. (Zhezhgan is a mining
7391 facility in Kazakhstan.)
4
https://code.google.com/archive/p/word2vec/, accessed September 20, 2017
(M )
7396 where um is a morpheme embedding and ũi is a non-compositional embedding of the
7397 whole word, which is an additional free parameter of the model (Figure 14.5, left side).
7398 All embeddings are estimated from a log-bilinear language model (Mnih and Hinton,
7399 2007), which is similar to the CBOW model (§ 14.5), but includes only contextual informa-
7400 tion from preceding words. The morphological segments are obtained using an unsuper-
7401 vised segmenter (Creutz and Lagus, 2007). For words that do not appear in the training
7402 data, the embedding can be constructed directly from the morphemes, assuming that each
7403 morpheme appears in some other word in the training data. The free parameter ũ adds
7404 flexibility: words with similar morphemes are encouraged to have similar embeddings,
7405 but this parameter makes it possible for them to be different.
7406 Word-internal structure can be incorporated into word representations in various other
7407 ways. Here are some of the main parameters.
7408 Subword units. Examples like IAEA and Zhezhgan are not based on morphological com-
7409 position, and a morphological segmenter is unlikely to identify meaningful sub-
7410 word units for these terms. Rather than using morphemes for subword embeddings,
7411 one can use characters (Santos and Zadrozny, 2014; Ling et al., 2015; Kim et al., 2016),
7412 character n-grams (Wieting et al., 2016; Bojanowski et al., 2017), and byte-pair en-
7413 codings, a compression technique which captures frequent substrings (Gage, 1994;
7414 Sennrich et al., 2016).
7415 Composition. Combining the subword embeddings by addition does not differentiate
7416 between orderings, nor does it identify any particular morpheme as the root. A
7417 range of more flexible compositional models have been considered, including re-
7418 currence (Ling et al., 2015), convolution (Santos and Zadrozny, 2014; Kim et al.,
7419 2016), and recursive neural networks (Luong et al., 2013), in which representa-
7420 tions of progressively larger units are constructed over a morphological parse, e.g.
7421 ((milli+curie)+s), ((in+flam)+able), (in+(vis+ible)). A recursive embedding model is
7422 shown in the right panel of Figure 14.5.
7423 Estimation. Estimating subword embeddings from a full dataset is computationally ex-
7424 pensive. An alternative approach is to train a subword model to match pre-trained
7425 word embeddings (Cotterell et al., 2016; Pinter et al., 2017). To train such a model, it
7426 is only necessary to iterate over the vocabulary, and the not the corpus.
7428 where ûi is the pretrained embedding of word i, and L = {(i, j)} is a lexicon of word
7429 relations. The hyperparameter βij controls the importance of adjacent words having
7430 similar embeddings; Faruqui et al. (2015) set it to the inverse of the degree of word i,
7431 βij = |{j : (i, j) ∈ L}|−1 . Retrofitting improves performance on a range of intrinsic evalu-
7432 ations, and gives small improvements on an extrinsic document classification task.
7454 (2013) identify such units and then treat them as words when estimating skipgram em-
7455 beddings, showing that the resulting embeddings perform reasonably on a task of solving
7456 phrasal analogies, e.g. New York : New York Times :: Baltimore : Baltimore Sun.
7458 To move beyond short multiword phrases, composition is necessary. A simple but sur-
7459 prisingly powerful approach is to represent a sentence with the average of its word em-
7460 beddings (Mitchell and Lapata, 2010). This can be considered a hybrid of the distribu-
7461 tional and compositional approaches to semantics: the word embeddings are computed
7462 distributionally, and then the sentence representation is computed by composition.
7463 The WORD 2 VEC approach can be stretched considerably further, embedding entire
7464 sentences using a model similar to skipgrams, in the “skip-thought” model of Kiros et al.
7465 (2015). Each sentence is encoded into a vector using a recurrent neural network: the encod-
(t)
7466 ing of sentence t is set to the RNN hidden state at its final token, hMt . This vector is then
7467 a parameter in a decoder model that is used to generate the previous and subsequent sen-
7468 tences: the decoder is another recurrent neural network, which takes the encoding of the
7469 neighboring sentence as an additional parameter in its recurrent update. (This encoder-
7470 decoder model is discussed at length in chapter 18.) The encoder and decoder are trained
7471 simultaneously from a likelihood-based objective, and the trained encoder can be used to
7472 compute a distributed representation of any sentence. Skip-thought can also be viewed
7473 as a hybrid of distributional and compositional approaches: the vector representation of
7474 each sentence is computed compositionally from the representations of the individual
7475 words, but the training objective is distributional, based on sentence co-occurrence across
7476 a corpus.
7477 Autoencoders are a variant of encoder-decoder models in which the decoder is trained
7478 to produce the same text that was originally encoded, using only the distributed encod-
7479 ing vector (Li et al., 2015). The encoding acts as a bottleneck, so that generalization is
7480 necessary if the model is to successfully fit the training data. In denoising autoencoders,
7481 the input is a corrupted version of the original sentence, and the auto-encoder must re-
7482 construct the uncorrupted original (Vincent et al., 2010; Hill et al., 2016). By interpolating
7483 between distributed representations of two sentences, αui +(1−α)uj , it is possible to gen-
7484 erate sentences that combine aspects of the two inputs, as shown in Figure 14.6 (Bowman
7485 et al., 2016).
7486 Autoencoders can also be applied to longer texts, such as paragraphs and documents.
7487 This enables applications such as question answering, which can be performed by match-
7488 ing the encoding of the question with encodings of candidate answers (Miao et al., 2016).
Figure 14.6: By interpolating between the distributed representations of two sentences (in
bold), it is possible to generate grammatical sentences that combine aspects of both (Bow-
man et al., 2016)
7490 Given a supervision signal, such as a label describing the sentiment or meaning of a sen-
7491 tence, a wide range of compositional methods can be applied to compute a distributed
7492 representation that then predicts the label. The simplest is to average the embeddings
7493 of each word in the sentence, and pass this average through a feedforward neural net-
7494 work (Iyyer et al., 2015). Convolutional and recurrent neural networks go further, with
7495 the ability to effectively capturing multiword phenomena such as negation (Kalchbrenner
7496 et al., 2014; Kim, 2014; Li et al., 2015; Tang et al., 2015). Another approach is to incorpo-
7497 rate the syntactic structure of the sentence into a recursive neural networks, in which the
7498 representation for each syntactic constituent is computed from the representations of its
7499 children (Socher et al., 2012). However, in many cases, recurrent neural networks perform
7500 as well or better than recursive networks (Li et al., 2015).
7501 Whether convolutional, recurrent, or recursive, a key question is whether supervised
7502 sentence representations are task-specific, or whether a single supervised sentence repre-
7503 sentation model can yield useful performance on other tasks. Wieting et al. (2015) train a
7504 variety of sentence embedding models for the task of labeling pairs of sentences as para-
7505 phrases. They show that the resulting sentence embeddings give good performance for
7506 sentiment analysis. The Stanford Natural Language Inference corpus classifies sentence
7507 pairs as entailments (the truth of sentence i implies the truth of sentence j), contradictions
7508 (the truth of sentence i implies the falsity of sentence j), and neutral (i neither entails nor
7509 contradicts j). Sentence embeddings trained on this dataset transfer to a wide range of
7510 classification tasks (Conneau et al., 2017).
7520 Symbolic representations are relatively brittle to this sort of variation, but are better
7521 suited to describe individual entities, the things that they do, and the things that are done
7522 to them. In examples (14.5)-(14.7), we not only know that somebody thanked someone
7523 else, but we can make a range of inferences about what has happened between the en-
7524 tities named Donald and Vlad. Because distributed representations do not treat entities
7525 symbolically, they lack the ability to reason about the roles played by entities across a sen-
7526 tence or larger discourse.5 A hybrid between distributed and symbolic representations
7527 might give the best of both worlds: robustness to the many different ways of describing
7528 the same event, plus the expressiveness to support inferences about entities and the roles
7529 that they play.
7530 A “top-down” hybrid approach is to begin with logical semantics (of the sort de-
7531 scribed in the previous two chapters), and but replace the predefined lexicon with a set
7532 of distributional word clusters (Poon and Domingos, 2009; Lewis and Steedman, 2013). A
7533 “bottom-up” approach is to add minimal symbolic structure to existing distributed repre-
7534 sentations, such as vector representations for each entity (Ji and Eisenstein, 2015; Wiseman
7535 et al., 2016). This has been shown to improve performance on two problems that we will
7536 encounter in the following chapters: classification of discourse relations between adja-
7537 cent sentences (chapter 16; Ji and Eisenstein, 2015), and coreference resolution of entity
7538 mentions (chapter 15; Wiseman et al., 2016; Ji et al., 2017). Research on hybrid seman-
7539 tic representations is still in an early stage, and future representations may deviate more
7540 boldly from existing symbolic and distributional approaches.
7544 similarity-based evaluations of word embeddings, and present a novel evaluation that
7545 controls for word frequency. Baroni et al. (2014) address linguistic issues that arise in
7546 attempts to combine distributed and compositional representations.
7547 In bilingual and multilingual distributed representations, embeddings are estimated
7548 for translation pairs or tuples, such as (dog, perro, chien). These embeddings can improve
7549 machine translation (Zou et al., 2013; Klementiev et al., 2012), transfer natural language
7550 processing models across languages (Täckström et al., 2012), and make monolingual word
7551 embeddings more accurate (Faruqui and Dyer, 2014). A typical approach is to learn a pro-
7552 jection that maximizes the correlation of the distributed representations of each element
7553 in a translation pair, which can be obtained from a bilingual dictionary. Distributed rep-
7554 resentations can also be linked to perceptual information, such as image features. Bruni
7555 et al. (2014) use textual descriptions of images to obtain visual contextual information for
7556 various words, which supplements traditional distributional context. Image features can
7557 also be inserted as contextual information in log bilinear language models (Kiros et al.,
7558 2014), making it possible to automatically generate text descriptions of images.
7559 Exercises
7560 1. Prove that the sum of probabilities of paths through a hierarchical softmax tree is
7561 equal to one.
2. In skipgram word embeddings, the negative sampling objective can be written as,
XX
L= count(i, j)ψ(i, j), [14.29]
i∈V j∈C
7569 3. * In Brown clustering, prove that the cluster merge that maximizes the average mu-
7570 tual information (Equation 14.13) also maximizes the log-likelihood objective (Equa-
7571 tion 14.12).
7572 For the next two problems, download a set of pre-trained word embeddings, such as the
7573 WORD 2 VEC or polyglot embeddings.
7574 4. Use cosine similarity to find the most similar words to: dog, whale, before, however,
7575 fabricate.
7576 5. Use vector addition and subtraction to compute target vectors for the analogies be-
7577 low. After computing each target vector, find the top three candidates by cosine
7578 similarity.
7583 The remaining problems will require you to build a classifier and test its properties. Pick
7584 a multi-class text classification dataset, such as RCV16 ). Divide your data into training
7585 (60%), development (20%), and test sets (20%), if no such division already exists.
7586 6. Train a convolutional neural network, with inputs set to pre-trained word embed-
7587 dings from the previous problem. Use a special, fine-tuned embedding for out-of-
7588 vocabulary words. Train until performance on the development set does not im-
7589 prove. You can also use the development set to tune the model architecture, such
7590 as the convolution width and depth. Report F - MEASURE and accuracy, as well as
7591 training time.
7592 7. Now modify your model from the previous problem to fine-tune the word embed-
7593 dings. Report F - MEASURE, accuracy, and training time.
7594 8. Try a simpler approach, in which word embeddings in the document are averaged,
7595 and then this average is passed through a feed-forward neural network. Again, use
7596 the development data to tune the model architecture. How close is the accuracy to
7597 the convolutional networks from the previous problems?
6
http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/lyrl2004_
rcv1v2_README.htm
7600 References are one of the most noticeable forms of linguistic ambiguity, afflicting not just
7601 automated natural language processing systems, but also fluent human readers. Warn-
7602 ings to avoid “ambiguous pronouns” are ubiquitous in manuals and tutorials on writing
7603 style. But referential ambiguity is not limited to pronouns, as shown in the text in Fig-
7604 ure 15.1. Each of the bracketed substrings refers to an entity that is introduced earlier
7605 in the passage. These references include the pronouns he and his, but also the shortened
7606 name Cook, and nominals such as the firm and the firm’s biggest growth market.
7607 Reference resolution subsumes several subtasks. This chapter will focus on corefer-
7608 ence resolution, which is the task of grouping spans of text that refer to a single underly-
7609 ing entity, or, in some cases, a single event: for example, the spans Tim Cook, he, and Cook
7610 are all coreferent. These individual spans are called mentions, because they mention an
7611 entity; the entity is sometimes called the referent. Each mention has a set of antecedents,
7612 which are preceding mentions that are coreferent; for the first mention of an entity, the an-
7613 tecedent set is empty. The task of pronominal anaphora resolution requires identifying
7614 only the antecedents of pronouns. In entity linking, references are resolved not to other
7615 spans of text, but to entities in a knowledge base. This task is discussed in chapter 17.
7616 Coreference resolution is a challenging problem for several reasons. Resolving differ-
7617 ent types of referring expressions requires different types of reasoning: the features and
7618 methods that are useful for resolving pronouns are different from those that are useful
7619 to resolve names and nominals. Coreference resolution involves not only linguistic rea-
7620 soning, but also world knowledge and pragmatics: you may not have known that China
7621 was Apple’s biggest growth market, but it is likely that you effortlessly resolved this ref-
7622 erence while reading the passage in Figure 15.1.1 A further challenge is that coreference
1
This interpretation is based in part on the assumption that a cooperative author would not use the
expression the firm’s biggest growth market to refer to an entity not yet mentioned in the article (Grice, 1975).
Pragmatics is the discipline of linguistics concerned with the formalization of such assumptions (Huang,
359
360 CHAPTER 15. REFERENCE RESOLUTION
(15.1) [[Apple Inc] Chief Executive Tim Cook] has jetted into [China] for talks with
government officials as [he] seeks to clear up a pile of problems in [[the firm]
’s biggest growth market] ... [Cook] is on [his] first trip to [the country] since
taking over...
Figure 15.1: Running example (Yee and Jones, 2012). Coreferring entity mentions are
underlined and bracketed.
7623 resolution decisions are often entangled: each mention adds information about the entity,
7624 which affects other coreference decisions. This means that coreference resolution must
7625 be addressed as a structure prediction problem. But as we will see, there is no dynamic
7626 program that allows the space of coreference decisions to be searched efficiently.
2015).
2
Pronouns whose referents come later are known as cataphora, as in this example from Márquez (1970):
(15.1) Many years later, as [he] faced the firing squad, [Colonel Aureliano Buendı́a] was to remember that
distant afternoon when his father took him to discover ice.
7639 parsing the text to identify all such noun phrases.3 Filtering heuristics can help to prune
7640 the search space to noun phrases that are likely to be coreferent (Lee et al., 2013; Durrett
7641 and Klein, 2013). In nested noun phrases, mentions are generally considered to be the
7642 largest unit with a given head word: thus, Apple Inc. Chief Executive Tim Cook would be
7643 included as a mention, but Tim Cook would not, since they share the same head word,
7644 Cook.
7648 (15.2) Tim Cook has jetted in for talks with officials as [he] seeks to clear up a pile of
7649 problems...
7650 The pronoun and possible antecedents have the following features:
7655 The S MASH method searches backwards from he, discarding officials and talks because they
7656 do not satisfy the agreements constraints.
7657 Another source of constraints comes from syntax — specifically, from the phrase struc-
7658 ture trees discussed in chapter 10. Consider a parse tree in which both x and y are phrasal
7659 constituents. The constituent x c-commands the constituent y iff the first branching node
7660 above x also dominates y. For example, in Figure 15.2a, Abigail c-commands her, because
7661 the first branching node above Abigail, S, also dominates her. Now, if x c-commands y,
7662 government and binding theory (Chomsky, 1982) states that y can refer to x only if it is
7663 a reflexive pronoun (e.g., herself ). Furthermore, if y is a reflexive pronoun, then its an-
7664 tecedent must c-command it. Thus, in Figure 15.2a, her cannot refer to Abigail; conversely,
7665 if we replace her with herself, then the reflexive pronoun must refer to Abigail, since this is
7666 the only candidate antecedent that c-commands it.
7667 Now consider the example shown in Figure 15.2b. Here, Abigail does not c-command
7668 her, but Abigail’s mom does. Thus, her can refer to Abigail — and we cannot use reflexive
3
In the OntoNotes coreference annotations, verbs can also be antecedents, if they are later referenced by
nominals (Pradhan et al., 2011):
(15.1) Sales of passenger cars [grew] 22%. [The strong growth] followed year-to-year increases.
S NP VP
S
NP VP Abigail V S
NP VP
Abigail speaks PP Abigail ’s mom speaks PP hopes NP VP
Figure 15.2: In (a), Abigail c-commands her; in (b), Abigail does not c-command her, but
Abigail’s mom does; in (c), the scope of Abigail is limited by the S non-terminal, so that she
or her can bind to Abigail, but not both.
7669 herself in this context, unless we are talking about Abigail’s mom. However, her does not
7670 have to refer to Abigail. Finally, Figure 15.2c shows the how these constraints are limited.
7671 In this case, the pronoun she can refer to Abigail, because the S non-terminal puts Abigail
7672 outside the domain of she. Similarly, her can also refer to Abigail. But she and her cannot be
7673 coreferent, because she c-commands her.
7679 (15.3) The doctor found an old map in the captain’s chest. Jim found an even older map
7680 hidden on the shelf. [It] described an island.
7681 Readers are expected to prefer the older map as the referent for the pronoun it.
7682 However, subjects are often preferred over objects, and this can contradict the prefer-
7683 ence for recency when two candidate referents are in the same sentence. For example,
7684 (15.4) Asha loaned Mei a book on Spanish. [She] is always trying to help people.
7685 Here, we may prefer to link she to Asha rather than Mei, because of Asha’s position in the
7686 subject role of the preceding sentence. (Arguably, this preference would not be strong
7687 enough to select Asha if the second sentence were She is visiting Valencia next month.)
7688 A third heuristic is parallelism:
7689 (15.5) Asha loaned Mei a book on Spanish. Olya loaned [her] a book on Portuguese.
NP VP
DET NN PP VBD NP PP
of DET NN when NP VP
he moved PRP TO NP
it to NNP
London
Figure 15.3: Left-to-right breadth-first tree traversal (Hobbs, 1978), indicating that the
search for an antecedent for it (NP1 ) would proceed in the following order: 536; the castle
in Camelot; the residence of the king; Camelot; the king. Hobbs (1978) proposes semantic con-
straints to eliminate 536 and the castle in Camelot as candidates, since they are unlikely to
be the direct object of the verb move.
7690 Here Mei is preferred as the referent for her, contradicting the preference for the subject
7691 Asha in the preceding sentence.
7692 The recency and subject role heuristics can be unified by traversing the document in
7693 a syntax-driven fashion (Hobbs, 1978): each preceding sentence is traversed breadth-first,
7694 left-to-right (Figure 15.3). This heuristic successfully handles (15.4): Asha is preferred as
7695 the referent for she because the subject NP is visited first. It also handles (15.3): the older
7696 map is preferred as the referent for it because the more recent sentence is visited first. (An
7697 alternative unification of recency and syntax is proposed by centering theory (Grosz et al.,
7698 1995), which is discussed in detail in chapter 16.)
7699 In early work on reference resolution, the number of heuristics was small enough that
7700 a set of numerical weights could be set by hand (Lappin and Leass, 1994). More recent
7701 work uses machine learning to quantify the importance of each of these factors. However,
7702 pronoun resolution cannot be completely solved by constraints and heuristics alone. This
7703 is shown by the classic example pair (Winograd, 1972):
7704 (15.6) The [city council] denied [the protesters] a permit because [they] advocated/feared
7705 violence.
7706 Without reasoning about the motivations of the city council and protesters, it is unlikely
7707 that any system could correctly resolve both versions of this example.
7711 (15.7) They told me that I was too ugly for show business, but I didn’t believe [it].
7712 (15.8) Asha saw Babak get angry, and I saw [it] too.
7713 (15.9) Asha said she worked in security. I suppose [that]’s one way to put it.
7714 These forms of reference are generally not annotated in large-scale coreference resolution
7715 datasets such as OntoNotes (Pradhan et al., 2011).
7716 Pronouns may also have generic referents:
7720 In the OntoNotes dataset, coreference is not annotated for generic referents, even in cases
7721 like these examples, in which the same generic entity is mentioned multiple times.
7722 Some pronouns do not refer to anything at all:
7726 How can we automatically distinguish these usages of it from referential pronouns?
7727 Consider the the difference between the following two examples (Bergsma et al., 2008):
7730 In the second example, the pronoun it is non-referential. One way to see this is by substi-
7731 tuting another pronoun, like them, into these examples:
7734 The questionable grammaticality of the second example suggests that it is not referential.
7735 Bergsma et al. (2008) operationalize this idea by comparing distributional statistics for the
7736 n-grams around the word it, testing how often other pronouns or nouns appear in the
7737 same context. In cases where nouns and other pronouns are infrequent, the it is unlikely
7738 to be referential.
7744 (15.20) Apple Inc Chief Executive [Tim Cook] has jetted into China . . . [Cook] is on his
7745 first business trip to the country . . .
7746 A typical solution for proper noun coreference is to match the syntactic head words
7747 of the reference with the referent. In § 10.5.2, we saw that the head word of a phrase can
7748 be identified by applying head percolation rules to the phrasal parse tree; alternatively,
7749 the head can be identified as the root of the dependency subtree covering the name. For
7750 sequences of proper nouns, the head word will be the final token.
7751 There are a number of caveats to the practice of matching head words of proper nouns.
7752 • In the European tradition, family names tend to be more specific than given names,
7753 and family names usually come last. However, other traditions have other practices:
7754 for example, in Chinese names, the family name typically comes first; in Japanese,
7755 honorifics come after the name, as in Nobu-San (Mr. Nobu).
7756 • In organization names, the head word is often not the most informative, as in Georgia
7757 Tech and Virginia Tech. Similarly, Lebanon does not refer to the same entity as South-
7758 ern Lebanon, necessitating special rules for the specific case of geographical modi-
7759 fiers (Lee et al., 2011).
7760 • Proper nouns can be nested, as in [the CEO of [Microsoft]], resulting in head word
7761 match without coreference.
7762 Despite these difficulties, proper nouns are the easiest category of references to re-
7763 solve (Stoyanov et al., 2009). In machine learning systems, one solution is to include a
7764 range of matching features, including exact match, head match, and string inclusion. In
7765 addition to matching features, competitive systems (e.g., Bengtson and Roth, 2008) in-
7766 clude large lists, or gazetteers, of acronyms (e.g, the National Basketball Association/NBA),
7767 demonyms (e.g., the Israelis/Israel), and other aliases (e.g., the Georgia Institute of Technol-
7768 ogy/Georgia Tech).
7779 Each row specifies the token spans that mention an entity. (“Singleton” entities, which are
7780 mentioned only once (e.g., talks, government officials), are excluded from the annotations.)
7781 Equivalently, if given a set of M mentions, {mi }M i=1 , each mention i can be assigned to a
7782 cluster zi , where zi = zj if i and j are coreferent. The cluster assignments z are invariant
7783 under permutation. The unique clustering associated with the assignment z is written
7784 c(z).
7785 Mention identification The task of identifying mention spans for coreference resolution
7786 is often performed by applying a set of heuristics to the phrase structure parse of each
7787 sentence. A typical approach is to start with all noun phrases and named entities, and
7788 then apply filtering rules to remove nested noun phrases with the same head (e.g., [Apple
7789 CEO [Tim Cook]]), numeric entities (e.g., [100 miles], [97%]), non-referential it, etc (Lee
7790 et al., 2013; Durrett and Klein, 2013). In general, these deterministic approaches err in
7791 favor of recall, since the mention clustering component can choose to ignore false positive
7792 mentions, but cannot recover from false negatives. An alternative is to consider all spans
4
In many annotations, the term markable is used to refer to spans of text that can potentially mention an
entity. The set of markables includes non-referential pronouns, which does not mention any entity. Part of the
job of the coreference system is to avoid incorrectly linking these non-referential markables to any mention
chains.
7793 (up to some finite length) as candidate mentions, performing mention identification and
7794 clustering jointly (Daumé III and Marcu, 2005; Lee et al., 2017).
7829 The variable y1,6 is not part of the training data, because the first mention appears before
7830 the true antecedent a6 = 2.
In mention ranking (Denis and Baldridge, 2007), the classifier learns to identify a single
antecedent ai ∈ {, 1, 2, . . . , i − 1} for each referring expression i,
7832 where ψM (a, i) is a score for the mention pair (a, i). If a = , then mention i does not refer
7833 to any previously-introduced entity — it is not anaphoric. Mention-ranking is similar to
7834 the mention-pair model, but all candidates are considered simultaneously, and at most
7835 a single antecedent is selected. The mention-ranking model explicitly accounts for the
7836 possibility that mention i is not anaphoric, through the score ψM (, i). The determination
7837 of anaphoricity can be made by a special classifier in a preprocessing step, so that non-
7838 antecedents are identified only for spans that are determined to be anaphoric (Denis and
7839 Baldridge, 2008).
7840 As a learning problem, ranking can be trained using the same objectives as in dis-
7841 criminative classification. For each mention i, we can define a gold antecedent a∗i , and an
7842 associated loss, such as the hinge loss, `i = (1 − ψM (a∗i , i) + ψM (â, i))+ or the negative
7843 log-likelihood, `i = − log p(a∗i | i; θ). (For more on learning to rank, see § 17.1.1.) But as
7844 with the mention-pair model, there is a mismatch between the labeled data, which comes
7845 in the form of mention sets, and the desired supervision, which would indicate the spe-
7846 cific antecedent of each mention. The antecedent variables {ai }M i=1 relate to the mention
7847 sets in a many-to-one mapping: each set of antecedents induces a single clustering, but a
7848 clustering can correspond to many different settings of antecedent variables.
A heuristic solution is to set a∗i = max{j : j < i ∧ zj = zi }, the most recent mention in
the same cluster as i. But the most recent mention may not be the most informative: in the
running example, the most recent antecedent of the mention Cook is the pronoun he, but
a more useful antecedent is the earlier mention Apple Inc Chief Executive Tim Cook. Rather
than selecting a specific antecedent to train on, the antecedent can be treated as a latent
variable, in the manner of the latent variable perceptron from § 12.4.2 (Fernandes et al.,
2014):
M
X
â = argmax ψM (ai , i) [15.5]
a
i=1
XM
a∗ = argmax ψM (ai , i) [15.6]
a∈A(c) i=1
M
X M
X ∂L
∂L
θ ←θ + ψM (a∗i , i) − ψM (âi , i) [15.7]
∂θ ∂θ
i=1 i=1
where A(c) is the set of antecedent structures that is compatible with the ground truth
coreference clustering c. Another alternative is to sum over all the conditional probabili-
ties of antecedent structures that are compatible with the ground truth clustering (Durrett
and Klein, 2013; Lee et al., 2017). For the set of mention m, we compute the following
probabilities:
X M
X Y
p(c | m) = p(a | m) = p(ai | i, m) [15.8]
a∈A(c) a∈A(c) i=1
exp (ψM (ai , i))
p(ai | i, m) = P 0
. [15.9]
a0 ∈{,1,2,...,i−1} exp (ψM (a , i))
7849 This objective rewards models that assign high scores for all valid antecedent structures.
7850 In the running example, this would correspond to summing the probabilities of the two
7851 valid antecedents for Cook, he and Apple Inc Chief Executive Tim Cook. In one of the exer-
7852 cises, you will compute the number of valid antecedent structures for a given clustering.
7854 A mention-pair system might predict ŷ1,2 = 1, ŷ2,3 = 1, ŷ1,3 = 0. Similarly, a mention-
7855 ranking system might choose â2 = 1 and â3 = 2. Logically, if mentions 1 and 3 are both
7856 coreferent with mention 2, then all three mentions must refer to the same entity. This
7857 constraint is known as transitive closure.
7858 Transitive closure can be applied post hoc, revising the independent mention-pair or
7859 mention-ranking decisions. However, there are many possible ways to enforce transitive
7860 closure: in the example above, we could set ŷ1,3 = 1, or ŷ1,2 = 0, or ŷ2,3 = 0. For docu-
7861 ments with many mentions, there may be many violations of transitive closure, and many
7862 possible fixes. Transitive closure can be enforced by always adding edges, so that ŷ1,3 = 1
7863 is preferred (e.g., Soon et al., 2001), but this can result in overclustering, with too many
7864 mentions grouped into too few entities.
Mention-pair coreference resolution can be viewed as a constrained optimization prob-
lem,
j
M X
X
max ψM (i, j) × yi,j
y∈{0,1}M
j=1 i=1
7865 with the constraint enforcing transitive closure. This constrained optimization problem
7866 is equivalent to graph partitioning with positive and negative edge weights: construct a
7867 graph where the nodes are mentions, and the edges are the pairwise scores ψM (i, j); the
7868 goal is to partition the graph so as to maximize the sum of the edge weights between all
7869 nodes within the same partition (McCallum and Wellner, 2004). This problem is NP-hard,
7870 motivating approximations such as correlation clustering (Bansal et al., 2004) and integer
7871 linear programming (Klenner, 2007; Finkel and Manning, 2008, also see § 13.2.2).
7873 where zi indicates the entity referenced by mention i, and ψE ({i : zi = e}) is a scoring
7874 function applied to all mentions i that are assigned to entity e.
Entity-based coreference resolution is conceptually similar to the unsupervised clus-
tering problems encountered in chapter 5: the goal is to obtain clusters of mentions that
are internally coherent. The number of possible clusterings is the Bell number, which is
Figure 15.4: The Bell Tree for the sentence Abigail hopes she speaks with her. Which paths
are excluded by the syntactic constraints mentioned in § 15.1.1?
n−1
X ∞
n−1 1 X kn
Bn = Bk = . [15.14]
k e k!
k=0 k=0
7875 This recurrence is illustrated by the Bell tree, which is applied to a short coreference prob-
7876 lem in Figure 15.4. The Bell number Bn grows exponentially with n, making exhaustive
7877 search of the space of clusterings impossible. For this reason, entity-based coreference
7878 resolution typically involves incremental search, in which clustering decisions are based
7879 on local evidence, in the hope of approximately optimizing the full objective in Equa-
7880 tion 15.13. This approach is sometimes called cluster ranking, in contrast to mention
7881 ranking.
7894 clusters; but unlike S MASH, a cluster is selected only if all members of its cluster are com-
7895 patible with the current mention. As mentions are added to a cluster, so are their features
7896 (e.g., gender, number, animacy). In this way, incoherent chains like {Hillary Clinton, Clinton, Bill Clinton}
7897 can be avoided. However, an incorrect assignment early in the document — a search error
7898 — might lead to a cascade of errors later on.
7899 More sophisticated search strategies can help to ameliorate the risk of search errors.
7900 One approach is beam search (§ 11.3), in which a set of hypotheses is maintained through-
7901 out search. Each hypothesis represents a path through the Bell tree (Figure 15.4). Hy-
7902 potheses are “expanded” either by adding the next mention to an existing cluster, or by
7903 starting a new cluster. Each expansion receives a score, based on Equation 15.13, and the
7904 top K hypotheses are kept on the beam as the algorithm moves to the next step.
7905 Incremental cluster ranking can be made more accurate by performing multiple passes
7906 over the document, applying rules (or “sieves”) with increasing recall and decreasing
7907 precision at each pass (Lee et al., 2013). In the early passes, coreference links are pro-
7908 posed only between mentions that are highly likely to corefer (e.g., exact string match
7909 for full names and nominals). Information can then be shared among these mentions,
7910 so that when more permissive matching rules are applied later, agreement is preserved
7911 across the entire cluster. For example, in the case of {Hillary Clinton, Clinton, she}, the
7912 name-matching sieve would link Clinton and Hillary Clinton, and the pronoun-matching
7913 sieve would then link she to the combined cluster. A deterministic multi-pass system
7914 won nearly every track of the 2011 CoNLL shared task on coreference resolution (Prad-
7915 han et al., 2011). Given the dominance of machine learning in virtually all other areas
7916 of natural language processing — and more than fifteen years of prior work on machine
7917 learning for coreference — this was a surprising result, even if learning-based methods
7918 have subsequently regained the upper hand (e.g., Lee et al., 2017, the state-of-the-art at
7919 the time of this writing).
7921 where the score for each cluster is computed as the sum of scores of all mentions that are
7922 linked into the cluster, and f (i, ∅) is a set of features for the non-anaphoric mention that
7923 initiates the cluster.
7924 Using Figure 15.4 as an example, suppose that the ground truth is,
7925 but that with a beam of size one, the learner reaches the hypothesis,
7926 This style of incremental update can also be applied to a margin loss between the gold
7927 clustering and the top clustering on the beam. By backpropagating from this loss, it is also
7928 possible to train a more complicated scoring function, such as a neural network in which
7929 the score for each entity is a function of embeddings for the entity mentions (Wiseman
7930 et al., 2015).
exp ψE (i ∪ {j : zj = e}; θ)
Pr(zi = e; θ) = P 0 0
, [15.21]
e0 exp ψE (i ∪ {j : zj = e } ; θ)
7941 where ψE (i ∪ {j : zj = e}; θ) is the score under parameters θ for assigning mention i to
7942 cluster e. This score can be an arbitrary function of the mention i, the cluster e and its
7943 (possibly empty) set of mentions; it can also include the history of actions taken thus far.
5
A draft of the second edition can be found here: http://incompleteideas.net/book/
the-book-2nd.html. Reinforcement learning has been used in spoken dialogue systems (Walker, 2000)
and text-based game playing (Branavan et al., 2009), and was applied to coreference resolution by Clark and
Manning (2015).
7944 If a policy assigns probability p(c; θ) to clustering c, then its expected loss is,
X
L(θ) = pθ (c) × `(c), [15.22]
c∈C(m)
7945 where C(m) is the set of possible clusterings for mentions m. The loss `(c) can be based on
7946 any arbitrary scoring function, including the complex evaluation metrics used in corefer-
7947 ence resolution (see § 15.4). This is an advantage of reinforcement learning, which can be
7948 trained directly on the evaluation metric — unlike traditional supervised learning, which
7949 requires a loss function that is differentiable and decomposable across individual deci-
7950 sions.
Rather than summing over the exponentially many possible clusterings, we can ap-
proximate the expectation by sampling trajectories of actions, z = (z1 , z2 , . . . , zM ), from
the current policy. Each action zi corresponds to a step in the Bell tree: adding mention
mi to an existing cluster, or forming a new cluster. Each trajectory z corresponds to a
single clustering c, and so we can write the loss of an action sequence as `(c(z)). The
policy gradient algorithm computes the gradient of the expected loss as an expectation
over trajectories (Sutton et al., 2000),
X ∂ M
∂
L(θ) =Ez∼Z(m) `(c(z)) log p(zi | z1:i−1 , m) [15.23]
∂θ ∂θ
i=1
K M
1 X X ∂ (k) (k)
≈ `(c(z (k) )) log p(zi | z1:i−1 , m) [15.24]
K ∂θ
k=1 i=1
[15.25]
7951 where the action sequence z (k) is sampled from the current policy. Unlike the incremental
7952 perceptron, an update is not made until the complete action sequence is available.
7960 • The oracle can be used to generate partial hypotheses that are likely to score well,
7961 by generating i actions from the initial state. These partial hypotheses are then used
7962 as starting points for the learned policy. This is known as roll-in.
7963 • The oracle can be used to compute the minimum possible loss from a given state, by
7964 generating M − i actions from the current state until completion. This is known as
7965 roll-out.
7966 The oracle can be combined with the existing policy during both roll-in and roll-out, sam-
7967 pling actions from each policy (Daumé III et al., 2009). One approach is to gradually
7968 decrease the number of actions drawn from the oracle over the course of learning (Ross
7969 et al., 2011).
7970 In the context of entity-based coreference resolution, Clark and Manning (2016) use
7971 the learned policy for roll-in and the oracle policy for roll-out. Algorithm 18 shows how
7972 the gradients on the policy weights are computed in this case. In this application, the
7973 oracle is “noisy”, because it selects the action that minimizes only the local loss — the
7974 accuracy of the coreference clustering up to mention i — rather than identifying the action
7975 sequence that will lead to the best final coreference clustering on the entire document.
7976 When learning from noisy oracles, it can be helpful to mix in actions from the current
7977 policy with the oracle during roll-out (Chang et al., 2015).
7998 Mention type. Each span can be identified as a pronoun, name, or nominal, using the
7999 part-of-speech of the head word of the mention: both the Penn Treebank and Uni-
8000 versal Dependencies tagsets (§ 8.1.1) include tags for pronouns and proper nouns,
8001 and all other heads can be marked as nominals (Haghighi and Klein, 2009).
8002 Mention width. The number of tokens in a mention is a rough predictor of its anaphoric-
8003 ity, with longer mentions being less likely to refer back to previously-defined enti-
8004 ties.
8005 Lexical features. The first, last, and head words can help to predict anaphoricity; they are
8006 also useful in conjunction with features such as mention type and part-of-speech,
8007 providing a rough measure of agreement (Björkelund and Nugues, 2011). The num-
8008 ber of lexical features can be very large, so it can be helpful to select only frequently-
8009 occurring features (Durrett and Klein, 2013).
8010 Morphosyntactic features. These features include the part-of-speech, number, gender,
8011 and dependency ancestors.
8012 The features for mention i and candidate antecedent a can be conjoined, producing
8013 joint features that can help to assess the compatibility of the two mentions. For example,
8014 Durrett and Klein (2013) conjoin each feature with the mention types of the anaphora
8015 and the antecedent. Coreference resolution corpora such as ACE and OntoNotes contain
6
The Universal Dependencies project has produced dependency treebanks for more than sixty languages.
However, coreference features and mention detection are generally based on phrase structure trees, which
exist for roughly two dozen languages. A list is available here: https://en.wikipedia.org/wiki/
Treebank
8016 documents from various genres. By conjoining the genre with other features, it is possible
8017 to learn genre-specific feature weights.
8020 Distance. The number of intervening tokens, mentions, and sentences between i and j
8021 can all be used as distance features. These distances can be computed on the surface
8022 text, or on a transformed representation reflecting the breadth-first tree traversal
8023 (Figure 15.3). Rather than using the distances directly, they are typically binned,
8024 creating binary features.
8025 String match. A variety of string match features can be employed: exact match, suffix
8026 match, head match, and more complex matching rules that disregard irrelevant
8027 modifiers (Soon et al., 2001).
8028 Compatibility. Building on the model, features can measure the anaphor and antecedent
8029 agree with respect to morphosyntactic attributes such as gender, number, and ani-
8030 macy.
8031 Nesting. If one mention is nested inside another (e.g., [The President of [France]]), they
8032 generally cannot corefer.
8033 Same speaker. For documents with quotations, such as news articles, personal pronouns
8034 can be resolved only by determining the speaker for each mention (Lee et al., 2013).
8035 Coreference is also more likely between mentions from the same speaker.
8036 Gazetteers. These features indicate that the anaphor and candidate antecedent appear in
8037 a gazetteer of acronyms (e.g., USA/United States, GATech/Georgia Tech), demonyms
8038 (e.g., Israel/Israeli), or other aliases (e.g., Knickerbockers/New York Knicks).
8039 Lexical semantics. These features use a lexical resource such as WordNet to determine
8040 whether the head words of the mentions are related through synonymy, antonymy,
8041 and hypernymy (§ 4.2).
8042 Dependency paths. The dependency path between the anaphor and candidate antecedent
8043 can help to determine whether the pair can corefer, under the government and bind-
8044 ing constraints described in § 15.1.1.
8045 Comprehensive lists of mention-pair features are offered by Bengtson and Roth (2008) and
8046 Rahman and Ng (2011). Neural network approaches use far fewer mention-pair features:
8047 for example, Lee et al. (2017) include only speaker, genre, distance, and mention width
8048 features.
8049 Semantics In many cases, coreference seems to require knowledge and semantic in-
8050 ferences, as in the running example, where we link China with a country and a growth
8051 market. Some of this information can be gleaned from WordNet, which defines a graph
8052 over synsets (see § 4.2). For example, one of the synsets of China is an instance of an
8053 Asian nation#1, which in turn is a hyponym of country#2, a synset that includes
8054 country.7 Such paths can be used to measure the similarity between concepts (Pedersen
8055 et al., 2004), and this similarity can be incorporated into coreference resolution as a fea-
8056 ture (Ponzetto and Strube, 2006). Similar ideas can be applied to knowledge graphs in-
8057 duced from Wikipedia (Ponzetto and Strube, 2007). But while such approaches improve
8058 relatively simple classification-based systems, they have proven less useful when added
8059 to the current generation of techniques.8 For example, Durrett and Klein (2013) employ
8060 a range of semantics-based features — WordNet synonymy and hypernymy relations on
8061 head words, named entity types (e.g., person, organization), and unsupervised cluster-
8062 ing over nominal heads — but find that these features give minimal improvement over a
8063 baseline system using surface features.
8074 For scalar mention-pair features (e.g., distance features), aggregation can be performed by
8075 computing the minimum, maximum, and median values across all mentions in the cluster.
8076 Additional entity-mention features include the number of mentions currently clustered in
8077 the entity, and A LL -X and M OST-X features for each mention type.
8081 mantic compatibility underlying nominal coreference, helping with difficult cases like
8082 (Apple, the firm) and (China, the firm’s biggest growth market). Furthermore, a distributed
8083 representation of entities can be trained to capture semantic features that are added by
8084 each mention.
8086 Entity mentions can be embedded into a vector space, providing the base layer for neural
8087 networks that score coreference decisions (Wiseman et al., 2015).
8088 Constructing the mention embedding Various approaches for embedding multiword
8089 units can be applied (see § 14.8). Figure 15.5 shows a recurrent neural network approach,
8090 which begins by running a bidirectional LSTM over the entire text, obtaining hidden states
←
− → −
8091 from the left-to-right and right-to-left passes, hm = [ h m ; h m ]. Each candidate mention
8092 span (s, t) is then represented by the vertical concatenation of four vectors:
(s,t) (s,t)
8093 where ufirst = hs+1 is the embedding of the first word in the span, ulast = ht is the
(s,t)
8094 embedding of the last word, uhead is the embedding of the “head” word, and φ(s,t) is a
8095 vector of surface features, such as the length of the span (Lee et al., 2017).
Attention over head words Rather than identifying the head word from the output of a
parser, it can be computed from a neural attention mechanism:
8096 Each token m gets a scalar score α̃m = θα · hm , which is the dot product of the LSTM
8097 hidden state hm and a vector of weights θα . The vector of scores for tokens in the span
8098 m ∈ {s + 1, s + 2, . . . , t} is then passed through a softmax layer, yielding a vector a(s,t)
8099 that allocates one unit of attention across the span. This eliminates the need for syntactic
8100 parsing to recover the head word; instead, the model learns to identify the most important
8101 words in each span. Attention mechanisms were introduced in neural machine transla-
8102 tion (Bahdanau et al., 2014), and are described in more detail in § 18.3.1.
··· ···
Using mention embeddings Given a set of mention embeddings, each mention i and
candidate antecedent a is scored as,
8103 where u(a) and u(i) are the embeddings for spans a and i respectively, as defined in Equa-
8104 tion 15.26.
8105 • The scores ψS (a) quantify whether span a is likely to be a coreferring mention, inde-
8106 pendent of what it corefers with. This allows the model to learn identify mentions
8107 directly, rather than identifying mentions with a preprocessing step.
8108 • The score ψM (a, i) computes the compatibility of spans a and i. Its base layer is a
8109 vector that includes the embeddings of spans a and i, their elementwise product
8110 u(a) u(i) , and a vector of surface features f (a, i, w), including distance, speaker,
8111 and genre information.
8112 Lee et al. (2017) provide an error analysis that shows how this method can correctly link
8113 a blaze and a fire, while incorrectly linking pilots and fight attendants. In each case, the
8114 coreference decision is based on similarities in the word embeddings.
8115 Rather than embedding individual mentions, Clark and Manning (2016) embed men-
8116 tion pairs. At the base layer, their network takes embeddings of the words in and around
8117 each mention, as well as one-hot vectors representing a few surface features, such as the
8118 distance and string matching features. This base layer is then passed through a multilayer
8119 feedforward network with ReLU nonlinearities, resulting in a representation of the men-
8120 tion pair. The output of the mention pair encoder ui,j is used in the scoring function of
8121 a mention-ranking model, ψM (i, j) = θ · ui,j . A similar approach is used to score cluster
8122 pairs, constructing a cluster-pair encoding by pooling over the mention-pair encodings
8123 for all pairs of mentions within the two clusters.
8156 Exercises
8157 1. Select an article from today’s news, and annotate coreference for the first twenty
8158 noun phrases that appear in the article (include nested noun phrases). That is,
8159 group the noun phrases into entities, where each entity corresponds to a set of noun
8160 phrases. Then specify the mention-pair training data that would result from the first
8161 five noun phrases.
8162 2. Using your annotations from the preceding problem, compute the following statis-
8163 tics:
8164 • The number of times new entities are introduced by each of the three types of
8165 referring expressions: pronouns, proper nouns, and nominals. Include “single-
8166 ton” entities that are mentioned only once.
8167 • For each type of referring expression, compute the fraction of mentions that are
8168 anaphoric.
8169 3. Apply a simple heuristic to all pronouns in the article from the previous exercise.
8170 Specifically, link each pronoun to the closest preceding noun phrase that agrees in
8171 gender, number, animacy, and person. Compute the following evaluation:
8172 • True positive: a pronoun that is linked to a noun phrase with which it is coref-
8173 erent, or is correctly labeled as the first mention of an entity.
8174 • False positive: a pronoun that is linked to a noun phrase with which it is not
8175 coreferent. (This includes mistakenly linking singleton or non-referential pro-
8176 nouns.)
8177 • False negative: a pronoun that is not linked to a noun phrase with which it is
8178 coreferent.
8179 Compute the F - MEASURE for your method, and for a trivial baseline in which ev-
8180 ery mention is its own entity. Are there any additional heuristics that would have
8181 improved the performance of this method?
8182 4. Durrett and Klein (2013) compute the probability of the gold coreference clustering
8183 by summing over all antecedent structures that are compatible with the clustering.
8184 Compute the number of antecedent structures for a single entity with K mentions.
8185 5. Use the policy gradient algorithm to compute the gradient for the following sce-
8186 nario, based on the Bell tree in Figure 15.4:
c(a1 ) ={Abigail}
c(a1:2 ) ={Abigail, she}
c(a1:3 ) ={Abigail, she}, {her}.
8188 • At each mention t, the action space At is to merge the mention with each exist-
8189 ing cluster, or the empty cluster, with probability,
8195 Discourse
385
386 CHAPTER 16. DISCOURSE
0.6 original
0.5 smoothing L=1
smoothing L=3
cosine similarity
0.4
0.3
0.2
0.1
0.0
0 5 10 15 20 25 30 35
sentence
Figure 16.1: Smoothed cosine similarity among adjacent sentences in a news article. Local
minima at m = 10 and m = 29 indicate likely segmentation points.
8218 with k` representing the value of a smoothing kernel of size L, e.g. k = [1, 0.5, 0.25]> .
8219 Segmentation points are then identified at local minima in the smoothed similarities s,
8220 since these points indicate changes in the overall distribution of words in the text. An
8221 example is shown in Figure 16.1.
8222 Text segmentation can also be formulated as a probabilistic model, in which each seg-
8223 ment has a unique language model that defines the probability over the text in the seg-
8224 ment (Utiyama and Isahara, 2001; Eisenstein and Barzilay, 2008; Du et al., 2013).1 A good
1
There is a rich literature on how latent variable models (such as latent Dirichlet allocation) can track
8225 segmentation achieves high likelihood by grouping segments with similar word distribu-
8226 tions. This probabilistic approach can be extended to hierarchical topic segmentation, in
8227 which each topic segment is divided into subsegments (Eisenstein, 2009). All of these ap-
8228 proaches are unsupervised. While labeled data can be obtained from well-formatted texts
8229 such as textbooks, such annotations may not generalize to speech transcripts in alterna-
8230 tive domains. Supervised methods have been tried in cases where in-domain labeled data
8231 is available, substantially improving performance by learning weights on multiple types
8232 of features (Galley et al., 2003).
(16.1) a. John went to his favorite music (16.2) a. John went to his favorite music
store to buy a piano. store to buy a piano.
b. He had frequented the store for b. It was a store John had fre-
many years. quented for many years.
c. He was excited that he could fi- c. He was excited that he could fi-
nally buy a piano. nally buy a piano.
d. He arrived just as the store was d. It was closing just as John ar-
closing for the day rived.
Figure 16.2: Two tellings of the same story (Grosz et al., 1995). The discourse on the left
uses referring expressions coherently, while the one on the right does not.
8262 • The forward-looking centers in utterance m are all the entities that are mentioned
8263 in the utterance, cf (wm ) = {e1 , e2 , . . . , }. The forward-looking centers are partially
8264 ordered by their syntactic prominence, favoring subjects over other positions.
8265 • The backward-looking center cb (wm ) is the highest-ranked element in the set of
8266 forward-looking centers from the previous utterance cf (wm−1 ) that is also men-
8267 tioned in wm .
8268 Given these two definitions, centering theory makes the following predictions about
8269 the form and position of referring expressions:
8270 1. If a pronoun appears in the utterance wm , then the backward-looking center cb (wm )
8271 must also be realized as a pronoun. This rule argues against the use of it to refer
8272 to the piano store in Example (16.2d), since J OHN is the backward looking center of
8273 (16.2d), and he is mentioned by name and not by a pronoun.
8274 2. Sequences of utterances should retain the same backward-looking center if possible,
8275 and ideally, the backward-looking center should also be the top-ranked element in
8276 the list of forward-looking centers. This rule argues in favor of the preservation of
8277 J OHN as the backward-looking center throughout Example (16.1).
8278 Centering theory unifies aspects of syntax, discourse, and anaphora resolution. However,
8279 it can be difficult to clarify exactly how to rank the elements of each utterance, or even
8280 how to partition a text or dialog into utterances (Poesio et al., 2004).
Figure 16.3: The entity grid representation for a dialogue from the television show Break-
ing Bad.
8299 text (Lapata, 2003), and disentangling multiple conversational threads in an online multi-
8300 party chat (Elsner and Charniak, 2010).
8308 Unifying these two representations into the form of Equation 16.4 requires linking the
8309 unbound variable y from [16.6] with the quantified variable x in [16.5]. Discourse un-
8310 derstanding therefore requires the reader to update a set of assignments, from variables
8311 to entities. This update would (presumably) link the dog in the first sentence of [16.3]
8312 with the unbound variable y in the second sentence, thereby licensing the conjunction in
8313 [16.4].2 This basic idea is at the root of dynamic semantics (Groenendijk and Stokhof,
8314 1991). Segmented discourse representation theory links dynamic semantics with a set
8315 of discourse relations, which explain how adjacent units of text are rhetorically or con-
8316 ceptually related (Lascarides and Asher, 2007). The next section explores the theory of
8317 discourse relations in more detail.
• TEMPORAL • COMPARISON
Table 16.1: The hierarchy of discourse relation in the Penn Discourse Treebank annota-
tions (Prasad et al., 2008). For example, PRECEDENCE is a subtype of S YNCHRONOUS,
which is a type of TEMPORAL relation.
8341 • Each connective is annotated for the discourse relation or relations that it expresses,
8342 if any — many discourse connectives have senses in which they do not signal a
8343 discourse relation (Pitler and Nenkova, 2009).
8344 • For each discourse relation, the two arguments of the relation are specified as ARG 1
8345 and ARG 2, where ARG 2 is constrained to be adjacent to the connective. These argu-
8346 ments may be sentences, but they may also smaller or larger units of text.
8347 • Adjacent sentences are annotated for implicit discourse relations, which are not
8348 marked by any connective. When a connective could be inserted between a pair
8349 of sentence, the annotator supplies it, and also labels its sense (e.g., example 16.5).
8350 In some cases, there is no relationship at all between a pair of adjacent sentences;
8351 in other cases, the only relation is that the adjacent sentences mention one or more
8352 shared entity. These phenomena are annotated as N O R EL and E NT R EL (entity rela-
8353 tion), respectively.
8354 Examples of Penn Discourse Treebank annotations are shown in (16.4). In (16.4), the
8355 word therefore acts as an explicit discourse connective, linking the two adjacent units of
8356 text. The Treebank annotations also specify the “sense” of each relation, linking the con-
8357 nective to a relation in the sense inventory shown in Table 16.1: in (16.4), the relation is
8358 PRAGMATIC CAUSE : JUSTIFICATION because it relates to the author’s communicative in-
8359 tentions. The word therefore can also signal causes in the external world (e.g., He was
8360 therefore forced to relinquish his plan). In discourse sense classification, the goal is to de-
8361 termine which discourse relation, if any, is expressed by each connective. A related task
8362 is the classification of implicit discourse relations, as in (16.5). In this example, the re-
8363 lationship between the adjacent sentences could be expressed by the connective because,
8364 indicating a CAUSE : REASON relationship.
8366 As suggested by the examples above, many connectives can be used to invoke multiple
8367 types of discourse relations. Similarly, some connectives have senses that are unrelated
8368 to discourse: for example, and functions as a discourse connective when it links propo-
8369 sitions, but not when it links noun phrases (Lin et al., 2014). Nonetheless, the senses of
8370 explicitly-marked discourse relations in the Penn Treebank are relatively easy to classify,
8371 at least at the coarse-grained level. When classifying the four top-level PDTB relations,
8372 90% accuracy can be obtained simply by selecting the most common relation for each
8373 connective (Pitler and Nenkova, 2009). At the more fine-grained levels of the discourse
8374 relation hierarchy, connectives are more ambiguous. This fact is reflected both in the ac-
8375 curacy of automatic sense classification (Versley, 2011) and in interannotator agreement,
8376 which falls to 80% for level-3 discourse relations (Prasad et al., 2008).
(16.4) . . . as this business of whaling has somehow come to be regarded among landsmen as a
rather unpoetical and disreputable pursuit; therefore, I am all anxiety to convince
ye, ye landsmen, of the injustice hereby done to us hunters of whales.
(16.5) But a few funds have taken other defensive steps. Some have raised their cash
positions to record levels. Implicit = BECAUSE High cash positions help buffer a
fund when the market falls.
(16.6) Michelle lives in a hotel room, and although she drives a canary-colored
Porsche, she hasn’t time to clean or repair it.
(16.7) Most oil companies, when they set exploration and production budgets for this
year, forecast revenue of $15 for each barrel of crude produced.
Figure 16.4: Example annotations of discourse relations. In the style of the Penn Discourse
Treebank, the discourse connective is underlined, the first argument is shown in italics,
and the second argument is shown in bold. Examples (16.5-16.7) are quoted from Prasad
et al. (2008).
8377 A more challenging task for explicitly-marked discourse relations is to identify the
8378 scope of the arguments. Discourse connectives need not be adjacent to ARG 1, as shown
8379 in item 16.6, where ARG 1 follows ARG 2; furthermore, the arguments need not be contigu-
8380 ous, as shown in (16.7). For these reasons, recovering the arguments of each discourse
8381 connective is a challenging subtask. Because intra-sentential arguments are often syn-
8382 tactic constituents (see chapter 10), many approaches train a classifier to predict whether
8383 each constituent is an appropriate argument for each explicit discourse connective (Well-
8384 ner and Pustejovsky, 2007; Lin et al., 2014, e.g.,).
3
In the dataset for the 2015 shared task on shallow discourse parsing, the interannotator agreement was
91% for explicit discourse relations and 81% for non-explicit relations, across all levels of detail (Xue et al.,
2015).
8386 This basic framework can be instantiated in several ways, including both feature-based
8387 and neural encoders. Several recent approaches are compared in the 2015 and 2016 shared
8388 tasks at the Conference on Natural Language Learning (Xue et al., 2015, 2016).
8389 Feature-based approaches Each argument can be encoded into a vector of surface fea-
8390 tures. The encoding typically includes lexical features (all words, or all content words, or
8391 a subset of words such as the first three and the main verb), Brown clusters of individ-
8392 ual words (§ 14.4), and syntactic features such as terminal productions and dependency
8393 arcs (Pitler et al., 2009; Lin et al., 2009; Rutherford and Xue, 2014). The classification func-
8394 tion then has two parts. First, it creates a joint feature vector by combining the encodings
8395 of each argument, typically by computing the cross-product of all features in each encod-
8396 ing:
(i+1)
f (y, z (i) , z (i+1) ) = {(a × b × y) : (za(i) zb )} [16.10]
8397 The size of this feature set grows with the square of the size of the vocabulary, so it can be
8398 helpful to select a subset of features that are especially useful on the training data (Park
8399 and Cardie, 2012). After f is computed, any classifier can be trained to compute the final
8400 score, Ψ(y, z (i) , z (i+1) ) = θ · f (y, z (i) , z (i+1) ).
8401 Neural network approaches In neural network architectures, the encoder is learned
8402 jointly with the classifier as an end-to-end model. Each argument can be encoded using
8403 a variety of neural architectures (surveyed in § 14.8): recursive (§ 10.6.1; Ji and Eisenstein,
8404 2015), recurrent (§ 6.3; Ji et al., 2016), and convolutional (§ 3.4; Qin et al., 2017). The clas-
8405 sification function can then be implemented as a feedforward neural network on the two
8406 encodings (chapter 3; for examples, see Rutherford et al., 2017; Qin et al., 2017), or as a
8407 simple bilinear product, Ψ(y, z (i) , z (i+1) ) = (z (i) )> Θy z (i+1) (Ji and Eisenstein, 2015). The
8408 encoding model can be trained by backpropagation from the classification objective, such
8409 as the margin loss. Rutherford et al. (2017) show that neural architectures outperform
8410 feature-based approaches in most settings. While neural approaches require engineering
8411 the network architecture (e.g., embedding size, number of hidden units in the classifier),
8412 feature-based approaches also require significant engineering to incorporate linguistic re-
8413 sources such as Brown clusters and parse trees, and to select a subset of relevant features.
8420 The basic element of RST is the discourse unit, which refers to a contiguous span of
8421 text. Elementary discourse units (EDUs) are the atomic elements in this framework, and
8422 are typically (but not always) clauses.4 Each discourse relation combines two or more
8423 adjacent discourse units into a larger, composite discourse unit; this process ultimately
8424 unites the entire text into a tree-like structure.5
8425 Nuclearity In many discourse relations, one argument is primary. For example:
8428 In this example, the second sentence provides EVIDENCE for the point made in the first
8429 sentence. The first sentence is thus the nucleus of the discourse relation, and the second
8430 sentence is the satellite. The notion of nuclearity is analogous to the head-modifier struc-
8431 ture of dependency parsing (see § 11.1.1). However, in RST, some relations have multiple
8432 nuclei. For example, the arguments of the CONTRAST relation are equally important:
8435 Relations that have multiple nuclei are called coordinating; relations with a single nu-
8436 cleus are called subordinating. Subordinating relations are constrained to have only two
8437 arguments, while coordinating relations (such as CONJUNCTION) may have more than
8438 two.
8439 RST Relations Rhetorical structure theory features a large inventory of discourse rela-
8440 tions, which are divided into two high-level groups: subject matter relations, and presen-
8441 tational relations. Presentational relations are organized around the intended beliefs of
8442 the reader. For example, in (16.8), the second discourse unit provides evidence intended
8443 to increase the reader’s belief in the proposition expressed by the first discourse unit, that
8444 LaShawn loves animals. In contrast, subject-matter relations are meant to communicate ad-
8445 ditional facts about the propositions contained in the discourse units that they relate:
4
Details of discourse segmentation can be found in the RST annotation manual (Carlson and Marcu,
2001).
5
While RST analyses are typically trees, this should be taken as a strong theoretical commitment to the
principle that all coherent discourses have a tree structure. Taboada and Mann (2006) write:
It is simply the case that trees are convenient, easy to represent, and easy to understand. There
is, on the other hand, no theoretical reason to assume that trees are the only possible represen-
tation of discourse structure and of coherence relations.
The appropriateness of tree structures to discourse has been challenged, e.g., by Wolf and Gibson (2005), who
propose a more general graph-structured representation.
6
from the RST Treebank (Carlson et al., 2002)
Concession
R
1H
Justify
1A
Conjunction
Elaboration 1D Justify
1B 1C 1E Conjunction
1F 1G
[It could have been a great movie] [It does have beautiful scenery,]1B [some of
1A
the best since Lord of the Rings.]1C [The acting is well done,]1D [and I really liked
the son of the leader of the Samurai.]1E [He was a likable chap,]1F [and I :::::
hated to
see him die.]1G [But, other than all that, this movie is nothing
::::::
more than hidden
1H
rip-offs.]
:::::
Figure 16.5: A rhetorical structure theory analysis of a short movie review, adapted from
Voll and Taboada (2007). Positive and :::::::::
negative sentiment words are underlined, indicat-
ing RST’s potential utility in document-level sentiment analysis.
8448 In this example, the satellite describes a world state that is realized by the action described
8449 in the nucleus. This relationship is about the world, and not about the author’s commu-
8450 nicative intentions.
8451 Example Figure 16.5 depicts an RST analysis of a paragraph from a movie review. Asym-
8452 metric (subordinating) relations are depicted with an arrow from the satellite to the nu-
8453 cleus; symmetric (coordinating) relations are depicted with lines. The elementary dis-
8454 course units 1F and 1G are combined into a larger discourse unit with the symmetric
8455 C ONJUNCTION relation. The resulting discourse unit is then the satellite in a J USTIFY
8456 relation with 1E.
7
from the RST Treebank (Carlson et al., 2002)
8475 Bottom-up discourse parsing Assume a segmentation of the text into N elementary
8476 discourse units with base
representations {z (i) }N i=1 , and assume a composition function
8477
(i) (j)
C OMPOSE z , z , ` , which maps two encodings and a discourse relation ` into a new
8478 encoding. The composition function can follow the strong compositionality criterion and
8479 simply select the encoding of the nucleus, or it can do something more complex. We
8480 also need a scoring function Ψ(z (i,k) , z (k,j) , `), which computes a scalar score for the (bi-
8481 narized) discourse relation ` with left child covering the span i + 1 : k, and the right
8482 child covering the span k + 1 : j. Given these components, we can construct vector rep-
8483 resentations for each span, and this is the basic idea underlying compositional vector
8484 grammars (Socher et al., 2013).
8485 These same components can also be used in bottom-up parsing, in a manner that is
8486 similar to the CKY algorithm for weighted context-free grammars (see § 10.1): compute
8487 the score and best analysis for each possible span of increasing lengths, while storing
8488 back-pointers that make it possible to recover the optimal parse of the entire input. How-
8489 ever, there is an important distinction from CKY parsing: for each labeled span (i, j, `), we
8490 must use the composition function to construct a representation z (i,j,`) . This representa-
8491 tion is then used to combine the discourse unit spanning i + 1 : j in higher-level discourse
8492 relations. The representation z (i,j,`) depends on the entire substructure of the unit span-
8
To use these algorithms, is also necessary to binarize all discourse relations during parsing, and then to
“unbinarize” them to reconstruct the desired structure (e.g., Hernault et al., 2010).
8493 ning i + 1 : j, and this violates the locality assumption that underlie CKY’s optimality
8494 guarantee. Bottom-up parsing with recursively constructed span representations is gen-
8495 erally not guaranteed to find the best-scoring discourse parse. This problem is explored
8496 in an exercise at the end of the chapter.
8497 Transition-based discourse parsing One drawback of bottom-up parsing is its cubic
8498 time complexity in the length of the input. For long documents, transition-based parsing
8499 is an appealing alternative. The shift-reduce algorithm can be applied to discourse parsing
8500 fairly directly (Sagae, 2009): the stack stores a set of discourse units and their represen-
8501 tations, and each action is chosen by a function of these representations. This function
8502 could be a linear product of weights and features, or it could be a neural network ap-
8503 plied to encodings of the discourse units. The R EDUCE action then performs composition
8504 on the two discourse units at the top of the stack, yielding a larger composite discourse
8505 unit, which goes on top of the stack. All of the techniques for integrating learning and
8506 transition-based parsing, described in § 11.3, are applicable to discourse parsing.
8526 Assertions may also support or rebut proposed links between two other assertions, cre-
8527 ating a hypergraph, which is a generalization of a graph to the case in which edges can
8528 join any number of vertices. This can be seen by introducing another sentence into the
8529 example:
8532 S3 acknowledges the validity of S2, but undercuts its support of S1. This can be repre-
8533 sented by introducing a hyperedge, (S3, S2, S1)undercut , indicating that S3 undercuts the
8534 proposed relationship between S2 and S1. S4 then undercuts the relevance of S3.
8535 Argumentation mining is the task of recovering such structures from raw texts. At
8536 present, annotations of argumentation structure are relatively small: Stab and Gurevych
8537 (2014a) have annotated a collection of 90 persuasive essays, and Peldszus and Stede (2015)
8538 have solicited and annotated a set of 112 paragraph-length “microtexts” in German.
8564 text, and is therefore more important to include in an extractive summary (Marcu, 1997a).9
8565 This insight can be generalized from individual relations using the concept of discourse
8566 depth (Hirao et al., 2013): for each elementary discourse unit e, the discourse depth de is
8567 the number of relations in which a discourse unit containing e is the satellite.
8568 Both discourse depth and nuclearity can be incorporated into extractive summariza-
8569 tion, using constrained optimization. Let xn be a bag-of-words vector representation of
8570 elementary discourse unit n, let yn ∈ {0, 1} indicate whether n is included in the summary,
8571 and let dn be the depth of unit n. Furthermore, let each discourse unit have a “head” h,
8572 which is defined recursively:
8573 • if a discourse unit is produced by a subordinating relation, then its head is the head
8574 of the (unique) nucleus;
8575 • if a discourse unit is produced by a coordinating relation, then its head is the head
8576 of the left-most nucleus;
8577 • for each elementary discourse unit, its parent π(n) ∈ {∅, 1, 2, . . . , N } is the head of
8578 the smallest discourse unit containing n whose head is not n;
8579 • if n is the head of the discourse unit spanning the whole document, then π(n) = ∅.
N
X Ψ (xn , {x1:N })
max yn
y={0,1}N dn
n=1
XN V
X
s.t. yn ( xn,j ) ≤ L
n=1 j=1
yπ(n) ≥ yn , ∀n [16.11]
8580 where Ψ (xn , {x1:N }) measures the coverage of elementary discourse unit n with respect
P
8581 to the rest of the document, and Vj=1 xn,m is the number of tokens in xn . The first con-
8582 straint ensures that the number of tokens in the summary has an upper bound L. The
8583 second constraint ensures that no elementary discourse unit is included unless its parent
8584 is also included. In this way, the discourse structure is used twice: to downweight the
8585 contributions of elementary discourse units that are not central to the discourse, and to
8586 ensure that the resulting structure is a subtree of the original discourse parse. The opti-
9
Conversely, the arguments of a multi-nuclear relation should either both be included in the summary,
or both excluded (Durrett et al., 2016).
h
a
b d e
c f g
Figure 16.6: A discourse depth tree (Hirao et al., 2013) for the discourse parse from Fig-
ure 16.5, in which each elementary discourse unit is connected to its parent. The discourse
units in one valid summary are underlined.
8587 mization problem in 16.11 can be solved with integer linear programming, described in
8588 § 13.2.2.10
8589 Figure 16.6 shows a discourse depth tree for the RST analysis from Figure 16.5, in
8590 which each elementary discourse is connected to (and below) its parent. The figure also
8591 shows a valid summary, corresponding to:
8592 (16.13) It could have been a great movie, and I really liked the son of the leader of the
8593 Samurai. But, other than all that, this movie is nothing more than hidden rip-offs.
8608 discourse unit is computed from its arguments and from a parameter correspond-
8609 ing to the discourse relation (Ji and Smith, 2017).
8610 Shallow, non-hierarchical discourse relations have also been applied to document clas-
8611 sification. One approach is to impose a set of constraints on the analyses of individual
8612 discourse units, so that adjacent units have the same polarity when they are connected
8613 by a discourse relation indicating agreement, and opposite polarity when connected by a
8614 contrastive discourse relation, indicating disagreement (Somasundaran et al., 2009; Zirn
8615 et al., 2011). Yang and Cardie (2014) apply explicitly-marked relations from the Penn
8616 Discourse Treebank to the problem of sentence-level sentiment polarity classification (see
8617 § 4.1). They impose the following soft constraints:
8618 • When a CONTRAST relation appears between two sentences, those sentences should
8619 have opposite sentiment polarity.
8620 • When an EXPANSION or CONTINGENCY relation appears between two sentences,
8621 they should have the same polarity.
8622 • When a CONTRAST relation appears within a sentence, it should have neutral polar-
8623 ity, since it is likely to express both sentiments.
8624 These discourse-driven constraints are shown to improve performance on two datasets of
8625 product reviews.
8638 • Hierarchical discourse relations tend to have a “canonical ordering” of the nucleus
8639 and satellite (Mann and Thompson, 1988): for example, in the ELABORATION rela-
8640 tion from rhetorical structure theory, the nucleus always comes first, while in the
8641 JUSTIFICATION relation, the satellite tends to be first (Marcu, 1997b).
8642 • Discourse relations should be signaled by connectives that are appropriate to the
8643 semantic or functional relationship between the arguments: for example, a coherent
8644 text would be more likely to use however to signal a COMPARISON relation than a
8645 temporal relation (Kibble and Power, 2004).
8646 • Discourse relations tend to be ordered in appear in predictable sequences: for ex-
8647 ample, COMPARISON relations tend to immediately precede CONTINGENCY rela-
8648 tions (Pitler et al., 2008). This observation can be formalized by generalizing the
8649 entity grid model (§ 16.2.2), so that each cell (i, j) provides information about the
8650 role of the discourse argument containing a mention of entity j in sentence i (Lin
8651 et al., 2011). For example, if the first sentence is ARG 1 of a comparison relation, then
8652 any entity mentions in the sentence would be labeled C OMP.A RG 1. This approach
8653 can also be applied to RST discourse relations (Feng et al., 2014).
8654 Datasets One difficulty with evaluating metrics of discourse coherence is that human-
8655 generated texts usually meet some minimal threshold of coherence. For this reason, much
8656 of the research on measuring coherence has focused on synthetic data. A typical setting is
8657 to permute the sentences of a human-written text, and then determine whether the origi-
8658 nal sentence ordering scores higher according to the proposed coherence measure (Barzi-
8659 lay and Lapata, 2008). There are also small datasets of human evaluations of the coherence
8660 of machine summaries: for example, human judgments of the summaries from the partic-
8661 ipating systems in the 2003 Document Understanding Conference are available online.11
8662 Researchers from the Educational Testing Service (an organization which administers sev-
8663 eral national exams in the United States) have studied the relationship between discourse
8664 coherence and student essay quality (Burstein et al., 2003, 2010). A public dataset of es-
8665 says from second-language learners, with quality annotations, has been made available by
8666 researchers at Cambridge University (Yannakoudakis et al., 2011). At the other extreme,
8667 Louis and Nenkova (2013) analyze the structure of professionally written scientific essays,
8668 finding that discourse relation transitions help to distinguish prize-winning essays from
8669 other articles in the same genre.
11
http://homepages.inf.ed.ac.uk/mlap/coherence/
8673 Exercises
8674 1. • Implement the smoothed cosine similarity metric from Equation 16.2, using the
8675 smoothing kernel k = [.5, .3, .15, .05].
8676 • Download the text of a news article with at least ten paragraphs.
8677 • Compute and plot the smoothed similarity s over the length of the article.
8678 • Identify local minima in s as follows: first find all sentences m such that sm <
8679 sm±1 . Then search among these points to find the five sentences with the lowest
8680 sm .
8681 • How often do the five local minima correspond to paragraph boundaries?
8682 – The fraction of local minima that are paragraph boundaries is the precision-
8683 at-k, where in this case, k = 5.
8684 – The fraction of paragraph boundaries which are local minima is the recall-
8685 at-k.
8686 – Compute precision-at-k and recall-at-k for k = 3 and k = 10.
8687 2. This exercise is to be done in pairs. Each participant selects an article from to-
8688 day’s news, and replaces all mentions of individual people with special tokens like
8689 P ERSON 1, P ERSON 2, and so on. The other participant should then use the rules
8690 of centering theory to guess each type of referring expression: full name (Captain
8691 Ahab), partial name (e.g., Ahab), nominal (e.g., the ship’s captain), or pronoun. Check
8692 whether the predictions match the original article, and whether the original article
8693 conforms to the rules of centering theory.
8702 Applications
405
8703 Chapter 17
8705 Computers offer powerful capabilities for searching and reasoning about structured records
8706 and relational data. Some even argue that the most important limitation of artificial intel-
8707 ligence is not inference or learning, but simply having too little knowledge (Lenat et al.,
8708 1990). Natural language processing provides an appealing solution: automatically con-
8709 struct a structured knowledge base by reading natural language text.
8710 For example, many Wikipedia pages have an “infobox” that provides structured in-
8711 formation about an entity or event. An example is shown in Figure 17.1a: each row rep-
8712 resents one or more properties of the entity I N THE A EROPLANE O VER THE S EA, a record
8713 album. The set of properties is determined by a predefined schema, which applies to all
8714 record albums in Wikipedia. As shown in Figure 17.1b, the values for many of these fields
8715 are indicated directly in the first few sentences of text on the same Wikipedia page.
8716 The task of automatically constructing (or “populating”) an infobox from text is an
8717 example of information extraction. Much of information extraction can be described in
8718 terms of entities, relations, and events.
8719 • Entities are uniquely specified objects in the world, such as people (J EFF M ANGUM),
8720 places (ATHENS , G EORGIA), organizations (M ERGE R ECORDS), and times (F EBRUARY
8721 10, 1998). Chapter 8 described the task of named entity recognition, which labels
8722 tokens as parts of entity spans. Now we will see how to go further, linking each
8723 entity mention to an element in a knowledge base.
8724 • Relations include a predicate and two arguments: for example, CAPITAL(G EORGIA, ATLANTA).
• Events involve multiple typed arguments. For example, the production and release
407
408 CHAPTER 17. INFORMATION EXTRACTION
Figure 17.1: From the Wikipedia page for the album “In the Aeroplane Over the Sea”,
retrieved October 26, 2017.
8727 Information extraction is similar to semantic role labeling (chapter 13): we may think
8728 of predicates as corresponding to events, and the arguments as defining slots in the event
8729 representation. However, the goals of information extraction are different. Rather than
8730 accurately parsing every sentence, information extraction systems often focus on recog-
8731 nizing a few key relation or event types, or on the task of identifying all properties of a
8732 given entity. Information extraction is often evaluated by the correctness of the resulting
8733 knowledge base, and not by how many sentences were accurately parsed. The goal is
8734 sometimes described as macro-reading, as opposed to micro-reading, in which each sen-
8735 tence must be analyzed correctly. Macro-reading systems are not penalized for ignoring
8736 difficult sentences, as long as they can recover the same information from other, easier-
8737 to-read sources. However, macro-reading systems must resolve apparent inconsistencies
8738 (was the album released on M ERGE R ECORDS or B LUE R OSE R ECORDS?), requiring rea-
8739 soning across the entire dataset.
8740 In addition to the basic tasks of recognizing entities, relations, and events, information
8741 extraction systems must handle negation, and must be able to distinguish statements of
8742 fact from hopes, fears, hunches, and hypotheticals. Finally, information extraction is of-
8743 ten paired with the problem of question answering, which requires accurately parsing a
8744 query, and then selecting or generating a textual answer. Question answering systems can
8745 be built on knowledge bases that are extracted from large text corpora, or may attempt to
8746 identify answers directly from the source texts.
8750 (17.4) The United States Army captured a hill overlooking Atlanta on May 14, 1864.
8752 1. Identify the spans United States Army, Atlanta, and May 14, 1864 as entity mentions.
8753 (The hill is not uniquely identified, so it is not a named entity.) We may also want to
8754 recognize the named entity types: organization, location, and date. This is named
8755 entity recognition, and is described in chapter 8.
8756 2. Link these spans to entities in a knowledge base: U.S. A RMY, ATLANTA, and 1864-
8757 M AY-14. This task is known as entity linking.
8758 The strings to be linked to entities are mentions — similar to the use of this term in
8759 coreference resolution. In some formulations of the entity linking task, only named enti-
8760 ties are candidates for linking. This is sometimes called named entity linking (Ling et al.,
8761 2015). In other formulations, such as Wikification (Milne and Witten, 2008), any string
8762 can be a mention. The set of target entities often corresponds to Wikipedia pages, and
8763 Wikipedia is the basis for more comprehensive knowledge bases such as YAGO (Suchanek
8764 et al., 2007), DBPedia (Auer et al., 2007), and Freebase (Bollacker et al., 2008). Entity link-
8765 ing may also be performed in more “closed” settings, where a much smaller list of targets
8766 is provided in advance. The system must also determine if a mention does not refer to
8767 any entity in the knowledge base, sometimes called a NIL entity (McNamee and Dang,
8768 2009).
8769 Returning to (17.4), the three entity mentions may seem unambiguous. But the Wikipedia
8770 disambiguation page for the string Atlanta says otherwise:1 there are more than twenty
1
https://en.wikipedia.org/wiki/Atlanta_(disambiguation), retrieved November 1, 2017.
8771 different towns and cities, five United States Navy vessels, a magazine, a television show,
8772 a band, and a singer — each prominent enough to have its own Wikipedia page. We now
8773 consider how to choose among these dozens of possibilities. In this chapter we will focus
8774 on supervised approaches. Unsupervised entity linking is closely related to the problem
8775 of cross-document coreference resolution, where the task is to identify pairs of mentions
8776 that corefer, across document boundaries (Bagga and Baldwin, 1998b; Singh et al., 2011).
8779 where y is a target entity, x is a description of the mention, Y(x) is a set of candidate
8780 entities, and c is a description of the context — such as the other text in the document,
8781 or its metadata. The function Ψ is a scoring function, which could be a linear model,
8782 Ψ(y, x, c) = θ · f (y, x, c), or a more complex function such as a neural network. In either
8783 case, the scoring function can be learned by minimizing a margin-based ranking loss,
`(ŷ, y (i) , x(i) , c(i) ) = Ψ(ŷ, x(i) , c(i) ) − Ψ(y (i) , x(i) , c(i) ) + 1 , [17.2]
+
8784 where y (i) is the ground truth and ŷ 6= y (i) is the predicted target for mention x(i) in
8785 context c(i) (Joachims, 2002; Dredze et al., 2010).
8786 Candidate identification For computational tractability, it is helpful to restrict the set of
8787 candidates, Y(x). One approach is to use a name dictionary, which maps from strings
8788 to the entities that they might mention. This mapping is many-to-many: a string such as
8789 Atlanta can refer to multiple entities, and conversely, an entity such as ATLANTA can be
8790 referenced by multiple strings. A name dictionary can be extracted from Wikipedia, with
8791 links between each Wikipedia entity page and the anchor text of all hyperlinks that point
8792 to the page (Bunescu and Pasca, 2006; Ratinov et al., 2011). To improve recall, the name
8793 dictionary can be augmented by partial and approximate matching (Dredze et al., 2010),
8794 but as the set of candidates grows, the risk of false positives increases. For example, the
8795 string Atlanta is a partial match to the Atlanta Fed (a name for the F EDERAL R ESERVE B ANK
8796 OF ATLANTA ), and a noisy match (edit distance of one) from Atalanta (a heroine in Greek
8797 mythology and an Italian soccer team).
8798 Features Feature-based approaches to entity ranking rely on three main types of local
8799 information (Dredze et al., 2010):
8800 • The similarity of the mention string to the canonical entity name, as quantified by
8801 string similarity. This feature would elevate the city ATLANTA over the basketball
8802 team ATLANTA H AWKS for the string Atlanta.
8803 • The popularity of the entity, which can be measured by Wikipedia page views or
8804 PageRank in the Wikipedia link graph. This feature would elevate ATLANTA , G EOR -
8805 GIA over the unincorporated community of ATLANTA , O HIO .
8806 • The entity type, as output by the named entity recognition system. This feature
8807 would elevate the city of ATLANTA over the magazine ATLANTA in contexts where
8808 the mention is tagged as a location.
8809 In addition to these local features, the document context can also help. If Jamaica is men-
8810 tioned in a document about the Caribbean, it is likely to refer to the island nation; in
8811 the context of New York, it is likely to refer to the neighborhood in Queens; in the con-
8812 text of a menu, it might refer to a hibiscus tea beverage. Such hints can be formalized
8813 by computing the similarity between the Wikipedia page describing each candidate en-
8814 tity and the mention context c(i) , which may include the bag-of-words representing the
8815 document (Dredze et al., 2010; Hoffart et al., 2011) or a smaller window of text around
8816 the mention (Ratinov et al., 2011). For example, we can compute the cosine similarity
8817 between bag-of-words vectors for the context and entity description, typically weighted
8818 using inverse document frequency to emphasize rare words.2
8819 Neural entity linking An alternative approach is to compute the score for each entity
8820 candidate using distributed vector representations of the entities, mentions, and context.
8821 For example, for the task of entity linking in Twitter, Yang et al. (2016) employ the bilinear
8822 scoring function,
Ψ(y, x, c) = vy> Θ(y,x) x + vy> Θ(y,c) c, [17.3]
8823 with vy ∈ RKy as the vector embedding of entity y, x ∈ RKx as the embedding of the
8824 mention, c ∈ RKc as the embedding of the context, and the matrices Θ(y,x) and Θ(y,c)
8825 as parameters that score the compatibility of each entity with respect to the mention and
8826 context. Each of the vector embeddings can be learned from an end-to-end objective, or
8827 pre-trained on unlabeled data.
8828 • Pretrained entity embeddings can be obtained from an existing knowledge base (Bor-
8829 des et al., 2011, 2013), or by running a word embedding algorithm such as WORD 2 VEC
PN (i)
2
The document frequency of word j is DF(j) = N1 i=1 δ xj > 0 , equal to the number of docu-
ments in which the word appears. The contribution of each word to the cosine similarity of two bag-of-
1 1
words vectors can be weighted by the inverse document frequency DF(j) or log DF(j) , to emphasize rare
words (Spärck Jones, 1972).
8830 on the text of Wikipedia, with hyperlinks substituted for the anchor text.3
8831 • The embedding of the mention x can be computed by averaging the embeddings
8832 of the words in the mention (Yang et al., 2016), or by the compositional techniques
8833 described in § 14.8.
• The embedding of the context c can also be computed from the embeddings of the
words in the context. A denoising autoencoder learns a function from raw text to
dense K-dimensional vector encodings by minimizing a reconstruction loss (Vin-
cent et al., 2010),
N
X
min ||x(i) − g(h(x̃(i) ; θh ); θg )||2 , [17.4]
θg ,θh
i=1
8834 where x̃(i) is a noisy version of the bag-of-words counts x(i) , which is produced by
8835 randomly setting some counts to zero; h : RV 7→ RK is an encoder with parameters
8836 θh ; and g : RK 7→ RV , with parameters θg . The encoder and decoder functions
8837 are typically implemented as feedforward neural networks. To apply this model to
8838 entity linking, each entity and context are initially represented by the encoding of
8839 their bag-of-words vectors, h(e) and g(c), and these encodings are then fine-tuned
8840 from labeled data (He et al., 2013). The context vector c can also be obtained by
8841 convolution on the embeddings of words in the document (Sun et al., 2015), or by
8842 examining metadata such as the author’s social network (Yang et al., 2016).
8843 The remaining parameters Θ(y,x) and Θ(y,c) can be trained by backpropagation from the
8844 margin loss in Equation 17.2.
8851 In each case, the term Washington refers to a different entity, and this reference is strongly
8852 suggested by the other entries on the list. In the last list, all three names are highly am-
8853 biguous — there are dozens of other Adams and Jefferson entities in Wikipedia. But a
3
Pre-trained entity embeddings can be downloaded from https://code.google.com/archive/p/
word2vec/.
8854 preference for coherence motivates collectively linking these references to the first three
8855 U.S. presidents.
8856 A general approach to collective entity linking is to introduce a compatibility score
8857 ψc (y). Collective entity linking is then performed by optimizing the global objective,
N
X
ŷ = argmax Ψc (y) + Ψ` (y (i) , x(i) , c(i) ), [17.5]
y∈Y(x) i=1
8858 where Y(x) is the set of all possible collective entity assignments for the mentions in x,
8859 and ψ` is the local scoring function for each entity i. The compatibility function is typically
P PN
8860 decomposed into a sum of pairwise scores, Ψc (y) = N i=1
(i) (j)
j6=i Ψc (y , y ). These scores
8861 can be computed in a number of different ways:
8862 • Wikipedia defines high-level categories for entities (e.g., living people, Presidents of
8863 the United States, States of the United States), and Ψc can reward entity pairs for the
8864 number of categories that they have in common (Cucerzan, 2007).
8865 • Compatibility can be measured by the number of incoming hyperlinks shared by
8866 the Wikipedia pages for the two entities (Milne and Witten, 2008).
8867 • In a neural architecture, the compatibility of two entities can be set equal to the inner
8868 product of their embeddings, Ψc (y (i) , y (j) ) = vy(i) · vy(j) .
8869 • A non-pairwise compatibility score can be defined using a type of latent variable
8870 model known as a probabilistic topic model (Blei et al., 2003; Blei, 2012). In this
8871 framework, each latent topic is a probability distribution over entities, and each
8872 document has a probability distribution over topics. Each entity helps to determine
8873 the document’s distribution over topics, and in turn these topics help to resolve am-
8874 biguous entity mentions (Newman et al., 2006). Inference can be performed using
8875 the sampling techniques described in chapter 5.
8876 Unfortunately, collective entity linking is NP-hard even for pairwise compatibility func-
8877 tions, so exact optimization is almost certainly intractable. Various approximate inference
8878 techniques have been proposed, including integer linear programming (Cheng and Roth,
8879 2013), Gibbs sampling (Han and Sun, 2012), and graph-based algorithms (Hoffart et al.,
8880 2011; Han et al., 2011).
8885 not just the top-scoring prediction. Usunier et al. (2009) define a general ranking error
8886 function,
k
X
Lrank (k) = αj , with α1 ≥ α2 ≥ · · · ≥ 0, [17.6]
j=1
8887 where k is equal to the number of labels ranked higher than the correct label y (i) . This
8888 function defines a class of ranking errors: if αj = 1 for all j, then the ranking error is
8889 equal to the rank of the correct entity; if α1 = 1 and αj>1 = 0, then the ranking error is
8890 one whenever the correct entity is not ranked first; if αj decreases smoothly with j, as in
8891 αj = 1j , then the error is between these two extremes.
This ranking error can be integrated into a margin objective. Remember that large
margin classification requires not only the correct label, but also that the correct label
outscores other labels by a substantial margin. A similar principle applies to ranking: we
want a high rank for the correct entity, and we want it to be separated from other entities
by a substantial margin. We therefore define the margin-augmented rank,
X
r(y (i) , x(i) ) , δ 1 + ψ(y, x(i) ) ≥ ψ(y (i) , x(i) ) , [17.7]
y∈Y(x(i) )\y (i)
8892 where δ (·) is a delta function, and Y(x(i) ) \ y (i) is the set of all entity candidates minus
8893 the true entity y (i) . The margin-augmented rank is the rank of the true entity, after aug-
8894 menting every other candidate with a margin of one, under the current scoring function
8895 ψ. (The context c is omitted for clarity, and can be considered part of x.)
For each instance, a hinge loss is computed from the ranking error associated with this
8896 The sum in Equation 17.8 includes non-zero values for every label that is ranked at least as
8897 high as the true entity, after applying the margin augmentation. Dividing by the margin-
8898 augmented rank of the true entity thus gives the average violation.
8899 The objective in Equation 17.8 is expensive to optimize when the label space is large,
8900 as is usually the case for entity linking against large knowledge bases. This motivates a
8901 randomized approximation called WARP (Weston et al., 2011), shown in Algorithm 19. In
8902 this procedure, we sample random entities until one violates the pairwise margin con-
8903 straint, ψ(y, x(i) ) + 1 ≥ ψ(y (i) , x(i) ). The number of samples N required to find such
8904 a violation yields
j an kapproximation of the margin-augmented rank of the true entity,
8905 r(y (i) , x(i) ) ≈ |Y(x)|
N . If a violation is found immediately, N = 1, the correct entity
8906 probably ranks below many others, r ≈ |Y(x)|. If many samples are required before a
8907 violation is found, N → |Y(x)|, then the correct entity is probably highly ranked, r → 1.
8908 A computational advantage of WARP is that it is not necessary to find the highest-scoring
8909 label, which can impose a non-trivial computational cost when Y(x(i) ) is large. The objec-
8910 tive is conceptually similar to the negative sampling objective in WORD 2 VEC (chapter 14),
8911 which compares the observed word against randomly sampled alternatives.
8916 This sentence introduces a relation between the entities referenced by George Bush and
8917 France. In the Automatic Content Extraction (ACE) ontology (Linguistic Data Consortium,
8918 2005), the type of this relation is PHYSICAL, and the subtype is LOCATED. This relation
8919 would be written,
8920 Relations take exactly two arguments, and the order of the arguments matters.
8921 In the ACE datasets, relations are annotated between entity mentions, as in the exam-
8922 ple above. Relations can also hold between nominals, as in the following example from
8923 the SemEval-2010 shared task (Hendrickx et al., 2009):
Table 17.1: Relations and example sentences from the SemEval-2010 dataset (Hendrickx
et al., 2009)
8925 This sentence describes a relation of type ENTITY- ORIGIN between tea and ginseng. Nomi-
8926 nal relation extraction is closely related to semantic role labeling (chapter 13). The main
8927 difference is that relation extraction is restricted to a relatively small number of relation
8928 types; for example, Table 17.1 shows the ten relation types from SemEval-2010.
8933 This pattern will be “triggered” whenever the literal string , a native of occurs between an
8934 entity of type P ERSON and an entity of type L OCATION. Such patterns can be generalized
8935 beyond literal matches using techniques such as lemmatization, which would enable the
8936 words (buy, buys, buying) to trigger the same patterns (see § 4.3.1.2). A more aggressive
8937 strategy would be to group all words in a WordNet synset (§ 4.2), so that, e.g., buy and
8938 purchase trigger the same patterns.
8939 Relation extraction patterns can be implemented in finite-state automata (§ 9.1). If the
8940 named entity recognizer is also a finite-state machine, then the systems can be combined
8941 by finite-state transduction (Hobbs et al., 1997). This makes it possible to propagate uncer-
8942 tainty through the finite-state cascade, and disambiguate from higher-level context. For
8943 example, suppose the entity recognizer cannot decide whether Starbuck refers to either a
8944 P ERSON or a L OCATION; in the composed transducer, the relation extractor would be free
8945 to select the P ERSON annotation when it appears in the context of an appropriate pattern.
8948 where r ∈ R is a relation type (possibly NIL), wi+1:j is the span of the first argument, and
8949 wm+1:n is the span of the second argument. The argument wm+1:n may appear before
8950 or after wi+1:j in the text, or they may overlap; we stipulate only that wi+1:j is the first
8951 argument of the relation. We now consider three alternatives for computing the scoring
8952 function.
Ψ(r, (i, j), (m, n), w) = θ · f (r, (i, j), (m, n), w), [17.12]
8955 with θ representing a vector of weights, and f (·) a vector of features. The pattern-based
8956 methods described in § 17.2.1 suggest several features:
8957 • Local features of wi+1:j and wm+1:n , including: the strings themselves; whether they
8958 are recognized as entities, and if so, which type; whether the strings are present in a
8959 gazetteer of entity names; each string’s syntactic head (§ 9.2.2).
8960 • Features of the span between the two arguments, wj+1:m or wn+1:i (depending on
8961 which argument appears first): the length of the span; the specific words that appear
8962 in the span, either as a literal sequence or a bag-of-words; the wordnet synsets (§ 4.2)
8963 that appear in the span between the arguments.
8964 • Features of the syntactic relationship between the two arguments, typically the de-
8965 pendency path between the arguments (§ 13.2.1). Example dependency paths are
8966 shown in Table 17.2.
Table 17.2: Candidates instances for the P HYSICAL .L OCATED relation, and their depen-
dency paths
8976 are similar and different, we can instead define a similarity function κ, which computes a
8977 score for any pair of instances, κ : X × X 7→ R+ . The score for any pair of instances (i, j)
8978 is κ(x(i) , x(j) ) ≥ 0, with κ(i, j) being large when instances x(i) and x(j) are similar. If the
8979 function κ obeys a few key properties it is a valid kernel function.4
Given a valid kernel function, we can build a non-linear classifier without explicitly
defining a feature vector or neural network architecture. For a binary classification prob-
lem y ∈ {−1, 1}, we have the decision function,
N
X
ŷ =Sign(b + y (i) α(i) κ(x(i) , x)) [17.13]
i=1
8980 where b and {α(i) }N i=1 are parameters that must be learned from the training set, under
8981 the constraint ∀i , α(i) ≥ 0. Intuitively, each αi specifies the importance of the instance x(i)
8982 towards the classification rule. Kernel-based classification can be viewed as a weighted
8983 form of the nearest-neighbor classifier (Hastie et al., 2009), in which test instances are
8984 assigned the most common label among their near neighbors in the training set. This
8985 results in a non-linear classification boundary. The parameters are typically learned from
8986 a margin-based objective (see § 2.3), leading to the kernel support vector machine. To
8987 generalize to multi-class classification, we can train separate binary classifiers for each
8988 label (sometimes called one-versus-all), or train binary classifiers for each pair of possible
8989 labels (one-versus-one).
8990 Dependency kernels are particularly effective for relation extraction, due to their abil-
8991 ity to capture syntactic properties of the path between the two candidate arguments. One
8992 class of dependency tree kernels is defined recursively, with the score for a pair of trees
4
The Gram matrix K arises from computing the kernel function between all pairs in a set of instances. For
a valid kernel, the Gram matrix must be symmetric (K = K> ) and positive semi-definite (∀a, a> Ka ≥ 0).
For more on kernel-based classification, see chapter 14 of Murphy (2012).
8993 equal to the similarity of the root nodes and the sum of similarities of matched pairs of
8994 child subtrees (Zelenko et al., 2003; Culotta and Sorensen, 2004). Alternatively, Bunescu
8995 and Mooney (2005) define a kernel function over sequences of unlabeled dependency
8996 edges, in which the score is computed as a product of scores for each pair of words in the
8997 sequence: identical words receive a high score, words that share a synset or part-of-speech
8998 receive a small non-zero score (e.g., travel / visit), and unrelated words receive a score of
8999 zero.
9014 Recurrent neural networks have also been applied to relation extraction, using a net-
9015 work such as an bidirectional LSTM to encode the words or dependency path between
9016 the two arguments. Xu et al. (2015) segment each dependency path into left and right
9017 subpaths: the path George Bush ← wants → travel → France is segmented into the sub-
N SUBJ X COMP O BL
9018 paths,
Xu et al. (2015) then run recurrent networks from the arguments to the root word (in this
case, wants), obtaining the final representation by max pooling across all the recurrent
states along each path. This process can be applied across separate “channels”, in which
the inputs consist of embeddings for the words, parts-of-speech, dependency relations,
and WordNet hypernyms. To define the model formally, let s(m) define the successor of
word m in either the left or right subpath (in a dependency path, each word can have a
(c)
successor in at most one subpath). Let xm indicate the embedding of word (or relation)
←−(c) →
− (c)
m in channel c, and let h m and h m indicate the associated recurrent states in the left
and right subtrees respectively. Then the complete model is specified as follows,
(c) (c)
hs(m) =RNN(xs(m) , h(c) m ) [17.18]
← −(c) ← −(c) ←−(c) → − (c) →− (c) →− (c)
z (c) =MaxPool h i , h s(i) , . . . , h root , h j , h s(j) , . . . , h root [17.19]
h i
Ψ(r, i, j) =θ · z (word) ; z (POS) ; z (dependency) ; z (hypernym) . [17.20]
9021 Note that z is computed by applying max-pooling to the matrix of horizontally concate-
9022 nated vectors h, while Ψ is computed from the vector of vertically concatenated vectors
9023 z. Xu et al. (2015) pass the score Ψ through a softmax layer to obtain a probability
9024 p(r | i, j, w), and train the model by regularized cross-entropy. Miwa and Bansal (2016)
9025 show that a related model can solve the more challenging “end-to-end” relation extrac-
9026 tion task, in which the model must simultaneously detect entities and then extract their
9027 relations.
9039 • KBP tasks are often formulated from the perspective of identifying attributes of a
9040 few “query” entities. As a result, these systems often start with an information
9041 retrieval phase, in which relevant passages of text are obtained by search.
9042 • For many entity pairs, there will be multiple passages of text that provide evidence.
9043 Slot filling systems must aggregate this evidence to predict a single relation type (or
9044 set of relations).
9045 • Labeled data is usually available in the form of pairs of related entities, rather than
9046 annotated passages of text. Training from such type-level annotations is a challenge:
9047 two entities may be linked by several relations, or they may appear together in a
9048 passage of text that nonetheless does not describe their relation to each other.
9049 Information retrieval is beyond the scope of this text (see Manning et al., 2008). The re-
9050 mainder of this section describes approaches to information fusion and learning from
9051 type-level annotations.
9056 (17.12) Elected mayor of Atlanta in 1973, Maynard Jackson was the first African Amer-
9057 ican to serve as mayor of a major southern city.
9058 (17.13) Atlanta’s airport will be renamed to honor Maynard Jackson, the city’s first
9059 Black mayor .
9060 (17.14) Born in Dallas, Texas in 1938, Maynard Holbrook Jackson, Jr. moved to Atlanta
9061 when he was 8.
9062 (17.15) Maynard Jackson has gone from one of the worst high schools in Atlanta to one
9063 of the best.
9064 The first and second examples provide evidence for the relation MAYOR holding between
9065 the entities ATLANTA and M AYNARD J ACKSON , J R .. The third example provides evidence
9066 for a different relation between these same entities, LIVED - IN. The fourth example poses
9067 an entity linking problem, referring to M AYNARD J ACKSON HIGH SCHOOL. Knowledge
9068 base population requires aggregating this sort of textual evidence, and predicting the re-
9069 lations that are most likely to hold.
5
First three examples from: http://www.georgiaencyclopedia.org/articles/
government-politics/maynard-jackson-1938-2003; JET magazine, November 10, 2003;
www.todayingeorgiahistory.org/content/maynard-jackson-elected
9070 One approach is to run a single-document relation extraction system (using the tech-
9071 niques described in § 17.2.2), and then aggregate the results (Li et al., 2011). Relations
9072 that are detected with high confidence in multiple documents are more likely to be valid,
9073 motivating the heuristic,
N
X
ψ(r, e1 , e2 ) = (p(r(e1 , e2 ) | w(i) ))α , [17.21]
i=1
9074 where p(r(e1 , e2 ) | w(i) ) is the probability of relation r between entities e1 and e2 condi-
9075 tioned on the text w(i) , and α 1 is a tunable hyperparameter. Using this heuristic, it is
9076 possible to rank all candidate relations, and trace out a precision-recall curve as more re-
9077 lations are extracted.6 Alternatively, features can be aggregated across multiple passages
9078 of text, feeding a single type-level relation extraction system (Wolfe et al., 2017).
9079 Precision can be improved by introducing constraints across multiple relations. For
9080 example, if we are certain of the relation PARENT(e1 , e2 ), then it cannot also be the case
9081 that PARENT(e2 , e1 ). Integer linear programming makes it possible to incorporate such
9082 constraints into a global optimization (Li et al., 2011). Other pairs of relations have posi-
9083 tive correlations, such MAYOR(e1 , e2 ) and LIVED - IN(e1 , e2 ). Compatibility across relation
9084 types can be incorporated into probabilistic graphical models (e.g., Riedel et al., 2010).
Figure 17.2: Four training instances for relation classification using distant supervi-
sion Mintz et al. (2009). The first two instances are positive for the MAYOR relation, and
the third instance is positive for the BORN - IN relation. The fourth instance is a negative ex-
ample, constructed from a pair of entities (N EW Y ORK, M AYNARD J ACKSON) that do not
appear in any Freebase relation. Each instance’s features are computed by aggregating
across all sentences in which the two entities are mentioned.
9101 In multiple instance learning, labels are assigned to sets of instances, of which only
9102 an unknown subset are actually relevant (Dietterich et al., 1997; Maron and Lozano-Pérez,
9103 1998). This formalizes the framework of distant supervision: the relation REL(A, B) acts
9104 as a label for the entire set of sentences mentioning entities A and B, even when only a
9105 subset of these sentences actually describes the relation. One approach to multi-instance
9106 learning is to introduce a binary latent variable for each sentence, indicating whether the
9107 sentence expresses the labeled relation (Riedel et al., 2010). A variety of inference tech-
9108 niques have been employed for this probabilistic model of relation extraction: Surdeanu
9109 et al. (2012) use expectation maximization, Riedel et al. (2010) use sampling, and Hoff-
9110 mann et al. (2011) use a custom graph-based algorithm. Expectation maximization and
9111 sampling are surveyed in chapter 5, and are covered in more detail by Murphy (2012);
9112 graph-based methods are surveyed by Mihalcea and Radev (2011).
Table 17.3: Various relation extraction tasks and their properties. VerbNet and FrameNet
are described in chapter 13. ACE (Linguistic Data Consortium, 2005), TAC (McNamee
and Dang, 2009), and SemEval (Hendrickx et al., 2009) refer to shared tasks, each of which
involves an ontology of relation types.
9121 and so on. Extracting such tuples can be viewed as a lightweight version of semantic role
9122 labeling (chapter 13), with only two argument types: first slot and second slot. The task is
9123 generally evaluated on the relation level, rather than on the level of sentences: precision is
9124 measured by the number of extracted relations that are accurate, and recall is measured by
9125 the number of true relations that were successfully extracted. OpenIE systems are trained
9126 from distant supervision or bootstrapping, rather than from labeled sentences.
9127 An early example is the TextRunner system (Banko et al., 2007), which identifies re-
9128 lations with a set of handcrafted syntactic rules. The examples that are acquired from the
9129 handcrafted rules are then used to train a classification model that uses part-of-speech pat-
9130 terns as features. Finally, the relations that are extracted by the classifier are aggregated,
9131 removing redundant relations and computing the number of times that each relation is
9132 mentioned in the corpus. TextRunner was the first in a series of systems that performed
9133 increasingly accurate open relation extraction by incorporating more precise linguistic fea-
9134 tures (Etzioni et al., 2011), distant supervision from Wikipedia infoboxes (Wu and Weld,
9135 2010), and better learning algorithms (Zhu et al., 2009).
9154 Event coreference Because multiple sentences may describe unique properties of a sin-
9155 gle event, event coreference is required to link event mentions across a single passage
9156 of text, or between passages (Humphreys et al., 1997). Bejan and Harabagiu (2014) de-
9157 fine event coreference as the task of identifying event mentions that share the same event
9158 participants (i.e., the slot-filling entities) and the same event properties (e.g., the time and
9159 location), within or across documents. Event coreference resolution can be performed us-
9160 ing supervised learning techniques in a similar way to entity coreference, as described
9161 in chapter 15: move left-to-right through the document, and use a classifier to decide
9162 whether to link each event reference to an existing cluster of coreferent events, or to cre-
9163 ate a new cluster (Ahn, 2006). Each clustering decision is based on the compatibility of
9164 features describing the participants and properties of the event. Due to the difficulty of
9165 annotating large amounts of data for entity coreference, unsupervised approaches are es-
9166 pecially desirable (Chen and Ji, 2009; Bejan and Harabagiu, 2014).
9167 Relations between events Just as entities are related to other entities, events may be
9168 related to other events: for example, the event of winning an election both precedes and
9169 causes the event of serving as mayor; moving to Atlanta precedes and enables the event of
9170 becoming mayor of Atlanta; moving from Dallas to Atlanta prevents the event of later be-
9171 coming mayor of Dallas. As these examples show, events may be related both temporally
9172 and causally. The TimeML annotation scheme specifies a set of six temporal relations
Table 17.4: Table of factuality values from the FactBank corpus (Saurı́ and Pustejovsky,
2009). The entry (NA) indicates that this combination is not annotated.
9173 between events (Pustejovsky et al., 2005), derived in part from interval algebra (Allen,
9174 1984). The TimeBank corpus provides TimeML annotations for 186 documents (Puste-
9175 jovsky et al., 2003). Methods for detecting these temporal relations combine supervised
9176 machine learning with temporal constraints, such as transitivity (e.g. Mani et al., 2006;
9177 Chambers and Jurafsky, 2008).
9178 More recent annotation schemes and datasets combine temporal and causal relations (Mirza
9179 et al., 2014; Dunietz et al., 2017): for example, the CaTeRS dataset includes annotations of
9180 320 five-sentence short stories (Mostafazadeh et al., 2016). Abstracting still further, pro-
9181 cesses are networks of causal relations between multiple events. A small dataset of bi-
9182 ological processes is annotated in the ProcessBank dataset (Berant et al., 2014), with the
9183 goal of supporting automatic question answering on scientific textbooks.
9209 (17.23) These results suggest that expression of c-jun, jun B and jun D genes might be in-
9210 volved in terminal granulocyte differentiation. . . (Morante and Daelemans, 2009)
9211 (17.24) A whale is technically a mammal (Lakoff, 1973)
9212 In the first example, the hedges suggest and might communicate uncertainty; in the second
9213 example, there is no uncertainty, but the hedge technically indicates that the evidence for
9214 the proposition will not fully meet the reader’s expectations. Hedging has been studied
9215 extensively in scientific texts (Medlock and Briscoe, 2007; Morante and Daelemans, 2009),
9216 where the goal of large-scale extraction of scientific facts is obstructed by hedges and spec-
9217 ulation. Still another related aspect of modality is evidentiality, in which speakers mark
9218 the source of their information. In many languages, it is obligatory to mark evidentiality
9219 through affixes or particles (Aikhenvald, 2004); while evidentiality is not grammaticalized
9220 in English, authors are expected to express this information in contexts such as journal-
9221 ism (Kovach and Rosenstiel, 2014) and Wikipedia.8
9222 Methods for handling negation and modality generally include two phases:
9224 2. identifying the scope and focus of the negation or modal operator.
7
The classification of negation as extra-propositional is controversial: Packard et al. (2014) argue that
negation is a “core part of compositionally constructed logical-form representations.” Negation is an element
of the semantic parsing tasks discussed in chapter 12 and chapter 13 — for example, negation markers are
treated as adjuncts in PropBank semantic role labeling. However, many of the relation extraction methods
mentioned in this chapter do not handle negation directly. A further consideration is that negation inter-
acts closely with aspects of modality that are generally not considered in propositional semantics, such as
certainty and subjectivity.
8
https://en.wikipedia.org/wiki/Wikipedia:Verifiability
9225 A considerable body of work on negation has employed rule-based techniques such as
9226 regular expressions (Chapman et al., 2001) to detect negated events. Such techniques
9227 match lexical cues (e.g., Norwood was not elected Mayor), while avoiding “double nega-
9228 tives” (e.g., surely all this is not without meaning). More recent approaches employ classi-
9229 fiers over lexical and syntactic features (Uzuner et al., 2009) and sequence labeling (Prab-
9230 hakaran et al., 2010).
9231 The tasks of scope and focus resolution are more fine grained, as shown in the example
9232 from Morante and Sporleder (2012):
9233 (17.25) [ After his habit he said ] nothing, and after mine I asked no questions.
9234 After his habit he said nothing, and [ after mine I asked ] no [ questions ].
9235 In this sentence, there are two negation cues (nothing and no). Each negates an event,
9236 indicated by the underlined verbs said and asked (this is the focus of negation), and each
9237 occurs within a scope: after his habit he said and after mine I asked questions. These tasks
9238 are typically formalized as sequence labeling problems, with each word token labeled
9239 as beginning, inside, or outside of a cue, focus, or scope span (see § 8.3). Conventional
9240 sequence labeling approaches can then be applied, using surface features as well as syn-
9241 tax (Velldal et al., 2012) and semantic analysis (Packard et al., 2014). Labeled datasets
9242 include the BioScope corpus of biomedical texts (Vincze et al., 2008) and a shared task
9243 dataset of detective stories by Arthur Conan Doyle (Morante and Blanco, 2012).
9256 returns a boolean value: for example, the question who is the mayor of the capital of Georgia?
9257 would be converted to,
9258 This lambda expression can then be used to query an existing knowledge base, returning
9259 “true” for all entities that satisfy it.
9272 (17.26) James the turtle was always getting into trouble. Sometimes he’d reach into
9273 the freezer and empty out all the food . . .
9274 Q: What is the name of the trouble making turtle?
9275 (a) Fries
9276 (b) Pudding
9277 (c) James
9278 (d) Jane
9279 • Cloze-style “fill in the blank” questions, as in the CNN/Daily Mail comprehension
9280 task (Hermann et al., 2015), the Children’s Book Test (Hill et al., 2016), and the Who-
9281 did-What dataset (Onishi et al., 2016). In these tasks, the system must guess which
9282 word or entity completes a sentence, based on reading a passage of text. Here is an
9283 example from Who-did-What:
9284 (17.27) Q: Tottenham manager Juande Ramos has hinted he will allow to leave
9285 if the Bulgaria striker makes it clear he is unhappy. (Onishi et al., 2016)
9286 The query sentence may be selected either from the story itself, or from an external
9287 summary. In either case, datasets can be created automatically by processing large
9288 quantities existing documents. An additional constraint is that that missing element
9289 from the cloze must appear in the main passage of text: for example, in Who-did-
9290 What, the candidates include all entities mentioned in the main passage. In the
9291 CNN/Daily Mail dataset, each entity name is replaced by a unique identifier, e.g.,
9292 ENTITY 37. This ensures that correct answers can only be obtained by accurately
9293 reading the text, and not from external knowledge about the entities.
9294 • Extractive question answering, in which the answer is drawn from the original text.
9295 In WikiQA, answers are sentences (Yang et al., 2015). In the Stanford Question An-
9296 swering Dataset (SQuAD), answers are words or short phrases (Rajpurkar et al.,
9297 2016):
9301 In both WikiQA and SQuAD, the original texts are Wikipedia articles, and the ques-
9302 tions are generated by crowdworkers.
The query is represented by vertically concatenating the final states of the left-to-right
and right-to-left passes:
−−→ ←−−
u =[h(q) Mq ; h(q) 0 ]. [17.25]
The attention vector is computed as a softmax over a vector of bilinear products, and
the expected representation is computed by summing over attention values,
ĉ = argmax o · xc . [17.29]
c
9308 This architecture can be trained end-to-end from a loss based on the log-likelihood of the
9309 correct answer. A number of related architectures have been proposed (e.g., Hermann
9310 et al., 2015; Kadlec et al., 2016; Dhingra et al., 2017; Cui et al., 2017), and the relationships
9311 between these methods are surveyed by Wang et al. (2017).
9319 Exercises
9326 Design a set of ranking weights θ that match the heuristic. You may assume that
9327 edit distance and popularity are always in the range [0, 100], and that the NIL entity
9328 has values of zero for all features except δ (y = NIL).
9329 2. Now consider another heuristic:
9330 • Among all candidate entities that have edit distance zero from the mention and
9331 the right type, choose the most popular one.
9332 • If no entity has edit distance zero from the mention, choose the one with the
9333 right type that is most popular, regardless of edit distance.
9334 • If no entity has the right type, choose NIL.
9335 Using the same features and assumptions from the previous problem, prove that
9336 there is no set of weights that could implement this heuristic. Then show that the
9337 heuristic can be implemented by adding a single feature. Your new feature should
9338 consider only the edit distance.
9339 3. * Consider the following formulation for collective entity linking, which rewards
9340 sets of entities that are all of the same type, where “types” can be elements of any
9341 set:
α all entities in y have the same type
ψc (y) = β more than half of the entities in y have the same type [17.30]
0 otherwise.
9342 Show how to implement this model of collective entity linking in an integer linear
9343 program. You may want to review § 13.2.2.
To get started, here is an integer linear program for entity linking, without including
the collective term ψc :
N
X X
max si,y zi,y
zi,y ∈{0,1}
i=1 y∈Y(x(i) )
X
s.t. zi,y ≤ 1 ∀i ∈ {1, 2, . . . N }
y∈Y(x(i) )
9344 where zi,y = 1 if entity y is linked to mention i, and si,y is a parameter that scores
9345 the quality of this individual ranking decision, e.g., si,y = θ · f (y, x(i) , c(i) ).
9346 To incorporate the collective linking score, you may assume parameters r,
(
1, entity y has type τ
ry,τ = [17.31]
0, otherwise.
9347 Hint: You will need to define several auxiliary variables to optimize over.
9351 a) Apply the pattern , such as to this corpus, obtaining candidates for the
9352 IS - A relation, e.g. IS - A (R OMANIA , C OUNTRY ). What are three pairs that this
9353 method identifies correctly? What are three different pairs that it gets wrong?
9354 b) Design a pattern for the PRESIDENT relation, e.g. PRESIDENT(P HILIPPINES, C ORAZON A QUINO).
9355 In this case, you may want to augment your pattern matcher with the ability
9356 to match multiple token wildcards, perhaps using case information to detect
9357 proper names. Again, list three correct
9358 c) Preprocess the Reuters data by running a named entity recognizer, replacing
9359 tokens with named entity spans when applicable. Apply your PRESIDENT
9360 matcher to this new data. Does the accuracy improve? Compare 20 randomly-
9361 selected pairs from this pattern and the one you designed in the previous part.
9362 5. Represent the dependency path x(i) as a sequence of words and dependency arcs
9363 of length Mi , ignoring the endpoints of the path. In example 1 of Table 17.2, the
9364 dependency path is,
x(1) = ( ← , traveled, → ) [17.32]
N SUBJ O BL
(i) (i)
9365 Ifxm is a word, then let pos(xm ) be its part-of-speech, using the tagset defined in
9366 chapter 8.
We can define the following kernel function over pairs of dependency paths (Bunescu
and Mooney, 2005):
(
0, Mi 6= Mj
κ(x(i) , x(j) ) = QMi (i) (j)
m=1 c(xm , xm ), Mi = Mj
(i) (j)
2, xm = xm
(i) (j) (i) (j)
c(x(i) (j)
m , xm ) = 1, xm and xm are words and pos(xm ) = pos(xm )
0, otherwise.
9367 Using this kernel function, compute the kernel similarities of example 1 from Ta-
9368 ble 17.2 with the other five examples.
6. Continuing from the previous problem, suppose that the instances have the follow-
ing labels:
y2 = 1, y3 = −1, y4 = −1, y5 = 1, y6 = 1 [17.33]
9369 Identify the conditions for α and b under which ŷ1 = 1. Remember the constraint
9370 that αi ≥ 0 for all i.
9373 Machine translation (MT) is one of the “holy grail” problems in artificial intelligence,
9374 with the potential to transform society by facilitating communication between people
9375 anywhere in the world. As a result, MT has received significant attention and funding
9376 since the early 1950s. However, it has proved remarkably challenging, and while there
9377 has been substantial progress towards usable MT systems — especially for high-resource
9378 language pairs like English-French — we are still far from translation systems that match
9379 the nuance and depth of human translations.
9382 where w(s) is a sentence in a source language, w(t) is a sentence in the target language,
9383 and Ψ is a scoring function. As usual, this formalism requires two components: a decod-
9384 ing algorithm for computing ŵ(t) , and a learning algorithm for estimating the parameters
9385 of the scoring function Ψ.
9386 Decoding is difficult for machine translation because of the huge space of possible
9387 translations. We have faced large label spaces before: for example, in sequence labeling,
9388 the set of possible label sequences is exponential in the length of the input. In these cases,
9389 it was possible to search the space quickly by introducing locality assumptions: for ex-
9390 ample, that each tag depends only on its predecessor, or that each production depends
9391 only on its parent. In machine translation, no such locality assumptions seem possible:
9392 human translators reword, reorder, and rearrange words; they replace single words with
9393 multi-word phrases, and vice versa. This flexibility means that in even relatively simple
435
436 CHAPTER 18. MACHINE TRANSLATION
9394 translation models, decoding is NP-hard (Knight, 1999). Approaches for dealing with this
9395 complexity are described in § 18.4.
Estimating translation models is difficult as well. Labeled translation data usually
comes in the form parallel sentences, e.g.,
9396 A useful feature function would note the translation pairs (gusta, likes), (manzanas, apples),
9397 and even (Vinay, Vinay). But this word-to-word alignment is not given in the data. One
9398 solution is to treat this alignment as a latent variable; this is the approach taken by clas-
9399 sical statistical machine translation (SMT) systems, described in § 18.2. Another solution
9400 is to model the relationship between w(t) and w(s) through a more complex and expres-
9401 sive function; this is the approach taken by neural machine translation (NMT) systems,
9402 described in § 18.3.
9403 The Vauquois Pyramid is a theory of how translation should be done. At the lowest
9404 level, the translation system operates on individual words, but the horizontal distance
9405 at this level is large, because languages express ideas differently. If we can move up the
9406 triangle to syntactic structure, the distance for translation is reduced; we then need only
9407 produce target-language text from the syntactic representation, which can be as simple
9408 as reading off a tree. Further up the triangle lies semantics; translating between semantic
9409 representations should be easier still, but mapping between semantics and surface text is
9410 a difficult, unsolved problem. At the top of the triangle is interlingua, a semantic repre-
9411 sentation that is so generic that it is identical across all human languages. Philosophers
Adequate? Fluent?
To Vinay it like Python yes no
Vinay debugs memory leaks no yes
Vinay likes Python yes yes
Table 18.1: Adequacy and fluency for translations of the Spanish sentence A Vinay le gusta
Python.
9412 debate whether such a thing as interlingua is really possible (Derrida, 1985). While the
9413 first-order logic representations discussed in chapter 12 might be considered to be lan-
9414 guage independent, it is built on an inventory of relations that is suspiciously similar to
9415 a subset of English words (Nirenburg and Wilks, 2001). Nonetheless, the idea of linking
9416 translation and semantic understanding may still be a promising path, if the resulting
9417 translations better preserve the meaning of the original text.
9420 • Adequacy: The translation w(t) should adequately reflect the linguistic content of
9421 w(s) . For example, if w(s) = A Vinay le gusta Python, the gloss1 w(t) = To Vinay it like Python
9422 is considered adequate becomes it contains all the relevant content. The output
9423 w(t) = Vinay debugs memory leaks is not adequate.
9424 • Fluency: The translation w(t) should read like fluent text in the target language. By
9425 this criterion, the gloss w(t) = To Vinay it like Python will score poorly, and w(t) =
9426 Vinay debugs memory leaks will be preferred.
9427 Automated evaluations of machine translations typically merge both of these criteria,
9428 by comparing the system translation with one or more reference translations, produced
9429 by professional human translators. The most popular quantitative metric is BLEU (bilin-
9430 gual evaluation understudy; Papineni et al., 2002), which is based on n-gram precision:
9431 what fraction of n-grams in the system translation appear in the reference? Specifically,
9432 for each n-gram length, the precision is defined as,
Translation p1 p2 p3 p4 BP BLEU
Reference Vinay likes programming in Python
2
Sys1 To Vinay it like to program Python 7 0 0 0 1 .21
3 1
Sys2 Vinay likes Python 3 2 0 0 .51 .33
4 3 2 1
Sys3 Vinay likes programming in his pajamas 6 5 4 3 1 .76
Figure 18.2: A reference translation and three system outputs. For each output, pn indi-
cates the precision at each n-gram, and BP indicates the brevity penalty.
P
9434 The BLEU score is then based on the average, exp N1 N n=1 log pn . Two modifications
9435 of Equation 18.2 are necessary: (1) to avoid computing log 0, all precisions are smoothed
9436 to ensure that they are positive; (2) each n-gram in the source can be used at most once,
9437 so that to to to to to to does not achieve p1 = 1 against the reference to be or not to be.
9438 Furthermore, precision-based metrics are biased in favor of short translations, which can
9439 achieve high scores by minimizing the denominator in [18.2]. To avoid this issue, a brevity
9440 penalty is applied to translations that are shorter than the reference. This penalty is indi-
9441 cated as “BP” in Figure 18.2.
9442 Automated metrics like BLEU have been validated by correlation with human judg-
9443 ments of translation quality. Nonetheless, it is not difficult to construct examples in which
9444 the BLEU score is high, yet the translation is disfluent or carries a completely different
9445 meaning from the original. To give just one example, consider the problem of translating
9446 pronouns. Because pronouns refer to specific entities, a single incorrect pronoun can oblit-
9447 erate the semantics of the original sentence. Existing state-of-the-art systems generally
9448 do not attempt the reasoning necessary to correctly resolve pronominal anaphora (Hard-
9449 meier, 2012). Despite the importance of pronouns for semantics, they have a marginal
9450 impact on BLEU, which may help to explain why existing systems do not make a greater
9451 effort to translate them correctly.
9452 Fairness and bias The problem of pronoun translation intersects with issues of fairness
9453 and bias. In many languages, such as Turkish, the third person singular pronoun is gender
9454 neutral. Today’s state-of-the-art systems produce the following Turkish-English transla-
9455 tions (Caliskan et al., 2017):
9458 The same problem arises for other professions that have stereotypical genders, such as
9459 engineers, soldiers, and teachers, and for other languages that have gender-neutral pro-
9460 nouns. This bias was not directly programmed into the translation model; it arises from
9461 statistical tendencies in existing datasets. This highlights a general problem with data-
9462 driven approaches, which can perpetuate biases that negatively impact disadvantaged
9463 groups. Worse, machine learning can amplify biases in data (Bolukbasi et al., 2016): if a
9464 dataset has even a slight tendency towards men as doctors, the resulting translation model
9465 may produce translations in which doctors are always he, and nurses are always she.
9466 Other metrics A range of other automated metrics have been proposed for machine
9467 translation. One potential weakness of BLEU is that it only measures precision; METEOR
9468 is a weighted F - MEASURE, which is a combination of recall and precision (see § 4.4.1).
9469 Translation Error Rate (TER) computes the string edit distance (see § 9.1.4.1) between the
9470 reference and the hypothesis (Snover et al., 2006). For language pairs like English and
9471 Japanese, there are substantial differences in word order, and word order errors are not
9472 sufficiently captured by n-gram based metrics. The RIBES metric applies rank correla-
9473 tion to measure the similarity in word order between the system and reference transla-
9474 tions (Isozaki et al., 2010).
9492 and Isahara, 2007): for example, De Gispert and Marino (2006) use Spanish as a bridge for
9493 translation between Catalan and English. For most of the 6000 languages spoken today,
9494 the only source of translation data remains the Judeo-Christian Bible (Resnik et al., 1999).
9495 While relatively small, at less than a million tokens, the Bible has been translated into
9496 more than 2000 languages, far outpacing any other corpus. Some research has explored
9497 the possibility of automatically identifying parallel sentence pairs from unaligned parallel
9498 texts, such as web pages and Wikipedia articles (Kilgarriff and Grefenstette, 2003; Resnik
9499 and Smith, 2003; Adafre and De Rijke, 2006). Another approach is to create large parallel
9500 corpora through crowdsourcing (Zaidan and Callison-Burch, 2011).
9504 The fluency score ΨF need not even consider the source sentence; it only judges w(t) on
9505 whether it is fluent in the target language. This decomposition is advantageous because
9506 it makes it possible to estimate the two scoring functions on separate data. While the
9507 adequacy model must be estimated from aligned sentences — which are relatively expen-
9508 sive and rare — the fluency model can be estimated from monolingual text in the target
9509 language. Large monolingual corpora are now available in many languages, thanks to
9510 resources such as Wikipedia.
An elegant justification of the decomposition in Equation 18.3 is provided by the noisy
channel model, in which each scoring function is a log probability:
9511 By setting the scoring functions equal to the logarithms of the prior and likelihood, their
9512 sum is equal to log pS,T , which is the logarithm of the joint probability of the source and
9513 target. The sentence ŵ(t) that maximizes this joint probability is also the maximizer of the
9514 conditional probability pT |S , making it the most likely target language sentence, condi-
9515 tioned on the source.
9516 The noisy channel model can be justified by a generative story. The target text is orig-
9517 inally generated from a probability model pT . It is then encoded in a “noisy channel”
9518 pS|T , which converts it to a string in the source language. In decoding, we apply Bayes’
9519 rule to recover the string w(t) that is maximally likely under the conditional probability
n
y
sta
tho
na
A
le
gu
Vi
py
Vinay
likes
python
9520 pT |S . Under this interpretation, the target probability pT is just a language model, and
9521 can be estimated using any of the techniques from chapter 6. The only remaining learning
9522 problem is to estimate the translation model pS|T .
A(w(s) , w(t) ) = {(A, ∅), (Vinay, Vinay), (le, likes), (gusta, likes), (Python,Python)}. [18.7]
9529 This alignment is shown in Figure 18.3. Another, less promising, alignment is:
A(w(s) , w(t) ) = {(A, Vinay), (Vinay, likes), (le, Python), (gusta, ∅), (Python, ∅)}. [18.8]
9530 Each alignment contains exactly one tuple for each word in the source, which serves to
9531 explain how the source word could be translated from the target, as required by the trans-
9532 lation probability pS|T . If no appropriate word in the target can be identified for a source
9533 word, it is aligned to ∅ — as is the case for the Spanish function word a in the example,
9534 which glosses to the English word to. Words in the target can align with multiple words
9535 in the source, so that the target word likes can align to both le and gusta in the source.
The joint probability of the alignment and the translation can be defined conveniently
as,
M (s)
Y
(s) (t) (s)
p(w ,A | w ) = p(wm , am | wa(t)m , m, M (s) , M (t) ) [18.9]
m=1
M (s)
Y
= p(am | m, M (s) , M (t) ) × p(wm
(s)
| wa(t)m ). [18.10]
m=1
9538 This means that each alignment decision is independent of the others, and depends
9539 only on the index m, and the sentence lengths M (s) and M (t) .
9540 • The translation probability also factors across tokens,
M (s)
Y
(s) (t) (s)
p(w | w , A) = p(wm | wa(t)m ), [18.12]
m=1
9541 so that each word in w(s) depends only on its aligned word in w(t) . This means that
9542 translation is word-to-word, ignoring context. The hope is that the target language
9543 model p(w(t) ) will correct any disfluencies that arise from word-to-word translation.
To translate with such a model, we could sum or max over all possible alignments,
X
p(w(s) , w(t) ) = p(w(s) , w(t) , A) [18.13]
A
X
=p(w(t) ) p(A) × p(w(s) | w(t) , A) [18.14]
A
≥p(w(t) ) max p(A) × p(w(s) | w(t) , A). [18.15]
A
The term p(A) defines the prior probability over alignments. A series of alignment
models with increasingly relaxed independence assumptions was developed by researchers
at IBM in the 1980s and 1990s, known as IBM Models 1-6 (Och and Ney, 2003). IBM
Model 1 makes the strongest independence assumption:
1
p(am | m, M (s) , M (t) ) = . [18.16]
M (t)
9544 In this model, every alignment is equally likely. This is almost surely wrong, but it re-
9545 sults in a convex learning objective, yielding a good initialization for the more complex
9546 alignment models (Brown et al., 1993; Koehn, 2009).
9560 1. E-step: Update beliefs about word alignment using Equation 18.18.
9561 2. M-step: Update the translation model using Equations 18.19 and 18.20.
re
s
e
us
nd
on
r
un
ver
No
all
pre
We’ll
have
drink
9572 The line we will take a glass is the word-for-word gloss of the French sentence; the transla-
9573 tion we’ll have a drink is shown on the third line. Such examples are difficult for word-to-
9574 word translation models, since they require translating prendre to have and verre to drink.
9575 These translations are only correct in the context of these specific phrases.
Phrase-based translation generalizes on word-based models by building translation
tables and alignments between multiword spans. (These “phrases” are not necessarily
syntactic constituents like the noun phrases and verb phrases described in chapters 9 and
10.) The generalization from word-based translation is surprisingly straightforward: the
translation tables can now condition on multi-word units, and can assign probabilities to
multi-word units; alignments are mappings from spans to spans, ((i, j), (k, `)), so that
Y (s) (s) (s) (t) (t) (t)
p(w(s) | w(t) , A) = pw(s) |w(t) ({wi+1 , wi+2 , . . . , wj } | {wk+1 , wk+2 , . . . , w` }).
((i,j),(k,`))∈A
[18.21]
(s)
9576 The phrase alignment ((i, j), (k, `)) indicates that the span wi+1:j is the translation of the
(t)
9577 span wk+1:` . An example phrasal alignment is shown in Figure 18.4. Note that the align-
9578 ment set A is required to cover all of the tokens in the source, just as in word-based trans-
9579 lation. The probability model pw(s) |w(t) must now include translations for all phrase pairs,
9580 which can be learned from expectation-maximization just as in word-based statistical ma-
9581 chine translation.
9600 with subscripts indicating the alignment between the Spanish (left) and English (right)
9601 parts of the right-hand side. Terminal productions yield translation pairs,
9602 A synchronous derivation begins with the start symbol S, and derives a pair of sequences
9603 of terminal symbols.
9604 Given an SCFG in which each production yields at most two symbols in each language
9605 (Chomsky Normal Form; see § 9.2.1.2), a sentence can be parsed using only the CKY
9606 algorithm (chapter 10). The resulting derivation also includes productions in the other
9607 language, all the way down to the surface form. Therefore, SCFGs make translation very
9608 similar to parsing. In a weighted SCFG, the log probability log pS|T can be computed from
4
Key earlier work includes syntax-driven transduction (Lewis II and Stearns, 1968) and stochastic inver-
sion transduction grammars (Wu, 1997).
9609 the sum of the log-probabilities of the productions. However, combining SCFGs with a
9610 target language model is computationally expensive, necessitating approximate search
9611 algorithms (Huang and Chiang, 2007).
9612 Synchronous context-free grammars are an example of tree-to-tree translation, be-
9613 cause they model the syntactic structure of both the target and source language. In string-
9614 to-tree translation, string elements are translated into constituent tree fragments, which
9615 are then assembled into a translation (Yamada and Knight, 2001; Galley et al., 2004); in
9616 tree-to-string translation, the source side is parsed, and then transformed into a string on
9617 the target side (Liu et al., 2006). A key question for syntax-based translation is the extent
9618 to which we phrasal constituents align across translations (Fox, 2002), because this gov-
9619 erns the extent to which we can rely on monolingual parsers and treebanks. For more on
9620 syntax-based machine translation, see the monograph by Williams et al. (2016).
z =E NCODE(w(s) ) [18.24]
(t) (s)
w |w ∼D ECODE(z), [18.25]
9622 where the second line means that the function D ECODE(z) defines the conditional proba-
9623 bility p(w(t) | w(s) ).
The decoder is typically a recurrent neural network, which generates the target lan-
guage sentence one word at a time, while recurrently updating a hidden state. The en-
coder and decoder networks are trained end-to-end from parallel sentences. If the output
layer of the decoder is a logistic function, then the entire architecture can be trained to
maximize the conditional log-likelihood,
M (t)
X (t)
(t) (s) (t)
log p(w |w )= p(wm | w1:m−1 , z) [18.26]
m=1
(t) (t) (t)
p(wm | w1:m−1 , w(s) ) ∝ exp βw(t) · hm−1 [18.27]
m
(t)
where the hidden state hm−1 is a recurrent function of the previously generated text
(t)
w1:m−1 and the encoding z. The second line is equivalent to writing,
(t) (t) (t)
wm | w1:m−1 , w(s) ∼ SoftMax β · hm−1 , [18.28]
(t)
9624 where β ∈ R(V ×K) is the matrix of output word vectors for the V (t) words in the target
9625 language vocabulary.
The simplest encoder-decoder architecture is the sequence-to-sequence model (Sutskever
et al., 2014). In this model, the encoder is set to the final hidden state of a long short-term
memory (LSTM) (see § 6.3.3) on the source sentence:
(s)
h(s) (s)
m =LSTM(xm , hm−1 ) [18.29]
(s)
z ,hM (s) , [18.30]
(s) (s)
where xm is the embedding of source language word wm . The encoding then provides
the initial hidden state for the decoder LSTM:
(t)
h0 =z [18.31]
(t)
h(t) (t)
m =LSTM(xm , hm−1 ), [18.32]
(t) (t)
9626 where xm is the embedding of the target language word wm .
9627 Sequence-to-sequence translation is nothing more than wiring together two LSTMs:
9628 one to read the source, and another to generate the target. To make the model work well,
9629 some additional tweaks are needed:
9630 • Most notably, the model works much better if the source sentence is reversed, read-
9631 ing from the end of the sentence back to the beginning. In this way, the words at the
9632 beginning of the source have the greatest impact on the encoding z, and therefore
9633 impact the words at the beginning of the target sentence. Later work on more ad-
9634 vanced encoding models, such as neural attention (see § 18.3.1), has eliminated the
9635 need for reversing the source sentence.
• The encoder and decoder can be implemented as deep LSTMs, with multiple layers
(s,i)
of hidden states. As shown in Figure 18.5, each hidden state hm at layer i is treated
as the input to an LSTM at layer i + 1:
(s)
h(s,1)
m =LSTM(x(s)
m , hm−1 ) [18.33]
(s)
h(s,i+1)
m =LSTM(h(s,i)
m , hm−1 ), ∀i ≥ 1. [18.34]
9636 The original work on sequence-to-sequence translation used four layers; in 2016,
9637 Google’s commercial machine translation system used eight layers (Wu et al., 2016).5
9638 • Significant improvements can be obtained by creating an ensemble of translation
9639 models, each trained from a different random initialization. For an ensemble of size
9640 N , the per-token decoding probability is set equal to,
N
(t) 1 X (t)
p(w(t) | z, w1:m−1 ) = pi (w(t) | z, w1:m−1 ), [18.35]
N
i=1
9641 where pi is the decoding probability for model i. Each translation model in the
9642 ensemble includes its own encoder and decoder networks.
9643 • The original sequence-to-sequence model used a fairly standard training setup: stochas-
9644 tic gradient descent with an exponentially decreasing learning rate after the first five
9645 epochs; mini-batches of 128 sentences, chosen to have similar length so that each
9646 sentence on the batch will take roughly the same amount of time to process; gradi-
9647 ent clipping (see § 3.3.4) to ensure that the norm of the gradient never exceeds some
9648 predefined value.
Output
activation α
Query ψα
Key Value
Figure 18.6: A general view of neural attention. The dotted box indicates that each αm→n
can be viewed as a gate on value n.
9658 Is it possible for translation to be both contextualized and compositional? One ap-
9659 proach is to augment neural translation with an attention mechanism. The idea of neural
9660 attention was described in § 17.5, but its application to translation bears further discus-
9661 sion. In general, attention can be thought of as using a query to select from a memory
9662 of key-value pairs. However, the query, keys, and values are all vectors, and the entire
9663 operation is differentiable. For each key n in the memory, we compute a score ψα (m, n)
9664 with respect to the query m. That score is a function of the compatibility of the key and
9665 the query, and can be computed using a small feedforward neural network. The vector
9666 of scores is passed through an activation function, such as softmax. The output of this
9667 activation function is a vector of non-negative numbers [αm→1 , αm→2 , . . . , αm→N ]> , with
9668 length N equal to the size of the memory. Each value in the memory vn is multiplied by
9669 the attention αm→n ; the sum of these scaled values is the output. This process is shown in
9670 Figure 18.6. In the extreme case that αm→n = 1 and αm→n0 = 0 for all other n0 , then the
9671 attention mechanism simply selects the value vn from the memory.
9672 Neural attention makes it possible to integrate alignment into the encoder-decoder ar-
9673 chitecture. Rather than encoding the entire source sentence into a fixed length vector z,
(S)
9674 it can be encoded into a matrix Z ∈ RK×M , where K is the dimension of the hidden
9675 state, and M (S) is the number of tokens in the source input. Each column of Z represents
9676 the state of a recurrent neural network over the source sentence. These vectors are con-
9677 structed from a bidirectional LSTM (see § 7.6), which can be a deep network as shown in
9678 Figure 18.5. These columns are both the keys and the values in the attention mechanism.
At each step m in decoding, the attentional state is computed by executing a query,
(t)
which is equal to the state of the decoder, hm . The resulting compatibility scores are,
9679 The function ψ is thus a two layer feedforward neural network, with weights vα on the
9680 output layer, and weights Θα on the input layer. To convert these scores into atten-
9681 tion weights, we apply an activation function, which can be vector-wise softmax or an
Softmax attention
exp ψα (m, n)
αm→n = P (s) [18.37]
M 0
n0 =1 exp ψα (m, n )
Sigmoid attention
αm→n = σ (ψα (m, n)) [18.38]
M (s)
X
cm = αm→n zn , [18.39]
n=1
where αm→n ∈ [0, 1] is the amount of attention from word m of the target to word n of the
source. The context vector can be incorporated into the decoder’s word output probability
model, by adding another layer to the decoder (Luong et al., 2015):
h̃(t)
m = tanh Θc [h(t)
m ; c m ] [18.40]
(t) (t)
p(wm+1 | w1:m , w(s) ) ∝ exp βw(t) · h̃(t) m . [18.41]
m+1
(t)
9683 Here the decoder state hm is concatenated with the context vector, forming the input
(t)
9684 to compute a final output vector h̃m . The context vector can be incorporated into the
9685 decoder recurrence in a similar manner (Bahdanau et al., 2014).
M (s)
X
(i) (i)
zm = αm→n (Θv h(i−1)
n ) [18.42]
n=1
h(i)
m
(i)
=Θ2 ReLU Θ1 zm + b1 + b2 . [18.43]
9687 For each token m at level i, we compute self-attention over the entire source sentence:
9688 the keys, values, and queries are all projections of the vector h(i−1) . The attention scores
(i)
9689 αm→n are computed using a scaled form of softmax attention,
9690 where M is the length of the input. This encourages the attention to be more evenly
9691 dispersed across the input. Self-attention is applied across multiple “heads”, each using
9692 different projections of h(i−1) to form the keys, values, and queries.
(i)
9693 The output of the self-attentional layer is the representation zm , which is then passed
9694 through a two-layer feed-forward network, yielding the input to the next layer, h(i) . To
9695 ensure that information about word order in the source is integrated into the model, the
9696 encoder includes positional encodings of the index of each word in the source. These
9697 encodings are vectors for each position m ∈ {1, 2, . . . , M }. The positional encodings are
9698 concatenated with the word embeddings xm at the base layer of the model.6
9699 Convolutional neural networks (see § 3.4) have also been applied as encoders in neu-
(s)
9700 ral machine translation. For each word wm , a convolutional network computes a rep-
(s)
9701 resentation hm from the embeddings of the word and its neighbors. This procedure is
9702 applied several times, creating a deep convolutional network. The recurrent decoder then
9703 computes a set of attention weights over these convolutional representations, using the
9704 decoder’s hidden state h(t) as the queries. This attention vector is used to compute a
9705 weighted average over the outputs of another convolutional neural network of the source,
9706 yielding an averaged representation cm , which is then fed into the decoder. As with the
9707 transformer, speed is the main advantage over recurrent encoding models; another sim-
9708 ilarity is that word order information is approximated through the use of positional en-
9709 codings. It seems likely that there are limitations to how well positional encodings can
9710 account for word order and deeper linguistic structure. But for the moment, the com-
9711 putational advantages of such approaches have put them on par with the best recurrent
9712 translation models.7
Source: The ecotax portico in Pont-de-buis was taken down on Thursday morning
Figure 18.7: Translation with unknown words. The system outputs unk to indicate words
that are outside its vocabulary. Figure adapted from Luong et al. (2015).
9717 • New proper nouns, such as family names or organizations, are constantly arising —
9718 particularly in the news domain. The same is true, to a lesser extent, for technical
9719 terminology. This issue is shown in Figure 18.7.
9720 • In many languages, words have complex internal structure, known as morphology.
9721 An example is German, which uses compounding to form nouns like Abwasserbe-
9722 handlungsanlage (sewage water treatment plant; example from Sennrich et al. (2016)).
9723 While compounds could in principle be addressed by better tokenization (see § 8.4),
9724 other morphological processes involve more complex transformations of subword
9725 units.
9726 Names and technical terms can be handled in a postprocessing step: after first identi-
9727 fying alignments between unknown words in the source and target, we can look up each
9728 aligned source word in a dictionary, and choose a replacement (Luong et al., 2015). If the
9729 word does not appear in the dictionary, it is likely to be a proper noun, and can be copied
9730 directly from the source to the target. This approach can also be integrated directly into
9731 the translation model, rather than applying it as a postprocessing step (Jean et al., 2015).
9732 Words with complex internal structure can be handled by translating subword units
9733 rather than entire words. A popular technique for identifying subword units is byte-pair
9734 encoding (BPE; Gage, 1994; Sennrich et al., 2016). The initial vocabulary is defined as the
9735 set of characters used in the text. The most common character bigram is then merged into
9736 a new symbol, and the vocabulary is updated. The merging operation is applied repeat-
9737 edly, until the vocabulary reaches some maximum size. For example, given the dictionary
9738 {fish, fished, want, wanted, bike, biked}, we would first merge e+d into the subword unit ed,
9739 since this bigram appears in three words of the six words. Next, there are several bigrams
9740 that each appear in a pair of words: f+i, i+s, s+h, w+a, a+n, etc. These can be merged
9741 in any order, resulting in the segmentation, {fish, fish+ed, want, want+ed, bik+e, bik+ed}. At
9742 this point, there are no subword bigrams that appear more than once. In real data, merg-
9743 ing is performed until the number of subword units reaches some predefined threshold,
9751 where w(t) is a sequence of tokens from the target vocabulary V. It is not possible to
9752 efficiently obtain exact solutions to the decoding problem, for even minimally effective
9753 models in either statistical or neural machine translation. Today’s state-of-the-art transla-
9754 tion systems use beam search (see § 11.3.1.4), which is an incremental decoding algorithm
9755 that maintains a small constant number of competitive hypotheses. Such greedy approxi-
9756 mations are reasonably effective in practice, and this may be in part because the decoding
9757 objective is only loosely correlated with measures of translation quality, so that exact op-
9758 timization of [18.45] may not greatly improve the resulting translations.
Decoding in neural machine translation is somewhat simpler than in phrase-based
statistical machine translation.9 The scoring function Ψ is defined,
M (t)
X (t)
(t) (s) (t)
Ψ(w , w )= ψ(wm ; w1:m−1 , z) [18.46]
m=1
X
(t)
ψ(w(t) ; w1:m−1 , z) =βw(t) · h(t)
m − log exp βw · h(t)
m , [18.47]
m
w∈V
(t)
9759 where z is the encoding of the source sentence w(s) , and hm is a function of the encoding
(t)
9760 z and the decoding history w1:m−1 . This formulation subsumes the attentional translation
9761 model, where z is a matrix encoding of the source.
Now consider the incremental decoding algorithm,
(t) (t)
ŵm = argmax ψ(w; ŵ1:m−1 , z), m = 1, 2, . . . [18.48]
w∈V
8
Transliteration is crucial for converting names and other foreign words between languages that do not
share a single script, such as English and Japanese. It is typically approached using the finite-state methods
discussed in chapter 9 (Knight and Graehl, 1998).
9
For more on decoding in phrase-based statistical models, see Koehn (2009).
9762 This algorithm selects the best target language word at position m, assuming that it has
(t)
9763 already generated the sequence ŵ1:m−1 . (Termination can be handled by augmenting
9764 the vocabulary V with a special end-of-sequence token, .) The incremental algorithm
9765 is likely to produce a suboptimal solution to the optimization problem defined in Equa-
9766 tion 18.45, because selecting the highest-scoring word at position m can set the decoder
9767 on a “garden path,” in which there are no good choices at some later position n > m. We
9768 might hope for some dynamic programming solution, as in sequence labeling (§ 7.3). But
9769 the Viterbi algorithm and its relatives rely on a Markov decomposition of the objective
9770 function into a sum of local scores: for example, scores can consider locally adjacent tags
9771 (ym , ym−1 ), but not the entire tagging history y1:m . This decomposition is not applicable
(t)
9772 to recurrent neural networks, because the hidden state hm is impacted by the entire his-
(t)
9773 tory w1:m ; this sensitivity to long-range context is precisely what makes recurrent neural
9774 networks so effective.10 In fact, it can be shown that decoding from any recurrent neural
9775 network is NP-complete (Siegelmann and Sontag, 1995; Chen et al., 2018).
9776 Beam search Beam search is a general technique for avoiding search errors when ex-
9777 haustive search is impossible; it was first discussed in § 11.3.1.4. Beam search can be
9778 seen as a variant of the incremental decoding algorithm sketched in Equation 18.48, but
9779 at each step m, a set of K different hypotheses are kept on the beam. For each hypothesis
P (t) (t) (t)
9780 k ∈ {1, 2, . . . , K}, we compute both the current score M
m=1 ψ(wk,m ; wk,1:m−1 , z) as well as
(t)
9781 the current hidden state hk . At each step in the beam search, the K top-scoring children
9782 of each hypothesis currently on the beam are “expanded”, and the beam is updated. For
9783 a detailed description of beam search for RNN decoding, see Graves (2012).
9784 Learning and search Conventionally, the learning algorithm is trained to predict the
9785 right token in the translation, conditioned on the translation history being correct. But
9786 if decoding must be approximate, then we might do better by modifying the learning
9787 algorithm to be robust to errors in the translation history. Scheduled sampling does this
9788 by training on histories that sometimes come from the ground truth, and sometimes come
9789 from the model’s own output (Bengio et al., 2015).11 As training proceeds, the training
9790 wheels come off: we increase the fraction of tokens that come from the model rather than
9791 the ground truth. Another approach is to train on an objective that relates directly to beam
9792 search performance (Wiseman et al., 2016). Reinforcement learning has also been applied
9793 to decoding of RNN-based translation models, making it possible to directly optimize
9794 translation metrics such as B LEU (Ranzato et al., 2016).
10
Note that this problem does not impact RNN-based sequence labeling models (see § 7.6). This is because
the tags produced by these models do not affect the recurrent state.
11
Scheduled sampling builds on earlier work on learning to search (Daumé III et al., 2009; Ross et al.,
2011), which are also described in § 15.2.4.
9804 However, identifying the top-scoring translation ŵ(t) is usually intractable, as described
9805 in the previous section. In minimum error-rate training (MERT), ŵ(t) is selected from a
9806 set of candidate translations Y(w(s) ); this is typically a strict subset of all possible transla-
9807 tions, so that it is only possible to optimize an approximation to the true error rate (Och
9808 and Ney, 2003).
A further issue is that the objective function in Equation 18.50 is discontinuous and
non-differentiable, due to the argmax over translations: an infinitesimal change in the
parameters θ could cause another translation to be selected, with a completely different
error. To address this issue, we can instead minimize the risk, which is defined as the
expected error rate,
9809 Minimum risk training minimizes the sum of R(θ) across all instances in the training set.
The risk can be generalized by exponentiating the translation probabilities,
α
p̃(w(t) ; θ, α) ∝ p(w(t) | w(s) ; θ) [18.53]
X
R̃(θ) = p̃(ŵ(t) | w(s) ; α, θ) × ∆(ŵ(t) , w(t) ) [18.54]
ŵ(t) ∈Y(w(s) )
9810 where Y(w(s) ) is now the set of all possible translations for w(s) . Exponentiating the prob-
9811 abilities in this way is known as annealing (Smith and Eisner, 2006). When α = 1, then
9812 R̃(θ) = R(θ); when α = ∞, then R̃(θ) is equivalent to the sum of the errors of the maxi-
9813 mum probability translations for each sentence in the dataset.
Clearly the set of candidate translations Y(w(s) ) is too large to explicitly sum over.
Because the error function ∆ generally does not decompose into smaller parts, there is
no efficient dynamic programming solution to sum over this set. We can approximate
P (t) (t) (t)
the sum ŵ(t) ∈Y(w(s) ) with a sum over a finite number of samples, {w1 , w2 , . . . , wK }.
If these samples were drawn uniformly at random, then the (annealed) risk would be
approximated as (Shen et al., 2016),
K
1 X (t) (t)
R̃(θ) ≈ p̃(wk | w(s) ; θ, α) × ∆(wk , w(t) ) [18.55]
Z
k=1
K
X (t)
Z= p̃(wk | w(s) ; θ, α). [18.56]
k=1
9814 Shen et al. (2016) report that performance plateaus at K = 100 for minimum risk training
9815 of neural machine translation.
Uniform sampling over the set of all possible translations is undesirable, because most
translations have very low probability. A solution from Monte Carlo estimation is impor-
tance sampling, in which we draw samples from a proposal distribution q(w(t) ). This
distribution can be set equal to the current translation model p(w(t) | w(s) ; θ). Each sam-
(t)
p̃(wk |w(s) )
ple is then weighted by an importance score, ωk = (t) . The effect of this weighting
q(wk )
is to correct for any mismatch between the proposal distribution q and the true distribu-
tion p̃. The risk can then be approximated as,
(t)
wk ∼q(w(t) ) [18.57]
(t)
p̃(wk | w(s) )
ωk = (t)
[18.58]
q(wk )
K
X
1 (t)
R̃(θ) ≈ PK ωk × ∆(wk , w(t) ). [18.59]
k=1 ωk k=1
9816 Importance sampling will generally give a more accurate approximation with a given
9817 number of samples. The only formal requirement is that the proposal assigns non-zero
9818 probability to every w(t) ∈ Y(w(s) ). For more on importance sampling and related meth-
9819 ods, see Robert and Casella (2013).
9836 Then, too, Smith has reconsidered the book’s famous opening. Camus’s
9837 original is deceptively simple: “Aujourd’hui, maman est morte.” Gilbert influ-
9838 enced generations by offering us “Mother died today”—inscribing in Meur-
9839 sault [the narrator] from the outset a formality that could be construed as
9840 heartlessness. But maman, after all, is intimate and affectionate, a child’s name
9841 for his mother. Matthew Ward concluded that it was essentially untranslatable
9842 (“mom” or “mummy” being not quite apt), and left it in the original French:
9843 “Maman died today.” There is a clear logic in this choice; but as Smith has
9844 explained, in an interview in The Guardian, maman “didn’t really tell the reader
9845 anything about the connotation.” She, instead, has translated the sentence as
9846 “My mother died today.”
9847 I chose “My mother” because I thought about how someone would
9848 tell another person that his mother had died. Meursault is speaking
9849 to the reader directly. “My mother died today” seemed to me the
9850 way it would work, and also implied the closeness of “maman” you
9851 get in the French.
9852 Elsewhere in the book, she has translated maman as “mama” — again, striving
9853 to come as close as possible to an actual, colloquial word that will carry the
9854 same connotations as maman does in French.
12
The book review is currently available online at http://www.nybooks.com/articles/2014/06/
05/camus-new-letranger/.
9855 The passage is a useful reminder that while the quality of machine translation has
9856 improved dramatically in recent years, expert human translations draw on considerations
9857 that are beyond the ken of any known computational approach.
9858 Exercises
9862 As above, the second line shows a word-for-word gloss, and the third line shows
9863 the desired translation. Use the synchronized production rule in [18.22], and design
9864 the other production rules necessary to derive this sentence pair. You may derive
9865 (atacado, attacked) directly from VP.
9868 In many of the most interesting problems in natural language processing, language is
9869 the output. The previous chapter described the specific case of machine translation, but
9870 there are many other applications, from summarization of research articles, to automated
9871 journalism, to dialogue systems. This chapter emphasizes three main scenarios: data-to-
9872 text, in which text is generated to explain or describe a structured record or unstructured
9873 perceptual input; text-to-text, which typically involves fusing information from multiple
9874 linguistic sources into a single coherent summary; and dialogue, in which text is generated
9875 as part of an interactive conversation with one or more human participants.
9886 The earlier stages of this process are sometimes called content selection and text plan-
9887 ning; the later stages are often called surface realization.
9888 Early systems for data-to-text generation were modular, with separate software com-
9889 ponents for each task. Artificial intelligence planning algorithms can be applied to both
459
460 CHAPTER 19. TEXT GENERATION
Figure 19.1: An example input-output pair for the task of generating text descriptions of
weather forecasts (Konstas and Lapata, 2013). [todo: permission]
9890 the high-level information structure and the organization of individual sentences, ensur-
9891 ing that communicative goals are met (McKeown, 1992; Moore and Paris, 1993). Surface
9892 realization can be performed by grammars or templates, which link specific types of data
9893 to candidate words and phrases. A simple example template is offered by Wiseman et al.
9894 (2017), for generating descriptions of basketball games:
9898 For more complex cases, it may be necessary to apply morphological inflections such as
9899 pluralization and tense marking — even in the simple example above, languages such
9900 as Russian would require case marking suffixes for the team names. Such inflections can
9901 be applied as a postprocessing step. Another difficult challenge for surface realization is
9902 the generation of varied referring expressions (e.g., The Knicks, New York, they), which is
9903 critical to avoid repetition. As discussed in § 16.2.1, the form of referring expressions is
9904 constrained by the discourse and information structure.
9905 An example at the intersection of rule-based and statistical techniques is the Nitrogen
9906 system (Langkilde and Knight, 1998). The input to Nitrogen is an abstract meaning rep-
9907 resentation (AMR; see § 13.3) of semantic content to be expressed in a single sentence. In
9908 data-to-text scenarios, the abstract meaning representation is the output of a higher-level
9909 text planning stage. A set of rules then converts the abstract meaning representation into
9910 various sentence plans, which may differ in both the high-level structure (e.g., active ver-
9911 sus passive voice) as well as the low-level details (e.g., word and phrase choice). Some
9912 examples are shown in Figure 19.2. To control the combinatorial explosion in the number
9913 of possible realizations for any given meaning, the sentence plans are unified into a single
9914 finite-state acceptor, in which word tokens are represented by arcs (see § 9.1.1). A bigram
Figure 19.2: Abstract meaning representation and candidate surface realizations from the
Nitrogen system. Example adapted from Langkilde and Knight (1998).
9915 language model is then used to compute weights on the arcs, so that the shortest path is
9916 also the surface realization with the highest bigram language model probability.
9917 More recent systems are unified models that are trained end-to-end using backpropa-
9918 gation. Data-to-text generation shares many properties with machine translation, includ-
9919 ing a problem of alignment: labeled examples provide the data and the text, but they do
9920 not specify which parts of the text correspond to which parts of the data. For example, to
9921 learn from Figure 19.1, the system must align the word cloudy to records in CLOUD SKY
9922 COVER , the phrases 10 and 20 degrees to the MIN and MAX fields in TEMPERATURE , and
9923 so on. As in machine translation, both latent variables and neural attention have been
9924 proposed as solutions.
9928 where V∗is the set of strings over a discrete vocabulary, and θ is a vector of parameters.
9929 The relationship between w and y is complex: the data y may contain dozens of records,
9930 and w may extend to several sentences. To facilitate learning and inference, it would be
9931 helpful to decompose the scoring function Ψ into subcomponents. This would be pos-
9932 sible if given an alignment, specifying which element of y is expressed in each part of
9933 w (Angeli et al., 2010):
M
X
Ψ(w, y; θ) = ψw,y (wm , yzm ) + ψz (zm , zm−1 ), [19.2]
m=1
9934 where zm indicates the record aligned to word m. For example, in Figure 19.1, z1 might
9935 specify that the word cloudy is aligned to the record cloud-sky-cover:percent. The
9936 score for this alignment would then be given by the weight on features such as
9937 which could be learned from labeled data {(w(i) , y (i) , z (i) )}N
i=1 . The function ψz can learn
9938 to assign higher scores to alignments that are coherent, referring to the same records in
9939 adjacent parts of the text.1
9940 Several datasets include structured records and natural language text (Barzilay and
9941 McKeown, 2005; Chen and Mooney, 2008; Liang and Klein, 2009), but the alignments
9942 between text and records are usually not available.2 One solution is to model the problem
9943 probabilistically, treating the alignment as a latent variable (Liang et al., 2009; Konstas
9944 and Lapata, 2013). The model can then be estimated using expectation maximization or
9945 sampling (see chapter 5).
Sequences Some types of structured records have a natural ordering, such as events in
a game (Chen and Mooney, 2008) and steps in a recipe (Tutin and Kittredge, 1992). For
example, the following records describe a sequence of events in a robot soccer match (Mei
1
More expressive decompositions of Ψ are possible. For example, Wong and Mooney (2007) use a syn-
chronous context-free grammar (see § 18.2.4) to “translate” between a meaning representation and natural
language text.
2
An exception is a dataset of records and summaries from American football games, containing annota-
tions of alignments between sentences and records (Snyder and Barzilay, 2007).
Figure 19.3: Examples of the image captioning task, with attention masks shown for each
of the underlined words. From Xu et al. (2015). [todo: permission]
et al., 2016):
9964 Each event is a single record, and can be encoded by a concatenation of vector represen-
9965 tations for the event type (e.g., PASS), the field (e.g., arg1), and the values (e.g., PURPLE 3),
9966 e.g.,
X = uPASS , uarg1 , uPURPLE 6 , uarg2 , uPURPLE 3 . [19.4]
9967 This encoding can then act as the input layer for a recurrent neural network, yielding a
9968 sequence of vector representations {zr }R
r=1 , where r indexes over records. Interestingly,
9969 this sequence-based approach is effective even in cases where there is no natural ordering
9970 over the records, such as the weather data in Figure 19.1 (Mei et al., 2016).
9971 Images Another flavor of data-to-text generation is the generation of text captions for
9972 images. Examples from this task are shown in Figure 19.3. Images are naturally repre-
9973 sented as tensors: a color image of 320 × 240 pixels would be stored as a tensor with
9974 320 × 240 × 3 intensity values. The dominant approach to image classification is to en-
9975 code images as vectors using a combination of convolution and pooling (Krizhevsky et al.,
9976 2012). Chapter 3 explains how to use convolutional networks for text; for images, convo-
9977 lution is applied across the vertical, horizontal, and color dimensions. By pooling the re-
9978 sults of successive convolutions, the image is converted to a vector representation, which
Figure 19.4: Neural attention in text generation. Figure from Mei et al. (2016).[todo: per-
mission]
9979 can then be fed directly into the decoder as the initial state (Vinyals et al., 2015), just as
9980 in the sequence-to-sequence translation model (see § 18.3). Alternatively, one can apply
9981 a set of convolutional networks, yielding vector representations for different parts of the
9982 image, which can then be combined using neural attention (Xu et al., 2015).
9984 where f is an elementwise nonlinearity such as tanh or ReLU (see § 3.2.1). The weighted
9985 average cm can then be included in the recurrent update to the decoder state, or in the
9986 emission probabilities, as described in § 18.3.1. Figure 19.4 shows the attention to compo-
9987 nents of a weather record, while generating the text shown on the x-axis.
(s)
10008 where δ(wr = w(t) ) is an indicator function, taking the value 1 iff the text of the record
(s)
10009 wr is identical to the target word w(t) . The probability of copying record r from the source
10010 is δ (sm = copy) × αm→r , the product of the copy probability by the local attention. Note
10011 that in this model, the attention weights αm are computed from the previous decoder state
10012 hm−1 . The computation graph therefore remains a feedforward network, with recurrent
(t) (t) (t)
10013 paths such as hm−1 → αm → wm → hm .
3
A number of variants of this strategy have been proposed (e.g., Gu et al., 2016; Merity et al., 2017). See
Wiseman et al. (2017) for an overview.
(t) XR
(t) exp βw(t) · hm−1
p(w(t) | w1:m , Z) =πm × P (t)
+(1 − πm ) × δ(wr(s) = w(t) ) × αm→r .
V
j=1 exp βj · hm−1 r=1
| {z } | {z }
generate copy
[19.10]
10026 These problems can be approached in two ways: through the encoder-decoder architec-
10027 ture discussed in the previous section, or by operating directly on the input text.
4
In § 16.3.4.1, we encountered a special case of single-document summarization, which involved extract-
ing the most important sentences or discourse units. We now consider the more challenging problem of
abstractive summarization, in which the summary can include words that do not appear in the original text.
10034 (19.3) Russian defense minister Ivanov called sunday for the creation of a joint front for
10035 combating global terrorism.
10036 Russia calls for joint front against terrorism.
10037
10038 Sentence summarization is closely related to sentence compression, in which the sum-
10039 mary is produced by deleting words or phrases from the original (Clarke and Lapata,
10040 2008). But as shown in (19.3), a sentence summary can also introduce new words, such as
10041 against, which replaces the phrase for combatting.
10042 Sentence summarization can be treated as a machine translation problem, using the at-
10043 tentional encoder-decoder translation model discussed in § 18.3.1 (Rush et al., 2015). The
10044 longer sentence is encoded into a sequence of vectors, one for each token. The decoder
10045 then computes attention over these vectors when updating its own recurrent state. As
10046 with data-to-text generation, it can be useful to augment the encoder-decoder model with
10047 the ability to copy words directly from the source. Rush et al. (2015) train this model by
10048 building four million sentence pairs from news articles. In each pair, the longer sentence is
10049 the first sentence of the article, and the summary is the article headline. Sentence summa-
10050 rization can also be trained in a semi-supervised fashion, using a probabilistic formulation
10051 of the encoder-decoder model called a variational autoencoder (Miao and Blunsom, 2016,
10052 also see § 14.8.2).
When summarizing longer documents, an additional concern is that the summary not
be repetitive: each part of the summary should cover new ground. This can be addressed
P
by maintaining a vector of the sum total of all attention values thus far, tm = m n=1 αn .
This total can be used as an additional input to the computation of the attention weights,
αm→n ∝ exp vα · tanh(Θα [h(t)
m ; h (s)
n ; tm ]) , [19.11]
which enables the model to learn to prefer parts of the source which have not been at-
tended to yet (Tu et al., 2016). To further encourage diversity in the generated summary,
See et al. (2017) introduce a coverage loss to the objective function,
M (s)
X
`m = min(αm→n , tm→n ). [19.12]
n=1
10053 This loss will be low if αm→· assigns little attention to words that already have large
10054 values in tm→· . Coverage loss is similar to the concept of marginal relevance, in which
10055 the reward for adding new content is proportional to the extent to which it increases
10056 the overall amount of information conveyed by the summary (Carbonell and Goldstein,
10057 1998).
10064 (19.4) Palin actually turned against the bridge project only after it became a national
10065 symbol of wasteful spending.
10066 (19.5) Ms. Palin supported the bridge project while running for governor, and aban-
10067 doned it after it became a national scandal.
10068 An intersection preserves only the content that is present in both sentences:
10069 (19.6) Palin turned against the bridge project after it became a national scandal.
10071 (19.7) Ms. Palin supported the bridge project while running for governor, but turned
10072 against it when it became a national scandal and a symbol of wasteful spending.
Dependency parsing is often used as a technique for sentence fusion. After parsing
each sentence, the resulting dependency trees can be aggregated into a lattice (Barzilay
and McKeown, 2005) or a graph structure (Filippova and Strube, 2008), in which identical
or closely related words (e.g., Palin, bridge, national) are fused into a single node. The
resulting graph can then be pruned back to a tree by solving an integer linear program
(see § 13.2.2),
X r
max ψ(i →− j, w; θ) × yi,j,r [19.13]
y
i,j,r
s.t. y ∈ C, [19.14]
10073 where the variable yi,j,r ∈ {0, 1} indicates whether there is an edge from i to j of type r,
r
10074 the score of this edge is ψ(i →− j, w; θ), and C is a set of constraints, described below. As
r
10075 usual, w is the list of words in the graph, and θ is a vector of parameters. The score ψ(i →
−
10076 j, w; θ) reflects the “importance” of the modifier j to the overall meaning: in intersective
10077 fusion, this score indicates the extent to which the content in this edge is expressed in all
10078 sentences; in union fusion, the score indicates whether the content in the edge is expressed
10079 in any sentence.
10080 The constraint set C ensures that y forms a valid dependency graph. It can also im-
10081 pose additional linguistic constraints: for example, ensuring that coordinated nouns are
10082 sufficiently similar. The resulting tree must then be linearized into a sentence. This is
10083 typically done by generating a set of candidate linearizations, and choosing the one with
10084 the highest score under a language model (Langkilde and Knight, 1998; Song et al., 2016).
10112 (19.9) I’d like anchovies , and please bring it to the College of Computing
O O B-TOPPING O O O O O O B-ADDR I-ADDR I-ADDR I-ADDR
10113 Building .
I-ADDR O
start q0
10114 The tagger can be driven by a bi-directional recurrent neural network, similar to recurrent
10115 approaches to semantic role labeling described in § 13.2.3.
10116 The input in (19.9) could not be handled by the finite-state system from Figure 19.5,
10117 which forces the user to provide the topping first, and then the location. In this sense,
10118 the initiative is driven completely by the system. Agenda-based dialogue systems ex-
10119 tend finite-state architectures by attempting to recognize all slots that are filled by the
10120 user’s reply, thereby handling these more complex examples. Agenda-based systems dy-
10121 namically pose additional questions until the frame is complete (Bobrow et al., 1977; Allen
10122 et al., 1995; Rudnicky and Xu, 1999). Such systems are said to be mixed-initiative, because
10123 both the user and the system can drive the direction of the dialogue.
10131 • Each state is a tuple of information about whether the topping and address are
10132 known, and whether the order has been confirmed. For example,
10133 is a possible state. Any state in which the pizza order is confirmed is a terminal
10134 state, and the Markov decision process stops after entering such a state.
10135 • The set of actions includes querying for the topping, querying for the address, and
10136 requesting confirmation. Each action induces a probability distribution over states,
10137 p(st | at , st−1 ). For example, requesting confirmation of the order is not likely to
10138 result in a transition to the terminal state if the topping is not yet known. This
10139 probability distribution over state transitions may be learned from data, or it may
10140 be specified in advance.
10141 • Each state-action-state tuple earns a reward, ra (st , st+1 ). In the context of the pizza
10142 ordering system, a simple reward function would be,
0, a = C ONFIRM, st+1 = (*, *, C ONFIRMED)
ra (st , st+1 ) = −10, a = C ONFIRM, st+1 = (*, *, N OT C ONFIRMED) [19.16]
−1, a 6= C ONFIRM
10143 This function assigns zero reward for successful transitions to the terminal state, a
10144 large negative reward to a rejected request for confirmation, and a small negative re-
10145 ward for every other type of action. The system is therefore rewarded for reaching
10146 the terminal state in few steps, and penalized for prematurely requesting confirma-
10147 tion.
10148 In a Markov decision process, a policy is a function π : S 7→ A that maps from states to
10149 actions (see § 15.2.4.3). The value of a policy is the expected sum of discounted rewards,
P
10150 Eπ [ Tt=1 γ t rat (st , st+1 )], where γ is the discount factor, γ ∈ [0, 1). Discounting has the
10151 effect of emphasizing rewards that can be obtained immediately over less certain rewards
10152 in the distant future.
10153 An optimal policy can be obtained by dynamic programming, by iteratively updating
10154 the value function V (s), which is the expectation of the cumulative reward from s under
10155 the optimal action a,
X
V (s) ← max p(s0 | s, a)[ra (s, s0 ) + γV (s0 )]. [19.17]
a∈A
s0 ∈S
10156 The value function V (s) is computed in terms of V (s0 ) for all states s0 ∈ S. A series
10157 of iterative updates to the value function will eventually converge to a stationary point.
10158 This algorithm is known as value iteration. Given the converged value function V (s), the
10160 Value iteration and related algorithms are described in detail by Sutton and Barto (1998).
10161 For applications to dialogue systems, see Levin et al. (1998) and Walker (2000).
10162 The Markov decision process framework assumes that the current state of the dialogue
10163 is known. In reality, the system may misinterpret the user’s statements — for example,
10164 believing that a specification of the delivery location (P EACHTREE) is in fact a specification
10165 of the topping (PEACHES). In a partially observable Markov decision process (POMDP),
10166 the system receives an observation o, which is probabilistically conditioned on the state,
10167 p(o | s). It must therefore maintain a distribution of beliefs about which state it is in, with
10168 qt (s) indicating the degree of belief that the dialogue is in state s at time t. The POMDP
10169 formulation can help to make dialogue systems more robust to errors, particularly in the
10170 context of spoken language dialogues, where the speech itself may be misrecognized (Roy
10171 et al., 2000; Williams and Young, 2007). However, finding the optimal policy in a POMDP
10172 is computationally intractable, requiring additional approximations.
10186 While encoder-decoder models can generate responses that make sense in the con-
10187 text of the immediately preceding turn, they struggle to maintain coherence over longer
5
Twitter is also frequently used for construction of dialogue datasets (Ritter et al., 2011; Sordoni et al.,
2015). Another source is technical support chat logs from the Ubuntu linux distribution (Uthus and Aha,
2013; Lowe et al., 2015).
6
All examples are translated from Chinese by Shang et al. (2015).
Figure 19.6: A hierarchical recurrent neural network for dialogue, with recurrence over
both words and turns, from Serban et al. (2016). [todo: permission]
10188 conversations. One solution is to model the dialogue context recurrently. This creates
10189 a hierarchical recurrent network, including both word-level and turn-level recurrences.
10190 The turn-level hidden state is then used as additional context in the decoder (Serban et al.,
10191 2016), as shown in Figure 19.6.
10192 An open question is how to integrate the encoder-decoder architecture into task-oriented
10193 dialogue systems. Neural chatbots can be trained end-to-end: the user’s turn is analyzed
10194 by the encoder, and the system output is generated by the decoder. This architecture
10195 can be trained by log-likelihood using backpropagation (e.g., Sordoni et al., 2015; Serban
10196 et al., 2016), or by more elaborate objectives, using reinforcement learning (Li et al., 2016).
10197 In contrast, the task-oriented dialogue systems described in § 19.3.1 typically involve a
10198 set of specialized modules: one for recognizing the user input, another for deciding what
10199 action to take, and a third for arranging the text of the system output.
10200 Recurrent neural network decoders can be integrated into Markov Decision Process
10201 dialogue systems, by conditioning the decoder on a representation of the information
10202 that is to be expressed in each turn (Wen et al., 2015). Specifically, the long short-term
10203 memory (LSTM; § 6.3) architecture is augmented so that the memory cell at turn m takes
10204 an additional input dm , which is a representation of the slots and values to be expressed
10205 in the next turn. However, this approach still relies on additional modules to recognize
10206 the user’s utterance and to plan the overall arc of the dialogue.
10207 Another promising direction is to create embeddings for the elements in the domain:
10208 for example, the slots in a record and the entities that can fill them. The encoder then
10209 encodes not only the words of the user’s input, but the embeddings of the elements that
10210 the user mentions. Similarly, the decoder is endowed with the ability to refer to specific
10211 elements in the knowledge base. He et al. (2017) show that such a method can learn to
10212 play a collaborative dialogue game, in which both players are given a list of entities and
10213 their properties, and the goal is to find an entity that is on both players’ lists.
10228 Exercises
10229 1. The SimpleNLG system produces surface realizations from representations of de-
10230 sired syntactic structure (Gatt and Reiter, 2009). This system can be accessed on
10231 github at https://github.com/simplenlg/simplenlg. Download the sys-
10232 tem, and produce realizations of the following examples:
10236 Then convert each example to a question. [todo: Can’t get SimpleNLG to work with
10237 python anymore]
10239 Probability
10240 Probability theory provides a way to reason about random events. The sorts of random
10241 events that are typically used to explain probability theory include coin flips, card draws,
10242 and the weather. It may seem odd to think about the choice of a word as akin to the flip of
10243 a coin, particularly if you are the type of person to choose words carefully. But random or
10244 not, language has proven to be extremely difficult to model deterministically. Probability
10245 offers a powerful tool for modeling and manipulating linguistic data.
10246 Probability can be thought of in terms of random outcomes: for example, a single coin
10247 flip has two possible outcomes, heads or tails. The set of possible outcomes is the sample
10248 space, and a subset of the sample space is an event. For a sequence of two coin flips,
10249 there are four possible outcomes, {HH, HT, T H, T T }, representing the ordered sequences
10250 heads-head, heads-tails, tails-heads, and tails-tails. The event of getting exactly one head
10251 includes two outcomes: {HT, T H}.
10252 Formally, a probability is a function from events to the interval between zero and one:
10253 Pr : F 7→ [0, 1], where F is the set of possible events. An event that is certain has proba-
10254 bility one; an event that is impossible has probability zero. For example, the probability
10255 of getting fewer than three heads on two coin flips is one. Each outcome is also an event
10256 (a set with exactly one element), and for two flips of a fair coin, the probability of each
10257 outcome is,
1
Pr({HH}) = Pr({HT }) = Pr({T H}) = Pr({T T }) = . [A.1]
4
475
476 APPENDIX A. PROBABILITY
10265 In the coin flip example, the event of obtaining a single head on two flips corresponds to
10266 the set of outcomes {HT, T H}; the complement event includes the other two outcomes,
10267 {T T, HH}.
Furthermore, the event B is the union of two disjoint events: A ∩ B and B − (A ∩ B).
Pr(B) = Pr(B − (A ∩ B)) + Pr(A ∩ B). [A.7]
Reorganizing and subtituting into Equation A.6 gives the desired result:
Pr(B − (A ∩ B)) = Pr(B) − Pr(A ∩ B) [A.8]
Pr(A ∪ B) = Pr(A) + Pr(B) − Pr(A ∩ B). [A.9]
A A∩B B
10278 For any event B, the union B ∪ ¬B is a partition of the sample space. Therefore, a special
10279 case of the law of total probability is,
Pr(A ∩ B)
Pr(A | B) = . [A.12]
Pr(B)
The chain rule of probability states that Pr(A ∩ B) = Pr(A | B) × Pr(B), which is just
a rearrangement of terms from Equation A.12. The chain rule can be applied repeatedly:
Bayes’ rule (sometimes called Bayes’ law or Bayes’ theorem) gives us a way to convert
between Pr(A | B) and Pr(B | A). It follows from the definition of conditional probability
and the chain rule:
Pr(A ∩ B) Pr(B | A) × Pr(A)
Pr(A | B) = = [A.13]
Pr(B) Pr(B)
10281 Each term in Bayes rule has a name, which we will occasionally use:
10282 • Pr(A) is the prior, since it is the probability of event A without knowledge about
10283 whether B happens or not.
10284 • Pr(B | A) is the likelihood, the probability of event B given that event A has oc-
10285 curred.
10286 • Pr(A | B) is the posterior, the probability of event A with knowledge that B has
10287 occurred.
10288 Example The classic examples for Bayes’ rule involve tests for rare diseases, but Man-
10289 ning and Schütze (1999) reframe this example in a linguistic setting. Suppose that you are
10290 is interested in a rare syntactic construction, such as parasitic gaps, which occur on average
10291 once in 100,000 sentences. Here is an example of a parasitic gap:
10292 (A.1) Which class did you attend without registering for ?
10293 Lana Linguist has developed a complicated pattern matcher that attempts to identify
10294 sentences with parasitic gaps. It’s pretty good, but it’s not perfect:
10295 • If a sentence has a parasitic gap, the pattern matcher will find it with probability
10296 0.95. (This is the recall, which is one minus the false positive rate.)
10297 • If the sentence doesn’t have a parasitic gap, the pattern matcher will wrongly say it
10298 does with probability 0.005. (This is the false positive rate, which is one minus the
10299 precision.)
10300 Suppose that Lana’s pattern matcher says that a sentence contains a parasitic gap. What
10301 is the probability that this is true?
Let G be the event of a sentence having a parasitic gap, and T be the event of the test
being positive. We are interested in the probability of a sentence having a parasitic gap
given that the test is positive. This is the conditional probability Pr(G | T ), and it can be
computed by Bayes’ rule:
Pr(T | G) × Pr(G)
Pr(G | T ) = . [A.14]
Pr(T )
10302 We already know both terms in the numerator: Pr(T | G) is the recall, which is 0.95; Pr(G)
10303 is the prior, which is 10−5 .
10304 We are not given the denominator, but it can be computed using tools developed ear-
10305 lier in this section. First apply the law of total probability, using the partition {G, ¬G}:
This says that the probability of the test being positive is the sum of the probability of a
true positive (T ∩ G) and the probability of a false positive (T ∩ ¬G). The probability of
each of these events can be computed using the chain rule:
Plugging these terms into Bayes’ rule gives the desired posterior probability,
Pr(T | G) Pr(G)
Pr(G | T ) = [A.20]
Pr(T )
0.95 × 10−5
= [A.21]
0.95 × 10−5 + 0.005 × (1 − 10−5 )
≈0.002. [A.22]
10306 Lana’s pattern matcher seems accurate, with false positive and false negative rates
10307 below 5%. Yet the extreme rarity of the phenomenon means that a positive result from the
10308 detector is most likely to be wrong.
coin, the probability of getting heads on the first flip is independent of the probability of
getting heads on the second flip:
1 1 1
Pr({HT, HH}) = Pr(HT ) + Pr(HH) =
+ = [A.23]
4 4 2
1 1 1
Pr({HH, T H}) = Pr(HH) + Pr(T H) = + = [A.24]
4 4 2
1 1 1
Pr({HT, HH}) × Pr({HH, T H}) = × = [A.25]
2 2 4
1
Pr({HT, HH} ∩ {HH, T H}) = Pr(HH) = [A.26]
4
= Pr({HT, HH}) × Pr({HH, T H}). [A.27]
10310 If Pr(A ∩ B | C) = Pr(A | C) × Pr(B | C), then the events A and B are conditionally
10311 independent, written A ⊥ B | C. Conditional independence plays a important role in
10312 probabilistic models such as Naı̈ve Bayes chapter 2.
10316 • An indicator random variable is a functions from events to the set {0, 1}. In the coin
10317 flip example, we can define Y as an indicator random variable, taking the value
10318 1 when the coin has come up heads on at least one flip. This would include the
10319 outcomes {HH, HT, T H}. The probability Pr(Y = 1) is the sum of the probabilities
10320 of these outcomes, Pr(Y = 1) = 14 + 41 + 14 = 34 .
10321 • A discrete random variable is a function from events to a discrete subset of R. Con-
10322 sider the coin flip example: the number of heads on two flips, X, can be viewed as a
10323 discrete random variable, X ∈ 0, 1, 2. The event probability Pr(X = 1) can again be
10324 computed as the sum of the probabilities of the events in which there is one head,
10325 {HT, T H}, giving Pr(X = 1) = 14 + 14 = 12 .
Each possible value of a random variable is associated with a subset of the sample
space. In the coin flip example, X = 0 is associated with the event {T T }, X = 1 is
associated with the event {HT, T H}, and X = 2 is associated with the event {HH}.
Assuming a fair coin, the probabilities of these events are, respectively, 1/4, 1/2, and 1/4.
This list of numbers represents the probability distribution over X, written pX , which
maps from the possible values of X to the non-negative reals. For a specific value x, we
write pX (x), which is equal to the event probability Pr(X = x).1 The function pX is called
1
In general, capital letters (e.g., X) refer to random variables, and lower-case letters (e.g., x) refer to
specific values. When the distribution is clear from context, I will simply write p(x).
10326 Probabilities over multiple random variables can written as joint probabilities, e.g.,
10327 pA,B (a, b) = Pr(A = a ∩ B = b). Several properties of event probabilities carry over to
10328 probability distributions over random variables:
P
10329 • The marginal probability distribution is pA (a) = b pA,B (a, b).
pA,B (a,b)
10330 • The conditional probability distribution is pA|B (a | b) = pB (b) .
10331 • Random variables A and B are independent iff pA,B (a, b) = pA (a) × pB (b).
10341 For example, a fast food restaurant in Quebec has a special offer for cold days: they give
10342 a 1% discount on poutine for every degree below zero. Assuming a thermometer with
10343 infinite precision, the expected price would be an integral over all possible temperatures,
Z
E[price(x)] = min(1, 1 + x) × original-price × p(x)dx. [A.31]
X
10345 Probabilistic models provide a principled way to reason about random events and ran-
10346 dom variables. Let’s consider the coin toss example. Each toss can be modeled as a ran-
10347 dom event, with probability θ of the event H, and probability 1 − θ of the complementary
10348 event T . If we write a random variable X as the total number of heads on three coin
10349 flips, then the distribution of X depends on θ. In this case, X is distributed as a binomial
10350 random variable, meaning that it is drawn from a binomial distribution, with parameters
10351 (θ, N = 3). This is written,
10352 The properties of the binomial distribution enable us to make statements about the X,
10353 such as its expected value and the likelihood that its value will fall within some interval.
Now suppose that θ is unknown, but we have run an experiment, in which we exe-
cuted N trials, and obtained x heads. We can estimate θ by the principle of maximum
likelihood:
This says that the estimate θ̂ should be the value that maximizes the likelihood of the
data. The semicolon indicates that θ and N are parameters of the probability function.
The likelihood pX (x; θ, N ) can be computed from the binomial distribution,
N!
pX (x; θ, N ) = θx (1 − θ)N −x . [A.34]
x!(N − x)!
10354 This likelihood is proportional to the product of the probability of individual out-
10355 comes: for example, the sequence T, H, H, T, H would have probability θ3 (1 − θ)2 . The
10356 term x!(NN−x)!
!
arises from the many possible orderings by which we could obtain x heads
10357 on N trials. This term does not depend on θ, so it can be ignored during estimation.
In practice, we maximize the log-likelihood, which is a monotonic function of the like-
lihood. Under the binomial distribution, the log-likelihood is a convex function of θ (see
§ 2.3), so it can be maximized by taking the derivative and setting it equal to zero.
10358 In this case, the maximum likelihood estimate is equal to Nx , the fraction of trials that
10359 came up heads. This intuitive solution is also known as the relative frequency estimate,
10360 since it is equal to the relative frequency of the outcome.
Is maximum likelihood estimation always the right choice? Suppose you conduct one
trial, and get heads. Would you conclude that θ = 1, meaning that the coin is guaran-
teed to come up heads? If not, then you must have some prior expectation about θ. To
incorporate this prior information, we can treat θ as a random variable, and use Bayes’
rule:
p(x | θ) × p(θ)
p(θ | x; N ) = [A.41]
p(x)
∝p(x | θ) × p(θ) [A.42]
θ̂ = argmax p(x | θ) × p(θ). [A.43]
θ
10361 This it the maximum a posteriori (MAP) estimate. Given a form for p(θ), you can de-
10362 rive the MAP estimate using the same approach that was used to derive the maximum
10363 likelihood estimate.
40 20
30
|x| 2cos(x)
(x 2)2 + 3
20 10
10 0
4 2 0 2 4 20 10 0 10 20
x x
(a) The function f (x) = (x − 2)2 + 3 (b) The function f (x) = |x| − 2 cos(x)
485
486 APPENDIX B. NUMERICAL OPTIMIZATION
derivatives are,
∂d
= x1 − 2 [B.2]
∂x1
∂d
= x2 − 1 [B.3]
∂x2
10378 The unique minimum is x∗ = [2, 1]> .
10379 For non-convex functions, critical points are not necessarily global minima. A local
10380 minimum x∗ is a point at which the function takes a smaller value than at all nearby
10381 neighbors: formally, x∗ is a local minimum if there is some positive such that f (x∗ ) ≤
10382 f (x) for all x within distance of x∗ . Figure B.1b shows the function f (x) = |x| − 2 cos(x),
10383 which has many local minima, as well as a unique global minimum at x = 0. A critical
10384 point may also be the local or global maximum of the function; it may be a saddle point,
10385 which is a minimum with respect to at least one coordinate, and a maximum with respect
10386 at least one other coordinate; it may be an inflection point, which is neither or a minimum
10387 nor maximum. When available, the second derivative of f can help to distinguish these
10388 cases.
10390 where η (t) > 0 is a step size. If the step size is chosen appropriately, this procedure will
10391 find the global minimum of a differentiable convex function. For non-convex functions,
10392 gradient descent will find a local minimum. The extension to non-differentiable convex
10393 functions is discussed in § 2.3.
where each gi (x) is a scalar function of x. For example, suppose that x must be non-
negative, and that its sum cannot exceed a budget b. Then there are D + 1 inequality
constraints,
gi (x) = − xi , ∀i = 1, 2, . . . , D [B.7]
D
X
gD+1 (x) = − b + xi . [B.8]
i=1
10395 Inequality constraints can be combined with the original objective function f by form-
10396 ing a Lagrangian,
C
X
L(x, λ) = f (x) + λc gc (x), [B.9]
c=1
10397 where λc is a Lagrange multiplier. For any Lagrangian, there is a corresponding dual
10398 form, which is a function of λ:
10401 where θ (i−1) is the previous set of weights, and `(i) (θ) is the margin loss on instance i. As
10402 in § 2.3.1, this loss is defined as,
10403 When the margin loss is zero for θ (i−1) , the optimal solution is simply to set θ ∗ =
10404 θ (i−1) , so we will focus on the case where `(i) (θ (i−1) )
> 0. The Lagrangian for this problem
10405 is,
1
L(θ, λ) = ||θ − θ (i−1) ||2 + λ`(i) (θ), [B.14]
2
Holding λ constant, we can solve for θ by differentiating,
∂ (i)
∇θ L =θ − θ (i−1) + λ ` (θ) [B.15]
∂θ
θ ∗ =θ (i−1) + λδ, [B.16]
10406 where δ = f (x(i) , y (i) ) − f (x(i) , ŷ) and ŷ = argmaxy6=y(i) θ · f (x(i) , y).
The Lagrange multiplier λ acts as the learning rate in a perceptron-style update to θ.
We can solve for λ by plugging θ ∗ back into the Lagrangian, obtaining the dual function,
1
D(λ) = ||θ (i−1) + λδ − θ (i−1) ||2 + λ(1 − (θ (i−1) + λδ) · δ) [B.17]
2
λ2
= ||δ||2 − λ2 ||δ||2 + λ(1 − θ (i−1) · δ) [B.18]
2
λ2
= − ||δ||2 + λ`(i) (θ (i−1) ). [B.19]
2
Differentiating and solving for λ,
∂D
= − λ||δ||2 + `(i) (θ (i−1) ) [B.20]
∂λ
`(i) (θ (i−1) )
λ∗ = . [B.21]
||δ||2
`(i) (θ (i−1) )
θ ∗ = θ (i−1) + (f (x(i) , y (i) ) − f (x(i) , ŷ)). [B.22]
||f (x(i) , y (i) ) − f (x(i) , ŷ)||2
10408 This update has strong intuitive support. The numerator of the learning rate grows with
10409 the loss. The denominator grows with the norm of the difference between the feature
10410 vectors associated with the correct and predicted label. If this norm is large, then the step
10411 with respect to each feature should be small, and vice versa.
10413 Abadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,
10414 J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
10415 R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore,
10416 D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A.
10417 Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Watten-
10418 berg, M. Wicke, Y. Yu, and X. Zheng (2016). Tensorflow: Large-scale machine learning
10419 on heterogeneous distributed systems. CoRR abs/1603.04467.
10420 Abend, O. and A. Rappoport (2017). The state of the art in semantic representation. In
10421 Proceedings of the Association for Computational Linguistics (ACL).
10422 Abney, S., R. E. Schapire, and Y. Singer (1999). Boosting applied to tagging and PP attach-
10423 ment. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp.
10424 132–134.
10425 Abney, S. P. (1987). The English noun phrase in its sentential aspect. Ph. D. thesis, Mas-
10426 sachusetts Institute of Technology.
10427 Abney, S. P. and M. Johnson (1991). Memory requirements and local ambiguities of pars-
10428 ing strategies. Journal of Psycholinguistic Research 20(3), 233–250.
10429 Adafre, S. F. and M. De Rijke (2006). Finding similar sentences across multiple languages
10430 in wikipedia. In Proceedings of the Workshop on NEW TEXT Wikis and blogs and other
10431 dynamic text sources.
10432 Ahn, D. (2006). The stages of event extraction. In Proceedings of the Workshop on Annotating
10433 and Reasoning about Time and Events, pp. 1–8. Association for Computational Linguistics.
10434 Aho, A. V., M. S. Lam, R. Sethi, and J. D. Ullman (2006). Compilers: Principles, techniques,
10435 & tools.
489
490 BIBLIOGRAPHY
10437 Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
10438 Automatic Control 19(6), 716–723.
10442 Allauzen, C., M. Riley, J. Schalkwyk, W. Skut, and M. Mohri (2007). OpenFst: A gen-
10443 eral and efficient weighted finite-state transducer library. In International Conference on
10444 Implementation and Application of Automata, pp. 11–23. Springer.
10445 Allen, J. F. (1984). Towards a general theory of action and time. Artificial intelligence 23(2),
10446 123–154.
10447 Allen, J. F., B. W. Miller, E. K. Ringger, and T. Sikorski (1996). A robust system for natural
10448 spoken dialogue. In Proceedings of the Association for Computational Linguistics (ACL), pp.
10449 62–70.
10454 Alm, C. O., D. Roth, and R. Sproat (2005). Emotions from text: machine learning for
10455 text-based emotion prediction. In Proceedings of Empirical Methods for Natural Language
10456 Processing (EMNLP), pp. 579–586.
10457 Aluı́sio, S., J. Pelizzoni, A. Marchi, L. de Oliveira, R. Manenti, and V. Marquiafável (2003).
10458 An account of the challenge of tagging a reference corpus for Brazilian Portuguese.
10459 Computational Processing of the Portuguese Language, 194–194.
10460 Anand, P., M. Walker, R. Abbott, J. E. Fox Tree, R. Bowmani, and M. Minor (2011). Cats rule
10461 and dogs drool!: Classifying stance in online debate. In Proceedings of the 2nd Workshop
10462 on Computational Approaches to Subjectivity and Sentiment Analysis, Portland, Oregon, pp.
10463 1–9. Association for Computational Linguistics.
10464 Anandkumar, A. and R. Ge (2016). Efficient approaches for escaping higher order saddle
10465 points in non-convex optimization. In Proceedings of the Conference On Learning Theory
10466 (COLT), pp. 81–102.
10467 Anandkumar, A., R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky (2014). Tensor decompo-
10468 sitions for learning latent variable models. The Journal of Machine Learning Research 15(1),
10469 2773–2832.
10470 Ando, R. K. and T. Zhang (2005). A framework for learning predictive structures from
10471 multiple tasks and unlabeled data. The Journal of Machine Learning Research 6, 1817–
10472 1853.
10473 Andor, D., C. Alberti, D. Weiss, A. Severyn, A. Presta, K. Ganchev, S. Petrov, and
10474 M. Collins (2016). Globally normalized transition-based neural networks. In Proceedings
10475 of the Association for Computational Linguistics (ACL), pp. 2442–2452.
10476 Angeli, G., P. Liang, and D. Klein (2010). A simple domain-independent probabilistic ap-
10477 proach to generation. In Proceedings of Empirical Methods for Natural Language Processing
10478 (EMNLP), pp. 502–512.
10479 Antol, S., A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh
10480 (2015). Vqa: Visual question answering. In Proceedings of the International Conference on
10481 Computer Vision (ICCV), pp. 2425–2433.
10483 Arora, S. and B. Barak (2009). Computational complexity: a modern approach. Cambridge
10484 University Press.
10485 Arora, S., R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu (2013).
10486 A practical algorithm for topic modeling with provable guarantees. In Proceedings of the
10487 International Conference on Machine Learning (ICML), pp. 280–288.
10488 Arora, S., Y. Li, Y. Liang, T. Ma, and A. Risteski (2016). Linear algebraic structure of word
10489 senses, with applications to polysemy. arXiv preprint arXiv:1601.03764.
10490 Artstein, R. and M. Poesio (2008). Inter-coder agreement for computational linguistics.
10491 Computational Linguistics 34(4), 555–596.
10492 Artzi, Y. and L. Zettlemoyer (2013). Weakly supervised learning of semantic parsers for
10493 mapping instructions to actions. Transactions of the Association for Computational Linguis-
10494 tics 1, 49–62.
10497 Auer, P. (2013). Code-switching in conversation: Language, interaction and identity. Routledge.
10498 Auer, S., C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives (2007). Dbpedia: A
10499 nucleus for a web of open data. The semantic web, 722–735.
10500 Austin, J. L. (1962). How to do things with words. Oxford University Press.
10501 Aw, A., M. Zhang, J. Xiao, and J. Su (2006). A phrase-based statistical model for SMS text
10502 normalization. In Proceedings of the Association for Computational Linguistics (ACL), pp.
10503 33–40.
10504 Ba, J. L., J. R. Kiros, and G. E. Hinton (2016). Layer normalization. arXiv preprint
10505 arXiv:1607.06450.
10506 Bagga, A. and B. Baldwin (1998a). Algorithms for scoring coreference chains. In Proceed-
10507 ings of the Language Resources and Evaluation Conference, pp. 563–566.
10508 Bagga, A. and B. Baldwin (1998b). Entity-based cross-document coreferencing using the
10509 vector space model. In Proceedings of the International Conference on Computational Lin-
10510 guistics (COLING), pp. 79–85.
10511 Bahdanau, D., K. Cho, and Y. Bengio (2014). Neural machine translation by jointly learn-
10512 ing to align and translate. In Neural Information Processing Systems (NIPS).
10513 Baldwin, T. and S. N. Kim (2010). Multiword expressions. In Handbook of natural language
10514 processing, Volume 2, pp. 267–292. Boca Raton, USA: CRC Press.
10515 Balle, B., A. Quattoni, and X. Carreras (2011). A spectral learning algorithm for finite state
10516 transducers. In Proceedings of the European Conference on Machine Learning and Principles
10517 and Practice of Knowledge Discovery in Databases (ECML), pp. 156–171.
10523 Banko, M., M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni (2007). Open
10524 information extraction from the web. In Proceedings of the International Joint Conference
10525 on Artificial Intelligence (IJCAI), pp. 2670–2676.
10526 Bansal, N., A. Blum, and S. Chawla (2004). Correlation clustering. Machine Learning 56(1-
10527 3), 89–113.
10528 Barber, D. (2012). Bayesian reasoning and machine learning. Cambridge University Press.
10529 Barman, U., A. Das, J. Wagner, and J. Foster (2014, October). Code mixing: A challenge for
10530 language identification in the language of social media. In Proceedings of the First Work-
10531 shop on Computational Approaches to Code Switching, Doha, Qatar, pp. 13–23. Association
10532 for Computational Linguistics.
10533 Barnickel, T., J. Weston, R. Collobert, H.-W. Mewes, and V. Stümpflen (2009). Large scale
10534 application of neural network based semantic role labeling for automated relation ex-
10535 traction from biomedical texts. PLoS One 4(7), e6393.
10536 Baron, A. and P. Rayson (2008). Vard2: A tool for dealing with spelling variation in his-
10537 torical corpora. In Postgraduate conference in corpus linguistics.
10538 Baroni, M., R. Bernardi, and R. Zamparelli (2014). Frege in space: A program for compo-
10539 sitional distributional semantics. Linguistic Issues in Language Technologies.
10540 Barzilay, R. and M. Lapata (2008, mar). Modeling local coherence: An Entity-Based ap-
10541 proach. Computational Linguistics 34(1), 1–34.
10542 Barzilay, R. and K. R. McKeown (2005). Sentence fusion for multidocument news summa-
10543 rization. Computational Linguistics 31(3), 297–328.
10544 Beesley, K. R. and L. Karttunen (2003). Finite-state morphology. Stanford, CA: Center for
10545 the Study of Language and Information.
10546 Bejan, C. A. and S. Harabagiu (2014). Unsupervised event coreference resolution. Compu-
10547 tational Linguistics 40(2), 311–347.
10548 Bell, E. T. (1934). Exponential numbers. The American Mathematical Monthly 41(7), 411–419.
10549 Bender, E. M. (2013, jun). Linguistic Fundamentals for Natural Language Processing: 100
10550 Essentials from Morphology and Syntax, Volume 6 of Synthesis Lectures on Human Language
10551 Technologies. Morgan & Claypool Publishers.
10552 Bengio, S., O. Vinyals, N. Jaitly, and N. Shazeer (2015). Scheduled sampling for sequence
10553 prediction with recurrent neural networks. In Neural Information Processing Systems
10554 (NIPS), pp. 1171–1179.
10555 Bengio, Y., R. Ducharme, P. Vincent, and C. Janvin (2003). A neural probabilistic language
10556 model. The Journal of Machine Learning Research 3, 1137–1155.
10557 Bengio, Y., P. Simard, and P. Frasconi (1994). Learning long-term dependencies with gra-
10558 dient descent is difficult. IEEE Transactions on Neural Networks 5(2), 157–166.
10559 Bengtson, E. and D. Roth (2008). Understanding the value of features for coreference
10560 resolution. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP),
10561 pp. 294–303.
10562 Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and
10563 powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B
10564 (Methodological), 289–300.
10565 Berant, J., A. Chou, R. Frostig, and P. Liang (2013). Semantic parsing on freebase from
10566 question-answer pairs. In Proceedings of Empirical Methods for Natural Language Process-
10567 ing (EMNLP), pp. 1533–1544.
10568 Berant, J., V. Srikumar, P.-C. Chen, A. Vander Linden, B. Harding, B. Huang, P. Clark, and
10569 C. D. Manning (2014). Modeling biological processes for reading comprehension. In
10570 Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
10571 Berg-Kirkpatrick, T., A. Bouchard-Côté, J. DeNero, and D. Klein (2010). Painless unsuper-
10572 vised learning with features. In Proceedings of the North American Chapter of the Associa-
10573 tion for Computational Linguistics (NAACL), pp. 582–590.
10574 Berg-Kirkpatrick, T., D. Burkett, and D. Klein (2012). An empirical investigation of sta-
10575 tistical significance in NLP. In Proceedings of Empirical Methods for Natural Language
10576 Processing (EMNLP), pp. 995–1005.
10577 Berger, A. L., V. J. D. Pietra, and S. A. D. Pietra (1996). A maximum entropy approach to
10578 natural language processing. Computational linguistics 22(1), 39–71.
10579 Bergsma, S., D. Lin, and R. Goebel (2008). Distributional identification of non-referential
10580 pronouns. In Proceedings of the Association for Computational Linguistics (ACL), pp. 10–18.
10581 Bernardi, R., R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Mus-
10582 cat, and B. Plank (2016). Automatic description generation from images: A survey of
10583 models, datasets, and evaluation measures. Journal of Artificial Intelligence Research 55,
10584 409–442.
10585 Bertsekas, D. P. (2012). Incremental gradient, subgradient, and proximal methods for
10586 convex optimization: A survey. See Sra et al. (2012).
10587 Bhatia, P., R. Guthrie, and J. Eisenstein (2016). Morphological priors for probabilistic neu-
10588 ral word embeddings. In Proceedings of Empirical Methods for Natural Language Processing
10589 (EMNLP).
10590 Bhatia, P., Y. Ji, and J. Eisenstein (2015). Better document-level sentiment analysis from
10591 rst discourse parsing. In Proceedings of Empirical Methods for Natural Language Processing
10592 (EMNLP).
10593 Biber, D. (1991). Variation across speech and writing. Cambridge University Press.
10594 Bird, S., E. Klein, and E. Loper (2009). Natural language processing with Python. California:
10595 O’Reilly Media.
10597 Björkelund, A. and P. Nugues (2011). Exploring lexicalized features for coreference reso-
10598 lution. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 45–50.
10599 Blackburn, P. and J. Bos (2005). Representation and inference for natural language: A first
10600 course in computational semantics. CSLI.
10601 Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM 55(4), 77–84.
10602 Blei, D. M. (2014). Build, compute, critique, repeat: Data analysis with latent variable
10603 models. Annual Review of Statistics and Its Application 1, 203–232.
10604 Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent dirichlet allocation. the Journal of
10605 machine Learning research 3, 993–1022.
10606 Blitzer, J., M. Dredze, and F. Pereira (2007). Biographies, bollywood, boom-boxes and
10607 blenders: Domain adaptation for sentiment classification. In Proceedings of the Associa-
10608 tion for Computational Linguistics (ACL), pp. 440–447.
10609 Blum, A. and T. Mitchell (1998). Combining labeled and unlabeled data with co-training.
10610 In Proceedings of the Conference On Learning Theory (COLT), pp. 92–100.
10613 Bohnet, B. (2010). Very high accuracy and fast dependency parsing is not a contradiction.
10614 In Proceedings of the International Conference on Computational Linguistics (COLING), pp.
10615 89–97.
10616 Boitet, C. (1988). Pros and cons of the pivot and transfer approaches in multilingual ma-
10617 chine translation. Readings in machine translation, 273–279.
10618 Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov (2017). Enriching word vectors with
10619 subword information. Transactions of the Association for Computational Linguistics 5, 135–
10620 146.
10621 Bollacker, K., C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008). Freebase: a collabora-
10622 tively created graph database for structuring human knowledge. In Proceedings of the
10623 ACM International Conference on Management of Data (SIGMOD), pp. 1247–1250. AcM.
10624 Bolukbasi, T., K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai (2016). Man is to
10625 computer programmer as woman is to homemaker? debiasing word embeddings. In
10626 Neural Information Processing Systems (NIPS), pp. 4349–4357.
10627 Bordes, A., N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013). Translating
10628 embeddings for modeling multi-relational data. In Neural Information Processing Systems
10629 (NIPS), pp. 2787–2795.
10630 Bordes, A., J. Weston, R. Collobert, Y. Bengio, et al. (2011). Learning structured embed-
10631 dings of knowledge bases. In Proceedings of the National Conference on Artificial Intelligence
10632 (AAAI), pp. 301–306.
10633 Borges, J. L. (1993). Other Inquisitions 1937–1952. University of Texas Press. Translated by
10634 Ruth L. C. Simms.
10635 Botha, J. A. and P. Blunsom (2014). Compositional morphology for word representations
10636 and language modelling. In Proceedings of the International Conference on Machine Learn-
10637 ing (ICML).
10638 Bottou, L. (2012). Stochastic gradient descent tricks. In Neural networks: Tricks of the trade,
10639 pp. 421–436. Springer.
10640 Bottou, L., F. E. Curtis, and J. Nocedal (2016). Optimization methods for large-scale ma-
10641 chine learning. arXiv preprint arXiv:1606.04838.
10642 Bowman, S. R., L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio (2016). Gen-
10643 erating sentences from a continuous space. In Proceedings of the Conference on Natural
10644 Language Learning (CoNLL), pp. 10–21.
10645 boyd, d. and K. Crawford (2012). Critical questions for big data. Information, Communica-
10646 tion & Society 15(5), 662–679.
10647 Boyd, S. and L. Vandenberghe (2004). Convex Optimization. New York: Cambridge Uni-
10648 versity Press.
10649 Branavan, S., H. Chen, J. Eisenstein, and R. Barzilay (2009). Learning document-level
10650 semantic properties from free-text annotations. Journal of Artificial Intelligence Re-
10651 search 34(2), 569–603.
10652 Branavan, S. R., H. Chen, L. S. Zettlemoyer, and R. Barzilay (2009). Reinforcement learning
10653 for mapping instructions to actions. In Proceedings of the Association for Computational
10654 Linguistics (ACL), pp. 82–90.
10655 Braud, C., O. Lacroix, and A. Søgaard (2017). Does syntax help discourse segmenta-
10656 tion? not so much. In Proceedings of Empirical Methods for Natural Language Processing
10657 (EMNLP), pp. 2432–2442.
10662 Brown, P. F., P. V. Desouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai (1992). Class-based
10663 n-gram models of natural language. Computational linguistics 18(4), 467–479.
10664 Brown, P. F., V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer (1993). The mathematics
10665 of statistical machine translation: Parameter estimation. Computational linguistics 19(2),
10666 263–311.
10667 Brun, C. and C. Roux (2014). Décomposition des “hash tags” pour l’amélioration de la
10668 classification en polarité des “tweets”. Proceedings of Traitement Automatique des Langues
10669 Naturelles, 473–478.
10670 Bruni, E., N.-K. Tran, and M. Baroni (2014). Multimodal distributional semantics. Journal
10671 of Artificial Intelligence Research 49(2014), 1–47.
10672 Bullinaria, J. A. and J. P. Levy (2007). Extracting semantic representations from word co-
10673 occurrence statistics: A computational study. Behavior research methods 39(3), 510–526.
10674 Bunescu, R. C. and R. J. Mooney (2005). A shortest path dependency kernel for relation
10675 extraction. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP),
10676 pp. 724–731.
10677 Bunescu, R. C. and M. Pasca (2006). Using encyclopedic knowledge for named entity
10678 disambiguation. In Proceedings of the European Chapter of the Association for Computational
10679 Linguistics (EACL), pp. 9–16.
10680 Burstein, J., D. Marcu, and K. Knight (2003). Finding the WRITE stuff: Automatic identi-
10681 fication of discourse structure in student essays. IEEE Intelligent Systems 18(1), 32–39.
10682 Burstein, J., J. Tetreault, and S. Andreyev (2010). Using entity-based features to model
10683 coherence in student essays. In Human language technologies: The 2010 annual conference
10684 of the North American chapter of the Association for Computational Linguistics, pp. 681–684.
10685 Association for Computational Linguistics.
10686 Burstein, J., J. Tetreault, and M. Chodorow (2013). Holistic discourse coherence annotation
10687 for noisy essay writing. Dialogue & Discourse 4(2), 34–52.
10688 Cai, Q. and A. Yates (2013). Large-scale semantic parsing via schema matching and lexicon
10689 extension. In Proceedings of the Association for Computational Linguistics (ACL), pp. 423–
10690 433.
10691 Caliskan, A., J. J. Bryson, and A. Narayanan (2017). Semantics derived automatically from
10692 language corpora contain human-like biases. Science 356(6334), 183–186.
10695 Cappé, O. and E. Moulines (2009). On-line expectation–maximization algorithm for latent
10696 data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71(3),
10697 593–613.
10698 Carbonell, J. and J. Goldstein (1998). The use of mmr, diversity-based reranking for re-
10699 ordering documents and producing summaries. In Proceedings of ACM SIGIR conference
10700 on Research and development in information retrieval, pp. 335–336.
10703 Cardie, C. and K. Wagstaff (1999). Noun phrase coreference as clustering. In Proceedings
10704 of Empirical Methods for Natural Language Processing (EMNLP), pp. 82–89.
10705 Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statistic. Com-
10706 putational linguistics 22(2), 249–254.
10707 Carletta, J. (2007). Unleashing the killer corpus: experiences in creating the multi-
10708 everything ami meeting corpus. Language Resources and Evaluation 41(2), 181–190.
10709 Carlson, L. and D. Marcu (2001). Discourse tagging reference manual. Technical Report
10710 ISI-TR-545, Information Sciences Institute.
10711 Carlson, L., M. E. Okurowski, and D. Marcu (2002). RST discourse treebank. Linguistic
10712 Data Consortium, University of Pennsylvania.
10714 Carreras, X., M. Collins, and T. Koo (2008). Tag, dynamic programming, and the percep-
10715 tron for efficient, feature-rich parsing. In Proceedings of the Conference on Natural Language
10716 Learning (CoNLL), pp. 9–16.
10717 Carreras, X. and L. Màrquez (2005). Introduction to the conll-2005 shared task: Semantic
10718 role labeling. In Proceedings of the Ninth Conference on Computational Natural Language
10719 Learning, pp. 152–164. Association for Computational Linguistics.
10720 Carroll, L. (1917). Through the looking glass: And what Alice found there. Chicago: Rand,
10721 McNally.
10722 Chambers, N. and D. Jurafsky (2008). Jointly combining implicit constraints improves
10723 temporal ordering. In Proceedings of Empirical Methods for Natural Language Processing
10724 (EMNLP), pp. 698–706.
10725 Chang, K.-W., A. Krishnamurthy, A. Agarwal, H. Daume III, and J. Langford (2015).
10726 Learning to search better than your teacher. In Proceedings of the International Confer-
10727 ence on Machine Learning (ICML).
10728 Chang, M.-W., L. Ratinov, and D. Roth (2007). Guiding semi-supervision with constraint-
10729 driven learning. In Proceedings of the Association for Computational Linguistics (ACL), pp.
10730 280–287.
10731 Chang, M.-W., L.-A. Ratinov, N. Rizzolo, and D. Roth (2008). Learning and inference with
10732 constraints. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp.
10733 1513–1518.
10737 Charniak, E. (1997). Statistical techniques for natural language parsing. AI magazine 18(4),
10738 33–43.
10739 Charniak, E. and M. Johnson (2005). Coarse-to-fine n-best parsing and maxent discrimi-
10740 native reranking. In Proceedings of the Association for Computational Linguistics (ACL), pp.
10741 173–180.
10742 Chelba, C. and A. Acero (2006). Adaptation of maximum entropy capitalizer: Little data
10743 can help a lot. Computer Speech & Language 20(4), 382–399.
10744 Chelba, C., T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson (2013).
10745 One billion word benchmark for measuring progress in statistical language modeling.
10746 arXiv preprint arXiv:1312.3005.
10747 Chen, D., J. Bolton, and C. D. Manning (2016). A thorough examination of the CNN/Daily
10748 Mail reading comprehension task. In Proceedings of the Association for Computational
10749 Linguistics (ACL).
10750 Chen, D. and C. D. Manning (2014). A fast and accurate dependency parser using neural
10751 networks. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP),
10752 pp. 740–750.
10753 Chen, D. L. and R. J. Mooney (2008). Learning to sportscast: a test of grounded language
10754 acquisition. In Proceedings of the International Conference on Machine Learning (ICML), pp.
10755 128–135.
10756 Chen, H., S. Branavan, R. Barzilay, and D. R. Karger (2009). Content modeling using latent
10757 permutations. Journal of Artificial Intelligence Research 36(1), 129–163.
10758 Chen, M., Z. Xu, K. Weinberger, and F. Sha (2012). Marginalized denoising autoencoders
10759 for domain adaptation. In Proceedings of the International Conference on Machine Learning
10760 (ICML).
10761 Chen, M. X., O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar,
10762 M. Schuster, Z. Chen, Y. Wu, and M. Hughes (2018). The best of both worlds: Combin-
10763 ing recent advances in neural machine translation. In Proceedings of the Association for
10764 Computational Linguistics (ACL).
10765 Chen, S. F. and J. Goodman (1999). An empirical study of smoothing techniques for lan-
10766 guage modeling. Computer Speech & Language 13(4), 359–393.
10767 Chen, T. and C. Guestrin (2016). Xgboost: A scalable tree boosting system. In Proceedings
10768 of Knowledge Discovery and Data Mining (KDD), pp. 785–794.
10769 Chen, X., X. Qiu, C. Zhu, P. Liu, and X. Huang (2015). Long short-term memory neural
10770 networks for chinese word segmentation. In Proceedings of Empirical Methods for Natural
10771 Language Processing (EMNLP), pp. 1197–1206.
10772 Chen, Y., S. Gilroy, A. Malletti, K. Knight, and J. May (2018). Recurrent neural networks
10773 as weighted language recognizers. In Proceedings of the North American Chapter of the
10774 Association for Computational Linguistics (NAACL).
10778 Cheng, X. and D. Roth (2013). Relational inference for wikification. In Proceedings of
10779 Empirical Methods for Natural Language Processing (EMNLP), pp. 1787–1796.
10782 Chiang, D., J. Graehl, K. Knight, A. Pauls, and S. Ravi (2010). Bayesian inference for
10783 finite-state transducers. In Proceedings of the North American Chapter of the Association for
10784 Computational Linguistics (NAACL), pp. 447–455.
10787 Cho, K., B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and
10788 Y. Bengio (2014). Learning phrase representations using rnn encoder-decoder for sta-
10789 tistical machine translation. In Proceedings of Empirical Methods for Natural Language
10790 Processing (EMNLP).
10791 Chomsky, N. (1957). Syntactic structures. The Hague: Mouton & Co.
10792 Chomsky, N. (1982). Some concepts and consequences of the theory of government and binding,
10793 Volume 6. MIT press.
10794 Choromanska, A., M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun (2015). The loss
10795 surfaces of multilayer networks. In Proceedings of Artificial Intelligence and Statistics (AIS-
10796 TATS), pp. 192–204.
10797 Christensen, J., S. Soderland, O. Etzioni, et al. (2010). Semantic role labeling for open
10798 information extraction. In Proceedings of the Workshop on Formalisms and Methodology for
10799 Learning by Reading, pp. 52–60. Association for Computational Linguistics.
10800 Christodoulopoulos, C., S. Goldwater, and M. Steedman (2010). Two decades of unsuper-
10801 vised pos induction: How far have we come? In Proceedings of Empirical Methods for
10802 Natural Language Processing (EMNLP), pp. 575–584.
10803 Chu, Y.-J. and T.-H. Liu (1965). On shortest arborescence of a directed graph. Scientia
10804 Sinica 14(10), 1396–1400.
10805 Chung, C. and J. W. Pennebaker (2007). The psychological functions of function words.
10806 In K. Fiedler (Ed.), Social communication, pp. 343–359. New York and Hove: Psychology
10807 Press.
10808 Church, K. (2011). A pendulum swung too far. Linguistic Issues in Language Technology 6(5),
10809 1–27.
10810 Church, K. W. (2000). Empirical estimates of adaptation: the chance of two Noriegas
10811 is closer to p/2 than p2 . In Proceedings of the International Conference on Computational
10812 Linguistics (COLING), pp. 180–186.
10813 Church, K. W. and P. Hanks (1990). Word association norms, mutual information, and
10814 lexicography. Computational linguistics 16(1), 22–29.
10815 Ciaramita, M. and M. Johnson (2003). Supersense tagging of unknown nouns in wordnet.
10816 In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 168–
10817 175.
10818 Clark, K. and C. D. Manning (2015). Entity-centric coreference resolution with model
10819 stacking. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1405–
10820 1415.
10821 Clark, K. and C. D. Manning (2016). Improving coreference resolution by learning entity-
10822 level distributed representations. In Proceedings of the Association for Computational Lin-
10823 guistics (ACL).
10824 Clark, P. (2015). Elementary school science and math tests as a driver for ai: take the aristo
10825 challenge! In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp.
10826 4019–4021.
10827 Clarke, J., D. Goldwasser, M.-W. Chang, and D. Roth (2010). Driving semantic parsing
10828 from the world’s response. In Proceedings of the Conference on Natural Language Learning
10829 (CoNLL), pp. 18–27.
10830 Clarke, J. and M. Lapata (2008). Global inference for sentence compression: An integer
10831 linear programming approach. Journal of Artificial Intelligence Research 31, 399–429.
10832 Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychologi-
10833 cal measurement 20(1), 37–46.
10834 Cohen, S. (2016). Bayesian analysis in natural language processing. Synthesis Lectures on
10835 Human Language Technologies. San Rafael, CA: Morgan & Claypool Publishers.
10836 Collier, N., C. Nobata, and J.-i. Tsujii (2000). Extracting the names of genes and gene
10837 products with a hidden markov model. In Proceedings of the International Conference on
10838 Computational Linguistics (COLING), pp. 201–207.
10839 Collins, M. (1997). Three generative, lexicalised models for statistical parsing. In Proceed-
10840 ings of the Association for Computational Linguistics (ACL), pp. 16–23.
10841 Collins, M. (2002). Discriminative training methods for hidden markov models: theory
10842 and experiments with perceptron algorithms. In Proceedings of Empirical Methods for
10843 Natural Language Processing (EMNLP), pp. 1–8.
10846 Collins, M. and T. Koo (2005). Discriminative reranking for natural language parsing.
10847 Computational Linguistics 31(1), 25–70.
10848 Collins, M. and B. Roark (2004). Incremental parsing with the perceptron algorithm. In
10849 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pp.
10850 111. Association for Computational Linguistics.
10851 Collobert, R., K. Kavukcuoglu, and C. Farabet (2011). Torch7: A matlab-like environment
10852 for machine learning. Technical Report EPFL-CONF-192376, EPFL.
10853 Collobert, R. and J. Weston (2008). A unified architecture for natural language process-
10854 ing: Deep neural networks with multitask learning. In Proceedings of the International
10855 Conference on Machine Learning (ICML), pp. 160–167.
10856 Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011). Nat-
10857 ural language processing (almost) from scratch. Journal of Machine Learning Research 12,
10858 2493–2537.
10859 Colton, S., J. Goodwin, and T. Veale (2012). Full-face poetry generation. In Proceedings of
10860 the International Conference on Computational Creativity, pp. 95–102.
10861 Conneau, A., D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017). Supervised learning
10862 of universal sentence representations from natural language inference data. In Proceed-
10863 ings of Empirical Methods for Natural Language Processing (EMNLP), pp. 681–691.
10864 Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2009). Introduction to algorithms
10865 (third ed.). MIT press.
10866 Cotterell, R., H. Schütze, and J. Eisner (2016). Morphological smoothing and extrapolation
10867 of word embeddings. In Proceedings of the Association for Computational Linguistics (ACL),
10868 pp. 1651–1660.
10869 Coviello, L., Y. Sohn, A. D. Kramer, C. Marlow, M. Franceschetti, N. A. Christakis, and
10870 J. H. Fowler (2014). Detecting emotional contagion in massive social networks. PloS
10871 one 9(3), e90315.
10872 Covington, M. A. (2001). A fundamental algorithm for dependency parsing. In Proceedings
10873 of the 39th annual ACM southeast conference, pp. 95–102.
10874 Crammer, K., O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer (2006, December).
10875 Online passive-aggressive algorithms. The Journal of Machine Learning Research 7, 551–
10876 585.
10877 Crammer, K. and Y. Singer (2001). Pranking with ranking. In Neural Information Processing
10878 Systems (NIPS), pp. 641–647.
10879 Creutz, M. and K. Lagus (2007). Unsupervised models for morpheme segmentation and
10880 morphology learning. ACM Transactions on Speech and Language Processing (TSLP) 4(1),
10881 3.
10882 Cross, J. and L. Huang (2016). Span-based constituency parsing with a structure-label
10883 system and provably optimal dynamic oracles. In Proceedings of Empirical Methods for
10884 Natural Language Processing (EMNLP), pp. 1–11.
10885 Cucerzan, S. (2007). Large-scale named entity disambiguation based on wikipedia data.
10886 In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
10887 Cui, H., R. Sun, K. Li, M.-Y. Kan, and T.-S. Chua (2005). Question answering passage
10888 retrieval using dependency relations. In Proceedings of the 28th annual international ACM
10889 SIGIR conference on Research and development in information retrieval, pp. 400–407. ACM.
10890 Cui, Y., Z. Chen, S. Wei, S. Wang, T. Liu, and G. Hu (2017). Attention-over-attention neural
10891 networks for reading comprehension. In Proceedings of the Association for Computational
10892 Linguistics (ACL).
10893 Culotta, A. and J. Sorensen (2004). Dependency tree kernels for relation extraction. In
10894 Proceedings of the Association for Computational Linguistics (ACL).
10895 Culotta, A., M. Wick, and A. McCallum (2007). First-order probabilistic models for coref-
10896 erence resolution. In Proceedings of the North American Chapter of the Association for Com-
10897 putational Linguistics (NAACL), pp. 81–88.
10898 Curry, H. B. and R. Feys (1958). Combinatory Logic, Volume I. Amsterdam: North Holland.
10902 Das, D., D. Chen, A. F. Martins, N. Schneider, and N. A. Smith (2014). Frame-semantic
10903 parsing. Computational Linguistics 40(1), 9–56.
10904 Daumé III, H. (2007). Frustratingly easy domain adaptation. In Proceedings of the Associa-
10905 tion for Computational Linguistics (ACL).
10906 Daumé III, H., J. Langford, and D. Marcu (2009). Search-based structured prediction.
10907 Machine learning 75(3), 297–325.
10908 Daumé III, H. and D. Marcu (2005). A large-scale exploration of effective global features
10909 for a joint entity detection and tracking model. In Proceedings of Empirical Methods for
10910 Natural Language Processing (EMNLP), pp. 97–104.
10911 Dauphin, Y. N., R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio (2014). Iden-
10912 tifying and attacking the saddle point problem in high-dimensional non-convex opti-
10913 mization. In Neural Information Processing Systems (NIPS), pp. 2933–2941.
10914 Davidson, D. (1967). The logical form of action sentences. In N. Rescher (Ed.), The Logic of
10915 Decision and Action. Pittsburgh: University of Pittsburgh Press.
10919 De Marneffe, M.-C. and C. D. Manning (2008). The stanford typed dependencies represen-
10920 tation. In Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain
10921 Parser Evaluation, pp. 1–8. Association for Computational Linguistics.
10922 Dean, J. and S. Ghemawat (2008). Mapreduce: simplified data processing on large clusters.
10923 Communications of the ACM 51(1), 107–113.
10928 Deisenroth, M. P., A. A. Faisal, and C. S. Ong (2018). Mathematics For Machine Learning.
10929 Cambridge UP.
10930 Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incom-
10931 plete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Method-
10932 ological), 1–38.
10933 Denis, P. and J. Baldridge (2007). A ranking approach to pronoun resolution. In IJCAI.
10934 Denis, P. and J. Baldridge (2008). Specialized models and ranking for coreference resolu-
10935 tion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing,
10936 EMNLP ’08, Stroudsburg, PA, USA, pp. 660–669. Association for Computational Lin-
10937 guistics.
10938 Denis, P. and J. Baldridge (2009). Global joint models for coreference resolution and named
10939 entity classification. Procesamiento del Lenguaje Natural 42.
10940 Derrida, J. (1985). Des tours de babel. In J. Graham (Ed.), Difference in translation. Ithaca,
10941 NY: Cornell University Press.
10942 Dhingra, B., H. Liu, Z. Yang, W. W. Cohen, and R. Salakhutdinov (2017). Gated-attention
10943 readers for text comprehension. In Proceedings of the Association for Computational Lin-
10944 guistics (ACL).
10945 Diaconis, P. and B. Skyrms (2017). Ten Great Ideas About Chance. Princeton University
10946 Press.
10947 Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classifica-
10948 tion learning algorithms. Neural computation 10(7), 1895–1923.
10949 Dietterich, T. G., R. H. Lathrop, and T. Lozano-Pérez (1997). Solving the multiple instance
10950 problem with axis-parallel rectangles. Artificial intelligence 89(1), 31–71.
10951 Dimitrova, L., N. Ide, V. Petkevic, T. Erjavec, H. J. Kaalep, and D. Tufis (1998). Multext-
10952 east: Parallel and comparable corpora and lexicons for six central and eastern european
10953 languages. In Proceedings of the 17th international conference on Computational linguistics-
10954 Volume 1, pp. 315–319. Association for Computational Linguistics.
10959 dos Santos, C., B. Xiang, and B. Zhou (2015). Classifying relations by ranking with con-
10960 volutional neural networks. In Proceedings of the Association for Computational Linguistics
10961 (ACL), pp. 626–634.
10962 Dowty, D. (1991). Thematic proto-roles and argument selection. Language, 547–619.
10963 Dredze, M., P. McNamee, D. Rao, A. Gerber, and T. Finin (2010). Entity disambiguation
10964 for knowledge base population. In Proceedings of the 23rd International Conference on
10965 Computational Linguistics, pp. 277–285. Association for Computational Linguistics.
10966 Dredze, M., M. J. Paul, S. Bergsma, and H. Tran (2013). Carmen: A Twitter geolocation
10967 system with applications to public health. In AAAI workshop on expanding the boundaries
10968 of health informatics using AI (HIAI), pp. 20–24.
10969 Dreyfus, H. L. (1992). What computers still can’t do: a critique of artificial reason. MIT press.
10970 Du, L., W. Buntine, and M. Johnson (2013). Topic segmentation with a structured topic
10971 model. In Proceedings of the North American Chapter of the Association for Computational
10972 Linguistics (NAACL), pp. 190–200.
10973 Duchi, J., E. Hazan, and Y. Singer (2011). Adaptive subgradient methods for online learn-
10974 ing and stochastic optimization. The Journal of Machine Learning Research 12, 2121–2159.
10975 Dunietz, J., L. Levin, and J. Carbonell (2017). The because corpus 2.0: Annotating causality
10976 and overlapping relations. In Proceedings of the Linguistic Annotation Workshop.
10980 Durrett, G. and D. Klein (2013). Easy victories and uphill battles in coreference resolution.
10981 In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
10982 Durrett, G. and D. Klein (2015). Neural crf parsing. In Proceedings of the Association for
10983 Computational Linguistics (ACL).
10984 Dyer, C., M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith (2015). Transition-based
10985 dependency parsing with stack long short-term memory. In Proceedings of the Association
10986 for Computational Linguistics (ACL), pp. 334–343.
10987 Dyer, C., A. Kuncoro, M. Ballesteros, and N. A. Smith (2016). Recurrent neural network
10988 grammars. In Proceedings of the North American Chapter of the Association for Computational
10989 Linguistics (NAACL), pp. 199–209.
10990 Edmonds, J. (1967). Optimum branchings. Journal of Research of the National Bureau of
10991 Standards B 71(4), 233–240.
10994 Eisenstein, J. (2009). Hierarchical text segmentation from multi-scale lexical cohesion. In
10995 Proceedings of the North American Chapter of the Association for Computational Linguistics
10996 (NAACL).
10997 Eisenstein, J. and R. Barzilay (2008). Bayesian unsupervised topic segmentation. In Pro-
10998 ceedings of Empirical Methods for Natural Language Processing (EMNLP).
10999 Eisner, J. (1997). State-of-the-art algorithms for minimum spanning trees: A tutorial dis-
11000 cussion.
11001 Eisner, J. (2000). Bilexical grammars and their cubic-time parsing algorithms. In Advances
11002 in probabilistic and other parsing technologies, pp. 29–61. Springer.
11003 Eisner, J. (2002). Parameter estimation for probabilistic finite-state transducers. In Proceed-
11004 ings of the Association for Computational Linguistics (ACL), pp. 1–8.
11005 Eisner, J. (2016). Inside-outside and forward-backward algorithms are just backprop. In
11006 Proceedings of the Workshop on Structured Prediction for NLP, pp. 1–17.
11007 Eisner, J. M. (1996). Three new probabilistic models for dependency parsing: An explo-
11008 ration. In Proceedings of the International Conference on Computational Linguistics (COL-
11009 ING), pp. 340–345.
11010 Ekman, P. (1992). Are there basic emotions? Psychological Review 99(3), 550–553.
11011 Elman, J. L. (1990). Finding structure in time. Cognitive science 14(2), 179–211.
11015 Elsner, M. and E. Charniak (2010). Disentangling chat. Computational Linguistics 36(3),
11016 389–409.
11017 Esuli, A. and F. Sebastiani (2006). Sentiwordnet: A publicly available lexical resource for
11018 opinion mining. In LREC, Volume 6, pp. 417–422. Citeseer.
11019 Etzioni, O., A. Fader, J. Christensen, S. Soderland, and M. Mausam (2011). Open informa-
11020 tion extraction: The second generation. In Proceedings of the International Joint Conference
11021 on Artificial Intelligence (IJCAI), pp. 3–10.
11022 Faruqui, M., J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith (2015). Retrofitting
11023 word vectors to semantic lexicons. In Proceedings of the North American Chapter of the
11024 Association for Computational Linguistics (NAACL).
11025 Faruqui, M. and C. Dyer (2014). Improving vector space word representations using mul-
11026 tilingual correlation. In Proceedings of the European Chapter of the Association for Computa-
11027 tional Linguistics (EACL), pp. 462–471.
11028 Faruqui, M., R. McDonald, and R. Soricut (2016). Morpho-syntactic lexicon generation
11029 using graph-based semi-supervised learning. Transactions of the Association for Computa-
11030 tional Linguistics 4, 1–16.
11031 Faruqui, M., Y. Tsvetkov, P. Rastogi, and C. Dyer (2016, August). Problems with evaluation
11032 of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on
11033 Evaluating Vector-Space Representations for NLP, Berlin, Germany, pp. 30–35. Association
11034 for Computational Linguistics.
11036 Feng, V. W., Z. Lin, and G. Hirst (2014). The impact of deep hierarchical discourse struc-
11037 tures in the evaluation of text coherence. In Proceedings of the International Conference on
11038 Computational Linguistics (COLING), pp. 940–949.
11039 Feng, X., L. Huang, D. Tang, H. Ji, B. Qin, and T. Liu (2016). A language-independent
11040 neural network for event detection. In Proceedings of the Association for Computational
11041 Linguistics (ACL), pp. 66–71.
11042 Fernandes, E. R., C. N. dos Santos, and R. L. Milidiú (2014). Latent trees for coreference
11043 resolution. Computational Linguistics.
11047 Ficler, J. and Y. Goldberg (2017, September). Controlling linguistic style aspects in neural
11048 language generation. In Proceedings of the Workshop on Stylistic Variation, Copenhagen,
11049 Denmark, pp. 94–104. Association for Computational Linguistics.
11050 Filippova, K. and M. Strube (2008). Sentence fusion via dependency graph compression.
11051 In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 177–
11052 185.
11053 Fillmore, C. J. (1968). The case for case. In E. Bach and R. Harms (Eds.), Universals in
11054 linguistic theory. Holt, Rinehart, and Winston.
11055 Fillmore, C. J. (1976). Frame semantics and the nature of language. Annals of the New York
11056 Academy of Sciences 280(1), 20–32.
11057 Fillmore, C. J. and C. Baker (2009). A frames approach to semantic analysis. In The Oxford
11058 Handbook of Linguistic Analysis. Oxford University Press.
11059 Finkel, J. R., T. Grenager, and C. Manning (2005). Incorporating non-local information
11060 into information extraction systems by gibbs sampling. In Proceedings of the Association
11061 for Computational Linguistics (ACL), pp. 363–370.
11062 Finkel, J. R., T. Grenager, and C. D. Manning (2007). The infinite tree. In Proceedings of the
11063 Association for Computational Linguistics (ACL), pp. 272–279.
11064 Finkel, J. R., A. Kleeman, and C. D. Manning (2008). Efficient, feature-based, conditional
11065 random field parsing. In Proceedings of the Association for Computational Linguistics (ACL),
11066 pp. 959–967.
11067 Finkel, J. R. and C. Manning (2009). Hierarchical bayesian domain adaptation. In Proceed-
11068 ings of the North American Chapter of the Association for Computational Linguistics (NAACL),
11069 pp. 602–610.
11074 Finkelstein, L., E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin
11075 (2002). Placing search in context: The concept revisited. ACM Transactions on Information
11076 Systems 20(1), 116–131.
11078 Flanigan, J., S. Thomson, J. Carbonell, C. Dyer, and N. A. Smith (2014). A discrimina-
11079 tive graph-based parser for the abstract meaning representation. In Proceedings of the
11080 Association for Computational Linguistics (ACL), pp. 1426–1436.
11081 Foltz, P. W., W. Kintsch, and T. K. Landauer (1998). The measurement of textual coherence
11082 with latent semantic analysis. Discourse processes 25(2-3), 285–307.
11083 Fordyce, C. S. (2007). Overview of the iwslt 2007 evaluation campaign. In International
11084 Workshop on Spoken Language Translation (IWSLT) 2007.
11085 Fox, H. (2002). Phrasal cohesion and statistical machine translation. In Proceedings of
11086 Empirical Methods for Natural Language Processing (EMNLP), pp. 304–3111.
11087 Francis, W. and H. Kucera (1982). Frequency analysis of English usage. Houghton Mifflin
11088 Company.
11089 Francis, W. N. (1964). A standard sample of present-day English for use with digital
11090 computers. Report to the U.S Office of Education on Cooperative Research Project No.
11091 E-007.
11092 Freund, Y. and R. E. Schapire (1999). Large margin classification using the perceptron
11093 algorithm. Machine learning 37(3), 277–296.
11094 Fromkin, V., R. Rodman, and N. Hyams (2013). An introduction to language. Cengage
11095 Learning.
11096 Fundel, K., R. Küffner, and R. Zimmer (2007). Relex – relation extraction using depen-
11097 dency parse trees. Bioinformatics 23(3), 365–371.
11098 Gabow, H. N., Z. Galil, T. Spencer, and R. E. Tarjan (1986). Efficient algorithms for finding
11099 minimum spanning trees in undirected and directed graphs. Combinatorica 6(2), 109–
11100 122.
11104 Gage, P. (1994). A new algorithm for data compression. The C Users Journal 12(2), 23–38.
11105 Gale, W. A., K. W. Church, and D. Yarowsky (1992). One sense per discourse. In Pro-
11106 ceedings of the workshop on Speech and Natural Language, pp. 233–237. Association for
11107 Computational Linguistics.
11108 Galley, M., M. Hopkins, K. Knight, and D. Marcu (2004). What’s in a translation rule? In
11109 Proceedings of the North American Chapter of the Association for Computational Linguistics
11110 (NAACL), pp. 273–280.
11111 Galley, M., K. R. McKeown, E. Fosler-Lussier, and H. Jing (2003). Discourse segmentation
11112 of multi-party conversation. In Proceedings of the Association for Computational Linguistics
11113 (ACL).
11114 Ganchev, K. and M. Dredze (2008). Small statistical models by random feature mixing. In
11115 Proceedings of the ACL08 HLT Workshop on Mobile Language Processing, pp. 19–20.
11116 Ganchev, K., J. Graça, J. Gillenwater, and B. Taskar (2010). Posterior regularization for
11117 structured latent variable models. The Journal of Machine Learning Research 11, 2001–
11118 2049.
11122 Gao, J., G. Andrew, M. Johnson, and K. Toutanova (2007). A comparative study of param-
11123 eter estimation methods for statistical natural language processing. In Proceedings of the
11124 Association for Computational Linguistics (ACL), pp. 824–831.
11125 Gatt, A. and E. Krahmer (2018). Survey of the state of the art in natural language genera-
11126 tion: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research 61,
11127 65–170.
11128 Gatt, A. and E. Reiter (2009). Simplenlg: A realisation engine for practical applications.
11129 In Proceedings of the 12th European Workshop on Natural Language Generation, pp. 90–93.
11130 Association for Computational Linguistics.
11131 Ge, D., X. Jiang, and Y. Ye (2011). A note on the complexity of l p minimization. Mathe-
11132 matical programming 129(2), 285–299.
11133 Ge, N., J. Hale, and E. Charniak (1998). A statistical approach to anaphora resolution. In
11134 Proceedings of the sixth workshop on very large corpora, Volume 71, pp. 76.
11135 Ge, R., F. Huang, C. Jin, and Y. Yuan (2015). Escaping from saddle points — online stochas-
11136 tic gradient for tensor decomposition. In P. Grünwald, E. Hazan, and S. Kale (Eds.),
11137 Proceedings of the Conference On Learning Theory (COLT).
11138 Ge, R. and R. J. Mooney (2005). A statistical semantic parser that integrates syntax and
11139 semantics. In Proceedings of the Conference on Natural Language Learning (CoNLL), pp.
11140 9–16.
11141 Geach, P. T. (1962). Reference and generality: An examination of some medieval and modern
11142 theories. Cornell University Press.
11143 Gildea, D. and D. Jurafsky (2002). Automatic labeling of semantic roles. Computational
11144 linguistics 28(3), 245–288.
11145 Gimpel, K., N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yo-
11146 gatama, J. Flanigan, and N. A. Smith (2011). Part-of-speech tagging for Twitter: an-
11147 notation, features, and experiments. In Proceedings of the Association for Computational
11148 Linguistics (ACL), pp. 42–47.
11149 Glass, J., T. J. Hazen, S. Cyphers, I. Malioutov, D. Huynh, and R. Barzilay (2007). Recent
11150 progress in the mit spoken lecture processing project. In Eighth Annual Conference of the
11151 International Speech Communication Association.
11152 Glorot, X. and Y. Bengio (2010). Understanding the difficulty of training deep feedforward
11153 neural networks. In Proceedings of Artificial Intelligence and Statistics (AISTATS), pp. 249–
11154 256.
11155 Glorot, X., A. Bordes, and Y. Bengio (2011). Deep sparse rectifier networks. In Proceedings
11156 of the 14th International Conference on Artificial Intelligence and Statistics. JMLR W&CP
11157 Volume, Volume 15, pp. 315–323.
11158 Godfrey, J. J., E. C. Holliman, and J. McDaniel (1992). Switchboard: Telephone speech
11159 corpus for research and development. In Proceedings of the International Conference on
11160 Acoustics, Speech, and Signal Processing (ICASSP), pp. 517–520. IEEE.
11164 Goldberg, Y. (2017b). Neural Network Methods for Natural Language Processing. Synthesis
11165 Lectures on Human Language Technologies. Morgan & Claypool Publishers.
11166 Goldberg, Y. and M. Elhadad (2010). An efficient algorithm for easy-first non-directional
11167 dependency parsing. In Proceedings of the North American Chapter of the Association for
11168 Computational Linguistics (NAACL), pp. 742–750.
11169 Goldberg, Y. and J. Nivre (2012). A dynamic oracle for arc-eager dependency parsing.
11170 In Proceedings of the International Conference on Computational Linguistics (COLING), pp.
11171 959–976.
11172 Goldberg, Y., K. Zhao, and L. Huang (2013). Efficient implementation of beam-search
11173 incremental parsers. In ACL (2), pp. 628–633.
11174 Goldwater, S. and T. Griffiths (2007). A fully bayesian approach to unsupervised part-of-
11175 speech tagging. In Annual meeting-association for computational linguistics, Volume 45.
11176 Gonçalo Oliveira, H. R., F. A. Cardoso, and F. C. Pereira (2007). Tra-la-lyrics: An approach
11177 to generate text based on rhythm. In Proceedings of the 4th. International Joint Workshop
11178 on Computational Creativity. A. Cardoso and G. Wiggins.
11179 Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep learning. MIT Press.
11180 Goodman, J. T. (2001). A bit of progress in language modeling. Computer Speech & Lan-
11181 guage 15(4), 403–434.
11182 Gouws, S., D. Metzler, C. Cai, and E. Hovy (2011). Contextual bearing on linguistic varia-
11183 tion in social media. In LASM.
11184 Goyal, A., H. Daume III, and S. Venkatasubramanian (2009). Streaming for large scale
11185 nlp: Language modeling. In Proceedings of the North American Chapter of the Association
11186 for Computational Linguistics (NAACL), pp. 512–520.
11187 Graves, A. (2012). Sequence transduction with recurrent neural networks. In Proceedings
11188 of the International Conference on Machine Learning (ICML).
11189 Graves, A. and N. Jaitly (2014). Towards end-to-end speech recognition with recur-
11190 rent neural networks. In Proceedings of the International Conference on Machine Learning
11191 (ICML), pp. 1764–1772.
11192 Graves, A. and J. Schmidhuber (2005). Framewise phoneme classification with bidirec-
11193 tional lstm and other neural network architectures. Neural Networks 18(5), 602–610.
11194 Grice, H. P. (1975). Logic and conversation. In P. Cole and J. L. Morgan (Eds.), Syntax and
11195 Semantics Volume 3: Speech Acts, pp. 41–58. Academic Press.
11196 Grishman, R. (2012). Information extraction: Capabilities and challenges. Notes prepared
11197 for the 2012 International Winter School in Language and Speech Technologies, Rovira
11198 i Virgili University, Tarragona, Spain.
11199 Grishman, R. (2015). Information extraction. IEEE Intelligent Systems 30(5), 8–15.
11200 Grishman, R., C. Macleod, and J. Sterling (1992). Evaluating parsing strategies using
11201 standardized parse files. In Proceedings of the third conference on Applied natural language
11202 processing, pp. 156–161. Association for Computational Linguistics.
11203 Grishman, R. and B. Sundheim (1996). Message understanding conference-6: A brief his-
11204 tory. In Proceedings of the International Conference on Computational Linguistics (COLING),
11205 pp. 466–471.
11206 Groenendijk, J. and M. Stokhof (1991). Dynamic predicate logic. Linguistics and philoso-
11207 phy 14(1), 39–100.
11208 Grosz, B. J. (1979). Focusing and description in natural language dialogues. Technical
11209 report, SRI INTERNATIONAL MENLO PARK CA.
11210 Grosz, B. J., S. Weinstein, and A. K. Joshi (1995). Centering: A framework for modeling
11211 the local coherence of discourse. Computational linguistics 21(2), 203–225.
11212 Gu, J., Z. Lu, H. Li, and V. O. Li (2016). Incorporating copying mechanism in sequence-to-
11213 sequence learning. In Proceedings of the Association for Computational Linguistics (ACL),
11214 pp. 1631–1640.
11215 Gulcehre, C., S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio (2016). Pointing the unknown
11216 words. In Proceedings of the Association for Computational Linguistics (ACL), pp. 140–149.
11222 Haghighi, A. and D. Klein (2009). Simple coreference resolution with rich syntactic and
11223 semantic features. In Proceedings of Empirical Methods for Natural Language Processing
11224 (EMNLP), pp. 1152–1161.
11228 Hajič, J. and B. Hladká (1998). Tagging inflective languages: Prediction of morphological
11229 categories for a rich, structured tagset. In Proceedings of the Association for Computational
11230 Linguistics (ACL), pp. 483–490.
11232 Hammerton, J. (2003). Named entity recognition with long short-term memory. In Pro-
11233 ceedings of the Conference on Natural Language Learning (CoNLL), pp. 172–175.
11234 Han, X. and L. Sun (2012). An entity-topic model for entity linking. In Proceedings of
11235 Empirical Methods for Natural Language Processing (EMNLP), pp. 105–115.
11236 Han, X., L. Sun, and J. Zhao (2011). Collective entity linking in web text: a graph-based
11237 method. In Proceedings of ACM SIGIR conference on Research and development in informa-
11238 tion retrieval, pp. 765–774.
11239 Hannak, A., E. Anderson, L. F. Barrett, S. Lehmann, A. Mislove, and M. Riedewald (2012).
11240 Tweetin’in the rain: Exploring societal-scale effects of weather on mood. In Proceedings
11241 of the International Conference on Web and Social Media (ICWSM).
11242 Hardmeier, C. (2012). Discourse in statistical machine translation. a survey and a case
11243 study. Discours. Revue de linguistique, psycholinguistique et informatique. A journal of lin-
11244 guistics, psycholinguistics and computational linguistics (11).
11246 Hastie, T., R. Tibshirani, and J. Friedman (2009). The elements of statistical learning (Second
11247 ed.). New York: Springer.
11251 Hayes, A. F. and K. Krippendorff (2007). Answering the call for a standard reliability
11252 measure for coding data. Communication methods and measures 1(1), 77–89.
11253 He, H., A. Balakrishnan, M. Eric, and P. Liang (2017). Learning symmetric collaborative
11254 dialogue agents with dynamic knowledge graph embeddings. In Proceedings of the As-
11255 sociation for Computational Linguistics (ACL), pp. 1766–1776.
11256 He, K., X. Zhang, S. Ren, and J. Sun (2015). Delving deep into rectifiers: Surpassing
11257 human-level performance on imagenet classification. In Proceedings of the International
11258 Conference on Computer Vision (ICCV), pp. 1026–1034.
11259 He, K., X. Zhang, S. Ren, and J. Sun (2016). Deep residual learning for image recognition.
11260 In Proceedings of the International Conference on Computer Vision (ICCV), pp. 770–778.
11261 He, L., K. Lee, M. Lewis, and L. Zettlemoyer (2017). Deep semantic role labeling: What
11262 works and what’s next. In Proceedings of the Association for Computational Linguistics
11263 (ACL).
11264 He, Z., S. Liu, M. Li, M. Zhou, L. Zhang, and H. Wang (2013). Learning entity repre-
11265 sentation for entity disambiguation. In Proceedings of the Association for Computational
11266 Linguistics (ACL), pp. 30–34.
11267 Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In
11268 Proceedings of the International Conference on Computational Linguistics (COLING), pp. 539–
11269 545. Association for Computational Linguistics.
11270 Hearst, M. A. (1997). Texttiling: Segmenting text into multi-paragraph subtopic passages.
11271 Computational linguistics 23(1), 33–64.
11272 Heerschop, B., F. Goossen, A. Hogenboom, F. Frasincar, U. Kaymak, and F. de Jong (2011).
11273 Polarity analysis of texts using discourse structure. In Proceedings of the 20th ACM inter-
11274 national conference on Information and knowledge management, pp. 1061–1070. ACM.
11285 Hernault, H., H. Prendinger, D. A. duVerle, and M. Ishizuka (2010). HILDA: A discourse
11286 parser using support vector machine classification. Dialogue and Discourse 1(3), 1–33.
11287 Hill, F., A. Bordes, S. Chopra, and J. Weston (2016). The goldilocks principle: Reading
11288 children’s books with explicit memory representations. In Proceedings of the International
11289 Conference on Learning Representations (ICLR).
11290 Hill, F., K. Cho, and A. Korhonen (2016). Learning distributed representations of sentences
11291 from unlabelled data. In Proceedings of the North American Chapter of the Association for
11292 Computational Linguistics (NAACL).
11293 Hindle, D. and M. Rooth (1993). Structural ambiguity and lexical relations. Computational
11294 linguistics 19(1), 103–120.
11295 Hirao, T., Y. Yoshida, M. Nishino, N. Yasuda, and M. Nagata (2013). Single-document
11296 summarization as a tree knapsack problem. In Proceedings of Empirical Methods for Nat-
11297 ural Language Processing (EMNLP), pp. 1515–1520.
11298 Hirschman, L. and R. Gaizauskas (2001). Natural language question answering: the view
11299 from here. natural language engineering 7(4), 275–300.
11300 Hirschman, L., M. Light, E. Breck, and J. D. Burger (1999). Deep read: A reading compre-
11301 hension system. In Proceedings of the Association for Computational Linguistics (ACL), pp.
11302 325–332.
11304 Hobbs, J. R., D. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel, and M. Tyson (1997).
11305 Fastus: A cascaded finite-state transducer for extracting information from natural-
11306 language text. Finite-state language processing, 383–406.
11307 Hochreiter, S. and J. Schmidhuber (1997). Long short-term memory. Neural computa-
11308 tion 9(8), 1735–1780.
11309 Hockenmaier, J. and M. Steedman (2007). Ccgbank: a corpus of ccg derivations and de-
11310 pendency structures extracted from the penn treebank. Computational Linguistics 33(3),
11311 355–396.
11315 Hoffmann, R., C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld (2011). Knowledge-based
11316 weak supervision for information extraction of overlapping relations. In Proceedings of
11317 the Association for Computational Linguistics (ACL), pp. 541–550.
11318 Holmstrom, L. and P. Koistinen (1992). Using additive noise in back-propagation training.
11319 IEEE Transactions on Neural Networks 3(1), 24–38.
11320 Hovy, E. and J. Lavid (2010). Towards a ’science’ of corpus annotation: a new method-
11321 ological challenge for corpus linguistics. International journal of translation 22(1), 13–36.
11322 Hsu, D., S. M. Kakade, and T. Zhang (2012). A spectral algorithm for learning hidden
11323 markov models. Journal of Computer and System Sciences 78(5), 1460–1480.
11324 Hu, M. and B. Liu (2004). Mining and summarizing customer reviews. In Proceedings of
11325 Knowledge Discovery and Data Mining (KDD), pp. 168–177.
11326 Hu, Z., Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017). Toward controlled
11327 generation of text. In International Conference on Machine Learning, pp. 1587–1596.
11328 Huang, F. and A. Yates (2012). Biased representation learning for domain adaptation. In
11329 Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1313–1323.
11330 Huang, L. and D. Chiang (2007). Forest rescoring: Faster decoding with integrated lan-
11331 guage models. In Proceedings of the Association for Computational Linguistics (ACL), pp.
11332 144–151.
11333 Huang, L., S. Fayong, and Y. Guo (2012). Structured perceptron with inexact search. In
11334 Proceedings of the North American Chapter of the Association for Computational Linguistics
11335 (NAACL), pp. 142–151.
11336 Huang, Y. (2015). Pragmatics (Second ed.). Oxford Textbooks in Linguistics. Oxford Uni-
11337 versity Press.
11338 Huang, Z., W. Xu, and K. Yu (2015). Bidirectional lstm-crf models for sequence tagging.
11339 arXiv preprint arXiv:1508.01991.
11342 Humphreys, K., R. Gaizauskas, and S. Azzam (1997). Event coreference for information
11343 extraction. In Proceedings of a Workshop on Operational Factors in Practical, Robust Anaphora
11344 Resolution for Unrestricted Texts, pp. 75–81. Association for Computational Linguistics.
11345 Ide, N. and Y. Wilks (2006). Making sense about sense. In Word sense disambiguation, pp.
11346 47–73. Springer.
11347 Ioffe, S. and C. Szegedy (2015). Batch normalization: Accelerating deep network train-
11348 ing by reducing internal covariate shift. In Proceedings of the International Conference on
11349 Machine Learning (ICML), pp. 448–456.
11350 Isozaki, H., T. Hirao, K. Duh, K. Sudoh, and H. Tsukada (2010). Automatic evaluation
11351 of translation quality for distant language pairs. In Proceedings of Empirical Methods for
11352 Natural Language Processing (EMNLP), pp. 944–952.
11353 Iyyer, M., V. Manjunatha, J. Boyd-Graber, and H. Daumé III (2015). Deep unordered com-
11354 position rivals syntactic methods for text classification. In Proceedings of the Association
11355 for Computational Linguistics (ACL), pp. 1681–1691.
11356 James, G., D. Witten, T. Hastie, and R. Tibshirani (2013). An introduction to statistical learn-
11357 ing, Volume 112. Springer.
11358 Janin, A., D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau,
11359 E. Shriberg, A. Stolcke, et al. (2003). The ICSI meeting corpus. In Acoustics, Speech,
11360 and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference
11361 on, Volume 1, pp. I–I. IEEE.
11362 Jean, S., K. Cho, R. Memisevic, and Y. Bengio (2015). On using very large target vocab-
11363 ulary for neural machine translation. In Proceedings of the Association for Computational
11364 Linguistics (ACL), pp. 1–10.
11365 Jeong, M., C.-Y. Lin, and G. G. Lee (2009). Semi-supervised speech act recognition in
11366 emails and forums. In Proceedings of Empirical Methods for Natural Language Processing
11367 (EMNLP), pp. 1250–1259.
11368 Ji, H. and R. Grishman (2011). Knowledge base population: Successful approaches and
11369 challenges. In Proceedings of the Association for Computational Linguistics (ACL), pp. 1148–
11370 1158.
11371 Ji, Y., T. Cohn, L. Kong, C. Dyer, and J. Eisenstein (2015). Document context language
11372 models. In International Conference on Learning Representations, Workshop Track, Volume
11373 abs/1511.03962.
11374 Ji, Y. and J. Eisenstein (2014). Representation learning for text-level discourse parsing. In
11375 Proceedings of the Association for Computational Linguistics (ACL).
11376 Ji, Y. and J. Eisenstein (2015, June). One vector is not enough: Entity-augmented distribu-
11377 tional semantics for discourse relations. Transactions of the Association for Computational
11378 Linguistics (TACL).
11379 Ji, Y., G. Haffari, and J. Eisenstein (2016). A latent variable recurrent neural network for
11380 discourse relation language models. In Proceedings of the North American Chapter of the
11381 Association for Computational Linguistics (NAACL).
11382 Ji, Y. and N. A. Smith (2017). Neural discourse structure for text categorization. In Pro-
11383 ceedings of the Association for Computational Linguistics (ACL), pp. 996–1005.
11384 Ji, Y., C. Tan, S. Martschat, Y. Choi, and N. A. Smith (2017). Dynamic entity representations
11385 in neural language models. In Proceedings of Empirical Methods for Natural Language
11386 Processing (EMNLP), pp. 1831–1840.
11387 Jiang, L., M. Yu, M. Zhou, X. Liu, and T. Zhao (2011). Target-dependent twitter sentiment
11388 classification. In Proceedings of the Association for Computational Linguistics (ACL), pp.
11389 151–160.
11390 Jing, H. (2000). Sentence reduction for automatic text summarization. In Proceedings of
11391 the sixth conference on Applied natural language processing, pp. 310–315. Association for
11392 Computational Linguistics.
11393 Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of
11394 Knowledge Discovery and Data Mining (KDD), pp. 133–142.
11395 Jockers, M. L. (2015). Szuzhet? http:bla.bla.com.
11396 Johnson, A. E., T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody,
11397 P. Szolovits, L. A. Celi, and R. G. Mark (2016). Mimic-iii, a freely accessible critical care
11398 database. Scientific data 3, 160035.
11399 Johnson, M. (1998). Pcfg models of linguistic tree representations. Computational Linguis-
11400 tics 24(4), 613–632.
11401 Johnson, R. and T. Zhang (2017). Deep pyramid convolutional neural networks for text
11402 categorization. In Proceedings of the Association for Computational Linguistics (ACL), pp.
11403 562–570.
11404 Joshi, A. K. (1985). How much context-sensitivity is required to provide reasonable struc-
11405 tural descriptions? – tree adjoining grammars. In Natural Language Processing – Theoret-
11406 ical, Computational and Psychological Perspective. New York, NY: Cambridge University
11407 Press.
11408 Joshi, A. K. and Y. Schabes (1997). Tree-adjoining grammars. In Handbook of formal lan-
11409 guages, pp. 69–123. Springer.
11410 Joshi, A. K., K. V. Shanker, and D. Weir (1991). The convergence of mildly context-sensitive
11411 grammar formalisms. In Foundational Issues in Natural Language Processing. Cambridge
11412 MA: MIT Press.
11413 Jozefowicz, R., O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu (2016). Exploring the limits
11414 of language modeling. arXiv preprint arXiv:1602.02410.
11415 Jozefowicz, R., W. Zaremba, and I. Sutskever (2015). An empirical exploration of recurrent
11416 network architectures. In Proceedings of the International Conference on Machine Learning
11417 (ICML), pp. 2342–2350.
11418 Jurafsky, D. (1996). A probabilistic model of lexical and syntactic access and disambigua-
11419 tion. Cognitive Science 20(2), 137–194.
11420 Jurafsky, D. and J. H. Martin (2009). Speech and Language Processing (Second ed.). Prentice
11421 Hall.
11422 Jurafsky, D. and J. H. Martin (2018). Speech and Language Processing (Third ed.). Prentice
11423 Hall.
11424 Kadlec, R., M. Schmid, O. Bajgar, and J. Kleindienst (2016). Text understanding with
11425 the attention sum reader network. In Proceedings of the Association for Computational
11426 Linguistics (ACL), pp. 908–918.
11427 Kalchbrenner, N. and P. Blunsom (2013, August). Recurrent convolutional neural net-
11428 works for discourse compositionality. In Proceedings of the Workshop on Continuous Vec-
11429 tor Space Models and their Compositionality, Sofia, Bulgaria, pp. 119–126. Association for
11430 Computational Linguistics.
11431 Kalchbrenner, N., E. Grefenstette, and P. Blunsom (2014). A convolutional neural network
11432 for modelling sentences. In Proceedings of the Association for Computational Linguistics
11433 (ACL), pp. 655–665.
11436 Kate, R. J., Y. W. Wong, and R. J. Mooney (2005). Learning to transform natural to formal
11437 languages. In Proceedings of the National Conference on Artificial Intelligence (AAAI).
11438 Kehler, A. (2007). Rethinking the SMASH approach to pronoun interpretation. In Interdis-
11439 ciplinary perspectives on reference processing, New Directions in Cognitive Science Series,
11440 pp. 95–122. Oxford University Press.
11441 Kibble, R. and R. Power (2004). Optimizing referential coherence in text generation. Com-
11442 putational Linguistics 30(4), 401–416.
11443 Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities 31(2),
11444 91–113.
11445 Kilgarriff, A. and G. Grefenstette (2003). Introduction to the special issue on the web as
11446 corpus. Computational linguistics 29(3), 333–347.
11447 Kim, M.-J. (2002). Does korean have adjectives? MIT Working Papers in Linguistics 43,
11448 71–89.
11449 Kim, S.-M. and E. Hovy (2006, July). Extracting opinions, opinion holders, and topics
11450 expressed in online news media text. In Proceedings of the Workshop on Sentiment and
11451 Subjectivity in Text, Sydney, Australia, pp. 1–8. Association for Computational Linguis-
11452 tics.
11453 Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings
11454 of Empirical Methods for Natural Language Processing (EMNLP), pp. 1746–1751.
11455 Kim, Y., C. Denton, L. Hoang, and A. M. Rush (2017). Structured attention networks. In
11456 Proceedings of the International Conference on Learning Representations (ICLR).
11457 Kim, Y., Y. Jernite, D. Sontag, and A. M. Rush (2016). Character-aware neural language
11458 models. In Proceedings of the National Conference on Artificial Intelligence (AAAI).
11459 Kingma, D. and J. Ba (2014). Adam: A method for stochastic optimization. arXiv preprint
11460 arXiv:1412.6980.
11461 Kiperwasser, E. and Y. Goldberg (2016). Simple and accurate dependency parsing using
11462 bidirectional lstm feature representations. Transactions of the Association for Computa-
11463 tional Linguistics 4, 313–327.
11466 Kiros, R., R. Salakhutdinov, and R. Zemel (2014). Multimodal neural language models. In
11467 Proceedings of the International Conference on Machine Learning (ICML), pp. 595–603.
11468 Kiros, R., Y. Zhu, R. Salakhudinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler
11469 (2015). Skip-thought vectors. In Neural Information Processing Systems (NIPS).
11470 Klein, D. and C. D. Manning (2003). Accurate unlexicalized parsing. In Proceedings of the
11471 Association for Computational Linguistics (ACL), pp. 423–430.
11472 Klein, D. and C. D. Manning (2004). Corpus-based induction of syntactic structure: Mod-
11473 els of dependency and constituency. In Proceedings of the Association for Computational
11474 Linguistics (ACL).
11475 Klein, G., Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017). Opennmt: Open-source
11476 toolkit for neural machine translation. arXiv preprint arXiv:1701.02810.
11477 Klementiev, A., I. Titov, and B. Bhattarai (2012). Inducing crosslingual distributed repre-
11478 sentations of words. In Proceedings of the International Conference on Computational Lin-
11479 guistics (COLING), pp. 1459–1474.
11480 Klenner, M. (2007). Enforcing consistency on coreference sets. In Recent Advances in Natu-
11481 ral Language Processing (RANLP), pp. 323–328.
11484 Knight, K. and J. Graehl (1998). Machine transliteration. Computational Linguistics 24(4),
11485 599–612.
11486 Knight, K. and D. Marcu (2000). Statistics-based summarization-step one: Sentence com-
11487 pression. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp.
11488 703–710.
11489 Knight, K. and J. May (2009). Applications of weighted automata in natural language
11490 processing. In Handbook of Weighted Automata, pp. 571–596. Springer.
11491 Knott, A. (1996). A data-driven methodology for motivating a set of coherence relations. Ph. D.
11492 thesis, The University of Edinburgh.
11493 Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT
11494 summit, Volume 5, pp. 79–86.
11497 Konstas, I. and M. Lapata (2013). A global model for concept-to-text generation. Journal
11498 of Artificial Intelligence Research 48, 305–346.
11499 Koo, T., X. Carreras, and M. Collins (2008, jun). Simple semi-supervised dependency
11500 parsing. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 595–603. Association for
11501 Computational Linguistics.
11502 Koo, T. and M. Collins (2005). Hidden-variable models for discriminative reranking. In
11503 Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 507–514.
11504 Koo, T. and M. Collins (2010). Efficient third-order dependency parsers. In Proceedings of
11505 the Association for Computational Linguistics (ACL).
11506 Koo, T., A. Globerson, X. Carreras, and M. Collins (2007). Structured prediction models
11507 via the matrix-tree theorem. In Proceedings of Empirical Methods for Natural Language
11508 Processing (EMNLP), pp. 141–150.
11509 Kovach, B. and T. Rosenstiel (2014). The elements of journalism: What newspeople should know
11510 and the public should expect. Three Rivers Press.
11511 Krishnamurthy, J. (2016). Probabilistic models for learning a semantic parser lexicon. In
11512 Proceedings of the North American Chapter of the Association for Computational Linguistics
11513 (NAACL), pp. 606–616.
11517 Krizhevsky, A., I. Sutskever, and G. E. Hinton (2012). Imagenet classification with deep
11518 convolutional neural networks. In Neural Information Processing Systems (NIPS), pp.
11519 1097–1105.
11520 Kübler, S., R. McDonald, and J. Nivre (2009). Dependency parsing. Synthesis Lectures on
11521 Human Language Technologies 1(1), 1–127.
11522 Kuhlmann, M. and J. Nivre (2010). Transition-based techniques for non-projective depen-
11523 dency parsing. Northern European Journal of Language Technology (NEJLT) 2(1), 1–19.
11524 Kummerfeld, J. K., T. Berg-Kirkpatrick, and D. Klein (2015). An empirical analysis of op-
11525 timization for max-margin NLP. In Proceedings of Empirical Methods for Natural Language
11526 Processing (EMNLP).
11531 Lafferty, J., A. McCallum, and F. Pereira (2001). Conditional random fields: Probabilistic
11532 models for segmenting and labeling sequence data. In icml.
11533 Lakoff, G. (1973). Hedges: A study in meaning criteria and the logic of fuzzy concepts.
11534 Journal of philosophical logic 2(4), 458–508.
11535 Lample, G., M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016). Neural
11536 architectures for named entity recognition. In Proceedings of the North American Chapter
11537 of the Association for Computational Linguistics (NAACL), pp. 260–270.
11538 Langkilde, I. and K. Knight (1998). Generation that exploits corpus-based statistical
11539 knowledge. In Proceedings of the Association for Computational Linguistics (ACL), pp. 704–
11540 710.
11541 Lapata, M. (2003). Probabilistic text structuring: Experiments with sentence ordering. In
11542 Proceedings of the Association for Computational Linguistics (ACL), pp. 545–552.
11543 Lappin, S. and H. J. Leass (1994). An algorithm for pronominal anaphora resolution.
11544 Computational linguistics 20(4), 535–561.
11545 Lari, K. and S. J. Young (1990). The estimation of stochastic context-free grammars using
11546 the inside-outside algorithm. Computer speech & language 4(1), 35–56.
11547 Lascarides, A. and N. Asher (2007). Segmented discourse representation theory: Dynamic
11548 semantics with discourse structure. In Computing meaning, pp. 87–124. Springer.
11549 Law, E. and L. v. Ahn (2011). Human computation. Synthesis Lectures on Artificial Intelli-
11550 gence and Machine Learning 5(3), 1–121.
11551 Lebret, R., D. Grangier, and M. Auli (2016). Neural text generation from structured data
11552 with application to the biography domain. In Proceedings of Empirical Methods for Natural
11553 Language Processing (EMNLP), pp. 1203–1213.
11554 LeCun, Y. and Y. Bengio (1995). Convolutional networks for images, speech, and time
11555 series. The handbook of brain theory and neural networks 3361.
11556 LeCun, Y., L. Bottou, G. B. Orr, and K.-R. Müller (1998). Efficient backprop. In Neural
11557 networks: Tricks of the trade, pp. 9–50. Springer.
11558 Lee, C. M. and S. S. Narayanan (2005). Toward detecting emotions in spoken dialogs.
11559 IEEE transactions on speech and audio processing 13(2), 293–303.
11560 Lee, H., A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, and D. Jurafsky (2013). De-
11561 terministic coreference resolution based on entity-centric, precision-ranked rules. Com-
11562 putational Linguistics 39(4), 885–916.
11563 Lee, H., Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky (2011). Stan-
11564 ford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In
11565 Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 28–34. Associa-
11566 tion for Computational Linguistics.
11567 Lee, K., L. He, M. Lewis, and L. Zettlemoyer (2017). End-to-end neural coreference reso-
11568 lution. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
11569 Lenat, D. B., R. V. Guha, K. Pittman, D. Pratt, and M. Shepherd (1990). Cyc: toward
11570 programs with common sense. Communications of the ACM 33(8), 30–49.
11571 Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries:
11572 how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual interna-
11573 tional conference on Systems documentation, pp. 24–26. ACM.
11574 Levesque, H. J., E. Davis, and L. Morgenstern (2011). The winograd schema challenge.
11575 In Aaai spring symposium: Logical formalizations of commonsense reasoning, Volume 46, pp.
11576 47.
11577 Levin, E., R. Pieraccini, and W. Eckert (1998). Using markov decision process for learning
11578 dialogue strategies. In Acoustics, Speech and Signal Processing, 1998. Proceedings of the
11579 1998 IEEE International Conference on, Volume 1, pp. 201–204. IEEE.
11582 Levy, O., Y. Goldberg, and I. Dagan (2015). Improving distributional similarity with
11583 lessons learned from word embeddings. Transactions of the Association for Computational
11584 Linguistics 3, 211–225.
11586 Lewis, M. and M. Steedman (2013). Combined distributional and logical semantics. Trans-
11587 actions of the Association for Computational Linguistics 1, 179–192.
11588 Lewis II, P. M. and R. E. Stearns (1968). Syntax-directed transduction. Journal of the ACM
11589 (JACM) 15(3), 465–488.
11590 Li, J. and D. Jurafsky (2015). Do multi-sense embeddings improve natural language
11591 understanding? In Proceedings of Empirical Methods for Natural Language Processing
11592 (EMNLP), pp. 1722–1732.
11593 Li, J. and D. Jurafsky (2017). Neural net models of open-domain discourse coherence. In
11594 Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 198–209.
11595 Li, J., R. Li, and E. Hovy (2014). Recursive deep models for discourse parsing. In Proceed-
11596 ings of Empirical Methods for Natural Language Processing (EMNLP).
11597 Li, J., M.-T. Luong, and D. Jurafsky (2015). A hierarchical neural autoencoder for para-
11598 graphs and documents. In Proceedings of Empirical Methods for Natural Language Process-
11599 ing (EMNLP).
11600 Li, J., T. Luong, D. Jurafsky, and E. Hovy (2015). When are tree structures necessary
11601 for deep learning of representations? In Proceedings of Empirical Methods for Natural
11602 Language Processing (EMNLP), pp. 2304–2314.
11603 Li, J., W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao (2016, November). Deep
11604 reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on
11605 Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1192–1202. Associ-
11606 ation for Computational Linguistics.
11607 Li, Q., S. Anzaroot, W.-P. Lin, X. Li, and H. Ji (2011). Joint inference for cross-document
11608 information extraction. In Proceedings of the International Conference on Information and
11609 Knowledge Management (CIKM), pp. 2225–2228.
11610 Li, Q., H. Ji, and L. Huang (2013). Joint event extraction via structured prediction with
11611 global features. In Proceedings of the Association for Computational Linguistics (ACL), pp.
11612 73–82.
11613 Liang, P., A. Bouchard-Côté, D. Klein, and B. Taskar (2006). An end-to-end discrimina-
11614 tive approach to machine translation. In Proceedings of the Association for Computational
11615 Linguistics (ACL), pp. 761–768.
11616 Liang, P., M. Jordan, and D. Klein (2009). Learning semantic correspondences with less
11617 supervision. In Proceedings of the Association for Computational Linguistics (ACL), pp. 91–
11618 99.
11619 Liang, P., M. I. Jordan, and D. Klein (2013). Learning dependency-based compositional
11620 semantics. Computational Linguistics 39(2), 389–446.
11621 Liang, P. and D. Klein (2009). Online em for unsupervised models. In Proceedings of the
11622 North American Chapter of the Association for Computational Linguistics (NAACL), pp. 611–
11623 619.
11624 Liang, P., S. Petrov, M. I. Jordan, and D. Klein (2007). The infinite pcfg using hierarchical
11625 dirichlet processes. In Proceedings of Empirical Methods for Natural Language Processing
11626 (EMNLP), pp. 688–697.
11627 Liang, P. and C. Potts (2015). Bringing machine learning and compositional semantics
11628 together. Annual Review of Linguistics 1(1), 355–376.
11630 Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the
11631 17th international conference on Computational linguistics-Volume 2, pp. 768–774. Associa-
11632 tion for Computational Linguistics.
11633 Lin, J. and C. Dyer (2010). Data-intensive text processing with mapreduce. Synthesis
11634 Lectures on Human Language Technologies 3(1), 1–177.
11635 Lin, Z., M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017). A
11636 structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
11637 Lin, Z., M.-Y. Kan, and H. T. Ng (2009). Recognizing implicit discourse relations in the
11638 penn discourse treebank. In Proceedings of Empirical Methods for Natural Language Pro-
11639 cessing (EMNLP), pp. 343–351.
11640 Lin, Z., H. T. Ng, and M.-Y. Kan (2011). Automatically evaluating text coherence using
11641 discourse relations. In Proceedings of the Association for Computational Linguistics (ACL),
11642 pp. 997–1006.
11643 Lin, Z., H. T. Ng, and M. Y. Kan (2014, nov). A PDTB-styled end-to-end discourse parser.
11644 Natural Language Engineering FirstView, 1–34.
11645 Ling, W., C. Dyer, A. Black, and I. Trancoso (2015). Two/too simple adaptations of
11646 word2vec for syntax problems. In Proceedings of the North American Chapter of the As-
11647 sociation for Computational Linguistics (NAACL).
11648 Ling, W., T. Luı́s, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black, and I. Trancoso
11649 (2015). Finding function in form: Compositional character models for open vocabulary
11650 word representation. In Proceedings of Empirical Methods for Natural Language Processing
11651 (EMNLP).
11652 Ling, W., G. Xiang, C. Dyer, A. Black, and I. Trancoso (2013). Microblogs as parallel cor-
11653 pora. In Proceedings of the Association for Computational Linguistics (ACL).
11654 Ling, X., S. Singh, and D. S. Weld (2015). Design challenges for entity linking. Transactions
11655 of the Association for Computational Linguistics 3, 315–328.
11656 Linguistic Data Consortium (2005, July). ACE (automatic content extraction) English an-
11657 notation guidelines for relations. Technical Report Version 5.8.3, Linguistic Data Con-
11658 sortium.
11659 Liu, B. (2015). Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge
11660 University Press.
11661 Liu, D. C. and J. Nocedal (1989). On the limited memory BFGS method for large scale
11662 optimization. Mathematical programming 45(1-3), 503–528.
11663 Liu, Y., Q. Liu, and S. Lin (2006). Tree-to-string alignment template for statistical machine
11664 translation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 609–
11665 616.
11666 Loper, E. and S. Bird (2002). Nltk: The natural language toolkit. In Proceedings of the ACL-
11667 02 Workshop on Effective tools and methodologies for teaching natural language processing and
11668 computational linguistics-Volume 1, pp. 63–70. Association for Computational Linguistics.
11669 Louis, A., A. Joshi, and A. Nenkova (2010). Discourse indicators for content selection in
11670 summarization. In Proceedings of the 11th Annual Meeting of the Special Interest Group on
11671 Discourse and Dialogue, pp. 147–156. Association for Computational Linguistics.
11672 Louis, A. and A. Nenkova (2013). What makes writing great? first experiments on article
11673 quality prediction in the science journalism domain. Transactions of the Association for
11674 Computational Linguistics 1, 341–352.
11676 Lowe, R., N. Pow, I. V. Serban, and J. Pineau (2015). The ubuntu dialogue corpus: A large
11677 dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the
11678 Special Interest Group on Discourse and Dialogue (SIGDIAL).
11681 Luo, X., A. Ittycheriah, H. Jing, N. Kambhatla, and S. Roukos (2004). A mention-
11682 synchronous coreference resolution algorithm based on the bell tree. In Proceedings
11683 of the Association for Computational Linguistics (ACL).
11684 Luong, M.-T., R. Socher, and C. D. Manning (2013). Better word representations with
11685 recursive neural networks for morphology. CoNLL-2013 104.
11686 Luong, T., H. Pham, and C. D. Manning (2015). Effective approaches to attention-based
11687 neural machine translation. In Proceedings of Empirical Methods for Natural Language
11688 Processing (EMNLP), pp. 1412–1421.
11689 Luong, T., I. Sutskever, Q. Le, O. Vinyals, and W. Zaremba (2015). Addressing the rare
11690 word problem in neural machine translation. In Proceedings of the Association for Compu-
11691 tational Linguistics (ACL), pp. 11–19.
11692 Maas, A. L., A. Y. Hannun, and A. Y. Ng (2013). Rectifier nonlinearities improve neu-
11693 ral network acoustic models. In Proceedings of the International Conference on Machine
11694 Learning (ICML).
11695 Mairesse, F. and M. A. Walker (2011). Controlling user perceptions of linguistic style:
11696 Trainable generation of personality traits. Computational Linguistics 37(3), 455–488.
11697 Mani, I., M. Verhagen, B. Wellner, C. M. Lee, and J. Pustejovsky (2006). Machine learning
11698 of temporal relations. In Proceedings of the Association for Computational Linguistics (ACL),
11699 pp. 753–760.
11700 Mann, W. C. and S. A. Thompson (1988). Rhetorical structure theory: Toward a functional
11701 theory of text organization. Text 8(3), 243–281.
11702 Manning, C. D. (2015). Computational linguistics and deep learning. Computational Lin-
11703 guistics 41(4), 701–707.
11704 Manning, C. D. (2016). Computational linguistics and deep learning. Computational Lin-
11705 guistics 41(4).
11706 Manning, C. D., P. Raghavan, H. Schütze, et al. (2008). Introduction to information retrieval,
11707 Volume 1. Cambridge university press.
11708 Manning, C. D. and H. Schütze (1999). Foundations of Statistical Natural Language Process-
11709 ing. Cambridge, Massachusetts: MIT press.
11710 Marcu, D. (1996). Building up rhetorical structure trees. In Proceedings of the National
11711 Conference on Artificial Intelligence, pp. 1069–1074.
11712 Marcu, D. (1997a). From discourse structures to text summaries. In Proceedings of the
11713 workshop on Intelligent Scalable Text Summarization.
11714 Marcu, D. (1997b). From local to global coherence: A bottom-up approach to text plan-
11715 ning. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pp. 629–635.
11716 Marcus, M. P., M. A. Marcinkiewicz, and B. Santorini (1993). Building a large annotated
11717 corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330.
11720 Márquez, G. G. (1970). One Hundred Years of Solitude. Harper & Row. English translation
11721 by Gregory Rabassa.
11722 Martins, A. F. T., N. A. Smith, and E. P. Xing (2009). Concise integer linear programming
11723 formulations for dependency parsing. In Proceedings of the Association for Computational
11724 Linguistics (ACL), pp. 342–350.
11728 Matsuzaki, T., Y. Miyao, and J. Tsujii (2005). Probabilistic cfg with latent annotations. In
11729 Proceedings of the Association for Computational Linguistics (ACL), pp. 75–82.
11730 Matthiessen, C. and J. A. Bateman (1991). Text generation and systemic-functional linguistics:
11731 experiences from English and Japanese. Pinter Publishers.
11732 McCallum, A. and W. Li (2003). Early results for named entity recognition with condi-
11733 tional random fields, feature induction and web-enhanced lexicons. In Proceedings of
11734 the North American Chapter of the Association for Computational Linguistics (NAACL), pp.
11735 188–191.
11736 McCallum, A. and B. Wellner (2004). Conditional models of identity uncertainty with
11737 application to noun coreference. In NIPS, pp. 905–912.
11738 McDonald, R., K. Crammer, and F. Pereira (2005). Online large-margin training of depen-
11739 dency parsers. In Proceedings of the Association for Computational Linguistics (ACL), pp.
11740 91–98.
11741 McDonald, R., K. Hannan, T. Neylon, M. Wells, and J. Reynar (2007). Structured models
11742 for fine-to-coarse sentiment analysis. In Proceedings of ACL.
11743 McDonald, R. and F. Pereira (2006). Online learning of approximate dependency parsing
11744 algorithms. In Proceedings of the European Chapter of the Association for Computational
11745 Linguistics (EACL).
11747 McKeown, K., S. Rosenthal, K. Thadani, and C. Moore (2010). Time-efficient creation of
11748 an accurate sentence fusion corpus. In Proceedings of the North American Chapter of the
11749 Association for Computational Linguistics (NAACL), pp. 317–320.
11754 McNamee, P. and H. T. Dang (2009). Overview of the tac 2009 knowledge base population
11755 track. In Text Analysis Conference (TAC), Volume 17, pp. 111–113.
11756 Medlock, B. and T. Briscoe (2007). Weakly supervised learning for hedge classification in
11757 scientific literature. In Proceedings of the Association for Computational Linguistics (ACL),
11758 pp. 992–999.
11759 Mei, H., M. Bansal, and M. R. Walter (2016). What to talk about and how? selective gen-
11760 eration using lstms with coarse-to-fine alignment. In Proceedings of the North American
11761 Chapter of the Association for Computational Linguistics (NAACL), pp. 720–730.
11762 Merity, S., N. S. Keskar, and R. Socher (2018). Regularizing and optimizing lstm language
11763 models. In Proceedings of the International Conference on Learning Representations (ICLR).
11764 Merity, S., C. Xiong, J. Bradbury, and R. Socher (2017). Pointer sentinel mixture models.
11765 In Proceedings of the International Conference on Learning Representations (ICLR).
11766 Messud, C. (2014, June). A new ‘l’étranger’. New York Review of Books.
11767 Miao, Y. and P. Blunsom (2016). Language as a latent variable: Discrete generative mod-
11768 els for sentence compression. In Proceedings of Empirical Methods for Natural Language
11769 Processing (EMNLP), pp. 319–328.
11770 Miao, Y., L. Yu, and P. Blunsom (2016). Neural variational inference for text processing. In
11771 Proceedings of the International Conference on Machine Learning (ICML).
11772 Mihalcea, R., T. A. Chklovski, and A. Kilgarriff (2004, July). The senseval-3 english lexical
11773 sample task. In Proceedings of SENSEVAL-3, Barcelona, Spain, pp. 25–28. Association for
11774 Computational Linguistics.
11775 Mihalcea, R. and D. Radev (2011). Graph-based natural language processing and information
11776 retrieval. Cambridge University Press.
11777 Mikolov, T., K. Chen, G. Corrado, and J. Dean (2013). Efficient estimation of word repre-
11778 sentations in vector space. In Proceedings of International Conference on Learning Represen-
11779 tations.
11780 Mikolov, T., A. Deoras, D. Povey, L. Burget, and J. Cernocky (2011). Strategies for train-
11781 ing large scale neural network language models. In Automatic Speech Recognition and
11782 Understanding (ASRU), 2011 IEEE Workshop on, pp. 196–201. IEEE.
11783 Mikolov, T., M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur (2010). Recurrent
11784 neural network based language model. In INTERSPEECH, pp. 1045–1048.
11785 Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed rep-
11786 resentations of words and phrases and their compositionality. In Advances in Neural
11787 Information Processing Systems, pp. 3111–3119.
11788 Mikolov, T., W.-t. Yih, and G. Zweig (2013). Linguistic regularities in continuous space
11789 word representations. In Proceedings of the North American Chapter of the Association for
11790 Computational Linguistics (NAACL), pp. 746–751.
11791 Mikolov, T. and G. Zweig. Context dependent recurrent neural network language model.
11792 In Proceedings of Spoken Language Technology (SLT), pp. 234–239.
11793 Miller, G. A., G. A. Heise, and W. Lichten (1951). The intelligibility of speech as a function
11794 of the context of the test materials. Journal of experimental psychology 41(5), 329.
11795 Miller, M., C. Sathi, D. Wiesenthal, J. Leskovec, and C. Potts (2011). Sentiment flow
11796 through hyperlink networks. In Proceedings of the International Conference on Web and
11797 Social Media (ICWSM).
11798 Miller, S., J. Guinness, and A. Zamanian (2004). Name tagging with word clusters and
11799 discriminative training. In Proceedings of the North American Chapter of the Association for
11800 Computational Linguistics (NAACL), pp. 337–342.
11801 Milne, D. and I. H. Witten (2008). Learning to link with wikipedia. In Proceedings of the
11802 International Conference on Information and Knowledge Management (CIKM), pp. 509–518.
11803 ACM.
11804 Miltsakaki, E. and K. Kukich (2004). Evaluation of text coherence for electronic essay
11805 scoring systems. Natural Language Engineering 10(1), 25–55.
11806 Minka, T. P. (1999). From hidden markov models to linear dynamical systems. Tech. Rep.
11807 531, Vision and Modeling Group of Media Lab, MIT.
11808 Minsky, M. (1974). A framework for representing knowledge. Technical Report 306, MIT
11809 AI Laboratory.
11811 Mintz, M., S. Bills, R. Snow, and D. Jurafsky (2009). Distant supervision for relation extrac-
11812 tion without labeled data. In Proceedings of the Association for Computational Linguistics
11813 (ACL), pp. 1003–1011.
11814 Mirza, P., R. Sprugnoli, S. Tonelli, and M. Speranza (2014). Annotating causality in the
11815 tempeval-3 corpus. In Proceedings of the EACL 2014 Workshop on Computational Ap-
11816 proaches to Causality in Language (CAtoCL), pp. 10–19.
11817 Misra, D. K. and Y. Artzi (2016). Neural shift-reduce ccg semantic parsing. In Proceedings
11818 of Empirical Methods for Natural Language Processing (EMNLP).
11821 Miwa, M. and M. Bansal (2016). End-to-end relation extraction using lstms on sequences
11822 and tree structures. In Proceedings of the Association for Computational Linguistics (ACL),
11823 pp. 1105–1116.
11824 Mnih, A. and G. Hinton (2007). Three new graphical models for statistical language mod-
11825 elling. In Proceedings of the 24th international conference on Machine learning, ICML ’07,
11826 New York, NY, USA, pp. 641–648. ACM.
11827 Mnih, A. and G. E. Hinton (2008). A scalable hierarchical distributed language model. In
11828 Neural Information Processing Systems (NIPS), pp. 1081–1088.
11829 Mnih, A. and Y. W. Teh (2012). A fast and simple algorithm for training neural probabilis-
11830 tic language models. In Proceedings of the International Conference on Machine Learning
11831 (ICML).
11834 Mohri, M., F. Pereira, and M. Riley (2002). Weighted finite-state transducers in speech
11835 recognition. Computer Speech & Language 16(1), 69–88.
11836 Mohri, M., A. Rostamizadeh, and A. Talwalkar (2012). Foundations of machine learning.
11837 MIT press.
11838 Montague, R. (1973). The proper treatment of quantification in ordinary english. In Ap-
11839 proaches to natural language, pp. 221–242. Springer.
11840 Moore, J. D. and C. L. Paris (1993, dec). Planning text for advisory dialogues: Capturing
11841 intentional and rhetorical information. Comput. Linguist. 19(4), 651–694.
11842 Morante, R. and E. Blanco (2012). *sem 2012 shared task: Resolving the scope and fo-
11843 cus of negation. In Proceedings of the First Joint Conference on Lexical and Computational
11844 Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2:
11845 Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 265–274. Asso-
11846 ciation for Computational Linguistics.
11847 Morante, R. and W. Daelemans (2009). Learning the scope of hedge cues in biomedical
11848 texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language
11849 Processing, pp. 28–36. Association for Computational Linguistics.
11850 Morante, R. and C. Sporleder (2012). Modality and negation: An introduction to the
11851 special issue. Computational linguistics 38(2), 223–260.
11852 Mostafazadeh, N., A. Grealish, N. Chambers, J. Allen, and L. Vanderwende (2016, June).
11853 Caters: Causal and temporal relation scheme for semantic annotation of event struc-
11854 tures. In Proceedings of the Fourth Workshop on Events, San Diego, California, pp. 51–61.
11855 Association for Computational Linguistics.
11856 Mueller, T., H. Schmid, and H. Schütze (2013). Efficient higher-order CRFs for morpholog-
11857 ical tagging. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP),
11858 pp. 322–332.
11859 Müller, C. and M. Strube (2006). Multi-level annotation of linguistic data with mmax2.
11860 Corpus technology and language pedagogy: New resources, new tools, new methods 3, 197–
11861 214.
11862 Muralidharan, A. and M. A. Hearst (2013). Supporting exploratory text analysis in litera-
11863 ture study. Literary and linguistic computing 28(2), 283–295.
11864 Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press.
11865 Nakagawa, T., K. Inui, and S. Kurohashi (2010). Dependency tree-based sentiment classi-
11866 fication using crfs with hidden variables. In Proceedings of the North American Chapter of
11867 the Association for Computational Linguistics (NAACL), pp. 786–794.
11868 Nakazawa, T., M. Yaguchi, K. Uchimoto, M. Utiyama, E. Sumita, S. Kurohashi, and H. Isa-
11869 hara (2016). ASPEC: Asian scientific paper excerpt corpus. In Proceedings of the Language
11870 Resources and Evaluation Conference, pp. 2204–2208.
11871 Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys
11872 (CSUR) 41(2), 10.
11873 Neal, R. M. and G. E. Hinton (1998). A view of the em algorithm that justifies incremental,
11874 sparse, and other variants. In Learning in graphical models, pp. 355–368. Springer.
11877 Neubig, G. (2017). Neural machine translation and sequence-to-sequence models: A tu-
11878 torial. arXiv preprint arXiv:1703.01619.
11884 Neubig, G., Y. Goldberg, and C. Dyer (2017). On-the-fly operation batching in dynamic
11885 computation graphs. In Neural Information Processing Systems (NIPS).
11886 Neuhaus, P. and N. Bröker (1997). The complexity of recognition of linguistically adequate
11887 dependency grammars. In eacl, pp. 337–343.
11888 Newman, D., C. Chemudugunta, and P. Smyth (2006). Statistical entity-topic models. In
11889 Proceedings of Knowledge Discovery and Data Mining (KDD), pp. 680–686.
11890 Ng, V. (2010). Supervised noun phrase coreference research: The first fifteen years. In
11891 Proceedings of the 48th annual meeting of the association for computational linguistics, pp.
11892 1396–1411. Association for Computational Linguistics.
11893 Nguyen, D. and A. S. Dogruöz (2013). Word level language identification in online multi-
11894 lingual communication. In Proceedings of Empirical Methods for Natural Language Process-
11895 ing (EMNLP).
11896 Nguyen, D. T. and S. Joty (2017). A neural local coherence model. In Proceedings of the
11897 Association for Computational Linguistics (ACL), pp. 1320–1330.
11898 Nigam, K., A. K. McCallum, S. Thrun, and T. Mitchell (2000). Text classification from
11899 labeled and unlabeled documents using em. Machine learning 39(2-3), 103–134.
11900 Nirenburg, S. and Y. Wilks (2001). What’s in a symbol: ontology, representation and lan-
11901 guage. Journal of Experimental & Theoretical Artificial Intelligence 13(1), 9–23.
11902 Nivre, J. (2008). Algorithms for deterministic incremental dependency parsing. Computa-
11903 tional Linguistics 34(4), 513–553.
11904 Nivre, J., M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C. D. Manning, R. McDonald,
11905 S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman (2016, may). Universal de-
11906 pendencies v1: A multilingual treebank collection. In N. C. C. Chair), K. Choukri, T. De-
11907 clerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk,
11908 and S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Re-
11909 sources and Evaluation (LREC 2016), Paris, France. European Language Resources Asso-
11910 ciation (ELRA).
11911 Nivre, J. and J. Nilsson (2005). Pseudo-projective dependency parsing. In Proceedings of the
11912 43rd Annual Meeting on Association for Computational Linguistics, pp. 99–106. Association
11913 for Computational Linguistics.
11916 Och, F. J. and H. Ney (2003). A systematic comparison of various statistical alignment
11917 models. Computational linguistics 29(1), 19–51.
11918 O’Connor, B., M. Krieger, and D. Ahn (2010). Tweetmotif: Exploratory search and topic
11919 summarization for twitter. In Proceedings of the International Conference on Web and Social
11920 Media (ICWSM), pp. 384–385.
11921 Oflazer, K. and Ì. Kuruöz (1994). Tagging and morphological disambiguation of turkish
11922 text. In Proceedings of the fourth conference on Applied natural language processing, pp. 144–
11923 149. Association for Computational Linguistics.
11924 Ohta, T., Y. Tateisi, and J.-D. Kim (2002). The genia corpus: An annotated research abstract
11925 corpus in molecular biology domain. In Proceedings of the second international conference
11926 on Human Language Technology Research, pp. 82–86. Morgan Kaufmann Publishers Inc.
11927 Onishi, T., H. Wang, M. Bansal, K. Gimpel, and D. McAllester (2016). Who did what: A
11928 large-scale person-centered cloze dataset. In Proceedings of Empirical Methods for Natural
11929 Language Processing (EMNLP), pp. 2230–2235.
11930 Owoputi, O., B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith (2013).
11931 Improved part-of-speech tagging for online conversational text with word clusters. In
11932 Proceedings of the North American Chapter of the Association for Computational Linguistics
11933 (NAACL), pp. 380–390.
11934 Packard, W., E. M. Bender, J. Read, S. Oepen, and R. Dridan (2014). Simple negation
11935 scope resolution through deep parsing: A semantic solution to a semantic problem. In
11936 Proceedings of the Association for Computational Linguistics (ACL), pp. 69–78.
11937 Paice, C. D. (1990). Another stemmer. In ACM SIGIR Forum, Volume 24, pp. 56–61.
11938 Pak, A. and P. Paroubek (2010). Twitter as a corpus for sentiment analysis and opinion
11939 mining. In LREC, Volume 10, pp. 1320–1326.
11941 Palmer, M., D. Gildea, and P. Kingsbury (2005). The proposition bank: An annotated
11942 corpus of semantic roles. Computational linguistics 31(1), 71–106.
11943 Pan, S. J. and Q. Yang (2010). A survey on transfer learning. IEEE Transactions on knowledge
11944 and data engineering 22(10), 1345–1359.
11945 Pan, X., T. Cassidy, U. Hermjakob, H. Ji, and K. Knight (2015). Unsupervised entity linking
11946 with abstract meaning representation. In Proceedings of the North American Chapter of the
11947 Association for Computational Linguistics (NAACL), pp. 1130–1139.
11948 Pang, B. and L. Lee (2004). A sentimental education: Sentiment analysis using subjectivity
11949 summarization based on minimum cuts. In Proceedings of the Association for Computa-
11950 tional Linguistics (ACL), pp. 271–278.
11951 Pang, B. and L. Lee (2005). Seeing stars: Exploiting class relationships for sentiment cate-
11952 gorization with respect to rating scales. In Proceedings of the Association for Computational
11953 Linguistics (ACL), pp. 115–124.
11954 Pang, B. and L. Lee (2008). Opinion mining and sentiment analysis. Foundations and trends
11955 in information retrieval 2(1-2), 1–135.
11956 Pang, B., L. Lee, and S. Vaithyanathan (2002). Thumbs up?: sentiment classification using
11957 machine learning techniques. In Proceedings of Empirical Methods for Natural Language
11958 Processing (EMNLP), pp. 79–86.
11959 Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu (2002). Bleu: a method for automatic
11960 evaluation of machine translation. In Proceedings of the Association for Computational
11961 Linguistics (ACL), pp. 311–318.
11962 Park, J. and C. Cardie (2012). Improving implicit discourse relation recognition through
11963 feature set optimization. In Proceedings of the Special Interest Group on Discourse and Dia-
11964 logue (SIGDIAL), pp. 108–112.
11965 Parsons, T. (1990). Events in the Semantics of English, Volume 5. MIT Press.
11966 Pascanu, R., T. Mikolov, and Y. Bengio (2013). On the difficulty of training recurrent neural
11967 networks. In Proceedings of the 30th International Conference on Machine Learning (ICML-
11968 13), pp. 1310–1318.
11969 Paul, M., M. Federico, and S. Stüker (2010). Overview of the iwslt 2010 evaluation cam-
11970 paign. In International Workshop on Spoken Language Translation (IWSLT) 2010.
11971 Pedersen, T., S. Patwardhan, and J. Michelizzi (2004). Wordnet::similarity - measuring the
11972 relatedness of concepts. In Proceedings of the North American Chapter of the Association for
11973 Computational Linguistics (NAACL), pp. 38–41.
11978 Pei, W., T. Ge, and B. Chang (2015). An effective neural network model for graph-based
11979 dependency parsing. In Proceedings of the Association for Computational Linguistics (ACL),
11980 pp. 313–322.
11981 Peldszus, A. and M. Stede (2013). From argument diagrams to argumentation mining
11982 in texts: A survey. International Journal of Cognitive Informatics and Natural Intelligence
11983 (IJCINI) 7(1), 1–31.
11986 Peng, F., F. Feng, and A. McCallum (2004). Chinese segmentation and new word detec-
11987 tion using conditional random fields. In Proceedings of the International Conference on
11988 Computational Linguistics (COLING), pp. 562.
11989 Pennington, J., R. Socher, and C. Manning (2014). Glove: Global vectors for word repre-
11990 sentation. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP),
11991 pp. 1532–1543.
11992 Pereira, F. and Y. Schabes (1992). Inside-outside reestimation from partially bracketed
11993 corpora. In Proceedings of the Association for Computational Linguistics (ACL), pp. 128–
11994 135.
11995 Pereira, F. C. N. and S. M. Shieber (2002). Prolog and natural-language analysis. Microtome
11996 Publishing.
11997 Peters, M. E., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer
11998 (2018). Deep contextualized word representations. In Proceedings of the North American
11999 Chapter of the Association for Computational Linguistics (NAACL).
12000 Peterson, W. W., T. G. Birdsall, and W. C. Fox (1954). The theory of signal detectability.
12001 Transactions of the IRE professional group on information theory 4(4), 171–212.
12002 Petrov, S., L. Barrett, R. Thibaux, and D. Klein (2006). Learning accurate, compact, and in-
12003 terpretable tree annotation. In Proceedings of the Association for Computational Linguistics
12004 (ACL).
12005 Petrov, S., D. Das, and R. McDonald (2012, May). A universal part-of-speech tagset. In
12006 Proceedings of LREC.
12007 Petrov, S. and R. McDonald (2012). Overview of the 2012 shared task on parsing the web.
12008 In Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL),
12009 Volume 59.
12010 Pinker, S. (2003). The language instinct: How the mind creates language. Penguin UK.
12011 Pinter, Y., R. Guthrie, and J. Eisenstein (2017). Mimicking word embeddings using
12012 subword RNNs. In Proceedings of Empirical Methods for Natural Language Processing
12013 (EMNLP).
12014 Pitler, E., A. Louis, and A. Nenkova (2009). Automatic sense prediction for implicit dis-
12015 course relations in text. In Proceedings of the Association for Computational Linguistics
12016 (ACL).
12017 Pitler, E. and A. Nenkova (2009). Using syntax to disambiguate explicit discourse con-
12018 nectives in text. In Proceedings of the Association for Computational Linguistics (ACL), pp.
12019 13–16.
12020 Pitler, E., M. Raghupathy, H. Mehta, A. Nenkova, A. Lee, and A. Joshi (2008). Easily iden-
12021 tifiable discourse relations. In Proceedings of the International Conference on Computational
12022 Linguistics (COLING), pp. 87–90.
12023 Plank, B., A. Søgaard, and Y. Goldberg (2016). Multilingual part-of-speech tagging with
12024 bidirectional long short-term memory models and auxiliary loss. In Proceedings of the
12025 Association for Computational Linguistics (ACL).
12026 Poesio, M., R. Stevenson, B. Di Eugenio, and J. Hitzeman (2004). Centering: A parametric
12027 theory and its instantiations. Computational linguistics 30(3), 309–363.
12028 Polanyi, L. and A. Zaenen (2006). Contextual valence shifters. In Computing attitude and
12029 affect in text: Theory and applications. Springer.
12030 Ponzetto, S. P. and M. Strube (2006). Exploiting semantic role labeling, wordnet and
12031 wikipedia for coreference resolution. In Proceedings of the North American Chapter of
12032 the Association for Computational Linguistics (NAACL), pp. 192–199.
12033 Ponzetto, S. P. and M. Strube (2007). Knowledge derived from wikipedia for computing
12034 semantic relatedness. Journal of Artificial Intelligence Research 30, 181–212.
12035 Poon, H. and P. Domingos (2008). Joint unsupervised coreference resolution with markov
12036 logic. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp.
12037 650–659.
12038 Poon, H. and P. Domingos (2009). Unsupervised semantic parsing. In Proceedings of Em-
12039 pirical Methods for Natural Language Processing (EMNLP), pp. 1–10.
12040 Popel, M., D. Marecek, J. Stepánek, D. Zeman, and Z. Zabokrtskỳ (2013). Coordination
12041 structures in dependency treebanks. In Proceedings of the Association for Computational
12042 Linguistics (ACL), pp. 517–527.
12043 Popescu, A.-M., O. Etzioni, and H. Kautz (2003). Towards a theory of natural language
12044 interfaces to databases. In Proceedings of Intelligent User Interfaces (IUI), pp. 149–157.
12045 Poplack, S. (1980). Sometimes i’ll start a sentence in spanish y termino en español: toward
12046 a typology of code-switching1. Linguistics 18(7-8), 581–618.
12047 Porter, M. F. (1980). An algorithm for suffix stripping. Program 14(3), 130–137.
12048 Prabhakaran, V., O. Rambow, and M. Diab (2010). Automatic committed belief tagging.
12049 In Proceedings of the International Conference on Computational Linguistics (COLING), pp.
12050 1014–1022.
12051 Pradhan, S., X. Luo, M. Recasens, E. Hovy, V. Ng, and M. Strube (2014). Scoring corefer-
12052 ence partitions of predicted mentions: A reference implementation. In Proceedings of the
12053 Association for Computational Linguistics (ACL), pp. 30–35.
12054 Pradhan, S., L. Ramshaw, M. Marcus, M. Palmer, R. Weischedel, and N. Xue (2011).
12055 CoNLL-2011 shared task: Modeling unrestricted coreference in OntoNotes. In Proceed-
12056 ings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task,
12057 pp. 1–27. Association for Computational Linguistics.
12058 Pradhan, S., W. Ward, K. Hacioglu, J. H. Martin, and D. Jurafsky (2005). Semantic role
12059 labeling using different syntactic views. In Proceedings of the Association for Computational
12060 Linguistics (ACL), pp. 581–588.
12061 Prasad, R., N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, and B. Webber (2008).
12062 The Penn Discourse Treebank 2.0. In Proceedings of LREC.
12063 Punyakanok, V., D. Roth, and W.-t. Yih (2008). The importance of syntactic parsing and
12064 inference in semantic role labeling. Computational Linguistics 34(2), 257–287.
12065 Pustejovsky, J., P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim,
12066 D. Day, L. Ferro, et al. (2003). The timebank corpus. In Corpus linguistics, Volume 2003,
12067 pp. 40. Lancaster, UK.
12068 Pustejovsky, J., B. Ingria, R. Sauri, J. Castano, J. Littman, R. Gaizauskas, A. Setzer, G. Katz,
12069 and I. Mani (2005). The specification language timeml. In The language of time: A reader,
12070 pp. 545–557. Oxford University Press.
12071 Qin, L., Z. Zhang, H. Zhao, Z. Hu, and E. Xing (2017). Adversarial connective-exploiting
12072 networks for implicit discourse relation classification. In Proceedings of the Association
12073 for Computational Linguistics (ACL), pp. 1006–1017.
12074 Qiu, G., B. Liu, J. Bu, and C. Chen (2011). Opinion word expansion and target extraction
12075 through double propagation. Computational linguistics 37(1), 9–27.
12076 Quattoni, A., S. Wang, L.-P. Morency, M. Collins, and T. Darrell (2007). Hidden conditional
12077 random fields. IEEE transactions on pattern analysis and machine intelligence 29(10).
12078 Rahman, A. and V. Ng (2011). Narrowing the modeling gap: a cluster-ranking approach
12079 to coreference resolution. Journal of Artificial Intelligence Research 40, 469–521.
12080 Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang (2016). Squad: 100,000+ questions for
12081 machine comprehension of text. In Proceedings of Empirical Methods for Natural Language
12082 Processing (EMNLP), pp. 2383–2392.
12083 Ranzato, M., S. Chopra, M. Auli, and W. Zaremba (2016). Sequence level training with
12084 recurrent neural networks. In Proceedings of the International Conference on Learning Rep-
12085 resentations (ICLR).
12086 Rao, D., D. Yarowsky, A. Shreevats, and M. Gupta (2010). Classifying latent user attributes
12087 in twitter. In Proceedings of Workshop on Search and mining user-generated contents.
12088 Ratinov, L. and D. Roth (2009). Design challenges and misconceptions in named entity
12089 recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language
12090 Learning, pp. 147–155. Association for Computational Linguistics.
12091 Ratinov, L., D. Roth, D. Downey, and M. Anderson (2011). Local and global algorithms
12092 for disambiguation to wikipedia. In Proceedings of the Association for Computational Lin-
12093 guistics (ACL), pp. 1375–1384.
12094 Ratliff, N. D., J. A. Bagnell, and M. Zinkevich (2007). (approximate) subgradient methods
12095 for structured prediction. In Proceedings of Artificial Intelligence and Statistics (AISTATS),
12096 pp. 380–387.
12097 Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In emnlp,
12098 pp. 133–142.
12099 Ratnaparkhi, A., J. Reynar, and S. Roukos (1994). A maximum entropy model for preposi-
12100 tional phrase attachment. In Proceedings of the workshop on Human Language Technology,
12101 pp. 250–255. Association for Computational Linguistics.
12102 Read, J. (2005). Using emoticons to reduce dependency in machine learning techniques for
12103 sentiment classification. In Proceedings of the ACL student research workshop, pp. 43–48.
12104 Association for Computational Linguistics.
12105 Reisinger, D., R. Rudinger, F. Ferraro, C. Harman, K. Rawlins, and B. V. Durme (2015).
12106 Semantic proto-roles. Transactions of the Association for Computational Linguistics 3, 475–
12107 488.
12108 Reisinger, J. and R. J. Mooney (2010). Multi-prototype vector-space models of word mean-
12109 ing. In Proceedings of the North American Chapter of the Association for Computational Lin-
12110 guistics (NAACL), pp. 109–117.
12111 Reiter, E. and R. Dale (2000). Building natural language generation systems. Cambridge
12112 university press.
12113 Resnik, P., M. B. Olsen, and M. Diab (1999). The bible as a parallel corpus: Annotating the
12114 ‘book of 2000 tongues’. Computers and the Humanities 33(1-2), 129–153.
12115 Resnik, P. and N. A. Smith (2003). The web as a parallel corpus. Computational Linguis-
12116 tics 29(3), 349–380.
12117 Ribeiro, F. N., M. Araújo, P. Gonçalves, M. A. Gonçalves, and F. Benevenuto (2016).
12118 Sentibench-a benchmark comparison of state-of-the-practice sentiment analysis meth-
12119 ods. EPJ Data Science 5(1), 1–29.
12120 Richardson, M., C. J. Burges, and E. Renshaw (2013). MCTest: A challenge dataset for
12121 the open-domain machine comprehension of text. In Proceedings of Empirical Methods for
12122 Natural Language Processing (EMNLP), pp. 193–203.
12123 Riedel, S., L. Yao, and A. McCallum (2010). Modeling relations and their mentions without
12124 labeled text. In Proceedings of the European Conference on Machine Learning and Principles
12125 and Practice of Knowledge Discovery in Databases (ECML), pp. 148–163.
12126 Riedl, M. O. and R. M. Young (2010). Narrative planning: Balancing plot and character.
12127 Journal of Artificial Intelligence Research 39, 217–268.
12128 Rieser, V. and O. Lemon (2011). Reinforcement learning for adaptive dialogue systems: a data-
12129 driven methodology for dialogue management and natural language generation. Springer Sci-
12130 ence & Business Media.
12131 Riloff, E. (1996). Automatically generating extraction patterns from untagged text. In
12132 Proceedings of the national conference on artificial intelligence, pp. 1044–1049.
12133 Riloff, E. and J. Wiebe (2003). Learning extraction patterns for subjective expressions. In
12134 Proceedings of the 2003 conference on Empirical methods in natural language processing, pp.
12135 105–112. Association for Computational Linguistics.
12136 Ritchie, G. (2001). Current directions in computational humour. Artificial Intelligence Re-
12137 view 16(2), 119–135.
12138 Ritter, A., C. Cherry, and W. B. Dolan (2011). Data-driven response generation in social
12139 media. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp.
12140 583–593.
12141 Roark, B., M. Saraclar, and M. Collins (2007). Discriminative¡ i¿ n¡/i¿-gram language
12142 modeling. Computer Speech & Language 21(2), 373–392.
12143 Robert, C. and G. Casella (2013). Monte Carlo statistical methods. Springer Science & Busi-
12144 ness Media.
12145 Rosenfeld, R. (1996). A maximum entropy approach to adaptive statistical language mod-
12146 elling. Computer Speech & Language 10(3), 187–228.
12147 Ross, S., G. Gordon, and D. Bagnell (2011). A reduction of imitation learning and struc-
12148 tured prediction to no-regret online learning. In Proceedings of Artificial Intelligence and
12149 Statistics (AISTATS), pp. 627–635.
12150 Roy, N., J. Pineau, and S. Thrun (2000). Spoken dialogue management using probabilistic
12151 reasoning. In Proceedings of the Association for Computational Linguistics (ACL), pp. 93–
12152 100.
12156 Rush, A. M., S. Chopra, and J. Weston (2015). A neural attention model for abstractive sen-
12157 tence summarization. In Proceedings of Empirical Methods for Natural Language Processing
12158 (EMNLP), pp. 379–389.
12159 Rush, A. M., D. Sontag, M. Collins, and T. Jaakkola (2010). On dual decomposition and
12160 linear programming relaxations for natural language processing. In Proceedings of Em-
12161 pirical Methods for Natural Language Processing (EMNLP), pp. 1–11.
12162 Russell, S. J. and P. Norvig (2009). Artificial intelligence: a modern approach (3rd ed.). Prentice
12163 Hall.
12164 Rutherford, A., V. Demberg, and N. Xue (2017). A systematic study of neural discourse
12165 models for implicit discourse relation. In Proceedings of the European Chapter of the Asso-
12166 ciation for Computational Linguistics (EACL), pp. 281–291.
12167 Rutherford, A. T. and N. Xue (2014). Discovering implicit discourse relations through
12168 brown cluster pair representation and coreference patterns. In Proceedings of the Euro-
12169 pean Chapter of the Association for Computational Linguistics (EACL).
12170 Sag, I. A., T. Baldwin, F. Bond, A. Copestake, and D. Flickinger (2002). Multiword expres-
12171 sions: A pain in the neck for nlp. In International Conference on Intelligent Text Processing
12172 and Computational Linguistics, pp. 1–15. Springer.
12173 Sagae, K. (2009). Analysis of discourse structure with syntactic dependencies and data-
12174 driven shift-reduce parsing. In Proceedings of the 11th International Conference on Parsing
12175 Technologies, pp. 81–84.
12176 Santos, C. D. and B. Zadrozny (2014). Learning character-level representations for part-of-
12177 speech tagging. In Proceedings of the International Conference on Machine Learning (ICML),
12178 pp. 1818–1826.
12179 Sato, M.-A. and S. Ishii (2000). On-line em algorithm for the normalized gaussian network.
12180 Neural computation 12(2), 407–432.
12181 Saurı́, R. and J. Pustejovsky (2009). Factbank: a corpus annotated with event factuality.
12182 Language resources and evaluation 43(3), 227.
12183 Saxe, A. M., J. L. McClelland, and S. Ganguli (2014). Exact solutions to the nonlinear
12184 dynamics of learning in deep linear neural networks. In Proceedings of the International
12185 Conference on Learning Representations (ICLR).
12186 Schank, R. C. and R. Abelson (1977). Scripts, goals, plans, and understanding. Hillsdale, NJ:
12187 Erlbaum.
12188 Schapire, R. E. and Y. Singer (2000). Boostexter: A boosting-based system for text catego-
12189 rization. Machine learning 39(2-3), 135–168.
12190 Schaul, T., S. Zhang, and Y. LeCun (2013). No more pesky learning rates. In Proceedings of
12191 the International Conference on Machine Learning (ICML), pp. 343–351.
12192 Schnabel, T., I. Labutov, D. Mimno, and T. Joachims (2015). Evaluation methods for un-
12193 supervised word embeddings. In Proceedings of Empirical Methods for Natural Language
12194 Processing (EMNLP), pp. 298–307.
12195 Schneider, N., J. Flanigan, and T. O’Gorman (2015). The logic of amr: Practical, unified,
12196 graph-based sentence semantics for nlp. In Proceedings of the North American Chapter of
12197 the Association for Computational Linguistics (NAACL), pp. 4–5.
12198 Schütze, H. (1998). Automatic word sense discrimination. Computational linguistics 24(1),
12199 97–123.
12200 Schwarm, S. E. and M. Ostendorf (2005). Reading level assessment using support vector
12201 machines and statistical language models. In Proceedings of the Association for Computa-
12202 tional Linguistics (ACL), pp. 523–530.
12203 See, A., P. J. Liu, and C. D. Manning (2017). Get to the point: Summarization with pointer-
12204 generator networks. In Proceedings of the Association for Computational Linguistics (ACL),
12205 pp. 1073–1083.
12206 Sennrich, R., B. Haddow, and A. Birch (2016). Neural machine translation of rare words
12207 with subword units. In Proceedings of the Association for Computational Linguistics (ACL),
12208 pp. 1715–1725.
12209 Serban, I. V., A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016). Building end-to-
12210 end dialogue systems using generative hierarchical neural network models. In Proceed-
12211 ings of the National Conference on Artificial Intelligence (AAAI), pp. 3776–3784.
12212 Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine
12213 Learning 6(1), 1–114.
12214 Shang, L., Z. Lu, and H. Li (2015). Neural responding machine for short-text conversation.
12215 In Proceedings of the Association for Computational Linguistics (ACL), pp. 1577–1586.
12216 Shen, D. and M. Lapata (2007). Using semantic roles to improve question answering. In
12217 Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 12–21.
12218 Shen, S., Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu (2016). Minimum risk train-
12219 ing for neural machine translation. In Proceedings of the Association for Computational
12220 Linguistics (ACL), pp. 1683–1692.
12221 Shen, W., J. Wang, and J. Han (2015). Entity linking with a knowledge base: Issues, tech-
12222 niques, and solutions. IEEE Transactions on Knowledge and Data Engineering 27(2), 443–
12223 460.
12224 Shieber, S. M. (1985). Evidence against the context-freeness of natural language. Linguistics
12225 and Philosophy 8(3), 333–343.
12226 Siegelmann, H. T. and E. D. Sontag (1995). On the computational power of neural nets.
12227 Journal of computer and system sciences 50(1), 132–150.
12228 Singh, S., A. Subramanya, F. Pereira, and A. McCallum (2011). Large-scale cross-
12229 document coreference using distributed inference and hierarchical models. In Proceed-
12230 ings of the Association for Computational Linguistics (ACL), pp. 793–803.
12232 Smith, D. A. and J. Eisner (2006). Minimum risk annealing for training log-linear models.
12233 In Proceedings of the Association for Computational Linguistics (ACL), pp. 787–794.
12234 Smith, D. A. and J. Eisner (2008). Dependency parsing by belief propagation. In Proceed-
12235 ings of Empirical Methods for Natural Language Processing (EMNLP), pp. 145–156.
12239 Smith, N. A. (2011). Linguistic structure prediction. Synthesis Lectures on Human Language
12240 Technologies 4(2), 1–274.
12241 Snover, M., B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul (2006). A study of transla-
12242 tion edit rate with targeted human annotation. In Proceedings of association for machine
12243 translation in the Americas, Volume 200.
12244 Snow, R., B. O’Connor, D. Jurafsky, and A. Y. Ng (2008). Cheap and fast—but is it good?:
12245 evaluating non-expert annotations for natural language tasks. In Proceedings of Empirical
12246 Methods for Natural Language Processing (EMNLP), pp. 254–263.
12247 Snyder, B. and R. Barzilay (2007). Database-text alignment via structured multilabel classi-
12248 fication. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI),
12249 pp. 1713–1718.
12250 Socher, R., J. Bauer, C. D. Manning, and A. Y. Ng (2013). Parsing with compositional vector
12251 grammars. In Proceedings of the Association for Computational Linguistics (ACL).
12256 Socher, R., A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013).
12257 Recursive deep models for semantic compositionality over a sentiment treebank. In
12258 Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
12259 Søgaard, A. (2013). Semi-supervised learning and domain adaptation in natural language
12260 processing. Synthesis Lectures on Human Language Technologies 6(2), 1–103.
12261 Solorio, T. and Y. Liu (2008). Learning to predict code-switching points. In Proceedings
12262 of Empirical Methods for Natural Language Processing (EMNLP), pp. 973–981. Association
12263 for Computational Linguistics.
12264 Somasundaran, S., G. Namata, J. Wiebe, and L. Getoor (2009). Supervised and unsuper-
12265 vised methods in employing discourse relations for improving opinion polarity classi-
12266 fication. In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
12267 Somasundaran, S. and J. Wiebe (2009). Recognizing stances in online debates. In Proceed-
12268 ings of the Association for Computational Linguistics (ACL), pp. 226–234.
12269 Song, L., B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J. Smola (2010). Hilbert space
12270 embeddings of hidden markov models. In Proceedings of the International Conference on
12271 Machine Learning (ICML), pp. 991–998.
12272 Song, L., Y. Zhang, X. Peng, Z. Wang, and D. Gildea (2016). Amr-to-text generation as
12273 a traveling salesman problem. In Proceedings of Empirical Methods for Natural Language
12274 Processing (EMNLP), pp. 2084–2089.
12275 Soon, W. M., H. T. Ng, and D. C. Y. Lim (2001). A machine learning approach to corefer-
12276 ence resolution of noun phrases. Computational linguistics 27(4), 521–544.
12277 Sordoni, A., M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan
12278 (2015). A neural network approach to context-sensitive generation of conversational
12279 responses. In Proceedings of the North American Chapter of the Association for Computational
12280 Linguistics (NAACL).
12281 Soricut, R. and D. Marcu (2003). Sentence level discourse parsing using syntactic and
12282 lexical information. In Proceedings of the North American Chapter of the Association for
12283 Computational Linguistics (NAACL), pp. 149–156.
12284 Sowa, J. F. (2000). Knowledge representation: logical, philosophical, and computational founda-
12285 tions. Pacific Grove, CA: Brooks/Cole.
12286 Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application
12287 in retrieval. Journal of documentation 28(1), 11–21.
12288 Spitkovsky, V. I., H. Alshawi, D. Jurafsky, and C. D. Manning (2010). Viterbi training
12289 improves unsupervised dependency parsing. In CONLL, pp. 9–17.
12290 Sporleder, C. and M. Lapata (2005). Discourse chunking and its application to sen-
12291 tence compression. In Proceedings of Empirical Methods for Natural Language Processing
12292 (EMNLP), pp. 257–264.
12293 Sproat, R., A. Black, S. Chen, S. Kumar, M. Ostendorf, and C. Richards (2001). Normaliza-
12294 tion of non-standard words. Computer Speech & Language 15(3), 287–333.
12295 Sproat, R., W. Gale, C. Shih, and N. Chang (1996). A stochastic finite-state word-
12296 segmentation algorithm for chinese. Computational linguistics 22(3), 377–404.
12297 Sra, S., S. Nowozin, and S. J. Wright (2012). Optimization for machine learning. MIT Press.
12301 Srivastava, R. K., K. Greff, and J. Schmidhuber (2015). Training very deep networks. In
12302 Neural Information Processing Systems (NIPS), pp. 2377–2385.
12303 Stab, C. and I. Gurevych (2014a). Annotating argument components and relations in per-
12304 suasive essays. In Proceedings of the International Conference on Computational Linguistics
12305 (COLING), pp. 1501–1510.
12306 Stab, C. and I. Gurevych (2014b). Identifying argumentative discourse structures in per-
12307 suasive essays. In Proceedings of the 2014 Conference on Empirical Methods in Natural Lan-
12308 guage Processing (EMNLP), pp. 46–56.
12309 Stede, M. (2011, nov). Discourse Processing, Volume 4 of Synthesis Lectures on Human Lan-
12310 guage Technologies. Morgan & Claypool Publishers.
12313 Stenetorp, P., S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, and J. Tsujii (2012). Brat: a web-
12314 based tool for nlp-assisted text annotation. In Proceedings of the European Chapter of the
12315 Association for Computational Linguistics (EACL), pp. 102–107.
12316 Stern, M., J. Andreas, and D. Klein (2017). A minimal span-based neural constituency
12317 parser. In Proceedings of the Association for Computational Linguistics (ACL).
12318 Stolcke, A., K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky, P. Taylor, R. Martin,
12319 C. Van Ess-Dykema, and M. Meteer (2000). Dialogue act modeling for automatic tag-
12320 ging and recognition of conversational speech. Computational linguistics 26(3), 339–373.
12321 Stone, P. J. (1966). The General Inquirer: A Computer Approach to Content Analysis. The MIT
12322 Press.
12323 Stoyanov, V., N. Gilbert, C. Cardie, and E. Riloff (2009). Conundrums in noun phrase
12324 coreference resolution: Making sense of the state-of-the-art. In Proceedings of the Associ-
12325 ation for Computational Linguistics (ACL), pp. 656–664.
12326 Strang, G. (2016). Introduction to linear algebra (Fifth ed.). Wellesley, MA: Wellesley-
12327 Cambridge Press.
12328 Strubell, E., P. Verga, D. Belanger, and A. McCallum (2017). Fast and accurate entity recog-
12329 nition with iterated dilated convolutions. In Proceedings of Empirical Methods for Natural
12330 Language Processing (EMNLP).
12331 Suchanek, F. M., G. Kasneci, and G. Weikum (2007). Yago: a core of semantic knowledge.
12332 In Proceedings of the Conference on World-Wide Web (WWW), pp. 697–706.
12333 Sun, X., T. Matsuzaki, D. Okanohara, and J. Tsujii (2009). Latent variable perceptron algo-
12334 rithm for structured classification. In Proceedings of the International Joint Conference on
12335 Artificial Intelligence (IJCAI), Volume 9, pp. 1236–1242.
12336 Sun, Y., L. Lin, D. Tang, N. Yang, Z. Ji, and X. Wang (2015). Modeling mention, context
12337 and entity with neural networks for entity disambiguation. In IJCAI, pp. 1333–1339.
12338 Sundermeyer, M., R. Schlüter, and H. Ney (2012). Lstm neural networks for language
12339 modeling. In INTERSPEECH.
12340 Surdeanu, M., J. Tibshirani, R. Nallapati, and C. D. Manning (2012). Multi-instance multi-
12341 label learning for relation extraction. In Proceedings of Empirical Methods for Natural Lan-
12342 guage Processing (EMNLP), pp. 455–465.
12343 Sutskever, I., O. Vinyals, and Q. V. Le (2014). Sequence to sequence learning with neural
12344 networks. In Neural Information Processing Systems (NIPS), pp. 3104–3112.
12345 Sutton, R. S. and A. G. Barto (1998). Reinforcement learning: An introduction, Volume 1. MIT
12346 press Cambridge.
12347 Sutton, R. S., D. A. McAllester, S. P. Singh, and Y. Mansour (2000). Policy gradient methods
12348 for reinforcement learning with function approximation. In Neural Information Process-
12349 ing Systems (NIPS), pp. 1057–1063.
12350 Taboada, M., J. Brooke, M. Tofiloski, K. Voll, and M. Stede (2011). Lexicon-based methods
12351 for sentiment analysis. Computational linguistics 37(2), 267–307.
12352 Taboada, M. and W. C. Mann (2006). Rhetorical structure theory: Looking back and mov-
12353 ing ahead. Discourse studies 8(3), 423–459.
12354 Täckström, O., K. Ganchev, and D. Das (2015). Efficient inference and structured learning
12355 for semantic role labeling. Transactions of the Association for Computational Linguistics 3,
12356 29–41.
12357 Täckström, O., R. McDonald, and J. Uszkoreit (2012). Cross-lingual word clusters for
12358 direct transfer of linguistic structure. In Proceedings of the North American Chapter of the
12359 Association for Computational Linguistics (NAACL), pp. 477–487.
12360 Tang, D., B. Qin, and T. Liu (2015). Document modeling with gated recurrent neural net-
12361 work for sentiment classification. In Proceedings of Empirical Methods for Natural Language
12362 Processing (EMNLP), pp. 1422–1432.
12363 Taskar, B., C. Guestrin, and D. Koller (2003). Max-margin markov networks. In Neural
12364 Information Processing Systems (NIPS).
12365 Tausczik, Y. R. and J. W. Pennebaker (2010). The psychological meaning of words: LIWC
12366 and computerized text analysis methods. Journal of Language and Social Psychology 29(1),
12367 24–54.
12368 Teh, Y. W. (2006). A hierarchical bayesian language model based on pitman-yor processes.
12369 In Proceedings of the Association for Computational Linguistics (ACL), pp. 985–992.
12370 Tesnière, L. (1966). Éléments de syntaxe structurale (second ed.). Paris: Klincksieck.
12371 Teufel, S., J. Carletta, and M. Moens (1999). An annotation scheme for discourse-level
12372 argumentation in research articles. In Proceedings of the European Chapter of the Association
12373 for Computational Linguistics (EACL), pp. 110–117.
12374 Teufel, S. and M. Moens (2002). Summarizing scientific articles: experiments with rele-
12375 vance and rhetorical status. Computational linguistics 28(4), 409–445.
12376 Thomas, M., B. Pang, and L. Lee (2006). Get out the vote: Determining support or opposi-
12377 tion from Congressional floor-debate transcripts. In Proceedings of Empirical Methods for
12378 Natural Language Processing (EMNLP), pp. 327–335.
12379 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
12380 Statistical Society. Series B (Methodological), 267–288.
12381 Titov, I. and J. Henderson (2007). Constituent parsing with incremental sigmoid belief
12382 networks. In Proceedings of the Association for Computational Linguistics (ACL), pp. 632–
12383 639.
12384 Toutanova, K., D. Klein, C. D. Manning, and Y. Singer (2003). Feature-rich part-of-speech
12385 tagging with a cyclic dependency network. In Proceedings of the North American Chapter
12386 of the Association for Computational Linguistics (NAACL).
12387 Trivedi, R. and J. Eisenstein (2013). Discourse connectors for latent subjectivity in senti-
12388 ment analysis. In Proceedings of the North American Chapter of the Association for Compu-
12389 tational Linguistics (NAACL), pp. 808–813.
12390 Tromble, R. W. and J. Eisner (2006). A fast finite-state relaxation method for enforcing
12391 global constraints on sequence decoding. In Proceedings of the North American Chapter of
12392 the Association for Computational Linguistics (NAACL), pp. 423.
12393 Tsochantaridis, I., T. Hofmann, T. Joachims, and Y. Altun (2004). Support vector machine
12394 learning for interdependent and structured output spaces. In Proceedings of the twenty-
12395 first international conference on Machine learning, pp. 104. ACM.
12396 Tsvetkov, Y., M. Faruqui, W. Ling, G. Lample, and C. Dyer (2015). Evaluation of word
12397 vector representations by subspace alignment. In Proceedings of Empirical Methods for
12398 Natural Language Processing (EMNLP), pp. 2049–2054.
12399 Tu, Z., Z. Lu, Y. Liu, X. Liu, and H. Li (2016). Modeling coverage for neural machine
12400 translation. In Proceedings of the Association for Computational Linguistics (ACL), pp. 76–
12401 85.
12402 Turian, J., L. Ratinov, and Y. Bengio (2010). Word representations: a simple and general
12403 method for semi-supervised learning. In Proceedings of the Association for Computational
12404 Linguistics (ACL), pp. 384–394.
12405 Turing, A. M. (2009). Computing machinery and intelligence. In Parsing the Turing Test,
12406 pp. 23–65. Springer.
12407 Turney, P. D. and P. Pantel (2010). From frequency to meaning: Vector space models of
12408 semantics. Journal of Artificial Intelligence Research 37, 141–188.
12409 Tutin, A. and R. Kittredge (1992). Lexical choice in context: generating procedural texts.
12410 In Proceedings of the International Conference on Computational Linguistics (COLING), pp.
12411 763–769.
12413 Tzeng, E., J. Hoffman, T. Darrell, and K. Saenko (2015). Simultaneous deep transfer across
12414 domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision,
12415 pp. 4068–4076.
12416 Usunier, N., D. Buffoni, and P. Gallinari (2009). Ranking with ordered weighted pairwise
12417 classification. In Proceedings of the International Conference on Machine Learning (ICML),
12418 pp. 1057–1064.
12419 Uthus, D. C. and D. W. Aha (2013). The ubuntu chat corpus for multiparticipant chat
12420 analysis. In AAAI Spring Symposium: Analyzing Microtext, Volume 13, pp. 01.
12421 Utiyama, M. and H. Isahara (2001). A statistical model for domain-independent text seg-
12422 mentation. In Proceedings of the 39th Annual Meeting on Association for Computational
12423 Linguistics, pp. 499–506. Association for Computational Linguistics.
12424 Utiyama, M. and H. Isahara (2007). A comparison of pivot methods for phrase-based
12425 statistical machine translation. In Human Language Technologies 2007: The Conference of
12426 the North American Chapter of the Association for Computational Linguistics; Proceedings of
12427 the Main Conference, pp. 484–491.
12428 Uzuner, Ö., X. Zhang, and T. Sibanda (2009). Machine learning and rule-based approaches
12429 to assertion classification. Journal of the American Medical Informatics Association 16(1),
12430 109–115.
12431 Vadas, D. and J. R. Curran (2011). Parsing noun phrases in the penn treebank. Computa-
12432 tional Linguistics 37(4), 753–809.
12433 Van Eynde, F. (2006). NP-internal agreement and the structure of the noun phrase. Journal
12434 of Linguistics 42(1), 139–186.
12435 Van Gael, J., A. Vlachos, and Z. Ghahramani (2009). The infinite hmm for unsuper-
12436 vised pos tagging. In Proceedings of Empirical Methods for Natural Language Processing
12437 (EMNLP), pp. 678–687.
12438 Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
12439 I. Polosukhin (2017). Attention is all you need. In Neural Information Processing Systems
12440 (NIPS), pp. 6000–6010.
12441 Velldal, E., L. Øvrelid, J. Read, and S. Oepen (2012). Speculation and negation: Rules,
12442 rankers, and the role of syntax. Computational linguistics 38(2), 369–410.
12446 Vilain, M., J. Burger, J. Aberdeen, D. Connolly, and L. Hirschman (1995). A model-
12447 theoretic coreference scoring scheme. In Proceedings of the 6th conference on Message
12448 understanding, pp. 45–52. Association for Computational Linguistics.
12449 Vincent, P., H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol (2010). Stacked de-
12450 noising autoencoders: Learning useful representations in a deep network with a local
12451 denoising criterion. Journal of Machine Learning Research 11(Dec), 3371–3408.
12452 Vincze, V., G. Szarvas, R. Farkas, G. Móra, and J. Csirik (2008). The bioscope corpus:
12453 biomedical texts annotated for uncertainty, negation and their scopes. BMC bioinformat-
12454 ics 9(11), S9.
12455 Vinyals, O., A. Toshev, S. Bengio, and D. Erhan (2015). Show and tell: A neural image cap-
12456 tion generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference
12457 on, pp. 3156–3164. IEEE.
12458 Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum
12459 decoding algorithm. IEEE transactions on Information Theory 13(2), 260–269.
12460 Voll, K. and M. Taboada (2007). Not all words are created equal: Extracting semantic
12461 orientation as a function of adjective relevance. In Proceedings of Australian Conference
12462 on Artificial Intelligence.
12463 Wager, S., S. Wang, and P. S. Liang (2013). Dropout training as adaptive regularization. In
12464 Neural Information Processing Systems (NIPS), pp. 351–359.
12465 Wainwright, M. J. and M. I. Jordan (2008). Graphical models, exponential families, and
12466 variational inference. Foundations and Trends
R in Machine Learning 1(1-2), 1–305.
12470 Walker, M. A., J. E. Cahn, and S. J. Whittaker (1997). Improvising linguistic style: Social
12471 and affective bases for agent personality. In Proceedings of the first international conference
12472 on Autonomous agents, pp. 96–105. ACM.
12473 Wang, C., N. Xue, and S. Pradhan (2015). A Transition-based Algorithm for AMR Parsing.
12474 In Proceedings of the North American Chapter of the Association for Computational Linguistics
12475 (NAACL), pp. 366–375.
12476 Wang, H., T. Onishi, K. Gimpel, and D. McAllester (2017). Emergent predication structure
12477 in hidden state vectors of neural readers. In Proceedings of the 2nd Workshop on Represen-
12478 tation Learning for NLP, pp. 26–36.
12480 Webber, B. (2004, sep). D-LTAG: extending lexicalized TAG to discourse. Cognitive Sci-
12481 ence 28(5), 751–779.
12482 Webber, B., M. Egg, and V. Kordoni (2012). Discourse structure and language technology.
12483 Journal of Natural Language Engineering 1.
12484 Webber, B. and A. Joshi (2012). Discourse structure and computation: past, present and
12485 future. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Dis-
12486 coveries, pp. 42–54. Association for Computational Linguistics.
12487 Wei, G. C. and M. A. Tanner (1990). A monte carlo implementation of the em algorithm
12488 and the poor man’s data augmentation algorithms. Journal of the American Statistical
12489 Association 85(411), 699–704.
12490 Weinberger, K., A. Dasgupta, J. Langford, A. Smola, and J. Attenberg (2009). Feature
12491 hashing for large scale multitask learning. In Proceedings of the International Conference
12492 on Machine Learning (ICML), pp. 1113–1120.
12493 Weizenbaum, J. (1966). Eliza—a computer program for the study of natural language
12494 communication between man and machine. Communications of the ACM 9(1), 36–45.
12495 Wellner, B. and J. Pustejovsky (2007). Automatically identifying the arguments of dis-
12496 course connectives. In Proceedings of Empirical Methods for Natural Language Processing
12497 (EMNLP), pp. 92–101.
12498 Wen, T.-H., M. Gasic, N. Mrkšić, P.-H. Su, D. Vandyke, and S. Young (2015). Semantically
12499 conditioned lstm-based natural language generation for spoken dialogue systems. In
12500 Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 1711–1721.
12501 Weston, J., S. Bengio, and N. Usunier (2011). Wsabie: Scaling up to large vocabulary image
12502 annotation. In IJCAI, Volume 11, pp. 2764–2770.
12503 Wiebe, J., T. Wilson, and C. Cardie (2005). Annotating expressions of opinions and emo-
12504 tions in language. Language resources and evaluation 39(2), 165–210.
12505 Wieting, J., M. Bansal, K. Gimpel, and K. Livescu (2015). Towards universal paraphrastic
12506 sentence embeddings. arXiv preprint arXiv:1511.08198.
12507 Wieting, J., M. Bansal, K. Gimpel, and K. Livescu (2016). CHARAGRAM: Embedding
12508 words and sentences via character n-grams. In Proceedings of Empirical Methods for Nat-
12509 ural Language Processing (EMNLP), pp. 1504–1515.
12510 Williams, J. D. and S. Young (2007). Partially observable markov decision processes for
12511 spoken dialog systems. Computer Speech & Language 21(2), 393–422.
12512 Williams, P., R. Sennrich, M. Post, and P. Koehn (2016). Syntax-based statistical machine
12513 translation. Synthesis Lectures on Human Language Technologies 9(4), 1–208.
12514 Wilson, T., J. Wiebe, and P. Hoffmann (2005). Recognizing contextual polarity in phrase-
12515 level sentiment analysis. In Proceedings of Empirical Methods for Natural Language Pro-
12516 cessing (EMNLP), pp. 347–354.
12517 Winograd, T. (1972). Understanding natural language. Cognitive psychology 3(1), 1–191.
12518 Wiseman, S., A. M. Rush, and S. M. Shieber (2016). Learning global features for corefer-
12519 ence resolution. In Proceedings of the North American Chapter of the Association for Compu-
12520 tational Linguistics (NAACL), pp. 994–1004.
12521 Wiseman, S., S. Shieber, and A. Rush (2017). Challenges in data-to-document generation.
12522 In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), pp. 2253–
12523 2263.
12524 Wiseman, S. J., A. M. Rush, S. M. Shieber, and J. Weston (2015). Learning anaphoricity and
12525 antecedent ranking features for coreference resolution. In Proceedings of the Association
12526 for Computational Linguistics (ACL).
12527 Wolf, F. and E. Gibson (2005). Representing discourse coherence: A corpus-based study.
12528 Computational Linguistics 31(2), 249–287.
12529 Wolfe, T., M. Dredze, and B. Van Durme (2017). Pocket knowledge base population. In
12530 Proceedings of the Association for Computational Linguistics (ACL), pp. 305–310.
12531 Wong, Y. W. and R. Mooney (2007). Generation by inverting a semantic parser that uses
12532 statistical machine translation. In Proceedings of the North American Chapter of the Associ-
12533 ation for Computational Linguistics (NAACL), pp. 172–179.
12534 Wong, Y. W. and R. J. Mooney (2006). Learning for semantic parsing with statistical ma-
12535 chine translation. In Proceedings of the North American Chapter of the Association for Com-
12536 putational Linguistics (NAACL), pp. 439–446.
12537 Wu, B. Y. and K.-M. Chao (2004). Spanning trees and optimization problems. CRC Press.
12538 Wu, D. (1997). Stochastic inversion transduction grammars and bilingual parsing of par-
12539 allel corpora. Computational linguistics 23(3), 377–403.
12540 Wu, F. and D. S. Weld (2010). Open information extraction using wikipedia. In Proceedings
12541 of the Association for Computational Linguistics (ACL), pp. 118–127.
12542 Wu, X., R. Ward, and L. Bottou (2018). Wngrad: Learn the learning rate in gradient de-
12543 scent. arXiv preprint arXiv:1803.02865.
12544 Wu, Y., M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao,
12545 Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Łukasz Kaiser, S. Gouws,
12546 Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young,
12547 J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016).
12548 Google’s neural machine translation system: Bridging the gap between human and ma-
12549 chine translation. CoRR abs/1609.08144.
12550 Xia, F. (2000). The part-of-speech tagging guidelines for the penn chinese treebank (3.0).
12551 Technical report, University of Pennsylvania Institute for Research in Cognitive Science.
12552 Xu, K., J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio
12553 (2015). Show, attend and tell: Neural image caption generation with visual attention.
12554 In Proceedings of the International Conference on Machine Learning (ICML), pp. 2048–2057.
12555 Xu, W., X. Liu, and Y. Gong (2003). Document clustering based on non-negative matrix
12556 factorization. In SIGIR, pp. 267–273. ACM.
12557 Xu, Y., L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin (2015). Classifying relations via long
12558 short term memory networks along shortest dependency paths. In Proceedings of Empir-
12559 ical Methods for Natural Language Processing (EMNLP), pp. 1785–1794.
12560 Xuan Bach, N., N. L. Minh, and A. Shimazu (2012). A reranking model for discourse seg-
12561 mentation using subtree features. In Proceedings of the Special Interest Group on Discourse
12562 and Dialogue (SIGDIAL).
12563 Xue, N. et al. (2003). Chinese word segmentation as character tagging. Computational
12564 Linguistics and Chinese Language Processing 8(1), 29–48.
12565 Xue, N., H. T. Ng, S. Pradhan, R. Prasad, C. Bryant, and A. T. Rutherford (2015). The
12566 CoNLL-2015 shared task on shallow discourse parsing. In Proceedings of the Conference
12567 on Natural Language Learning (CoNLL).
12568 Xue, N., H. T. Ng, S. Pradhan, A. Rutherford, B. L. Webber, C. Wang, and H. Wang (2016).
12569 Conll 2016 shared task on multilingual shallow discourse parsing. In CoNLL Shared
12570 Task, pp. 1–19.
12571 Yamada, H. and Y. Matsumoto (2003). Statistical dependency analysis with support vector
12572 machines. In Proceedings of IWPT, Volume 3, pp. 195–206.
12573 Yamada, K. and K. Knight (2001). A syntax-based statistical translation model. In Proceed-
12574 ings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 523–530.
12575 Association for Computational Linguistics.
12576 Yang, B. and C. Cardie (2014). Context-aware learning for sentence-level sentiment anal-
12577 ysis with posterior regularization. In Proceedings of the Association for Computational Lin-
12578 guistics (ACL).
12579 Yang, Y., M.-W. Chang, and J. Eisenstein (2016). Toward socially-infused information ex-
12580 traction: Embedding authors, mentions, and entities. In Proceedings of Empirical Methods
12581 for Natural Language Processing (EMNLP).
12582 Yang, Y. and J. Eisenstein (2013). A log-linear model for unsupervised text normalization.
12583 In Proceedings of Empirical Methods for Natural Language Processing (EMNLP).
12584 Yang, Y. and J. Eisenstein (2015). Unsupervised multi-domain adaptation with feature em-
12585 beddings. In Proceedings of the North American Chapter of the Association for Computational
12586 Linguistics (NAACL).
12587 Yang, Y., W.-t. Yih, and C. Meek (2015). WikiQA: A challenge dataset for open-domain
12588 question answering. In Proceedings of Empirical Methods for Natural Language Processing
12589 (EMNLP), pp. 2013–2018.
12590 Yannakoudakis, H., T. Briscoe, and B. Medlock (2011). A new dataset and method for
12591 automatically grading esol texts. In Proceedings of the 49th Annual Meeting of the Associ-
12592 ation for Computational Linguistics: Human Language Technologies-Volume 1, pp. 180–189.
12593 Association for Computational Linguistics.
12594 Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised meth-
12595 ods. In Proceedings of the Association for Computational Linguistics (ACL), pp. 189–196.
12596 Association for Computational Linguistics.
12597 Yee, L. C. and T. Y. Jones (2012, March). Apple ceo in china mission to clear up problems.
12598 Reuters. retrieved on March 26, 2017.
12599 Yi, Y., C.-Y. Lai, S. Petrov, and K. Keutzer (2011, October). Efficient parallel cky parsing on
12600 gpus. In Proceedings of the 12th International Conference on Parsing Technologies, Dublin,
12601 Ireland, pp. 175–185. Association for Computational Linguistics.
12602 Yu, C.-N. J. and T. Joachims (2009). Learning structural svms with latent variables. In
12603 Proceedings of the International Conference on Machine Learning (ICML), pp. 1169–1176.
12604 Yu, F. and V. Koltun (2016). Multi-scale context aggregation by dilated convolutions. In
12605 Proceedings of the International Conference on Learning Representations (ICLR).
12606 Zaidan, O. F. and C. Callison-Burch (2011). Crowdsourcing translation: Professional qual-
12607 ity from non-professionals. In Proceedings of the Association for Computational Linguistics
12608 (ACL), pp. 1220–1229.
12609 Zaremba, W., I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv
12610 preprint arXiv:1409.2329.
12611 Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint
12612 arXiv:1212.5701.
12613 Zelenko, D., C. Aone, and A. Richardella (2003). Kernel methods for relation extraction.
12614 The Journal of Machine Learning Research 3, 1083–1106.
12615 Zelle, J. M. and R. J. Mooney (1996). Learning to parse database queries using induc-
12616 tive logic programming. In Proceedings of the National Conference on Artificial Intelligence
12617 (AAAI), pp. 1050–1055.
12618 Zeng, D., K. Liu, S. Lai, G. Zhou, and J. Zhao (2014). Relation classification via convolu-
12619 tional deep neural network. In Proceedings of the International Conference on Computational
12620 Linguistics (COLING), pp. 2335–2344.
12621 Zettlemoyer, L. S. and M. Collins (2005). Learning to map sentences to logical form: Struc-
12622 tured classification with probabilistic categorial grammars. In Proceedings of UAI.
12623 Zhang, X., J. Zhao, and Y. LeCun (2015). Character-level convolutional networks for text
12624 classification. In Neural Information Processing Systems (NIPS), pp. 649–657.
12625 Zhang, Y. and S. Clark (2008). A tale of two parsers: investigating and combining graph-
12626 based and transition-based dependency parsing using beam-search. In Proceedings of
12627 Empirical Methods for Natural Language Processing (EMNLP), pp. 562–571.
12628 Zhang, Y., T. Lei, R. Barzilay, T. Jaakkola, and A. Globerson (2014). Steps to excellence:
12629 Simple inference with refined scoring of dependency trees. In Proceedings of the Associa-
12630 tion for Computational Linguistics (ACL), pp. 197–207.
12631 Zhang, Y. and J. Nivre (2011). Transition-based dependency parsing with rich non-local
12632 features. In Proceedings of the Association for Computational Linguistics (ACL), pp. 188–193.
12633 Zhou, J. and W. Xu (2015). End-to-end learning of semantic role labeling using recurrent
12634 neural networks. In Proceedings of the Association for Computational Linguistics (ACL), pp.
12635 1127–1137.
12636 Zhu, J., Z. Nie, X. Liu, B. Zhang, and J.-R. Wen (2009). Statsnowball: a statistical approach
12637 to extracting entity relationships. In Proceedings of the Conference on World-Wide Web
12638 (WWW), pp. 101–110.
12639 Zhu, X., Z. Ghahramani, and J. D. Lafferty (2003). Semi-supervised learning using gaus-
12640 sian fields and harmonic functions. In Proceedings of the International Conference on Ma-
12641 chine Learning (ICML), pp. 912–919.
12644 Zipf, G. K. (1949). Human behavior and the principle of least effort.
12645 Zirn, C., M. Niepert, H. Stuckenschmidt, and M. Strube (2011). Fine-grained sentiment
12646 analysis with structural features. In IJCNLP, Chiang Mai, Thailand, pp. 336–344.
12647 Zou, W. Y., R. Socher, D. Cer, and C. D. Manning (2013). Bilingual word embeddings
12648 for phrase-based machine translation. In Proceedings of Empirical Methods for Natural
12649 Language Processing (EMNLP), pp. 1393–1398.
559
560 BIBLIOGRAPHY
12712 attention mechanism, 379, 430, 449, 462 12752 binomial random variable, 482
12713 autoencoder, 353 12753 binomial test, 96
12714 autoencoder, denoising, 353, 412 12754 BIO notation, 194, 321
12715 automated theorem provers, 290 12755 biomedical natural language processing,
12716 automatic differentiation, 70 12756 193
12717 auxiliary verbs, 188 12757 bipartite graph, 327
12718 average mutual information, 341 12758 Bonferroni correction, 99
12719 averaged perceptron, 42 12759 boolean semiring, 207
12760 boosting, 62
12720 backchannel, 197 12761 bootstrap samples, 98
12721 backoff, 142 12762 brevity penalty (machine translation),
12722 backpropagation, 69, 148 12763 438
12723 backpropagation through time, 148 12764 Brown clusters, 335, 340
12724 backward recurrence, 175, 176 12765 byte-pair encoding, 351, 452
12725 backward-looking center, 388
12726 bag of words, 29 12766 c-command, 361
12727 balanced F - MEASURE, 95 12767 case marking, 192, 227
12728 balanced test set, 93 12768 Catalan number, 233
12729 batch learning, 41 12769 cataphora, 360
12730 batch normalization, 74 12770 center embedding, 215
12731 batch optimization, 53 12771 centering theory, 363, 388
12732 Baum-Welch algorithm, 180 12772 chain FSA, 213
12733 Bayes’ rule, 478 12773 chain rule of probability, 477
12734 Bayesian nonparametrics, 115, 256 12774 chance agreement, 102
12735 beam sampling, 182 12775 character-level language models, 153
12736 beam search, 275, 372, 453, 454 12776 chart parsing, 234
12737 Bell number, 370 12777 chatbots, 472
12738 best-path algorithm, 206 12778 Chomsky Normal Form (CNF), 219
12739 bias (learning theory), 38, 139 12779 Chu-Liu-Edmonds algorithm, 268
12740 bias-variance tradeoff, 38, 141 12780 CKY algorithm, 234
12741 biconvexity, 114 12781 class imbalance, 93
12742 bidirectional LSTM, 449 12782 classification weights, 29
12743 bidirectional recurrent neural network, 12783 closed-vocabulary, 153
12744 178 12784 closure (regular languages), 200
12745 bigrams, 40, 82 12785 cloze question answering, 429
12746 bilexical, 255 12786 cluster ranking, 371
12747 bilexical features, 270 12787 clustering, 108
12748 bilinear product, 338 12788 co-training, 120
12749 binarization (context-free grammar), 12789 coarse-to-fine attention, 465
12750 219, 236 12790 code switching, 190, 196
12751 binomial distribution, 96 12791 Cohen’s Kappa, 102
12873 derivation (semantic parsing), 296, 300 12913 early stopping, 43, 75
12874 derivational ambiguity, 230 12914 early update, 280
12875 derivational morphology, 202 12915 easy-first parsing, 283
12876 determiner, 189 12916 edit distance, 209, 439
12877 determiner phrase, 223 12917 effective count (language models), 141
12878 deterministic FSA, 202 12918 elementary discourse units, 395
12879 development set, 39, 93 12919 elementwise nonlinearity, 63
12880 dialogue acts, 102, 197, 474 12920 Elman unit, 147
12881 dialogue management, 470 12921 ELMo (embeddings from language
12882 dialogue systems, 137, 469 12922 models), 349
12883 digital humanities, 17, 81 12923 embedding, 177, 343
12884 dilated convolution, 77, 195 12924 emission features, 158
12885 Dirichlet distribution, 127 12925 emotion, 84
12886 discounting (language models), 142 12926 empirical Bayes, 128
12887 discourse, 385 12927 empty string, 200
12888 discourse connectives, 391 12928 encoder-decoder, 446
12889 discourse depth, 400 12929 encoder-decoder model, 353, 462
12890 discourse depth tree, 401 12930 ensemble, 322
12891 discourse parsing, 390 12931 ensemble learning, 448
12892 discourse relations, 355, 390 12932 ensemble methods, 62
12893 discourse segment, 385 12933 entailment, 293, 354
12894 discourse sense classification, 392 12934 entities, 407
12895 discourse unit, 395 12935 entity embeddings, 411
12896 discrete random variable, 480 12936 entity grid, 389
12897 discriminative learning, 40 12937 entity linking, 359, 407, 409, 420
12898 disjoint events (probability), 476 12938 entropy, 57, 111
12899 distant supervision, 130, 422, 423 12939 estimation, 482
12900 distributed semantics, 334 12940 EuroParl corpus, 439
12901 distributional hypothesis, 333, 334 12941 event, 425
12902 distributional semantics, 22, 334 12942 event (probability), 475
12903 distributional statistics, 87, 255, 334 12943 event coreference, 425
12904 document frequency, 411 12944 event detection, 425
12905 domain adaptation, 107, 123 12945 event semantics, 309
12906 dropout, 71, 149 12946 events, 407
12907 dual decomposition, 321 12947 evidentiality, 192, 427
12908 dynamic computation graphs, 70 12948 exchange clustering, 342
12909 dynamic oracle, 280 12949 expectation, 481
12910 dynamic programming, 159 12950 expectation maximization, 109, 143
12911 dynamic semantics, 305, 390 12951 expectation semiring, 215
12952 expectation-maximization, in machine
12912 E-step (expectation-maximization), 111 12953 translation, 443
13433 schema, 407, 408, 424 13474 slots (dialogue systems), 469
13434 search error, 259, 372 13475 smooth functions, 45
13435 second-order dependency parsing, 267 13476 smoothing, 38, 141
13436 second-order logic, 292 13477 soft K-means, 109
13437 seed lexicon, 85 13478 softmax, 63, 146, 420
13438 segmented discourse representation 13479 source domain, 122
13439 theory (SDRT), 390 13480 source language, 435
13440 self-attention, 450 13481 spanning tree, 262
13441 self-training, 120 13482 sparse matrix, 338
13442 semantic, 186 13483 sparsity, 56
13443 semantic concordance, 88 13484 spectral learning, 129
13444 semantic parsing, 294 13485 speech acts, 197
13445 semantic role, 310 13486 speech recognition, 137
13446 Semantic role labeling, 310 13487 split constituents, 316
13447 semantic role labeling, 416, 424 13488 spurious ambiguity, 230, 272, 278, 300
13448 semantic underspecification, 305 13489 squashing function, 147
13449 semantics, 250, 287 13490 squashing functions, 65
13450 semi-supervised learning, 107, 117, 348 13491 stand-off annotations, 101
13451 semiring algebra, 182 13492 Stanford Natural Language Inference
13452 semiring notation, 207 13493 corpus, 354
13453 semisupervised, 88 13494 statistical learning theory, 42
13454 senses, 313 13495 statistical machine translation, 436
13455 sentence (logic), 292 13496 statistical significance, 96
13456 sentence compression, 467 13497 stem, 18
13457 sentence fusion, 468 13498 stem (morphology), 203
13458 sentence summarization, 466 13499 stemmer, 90
13459 sentiment, 81 13500 step size, 53, 486
13460 sentiment lexicon, 32 13501 stochastic gradient descent, 45, 54
13461 sequence-to-sequence, 447 13502 stoplist, 92
13462 shift-reduce parsing, 258 13503 stopwords, 92
13463 shifted positive pointwise mutual 13504 string (formal language theory), 199
13464 information, 346 13505 string-to-tree translation, 446
13465 shortest-path algorithm, 205 13506 strong compositionality criterion (RST),
13466 sigmoid, 63 13507 397
13467 simplex, 127 13508 strongly equivalent grammars, 218
13468 singular value decomposition, 73 13509 structure induction, 180
13469 singular value decomposition (SVD), 116 13510 structured attention, 465
13470 singular vectors, 74 13511 structured perceptron, 170
13471 skipgram word embeddings, 344 13512 structured prediction, 30
13472 slack variables, 48 13513 structured support vector machine, 171
13473 slot filling, 420 13514 subgradient, 45, 56