Natural Language Processing: Processing Concerns What We Do Once We Know What The Words Are
Natural Language Processing: Processing Concerns What We Do Once We Know What The Words Are
Natural Language Processing: Processing Concerns What We Do Once We Know What The Words Are
Introduction
Previously, we discussed the problem of speech recognition, i.e., hypothesizing from an acoustic signal the words a speaker is saying. Natural language processing concerns what we do once we know what the words are. Some important examples of natural language processing tasks are the following: Natural language front ends (often to data base management systems) Intelligent help or tutoring systems Text summarization Information access and extraction Machine translation Intelligent authoring tools (e.g., grammar or style checkers) There are two ways in which we might approach these problems. One is to hope we can find some clever tricks that can solve the problem at least partially, perhaps good enough for some application. In is generally the case that, for some tasks, e.g., machine translation, it is possible to do at least a very poor job of it with clever tricks. Similarly, there are various technical sub-problems (to be discussed below) that can be solved approximately with clever tricks. We can also build systems that behave reasonably well in a very highly constrained domain (e.g., asking about airline flight availability). However, it doesn seem t possible to get very far with clever tricks for some tasks, e.g., front ends. Moreover, to achieve (near) human-level performance in any of these tasks almost certainly involves solving very hard problems, i.e., simulating people at some level. An Example Hack: ELIZA A well-known example of a conversation program based on clever tricksis ELIZA, written by Joe Weizenbaum. (A version of this program called DOCTOR was popularized; indeed, it's still available in EMACS.) ELIZA imitated a Rogerian (i.e., so-called ``non-directed'') psychotherapist. Here is a sample dialogue: Men are all alike. IN WHAT WAY They're always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE Well, my boyfriend made me come here. YOUR BOYFRIEND MADE YOU COME HERE He says I am depressed much of the time. I AM SORRY TO HEAR THE YOU ARE DEPRESSED It's true. I am unhappy. DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY I need some help, that much seems certain. WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP Perhaps I could learn to get along with my mother. TELL ME MORE ABOUT YOUR FAMILY My mother takes care of me. WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU My father. YOUR FATHER You are like my father in some ways. WHAT RESEMBLANCE DO YOU SEE You are not very aggressive but I think you don't want me to notice that. WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE You don't argue with me. WHY DO YOU THINK I DON'T ARGUE WITH YOU You are afraid of me. DOES IT PLEASE YOU TO THINK I AM AFRAID OF YOU My father is afraid of everybody. WHAT ELSE COMES TO MIND WHEN YOU THINK OF YOUR FATHER Bullies. DOES THAT HAVE ANYTHING TO DO WITH THE FACT THAT YOUR BOYFRIEND MADE YOU COME HERE Impressive as this dialogue seems, there's not much to this system. Once again, it operates using keywords and simple pattern matching. ELIZA looks for keywords that occur in the input, and has patterns associated with these; patterns typically have transformations associated with them the are used to generate an output. For example, a pattern association with the keyword me might be (* YOU * ME) and an associated transformation would be (WHAT MAKES YOU THINK I 2 YOU)
where * indicates a string of any number of words, and 2 designates the second element matched. In a similar manner, if the program sees the word ``everyone'' in the input, it might have a pattern that causes it to substitute ``who in particular?'' In addition, ELIZA used a number of heuristic: eliminate portions of the input which don't contain keywords use first the more complex patterns associated with the keywords present in the input prefer a pattern that hasn't yet been used to one that has if nothing matches, output a default string such as ``Please go on'' or ``Very interesting'' substitute for keywords certain associated words, such as ``family'' for ``mother'' (and remembering this substitution through the course of the dialogue. Even with these tricks: A conversation with ELIZA gets old fast; the program gets repetitous quickly. Since its response formation is so stupid, mistakes and typos in the input often get echoed back to the users; responses are sometimes completely ungrammatical. It is extremely doubtful that any computer-knowledgable person would be fooled by ELIZA for very long. So why does it work even as well as it does? Once again, the task is at fault. Indeed, Weizenbaum chose this particular kind of therapist to emulate because it is ``one of the few examples of ... NL conversation in which one of the participating pair is free to assume the pose of knowing almost nothing of the real world.'' In other words, it is not a very good example of real natural language, in that in almost any other context, these kinds of interactions would be unacceptable. But even within this context, the program doesn't function very well. In addition, if we believe that the entity we are communicating with is really intelligent, we might embue the utterances with significance of which the entity is unaware. The following is a true story told by Danny when he worked at BBN: One Saturday morning about 9 A.M., a BBN Vice President in charge of selling our Telcomp commercial service arrived at our PDP-1 Computer
Room hoping to be able to use the computer for a demonstration. He found a note on the PDP-1 teletype that read: "TYPE ALL MESSAGES INTO THIS TELETYPE, AND END THEM WITH A PERIOD." This note was left there by one of the programming bums who had just succeeded in making run a LISP version of the DOCTOR program (based on ELIZA as developed by Weizenbaum). However, the VP thought that this note indicated that this teletype was linked with the one in my home, since I often used the computer all night. Therefore he started typing to "me," but in fact was typing to the version of the DOCTOR program. The following conversation ensued: VP: DO YOU THINK I CAN USE THE COMPUTER THIS MORNING?
Teletype Response:WHY DO YOU ASK? VP: TELCOMP I MIGHT BE ABLE TO MAKE SOME ADDITIONAL SALES. Teletype Response:WHY AREN'T YOU SURE? VP: SEEN A MY PERSPECTIVE CUSTOMERS HAVE NEVER DEMONSTRATION OF THE SYSTEM. Teletype Response:DOES THIS MEAN VERY MUCH TO YOU? VP: OF COURSE IT DOES.
Teletype Response:YOU SEEM VERY POSITIVE ABOUT THAT. VP: THIS IS RIDICULOUS.
Teletype Response:WHAT DO YOU MEAN THIS IS RIDICULOUS? VP: COMPUTER IS THERE ANY REASON WHY I CAN'T USE THE NOW? Teletype Response:TELL ME MORE ABOUT YOUR PERSPECTIVE CUSTOMERS. VP: PLEASE DIAL ME UP ON 491-1850
Note that after the last remark the VP did not obey instructions, and left out the period. Therefore, of course, the computer didn't answer him. This so infuriated the VP, who thought I was playing games with him, that he called me up, woke me from a deep sleep, and said VP: Bobrow: you? Why are you being so snotty to me? What do you mean why am I being snotty to
The VP angrily read me the dialog that "we" had been having, and ouldn't get any response but laughter from me. It took a while to onvince him it really was the computer.
So, we can easily build a very limited system with a bag of tricks, but doing better than this requires solving some very hard problems. The AI approach to NLP is fundamentally concerned with these very hard problems. The holy grail of AI/NLP is to do any of these tasks as well as a five year old child.
Despite the differences in these tasks, the underlying nature of them is really quite similar. In general, words (and maybe other linguistic objects) have meanings associated with them. The words can be placed in linear arrangements that suggest or constrain how their meanings might be combined together. These utterances occur in some context, e.g., as part of a conversation or as sentences in a text, whether they bear some relations to one another. In understanding, we get the words, and have to extract the meaning, and figure out how it relates to the larger task. In production, we are engaged in some task, and have to come up with something to say, and then figure out how to encode it into words. So, why is this so hard? Consider the following short discourse (from the old Burns and Allen TV show): Gracie: Did you go to the hospital today to visit your aunt? George: Yes, in fact, I took her flowers. Gracie: That a terrible thing to do! If she was sick, you should have s brought her flowers! The problem here is that the phrase took her flowersis ambiguous in several ways. Let examine these ambiguities briefly: Notice that there are two senses s of the verb take here, meaning, roughly, bring or take away This is an . example of lexical ambiguity, i.e., of a given word having multiple meanings. Heris also ambiguous, being either a personal pronoun or a possessive pronoun. Notice also that each sense of take has different grammatical properties, leading to different parses of the sentence: [S [NP [Pro I]] [VP [V took] [NP [Pro her]] [NP [N flowers]]]] [S [NP [Pro I]] [VP [V took] [NP [DET her] [N flowers]]]] Each of these presents a constituent structure of the sentence. Constituent structure tells us how the components combine together. Here I show the constituents with labeled brackets This is just a lazy man way of drawing a . s tree. (To see this, draw a node corresponding to each bracket label, starting with , with a link to each element inside the brackets. Thus, S would have a link to S the children NP and ; VP NP would have a link to the single child Pro which , would have a link to the child . And so on.) I Note that in the second interpretation, the labeling indicates that flowersis a her constituent (in fact, an , or NP noun phrase In this reading, flowersis ). her treating analogously to flowers or his some flowers In the first interpretation, . there is no constituent corresponding to this string. Instead, the string took her flowers (a VP, or verb phrase is the first constituent that contains ) her
flowers In this reading, the phrase . took her flowersis treated analogously to took him flowers or took Auntie flowers . When there are alternative constituent structures we can assign to a string, we say it is syntactically ambiguous. We humans resolved the ambiguities, indeed, generally without becoming conscious of them. To do so required world knowledge, in this case, about what the appropriate social behavior is in a given situation. Thus, we need discourse and inference to figure out which of various possible interpretations of an utterance is intended. NLP is like this all the time the example just brings it to light.
Kinds of Knowledge
Doing anything interesting with language requires lots of knowledge. It is useful to classify this into different kinds: Lexicalknowledge about individual words Syntacticknowledge about grammatical structure Semanticknowledge about what words, etc., mean Pragmaticsknowledge about how language is used. This might be less familiar to people. For example, the sentence ``Do you know if there's a telephone nearby?'' is probably a request for information, rather than a request about someones' state of knowledge. This is one kind of pragmatic knowledge. Another example: ``I may or may not go to the party tonight'' seems to be something more than a tautology. Indeed, all aspects of reference, e.g., definiteness and indefiniteness, are essentially pragmatic issues. Since language use will be important to systems that use language, pragmatics will loom large here. (Indeed, one radical position would have us state that there really is no semantics independent of pragmatics.) World knowledgeall the general knowledge we have.
Useknowledge of how terms actually occur and co-occur in large corpora. There are other kinds of linguistic knowledge, for example, morphology (the structure of words) and phonetics (the nature of the sounds we use) but we won't have much to say about these here.
transportation and by means-of-transportation have to both map into this go same structure. If we have a representation in which identical meanings have identical representations, then we say that the representation has canonical form. If we believe that the representation is independent of the particular language, then it is called an interlingua. The degree to which this is possible is controversial. But it is perhaps worth noting that bilingual speakers sometimes report that they can remember what language they had a conversation in, but t that they can recall the content of the conversation. Another issue is sown by the following example of a surface ambiguity that should be resolved in the meaning: The chicken is ready to eat. Note that here, no lexical ambiguity is relevant, or are there their structures to assign the sentence. Yet there are two quite distinct readings of the sentence. This example reflects the issue of roles. That is, the situation of eating has roles for the eater and the eaten. We would like to specify the roles of each type of situation, and then map the fillers of the roles into the right places. (This sentence is interesting because the subject can be either role.) Roles are sometimes also called cases. There are a number of theories that try to specify a very limited number of cases for all situations. These tend to be things like Agent (the one responsible for an action) Theme (a thing acted upon) Experiencer (the entity undergoing an experience or psychological state) Beneficiary (the one for whom the action is performed) Instrument (the intermediate cause of an event) Donor Recipient Origin Destination although there is no agreement on the set.
10
This representation has been derived compositionally. That is, we composed the meanings of the parts from the meanings of the whole. Before going further, it is important to realize that compositional meaning is a small part of the intended meaning. For example, when someone says pen or red , Chinese restaurant they probably mean something like , pen that writes in red and restaurant that serves food prepared in the Chinese manner However, these . interpretations can come from the language itself. This is because in t expressions like shirt and red Chinese diplomat the modification works quite , differently. So, all we can get out of the language by compositional means is something like pen somehow related to red etc. Thus the compositional , meaning is just a kind of framework upon which real meaning must (eventually, somehow) be built. To get even compositional meaning, we needed to know how the elements of the sentence formed constituents, and how to interpret these constituents semantically. Which things can go where is called grammar. If we have a grammar for a language, it will assign a structure to a string of words (generally, many structures). Assigning a grammatical structure to a string is called parsing. In general, grammars consist of rules, like S NP VP VP V NP VP V NP DET N DET the, a N boy, girl V saw, dreamed, hit The terminals are, of course, the lexicon. The pre-terminals are the parts of speech, or lexical categories. There is generally a small, fixed number of these. (In traditional grammar, probably about 6; in most grammatical theories, several more; in some applications, many more. E.g., 35 might be distinguished for practical purposes). Everything else is a phrase of some sort. Generally, the phrases, or syntactic categories, are bigger versions of the (major) parts of speech. The major rules of a grammar build up these phrases can relate them to one another. Grammars let us assign parses to sentences. For example, the above grammar lets us parse a sentence like The girl saw a boy :
11
S NP DE T The N V DE T a V P NP N boy
girl
saw
Grammars such as the one above are called context-free grammars. (This is because the symbols on the left hand side get re-written without any context.) This is a very natural way to think about grammars, and a lot is known about how to parse them. Note, though, that the CFGs are not as powerful as can be. (For example, we can show that the language anbncn can't be described with a context-free grammar.) It is a subject of debate about whether it natural languages require more than a CFG. The general belief is that they generally don except possibly for certain isolated exceptions. However, it is also t, generally believed that you can indeed write a better grammar if you allow a slight extension of the CFG rules. To demonstrate this, note that the above grammar is clearly inadequate even for this simple vocabulary. According to this grammar, the strings boy dreamed A the girland girl hitare fine, and will assign parses to them. A Similarly, it would be awkward to express things like agreement in this notation. E.g., I we extended our grammar to allow present tense verb forms, like dream and dreams as well as pronouns, like , then we need to be able to express , I the grammatical fact that Jan dreams is okay, but Jan dream is not. Similarly, dream is okay, but dreams is not; boyis okay, but boysis not. (Of I I a a course, other languages have different particular rules.) One way to capture these regularities well is to add features to grammatical categories. The idea is that, rather than say that a word like man is of lexical category N, we say that it is a bundle of features (one of which is the traditional lexical category name). For example, here is a feature structure for a word like man :
N cat : number : sg person : 3
12
Now we need to augment our rules deal with features. In particular, we will add feature constraints. Here is the rule for NPs about with some constraints added: NP D N (N number) = (D number) (NP number) = (N number) (NP person) = (N person) (NP definite) = (D definite) That is, you still get an NP when you see a D followed by an N, but only if they have the same number feature values. Moreover, the resulting NP has the same number, and the same person as the N, and the same definiteness as the D. Having the same feature values means that, if the values are symbols, that they be the same, and if they are themselves feature structures, that each of the features present in each on have the same value. This idea is called, not surprisingly, unification. It is related to, but different from, what we did in logic. Our lexical items would have similar extensions: N man (N person) = 3 (N number) = sg D a (D definite) = (D number) = sg Each of these simply means to build a feature structure like the one above. Our grammar rule above now can determine if the constraints can be satisfied, and specify a new feature structure for the phrase. In this case they can, and we get the resulting structure:
13
NP sg 3
This sort of thing is called unification grammar. (Technically, unification grammar is more powerful than CFGs, although it is not as powerful as the CSG, the context sensitive grammars.) Note that we have carried along a little semantics in the feature structure, namely, the meaning of . It is not too hard to extend this. For example, we a could add a category feature to man (of, say Man and hence carry around ), this information. It is not too hard to see how one might read off these features into predicate calculus formulae. Actually, we can take advantage of the fact that we recurse through the values of feature structures to make these descriptions more compact. For example, an alternative version of the previous rules might be the following: N man (N head agreement person) = 3 (N head agreement number) = sg (N head semantics category) = Man D a (D head semantics definite) = (D head agreement number) = sg NP D N (NP head) = (N head) (NP head) = (D head) The first two rules correspond to these feature structures, respectively: cat : head :
cat : head :
N agreement : semantics :
D semantics : agreement :
[ definite : [ number :
] sg ]
14
Entries for verbs tend to be more complex, since they have to take into account the ways various verbs line up with their complements. I.e., give can arise in contexts like gave the mouse a cookie as well as gave a cookie to the I I mouse These correspond to VPs that look like NP NP and NP PP This . V V . business is called subcategorization. We can get this to work right with feature structures too, although the details are a bit cumbersome. The issue of exactly what the right grammars of natural languages are is extremely controversial. If we had a unification, or other grammar, we could use it to parse the sentences of a language, and then try to choose between parses and interpretations we like best. There are many details about how best to parse, which we will ignore here. This is close to what is done in practice. In reality, it is compounded by a number of issues: Almost all sentences have lots of syntactic parses. How many? Charniak estimates that, for the average sentence in the WSJ which is 23 words long a million parses is conservative. Why so many? The number is a little misleading because, for long sentences, each of the clauses may have different, essentially independent parsers, and hence the number of real parses is the product of these. Also, perhaps the grammars are a bit simplistic, and a better grammar would eliminate a few parsers. However, fundamentally, the problem is real. A classic example is a sentence like the following: Time flies like an arrow. This sentence has a number of parsers. The obvious one corresponds to the following series of POS tags: N V P D N
15
Another, unintuitive couple are the following: V [N [V N] P P D N] ([You] time flies that resemble an arrow.) D N] ([You] time flies in the manner in which an arrow times flies.)
In general, most parses admitted by a grammar don make any sense. In t any case, they need to be eliminated by reasoning, not by grammar. In context, various pieces of an utterance can be omitted (details varying from language to language) Do you like that song? I know twelve more [songs]. Jan thinks chocolate is Lynn favorite kind of ice cream. Pat thinks s vanilla. These are examples of ellipsis. People are really good at understanding poorly-formed utterances. (Virtually all of speech works this way.) There are a slew of issues related to how exactly people do this. (Note that NLP is one area where doing it better than people doesn make much sense. t E.g., we don want to find an interpretation of a sentence that no speaker of a t language sees. Similarly, there is no point in producing utterances that a theory likes but which no one can understand.) For example, in our time flies example, we don have the subjective t impression of waiting til the end of the syntactic parse phase and then ruling out semantically ill-suited possibilities. Instead, we combine syntactic, semantic and world knowledge at the same time, making local decisions. As examples, there are lots of sentences that at first seem ungrammatical, but actually aren Some famous ones are t. The old man the boats. The horse raced past the barn fell. The old man glasses were filled with sherry. s
16
One moral of the story is that combining many types of knowledge together is the key. Unfortunately, how to do this is not well understood. Instead, one trick that people employ today is to use statistical techniques to approximate the answer. One success of statistical techniques is POS tagging , i.e., assigning POSs to words in a discourse. This is a subpart of the parsing problem, but a useful subpart, in that many (but by no means all) syntactic ambiguities go away once we know the right part of speech. Turns out that if we try to predict the POS of a word based on the word and the POS of the previous word or two, this is just what a HMM does. Such techniques (among others) performs at about 95% correct, which is pretty good, considering that humans agree with each other only about 98% of the time on this task. We can try to do similar tricks with parsing per se. There are probabilistic CFGs, for example. These don work too well by themselves, but with some t additional statistics, they parse big corpora correctly about 88% of the time.
17
Such scenarios for common activities are called scripts (Schank and Abelson). These are just complex action descriptions. To the extent that events conform to a script, some connecting inference can be done. A standard example of script-based understanding is the following: John went to a restaurant. He ordered a hamburger. The waiter brought it to him. Later, when the check came, he paid it and left. Here are some examples of how the script knowledge helps: We assume that the in paid itis John. he he We assume that John ate the hamburger. We are unsurprised by the definite reference to waiter even though the , he had not been previously introduced to us.
Of course, not even all simple texts conform to scripts. However, it is still possible to make sense out of them by reasoning about the plans under which agents are operating. For example, consider the following stories: John, the vice-president, wanted to become president. He went and got some arsenic. Willa was hungry. She picked up the Michelin guide. While the overall scenario may not be familiar, the individual components are. We should be able to fit these together to make sense out of a text. This is called plan recognition. Plan recognition is harder than script-based understanding, but somewhat more general. However, neither is anything like a complete solution to our problem.
18