Unit II
Unit II
Unit II
Syntax
• Parsing Natural Languages
• Tree Banks: A Data Driven Approach to Syntax
• Representation of Syntactic Structure
• Parsing Algorithms
• Models for Ambiguity Resolution in Parsing
• Multilingual Issues
Syntax-Introduction
• Parsing uncovers the hidden structure of linguistic input.
• In Natural Language Applications the predicate structure of sentences
can be useful.
• In NLP the syntactic analysis of the input can vary from:
• Very low level- POS tagging.
• Very high level- recovering a structural analysis that identifies the dependency
between predicates and arguments in the sentence.
• The major problem in parsing natural language is the problem of
ambiguity.
Parsing Natural Languages
• Let us look at the following spoken sentences:
• He wanted to go for a drive in movie.
• He wanted to go for a drive in the country.
• There is a natural pause between drive and in in the second sentence.
• This gap reflects an underlying hidden structure to the sentence.
• Parsing provides a structural description that identifies such a break
in the intonation.
Parsing Natural Languages
• Let us look at another sentence:
• The cat who lives dangerously had nine lives.
• A text-to-speech system needs to know that the first instance of the
word lives is a verb and the second instance a noun.
• This is an instance of POS tagging problem.
• Another important application where parsing is important is text
summarization.
Parsing Natural Languages
• Let us look at examples for summarization:
• Beyond the basic level, the operations of the three products vary widely.
• The above sentence can be summarized as follows:
• The operations of the products vary.
• To do this task we first parse the first sentence.
Parsing Natural Languages
• Deleting the circled constituents PP, CD and ADVP in the previous
diagram results in the short sentence.
• Let us look at another example:
• Open borders imply increasing racial fragmentation in EUROPEAN
COUNTRIES.
• In the above example the capitalized phrase can be replaced with
other phrases without changing the meaning of the sentence.
• Open borders imply increasing racial fragmentation in the countries of Europe.
• Open borders imply increasing racial fragmentation in European states.
• Open borders imply increasing racial fragmentation in Europe.
• Open borders imply increasing racial fragmentation in European Nations.
• Open borders imply increasing racial fragmentation in the European countries.
Parsing Natural Languages
• In NLP syntactic parsing is used in many applications like:
• Statistical Machine Translation
• Information extraction from text collections
• Language summarization
• Producing entity grids for language generation
• Error correction in text
• Knowledge acquisition from language
Treebanks: A Data-Driven Approach to Syntax
The above CFG can produce the syntax analysis of a sentence like:
John bought a shirt with pockets
Treebanks: A Data-Driven Approach to Syntax
• Parsing the previous sentence gives us two possible derivations.
Treebanks: A Data-Driven Approach to Syntax
• Writing a CFG for the syntactic analysis of natural language is
problematic.
• A simple list of rules does not consider interactions between different
components in the grammar.
• Listing all possible syntactic constructions in a language is a difficult
task.
• It is difficult to exhaustively list lexical properties of words. This is a
typical knowledge acquisition problem.
• One more problem is that the rules interact with each other in
combinatorially explosive ways.
Treebanks: A Data-Driven Approach to Syntax
• Let us look at an example of noun phrases as a binary branching tree.
• N -> NN (Recursive Rule)
• N -> ‘natural’|’language’|’processing’|’book’
• For the input ‘natural language processing’ the recursive rules
produce two ambiguous parses.
Treebanks: A Data-Driven Approach to Syntax
• For CFGs it can be proved that the number of parsers obtained by
using the recursive rule n times is the Catalan number (1,1,2,5,14,42,
132, 429, 1430, 4862,….) of n:
• For the input “natural language processing book” only one out of the
five parsers obtained using the above CFG is correct:
Treebanks: A Data-Driven Approach to Syntax
• This is the second knowledge acquisition problem- We need to know
not only the rules but also which analysis is most plausible for a given
input sentence.
• The construction of a tree bank is a data driven approach to syntax
analysis that allows us to address both the knowledge acquisition
bottlenecks in one stroke.
• A treebank is a collection of sentences where each sentence is
provided a complete syntax analysis.
• The syntax analysis for each sentence has been judged by a human
expert.
Treebanks: A Data-Driven Approach to Syntax
• A set of annotation guidelines is written before the annotation process to
ensure a consistent scheme of annotation throughout the tree bank.
• No set of syntactic rules are provided by a treebank.
• No exhaustive set of rules are assumed to exist even though assumptions
about syntax are implicit in a treebank.
• The consistency of syntax analysis in a treebank is measured using
interannotator agreement by having approximately 10% overlapped
material annotated by more than one annotator.
• Treebanks provide annotations of syntactic structure for a large sample of
sentences.
Treebanks: A Data-Driven Approach to Syntax
• A supervised machine learning method can be used to train the parser.
• Treebanks solve the first knowledge acquisition problem of finding the
grammar underlying the syntax analysis because the analysis is directly
given instead of a grammar.
• The second problem of knowledge acquisition is also solved by
treebanks.
• Each sentence in a treebank has ben given its most plausible syntactic
analysis.
• Supervised learning algorithms can be used to learn a scoring function
over all possible syntax analyses.
Treebanks: A Data-Driven Approach to Syntax
• For real time data the parser uses the scoring function to return the
syntax analysis that has the highest score.
• Two main approaches to syntax analysis that are used to construct
treebanks are:
• Dependency graphs
• Phase structure trees
• These two representations are very closely connected to each other and
under some assumptions one can be converted to the other.
• Dependency analysis is used for free word order languages like Indian
languages.
• Phrase structure analysis is used to provide additional information about
long distance dependencies for languages like English and French.
Treebanks: A Data-Driven Approach to Syntax
• In the discussion to follow we examine three main components for
building a parser:
• The representation of syntactic structure- it involves the use of a
varying amount of linguistic knowledge to build a treebank.
• The training and decoding algorithms- they deal with the potentially
exponential search space.
• Methods to model ambiguity- provides a way to rank parses to recover
the most likely parse.
Representation of Syntactic Structure
• Syntax Analysis using dependency graphs
• Syntax Analysis using phrase structure trees
Syntax Analysis using Dependency Graphs
• In dependency graphs the head of a phrase is connected with the
dependents in that phrase using directed connections.
• The head-dependent relationship can be semantic (head-modifier) or
syntactic (head-specifier).
• The main difference between dependency graphs and phrase structure
trees is that dependency analysis make minimal assumptions about
syntactic structure.
• Dependency graphs treat the words in the input sentence as the only
vertices in the graph which are linked together by directed arcs
representing syntactic dependencies.
Syntax Analysis using Dependency Graphs
• One typical definition of dependency graph is as follows:
• In dependency syntactic parsing the task is to derive a syntactic structure for an
input sentence by identifying the syntactic head of each word in the sentence.
• The nodes are the words of the input sentence and the arcs are the binary relations
from head to dependent.
• It is often assumed that all words except one have a syntactic head.
• It means that the graph will be a tree with the single independent node as the root.
• In labeled dependency parsing the parser assigns a specific type to each dependency
relation holding between head word and dependent word.
• In the current discussion we will be discussing about dependency trees only
where each word depends on exactly one parent either another word or a
dummy symbol.
Syntax Analysis using Dependency Graphs
• In dependency trees the 0 index is used to indicate the root symbol
and the directed arcs are drawn from head word to the dependent
word.
Syntax Analysis using Dependency Graphs
• In the figure in the previous slide [fakulte,N3,7] is the seventh word in
the sentence with POS tag N3 and it has dative case.
• Here is a textual representation of a labeled dependency tree:
Syntax Analysis using Dependency Graphs
• An important notion in dependency analysis is the notion of projectivity.
• A projective dependency tree is one where we put the words in a linear
order based on the sentence with the root symbol in the first position.
• The dependency arcs are then drawn above the words without any
crossing dependencies.
• Example:
Syntax Analysis using Dependency Graphs
Syntax Analysis using Dependency Graphs
• Let us look at an example where a sentence contains an extra position
to the right of a noun phrase modifier phrase which requires a
crossing dependency.
Syntax Analysis using Dependency Graphs
• English has very few cases in a treebank that needs such a non projective
analysis.
• In languages like Czech, Turkish, Telugu the number of non productive
dependencies are much higher.
• Let us look at a multilingual comparison of crossing dependencies across
a few languages:
• Ar=Arabic;Ba=Basque;Ca=Catalan;Ch=Chinese;Cz=Czech;En=English;
Gr=Greek;Hu=Hungarian;It=Italian;Tu=Turkish
Syntax Analysis using Dependency Graphs
• Dependency graphs in treebanks do not explicitly distinguish between
projective and non-projective dependency tree analyses.
• Parsing algorithms are sometimes forced to distinguish between
projective and non-projective dependencies.
• Let us try to setup dependency links in a CFG.
Syntax Analysis using Dependency Graphs