Session 7 - Syntax Parsing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

Syntax Parsing

Lecturer: Dr. Bùi Thanh Hùng


Data Science Department
Faculty of Information Technology
Industrial University of Ho Chi Minh city
Email: [email protected]
Website: https://sites.google.com/site/hungthanhbui1980/
Examples

Build a parse tree for the following sentence

I gave him the book


Examples
Examples

If (b==0) a=b;
Examples
How it works
How it works

Rule Table
Parser (PTCP)
Syntax Parser

A parse tree typically has the following


components:
- Tree structure (there are many data structure
options)
- Structure of a tree node:
• Symbol at the current node
• Related vocabulary (if it is a leaf node)
• List of child nodes
• Additional information, for further analysis
Steps to build a Syntax Parser
Describe the source language grammar rules
These descriptions may initially be in natural
language
- Specify the meaning of non-terminal symbols
- Translate into rigorous grammar rules
- Analyze the grammar to select the most
appropriate parsing method
+ Is the grammar ambiguous?
+ Is the grammar left-recursive?
+ Is the grammar ambiguous?
+ How complex is the grammar?
Steps to build a Syntax Parser
Selecting the appropriate parsing method
Building a direct PTCP (for languages ​with simple
complexity)
Building a 2-step PTCP set
- Based on the input grammar, build a prediction
automaton
- Using the automaton to process the word
sequence from the PTTV
Building a universal PTCP set: in case the grammar
is too complex, universal analysis methods can be
used to build a PTCP set
Steps to build a Syntax Parser

Choosing the right parsing method

Choosing how to handle syntactic errors,


generating correction suggestions, and
situations requiring semantic composition
Inference

Definition: (called Aß deduces


B) if A →  is a production,  and ß are
strings of symbols belonging to a certain
language L
Inference

Definition: (called Aß deduces


B) if A →  is a production,  and ß are
strings of symbols belonging to a certain
language L
we say deduces
Inference

Symbolic system:
direct derivation
derivation through 0 or more steps
derivation through 1 or more steps
Some properties
with
and then
Left and right inference
The problem of parsing is essentially the
problem of finding the derivation string
where:
-S: the original symbol
-: the string containing the intermediate
symbol
-ß: the string consisting only of the terminal
symbols
Left and right inference
It is easy to see in the above derivation process:
▪ There are many derivation options from S to ß
▪ An intermediate symbol belonging to a must
sooner or later be transformed by a certain
production rule
▪ If the intermediate symbol chosen for
transformation is always the leftmost of a, then
we call this option a left derivation
▪ Similar definition for right derivation
Left and right inference

Given grammar G with productions

Input String

Left derivation from S into W


Left and right inference

What encoding should we use to


store the derivation and how
should we use that information to
get the derivation?
parse tree
The parse tree represents the structure of an inference
The root node is the starting symbol
The leaf nodes are always the ending symbols
The internal nodes are always the intermediate symbols
The tree does not represent the order in which direct
inferences are executed:
- Traversing the tree will form the order in which
inferences are executed
- Left inference is equivalent to traversing the tree in
the middle-left-right order
parse tree
The parse tree represents the structure of
an inference
The root node is the starting symbol
The leaf nodes are always the ending
symbols
The internal nodes are always the
intermediate symbols
The tree does not represent the order in
which direct inferences are executed:
- Traversing the tree will form the order in
which inferences are executed
- Left inference is equivalent to traversing
the tree in the middle-left-right order
Inference vs Tree structure
Inference vs Tree structure
• Deduction is a one-dimensional information
representation
• Tree structure is a two-dimensional information
representation
• Tree structure illustrates the correlation between
components in a spatial structure
• A parse tree most fully describes the transformation from
the original symbol to the string to be analyzed, most
suitable for all purposes
• A syntax tree eliminates intermediate components,
focusing on describing the correlation between the end
symbols, this structure is suitable for information
aggregation
Ambiguous grammar
A loose grammar leads to many different parse trees
for a given input string
Ambiguous grammar

A grammar in which a string w has at least two


or more corresponding parse trees is called an
ambiguous grammar.
The biggest problem with ambiguous grammars
is the polysemy of the string w (there are many
ways to understand it), which results in the
impossibility of calculating the exact semantics
of w
Ambiguous grammar
A grammar in which a string w has at least two or
more corresponding parse trees is called an ambiguous
grammar.
The biggest problem with ambiguous grammars is the
polysemy of the string w (there are many ways to
understand it), which results in the impossibility of
calculating the exact semantics of w
Eliminate Ambiguity
Disambiguation actually creates a new grammar
based on the old grammar, but the two
grammars are not exactly equivalent.
There are many disambiguation strategies such
as:
- Adding intermediate variables
- Introducing constraints outside the grammar
(e.g. specifying the precedence of
operations)
Eliminate Ambiguity
There are many disambiguation strategies such as:
- Adding intermediate variables
- Introducing constraints outside the grammar (e.g.
specifying the precedence of operations)
Parsing strategies
Parsing strategies are divided into 3 groups:
• Trial-error strategies (backtracking): top-
down, bottom-up
• Dynamic programming strategies: CYK,
Earley,…
• Deterministic strategies: LL, LR,…
Parsing strategies

There are also other methods that rely on the


characteristics of the language to apply
effective techniques (e.g., operator hierarchy
analysis).
Some methods are general (e.g., Earley), but
most methods only work with grammars with
specific characteristics
Trial and error strategy
• Trial-and-error or backtracking strategies are
the first to come to mind when solving the
problem of grammar analysis.
• These strategies simply involve trying out
the inference rules until the target inference
sequence is achieved.
Trial and error strategy
Divided into 2 opposite approaches:
Trial and error strategy
Divided into 2 opposite approaches:
• Top-down
• Bottom-up
Trial and error strategy
Divided into 2 opposite approaches:
Top-down method:
• Looking from the parse tree, go from top to
bottom
• Try to gradually transform S to w
Trial and error strategy
Divided into 2 opposite approaches:
Top-down method:
• Looking from the parse tree, go from top to
bottom
• Try to gradually transform S to w
Bottom-up method:
• Looking at the parse tree, go from bottom to top
• Try to reduce from w to S
Trial and error strategy
Both methods have restrictions on the input
grammar:
• Top-down requires the input grammar to be
non-left recursive
• Bottom-up requires the input grammar to
have no empty production (A→ ) and no
inference of the form A +A
Trial and error strategy
Both methods have input grammar constraints:

These strategies are only theoretically meaningful


because they are slow and grammar-constrained,
but trial-and-error provides many suggestions for
other algorithms
• Removing grammar constraints: universal
methods
• Removing backtracking: deterministic methods
Dynamic strategy
The idea of ​dynamic programming aims
to:
• Build methods that have no input
grammar restrictions
• Store analyzed substrings to avoid
backtracking
Dynamic strategy
CYK Algorithm:
• Transform grammar to Chomsky
normal form
• Grammar has no empty inference
• No backtracking
Dynamic strategy
Let w be the n length string to be parsed. And
G represent the set of rules in our grammar
with start state S.
Dynamic strategy
Let w be the n length string to be parsed. And
G represent the set of rules in our grammar
with start state S.
Thuật toán CYK
CYK
Let w be the n length string to be parsed. And
G represent the set of rules in our grammar
with start state S.

Time and Space Complexity :


•Time Complexity
•Space Complexity
Earley

▪ Problems with left recursion in top-down parsing


– VP → VP PP
▪ Background
– Developed by Jay Earley in 1970
– No need to convert the grammar to CNF
– Left to right
▪ Complexity
– Faster than O(n3) in many cases
more versatile, no grammar constraints, no backtracking
Earley

Looks for both full and partial constituents


• When reading word k, it has already identified
all hypotheses that are consistent with words 1
to k-1
• Example:
– S [i,j] → Aux . NP VP
– NP [j,k] → N
– S [i,k] → Aux NP . VP
Earley

• It uses a dynamic programming table, just like CKY


• Example entry in column 1
– [0:1] VP -> VP . PP
– Created when processing word 1
– Corresponds to words 0 to 1 (these words
correspond to the VP part of the RHS of the rule)
– The dot separates the completed (known) part from
the incomplete (and possibly unattainable) part
Earley

Three types of entries


– ‘scan’ – for words
– ‘predict’ – for non-terminals
– ‘complete’ – otherwise
Earley
Earley
Earley
Looks for both full and partial constituents
• When reading word k, it has already identified all
hypotheses that are consistent with words 1 to k-1
• Example entry in column 1
– [0:1] VP -> VP . PP
– Created when processing word 1
– Corresponds to words 0 to 1 (these words correspond to
the VP part of the RHS of the rule)
– The dot separates the completed (known) part from the
incomplete (and possibly unattainable) part
S [i,j] → Aux . NP VP NP [j,k] → N S [i,k] → Aux NP . VP
take this book
Deterministic strategy
The idea of ​determinism goes the other way:
sacrificing richness of grammar for speed.
Deterministic strategy
Common features:
• Grammars have certain constraints
• Based on pre-analysis of grammars to
predict possible situations
• Building tables of options, which
indicate what to do when encountering
specific situations
Deterministic strategy
This is the strategy that all compilers use
due to its speed advantage (no
backtracking).

You might also like