Lesson 16

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

LESSON 16

Overview
of
Previous Lesson(s)
Over View
 For a deterministic finite automaton M, the minimum number of
states in any equivalent deterministic finite automaton is the same
as the number of equivalence groups of M's states.

Final states = A = {s2, s7}


Non Final States = B = {s0, s1, s3, s4, s5, s6}

3
Over View..
 The following table shows the result of applying the inputs a & b to
these states.
 The input a leads from s1 to s5 in group B and input b leads to  s2 in group
A.

 Looking at the table we find that the input b helps us distinguish


between two of the states (s1 and s6) and the rest of the states in the
group since it leads to group A for these two instead of group B.
4
Over View...
 The states in the set {s0, s3, s4, s5} cannot be equivalent to those in
the set {s1, s6} and we must partition B into two groups.

 Now we have the groups:


A = {s2, s7}, B = { s0, s3, s4, s5}, C = { s1, s6}

 The next examination of where the inputs lead shows us that s3 is


not equivalent to the rest of group B.
 Continuing this process until we cannot distinguish between the states
in any group by employing our input tests, we end up with the groups:

A = {s2, s7}, B = {s0, s4, s5}, C = {s1}, D = {s3}, E = { s6}

5
Minimizing DFA States...

In view of the above theoretical definitions and results, it is easy to


argue that all of the states in each group are equivalent because they all
go to the same groups under the inputs a and b.

6
Over View...
 Transition Diagrams:

 The behavior of a lexical analyzer can often be described by a


transition diagram.

 These diagrams have states, each of which represents something


about the history of the characters seen during the current search for
a lexeme that matches one of the possible patterns.

 There are arrows, or transitions, from one state to another, each of


which indicates the possible next input characters that cause the
lexical analyzer to make that change of state.

7
Over View...
 Finite Automata:

 These are a formalization of transition diagrams that include a


designation of a start state and one or more accepting states, as well
as the set of states, input characters, and transitions among states.

 Accepting states indicate that the lexeme for some token has been
found.

 Unlike transition diagrams, finite automata can make transitions on


empty input as well as on input characters.

8
Over View...
 DFA:
 A DFA is a special kind of finite automaton that has exactly one transition
out of each state for each input symbol.

 Transitions on empty input are disallowed.

 NFA:
 Automata that are not DFA's are called nondeterministic.

 NFA's often are easier to design than are DFA's.

 Allow ɛ transitions.
9
Over View...
 Lex:
 There is a family of software systems, including Lex and Flex, that are
lexical-analyzer generators.

 The user specifies the patterns for tokens using an extended regular-
expression notation. Lex converts these expressions into a lexical
analyzer that is essentially a DFA that recognizes any of the patterns.

 Minimization of Finite Automata:


 For every DFA there is a minimum state DFA accepting the same
language.

10
TODAY’S LESSON

11
Contents
 Syntax Analysis
 Introduction
 The Role of the Parser
 Representative Grammars
 Syntax Error Handling
 Error-Recovery Strategies
 Context-Free Grammars
 Formal Definition of a CFG
 Notational Conventions
 Derivations
 Parse Trees and Derivations
 Ambiguity

12
Syntax Analysis
 In this section we will see parsing methods that are typically used
in compilers.

 Initially we discuss basic concepts, then techniques suitable for hand


implementation, and finally algorithms that have been used in
automated tools.

 Since programs may contain syntactic errors, we discuss extensions of


the parsing methods for recovery from common errors.

 The syntax of programming language constructs can be specified


by CFG (Backus-Naur Form) notation.
13
Syntax Analysis..
 Grammars offer significant benefits for both language designers and
compiler writers.

 A grammar gives a precise syntactic specification of a programming


language.

 From certain classes of grammars, we can construct automatically an


efficient parser that determines the syntactic structure of a source
program.

 The structure imparted to a language by a properly designed grammar


is useful for translating source programs into correct object code and
for detecting errors.
14
Role of the Parser
 In compiler model, the parser obtains a string of tokens from the
lexical analyzer & verifies that the string of token names can be
generated by the grammar for the source language.

15
Role of the Parser..
 There are 03 general types of parsers for grammars: universal, top-
down, and bottom-up.
 Universal parsing methods such as the Cocke-Younger-Kasami
algorithm and Earley's algorithm can parse any grammar. These
general methods are, however, too inefficient to use in production
compilers.

 The methods commonly used in compilers is either top-down or


bottom-up.

 Top-down methods build parse trees from the top (root) to the bottom
(leaves), while bottom-up methods start from the leaves and work
their way up to the root.
16
Representative Grammars
 Expressions with + and *

E→E+T|T
T→T*F|F
F → ( E ) | id

 This takes care of precedence, but as we saw before, gives us


trouble since it is left-recursive and we did top-down parsing.

 So we use the next non-left-recursive grammar that generates the


same language.

17
Representative Grammars
E → T E'
E' → + T E' | ε
T → F T'
T' → * F T' | ε
F → ( E ) | id
 Following ambiguous grammar will be used for illustration, but in
general we try to avoid ambiguity.

E → E + E | E * E | ( E ) | id
 This grammar does not enforce precedence and it does not specify
left vs right associativity.
For example, id + id + id and id * id + id each have two parse trees.

18
Syntax Error Handling
 Syntactic errors and general strategies for error recovery.

 Panic-mode & Phrase-level recovery.

 Common programming errors can occur at many different levels.

 Lexical errors include misspellings of identifiers, keywords, or operators -


e.g., the use of an identifier elipseSize instead of ellipseSize – and missing
quotes around text intended as a string.

 Semantic errors include type mismatches between operators and


operands. An example is a return statement in a Java method with result
type void.
19
Syntax Error Handling..
 Syntactic errors include misplaced semicolons or extra or missing
braces, that is, "{" or "}"
As another example, in C or Java, the appearance of a case statement
without an enclosing switch is a syntactic error.

 Logical errors can be anything from incorrect reasoning on the part of


the programmer to the use in a C program of the assignment
operator = instead of the comparison operator ==.

20
Syntax Error Handling...
 The error handler in a parser has goals that are simple to state but
challenging to realize:

 Report the presence of errors clearly and accurately.

 Recover from each error quickly enough to detect subsequent errors.

 Add minimal overhead to the processing of correct programs.

21
Error Recovery Strategies
 Once an error is detected, how should the parser recover?

 The simplest approach is for the parser to quit with an informative


error message when it detects the first error.

 Error Recovering Strategies:


 Trivial Approach: No Recovery
 Print an error message when parsing cannot continue and then
terminate parsing.
 Panic-Mode Recovery
 The parser discards input until it encounters a synchronizing token.
These tokens are chosen so that the parser can make a fresh beginning.
Good examples are ; and }.
22
Error Recovery Strategies
 Phrase-Level Recovery
 Locally replace some prefix of the remaining input by some string.
Simple cases are exchanging ; with , and = with ==.
Difficulties occur when the real error occurred long before an error
was detected.

 Error Productions
 Include productions for common errors.

 Global Correction
 Change the input I to the closest correct input I' and produce the
parse tree for I'.

23
CFG
 Grammars used to systematically describe the syntax of programming
language constructs like expressions and statements.

stmt --> if ( expr ) stmt else stmt

 A syntactic variable stmt is used to denote statements and variable expr to


denote expressions.

 Other productions then define precisely what an expr is and what else a
stmt can be.

 A language generated by a (context-free) grammar is called a context


free language.
24
CFG Definition
 Context-free grammar (grammar ) consists of terminals, non-
terminals, a start symbol, and productions.

 Terminals:
 The basic components found by the lexer.
 They are sometimes called token names, i.e., the first component of
the token as produced by the lexer.

 Non-terminals:
 Syntactic variables that denote sets of strings.
 The sets of strings denoted by non-terminals help define the language
generated by the grammar

25
CFG Definition..
 Start Symbol:
 A non-terminal that forms the root of the parse tree.
 Conventionally, the productions for the start symbol are listed first.

 Productions:
 The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings.

 Each production consists of:


1. A nonterminal called the head or left side of the production, this
production defines some of the strings denoted by the head.

26
CFG Definition..
2. The symbol  Sometimes ::= has been used in place of the arrow.

3. A body or right side consisting of zero or more terminals and non-


terminals.

The components of the body describe one way in which strings of


the non-terminal at the head can be constructed.

27
CFG Definition..
 Ex Grammar

 Terminals: id + - * / ( )
 Non-Terminals: expression, term, factor
 Start Symbol: expression

28
Notational Conventions
 Notational conventions for grammars: (used in this course)

 These symbols are terminals:

(a) Lowercase letters early in the alphabet, such as a, b, c.


(b) Operator symbols such as +, *, and so on.
(c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0, 1, . . . , 9.
(e) Boldface strings such as id or if, each of which represents a single
terminal symbol.

29
Notational Conventions..
 These symbols are non-terminals:

(a) Uppercase letters early in the alphabet, such as A, B, C.


(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
(d) When discussing programming constructs, uppercase letters may be
used to represent non-terminals for the constructs.
For example, non-terminals for expressions, terms, and factors are
often represented by E, T, and F, respectively.

30
Notational Conventions…
 Uppercase letters late in the alphabet, such as X, Y, Z, represent
grammar symbols, that is, either non-terminals or terminals.

 Lowercase letters late in the alphabet , chiefly u, v, ... ,z, represent


(possibly empty) strings of terminals.

 Lowercase Greek letters, represents (possibly empty) strings of


grammar symbols.

 A set of productions A  α1 , A  α2 ,…, A  αk with a common


head A (call them A-productions) , may be written as
A  α1 | α2 ,…, |αk
31
Notational Conventions…
 The grammar we defined earlier using notations:

32
Derivations
 Assume we have a production A → α.
 We would then say that A derives α and write  A ⇒ α

 We generalize this. If, in addition, β and γ are strings, we say that


βAγ derives βαγ and write 
βAγ ⇒ βαγ

 We generalize further. If α derives β and β derives γ, we say α


derives γ and write
α ⇒* z
 Means drives in zero or more steps.

33
Derivations
 Formal definition of zero or more definitions:
1. α ⇒* α, for any string α.
2. If α ⇒* β and β ⇒ γ, then α ⇒* γ.

 If S is the start symbol and S ⇒* α, we say α is a sentential form of


the grammar.

 A sentential form may contain non-terminals and terminals.


 If it contains only terminals it is a sentence of the grammar and
the language generated by a grammar G, L(G), is the set of sentences.

 Two grammars generating the same language are called equivalent.

34
Derivations
 Ex: E → E + E | E * E | ( E ) | id

 We see that id + id is a sentence. Indeed it can be derived in two


ways from the start symbol E.
E ⇒ E + E ⇒ id + E ⇒ id + id E ⇒ E + E ⇒ E + id ⇒ id + id

 In the first derivation, we replaced the leftmost non-terminal by the


body of a production having the non-terminal as head. This is called
a leftmost derivation.
 Similarly the second derivation in which the rightmost non-terminal is
replaced is called a rightmost derivation or a canonical derivation.

35
Thank You

You might also like