Compiler Lecture 4

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 12

Lecture 4: Lexical Analysis II: From REs to DFAs

Source code Front-End IR Object code


Lexical Back-End
Analysis

(from last lecture) Lexical Analysis:


• Regular Expressions (REs) are formulae to describe a (regular) language.
• Every RE can be converted to a Deterministic Finite Automaton (DFA).
• DFAs can automate the construction of lexical analysers.

Today’s lecture:
Algorithms to derive a DFA from a RE.
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 1
Design
An Example (recognise r0 through r31)
Register  r ((0|1|2) (Digit|) | (4|5|6|7|8|9) | (3|30|31))
S2 digit S3
0|1|2
S0 r S1 3 S5 0|1 S6

4|5|6|7|8|9 S4
• Same code skeleton
State ‘r’ 0,1 2 3 4,5,…,9
0 1 - - - - (Lecture 3, slide 15)
1 - 2 2 5 4 can be used!
2(final) - 3 3 3 3 • Different (bigger)
3(final) - - - - - transition table.
4(final) - - - - - • Our Deterministic
5(final) - 6 - - - Finite Automaton
6(final) - - - - - (DFA) recognises
only r0 through r31.
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 2
Design
Non-deterministic Finite Automata
What about a RE such as (a | b)*abb?
a|b

S0  S1 a S2 b S3 b S4

• This is a Non-deterministic Finite Automaton (NFA):


– S0 has a transition on  ; S1 has two transitions on a (not possible for a DFA).
• A DFA is a special case of an NFA:
– for each state and each transition there is at most one rule.
• A DFA can be simulated with an NFA (obvious!)
• A NFA can be simulated with a DFA (less obvious).
– Simulate sets of possible states.
Why study NFAs? DFAs can lead to faster recognisers than NFAs but
may be much bigger. Converting a RE into an NFA is more direct.
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 3
Design
The Big Picture:
Automatic Lexical Analyser Construction
To convert a specification into code:
• Write down the RE for the input language.
• Convert the RE to a NFA (Thompson’s construction)
• Build the DFA that simulates the NFA (subset construction)
• Shrink the DFA (Hopcroft’s algorithm)
(for the curious: there is a full cycle - DFA to RE construction is all pairs, all
paths)
Lexical analyser generators:
• lex or flex work along these lines.
• Algorithms are well-known and understood.
• Key issue is the interface to parser.
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 4
Design
RE to NFA using Thompson’s construction
Key idea (Ken Thompson; CACM, 1968): NFA pattern for
each symbol and/or operator: join them in precedence order.
a 
S0 S1 S0 a S1 S2 b S3
NFA for a NFA for ab
a 
S1 S2 

S0 S5 S0  S1 a S2  S3
b
 S3 S4  
NFA for a | b NFA for a*

CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 5


Design
Example: Construct the NFA of a (b|c)*
First: NFAs a S0 b S1 c
S0 S1 S0 S1
for a, b, c

b 

S1 S2  S2 b S3 

S0 S5 S0  S1 S6  S7
c c S 
 S3 S4   S4 5 
Second: NFA for b|c Third: NFA for (b|c)*
Of course, a human
Fourth: NFA  would design a simpler
for a(b|c)* S4 b S5  one… But, we can
 automate production of
S0 a S1  S2  S3 S8  S9 the complex one...

 S6 c S7 b|c
 S0 a S1
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 6
Design
NFA to DFA: two key functions
• move(si,a): the (union of the) set of states to which there is a transition
on input symbol a from state si
 -closure(si): the (union of the) set of states reachable by  from si.
Example (see the diagram below):
 -closure(3)={3,4,7}; -closure({3,10})={3,4,7,10};
• move(-closure({3,10}),a)=8;

10 
 a
4 7 8
3 

The Algorithm:
• start with the -closure of s0 from NFA.
• Do for each unmarked state until there are no unmarked states:
– for each symbol take their -closure(move(state,symbol))
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 7
Design
NFA to DFA with subset construction
Initially, -closure is the only state in Dstates and it is unmarked.
while there is an unmarked state T in Dstates
mark T
for each input symbol a
U:=-closure(move(T,a))
if U is not in Dstates then add U as unmarked to Dstates
Dtable[T,a]:=U

• Dstates (set of states for DFA) and Dtable form the DFA.
• Each state of DFA corresponds to a set of NFA states that NFA
could be in after reading some sequences of input symbols.
• This is a fixed-point computation.

It sounds more complex than it actually is!


CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 8
Design
Example: NFA for (a | b)*abb

2 a 3
  
0 1 6  7 a 8 b 9 b 10
 
b
 4 5

• A=-closure(0)={0,1,2,4,7}
• for each input symbol (that is, a and b):
– B=-closure(move(A,a))=-closure({3,8})={1,2,3,4,6,7,8}
– C=-closure(move(A,b))=-closure({5})={1,2,4,5,6,7}
– Dtable[A,a]=B; Dtable[A,b]=C
• B and C are unmarked. Repeating the above we end up with:
– C={1,2,4,5,6,7}; D={1,2,4,5,6,7,9}; E={1,2,4,5,6,7,10}; and
– Dtable[B,a]=B; Dtable[B,b]=D; Dtable[C,a]=B; Dtable[C,b]=C;
Dtable[D,a]=B; Dtable[D,b]=E; Dtable[E,a]=B; Dtable[E,b]=C;
no more unmarked sets at this point!
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 9
Design
Result of applying subset construction
Transition table:
state a b
A B C
B B D
C B C
D B E
E(final) B C
b
C b
b
A a D E
a b b
B a a
a
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 10
Design
Another NFA version of the same RE
a|b
 a b b
N0 N1 N2 N3 N4

Apply the subset construction algorithm:


Iteration State Contains -closure(move(s,a)) -closure(move(s,b))
0 A N0,N1 N1,N2(B) N1(C)
1 B N1,N2 N1,N2(B) N1,N3(D)
C N1 N1,N2(B) N1(C)
2 D N1,N3 N1,N2(B) N1,N4(E)
3 E N1,N4 N1,N2(B) N1(C)

Note:
• iteration 3 adds nothing new, so the algorithm stops.
• state E contains N4 (final state)
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 11
Design
Enough theory… Let’s conclude!
• We presented algorithms to construct a DFA from a RE.
• The DFA is not necessarily the smallest possible.
• Using an (automatically generated) transition table and
the standard code skeleton (Lecture 3, slide 15) we can
build a lexical analyser from regular expressions
automatically. But, the size of the table can be large...
• Next time:
– DFA minimisation; Practical considerations; Lexical
Analysis wrap-up.
• Reading: Aho2 Sections 3.6-3.7; Aho1 pp. 147-166;
Grune 2.1.6.1-2.1.6.6 (different style); Hunter 3.3 (very
condensed); Cooper1 2.4-2.4.3
CSE 359 - Compiler Masud Ibn Afjal, CSE, HSTU 12
Design

You might also like