04 Regular Expressions & FAs

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Compiler Construction

Lexical Analysis — Part II

Course 2316410
Lecture No. 4
Quick Review

compile source code parts of speech & words


time Scanner
tables
or code Represent
design specifications Scanner words as
time Generator indices into a
global table
Specifications written as
“regular expressions”

Last class:
— The scanner is the first stage in the front end
— Specifications can be expressed using regular expressions
— Build tables and code from a DFA

1
Quick Review of Regular Expressions
• All strings of 1s and 0s ending in a 1
( 0 | 1 )* 1

• All strings over lowercase letters where the vowels (a,e,i,o,


& u) occur exactly once, in ascending order
Let Cons be (b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z)
Cons* a Cons* e Cons* i Cons* o Cons* u Cons*

• All strings of 1s and 0s that do not contain three 0s in a row:


( 1* (  |01 | 001 ) 1* )* (  | 0 | 00 )

2
Example (from Lab 1)
Consider the problem of recognizing ILOC register names
Register  r (0|1|2| … | 9) (0|1|2| … | 9)*

• Allows registers of arbitrary number


• Requires at least one digit

RE corresponds to a recognizer (or DFA)

(0|1|2| … 9)
r (0|1|2| … 9)
S0 S1 S2
accepting state

Recognizer for Register

Transitions on other inputs go to an error state, se

3
Example (continued)
DFA operation
• Start in state S0 & make transitions on each input character
• DFA accepts a word x iff x leaves it in a final state (S2 )
(0|1|2| … 9)
r (0|1|2| … 9)
S0 S1 S2
accepting state

Recognizer for Register


So,
• r17 takes it through s0, s1, s2 and accepts
• r takes it through s0, s1 and fails
• a takes it straight to se

4
Example (continued)
To be useful, the recognizer must be converted into code

0,1,2,3,4, All
r 5,6,7,8,9 others
Char  next character 
State  s0
s0 s1 se se
while (Char  EOF)
State  (State,Char)
s1 se s2 se
Char  next character
if (State is a final state ) s2 se s2 se
then report success
else report failure
se se se se

Skeleton recognizer Table encoding the RE


(0|1|2| … 9)
O(1) cost per character r (0|1|2| … 9)
(or per transition) S0 S1 S2 5
Example (continued)
We can add “actions” to each transition

0,1,2,3,4, All
Char  next character  r 5,6,7,8,9 others
State  s0 
while (Char  EOF) s0 s1 se se
Next  (State,Char) start error error
Act  (State,Char)
perform action Act s1 se s2 se
State  Next error add error
Char  next character
s2 se s2 se
if (State is a final state ) error add error
then report success
se se se se
else report failure
error error error

Skeleton recognizer Table encoding RE

Typical action is to capture the lexeme 6


What if we need a tighter specification?
r Digit Digit* allows arbitrary numbers
• Accepts r00000
• Accepts r99999
• What if we want to limit it to r0 through r31 ?
Write a tighter regular expression
— Register  r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )
— Register  r0|r1|r2| … |r31|r00|r01|r02| … |r09

Produces a more complex DFA


• DFA has more states
• DFA has same cost per transition (or per character)
• DFA has same basic implementation

7
Tighter register specification (continued)

The DFA for


Register  r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) )

(0|1|2| … 9)
S2 S3
0,1,2

r 3 0,1
S0 S1 S5 S6

4,5,6,7,8,9
S4

• Accepts a more constrained set of register names


• Same set of actions, more states

8
Tighter register specification (continued)

All This table runs


 r 0,1 2 3 4-9 others in the same
s0 s1 se se se se se skeleton
recognizer
s1 se s2 s2 s5 s4 se

s2 se s3 s3 s3 s3 se This table uses


the same O(1)
s3 se se se se se se time per
character
s4 se se se se se se

s5 se s6 se se se se The extra
precision costs
s6 se se se se se se us table space,
not time
se se se se se se se
(0|1|2| … 9)

Table encoding RE for the tighter register specification


S2 S3

0,1,2

r 3 0,1
S0 S1 S5 S6

4,5,6,7,8,9
9
S4
Tighter register specification (continued)

State 4,5,6
r 0,1 2 3 other
Action 7,8,9
1
0 e e e e e
start
2 2 5 4
1 e e
add add add add
3 3 3 3 e
2 e
add add add add exit
e
3,4 e e e e e
exit
6 e
5 e e e e
add exit
x
6 e e e e e
exit

e e e e e e e

(0|1|2| … 9)
S2 S3

We care about path lengths (time) and finite 0,1,2

size of set of states (representability), but we S0


r
S1
3
S5
0,1
S6

don’t worry (much) about number of states. 4,5,6,7,8,9


10
S4
Where are we going?
• We will show how to construct a finite state automaton
to recognize any RE Introduce NFAs
• Overview:
— Direct construction of a nondeterministic finite automaton
(NFA) to recognize a given RE
 Easy to build in an algorithmic way
 Requires -transitions to combine regular subexpressions
— Construct a deterministic finite automaton (DFA) to simulate
the NFA
 Use a set-of-states construction Optional, but
— Minimize the number of states in the DFA worthwhile
 Hopcroft state minimization algorithm
— Generate the scanner code
 Additional specifications needed for the actions

11
Non-deterministic Finite Automata
What about an RE such as ( a | b )* abb ?

b a

a b b
S0 S1 S2 S3
a
a

Each RE corresponds to a deterministic finite automaton (DFA)


• We know a DFA exists for each RE
• The DFA may be hard to build directly
• Automatic techniques will build it for us …
Nothing here that would change
12
the O(1) cost per transition
Non-deterministic Finite Automata
Here is a simpler RE for ( a | b )* abb

a|b

 a b b
S0 S1 S2 S3 S4

This recognizer is more intuitive


• Structure seems to follow the RE’s structure
This recognizer is not a DFA
• S0 has a transition on 
• S1 has two transitions on a
This is a non-deterministic finite automaton (NFA)

This NFA needs one more transition,


13
at O(1) cost per transition
Non-deterministic Finite Automata
An NFA accepts a string x iff  a path though the
transition graph from s0 to a final state such that
the edge labels spell x, ignoring ’s
• Transitions on  consume no input
• To “run” the NFA, start in s0 and guess the right transition at
each step
— Always guess correctly
— If some sequence of correct guesses accepts x then accept

Why study NFAs?


• They are the key to automating the REDFA construction
• We can paste together NFAs with -transitions

NFA
 NFA becomes an NFA

14
Relationship between NFAs and DFAs
DFA is a special case of an NFA
• DFA has no  transitions
• DFA’s transition function is single-valued
• Same rules will work

DFA can be simulated with an NFA


— Obviously

NFA can be simulated with a DFA (less obvious)


• Simulate sets of possible states
• Possible exponential blowup in the state space
• Still, one state per character in the input stream

Rabin & Scott, 1959 15


Automating Scanner Construction
To convert a specification into code:
1 Write down the RE for the input language
2 Build a big NFA
3 Build the DFA that simulates the NFA
4 Systematically shrink the DFA
5 Turn it into code

Scanner generators
• Lex and Flex work along these lines
• Algorithms are well-known and well-understood
• Key issue is interface to parser (define all parts of speech)
• You could build one in a weekend!

16
Where are we? Why are we doing this?
RE  NFA (Thompson’s construction) 
• Build an NFA for each term
• Combine them with -moves
NFA  DFA (Subset construction) 
• Build the simulation
The Cycle of Constructions
DFA  Minimal DFA 
• Hopcroft’s algorithm minimal
RE NFA DFA
DFA
DFA  RE
• All pairs, all paths problem
• Union together paths from s0 to a final state

17
RE NFA using Thompson’s Construction
Key idea
• NFA pattern for each symbol & each operator
• Join them with  moves in precedence order
a a  b
S0 S1 S0 S1 S3 S4

NFA for a NFA for ab

a 
S1 S2
 
S0 S5 S0
 S1
a
S3
 S4
 S3
b
S4
 
NFA for a*
NFA for a | b

Ken Thompson, CACM, 1968

18
Example of Thompson’s Construction
Let’s try a ( b | c )*
1. a, b, & c S0
a
S1 S0
b
S1 S0
c
S1

b
S2
2. b | c  S1 
S0 S5
 c 
S3 S4


3. ( b | c )* b
 S2 S3 
S0
 S1
 S7
S6
 c 
S4 S5

19
Example of Thompson’s Construction (con’t)


4. a ( b | c )*
b
 S4 S5 
S0
a
S1   
S2 S3 S8 S9
 c 
S6 S7

Of course, a human would design something simpler ...

b|c
But, we can automate production of
a
S0 S1 the more complex NFA version ...

20
Where are we? Why are we doing this?
RE  NFA (Thompson’s construction) 
• Build an NFA for each term
• Combine them with -moves
NFA  DFA (subset construction) 
• Build the simulation
The Cycle of Constructions
DFA  Minimal DFA 
• Hopcroft’s algorithm minimal
RE NFA DFA
DFA
DFA  RE
• All pairs, all paths problem
• Union together paths from s0 to a final state

21
NFA DFA with Subset Construction
Need to build a simulation of the NFA

Two key functions


• Move(si , a) is the set of states reachable from si by a
• -closure(si) is the set of states reachable from si by 
The algorithm:
• Start state derived from s0 of the NFA
• Take its -closure S0 = -closure({s0})
• Take the image of S0, Move(S0, ) for each   , and take
its -closure
• Iterate until no more states are added
Sounds more complex than it is…

Rabin & Scott, 1959 22


NFA DFA with Subset Construction

The algorithm: The algorithm halts:


s0 -closure({n0}) 1. S contains no duplicates
(test before adding)
S  { s0 }
W  { s0 } 2. 2{NFA states} is finite
while ( W ≠ Ø ) 3. while loop adds to S, but
select and remove s from W does not remove from S
(monotone)
for each   
t  -closure(Move(s,))  the loop halts
T[s,]  t S contains all the reachable
if ( t  S ) then NFA states
add t to S It tries each character in each si.
add t to W It builds every possible NFA
configuration.
Let’s think about why this works  S and T form the DFA

s0 is a set of states This test is a little tricky


23
S & W are sets of sets of states
NFA DFA with Subset Construction

The algorithm: The algorithm halts:


s0 -closure({n0}) 1. S contains no duplicates
(test before adding)
S  { s0 }
W  { s0 } 2. 2{NFA states} is finite
while ( W ≠ Ø ) 3. while loop adds to S, but
select and remove s from W does not remove from S
(monotone)
for each   
t  -closure(Move(s,))  the loop halts
T[s,]  t S contains all the reachable
if ( t  S ) then NFA states
add t to S It tries each character in each si.
add t to W It builds every possible NFA
configuration.
Let’s think about why this works  S and T form the DFA

Any DFA state containing a final state of the NFA


24
final state becomes a final state of the DFA.
NFA DFA with Subset Construction
Example of a fixed-point computation
• Monotone construction of some finite set
• Halts when it stops adding to the set
• Proofs of halting & correctness are similar
• These computations arise in many contexts
Other fixed-point computations
• Canonical construction of sets of LR(1) items
— Quite similar to the subset construction
• Classic data-flow analysis (& Gaussian Elimination)
— Solving sets of simultaneous set equations

We will see many more fixed-point computations

25
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

26
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9
q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,
s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

27
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

28
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

29
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

30
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9,


s1 q4, q6, q 9 none q 3 , q4 , q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

31
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

32
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

33
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

34
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none
q3 , q 4 , q 6

s3 q7, q8, q9 , none


q3 , q 4 , q 6

35
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7
q7 is the core
 state of s3

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

36
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7
q5 is the core
 state of s2

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

37
NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 q 1 , q2 , q 3 ,
q 4 , q6 , q 9 none none

q1, q2, q3 , q5, q 8, q9, q7, q8, q 9,


s1 q4, q6, q 9 none q 3 , q4 , q 6 q3, q4, q 6
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

Final states because of q9 38


NFA DFA with Subset Construction

a ( b | c )* : b
 q4 q5 
a   
q0 q1 q2 q3 q8 q9
 q6 c 
q7

States -closure(Move(s,*))
DFA NFA a b c
s0 q0 s1 none none

q1, q2, q3 ,
s1 none s2 s3
q4, q6, q 9
q5, q8, q9 ,
s2 none s2 s3
q3 , q 4 , q 6

s3 q7, q8, q9 , none s2 s3


q3 , q 4 , q 6

Transition table for the DFA 39


NFA DFA with Subset Construction
The DFA for a ( b | c )*

b
a b c
s2 s0 s1 none none
b
a s1 none s2 s3
s0 s1 b c
c s2 none s2 s3
s3
c s3 none s2 s3

• Much smaller than the NFA (no -transitions)


• All transitions are deterministic
• Use same code skeleton as before
b|c
But, remember
our goal: S0
a
S1
40
Where are we? Why are we doing this?
RE  NFA (Thompson’s construction) 
• Build an NFA for each term
• Combine them with -moves
NFA  DFA (subset construction) 
• Build the simulation
The Cycle of Constructions
DFA  Minimal DFA 
• Hopcroft’s algorithm minimal
RE NFA DFA
DFA
DFA  RE
• All pairs, all paths problem
• Union together paths from s0 to a final state
Not enough time to teach Hopcroft’s algorithm today

41
Alternative Approach to DFA Minimization
The Intuition
• The subset construction merges prefixes in the NFA

a b c
s1 s2 s3 s4
 abc | bc | ad
 b c
s0 s5 s6 s7

 Thompson’s construction would leave


a d
s8 s9 s10 -transitions between each single-
character automaton

s6
d
b Subset construction eliminates -
c
s1 s2 s3 transitions and merges the paths for a.
a
It leaves duplicate tails, such as bc.
b c
s0 s4 s5

42
Alternative Approach to DFA Minimization
Idea: use the subset construction twice
• For an NFA N
— Let reverse(N) be the NFA constructed by making initial states
final (& vice-versa) and reversing the edges
— Let subset(N) be the DFA that results from applying the
subset construction to N
— Let reachable(N) be N after removing all states that are not
reachable from the initial state
• Then,
reachable(subset(reverse[reachable(subset(reverse(N))]))

is the minimal DFA that implements N [Brzozowski, 1962]


This result is not intuitive, but it is true.
Neither algorithm dominates the other.

43
Alternative Approach to DFA Minimization
Step 1
• The subset construction on reverse(NFA) merges suffixes in
original NFA
a b c
s1 s2 s3 s4
 
 b c  Reversed NFA
s0 s5 s6 s7 s11

 
a d
s8 s9 s10

a b c
s1 s2 s3
subset(reverse(NFA))
a d
s8 s9 s11

44
Alternative Approach to DFA Minimization
Step 2
• Reverse it again & use subset to merge prefixes …
a b c
s1 s2 s3

s0  Reverse it, again
d s11
 a
s8 s9

d
a s2
b
And subset it, again
b c
s0 s3 s11
The Cycle of Constructions

Minimal DFA
minimal
RE NFA DFA
DFA
45
Brzozowski

You might also like