Patterns, Automata, and Regular Expressions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

• Finite Automata – formal or abstract

machine to recognize patterns


Patterns, Automata, and
• Regular expressions – formal notation to
Regular Expressions describe/generate patterns

Finite Automata Regular Expressions


• A finite collection of states • Defines a set of strings over the characters contained in
• An alphabet some alphabet, defining a language
• A set of transitions between those states labeled • Atomic operand can be
– a character,
with symbols in the alphabet
– the symbol ε,
• A start state, S0 – the symbol Φ, or
• One or more final states – a variable whose value can be any pattern defined by a regular
expression
• Three basic operations/operators
Deterministic – single-valued transition, no epsilon
– Union – e.g., a|b
transitions – Concatenation – e.g., ab
Non-deterministic – multi-valued transitions – Closure – (Kleene closure) – e.g., a* - where a can be a set
concatenated with itself zero or more times

Precedence of Regular Expression Algebraic Laws for Regular


Operators Expressions
• Closure (highest) • Identity for Union Φ|R = R|Φ= R
• Identity for concatenation εR=Rε=R
• Concatenation • Associativity and commutativity of union
• Union R|S=S|R, ((R|S)|T) = (R|(S|T))
• Associativity of concatenation (RS)T = R(ST)
• Non-commutativity of concatenation
• Left distributivity of concatenation over union
(R(S|T)) = (RS|RT)
• Right distributivity of concatenation over union
((S|T)R) = (SR|TR)
• Idempotence of union (R | R) = R

1
RE <-> DFA Automated RE->NFA

State-Elimination
• Build NFA for each term, connect them
a b
Construction
DFA
with moves s0 s1 s2 s3
RE
– Concatenate - ab
a  b
s0 s1 s2
Subset s3
Thompson’s
Construction
Construction
– Union – a|b
 s0
a

s1
NFA s4
 s2
b  s5

– Kleene Closure – a*  s3

 a

s2 s0 s1 s3


Thompson’s Construction NFA->DFA


• Each NFA has a single start state and a single • Subset construction algorithm
– Each state in DFA corresponds to a set of states in NFA
final state, with no transitions leaving the final
state and only the initial transition entering the q0 ε-closure(n0)
start state initialize Q with {q0}

• An  -move always connects two states that were


while (Q is still changing)

for each qi Q
start or final states for each character  
• A state has at most 2 entering and 2 exiting  - t ε-closure(move(qi, )) 
T[qi, ]  t
moves, and at most 1 entering and 1 exiting 
if t Q then
move on a symbol in the alphabet add t to Q

Example DFA Minimization


 
NFA P P {SF, {S - SF}
while (P is still changing)
m a i n
S0 Sm Sa Si Sn T 0
for each set p  P
 -m  -a,m  -i,m  -n,m for each  

Corresponding DFA partition p by 


m a i
S0 S0,Sm S0,Sa S0,Si into p1, p2, p3, … pk
n
 -m m m m T T U p1, p2, p3, … pk
m m m
if T P then
m
a So,Sn,
i
S0,Sn, P T
S0,Sn S0,Sn,
Sa Si  -m
 -a,m
Sm
 -i,m

2
Automated DFA->RE Regular Expression and DFA
Identifier
for i = 1 to N
for j = 1 to N letter (a|b|c| … |z|A|B|C| … |Z)
Rij0 = {a| (si,a) = sj} digit (0|1|2|3|4|5|6|7|8|9)
if (i == j)
Rij0 = Rij0 U 
id letter (letter|digit)*
letter
for k = 1 to N digit
for i = 1 to N
letter
for j = 1 to N S0 S1

Rijk = Rikk-1 Rkkk-1 Rkjk-1 URijk-1


digit
L = U R1jN accept
s S
j


F S2 error
letter
digit

Implementing Scanners Code for Semi-Mechanical Pure


(Recognizer) DFA
state = S0; /* code for S0 */
• Ad-hoc done = false;
token_value = “” /* empty string */
token_type = error;

• Semi-mechanical pure DFA char = next_char();


while (not done) {
class = char_class[char];

• Table-driven DFA switch(state) {


case S0:
switch (class)
case ‘letter’: token_type = identifier; token_value = token_value+char;
state = S1; char = next_char(); break;
case ‘digit’: done = true; break;
case ‘other’: done = true; break;
break;
case S1:
switch(class)
case ‘letter’:
case ‘digit’: token_value = token_value+char; char = next_char(); break;
case ‘other’: done = true; break;
break;
}
}
return(token_type);

Table-Driven Recognizer Tables Driving the Recognizer


a-z A-Z 0-9 other
letter char_class
digit value letter letter digit other
letter other
S0 S1 S2

digit
other
class S0 S1 S2 S3
accept

S3 error letter S1 S1 -- --
next_state
digit S3 S1 -- --
other S3 S2 -- --
To change language, we can just change tables

3
Table-Driven Recognizer/Scanner Error Recovery
char = next_char();
state = S0; /* code for S0 */
token_value = “” /* empty string */
• E.g., illegal character
while (not done) {
class = char_class[char]; • What should the scanner do?
state = next_state[class,state];
switch(state) {
case S2: /* accept state */
– Report the error
token_type = identifier;
done = true; break; – Try to correct it?
case S3: /* error */
token_type = error;
done = true; break;
• Error correction techniques
default: /* building an id */
token_value = token_value + char; – Minimum distance corrections
char = next_char; break;

}
} – Hard token recovery
return(token_type);
– Skip until match

Scanner Summary
• Break up input into tokens
• Catch lexical errors
• Difficulty affected by language design
• Issues
– Input buffering
– Lookahead
– Error recovery
• Scanner generators
– Tokens specified by regular expressions
– Regular expression -> DFA
– Highly efficient in practice

You might also like