Patterns, Automata, and Regular Expressions
Patterns, Automata, and Regular Expressions
Patterns, Automata, and Regular Expressions
1
RE <-> DFA Automated RE->NFA
State-Elimination
• Build NFA for each term, connect them
a b
Construction
DFA
with moves s0 s1 s2 s3
RE
– Concatenate - ab
a b
s0 s1 s2
Subset s3
Thompson’s
Construction
Construction
– Union – a|b
s0
a
s1
NFA s4
s2
b s5
– Kleene Closure – a* s3
a
s2 s0 s1 s3
2
Automated DFA->RE Regular Expression and DFA
Identifier
for i = 1 to N
for j = 1 to N letter (a|b|c| … |z|A|B|C| … |Z)
Rij0 = {a| (si,a) = sj} digit (0|1|2|3|4|5|6|7|8|9)
if (i == j)
Rij0 = Rij0 U
id letter (letter|digit)*
letter
for k = 1 to N digit
for i = 1 to N
letter
for j = 1 to N S0 S1
F S2 error
letter
digit
digit
other
class S0 S1 S2 S3
accept
S3 error letter S1 S1 -- --
next_state
digit S3 S1 -- --
other S3 S2 -- --
To change language, we can just change tables
3
Table-Driven Recognizer/Scanner Error Recovery
char = next_char();
state = S0; /* code for S0 */
token_value = “” /* empty string */
• E.g., illegal character
while (not done) {
class = char_class[char]; • What should the scanner do?
state = next_state[class,state];
switch(state) {
case S2: /* accept state */
– Report the error
token_type = identifier;
done = true; break; – Try to correct it?
case S3: /* error */
token_type = error;
done = true; break;
• Error correction techniques
default: /* building an id */
token_value = token_value + char; – Minimum distance corrections
char = next_char; break;
}
} – Hard token recovery
return(token_type);
– Skip until match
Scanner Summary
• Break up input into tokens
• Catch lexical errors
• Difficulty affected by language design
• Issues
– Input buffering
– Lookahead
– Error recovery
• Scanner generators
– Tokens specified by regular expressions
– Regular expression -> DFA
– Highly efficient in practice