Code Source Tokens Scanner Parser IR
Code Source Tokens Scanner Parser IR
Code Source Tokens Scanner Parser IR
source tokens
code scanner parser IR
errors
• maps characters into tokens – the basic unit of syntax
x = x + y;
becomes
<id, x> = <id, x> + <id, y> ;
• character string value for a token is a lexeme
• typical tokens: number, id, +, -, *, /, do, end
• eliminates white space (tabs, blanks, comments)
• a key issue is speed
⇒ use specialized recognizer (as opposed to lex)
1
Specifying patterns
white space
<ws> ::= <ws> ’ ’
| <ws> ’\t’
| ’ ’
| ’\t’
keywords and operators
specified as literal patterns: do, end
comments
opening and closing delimiters: /* · · · */
2
Specifying patterns
identifiers
alphabetic followed by k alphanumerics ( , $, &, . . . )
numbers
integers: 0 or digit from 1-9 followed by digits from 0-9
decimals: integer ’.’ digits from 0-9
reals: (integer or decimal) ’E’ (+ or -) digits from 0-9
complex: ’(’ real ’,’ real ’)’
Operation Definition
union of L and M L ∪ M = {s | s ∈ L or s ∈ M}
written L ∪ M
concatenation of L and M LM = {st | s ∈ L and t ∈ M}
written LM
Kleene closure of L L∗ = ∞ i
S
i=0 L
written L∗
+ S∞
positive closure of L L = i=1 Li
written L+
4
Regular expressions
identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
numbers
integer → (+ | − | ε) (0 | (1 | 2 | 3 | ... | 9) digit ∗)
decimal → integer . ( digit )∗
real → ( integer | decimal ) E (+ | −) digit ∗
complex → ’(’ real , real ’)’
Axiom Description
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) concatenation is associative
r(s|t) = rs|rt concatenation distributes over |
(s|t)r = sr|tr
εr = r ε is the identity for concatenation
rε = r
r∗ = (r|ε)∗ relation between ∗ and ε
r∗∗ = r∗ ∗ is idempotent
7
Examples
Let Σ = {a, b}
8
Recognizers
From a regular expression we can construct a
deterministic finite automaton (DFA)
Recognizer for identifier :
letter
digit
letter other
0 1 2
digit accept
other
3
error
identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
9
Code for the recognizer
char ← next char();
state ← 0; /* code for state 0 */
done ← false;
token value ← "" /* empty string */
while( not done ) {
class ← char class[char];
state ← next state[class,state];
switch(state) {
case 1: /* building an id */
token value ← token value + char;
char ← next char();
break;
case 2: /* accept state */
token type = identifier;
done = true;
break;
case 3: /* error */
token type = error;
done = true;
break;
}
}
return token type;
10
Tables for the recognizer
class 0 1 2 3
letter 1 1 — —
next state:
digit 3 1 — —
other 3 2 — —
• construct a DFA
• use state minimization techniques
• emit code for the scanner
(table driven or direct code )
Provable fact:
1
s0 s1
1
0 0 0 0
1
s2 s3
1
The RE is (00 | 11)∗((01 | 10)(00 | 11)∗(01 | 10)(00 | 11)∗)∗
14
More regular expressions
a jb
a b b
s0 s1 s2 s3
15
Finite automata
A DFA accepts x iff. ∃ a unique path through the transition graph from s0 to
a final state such that the edges spell x.
16
DFAs and NFAs are equivalent
17
NFA to DFA using the subset construction: example 1
ajb
a b b
s0 s1 s2 s3
a b
{s0} {s0, s1} {s0}
{s0, s1} {s0, s1} {s0, s2}
{s0, s2} {s0, s1} {s0, s3}
{s0, s3} {s0, s1} {s0}
b
b
a
a b b
fs0 g fs0 ; s1 g fs0 ; s2 g fs0 ; s3 g
a
a
18
Constructing a DFA from a regular expression
DFA
minimized
RE DFA
NFA
ε moves
19
RE to NFA
ε
N(ε)
a
N(a)
N(A) A
ε ε
N(A|B)
ε ε
N(B) B
N(A∗) ε
ε
ε
N(A) A
20
RE to NFA: example
N(A) A
ε ε
a|b
ε ε
N(B) B
a
2 3
ε ε
(a|b)∗ 0
ε
1 6
ε
7
ε ε
4 5
b
a b b
abb 7 8 9 10
21
NFA to DFA: the subset construction
Input: NFA N
Output: A DFA D with states Dstates and transitions Dtrans such that L(D) = L(N)
Method: Let s be a state in N and T be a set of states, and using the following operations:
Operation Definition
ε-closure(s) set of NFA states reachable from NFA state s on ε-transitions
alone
ε-closure(T ) set of NFA states reachable from some NFA state s in T on
ε-transitions alone
move(T, a) set of NFA states to which there is a transition on input symbol
a from some NFA state s in T
22
NFA to DFA using subset construction: example 2
ε
a
2 3
ε ε
ε ε a b b
0 1 6 7 8 9 10
ε ε
4 5
b
a b
A B C
A = {0, 1, 2, 4, 7} D = {1, 2, 4, 5, 6, 7, 9}
B B D
B = {1, 2, 3, 4, 6, 7, 8} E = {1, 2, 4, 5, 6, 7, 10}
C B C
C = {1, 2, 4, 5, 6, 7}
D B E
E B C
b
b
b
a
a
a b b
A B D E
a
a
23
Limits of regular languages
• L = {pk qk }
• L = {wcwr | w ∈ Σ∗}
24
Ramification - Internet Protocol
This is done through two DFAs in the client and server, respectively.
25
Ramification - Intrusion Detection
Code Operating System
FILE * f;
f=fopen("demo", "r"); ------> SYS_OPEN
strcpy(...); //vulnerability
if (!f)
printf("Fail to open\n"); ------> SYS_WRITE
else
fgets(f, buf); ------> SYS_READ
...
26