Code Source Tokens Scanner Parser IR

Scanner
source tokens
code scanner parser IR
errors
• maps characters into tokens – the basic unit of syntax
x = x + y;
becomes
<id, x> = <id, x> + <id, y> ;
• character string value for a token is a lexeme
• typical tokens: number, id, +, -, *, /, do, end
• eliminates white space (tabs, blanks, comments)
• a key issue is speed
⇒ use specialized recognizer (as opposed to lex)
1
Specifying patterns
A scanner must recognize the units of syntax

Some parts are easy:
white space
<ws> ::= <ws> ’ ’
| <ws> ’\t’
| ’ ’
| ’\t’
keywords and operators
specified as literal patterns: do, end
comments
opening and closing delimiters: /* · · · */
2
Specifying patterns
A scanner must recognize the units of syntax

Other parts are much harder:
identifiers
alphabetic followed by k alphanumerics ( , $, &, . . . )
numbers
integers: 0 or digit from 1-9 followed by digits from 0-9
decimals: integer ’.’ digits from 0-9
reals: (integer or decimal) ’E’ (+ or -) digits from 0-9
complex: ’(’ real ’,’ real ’)’
We need a powerful notation to specify these patterns

3
Operations on languages
Operation Definition
union of L and M L ∪ M = {s | s ∈ L or s ∈ M}
written L ∪ M
concatenation of L and M LM = {st | s ∈ L and t ∈ M}
written LM
Kleene closure of L L∗ = ∞ i
S
i=0 L
written L∗
+ S∞
positive closure of L L = i=1 Li
written L+
4
Regular expressions
Patterns are often specified as regular languages

Notations used to describe a regular language (or a regular set) include
both regular expressions and regular grammars
Regular expressions (over an alphabet Σ):
1. ε is a RE denoting the set {ε}

2. if a ∈ Σ, then a is a RE denoting {a}
3. if r and s are REs, denoting L(r) and L(s), then:
(r) is a RE denoting L(r)
(r) | (s) is a RE denoting L(r) L(s)
S
(r)(s) is a RE denoting L(r)L(s)

(r)∗ is a RE denoting L(r)∗
If we adopt a precedence for operators, the extra parentheses can go

away. We assume closure, then concatenation, then alternation as the
order of precedence.
5
Examples
identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
numbers
integer → (+ | − | ε) (0 | (1 | 2 | 3 | ... | 9) digit ∗)
decimal → integer . ( digit )∗
real → ( integer | decimal ) E (+ | −) digit ∗
complex → ’(’ real , real ’)’
Numbers can get much more complicated
Most programming language tokens can be described with REs
We can use REs to build scanners automatically

6
Algebraic properties of REs
Axiom Description
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) concatenation is associative
r(s|t) = rs|rt concatenation distributes over |
(s|t)r = sr|tr
εr = r ε is the identity for concatenation
rε = r
r∗ = (r|ε)∗ relation between ∗ and ε
r∗∗ = r∗ ∗ is idempotent
7
Examples
Let Σ = {a, b}
1. a|b denotes {a, b}

2. (a|b)(a|b) denotes {aa, ab, ba, bb}
i.e., (a|b)(a|b) = aa|ab|ba|bb
3. a∗ denotes {ε, a, aa, aaa, . . .}
4. (a|b)∗ denotes the set of all strings of a’s and b’s (including ε)
i.e., (a|b)∗ = (a∗b∗)∗
5. a|a∗b denotes {a, b, ab, aab, aaab, aaaab, . . .}
8
Recognizers
From a regular expression we can construct a
deterministic finite automaton (DFA)
Recognizer for identifier :
letter
digit
letter other
0 1 2
digit accept
other
3
error
identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
9
Code for the recognizer
char ← next char();
state ← 0; /* code for state 0 */
done ← false;
token value ← "" /* empty string */
while( not done ) {
class ← char class[char];
state ← next state[class,state];
switch(state) {
case 1: /* building an id */
token value ← token value + char;
char ← next char();
break;
case 2: /* accept state */
token type = identifier;
done = true;
break;
case 3: /* error */
token type = error;
done = true;
break;
}
}
return token type;
10
Tables for the recognizer
Two tables control the recognizer

a−z A−Z 0−9 other
char class:
value letter letter digit other
class 0 1 2 3
letter 1 1 — —
next state:
digit 3 1 — —
other 3 2 — —
To change languages, we can just change tables

11
Automatic construction
Scanner generators automatically construct code from RE-like

descriptions
• construct a DFA
• use state minimization techniques
• emit code for the scanner
(table driven or direct code )
A key issue in automation is an interface to the parser
lex is a scanner generator supplied with UNIX
• emits C code for scanner

• provides macro definitions for each token
(used in the parser)
12
Grammars for regular languages
Can we place a restriction on the form of a grammar to ensure that it

describes a regular language?
Provable fact:
For any RE r, ∃ a grammar g such that L(r) = L(g)
Grammars that generate regular sets are called regular grammars:
They have productions in one of 2 forms:

1. A → aA
2. A → a
where A is any non-terminal and a is any terminal symbol
These are also called type 3 grammars (Chomsky)

13
More regular languages
Example: the set of strings containing an even number of zeros and an

even number of ones
1
s0 s1
1
0 0 0 0
1
s2 s3
1
The RE is (00 | 11)∗((01 | 10)(00 | 11)∗(01 | 10)(00 | 11)∗)∗
14
More regular expressions
What about the RE (a | b)∗abb ?
a jb
a b b
s0 s1 s2 s3
State s0 has multiple transitions on a!

⇒ nondeterministic finite automaton
a b
s0 {s0, s1} {s0}
s1 – {s2}
s2 – {s3}
15
Finite automata
A non-deterministic finite automaton (NFA) consists of:
1. a set of states S = {s0, . . . , sn}

2. a set of input symbols Σ (the alphabet)
3. a transition function move mapping state-symbol pairs to sets of
states
4. a distinguished start state s0
5. a set of distinguished accepting or final states F
A Deterministic Finite Automaton (DFA) is a special case of an NFA:
1. no state has a ε-transition, and

2. for each state s and input symbol a, there is at most one edge labelled
a leaving s
A DFA accepts x iff. ∃ a unique path through the transition graph from s0 to
a final state such that the edges spell x.
16
DFAs and NFAs are equivalent
1. DFAs are clearly a subset of NFAs

2. Any NFA can be converted into a DFA, by simulating sets of
simultaneous states:
• each DFA state corresponds to a set of NFA states
• possible exponential blowup
17
NFA to DFA using the subset construction: example 1
ajb
a b b
s0 s1 s2 s3
a b
{s0} {s0, s1} {s0}
{s0, s1} {s0, s1} {s0, s2}
{s0, s2} {s0, s1} {s0, s3}
{s0, s3} {s0, s1} {s0}
b
b
a
a b b
fs0 g fs0 ; s1 g fs0 ; s2 g fs0 ; s3 g
a
a
18
Constructing a DFA from a regular expression
DFA
minimized
RE DFA
NFA
ε moves
RE →NFA w/ε moves

build NFA for each term
connect them with ε moves
NFA w/ε moves to DFA
construct the simulation
the “subset” construction
DFA → minimized DFA
merge compatible states
DFA → RE
construct Rkij = Rk−1
ik (Rk−1 ∗ k−1 S k−1
kk ) Rk j Ri j
19
RE to NFA
ε
N(ε)
a
N(a)
N(A) A
ε ε
N(A|B)
ε ε
N(B) B
N(AB) N(A) A N(B) B
N(A∗) ε
ε
ε
N(A) A
20
RE to NFA: example
N(A) A
ε ε
a|b
ε ε
N(B) B
a
2 3
ε ε
(a|b)∗ 0
ε
1 6
ε
7
ε ε
4 5
b
a b b
abb 7 8 9 10
21
NFA to DFA: the subset construction
Input: NFA N
Output: A DFA D with states Dstates and transitions Dtrans such that L(D) = L(N)
Method: Let s be a state in N and T be a set of states, and using the following operations:
Operation Definition
ε-closure(s) set of NFA states reachable from NFA state s on ε-transitions
alone
ε-closure(T ) set of NFA states reachable from some NFA state s in T on
ε-transitions alone
move(T, a) set of NFA states to which there is a transition on input symbol
a from some NFA state s in T
add state T = ε-closure(s0) unmarked to Dstates

while ∃ unmarked state T in Dstates
mark T
for each input symbol a
U = ε-closure(move(T, a))
if U 6∈ Dstates then add U to Dstates unmarked
Dtrans[T, a] = U
endfor
endwhile
ε-closure(s0) is the start state of D

A state of D is final if it contains at least one final state in N
22
NFA to DFA using subset construction: example 2
ε
a
2 3
ε ε
ε ε a b b
0 1 6 7 8 9 10
ε ε
4 5
b
a b
A B C
A = {0, 1, 2, 4, 7} D = {1, 2, 4, 5, 6, 7, 9}
B B D
B = {1, 2, 3, 4, 6, 7, 8} E = {1, 2, 4, 5, 6, 7, 10}
C B C
C = {1, 2, 4, 5, 6, 7}
D B E
E B C
b
b
b
a
a
a b b
A B D E
a
a
23
Limits of regular languages
Not all languages are regular
One cannot construct DFAs to recognize these languages:
• L = {pk qk }
• L = {wcwr | w ∈ Σ∗}
Note: neither of these is a regular expression!

(DFAs cannot count!)
But, this is a little subtle. One can construct DFAs for:
• alternating 0’s and 1’s

(ε | 1)(01)∗(ε | 0)
• sets of pairs of 0’s and 1’s
(01 | 10)+
24
Ramification - Internet Protocol
How does your browser establish a connection with a web server?
• The client sends a SYN message to the server.
• In response, the server replies with a SYN-ACK.
• Finally the client sends an ACK back to the server.
This is done through two DFAs in the client and server, respectively.
25
Ramification - Intrusion Detection
Code Operating System
FILE * f;
f=fopen("demo", "r"); ------> SYS_OPEN
strcpy(...); //vulnerability
if (!f)
printf("Fail to open\n"); ------> SYS_WRITE
else
fgets(f, buf); ------> SYS_READ
...
A DFA will be exercised simultaneously with the program on the OS side

to detect intrusion.
26

Code Source Tokens Scanner Parser IR

Uploaded by

Copyright:

Available Formats

Code Source Tokens Scanner Parser IR

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Code Source Tokens Scanner Parser IR

Uploaded by

Copyright:

Available Formats

Scanner

A scanner must recognize the units of syntax

A scanner must recognize the units of syntax

We need a powerful notation to specify these patterns

Patterns are often specified as regular languages

1. ε is a RE denoting the set {ε}

(r)(s) is a RE denoting L(r)L(s)

If we adopt a precedence for operators, the extra parentheses can go

Numbers can get much more complicated

Most programming language tokens can be described with REs

We can use REs to build scanners automatically

1. a|b denotes {a, b}

Two tables control the recognizer

To change languages, we can just change tables

Scanner generators automatically construct code from RE-like

A key issue in automation is an interface to the parser

lex is a scanner generator supplied with UNIX

• emits C code for scanner

Can we place a restriction on the form of a grammar to ensure that it

For any RE r, ∃ a grammar g such that L(r) = L(g)

Grammars that generate regular sets are called regular grammars:

They have productions in one of 2 forms:

These are also called type 3 grammars (Chomsky)

Example: the set of strings containing an even number of zeros and an

What about the RE (a | b)∗abb ?

State s0 has multiple transitions on a!

A non-deterministic finite automaton (NFA) consists of:

1. a set of states S = {s0, . . . , sn}

A Deterministic Finite Automaton (DFA) is a special case of an NFA:

1. no state has a ε-transition, and

1. DFAs are clearly a subset of NFAs

RE →NFA w/ε moves

N(AB) N(A) A N(B) B

add state T = ε-closure(s0) unmarked to Dstates

ε-closure(s0) is the start state of D

Not all languages are regular

One cannot construct DFAs to recognize these languages:

Note: neither of these is a regular expression!

But, this is a little subtle. One can construct DFAs for:

• alternating 0’s and 1’s

How does your browser establish a connection with a web server?

• The client sends a SYN message to the server.

• In response, the server replies with a SYN-ACK.

• Finally the client sends an ACK back to the server.

A DFA will be exercised simultaneously with the program on the OS side

You might also like