Code Source Tokens Scanner Parser IR

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Scanner

source tokens
code scanner parser IR

errors
• maps characters into tokens – the basic unit of syntax
x = x + y;
becomes
<id, x> = <id, x> + <id, y> ;
• character string value for a token is a lexeme
• typical tokens: number, id, +, -, *, /, do, end
• eliminates white space (tabs, blanks, comments)
• a key issue is speed
⇒ use specialized recognizer (as opposed to lex)

1
Specifying patterns

A scanner must recognize the units of syntax


Some parts are easy:

white space
<ws> ::= <ws> ’ ’
| <ws> ’\t’
| ’ ’
| ’\t’
keywords and operators
specified as literal patterns: do, end
comments
opening and closing delimiters: /* · · · */

2
Specifying patterns

A scanner must recognize the units of syntax


Other parts are much harder:

identifiers
alphabetic followed by k alphanumerics ( , $, &, . . . )
numbers
integers: 0 or digit from 1-9 followed by digits from 0-9
decimals: integer ’.’ digits from 0-9
reals: (integer or decimal) ’E’ (+ or -) digits from 0-9
complex: ’(’ real ’,’ real ’)’

We need a powerful notation to specify these patterns


3
Operations on languages

Operation Definition
union of L and M L ∪ M = {s | s ∈ L or s ∈ M}
written L ∪ M
concatenation of L and M LM = {st | s ∈ L and t ∈ M}
written LM
Kleene closure of L L∗ = ∞ i
S
i=0 L
written L∗
+ S∞
positive closure of L L = i=1 Li
written L+

4
Regular expressions

Patterns are often specified as regular languages


Notations used to describe a regular language (or a regular set) include
both regular expressions and regular grammars
Regular expressions (over an alphabet Σ):

1. ε is a RE denoting the set {ε}


2. if a ∈ Σ, then a is a RE denoting {a}
3. if r and s are REs, denoting L(r) and L(s), then:
(r) is a RE denoting L(r)
(r) | (s) is a RE denoting L(r) L(s)
S

(r)(s) is a RE denoting L(r)L(s)


(r)∗ is a RE denoting L(r)∗

If we adopt a precedence for operators, the extra parentheses can go


away. We assume closure, then concatenation, then alternation as the
order of precedence.
5
Examples

identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
numbers
integer → (+ | − | ε) (0 | (1 | 2 | 3 | ... | 9) digit ∗)
decimal → integer . ( digit )∗
real → ( integer | decimal ) E (+ | −) digit ∗
complex → ’(’ real , real ’)’

Numbers can get much more complicated

Most programming language tokens can be described with REs

We can use REs to build scanners automatically


6
Algebraic properties of REs

Axiom Description
r|s = s|r | is commutative
r|(s|t) = (r|s)|t | is associative
(rs)t = r(st) concatenation is associative
r(s|t) = rs|rt concatenation distributes over |
(s|t)r = sr|tr
εr = r ε is the identity for concatenation
rε = r
r∗ = (r|ε)∗ relation between ∗ and ε
r∗∗ = r∗ ∗ is idempotent

7
Examples

Let Σ = {a, b}

1. a|b denotes {a, b}


2. (a|b)(a|b) denotes {aa, ab, ba, bb}
i.e., (a|b)(a|b) = aa|ab|ba|bb
3. a∗ denotes {ε, a, aa, aaa, . . .}
4. (a|b)∗ denotes the set of all strings of a’s and b’s (including ε)
i.e., (a|b)∗ = (a∗b∗)∗
5. a|a∗b denotes {a, b, ab, aab, aaab, aaaab, . . .}

8
Recognizers
From a regular expression we can construct a
deterministic finite automaton (DFA)
Recognizer for identifier :
letter
digit

letter other
0 1 2

digit accept
other
3

error

identifier
letter → (a | b | c | ... | z | A | B | C | ... | Z)
digit → (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)
id → letter ( letter | digit )∗
9
Code for the recognizer
char ← next char();
state ← 0; /* code for state 0 */
done ← false;
token value ← "" /* empty string */
while( not done ) {
class ← char class[char];
state ← next state[class,state];
switch(state) {
case 1: /* building an id */
token value ← token value + char;
char ← next char();
break;
case 2: /* accept state */
token type = identifier;
done = true;
break;
case 3: /* error */
token type = error;
done = true;
break;
}
}
return token type;

10
Tables for the recognizer

Two tables control the recognizer


a−z A−Z 0−9 other
char class:
value letter letter digit other

class 0 1 2 3
letter 1 1 — —
next state:
digit 3 1 — —
other 3 2 — —

To change languages, we can just change tables


11
Automatic construction

Scanner generators automatically construct code from RE-like


descriptions

• construct a DFA
• use state minimization techniques
• emit code for the scanner
(table driven or direct code )

A key issue in automation is an interface to the parser

lex is a scanner generator supplied with UNIX

• emits C code for scanner


• provides macro definitions for each token
(used in the parser)
12
Grammars for regular languages

Can we place a restriction on the form of a grammar to ensure that it


describes a regular language?

Provable fact:

For any RE r, ∃ a grammar g such that L(r) = L(g)

Grammars that generate regular sets are called regular grammars:

They have productions in one of 2 forms:


1. A → aA
2. A → a
where A is any non-terminal and a is any terminal symbol

These are also called type 3 grammars (Chomsky)


13
More regular languages

Example: the set of strings containing an even number of zeros and an


even number of ones

1
s0 s1
1
0 0 0 0
1
s2 s3
1
The RE is (00 | 11)∗((01 | 10)(00 | 11)∗(01 | 10)(00 | 11)∗)∗

14
More regular expressions

What about the RE (a | b)∗abb ?

a jb

a b b
s0 s1 s2 s3

State s0 has multiple transitions on a!


⇒ nondeterministic finite automaton
a b
s0 {s0, s1} {s0}
s1 – {s2}
s2 – {s3}

15
Finite automata

A non-deterministic finite automaton (NFA) consists of:

1. a set of states S = {s0, . . . , sn}


2. a set of input symbols Σ (the alphabet)
3. a transition function move mapping state-symbol pairs to sets of
states
4. a distinguished start state s0
5. a set of distinguished accepting or final states F

A Deterministic Finite Automaton (DFA) is a special case of an NFA:

1. no state has a ε-transition, and


2. for each state s and input symbol a, there is at most one edge labelled
a leaving s

A DFA accepts x iff. ∃ a unique path through the transition graph from s0 to
a final state such that the edges spell x.
16
DFAs and NFAs are equivalent

1. DFAs are clearly a subset of NFAs


2. Any NFA can be converted into a DFA, by simulating sets of
simultaneous states:
• each DFA state corresponds to a set of NFA states
• possible exponential blowup

17
NFA to DFA using the subset construction: example 1
ajb

a b b
s0 s1 s2 s3

a b
{s0} {s0, s1} {s0}
{s0, s1} {s0, s1} {s0, s2}
{s0, s2} {s0, s1} {s0, s3}
{s0, s3} {s0, s1} {s0}
b

b
a

a b b
fs0 g fs0 ; s1 g fs0 ; s2 g fs0 ; s3 g

a
a

18
Constructing a DFA from a regular expression

DFA
minimized

RE DFA

NFA
ε moves

RE →NFA w/ε moves


build NFA for each term
connect them with ε moves
NFA w/ε moves to DFA
construct the simulation
the “subset” construction
DFA → minimized DFA
merge compatible states
DFA → RE
construct Rkij = Rk−1
ik (Rk−1 ∗ k−1 S k−1
kk ) Rk j Ri j

19
RE to NFA
ε
N(ε)
a
N(a)

N(A) A
ε ε

N(A|B)
ε ε
N(B) B

N(AB) N(A) A N(B) B

N(A∗) ε
ε
ε
N(A) A

20
RE to NFA: example

N(A) A
ε ε

a|b
ε ε
N(B) B

a
2 3

ε ε

(a|b)∗ 0
ε
1 6
ε
7

ε ε

4 5
b

a b b
abb 7 8 9 10

21
NFA to DFA: the subset construction
Input: NFA N
Output: A DFA D with states Dstates and transitions Dtrans such that L(D) = L(N)
Method: Let s be a state in N and T be a set of states, and using the following operations:

Operation Definition
ε-closure(s) set of NFA states reachable from NFA state s on ε-transitions
alone
ε-closure(T ) set of NFA states reachable from some NFA state s in T on
ε-transitions alone
move(T, a) set of NFA states to which there is a transition on input symbol
a from some NFA state s in T

add state T = ε-closure(s0) unmarked to Dstates


while ∃ unmarked state T in Dstates
mark T
for each input symbol a
U = ε-closure(move(T, a))
if U 6∈ Dstates then add U to Dstates unmarked
Dtrans[T, a] = U
endfor
endwhile

ε-closure(s0) is the start state of D


A state of D is final if it contains at least one final state in N

22
NFA to DFA using subset construction: example 2
ε

a
2 3

ε ε

ε ε a b b
0 1 6 7 8 9 10

ε ε

4 5
b

a b
A B C
A = {0, 1, 2, 4, 7} D = {1, 2, 4, 5, 6, 7, 9}
B B D
B = {1, 2, 3, 4, 6, 7, 8} E = {1, 2, 4, 5, 6, 7, 10}
C B C
C = {1, 2, 4, 5, 6, 7}
D B E
E B C
b

b
b
a
a

a b b
A B D E

a
a

23
Limits of regular languages

Not all languages are regular

One cannot construct DFAs to recognize these languages:

• L = {pk qk }
• L = {wcwr | w ∈ Σ∗}

Note: neither of these is a regular expression!


(DFAs cannot count!)

But, this is a little subtle. One can construct DFAs for:

• alternating 0’s and 1’s


(ε | 1)(01)∗(ε | 0)
• sets of pairs of 0’s and 1’s
(01 | 10)+

24
Ramification - Internet Protocol

How does your browser establish a connection with a web server?

• The client sends a SYN message to the server.

• In response, the server replies with a SYN-ACK.

• Finally the client sends an ACK back to the server.

This is done through two DFAs in the client and server, respectively.

25
Ramification - Intrusion Detection
Code Operating System
FILE * f;
f=fopen("demo", "r"); ------> SYS_OPEN
strcpy(...); //vulnerability
if (!f)
printf("Fail to open\n"); ------> SYS_WRITE
else
fgets(f, buf); ------> SYS_READ
...

A DFA will be exercised simultaneously with the program on the OS side


to detect intrusion.

26

You might also like