CD Unit I Part II Lexical Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

Compiler Design

Dr.S.Gopi Krishna
Professor in CSE
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

1. The Role of Lexical


Analyzer (Scanner)
2. Input Buffering
3. Specification of Tokens
4. Recognition of Tokens
5. Lexical Analyzer
Generator - Lex
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

1. The Role of Lexical


Analyzer (Scanner)
2. Input Buffering
3. Specification of Tokens
4. Recognition of Tokens
5. Lexical Analyzer Generator - Lex
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

The Role of Lexical


Analyzer
 1.1 Tokens, Patterns, and
Lexemes
 1.2 Attributes for Tokens
 1.3 Lexical Errors
The Role of Lexical Analyzer
As the first phase of a compiler, the main task of the
lexical analyzer is

 to read the input characters of the source program,


 group them into lexemes, and
 produce as output a sequence of tokens for each
lexeme in the source program.

The stream of tokens is sent to the parser for syntax


analysis.

It is common for the lexical analyzer to interact with the


symbol table as well.
The Role of Lexical Analyzer
These interactions are suggested in Fig
.

Commonly, the interaction is implemented by having the parser call the lexical analyzer.

The call, suggested by the getNextToken command, causes the lexical analyzer to read
characters from its input until it can identify the next lexeme and produce for it the next token,
which it returns to the parser.
The Role of Lexical Analyzer

Lexical Analyzer may perform certain other tasks besides identification of lexemes.
1. One such task is stripping out comments and whitespace (blank, newline, tab, and
perhaps other characters that are used to separate tokens in the input).

2. Another task is correlating error messages generated by the compiler with the source
program.

Sometimes, lexical analyzers are divided into a cascade of two processes:


 Scanning consists of the simple processes that do not require tokenization of the input,
such as deletion of comments and compaction of consecutive whitespace characters into
one.

 Lexical analysis proper is the more complex portion, which produces the tokens from
the output of the scanner.
1.1 Tokens, Patterns, and Lexemes

When discussing lexical analysis, we use three related but distinct terms:

1. A token is a pair consisting of a token name and an optional attribute value.

2. A pattern is a sequence of characters that form the keyword.

3. A lexeme is a sequence of characters that matches the pattern for a token and is identified
by the lexical analyzer as an instance of that token.

To see how these concepts are used in practice, in the C statement


printf("Total = %d\n", score);
• printf and score are lexemes matching the pattern for token id, and
• "Total = °/,d\n" is a lexeme matching literal.
1.1 Tokens, Patterns, and Lexemes

Examples of tokens
1.2 Attributes for Tokens
Lexical Analyzer returns to the parser not only a token name, but an attribute value that
describes the lexeme represented by the token;

the token name influences parsing decisions


while the attribute value influences translation of tokens after the parse.

Normally, information about an identifier — e.g., its lexeme, its type, and the location at which
it is first found (in case an error message about that identifier must be issued) — is kept in the
symbol table.

Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry for
that identifier.
1.2 Attributes for Tokens
Example : The token names and associated attribute values for the FORTRAN statement
E = M * C ** 2
are written below as a sequence of pairs.

<id, pointer to symbol-table entry for E>


<assign_op>
<id, pointer to symbol-table entry for M>
<mult_op>
<id, pointer to symbol-table entry for C>
<exp_op>
<number, integer value 2>
1.3 Lexical Errors
It is hard for a lexical analyzer to tell that there is a source-code error.

For instance, if the string fi is encountered for the first time in a C program in the context:

fi ( a == f(x)) ...

a lexical analyzer cannot tell whether fi is a misspelling of the keyword if or an undeclared


function identifier.

Since fi is a valid lexeme for the token id, the lexical analyzer must return the token id to the
parser and let some other phase of the compiler handle an error due to transposition of the
letters.
1.3 Lexical Errors

However, suppose a situation arises in which the lexical analyzer is unable to proceed because
none of the patterns for tokens matches any prefix of the remaining input.

Possible error-recovery actions are:


1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

The Role of Lexical


Analyzer
 1.1 Tokens, Patterns, and
Lexemes
 1.2 Attributes for Tokens
 1.3 Lexical Errors

…. Thank You
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

1. The Role of Lexical Analyzer


(Scanner)
2. Input Buffering
3. Specification of Tokens
4. Recognition of Tokens
5. Lexical Analyzer Generator - Lex
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

Input Buffering
 2.1 Buffer Pairs
 2.2 Sentinels
 Input Buffering
Program
Input Buffering
We cannot be sure we've seen the end of an identifier until we see a character
that is not a letter or digit, and therefore is not part of the lexeme for id.

In C, single-character operators like -, =, or < could also be the beginning of a


two-character operator like >=, ==, or <=.

Because of the amount of time taken to process characters and the large number
of characters that must be processed during the compilation of a large source
program, specialized buffering techniques have been developed to reduce the
amount of overhead required to process a single input character.
1. Buffer Pairs
2. Sentinals
2.1 Buffer Pairs
An important scheme involves two buffers that are alternately reloaded, as
suggested in Fig. Using a pair of input buffers

Each buffer is of the same size N, and N is usually the size of a disk block, e.g.,
4096 bytes.

Using one system read command we can read N characters into a buffer, rather
than using one system call per character.
2.1 Buffer Pairs
If fewer than N characters remain in the input file, then a special character, represented by eof,
marks the end of the source file and is different from any possible character of the source
program

Two pointers to the input are maintained:


1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent we are
attempting to determine.

2. Pointer forward scans ahead until a pattern match is found;


2.2 Sentinels
If we use the scheme of buffer pairs, we must check, each time we advance
forward, that we have not moved off one of the buffers; if we do, then we must
also reload the other buffer. Thus, for each character read, we make two tests:

1. one for the end of the buffer, and


2. one to determine what character is read (the latter may be a multiway branch).

We can combine the buffer-end test with the test for the current character if we
extend each buffer to hold a sentinel character at the end. The sentinel is a
special character that cannot be part of the source program, and a natural choice
is the character eof.
2.2 Sentinels
Figure shows the same arrangement as buffer pairs Fig., but with the sentinels added.

Note that eof retains its use as a marker for the end of the entire input. Any eof that appears
other than at the end of a buffer means that the input is at an end.

Sentinels at the end of each buffer


Input Buffering Program
Program 1.1 summarizes the algorithm for advancing forward. Notice how the first test, which can be part of a
multiway branch based on the character pointed to by forward, is the only test we make, except in the case where
we actually are at the end of a buffer or the end of the input.
Program 1.1: Lookahead code with sentinels
switch (*forward++) {
case eof:
if (forward is at the end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at the end of second buffer) {
reload first buffer;
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break:
Cases for the other characters
}
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

Input Buffering
 2.1 Buffer Pairs
 2.2 Sentinels
 Input Buffering
Program

… Thank You
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

1. The Role of Lexical Analyzer


(Scanner)
2. Input Buffering
3. Specification of
Tokens
4. Recognition of Tokens
5. Lexical Analyzer Generator - Lex
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

Specification of
Tokens
 Strings and Languages
 Operations on Languages
 Regular Expressions
 Regular Definitions
1 Strings and Languages
Regular expressions are an important notation for specifying lexeme patterns. While they
cannot express all possible patterns, they are very effective in specifying those types of
patterns that we actually need for tokens.

Strings and Languages


1. An alphabet is any finite set of symbols.
 Typical examples of symbols are letters, digits, and punctuation.
 The set {0,1} is the binary alphabet.

2. A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
 The length of a string s, usually written |s|, is the number of occurrences of symbols
in s.
 For example, banana is a string of length six. The empty string, denoted ε is the string
of length zero.
1 Strings and Languages
3. A language is any countable set of strings over some fixed alphabet.
 Abstract languages like 0, the empty set, or {ε}, the set containing only the empty string,
are languages under this definition.
 If x and y are strings, then the concatenation of x and y, denoted xy, is the string formed
by appending y to x. For example, if x = dog and y = house, then xy — doghouse.

The empty string is the identity under concatenation; that is, for any string s, εs = sε = s.

If we think of concatenation as a product, we can define the "exponentiation" of strings as


follows. Define s° to be ε, and for all i > 0, define si to be si-1s. Since εs = s, it follows that s1 =
s. Then s2 = ss, s3 = sss, and so on.
2 Operations on Languages

Definitions of operations on languages


2 Operations on Languages
In lexical analysis, the most important operations on languages are union, concatenation, and
closure, which are defined formally in Fig.

Union is the familiar operation on sets.

The concatenation of languages is all strings formed by taking a string from the first language
and a string from the second language, in all possible ways, and concatenating them.

The (Kleene) closure of a language L, denoted L*, is the set of strings get by concatenating L
zero or more times. Note that L°, the "concatenation of L zero times," is defined to be {ε}, and
inductively, U is Li-1L.

Finally, the positive closure, denoted L+, is the same as the Kleene closure, but without the
term L°. That is, ε will not be in L+ unless it is in L itself.
3 Regular Expressions
Describe identifiers by giving names to sets of letters and digits and using the language
operators union, concatenation, and closure.

This process is so useful that a notation called regular expressions has come into common
use for describing all the languages that can be built from these operators applied to the
symbols of some alphabet.

In this notation, if letter- is established to stand for any letter or the underscore, and digit- is
established to stand for any digit, then we could describe the language of C identifiers by:
letter- ( letter- | digit )*

The vertical bar above means union, the parentheses are used to group sub expressions,
the star means "zero or more occurrences of," and the juxtaposition of letter, with the
remainder of the expression signifies concatenation.
3 Regular Expressions
The regular expressions are built recursively out of smaller regular expressions, using the rules
described below.

Each regular expression r denotes a language L(r), which is also defined recursively from the
languages denoted by r's sub expressions.

Here are the rules that define the regular expressions over some alphabet ∑ and the languages
that those expressions denote.

BASIS: There are two rules that form the basis:


1. ε is a regular expression, and L(ε)is {ε}, that is, the language whose sole member is the
empty string.
2. If a is a symbol in ∑, then a is a regular expression, and L(a) = {a}, that is, the language with
one string, of length one, with a in its one position. Note that by convention, we use italics for
symbols, and boldface for their corresponding regular expression.
3 Regular Expressions

INDUCTION: There are four parts to the induction whereby larger regular expressions are
built from smaller ones. Suppose r and s are regular expressions denoting languages L(r) and
L(s), respectively.
1. (r)|(s) is a regular expression denoting the language L(r) U L(s).
2. (r)(s) is a regular expression denoting the language L(r)L(s).
3. (r)* is a regular expression denoting (L(r))*.
4. (r) is a regular expression denoting L(r). This last rule says that we can add additional pairs
of parentheses around expressions without changing the language they denote.
3 Regular Expressions

As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop
certain pairs of parentheses if we adopt the conventions that:
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.

Under these conventions, for example, we may replace the regular expression (a)|((b)*(c)) by
a|b*c. Both expressions denote the set of strings that are either a single a or are zero or more
b's followed by one c.
3 Regular Expressions

Example : Let ∑ = {a, b}.


1. The regular expression a|b denotes the language {a, b}.
2. (a|b)(a|b) denotes {aa, ab, ba, bb}, the language of all strings of length two over the
alphabet ∑. Another regular expression for the same language is aa|ab|ba|bb.
3. a* denotes the language consisting of all strings of zero or more a's, that is, {e,a,aa,aaa,...}.
4. (a|b)* denotes the set of all strings consisting of zero or more instances of a or b, that is, all
strings of a's and b's: {ε,a, b,aa, ab, ba, bb,aaa,...}. Another regular expression for the same
language is (a*b*)*.
5. a|a*b denotes the language {a, b, ab, aab, aaab,...}, that is, the string a and all strings
consisting of zero or more a's and ending in b.
3 Regular Expressions
A language that can be defined by a regular
expression is called a regular set.

If two regular expressions r and s denote


the same regular set, we say they are
equivalent and write r = s. For instance,
(a|b) = (b|a).

There are a number of algebraic laws for


regular expressions; each law asserts that
expressions of two different forms are
equivalent. Figure shows some of the
algebraic laws that hold for arbitrary regular
Algebraic laws for regular expressions
expressions r, s, and t.
4 Regular Definitions
For notational convenience, we may wish to give names to certain regular expressions and use
those names in subsequent expressions, as if the names were themselves symbols.

If £ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of


the form:
d1 -> r1
d2 -> r2
..
dn -> r n
where:
1. Each di is a new symbol, not in ∑ and not the same as any other of the ds, and
2. Each ri is a regular expression over the alphabet E U {d1,d2,.. . , di-1}
4 Regular Definitions
Example 1: C identifiers are strings of letters, digits, and underscores. Here is a regular
definition for the language of C identifiers. We shall conventionally use italics for the symbols
defined in regular definitions.
letter-> A|B|---|Z|a|b|….|z|_
digit -> 0 | 1| …. |9
id -> letter- ( letter- | digit )*

Example 2: Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234,
6.336E4, or 1.89E-4. The regular definition
digit-> 0 | 1|…..| 9
digits -> digit digit*
OptionalFraction -> . digits | ε
OptionalExponent  ( E ( + | - | ε) digits ) | ε
number -> digits optionalFraction optionalExponent
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

Specification of
Tokens
 Strings and Languages
 Operations on Languages
 Regular Expressions
 Regular Definitions

… Thank You
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

1. The Role of Lexical Analyzer


(Scanner)
2. Input Buffering
3. Specification of Tokens
4. Recognition of
Tokens
5. Lexical Analyzer Generator - Lex
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

Recognition of Tokens
 Transition Diagrams
 Recognition of Reserved
Words and Identifiers
Recognition of Tokens
Consider a grammar for branching statements, and study how to take the patterns for all the
needed tokens and build a piece of code that examines the input string and finds a prefix that
is a lexeme matching one of the patterns.

Our discussion will make use of the following running example.


stmt  if expr then stmt
| if expr then stmt else stmt

expr  term relop term
| term
term  id | number

The grammar fragment describes a simple form of branching statements and conditional
expressions.
Recognition of Tokens
The terminals of the grammar, which are if, then, else, relop, id, and number, are the names
of tokens as far as the lexical analyzer is concerned.

The patterns for these tokens are described using regular definitions, as given below.

digit -> [0-9]


digits -> digit+
number -> digits (. digits)? ( E [+-]? digits )?
Letter -> [A-Za-z]
id -> letter ( letter | digit )*
if -> if
then -> then
else -> else
relop -> < | > | <= | >= | = | <>
Recognition of Tokens
In addition, we assign the lexical analyzer the job of stripping out whitespace,
by recognizing the "token" ws defined by:

ws -> ( blank | tab | newline )+


Here, blank, tab, and newline are abstract symbols that we use to express the
ASCII characters of the same names.

Token ws is different from the other tokens in that, when we recognize it, we do
not return it to the parser, but rather restart the lexical analysis from the character
that follows the whitespace.
Tokens, their patterns, and attribute values
Recognition of Tokens
LEXEMES TOKEN ATTRIBUTE
NAME VALUE
Any ws -- -- Our goal for the lexical analyzer is summarized in Fig.
if if --
The table shows, for each lexeme or family of lexemes,
then then -- which token name is returned to the parser and what
else else --
attribute value, is returned.

Any id id Pointer to table entry Note that for the six relational operators, symbolic
constants LT, LE, and so on are used as the attribute value,
Any number number --
in order to indicate which instance of the token relop we
< relop LT have found.
<= relop LE
= relop EQ
<> relop NE
The particular operator found will influence the code that is
> relop GT output from the compiler.
>= relop GE
1 Transition Diagrams
As an intermediate step in the construction of a lexical analyzer, we first convert
patterns into stylized flowcharts, called "transition diagrams."

Transition diagrams have a collection of nodes or circles, called states.


 Each state represents a condition that could occur during the process of
scanning the input looking for a lexeme that matches one of several patterns.
 start state, or initial state, it is indicated by an edge, labeled "start,"
entering from nowhere
 accepting, or final, indicate that a lexeme has been found

 Each edge is labeled by a symbol or set of symbols.


1Transition Diagrams
Example: Fig. is a transition diagram that recognizes the
lexemes matching the token relop.

We begin in state 0, the start state. If we see < as the first


input symbol, then among the lexemes that match the
pattern for relop we can only be looking at <, <>, or <=.

Transition diagram for relop

We therefore go to state 1, and look at the next character.


1. If it is =, then we recognize lexeme <=, enter state 2, and return the token relop with attribute LE,
the symbolic constant representing this particular comparison operator.
2. If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to return an
indication that the not-equals operator has been found.
3. On any other character, the lexeme is <, and we enter state 4 to return that information. Note,
however, that state 4 has a * to indicate that we must retract the input one position.
2 Recognition of Reserved Words and
Identifiers
Recognizing keywords and identifiers presents a problem. Usually, keywords like if or then are
reserved (as they are in our running example), so they are not identifiers even though they look
like identifiers.

Thus, although we typically use a transition diagram like that of Fig. to search for identifier
lexemes, this diagram will also recognize the keywords if, then, and else of our running
example.

A transition diagram for id's and keywords


2 Recognition of Reserved Words and
Identifiers

There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially. A field of the symbol-table entry
indicates that these strings are never ordinary identifiers, and tells which token they
represent

2. Create separate transition diagrams for each keyword; an example for the keyword then is
shown in Fig. `
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

Recognition of Tokens
 Transition Diagrams
 Recognition of Reserved
Words and Identifiers

…. Thank You
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

1. The Role of Lexical Analyzer


(Scanner)
2. Input Buffering
3. Specification of Tokens
4. Recognition of Tokens
5. Lexical Analyzer
Generator - Lex
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

Lexical Analyzer
Generator - Lex
 Use of Lex
 Structure of Lex
Programs
1 Use of Lex
A tool called Lex, allows specifying a lexical analyzer by specifying regular expressions to
describe patterns for tokens.

The input notation for the Lex tool is referred to as the Lex language and the tool itself is the
Lex compiler.

Behind the scenes, the Lex compiler transforms the input patterns into a transition diagram and
generates code, in a file called lex.yy.c that simulates this transition diagram.
1 Use of Lex
The mechanics of how this translation from regular
expressions to transition diagrams occurs is given as follows:

Figure suggests how Lex is used.

Creating a lexical analyzer with Lex

1. An input file, which we call lex.l, is written in the Lex language and describes the lexical
analyzer to be generated.
2. The Lex compiler transforms lex.l to a C program, in a file that is always named lex.yy.c.
3. The latter file is compiled by the C compiler into a file called a.out, as always.
4. The C-compiler output is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens. `
1 Use of Lex
The normal use of the compiled C program, referred to
as a.out is as a subroutine of the parser.

It is a C function that returns an integer, which is a code


for one of the possible token names.

The attribute value, whether it is


 another numeric code
 a pointer to the symbol table or
 nothing
is placed in a global variable yylval, which is shared
between the lexical analyzer and parser, thereby making
it simple to return both the name and an attribute value
of a token.
A Lex program has the following form: 2 Structure of Lex Programs
Declarations
%%
translation rules
%%
auxiliary functions

1. The declarations section includes declarations of variables, manifest constants and regular
definitions.

2. The translation rules each have the form


Pattern { Action }
 Each pattern is a regular expression, which may use the regular definitions of the
declaration section.
 The actions are fragments of code

3. The third section holds whatever additional functions are used in the actions. Alternatively, these
functions can be compiled separately and loaded with the lexical analyzer.
2 Structure of Lex Programs
Figure is a Lex program that recognizes the
tokens of Fig. (Transition diagram for relop)
and returns the token found.

A few observations about this code will


introduce us to many of the important features
of Lex.

Lex program for the tokens of Transition diagram for relop


Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)

Lexical Analyzer
Generator - Lex
 Use of Lex
 Structure of Lex
Programs

… Thank You
Agenda END OF
01
UNIT I : PART II
(Phase I: Lexical Analysis)

1. The Role of Lexical


Analyzer (Scanner)
2. Input Buffering
3. Specification of Tokens
4. Recognition of Tokens
5. Lexical Analyzer
Generator - Lex
… Thank You

You might also like