CD Unit I Part II Lexical Analysis
CD Unit I Part II Lexical Analysis
CD Unit I Part II Lexical Analysis
Dr.S.Gopi Krishna
Professor in CSE
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)
Commonly, the interaction is implemented by having the parser call the lexical analyzer.
The call, suggested by the getNextToken command, causes the lexical analyzer to read
characters from its input until it can identify the next lexeme and produce for it the next token,
which it returns to the parser.
The Role of Lexical Analyzer
Lexical Analyzer may perform certain other tasks besides identification of lexemes.
1. One such task is stripping out comments and whitespace (blank, newline, tab, and
perhaps other characters that are used to separate tokens in the input).
2. Another task is correlating error messages generated by the compiler with the source
program.
Lexical analysis proper is the more complex portion, which produces the tokens from
the output of the scanner.
1.1 Tokens, Patterns, and Lexemes
When discussing lexical analysis, we use three related but distinct terms:
3. A lexeme is a sequence of characters that matches the pattern for a token and is identified
by the lexical analyzer as an instance of that token.
Examples of tokens
1.2 Attributes for Tokens
Lexical Analyzer returns to the parser not only a token name, but an attribute value that
describes the lexeme represented by the token;
Normally, information about an identifier — e.g., its lexeme, its type, and the location at which
it is first found (in case an error message about that identifier must be issued) — is kept in the
symbol table.
Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table entry for
that identifier.
1.2 Attributes for Tokens
Example : The token names and associated attribute values for the FORTRAN statement
E = M * C ** 2
are written below as a sequence of pairs.
For instance, if the string fi is encountered for the first time in a C program in the context:
fi ( a == f(x)) ...
Since fi is a valid lexeme for the token id, the lexical analyzer must return the token id to the
parser and let some other phase of the compiler handle an error due to transposition of the
letters.
1.3 Lexical Errors
However, suppose a situation arises in which the lexical analyzer is unable to proceed because
none of the patterns for tokens matches any prefix of the remaining input.
…. Thank You
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)
Input Buffering
2.1 Buffer Pairs
2.2 Sentinels
Input Buffering
Program
Input Buffering
We cannot be sure we've seen the end of an identifier until we see a character
that is not a letter or digit, and therefore is not part of the lexeme for id.
Because of the amount of time taken to process characters and the large number
of characters that must be processed during the compilation of a large source
program, specialized buffering techniques have been developed to reduce the
amount of overhead required to process a single input character.
1. Buffer Pairs
2. Sentinals
2.1 Buffer Pairs
An important scheme involves two buffers that are alternately reloaded, as
suggested in Fig. Using a pair of input buffers
Each buffer is of the same size N, and N is usually the size of a disk block, e.g.,
4096 bytes.
Using one system read command we can read N characters into a buffer, rather
than using one system call per character.
2.1 Buffer Pairs
If fewer than N characters remain in the input file, then a special character, represented by eof,
marks the end of the source file and is different from any possible character of the source
program
We can combine the buffer-end test with the test for the current character if we
extend each buffer to hold a sentinel character at the end. The sentinel is a
special character that cannot be part of the source program, and a natural choice
is the character eof.
2.2 Sentinels
Figure shows the same arrangement as buffer pairs Fig., but with the sentinels added.
Note that eof retains its use as a marker for the end of the entire input. Any eof that appears
other than at the end of a buffer means that the input is at an end.
Input Buffering
2.1 Buffer Pairs
2.2 Sentinels
Input Buffering
Program
… Thank You
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)
Specification of
Tokens
Strings and Languages
Operations on Languages
Regular Expressions
Regular Definitions
1 Strings and Languages
Regular expressions are an important notation for specifying lexeme patterns. While they
cannot express all possible patterns, they are very effective in specifying those types of
patterns that we actually need for tokens.
2. A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
The length of a string s, usually written |s|, is the number of occurrences of symbols
in s.
For example, banana is a string of length six. The empty string, denoted ε is the string
of length zero.
1 Strings and Languages
3. A language is any countable set of strings over some fixed alphabet.
Abstract languages like 0, the empty set, or {ε}, the set containing only the empty string,
are languages under this definition.
If x and y are strings, then the concatenation of x and y, denoted xy, is the string formed
by appending y to x. For example, if x = dog and y = house, then xy — doghouse.
The empty string is the identity under concatenation; that is, for any string s, εs = sε = s.
The concatenation of languages is all strings formed by taking a string from the first language
and a string from the second language, in all possible ways, and concatenating them.
The (Kleene) closure of a language L, denoted L*, is the set of strings get by concatenating L
zero or more times. Note that L°, the "concatenation of L zero times," is defined to be {ε}, and
inductively, U is Li-1L.
Finally, the positive closure, denoted L+, is the same as the Kleene closure, but without the
term L°. That is, ε will not be in L+ unless it is in L itself.
3 Regular Expressions
Describe identifiers by giving names to sets of letters and digits and using the language
operators union, concatenation, and closure.
This process is so useful that a notation called regular expressions has come into common
use for describing all the languages that can be built from these operators applied to the
symbols of some alphabet.
In this notation, if letter- is established to stand for any letter or the underscore, and digit- is
established to stand for any digit, then we could describe the language of C identifiers by:
letter- ( letter- | digit )*
The vertical bar above means union, the parentheses are used to group sub expressions,
the star means "zero or more occurrences of," and the juxtaposition of letter, with the
remainder of the expression signifies concatenation.
3 Regular Expressions
The regular expressions are built recursively out of smaller regular expressions, using the rules
described below.
Each regular expression r denotes a language L(r), which is also defined recursively from the
languages denoted by r's sub expressions.
Here are the rules that define the regular expressions over some alphabet ∑ and the languages
that those expressions denote.
INDUCTION: There are four parts to the induction whereby larger regular expressions are
built from smaller ones. Suppose r and s are regular expressions denoting languages L(r) and
L(s), respectively.
1. (r)|(s) is a regular expression denoting the language L(r) U L(s).
2. (r)(s) is a regular expression denoting the language L(r)L(s).
3. (r)* is a regular expression denoting (L(r))*.
4. (r) is a regular expression denoting L(r). This last rule says that we can add additional pairs
of parentheses around expressions without changing the language they denote.
3 Regular Expressions
As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop
certain pairs of parentheses if we adopt the conventions that:
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.
Under these conventions, for example, we may replace the regular expression (a)|((b)*(c)) by
a|b*c. Both expressions denote the set of strings that are either a single a or are zero or more
b's followed by one c.
3 Regular Expressions
Example 2: Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234,
6.336E4, or 1.89E-4. The regular definition
digit-> 0 | 1|…..| 9
digits -> digit digit*
OptionalFraction -> . digits | ε
OptionalExponent ( E ( + | - | ε) digits ) | ε
number -> digits optionalFraction optionalExponent
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)
Specification of
Tokens
Strings and Languages
Operations on Languages
Regular Expressions
Regular Definitions
… Thank You
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)
Recognition of Tokens
Transition Diagrams
Recognition of Reserved
Words and Identifiers
Recognition of Tokens
Consider a grammar for branching statements, and study how to take the patterns for all the
needed tokens and build a piece of code that examines the input string and finds a prefix that
is a lexeme matching one of the patterns.
The grammar fragment describes a simple form of branching statements and conditional
expressions.
Recognition of Tokens
The terminals of the grammar, which are if, then, else, relop, id, and number, are the names
of tokens as far as the lexical analyzer is concerned.
The patterns for these tokens are described using regular definitions, as given below.
Token ws is different from the other tokens in that, when we recognize it, we do
not return it to the parser, but rather restart the lexical analysis from the character
that follows the whitespace.
Tokens, their patterns, and attribute values
Recognition of Tokens
LEXEMES TOKEN ATTRIBUTE
NAME VALUE
Any ws -- -- Our goal for the lexical analyzer is summarized in Fig.
if if --
The table shows, for each lexeme or family of lexemes,
then then -- which token name is returned to the parser and what
else else --
attribute value, is returned.
Any id id Pointer to table entry Note that for the six relational operators, symbolic
constants LT, LE, and so on are used as the attribute value,
Any number number --
in order to indicate which instance of the token relop we
< relop LT have found.
<= relop LE
= relop EQ
<> relop NE
The particular operator found will influence the code that is
> relop GT output from the compiler.
>= relop GE
1 Transition Diagrams
As an intermediate step in the construction of a lexical analyzer, we first convert
patterns into stylized flowcharts, called "transition diagrams."
Thus, although we typically use a transition diagram like that of Fig. to search for identifier
lexemes, this diagram will also recognize the keywords if, then, and else of our running
example.
There are two ways that we can handle reserved words that look like identifiers:
1. Install the reserved words in the symbol table initially. A field of the symbol-table entry
indicates that these strings are never ordinary identifiers, and tells which token they
represent
2. Create separate transition diagrams for each keyword; an example for the keyword then is
shown in Fig. `
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)
Recognition of Tokens
Transition Diagrams
Recognition of Reserved
Words and Identifiers
…. Thank You
Agenda UNIT I : PART II
01
(Phase I: Lexical Analysis)
Lexical Analyzer
Generator - Lex
Use of Lex
Structure of Lex
Programs
1 Use of Lex
A tool called Lex, allows specifying a lexical analyzer by specifying regular expressions to
describe patterns for tokens.
The input notation for the Lex tool is referred to as the Lex language and the tool itself is the
Lex compiler.
Behind the scenes, the Lex compiler transforms the input patterns into a transition diagram and
generates code, in a file called lex.yy.c that simulates this transition diagram.
1 Use of Lex
The mechanics of how this translation from regular
expressions to transition diagrams occurs is given as follows:
1. An input file, which we call lex.l, is written in the Lex language and describes the lexical
analyzer to be generated.
2. The Lex compiler transforms lex.l to a C program, in a file that is always named lex.yy.c.
3. The latter file is compiled by the C compiler into a file called a.out, as always.
4. The C-compiler output is a working lexical analyzer that can take a stream of input
characters and produce a stream of tokens. `
1 Use of Lex
The normal use of the compiled C program, referred to
as a.out is as a subroutine of the parser.
1. The declarations section includes declarations of variables, manifest constants and regular
definitions.
3. The third section holds whatever additional functions are used in the actions. Alternatively, these
functions can be compiled separately and loaded with the lexical analyzer.
2 Structure of Lex Programs
Figure is a Lex program that recognizes the
tokens of Fig. (Transition diagram for relop)
and returns the token found.
Lexical Analyzer
Generator - Lex
Use of Lex
Structure of Lex
Programs
… Thank You
Agenda END OF
01
UNIT I : PART II
(Phase I: Lexical Analysis)