Lecture 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Lecture 3 Uzma Yaseen Bhat

Lexical Analysis
Lexical analysis also known as scanner, tokenization, or lexer.
As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the
source program, group them into lexemes, and produce as output a sequence of tokens for each lexeme in
the source program.

keyword operators

separators constants

special
identifiers token characters

Calculate the number of tokens


Q1) Int main()
{
*/find max of a & b */
Int a= 20, b=30;
If ( a< b)
Return (b)
else
return (a);
}
Q2) printf(“i=%d,&i=%x”, i, &i);
Finite automata is used for defining the tokens.
Lecture 3 Uzma Yaseen Bhat

The stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact
with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifier, it
needs to enter that lexeme into the symbol table. In some cases, information regarding the kind of identifier
may be read from the symbol table by the lexical analyzer to assist it in determining the proper token it
must pass to the parser. These interactions are suggested in Fig. 3.1. Commonly, the interaction is
implemented by having the parser call the lexical analyzer. The call, suggested by the getNextToken
command, causes the lexical analyzer to read characters from its input until it can identify the next lexeme
and produce for it the next token, which it returns to the parser.

Since the lexical analyzer is the part of the compiler that reads the source text, it may perform certain other
tasks besides identification of lexemes. One such task is stripping out comments and whitespace (blank,
newline, tab, and perhaps other characters that are used to separate tokens in the input). Another task is
correlating error messages generated by the compiler with the source program.
Sometimes, lexical analyzers are divided into a cascade of two processes:
a) Scanning consists of the simple processes that do not require tokenization of the input, such as deletion
of comments and compaction of consecutive whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where the scanner produces the sequence of tokens
as output.

Lexical Analysis Versus Parsing


There are a number of reasons why the analysis portion of a compiler is normally separated into lexical
analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis
often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments
and whitespace as syntactic units would be considerably more complex than one that can assume comments
and whitespace have already been removed by the lexical analyzer. If we are designing a new language,
separating lexical and syntactic concerns can lead to a cleaner overall language design.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques
that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for
reading input characters can speed up the compiler significantly.
Lecture 3 Uzma Yaseen Bhat

3. Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical


analyzer.

Tokens, Patterns, and Lexemes


When discussing lexical analysis, we use three related but distinct terms:
A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract
symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters
denoting an identifier. The token names are the input symbols that the parser processes. In what follows,
we shall generally write the name of a token in boldface. We will often refer to a token by its token name.
A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a
token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other
tokens, the pattern is a more complex structure that is matched by many strings.
A lexeme is a sequence of characters in the source program that matches the pattern for a token and is
identified by the lexical analyzer as an instance of that token.
Example 3.1 : Figure 3.2 gives some typical tokens, their informally described patterns, and some sample
lexemes. To see how these concepts are used in practice, in the C statement printf ("Total = %d\nI1, score)
; both printf and score are lexemes matching the pattern for token id, and "Total = %d\nI1 is a lexeme
matching literal. In many programming languages, the following classes cover most or all of the tokens:

1. One token for each keyword. The pattern for a keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in classes such as the token comparison mentioned in Fig.
3.2.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.

Attributes for Tokens


Lecture 3 Uzma Yaseen Bhat

When more than one lexeme can match a pattern, the lexical analyzer must provide the subsequent compiler
phases additional information about the particular lexeme that matched. For example, the pattern for token
number matches both 0 and 1, but it is extremely important for the code generator to know which lexeme
was found in the source program. Thus, in many cases the lexical analyzer returns to the parser not only a
token name, but an attribute value that describes the lexeme represented by the token; the token name in-
fluences parsing decisions, while the attribute value influences translation of tokens after the parse.

Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-code error.
For instance, if the string f i is encountered for the first time in a C program in the context:
Fi( a== f(X) )
a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or an undeclared function
identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return the token id to the
parser and let some other phase of the compiler - probably the parser in this case - handle an error due to
transposition of the letters.
The simplest recovery strategy is "panic mode" recovery. We delete successive characters from the
remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is
left. This recovery technique may confuse the parser, but in an interactive computing environment it may
be quite adequate. Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.

You might also like