Automata Theory and Compiler Design: Name: Smitha.A Usn: 1Vj21Cs042 Branch: Cse

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

1

AUTOMATA THEORY AND COMPILER

DESIGN

NAME : SMITHA.A
USN : 1VJ21CS042
BRANCH : CSE
2

TOPICS

1. Lexical Analysis
1.1 The role of Lexical Anlayser
1.2 Lexical Analysis Verses Parsing
1.3 Token Pattern And Lexemes
1.4 Attributes For Tokens
1.5 Lexical Errors
3.1 The Role Of Lexical Analyser
3

As the first phase of a compiler, the main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes, and produce as output a
sequence of tokens for each lexeme in the source program. The stream of tokens is sent
to the parser for syntax analysis. It is common for the lexical analyzer to interact with the
symbol table as well. When the lexical analyzer discovers a lexeme constituting an
identifier, it needs to enter that lexeme into the symbol table. In some cases, information
regarding the

kind of identifier may be read from the symbol table by the lexical analyzer to assist it in determining the
proper token it must pass to the parser.
These interactions are suggested in Fig. 3.1. Commonly, the interaction is implemented by having the parser
call the lexical analyzer. The call, suggested by the getNextToken command, causes the lexical analyzer to
read characters from its input until it can identify the next lexeme and produce for it the next token, which it
returns to the parser
4
Source
to Semantic
program Analytics

Figure 3.1: Interactions between the lexical analyzer and the parser

Since the lexical analyzer is the part of the compiler that reads the source text, it may perform
certain other tasks besides identification of lexemes. One such task is stripping out comments and
whitespace (blank, newline, tab, and perhaps other characters that are used to separate tokens in the
input). Another task is correlating error messages generated by the compiler with the source
program. For instance, the lexical analyzer may keep track of the number of newline characters
seen, so it can associate a line number with each error message. In some compilers, the lexical
analyzer makes a copy of the source program with the error messages inserted at the appropriate
positions. If the source program uses a macro-preprocessor, the expansion of macros may also be
performed by the lexical analyzer.
Sometimes, lexical analyzers are divided into a cascade of two processes:
5
a) Scanning consists of the simple processes that do not require tokenization of the
input, such as deletion of comments and compaction of consecutive whitespace
characters into one.
b) Lexical analysis proper is the more complex portion, where the scanner
produces the sequence of tokens as output.

3.1.1 Lexical Analysis Versus Parsing


There are a number of reasons why the analysis portion of a compiler is normally
separated into lexical analysis and parsing (syntax analysis) phases.
3.1. THE ROLE OF THE LEXICAL ANALYZER

1. Simplicity of design is the most important consideration. The separation of lexical


and syntactic analysis often allows us to simplify at least one of these tasks. For
example, a parser that had to deal with comments and whitespace as syntactic units
would be considerably more complex than one that can assume comments and
whitespace have already been removed by the lexical analyzer. If we are designing
a new language, separating lexical and syntactic concerns can lead to a cleaner
1. overall language design. 6
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not the job of parsing. In addition,
specialized buffering techniques for reading input characters can speed up the compiler
significantly.
Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the
lexical analyzer

3.1.2 Tokens, Patterns, and Lexemes


When discussing lexical analysis, we use three related but distinct terms:
• A token is a pair consisting of a token name and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or
a sequence of input characters denoting an identifier. The token names are the input
symbols that the parser processes. In what follows, we shall generally write the name of a
token in boldface. We will often refer to a token by its token name.
• A pattern is a description of the form that the lexemes of a token may take. In the case of a
keyword as a token, the pattern is just the sequence of characters that form the keyword.
• For identifiers and some other tokens, the pattern is a more complex structure 7
that is matched by many strings.
• A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token.

3.1.3 Attributes for Tokens


When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the
particular lexeme that matched. For example, the pattern for token number
matches both 0 and 1, but it is extremely important for the code generator to
know which lexeme was found in the source program. Thus, in many cases the
lexical analyzer returns to the parser not only a token name, but an attribute
value that describes the lexeme represented by the token; the token name
influences parsing decisions, while the attribute value influences translation of
tokens after the parse.
We shall assume that tokens have at most one associated attribute, although
this attribute may have a structure that combines several pieces of information.
The most important example is the token id, where we need to associate with the
8
token a great deal of information. Normally, information about an identifier — e.g., its
lexeme, its type, and the location at which it is first found (in case an error message about
that identifier must be issued) — is kept in the symbol table. Thus, the appropriate
attribute value for an identifier is a pointer to the symbol-table entry for that identifier.

3.1.4 Lexical Errors


It is hard for a lexical analyzer to tell, without the aid of other components, that there is a source-
code error. For instance, if the string f i is encountered for the first time in a C program in the
context:
f i ( a == f(x)) . . .
a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or an undeclared
function identifier. Since f i is a valid lexeme for the token id, the lexical analyzer must return
the token id to the parser and let some other phase of the compiler — probably the parser in this
case — handle an error due to transposition of the letters.
However, suppose a situation arises in which the lexical analyzer is unable to proceed
because none of the patterns for tokens matches any prefix of the remaining input. The simplest
recovery strategy is "panic mode" recovery. We delete successive characters from the remaining
input, until the lexical analyzer can find a well-formed token at the beginning of
9
what input is left. This recovery
technique may confuse the parser, but in an interactive computing environment it may
be quite adequate.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Transformations like these may be tried in an attempt to repair the input. The simplest
such strategy is to see whether a prefix of the remaining input can be transformed into a
valid lexeme by a single transformation. This strategy makes sense, since in practice
most lexical errors involve a single character. A more general correction strategy is to
find the smallest number of transformations needed to convert the source program into
one that consists only of valid lexemes, but this approach

You might also like