CD Unit-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

DEPARTMENT OF CSE

Compiler Design UNIT - I

UNIT I:
Lexical Analysis: Language Processors, Structure of a Compiler, Lexical Analysis,
The Role of the Lexical Analyzer, Bootstrapping, Input Buffering, Specification of
Tokens, Recognition of Tokens, Lexical Analyzer Generator-LEX, Finite Automata,
Regular Expressions and Finite Automata, Design of a Lexical Analyzer Generator.
______________________________________________________________________
1.Language Processors: An integrated software development
environment includes many different kinds of language processors such as
compilers, interpreters, assemblers, linkers, loaders etc.,
Language-processing system: - The preprocessor may also expand
shorthand’s, called macros, into source language statements. The modified
source program is then fed to a compiler. The compiler may produce an
assembly-language program as its output, because assembly language is easier
to produce as output and is easier to debug. The assembly language is then
processed by a program called an assembler that produces relocatable machine
code as its output.
Large programs are often compiled in pieces, so the relocatable machine
code may have to be linked together with other relocatable object files and
library files into the code that actually runs on the machine. The linker resolves
external memory addresses, where the code in one file may refer to a location in
another file. The loader then puts together all of the executable object files into
memory for execution.

Page 1
DEPARTMENT OF CSE

Compiler Design UNIT - I

Preprocessor:- A preprocessor is a program that processes its input data(i.e.


source program) to produce output(i.e. modified source program) that is used as
input to compiler. The preprocessor expand shorthand’s, called macros, into
source language statements.
Compiler
The language processor that reads the complete source program written in
high-level language as a whole in one go and translates it into an equivalent
program in machine language is called a Compiler. An important role of
compiler is to report errors in the source program that it detects during
translation process.
Example: C, C++, C#, Java.

Assembler:-The compiler may produce an assembly-language program as its


output, because assembly language is easier to produce as output and is easier
to debug. A program called an assembler that produces relocatable machine
code as its output then processes the assembly language

Linker:- A linker or link editor is a program that takes one or more objects
generated by a compiler and combines them into a single executable program.
Linker
Searches the program to find library routines used by program, e.g. printf(), math
routines.

www.sacet.ac.in Page 2
DEPARTMENT OF CSE

Compiler Design UNIT - I

Loader:-The loader loads the program on the hard disk onto the main memory
and loads the starting address of the program into the program counter(PC) and
makes the program ready for execution.
Cousins of compiler: Preprocessor, assembler, linker, loader

A Cross Compiler is a complier which compiles Source code at one


platform and runs it at any other platform.
Interpreter
An interpreter is another common kind of language processor. Instead of
producing a target program as a translation, an interpreter directly executes the
operations specified in the source program on inputs supplied by the user. It
executes the program statement by statement.

The machine-language target program produced by a compiler executes much


faster than an interpreter. An interpreter, however, can give better error
messages than a compiler, because it executes the source program statement by
statement.

Hybrid Compiler

Java language processors combine compilation and interpretation.

Java language processors


◦ a java source program is first compiled into bytecodes, which are
then interpreted by a virtual machine.

www.sacet.ac.in Page 3
DEPARTMENT OF CSE

Compiler Design UNIT - I


◦ Bytecodes compiled on one machine can be interpreted on another machine
(“Write once, run anywhere”)
◦ For faster processing, just-in-time compilers translate bytecodes into machine
language immediately before they run the intermediate program to process
the input

2.The structure of a compiler-


The compilation process can be subdivided into two main parts.
They are

1.Analysis part

2.Synthesis part

1.Analysis part:- The analysis part is often called the front end of the
compiler.
In analysis part, an intermediate code is created from the given source
program. Analysis part consists of Lexical Analysis, Syntax Analysis and
Semantic Analysis, intermediate code generation .
It breaks up the source program and checks whether the source program is
either syntactically or semantically correct.
It provides informative error messages, so that the user can take corrective
action. The analysis part also collects information about the source program
and stores it in a data structure called a symbol table, which is passed along
with the intermediate representation to the synthesis part.

2.Synthesis part:- The synthesis part constructs the desired target program
from the intermediate representation and the information in the symbol table.
Code Optimizer and Code Generator are the parts of this phase. The synthesis
part is the back end of the compiler.

Analysis part Synthesis part


Phases of A Compiler:- The compilation process operates as a sequence of
phases. Each phase transforms the source program from one representation into
another representation. A compiler is decomposed into phases as shown in Fig. The
symbol table, which stores information about the entire source program, is used by all

www.sacet.ac.in Page 4
DEPARTMENT OF CSE

Compiler Design UNIT - I

phases of the compiler. During Compilation process, each phase can encounter errors.
The error handler is a data structure that reports the presence of errors clearly and
accurately. It specifies how the errors can be recovered quickly to detect subsequent
errors.
Lexical Analyzer:- The first phase of a compiler is called lexical analysis or linear analysis or
scanning. The lexical analyzer reads the source program and groups the characters into meaningful
sequences called lexemes (i.e tokens). For each lexeme, the lexical analyzer produces output in the
form : <token-name, attribute-value>
This output is passed to the subsequent phase i.e syntax analysis.

Fig: Phases of compiler

www.sacet.ac.in Page 5
DEPARTMENT OF CSE

Compiler Design UNIT - I

Syntax Analyzer:- The second phase of the compiler is syntax analysis or


hierarchical analysis or parsing. The parser uses the tokens produced by the
lexical analyzer to create a syntax tree. In syntax tree, each interior node
represents an operation and the children of the node represent the arguments of
the operation.
semantic Analyzer: - The semantic analyzer uses the syntax tree and the
information in the symbol table to check the source program for semantic
errors.
It also gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation. An
important part of semantic analysis is type checking, where the compiler checks
that each operator has matching operands.
Intermediate Code Generator: - In the process of translating a source program
into target code, a compiler may construct one or more intermediate
representations, which can have a variety of forms. Syntax trees are a form of
intermediate representation; they are commonly used during syntax and
semantic analysis. After syntax and semantic analysis of the source program,
many compilers generate an intermediate representation. This intermediate
representation should have two important properties: it should be easy to
produce and it should be easy to translate into the target code. One form of
intermediate code is three-address code, which consists of a sequence of
assembly-like instructions with three operands per instruction.
Code Optimizer:- The machine-independent code-optimization phase attempts
to improve the intermediate code so that better target code will result. The target
code generated must be executed faster and must consume less power.
Symbol Table:- The symbol table is a data structure containing a record for each
variable name, with fields for the attributes of the name. The data structure
should be designed to allow the compiler to find the record for each name
quickly and to store or retrieve data from that record quickly.
Error Handler:- Error handler should report the presence of an error. It
must report the place in the source program where an error is detected.
Common programming errors can occur at many different levels.
• Lexical errors include misspellings of identifiers, keywords, or
operators.
• Syntax errors include misplaced semicolons or extra or
missing braces.
• Semantic errors include type mismatches between operators
and operands.
• Logical errors can be anything from incorrect reasoning on the
part of the programmer to the use in a C program of the
assignment operator = instead of the comparison operator ==.
The main goal of error handler is
1. Report the presence of errors clearly and accurately.
2. Recover from each error quickly

www.sacet.ac.in Page 6
DEPARTMENT OF CSE

Compiler Design UNIT - I


Example:- Compile the statement position = initial + rate * 60 (translation)

Fig:- Output of each phase of compiler

www.sacet.ac.in Page 7
DEPARTMENT OF CSE

Compiler Design UNIT - I

Grouping of Phases into Passes


Phase is a logical organization of compiler.
Pass : The activities from several phases are grouped in a
pass
◦ Front-end pass (lexical analysis, syntax analysis,
semantic analysis, intermediate code generation)
◦ Code optimization (optional pass)
◦ Back-end pass (code generation)

Differences between compiler and Interpreter


S.No Compiler Interpreter
1. A compiler is a program that takes An interpreter is another translator whole
source program and translates it which translates each statement of
into a target program and stores it on source program and directly
hard disk. An important role of the execute the target statement on
compiler is to report any errors in the inputs supplied by the user.
source program that it detects during
the translation process.

2.

3. For the first time of compilation the For the first time of interpretation
process make take more time but as the process may complete within
the target program is saved on the hard less time. As the target program is
disk from next time onwards the target not saved on the hard disk the
program produced by a compiler interpreter has to interpret every
executes much faster than an time it executes the program. So it
interpreter. is much slower than a compiler.
4. As the compiler is a complex task it As the interpreter is a simple task it
occupies more main memory than occupies less main memory than
interpreter. compiler.

www.sacet.ac.in Page 8
DEPARTMENT OF CSE

Compiler Design UNIT - I

5. An compiler, list outs all error An interpreter, however, can give


messages so it hard to debug the better error messages than a
errors. compiler, because it executes the
Examples of compilers: Borland C compiler or source program statement by
Turbo C compiler compiles the programs statement.
written in C or C++.

6. Examples:- C, C++, JAVA,. Examples:- BASIC,LISP, JAVA,….


These languages use compiler for These languages use interpreter for
translation. translation.

3.Compiler construction tools


The compiler writer can use some specialized tools that help in implementing
various phases of a compiler. These tools assist in the creation of an entire
compiler or its parts. Some commonly used compiler construction tools include:
1. Parser Generator –
It produces syntax analyzers (parsers) from the input that is based on
a grammatical description of programming language or on a context-
free grammar. It is useful as the syntax analysis phase is highly
complex and consumes more manual and compilation time.

www.Sacet.ac.in page 09
DEPARTMENT OF CSE

Compiler Design UNIT – I

2. Scanner Generator –
It generates lexical analyzers from the input that consists of regular
expression description based on tokens of a language. It generates a
finite automaton to recognize the regular expression.
Example: Lex

3. Syntax directed translation engines –


It generates intermediate code with three address format from the input
that consists of a parse tree. These engines have routines to traverse the
parse tree and then produces the intermediate code. In this, each node
of the parse tree is associated with one or more translations.
4. Automatic code generators –
It generates the machine language for a target machine. Each operation of the
intermediate language is translated using a collection of rules and then
is taken as an input by the code generator. A template matching process is
used. An intermediate language statement is replaced by its equivalent
machine language statement using templates.
5. Data-flow analysis engines –
It is used in code optimization.Data flow analysis is a key part of the code
optimization that gathers the information, that is the values that flow
from one part of a program to another.
6. Compiler construction toolkits –
It provides an integrated set of routines that aids in building compiler
components or in the construction of various phases of compiler.

www.sacet.ac.in Page 10
DEPARTMENT OF CSE

Compiler Design UNIT - I


4.Lexical Analysis
Lexical Analysis: - It is also called linear analysis. The first phase of a
compiler is called lexical analysis or scanning. The lexical analyzer reads the
the sequence of characters from source program, groups the characters into
meaningful sequences called lexemes and it produces token for each lexeme.
For each lexeme, the lexical analyzer produces output in the form
<token-name, attribute-value>
This output is passed to the subsequent phase i.e. syntax analysis.

4.1 Role of Lexical Analyzer: - Lexical analyzer is the first phase


of a compiler. The main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes, and produce
tokens for each lexeme in the source program. The stream of tokens is sent to
the parser for syntax analysis. When lexical analyzer discovers a lexeme
constituting an identifier, it interacts with the symbol table to enter that
lexeme into the symbol table. Commonly, the interaction is implemented by
having the parser call the lexical analyzer. The getNextToken command
given by the parser , causes the lexical analyzer to read characters from its
input until it can identify the next lexeme and produce the next token, which
it returns to the parse

Fig: Interaction between lexical Analyzer and Parser

Since the lexical analyzer is the part of the compiler that reads the source
text, it may perform certain other tasks besides identification of lexemes. One
such task is stripping out comments and whitespace (blank, newline, tab, and
www.sacet.ac.in Page 11
DEPARTMENT OF CSE

Compiler Design UNIT - I

perhaps other characters that are used to separate tokens in the input).
Another task is correlating error messages generated by the compiler with the
source program.
For instance, the lexical analyzer may keep track of the number of newline
characters seen, so it can associate a line number with each error message. In
some compilers, the lexical analyzer makes a copy of the source program
with the error messages inserted at the appropriate positions. If the source
program uses a macro-preprocessor, the expansion of macros may also be
performed by the lexical analyzer.

Sometimes, lexical analyzers are divided into a cascade of two processes:


a. Scanning consists of the simple processes that perform
such as deletion of comments and eliminating excessive
whitespace characters into one.
b. Lexical analysis is the more complex portion, which
produces the sequence of tokens as output.

4.2 Lexical Analysis Vs. Parsing: - There are a number of reasons why the
analysis portion of a compiler is normally separated into lexical analysis and
parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration.
The separation of lexical and syntactic analysis often
allows us to simplify at least one of these tasks.
2. Compiler efficiency is improved. A separate lexical
analyzer allows us to apply specialized techniques that
serve only the lexical task, not the job of parsing. In
addition, specialized buffering techniques for reading
input characters can speed up the compiler significantly.
3. Compiler portability is enhanced. Input-device-specific
peculiarities can be restricted to the lexical analyzer.
4.3 Token, patterns and Lexemes: - When discussing lexical analysis, we
use three related but distinct terms:

A token is a pair consisting of a token name and an optional attribute value.


The token name is the category of lexical unit,
e.g., a particular keyword, or a sequence of input characters denoting an
identifier etc.,

www.sacet.ac.in Page 12
DEPARTMENT OF CSE

Compiler Design UNIT - I

A pattern is a description that specify the rules that the lexemes should
follow in order to belong to that token.

A lexeme is a sequence of characters in the source program that matches the


pattern for a token

In many programming languages, the following classes cover most or all of


thetokens
1. One token for each keyword. The pattern for a keyword is

the same
as keyword itself.
2. Tokens for the operators, either individually or in classes
such as the token comparison mentioned.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as
numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and
right parentheses, comma, and semicolon.
Consider the following example of an assignment
statement:

result = oldsum – value / 100;

www.sacet.ac.in Page 13
DEPARTMENT OF CSE

Compiler Design UNIT - I

Following are the tokens and lexemes of above statement:


Token Lexeme

IDENTIFIER result

ASSIGN_OP =

IDENTIFIER oldsum

SUB_OP -

IDENTIFIER value

DIV_OP /

NUMBER 100

SEMICOLON ;

4.4 Lexical Errors: -


Lexical errors include misspellings of identifiers, keywords, or
operators.
• Some errors can not be detected by the lexical analyzer
alone
◦ fi (a == f(x) ) …
Error recovery
• Panic mode: successive characters are ignored until we
reach
to a well formed token
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Transformations like these may be tried in an attempt to
repair the input.

www.sacet.ac.in Page 14
DEPARTMENT OF CSE

Compiler Design UNIT - I


5.Bootstrapping

o Bootstrapping is widely used in the compilation development.


o Bootstrapping is used to produce a self-hosting compiler. Self-hosting
compiler is a type of compiler that can compile its own source code.
o Bootstrap compiler is used to compile the compiler and then you can use this
compiled compiler to compile everything else as well as future versions of itself.
A compiler can be characterized by three languages:
1. Source Language
2. Target Language
3. Implementation Language
The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:


1. Create a compiler SCAA for subset, S of the desired language, L using language "A"
and that compiler runs on machine A.

2. Create a compiler LCSA for language L written in a subset of L.

3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language
L, which runs on machine A and produces code for machine A.

The process described by the T-diagrams is called bootstrapping.

www.sacet.ac.in Page 15
Compiler Design UNIT - I

6.Input Buffering
The lexical analyzer scans the input from left to right one character at a time.
Lexical analyzer needs to look ahead one or more characters before a lexeme
is found
In this scheme, only one buffer is used to store the input string. But the
problem with this scheme is that if lexeme is very long then it crosses
the buffer boundary, to scan rest of the lexeme the buffer has to be
refilled, that makes overwriting the first part of lexeme.
• Speed of a lexical analyzer is important.
• When lookahead is necessary, two buffer scheme is introduced to
handle large lookaheads safely
• Techniques for speeding up the process of lexical analyzer such as
the use of sentinels (eof character) to mark the buffer end have been
adopted.
There are two methods used in this context
Buffer pairs
Sentinels
Here, we use a buffer divided into two N-character (N is generally equal
to the maximum number of characters that be read at one time) halves
as shown below,

Scheme
• Consists of two buffers, each consists of N-character size which are reloaded
alternatively.
N = size of a disk block (4096)
N characters are read from the input file to the buffer using one system read
command.

A block of data is first read into a buffer, and then scanned by lexical
analyzer

• eof is inserted at the end if the number of characters is less than N.

www.sacet.ac.in Page 16
DEPARTMENT OF CSE

Compiler Design UNIT - I

Pointers

- Two pointers lexemeBegin and forward are maintained.


- lexeme Begin points to the beginning of the current lexeme which is yet to
be found.
- forward scans ahead until a match for a pattern is found.
- Once a lexeme is found, lexemebegin is set to the character immediately after the
lexeme which is just found and forward is set to the character at its right end.

- Current lexeme is the set of characters between two pointers.

Sentinel

(eof character is used as sentinel)

◦ eof can be added at each buffer end. eof within a buffer marks the end of input.◦
The sentinel is a special character that cannot be part of the source program.

Sentinel which is used to identify the end of buffer

End of second buffer is also recognized by the end of buffer mark present at
the end of second buffer. when fp encounters first eof, then one can recognize
end of first buffer and hence filling up second buffer is started. in the same
way when second eof is obtained then it indicates of second buffer.

www.sacet.ac.in Page 17
DEPARTMENT OF CSE

Compiler Design UNIT - I

Alternatively both the buffers can be filled up until end of the input program
and stream of tokens is identified. This eof character introduced at the end is
calling

switch(*forward++)

case eof:

if(forward is at the end of the first buffer)

reload second buffer;

forward = beginning of the second buffer;

else if(forward is at the end of the second buffer)

reload first buffer;

forward = beginning the first buffer;

} else

/* eof within a buffer marks the end of input */

terminate lexical analysis;

break;

Cases for other characters

} Fig: Lookahead code with sentinels

www.sacet.ac.in Page 18
DEPARTMENT OF CSE

Compiler Design UNIT - I

7.SPECIFICATION OF TOKENS USING REGULAR EXPRESSION

Regular Expression
Def: The language accepted by finite automata can be easily described by
simple expressions called Regular Expressions. It is the most effective way to
represent any language.
The languages accepted by some regular expression are referred to as Regular
languages.
A regular expression can also be described as a sequence of pattern that defines
a string.
Regular expressions are used to match character combinations in strings.

Operations performed on regular expressions:

1. Union: The union of two regular languages, L1 and L2, which are
represented using L1 ∪ L2, is also regular and which represents the set of
strings that are either in L1 or L2 or both.
Example:
L1 = (1+0).(1+0) = {00 , 10, 11, 01} and
L2 = {∈ , 100}
then L1 ∪ L2 = {∈, 00, 10, 11, 01, 100}.
2. Concatenation:
The concatenation of two regular languages, L1 and L2, which are represented
using L1.L2 is also regular and which represents the set of strings that are
formed by taking any string in L1 concatenating it with any string in L2.

Example:
L1 = { 0,1 } and L2 = { 00, 11} then L1.L2 = {000, 011, 100, 111}.
3. Kleene closure:
If L1 is a regular language, then the Kleene closure i.e. L1* of L1 is also regular and
represents the set of those strings which are formed by taking a number of strings from L1
and the same string can be repeated any number of times and concatenating those strings.

Example:
L1 = { 0,1}
L1*= {∈, 0, 1, 00, 01, 10, 11 …….} , then L1* is all strings possible with symbols 0 and
1 including a null string.

www.sacet.ac.in Page 19
DEPARTMENT OF CSE

Compiler Design UNIT – I


Properties of Regular Expressions

Regular expressions: are an important notation for specifying lexeme


patterns (rules). To describe the set of valid C identifiers use a notation
called regular expressions. In this notation, if letter_ is established to stand
for any letter or the underscore, and digit is established to stand for any digit.
Then the identifiers of C language are defined by

letter_ (letter_ | digit)*

The vertical bar above means union, the parentheses are used
to group subexpressions, the star means "zero or more
occurrences of".

The letter_ at the beginning indicates that the identifier can contain
any letter or underscore( _ ) at the beginning.
www.sacet.ac.in Page 20

DEPARTMENT OF CSE

Compiler Design UNIT - I

Regular Definitions:-

Regular Definitions are names given to certain regular expressions and those
names can be used in subsequent expressions as symbols of the language.
If C is an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form:

where

1. Each di is a new symbol, not in C and not the same as any other of
the d's, and
2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.

www.sacet.ac.in Page 21
DEPARTMENT OF CSE

Compiler Design UNIT - I

Example 1 : C identifiers are strings of letters, digits, and


underscores. Write a regular definition for the language of C identifiers

Using shorthand notations , the regular definition can be rewritten as:

Example 2 : Unsigned numbers (integer or floating point) are


strings such as 5280, 0.01234, 6.336E4, or 1.89E-4. Write a
regular definition for Unsigned numbers in C language.

Using shorthand notations , the regular definition can be rewritten as:

8.Recognition of Tokens

www.sacet.ac.in Page 22
DEPARTMENT OF CSE

Compiler Design UNIT - I

Regular expressions (patterns) are used to specify tokens.

Transition diagram is used to recognize tokens

Transition Diagrams: -

Compiler converts regular-expression patterns to transition


diagrams. Transition diagrams have a collection of nodes or
circles, called states.

Each state represents a condition that could occur during the


process of scanning the input looking for a lexeme that matches
one of several patterns. Edges are directed from one state of the
transition diagram to another. Each edge is labeled by a symbol
or set of symbols. If we are in some state s, and the next input
symbol is
a, we look for an edge out of state s labeled by a (and perhaps
by other symbols, as well). If we find such an edge, we enter
the state of the transition diagram to which that edge
leads.Example 1:Draw a transaction diagram for
the relational operator relo


Relop < | < > | <= | >| >=|=

Figure: Transition diagram for relop

www.sacet.ac.in Page 23
DEPARTMENT OF CSE

Compiler Design UNIT - I

Recognition of Reserved Words and Identifiers

Recognizing keywords and identifiers presents a problem. Usually,


keywords
like if or then are reserved , so they are not identifiers even though they look
like identifiers.

Example 2 : Draw a transition diagram for the


identifier in C language.

Figure: Transition diagram for identifier


There are two ways that we can handle reserved words that look like identifiers

1. Install the reserved words in the symbol table initially. A field of the symbol-
table entry indicates that these strings are never ordinary identifiers, and tells
which token they represent. When we end an identifier, a call to installID places
it in the symbol table if it is not already there and returns a pointer to the symbol-
table entry for the lexeme found. Of course, any identifier not in the symbol table
during lexical analysis cannot be a reserved word, so its token is id. The function
getToken examines the symbol table entry for the lexeme found, and returns
whatever token name the symbol table says this lexeme represents | either id or
one of the keyword tokens that was initially installed in the table.
2.Create separate transition diagrams for each keyword; an example for the keyword then
is shown below. Note that such a transition diagram consists of states representing the
situation after each successive letter of the keyword is seen, followed by a test for a
“nonletter-or-digit," i.e., any character that cannot be the continuation of an identifier. It
is necessary to check that the identifier has ended, or else we would return token then in
situations where the correct token was id.

www.sacet.ac.in Page 24
DEPARTMENT OF CSE

Compiler Design UNIT - I

Example 3 : Draw a transaction diagram for the keyword then in C



language. then then

Figure: Transition diagram for then keyword

Other Transition Diagrams

Example 3 : Draw a transition diagram for white space in C language.

delim → ( blank | tab | newline )


ws delim+

Figure: Transition diagram for whitespace

Example 4: Draw a transaction diagram for unsigned numbers in C language.

Figure: Transition diagram for unsigned numbers

www.sacet.ac.in Page 25
DEPARTMENT OF CSE

Compiler Design UNIT - I


9.Lexical analyzer generator(Lex):-
Lex is a tool, or in a more recent implementation Flex, which allows to specify a
lexical analyzer by specifying regular expressions to describe patterns for tokens.
The input notation for the Lex tool is referred to as the Lex language and the tool
itself is the Lex compiler. Behind the scenes, the Lex compiler transforms the input
patterns into a transition diagram and generates code, in a file called lex.yy.c that
simulates this transition diagram.

Use of Lex:- An input file, which we call lex.l, is written in the Lex language
and describes the lexical analyzer to be generated. The Lex compiler
transforms lex.l to a C program, in a file that is always named lex.yy.c . The
lex.yy.c is compiled by the C compiler into a file called a.out. The C compiler
output is a working lexical analyzer that can take a stream of input characters
and produce a stream of tokens.

Fig: Structure of Lex Program


A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions

1. The declarations section includes declarations of


variables, manifest constants and regular definitions.

2. The translation rules each have the form


3. Pattern {Action}

www.sacet.ac.in Page 26
DEPARTMENT OF CSE

Compiler Design UNIT - I

Each pattern is a regular expression, which may use the regular definitions of
the declaration section. The actions are fragments of code, typically written
in C, although other languages can also be used

3.The third section holds whatever additional functions are used in the
actions. Alternatively, these functions can be compiled separately and
loaded with the lexical analyzer. Lex Program for tokens

www.sacet.ac.in Page 27
DEPARTMENT OF CSE

Compiler Design UNIT - I

The lexical analyzer created by Lex works along with the parser as follows.
When called by the parser, the lexical analyzer begins reading its remaining
input, one character at a time, until it finds the longest prefix of the input that
matches one of the patterns P i. It then executes the associated action Ai.
Typically, Ai will return to the parser, but if it does not (e.g., because P i
describes whitespace or comments), then the lexical analyzer proceeds to find
additional lexemes, until one of the corresponding actions causes a return to
the parser.
The lexical analyzer returns a single value, the token name, to the parser, but
uses the shared, integer variable yyval to pass additional information about
the lexeme found, if needed.

10.FINITE AUTOMATA:
A recognizer for a language is a program that takes as input a string x and answer
"yes" if x is a sentence of the language and "no" otherwise. The regular expression
is compiled into a recognizer by constructing a generalized transition diagram
called a finite automaton. A finite

automaton can be deterministic or non deterministic, where "nondeterministic"


means more than one transition out of a state may be possible on the same input
symbol.

www.sacet.ac.in Page 28
DEPARTMENT OF CSE

Compiler Design UNIT - I

Deterministic Finite Automata (DFA)

A Finite Automata (FA) or Finite State Machine (FSM) is a 5- tuple


(Q,∑,q0,A, δ)
where,
• Q is the set of finite states
∑ is the set of input symbols (Finite alphabet)
q0 is the initial state
A is the set of all accepting states or final states.
δ is the transition function,
Q×∑→Q For any element q of Q and any symbol a in ∑,
we interpret δ(q,a) as the state to which the Finite Automata moves,
if it is in state q and receives the input ‘a’.

Non-Deterministic Finite Automata (NFA)

Definition: A NFA is defined as a 5-tuple, M=(Q, ∑,q0,A, δ) where,


• Q is the set of finite states
• ∑ is the set of input symbols (Finite alphabet)
• q0 is the initial state
• A is the set of all accepting states or final states.
• δ is the transition function, Q×∑→2 Q

www.sacet.ac.in Page 29
DEPARTMENT OF CSE

Compiler Design UNIT - I

11

Regular expression to NFA (Thompson’s construction method):

For each kind of RE, define an NFA.


Input: A Regular Expression R over an alphabet ∑ Output:
An NFA N accepting L(R) Method:

1. R = ε

2. R = a

3. R= a | b

4. R=ab

5. R= a*

www.sacet.ac.in Page 30
DEPARTMENT OF CSE

Compiler Design UNIT - I

Example: Construct NFA for: (a+b) * abb

REGULAR EXPRESSION TO FINITE AUTOMATA – NFA TO MINIMIZED DFA


Converting NFA to DFA: The Subset Construction Algorithm The algorithm for
constructing a DFA from a given NFA such that it recognizes the same
language is called subset construction. The reason is that each state of the
DFA machine corresponds to a set of states of the NFA. The DFA keeps in a
particular state all possible states to which the NFA makes a transition on the
given input symbol. In other words, after processing a sequence of input
symbols the DFA is in a state that actually corresponds to a set of states from
the NFA reachable from the starting symbol on the same inputs. There are
three operations that can be applied on NFA states

www.sacet.ac.in Page 31
DEPARTMENT OF CSE

Compiler Design UNIT - I

The staring state of the automaton is assumed to be s0. The ε-closure( s )


operation computes exactly all the states reachable from a particular state on
seeing an input symbol. When such operations are defined the states to which
our automaton can make a transition from set T on input a can be simply
specified as: ε-closure( move( T, a ))

Example: Convert the NFA for the expression: (a|b)*abb into a DFA using
the subset construction algorithm.
Step 1: Convert the above expression in to NFA using Thompson
rule constructions

www.sacet.ac.in Page 32
DEPARTMENT OF CSE

Compiler Design UNIT - I

Step 2: Start state of equivalent DFA is ε-closure(0)

ε-closure(0) ={ 0,1,2,4,7}

Step 2.1: Compute ε-closure(move(A,a))


move(A,a) ={3,8}
ε-closure(move(A,a)) = ε-closure(3,8) =
{3,8,6,7,1,2,4} ε-closure(move(A,a)) ={1,2,3,4,6,7,8}
Dtran[A,a]=B
Step 2.2:
Compute ε-closure(move(A,b)) move(A,b) ={5} ε-
closure(move(A,b)) = ε-closure(5) = {5,6,7,1,2,4}
ε-closure(move(A,a)) ={1,2,4,5,6,7}
Dtran[A,b]=C DFA and Transition table after step 2 is shown below.

www.sacet.ac.in Page 33
DEPARTMENT OF CSE

Compiler Design UNIT - I

Step 3: Compute Transition from state B on input symbol {a,b}


B = {1,2,3,4,6,7,8}

Step 3.1: Compute ε-closure(move(B,a))


move(B,a) ={3,8}
ε-closure(move(B,a)) = ε-closure(3,8) = {3,8,6,7,1,2,4}
ε-closure(move(B,a)) ={1,2,3,4,6,7,8}
Dtran[B,a]=B

Step 3.2:
Compute ε-closure(move(B,b)) move(B,b) ={5,9} ε-
closure(move(B,b)) = ε-closure(5,9) = {5,9,6,7,1,2,4}
ε-closure(move(B,b)) ={1,2,4,5,6,7,9}
Dtran[B,b]=D DFA and Transition table after step 3 is shown below

Step 4: Compute Transition from state C on input symbol {a,b} C =


{1,2,,4,5,6,7}
Step 4.1: Compute ε-closure(move(C,a))
move(C,a) ={3,8}
ε-closure(move(C,a)) = ε-closure(3,8) = {3,8,6,7,1,2,4}
ε-closure(move(C,a)) ={1,2,3,4,6,7,8} Dtran[C,a]= B

www.sacet.ac.in Page 34
DEPARTMENT OF CSE

Compiler Design UNIT - I

Step 4.2:
Compute ε-closure(move(C,b)) move(C,b) ={5}

ε-closure(move(C,b)) = ε-closure(5) = {5,6,7,1,2,4}

ε-closure(move(C,b)) ={1,2,4,5,6,7}

Dtran[C,b]= C DFA and Transition table after step 4 is shown below

Step 5: Compute Transition from state D on input symbol {a,b} D =


{1,2,,4,5,6,7,9}
Step 5.1: Compute ε-closure(move(D,a)) move(D,a) ={3,8}
ε-closure(move(D,a)) = ε-closure(3,8) = {3,8,6,7,1,2,4} ε-
closure(move(D,a)) ={1,2,3,4,6,7,8}
Dtran[D,a]= B

Step 5.2: Compute ε-closure(move(D,b))


move(D,b) ={5,10}
ε-closure(move(D,b)) = ε-closure(5,10) =
{5,10,6,7,1,2,4} ε-closure(move(C,b)) ={1,2,4,5,6,7,10}
Dtran[D,b]= E

Step 6: Compute Transition from state E on input symbol {a,b} E =


{1,2,,4,5,6,7,10}
Step 6.1: Compute ε-closure(move(E,a))
move(E,a) ={3,8}
ε-closure(move(E,a)) = ε-closure(3,8) = {3,8,6,7,1,2,4}
ε-closure(move(E,a)) ={1,2,3,4,6,7,8}
Dtran[E,a]= B

www.sacet.ac.in Page 35
DEPARTMENT OF CSE

Compiler Design UNIT - I

Step 6.2: Compute ε-closure(move(E,b))


move(E,b) ={5}
ε-closure(move(E,b)) = ε-closure(5) = {5,6,7,1,2,4}
ε-closure(move(E,b)) ={1,2,4,5,6,7} Dtran[E,b]= C

DFA and Transition table after step 6 is shown below.

Step 7: No more new DFA states are formed. Stop the subset construction
method.
The start state and accepting states are marked the DFA.

Minimized DFA: Convert the above DFA in to minimized DFA by applying


the following algorithm.
Minimized DFA algorithm: Input: DFA with ‘s’ no of states Output: Minimized
DFA with reduced no of states.
Steps:
1. Partition the set of states in to two groups. They are set of accepting states
and non accepting states
2. For each group G of π do the following steps until π=π new .

www.sacet.ac.in Page 36
DEPARTMENT OF CSE

Compiler Design UNIT - I

3. Divide G in to as many groups as possible, such that two states s and t are in
the same group only when for all states s and t have transitions for all input
symbols ‘s’ are in the same group itself. Place newly formed group in π new.
4. Choose representative state for each group.
5. Remove any dead state from the group.
After applying minimized DFA algorithm for the regular expression (a|b)*abb ,
the transition table for the minimized DFA becomes Transition table for
Minimized state DFA :

12.Design of a Lexical-Analyzer Generator

1 The Structure of the Generated Analyzer


2 Pattern Matching Based on NFA's
3 DFA's for Lexical Analyzers

www.sacet.ac.in Page 37
DEPARTMENT OF CSE

Compiler Design UNIT - I

1. The Structure of the Generated Analyzer

Figure 3.49 Overviews the architecture of a lexical analyzer generated by Lex. The
program that serves as the lexical analyzer includes a fixed program that simulates
an automaton; at this point we leave open whether that automaton is deterministic
or nondeterministic. The rest of the lexical analyzer consists of components that
are created from the Lex program by Lex itself.

Figure 3.49: A Lex program is turned into a transition table and actions, which are
used by a finite-automaton simulator

The actions from the input program, which appear as fragments of code to be
invoked at the appropriate time by the automaton simulator.

To construct the automaton, we begin by taking each regular-expression pattern in


the Lex program and converting it, using Algorithm 3.23, to an NFA. We need a
single automaton that will recognize lexemes matching any of the patterns in the
program, so we combine all the NFA's into one by introducing a new start state
with e-transitions to each of the start states of the NFA's N{ for pattern pi. This
construction is shown in Fig. 3.50.

www.sacet.ac.in Page 38
DEPARTMENT OF CSE

Compiler Design UNIT - I

Example 3 . 2 6 : We shall illustrate the ideas of this section with the following
simple, abstract example:

Figure 3.51 shows three NFA's that recognize the three patterns. The third is a
simplification of what would come out of Algorithm 3.23. Then, Fig. 3.52 shows
these three NFA's combined into a single NFA by the addition of start state 0 and
three e-transitions. •
2. Pattern Matching Based on NFA's

If the lexical analyzer simulates an NFA such as that of Fig. 3.52, then it must read
input beginning at the point on its input which we have referred to as lexemeBegin.
As it moves the pointer called forward ahead in the input, it calculates the set of
states it is in at each point, following Algorithm 3.22.

Eventually, the NFA simulation reaches a point on the input where there are no
next states. At that point, there is no hope that any longer prefix of the input would
ever get the NFA to an accepting state; rather, the set of states will always be
empty. Thus, we are ready to decide on the longest prefix that is a lexeme
matching some pattern.
DEPARTMENT OF CSE

Compiler Design UNIT I


DEPARTMENT OF CSE

Compiler Design UNIT - I


3. DFA's for Lexical Analyzers
Another architecture, resembling the output of Lex, is to convert the NFA for all the patterns into
an equivalent DFA, using the subset construction of Algorithm 3.20. Within each DFA state, if
there are one or more accepting NFA states, determine the first pattern whose accepting state is
represented, and make that pattern the output of the DFA state.

Important problems

Q1) Describe the languages denoted by the following regular expressions

(i) a*ba*ba*ba*
(ii) (ii) (a|b)*a(a|b)(a|b).

I) L= { Strings of a’s and b’s that contain only three b’s }


II) L= { Strings of a’s and b’s that contain third letter from right end is a }

Q2) Write regular expressions for the following languages: Explain operations on Regular expressions.

i) All strings over the English alphabet that contain the five vowels in order.

ii) All strings of a’s and b’s that do not contain the subsequence abb

www.sacet.ac.in Page 40

You might also like