CD Unit-1
CD Unit-1
CD Unit-1
UNIT I:
Lexical Analysis: Language Processors, Structure of a Compiler, Lexical Analysis,
The Role of the Lexical Analyzer, Bootstrapping, Input Buffering, Specification of
Tokens, Recognition of Tokens, Lexical Analyzer Generator-LEX, Finite Automata,
Regular Expressions and Finite Automata, Design of a Lexical Analyzer Generator.
______________________________________________________________________
1.Language Processors: An integrated software development
environment includes many different kinds of language processors such as
compilers, interpreters, assemblers, linkers, loaders etc.,
Language-processing system: - The preprocessor may also expand
shorthand’s, called macros, into source language statements. The modified
source program is then fed to a compiler. The compiler may produce an
assembly-language program as its output, because assembly language is easier
to produce as output and is easier to debug. The assembly language is then
processed by a program called an assembler that produces relocatable machine
code as its output.
Large programs are often compiled in pieces, so the relocatable machine
code may have to be linked together with other relocatable object files and
library files into the code that actually runs on the machine. The linker resolves
external memory addresses, where the code in one file may refer to a location in
another file. The loader then puts together all of the executable object files into
memory for execution.
Page 1
DEPARTMENT OF CSE
Linker:- A linker or link editor is a program that takes one or more objects
generated by a compiler and combines them into a single executable program.
Linker
Searches the program to find library routines used by program, e.g. printf(), math
routines.
www.sacet.ac.in Page 2
DEPARTMENT OF CSE
Loader:-The loader loads the program on the hard disk onto the main memory
and loads the starting address of the program into the program counter(PC) and
makes the program ready for execution.
Cousins of compiler: Preprocessor, assembler, linker, loader
Hybrid Compiler
www.sacet.ac.in Page 3
DEPARTMENT OF CSE
1.Analysis part
2.Synthesis part
1.Analysis part:- The analysis part is often called the front end of the
compiler.
In analysis part, an intermediate code is created from the given source
program. Analysis part consists of Lexical Analysis, Syntax Analysis and
Semantic Analysis, intermediate code generation .
It breaks up the source program and checks whether the source program is
either syntactically or semantically correct.
It provides informative error messages, so that the user can take corrective
action. The analysis part also collects information about the source program
and stores it in a data structure called a symbol table, which is passed along
with the intermediate representation to the synthesis part.
2.Synthesis part:- The synthesis part constructs the desired target program
from the intermediate representation and the information in the symbol table.
Code Optimizer and Code Generator are the parts of this phase. The synthesis
part is the back end of the compiler.
www.sacet.ac.in Page 4
DEPARTMENT OF CSE
phases of the compiler. During Compilation process, each phase can encounter errors.
The error handler is a data structure that reports the presence of errors clearly and
accurately. It specifies how the errors can be recovered quickly to detect subsequent
errors.
Lexical Analyzer:- The first phase of a compiler is called lexical analysis or linear analysis or
scanning. The lexical analyzer reads the source program and groups the characters into meaningful
sequences called lexemes (i.e tokens). For each lexeme, the lexical analyzer produces output in the
form : <token-name, attribute-value>
This output is passed to the subsequent phase i.e syntax analysis.
www.sacet.ac.in Page 5
DEPARTMENT OF CSE
www.sacet.ac.in Page 6
DEPARTMENT OF CSE
www.sacet.ac.in Page 7
DEPARTMENT OF CSE
2.
3. For the first time of compilation the For the first time of interpretation
process make take more time but as the process may complete within
the target program is saved on the hard less time. As the target program is
disk from next time onwards the target not saved on the hard disk the
program produced by a compiler interpreter has to interpret every
executes much faster than an time it executes the program. So it
interpreter. is much slower than a compiler.
4. As the compiler is a complex task it As the interpreter is a simple task it
occupies more main memory than occupies less main memory than
interpreter. compiler.
www.sacet.ac.in Page 8
DEPARTMENT OF CSE
www.Sacet.ac.in page 09
DEPARTMENT OF CSE
2. Scanner Generator –
It generates lexical analyzers from the input that consists of regular
expression description based on tokens of a language. It generates a
finite automaton to recognize the regular expression.
Example: Lex
www.sacet.ac.in Page 10
DEPARTMENT OF CSE
Since the lexical analyzer is the part of the compiler that reads the source
text, it may perform certain other tasks besides identification of lexemes. One
such task is stripping out comments and whitespace (blank, newline, tab, and
www.sacet.ac.in Page 11
DEPARTMENT OF CSE
perhaps other characters that are used to separate tokens in the input).
Another task is correlating error messages generated by the compiler with the
source program.
For instance, the lexical analyzer may keep track of the number of newline
characters seen, so it can associate a line number with each error message. In
some compilers, the lexical analyzer makes a copy of the source program
with the error messages inserted at the appropriate positions. If the source
program uses a macro-preprocessor, the expansion of macros may also be
performed by the lexical analyzer.
4.2 Lexical Analysis Vs. Parsing: - There are a number of reasons why the
analysis portion of a compiler is normally separated into lexical analysis and
parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration.
The separation of lexical and syntactic analysis often
allows us to simplify at least one of these tasks.
2. Compiler efficiency is improved. A separate lexical
analyzer allows us to apply specialized techniques that
serve only the lexical task, not the job of parsing. In
addition, specialized buffering techniques for reading
input characters can speed up the compiler significantly.
3. Compiler portability is enhanced. Input-device-specific
peculiarities can be restricted to the lexical analyzer.
4.3 Token, patterns and Lexemes: - When discussing lexical analysis, we
use three related but distinct terms:
www.sacet.ac.in Page 12
DEPARTMENT OF CSE
A pattern is a description that specify the rules that the lexemes should
follow in order to belong to that token.
the same
as keyword itself.
2. Tokens for the operators, either individually or in classes
such as the token comparison mentioned.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as
numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and
right parentheses, comma, and semicolon.
Consider the following example of an assignment
statement:
www.sacet.ac.in Page 13
DEPARTMENT OF CSE
IDENTIFIER result
ASSIGN_OP =
IDENTIFIER oldsum
SUB_OP -
IDENTIFIER value
DIV_OP /
NUMBER 100
SEMICOLON ;
www.sacet.ac.in Page 14
DEPARTMENT OF CSE
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language
L, which runs on machine A and produces code for machine A.
www.sacet.ac.in Page 15
Compiler Design UNIT - I
6.Input Buffering
The lexical analyzer scans the input from left to right one character at a time.
Lexical analyzer needs to look ahead one or more characters before a lexeme
is found
In this scheme, only one buffer is used to store the input string. But the
problem with this scheme is that if lexeme is very long then it crosses
the buffer boundary, to scan rest of the lexeme the buffer has to be
refilled, that makes overwriting the first part of lexeme.
• Speed of a lexical analyzer is important.
• When lookahead is necessary, two buffer scheme is introduced to
handle large lookaheads safely
• Techniques for speeding up the process of lexical analyzer such as
the use of sentinels (eof character) to mark the buffer end have been
adopted.
There are two methods used in this context
Buffer pairs
Sentinels
Here, we use a buffer divided into two N-character (N is generally equal
to the maximum number of characters that be read at one time) halves
as shown below,
Scheme
• Consists of two buffers, each consists of N-character size which are reloaded
alternatively.
N = size of a disk block (4096)
N characters are read from the input file to the buffer using one system read
command.
A block of data is first read into a buffer, and then scanned by lexical
analyzer
www.sacet.ac.in Page 16
DEPARTMENT OF CSE
Pointers
Sentinel
◦ eof can be added at each buffer end. eof within a buffer marks the end of input.◦
The sentinel is a special character that cannot be part of the source program.
End of second buffer is also recognized by the end of buffer mark present at
the end of second buffer. when fp encounters first eof, then one can recognize
end of first buffer and hence filling up second buffer is started. in the same
way when second eof is obtained then it indicates of second buffer.
www.sacet.ac.in Page 17
DEPARTMENT OF CSE
Alternatively both the buffers can be filled up until end of the input program
and stream of tokens is identified. This eof character introduced at the end is
calling
switch(*forward++)
case eof:
} else
break;
www.sacet.ac.in Page 18
DEPARTMENT OF CSE
Regular Expression
Def: The language accepted by finite automata can be easily described by
simple expressions called Regular Expressions. It is the most effective way to
represent any language.
The languages accepted by some regular expression are referred to as Regular
languages.
A regular expression can also be described as a sequence of pattern that defines
a string.
Regular expressions are used to match character combinations in strings.
1. Union: The union of two regular languages, L1 and L2, which are
represented using L1 ∪ L2, is also regular and which represents the set of
strings that are either in L1 or L2 or both.
Example:
L1 = (1+0).(1+0) = {00 , 10, 11, 01} and
L2 = {∈ , 100}
then L1 ∪ L2 = {∈, 00, 10, 11, 01, 100}.
2. Concatenation:
The concatenation of two regular languages, L1 and L2, which are represented
using L1.L2 is also regular and which represents the set of strings that are
formed by taking any string in L1 concatenating it with any string in L2.
Example:
L1 = { 0,1 } and L2 = { 00, 11} then L1.L2 = {000, 011, 100, 111}.
3. Kleene closure:
If L1 is a regular language, then the Kleene closure i.e. L1* of L1 is also regular and
represents the set of those strings which are formed by taking a number of strings from L1
and the same string can be repeated any number of times and concatenating those strings.
Example:
L1 = { 0,1}
L1*= {∈, 0, 1, 00, 01, 10, 11 …….} , then L1* is all strings possible with symbols 0 and
1 including a null string.
www.sacet.ac.in Page 19
DEPARTMENT OF CSE
The vertical bar above means union, the parentheses are used
to group subexpressions, the star means "zero or more
occurrences of".
The letter_ at the beginning indicates that the identifier can contain
any letter or underscore( _ ) at the beginning.
www.sacet.ac.in Page 20
DEPARTMENT OF CSE
Regular Definitions:-
Regular Definitions are names given to certain regular expressions and those
names can be used in subsequent expressions as symbols of the language.
If C is an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form:
where
1. Each di is a new symbol, not in C and not the same as any other of
the d's, and
2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
www.sacet.ac.in Page 21
DEPARTMENT OF CSE
8.Recognition of Tokens
www.sacet.ac.in Page 22
DEPARTMENT OF CSE
Transition Diagrams: -
→
Relop < | < > | <= | >| >=|=
www.sacet.ac.in Page 23
DEPARTMENT OF CSE
1. Install the reserved words in the symbol table initially. A field of the symbol-
table entry indicates that these strings are never ordinary identifiers, and tells
which token they represent. When we end an identifier, a call to installID places
it in the symbol table if it is not already there and returns a pointer to the symbol-
table entry for the lexeme found. Of course, any identifier not in the symbol table
during lexical analysis cannot be a reserved word, so its token is id. The function
getToken examines the symbol table entry for the lexeme found, and returns
whatever token name the symbol table says this lexeme represents | either id or
one of the keyword tokens that was initially installed in the table.
2.Create separate transition diagrams for each keyword; an example for the keyword then
is shown below. Note that such a transition diagram consists of states representing the
situation after each successive letter of the keyword is seen, followed by a test for a
“nonletter-or-digit," i.e., any character that cannot be the continuation of an identifier. It
is necessary to check that the identifier has ended, or else we would return token then in
situations where the correct token was id.
www.sacet.ac.in Page 24
DEPARTMENT OF CSE
→
ws delim+
www.sacet.ac.in Page 25
DEPARTMENT OF CSE
Use of Lex:- An input file, which we call lex.l, is written in the Lex language
and describes the lexical analyzer to be generated. The Lex compiler
transforms lex.l to a C program, in a file that is always named lex.yy.c . The
lex.yy.c is compiled by the C compiler into a file called a.out. The C compiler
output is a working lexical analyzer that can take a stream of input characters
and produce a stream of tokens.
www.sacet.ac.in Page 26
DEPARTMENT OF CSE
Each pattern is a regular expression, which may use the regular definitions of
the declaration section. The actions are fragments of code, typically written
in C, although other languages can also be used
3.The third section holds whatever additional functions are used in the
actions. Alternatively, these functions can be compiled separately and
loaded with the lexical analyzer. Lex Program for tokens
www.sacet.ac.in Page 27
DEPARTMENT OF CSE
The lexical analyzer created by Lex works along with the parser as follows.
When called by the parser, the lexical analyzer begins reading its remaining
input, one character at a time, until it finds the longest prefix of the input that
matches one of the patterns P i. It then executes the associated action Ai.
Typically, Ai will return to the parser, but if it does not (e.g., because P i
describes whitespace or comments), then the lexical analyzer proceeds to find
additional lexemes, until one of the corresponding actions causes a return to
the parser.
The lexical analyzer returns a single value, the token name, to the parser, but
uses the shared, integer variable yyval to pass additional information about
the lexeme found, if needed.
10.FINITE AUTOMATA:
A recognizer for a language is a program that takes as input a string x and answer
"yes" if x is a sentence of the language and "no" otherwise. The regular expression
is compiled into a recognizer by constructing a generalized transition diagram
called a finite automaton. A finite
www.sacet.ac.in Page 28
DEPARTMENT OF CSE
www.sacet.ac.in Page 29
DEPARTMENT OF CSE
11
1. R = ε
2. R = a
3. R= a | b
4. R=ab
5. R= a*
www.sacet.ac.in Page 30
DEPARTMENT OF CSE
www.sacet.ac.in Page 31
DEPARTMENT OF CSE
Example: Convert the NFA for the expression: (a|b)*abb into a DFA using
the subset construction algorithm.
Step 1: Convert the above expression in to NFA using Thompson
rule constructions
www.sacet.ac.in Page 32
DEPARTMENT OF CSE
ε-closure(0) ={ 0,1,2,4,7}
www.sacet.ac.in Page 33
DEPARTMENT OF CSE
Step 3.2:
Compute ε-closure(move(B,b)) move(B,b) ={5,9} ε-
closure(move(B,b)) = ε-closure(5,9) = {5,9,6,7,1,2,4}
ε-closure(move(B,b)) ={1,2,4,5,6,7,9}
Dtran[B,b]=D DFA and Transition table after step 3 is shown below
www.sacet.ac.in Page 34
DEPARTMENT OF CSE
Step 4.2:
Compute ε-closure(move(C,b)) move(C,b) ={5}
ε-closure(move(C,b)) ={1,2,4,5,6,7}
www.sacet.ac.in Page 35
DEPARTMENT OF CSE
Step 7: No more new DFA states are formed. Stop the subset construction
method.
The start state and accepting states are marked the DFA.
www.sacet.ac.in Page 36
DEPARTMENT OF CSE
3. Divide G in to as many groups as possible, such that two states s and t are in
the same group only when for all states s and t have transitions for all input
symbols ‘s’ are in the same group itself. Place newly formed group in π new.
4. Choose representative state for each group.
5. Remove any dead state from the group.
After applying minimized DFA algorithm for the regular expression (a|b)*abb ,
the transition table for the minimized DFA becomes Transition table for
Minimized state DFA :
www.sacet.ac.in Page 37
DEPARTMENT OF CSE
Figure 3.49 Overviews the architecture of a lexical analyzer generated by Lex. The
program that serves as the lexical analyzer includes a fixed program that simulates
an automaton; at this point we leave open whether that automaton is deterministic
or nondeterministic. The rest of the lexical analyzer consists of components that
are created from the Lex program by Lex itself.
Figure 3.49: A Lex program is turned into a transition table and actions, which are
used by a finite-automaton simulator
The actions from the input program, which appear as fragments of code to be
invoked at the appropriate time by the automaton simulator.
www.sacet.ac.in Page 38
DEPARTMENT OF CSE
Example 3 . 2 6 : We shall illustrate the ideas of this section with the following
simple, abstract example:
Figure 3.51 shows three NFA's that recognize the three patterns. The third is a
simplification of what would come out of Algorithm 3.23. Then, Fig. 3.52 shows
these three NFA's combined into a single NFA by the addition of start state 0 and
three e-transitions. •
2. Pattern Matching Based on NFA's
If the lexical analyzer simulates an NFA such as that of Fig. 3.52, then it must read
input beginning at the point on its input which we have referred to as lexemeBegin.
As it moves the pointer called forward ahead in the input, it calculates the set of
states it is in at each point, following Algorithm 3.22.
Eventually, the NFA simulation reaches a point on the input where there are no
next states. At that point, there is no hope that any longer prefix of the input would
ever get the NFA to an accepting state; rather, the set of states will always be
empty. Thus, we are ready to decide on the longest prefix that is a lexeme
matching some pattern.
DEPARTMENT OF CSE
Important problems
(i) a*ba*ba*ba*
(ii) (ii) (a|b)*a(a|b)(a|b).
Q2) Write regular expressions for the following languages: Explain operations on Regular expressions.
i) All strings over the English alphabet that contain the five vowels in order.
ii) All strings of a’s and b’s that do not contain the subsequence abb
www.sacet.ac.in Page 40