Principles of Compiler Design PDF
Principles of Compiler Design PDF
Principles of Compiler Design PDF
INTRODUCTION TO COMPILER
Computers are a balanced mix of software and hardware. Hardware is
just a piece of mechanical device and its functions are being controlled by a
compatible software. Hardware understands instructions in the form of
electronic charge, which is the counterpart of binary language in software
programming. Binary language has only two alphabets, 0 and 1. To
instruct the machine, the hardware codes must be written in binary
format, which is simply a series of 1s and 0s. It would be a difficult and
cumbersome task for computer programmers to write such codes that is
why we have compilers to write such codes.
1.1.1. Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that
produces input for compilers. It deals with macro-processing,
augmentation, file inclusion, language extension, etc.
1.1.2. Interpreter
An interpreter, like a compiler, translates high-level language into low-
level machine language. The difference lies in the way they read the source
code or input. A compiler reads the whole source code at once, creates
tokens, checks semantics, generates intermediate code, executes the whole
program and may involve many passes. In contrast, an interpreter reads a
statement from the input, converts it to an intermediate code, executes it,
then takes the next statement in sequence. If an error occurs, an
interpreter stops execution and reports it; whereas a compiler reads the
whole program even if it encounters several errors.
1.1.3. Assembler
An assembler translates assembly language programs into machine
code. The output of an assembler is called an object file, which contains a
combination of machine instructions as well as the data required to place
these instructions in memory.
1.1.4 Linker
Linker is a computer program that links and merges various object
files together in order to make an executable file. All these files might
have been compiled by separate assemblers. The major task of a linker is
to search and locate referenced module/routines in a program and to
determine the memory location where these codes will be loaded, making
the program instruction to have absolute references.
1.2. Cross-Compiler
A compiler that runs on platform and is capable of generating
executable code for platform is called a cross-compiler.
Compiler Design Concepts, Worked out Examples and MCQs for NET/SET
The principle aids provided by the compiler-compilers are:
1. For Scanner Generator the Regular Expression is being used.
2. For Parser Generator the Context Free Grammars
are used. 1.4 Bootstrapping
A compiler is characterized by three languages:
1. source language
2. object language
3. The language in which it is written.
LEXICAL ANALYSIS
2.1. Introduction
Lexical analysis is the first phase of a compiler. It takes the modified
source code from language preprocessors that are written in the form of
sentences. The lexical analyzer breaks these sentences into a series of
tokens, by removing any whitespace or comments in the source code. If the
lexical analyzer finds a token as invalid, it generates an error. The lexical
analyzer works closely with the syntax analyzer. It reads character streams
from the source code, checks for legal tokens, and passes the data to the
syntax analyzer when it demands.
2.2 Tokens
Lexemes are said to be a sequence of characters (alphanumeric) which
is also called as tokens. There are some predefined rules for every lexeme
to be identified as a valid token. These rules are defined by grammar rules,
by means of a pattern. A pattern explains what can be a token, and these
patterns are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings,
numbers, operators, and punctuations symbols can be considered as
tokens.
For example, in C language, the variable
declaration line int value = 100;
Contains the tokens:
1) int (keyword)
2) value (identifier)
3) = (operator)
4) 100 (constant)
5) ; (symbol)
2.2.1.2 Strings
Any finite sequence of alphabets is called a string. Length of the string
is the total number of alphabets in the string, e.g., the string S is “INDIA”,
the length of the string, S is 5 and is denoted by |S|= 5. A string having no
alphabets, i.e. a string of zero length is known as an empty string and is
denoted by ε (epsilon
2.3 Language
A language is considered as a finite set of strings over some finite set of
alphabets. Computer languages are considered as finite sets, and
mathematically set operations can be performed on them. Finite languages
can be described by means of regular expressions.
c) sign = [ + | - ].
mpiler Design Concepts, Worked out Examples and MCQs for NET/SET
Example: 1
Solution:
E-closure(0) = {0, 1, 2, 4, 7} = A
Move(A,a) = {3, 8}
2 reads „a‟ goes to 3 & 8
Move(A, b) = {5}
E-closure(Move(A,a)) = {3, 6, 1, 2, 4, 7, 8}
Ie. {3, 8} = {1, 2, 3, 4, 6, 7, 8} = B
E-closure(Move(A, b)) = {5, 6, 1, 2, 4, 5,
7}
Ie. {5} = {1, 2, 4, 5, 6, 7} = C
Move (B, a) = {3, 8}
Move (B, b) = {5, 9}
Move (C, a) = {3, 8}
Move (C, b) = {5}
E-closure (Move (B, a)) = {1, 2, 3, 4, 6,
7, 8} = B {3, 8}
E-closure (Move (B, b)) = {1, 2, 4, 5, 6,
7, 9} = D {5, 9}
E-closure (Move (C, a)) = {1, 2, 3, 4, 6,
7, 8} = B {3, 8}
E-closure (Move (C, b)) = {1, 2, 4, 5, 6, 7} = C
{5}
Move (D, a) = {3, 8}
Move (D, b) = {5, 10}
E-closure(Move(D, a)) = {1, 2, 3, 7, 8}
= B {3, 8}
Minimizing DFA:
= {A, B, C, D, E}
new = {A, B, C, D} {E}
= new
= {A, B, C, D} {E}
new = {E} {A, B, C} {D} =
= {E} {D} {A, C} {B}
Transition table:
States Input System
a b
A B A
B B D
D B E
E B A
13
2.7.1 Implementation
LEX can build from its input, a lexical analyzer that behaves roughly
like a finite automaton
The idea is to construct a NFA „N‟ for each token pattern P in the translation
rules. Constructing an NFA from a regular expression and then link these NFA‟s
together with a new start state.
Summary
Lexical Analyzer is also known as Scanner.
Lexical Analyzer and parser can be in same pass.
Lexical Analysis simplifies the overall design.
Preliminary scanning is also done in the first phase.
Extra Buffering is needed for preliminary scanning.
LEX is the tool to construct lexical analyzer.
The output of lexical analyzer is a stream of tokens.
Finite Automata is also called recognizer.
15
Questions
1. Explain the need and role of the lexical analyzer.
2. Describe regular expressions.
3. Construct an NFA for (i) R=(a/b)* a (a/b)
(ii) R= (a/b)* (a/b)*
4. Convert it into minimized DFA
(i) aa*/ bb*
(ii) (a/b)(a/b)(a/b)
(iii)(a/b)*abb (a/b)*
(iv) 001*(1|0)*11
(v) (00)*|(11)*
SYNTAX ANALYSIS
3.1 Introduction
Syntax analysis or parsing is the second phase of a compiler. In this
chapter, we shall learn the basic concepts used in the construction of a
parser.
We have seen that a lexical analyzer can identify tokens with the help of
regular expressions and pattern rules. Due to the limitations of regular
expressions the lexical analyzer cannot check the syntax of a given
sentence. Regular expressions cannot check balancing tokens, such as
parenthesis. Therefore, this phase uses context-free grammar (CFG), which
is recognized by push-down automata.
The strings are derived from the start symbol by repeatedly replacing a
non-terminal (initially the start symbol) by the right side of a production,
for that non-terminal. Example
In this way, the parser accomplishes two tasks, i.e., parsing the code
and looking for errors. Finally a parse tree is generated as the output of
this phase.
Parsers are expected to parse the whole code even if some errors exist in
the program. Parsers use error recovering strategies, which we will learn
later in Error Handling Chapter.
18
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
In a parse tree:
1. All leaf nodes are terminals.
2. All interior nodes are non-terminals.
3. In-order traversal gives original input string.
A parse tree depicts associativity and precedence of operators. The deepest
sub-tree is traversed first, therefore the operator in that sub-tree gets
precedence over the operator which is in the parent nodes.
Exercise 1:
Consider the grammar
S iCtS
S iCtSeS
S a
20
3.6.1 Associativity
If an operand has operators on both sides, the side on which the
operator takes this operand is decided by the associativity of those
operators. If the operation is left-associative, then the operand will be taken
by the left operator; or if the operation is right-associative, the right
operator will take the operand.
21
3.6.2 Precedence
If two different operators share a common operand, the precedence of
operators decides which will take the operand. That is, 2+3*4 can have two
different parse trees, one corresponding to (2+3)*4 and another
corresponding to 2+(3*4). By setting precedence among operators, this
problem can be easily removed. As in the previous example,
mathematically * (multiplication) has precedence over + (addition), so the
expression
2+3*4 will always be interpreted as:
2 + (3 * 4)
These methods decrease the chances of ambiguity in a language or its
grammar.
22
A top-down parser will first parse A, which in-turn will yield a string
consisting of A itself and the parser may go into a loop forever.
Example
The production set
S => Aα | β
A => Sd
after applying the above algorithm,
should become S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
A => βdA'
A' => αdA' | ε
Now none of the production has either direct or indirect left recursion.
Example
Production Rule Without Left
No With Left Recursion Recursion
If the production is A Aα | β A βA'
23
Example
The above productions can be written as
A => αA‟
A‟=> β | |…
Now the parser has only one production per prefix which makes it easier
to take decisions.
24
Top-down Parsing
When the parser starts constructing the parse tree from the start
symbol and then tries to transform the start symbol to the input, it is
called top-down parsing.
Recursive descent parsing: It is a common form of top-down parsing. It
is called recursive, as it uses recursive procedures to process the input.
Recursive descent parsing suffers from backtracking.
Backtracking: It means, if one derivation of a production fails, the
syntax analyzer restarts the process using different rules of same
production. This technique may process the input string more than
once to determine the right production.
Bottom-up Parsing
As the name suggests, bottom-up parsing starts with the input symbols
and tries to construct the parse tree up to the start symbol. Note:
In both the cases the input to the parser is being scanned from left to
right, one symbol at a time.
The bottom-up parsing method is called “Shift Reduce” parsing. The
top-down parsing is called “Recursive Decent” parsing.
An operator-precedence parser is one kind of shift reduce parser and
predictive parser is one kind of recursive descent parser.
Example:
Input string : a + b * c
Production rules:
S→E
E→E+T
E → E *T
E→T
T → id
Let us start bottom-up parsing.
a+b*c
Read the input and check if any production matches with the input:
a+b*c
T+b*c
E+b*c
E+T*c
E*c
E*T
E
S
3.9.1 TOP-DOWN PARSING
Er. R.D Kumar Page 40
We have learnt in the last chapter that the top-down parsing technique
parses the input, and starts constructing a parse tree from the root node
gradually moving down to the leaf nodes. The types of top-down parsing
are depicted below:
25
Back-tracking
Top- down parsers start from the root node (start symbol) and match
the input string against the production rules to replace them (if matched).
To understand this, take the following example of CFG:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to
the left-most letter of the input, i.e. „r‟. The very production of S (S → rXd)
matches with it. So the top-down parser advances to the next input letter (i.e.
„e‟). The parser tries to expand non-terminal „X‟ and checks its production from the
left (X → oa). It does not match with the next input symbol. So the top-down
parser backtracks to obtain the next production rule of X, (X → ea).
26
In recursive descent parsing, the parser may have more than one
production to choose from for a single instance of input; whereas in
predictive parser, each step has at most one production to choose. There
might be instances where there is no production matching the input string,
making the parsing procedure to fail.
Predictive Parser
Predictive parser is a recursive descent parser, which has the capability
to predict which production is to be used to replace the input string. The
predictive parser does not suffer from backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer,
which points to the next input symbols. To make the parser back-tracking
free, the predictive parser puts some constraints on the grammar and
accepts only a class of grammar known as LL(k) grammar.
27
Predictive parsing uses a stack and a parsing table to parse the input
and generate a parse tree. Both the stack and the input contains an end
symbol $ to denote that the stack is empty and the input is consumed. The
parser refers to the parsing table to take any decision on the input and
stack element combination.
First Set
This set is created to know what terminal symbol is derived in the first
position by a non-terminal.
To compute FIRST(X) for all grammar symbols X, apply the following
rules until no more terminals or ε can be added to any FIRST set.
For example,
A→ t β
That is, A derives t (terminal) in the very first position. So, t ∈ FIRST(A).
Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-
terminal A in production rules. We do not consider what the non-terminal
can generate but instead, we see what would be the next terminal symbol
that follows the productions of a non-terminal.
To compute follow (A) for all non-terminals A, apply the following rules
until nothing can be added to any FOLLOW set.
29
Then
FIRST (E) = FIRST (T) =FIRST (F) = {(, id}
FIRST (E‟) = {+, ε}
FIRST (T‟) = {*, ε}
FOLLOW (E) =FOLLOW (E‟) = { ), $}
FOLLOW (T) =FOLLOW (T‟) = { ), +, $}
FOLLOW (F) = { ), *, +, $}
Id + * ( ) $
E E TE‟ E TE‟
E‟ E‟ +TE E‟ ε E‟ ε
T T FT‟ T FT‟
T‟ T ε T‟ *FT‟ T‟ ε T‟ ε
F F id F (E)
30
Shift step
The shift step refers to the advancement of the input pointer to the next
input symbol,
which is called the shifted symbol. This symbol is pushed onto the stack.
The shifted symbol is treated as a single node of the parse tree.
Reduce step
When the parser finds a complete grammar rule (RHS) and replaces it to
(LHS), it is
known as reduce-step. This occurs when the top of the stack contains a
handle. To reduce, a POP function is performed on the stack which pops off
the handle and replaces it with LHS non-terminal symbol.
Reducing a string W to the start symbol S of a grammar.
At each step a string matching the right side of a production is replaced
by the symbol
on the left.
Example:
S aAcBe; A Ab; A b; B d and the string is abbcde, we have to reduce it to
S. Abbcde abbcBe
aAbcBe
aAcBe
S
Each replacement of the right side of the production the left side in the
process above is called reduction .by reverse of a right most derivation is
called Handle
S* αAw αβw,then A β in partition following is a handle of αβw. The string
w to the right of the handle contains only terminal symbol.
A rightmost derivation in reverse often called a canonical reduction
sequence, is obtained by “Handle Pruning”.
Example:
E E+E
E E*E
E (E)
E id
Input: id1+id2*id3 E
Right Sentential Handl Reducing
Form e production
id1+id2*id3 id1 E id
E+id2*id3 id2 E id
E+E*id3 id3 E id
E+E*E E*E E E*E
E+E E+E E E+E
31
4) Error
1) In shift action, the next input symbol is shifted to the top of the stack.
2) In the reduce action the parser knows
Right end of the handle.
To locate left end of the handle within the stack.
Decide what non-terminal to replace the handle.
3) In an accept action, the parser announces successful completion of
parsing.
4) In an error action, parser looks for syntax error and calls an error
recovery routine.
Example:
E E+E
E E*E
E (E)
E id
Input: id1+id2*id3
Stack Input Action
Id1+id2*id
$ 3$ Shift
$id1 +id2*id3$ Reduce by E id
$E +id2*id3$ Shift
$E+ id2*id3$ Shift
$E+id2 *id3$ Reduce by E id
$E+E *id3$ Shift
$E+E* id3$ Shift
$E+E*
id3 $ Reduce by E id
$E+E*E $ Reduce by E E*E
$E+E $ Reduce by E E+E
$E $ Accept
32
Definition:
An operator precedence is an € free operator grammar in which the
precedence relations <. ,=. &.> constructed are disjoint
Note:
LEADING(A)={a/A=+> γaδ, where γ is € or a single non
terminal} TRAILING(A)={a/A=+> γaδ, where δ is € or a
single non terminal}
Precedence Function
Er. R.D Kumar Page 56
Precedence table can be encoded by a precedence functions f & g.
We select f & g for symbol a & b
f(a)<g(b) whenever a<.b
f(a)=g(b) whenever a=.b
f(a)>g(b) whenever a.>b
33
Example
Id + * $
3.10 LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It
uses a wide class of context-free grammar which makes it the most efficient
syntax analysis technique. LR parsers are also known as LR(k) parsers,
where L stands for left-to-right scanning of the input stream; R stands for
the construction of right-most derivation in reverse, and k denotes the
number of look ahead symbols to make decisions.
An LL Parser accepts LL grammar. LL grammar is a subset of context-
free grammar but with some restrictions to get the simplified version, in
order to achieve easy implementation. LL grammar can be implemented by
means of both algorithms, namely, recursive-descent or table-driven.
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input
from left to right, the second L in LL(k) stands for left-most derivation and
k itself represents the number of look aheads. Generally k = 1, so LL(k)
may also be written as LL(1).
34
35
Example:
Closure (E‟ .E) = E‟ .E
E .E+T
E .T
T .T*F I0
T .F
F . (E)
F .id
Step 3: perform GOTO (I, X) (I is CLOURE set and X is all Grammar symbol)
This is computed for a set I on a grammar symbol X. GOTO (I, X) is
defined to be the closure of the set of all items [A αX.β] such that [A α.Xβ]
is in I.
Consider an item in I‟s production is maybe like this
means, A .XBY then
GOTO (I, X) will be performed based on the following Rules:
1. If A X.BY where B is a Terminal, including this item only in the CLOSURE (X)
Item.
2. If A X.BY where B is a Non-Terminal including this item along with B‟s
CLOSURE (B).
Example:
I0 : E‟ .E
E .E+T
E .T
T .T*F
T .F
F . (E)
F .id
GOTO (I0, E) = E‟ E.
E E. +T I1
GOTO (I0, T) = E T.
T T.*F I2
GOTO (I0, F) = T F. I3
GOTO (I0,+) =GOTO(I0,*)= GOTO (I0.))= null
// 2 rule in
GOTO(I0,() = F (.E) step
3 is applied
E .E+T here
E .T
36
I2: E T.
T T.*F GOTO(I2,X)
GOTO(I2,E)=GOTO(I2,T)=GOTO(I2,F)=GOTO(I2,+)=GOTO (I2,() =
GOTO (I2, ))
=GOTO (I2,id)=Null
// 2
GOTO(I2,*)=T T*.F Rule in
F .(E) step 3
F .id I7
I3: T F. I.e. GOTO(I3,X)
GOTO(I 3,+
GOTO(I3,E)=GOTO(I3,T)=GOTO(I3,F)=GOTO(I3,*)= )=
GOTO(I3,()=GOTO(I3,))=GOTO(I3,id)=null ;
I4: F (.E)
E .E+T
E .T
T .T*F
T .F
F . (E)
F .id GOTO(I4,X) is,
GOTO(I4,E)= F (E.)
E E.+T I8
GOTO(I4, T)= T T.*F
E T. I2
GOTO(I4,F)= T F. I3
GOTO(I4,+)=Null ; GOTO(I4,*)=null ;
GOTO(I4,))=Null .
GOTO(I5,E) GOTO(I5,*
= GOTO(I5,T)= GOTO(I5,F)=GOTO(I5,+)= )=
37
GOTO(I5,()=GOTO(I5,))= GOTO(I5,ID)=Null
I6: E E+.T
T .T*F
T .F
F . (E)
F .id
GOTO (I6,E)=GOTO(I6,+)=GOTO(I6,*)=
GOTO(I6,))= Null
GOTO (I6,T)=E E+T.
T T.*F I9
GOTO (I6,F)=T F. I3
GOTO (I6,()=F (.E)
E .E+T
E .T
T .T*F I4
T .F
F . (E)
F .id
GOTO(I6,id)=F id I5
I7 : T T*.F
F .(E)
F .id
GOTO(I7,E)=GOTO(I7,T)=GOTO(I7,+)= GOTO(I7,*)=
GOTO(I7,))=Null
GOTO(I7,E)=T T*F. I10
GOTO(I7,() =F (.E)
E .E+T
E .T
T .T*F I4
T .F
F .(E)
F .id
GOTO(I7,id)=F id. I5
I8: F (E.)
E E.+T
GOTO(I8,E)=GOTO(I8,T)=GOTO(I8,F)=GOTO(I8,*)= GOTO(I 8,(
I9:
E E+T.
38
LR (0) items:
I0: E‟ .E
E .E+T
E .T
T .T*F
T .F
F .(E)
F id.
I1: E‟ E.
E E. +T
I2: E T.
T T.*F
I3: T F.
I4: F (.E)
E .E+T
E .T
T .T*F
T .F
F . (E)
F .id
I5: F id.
39
F .id
I7: T T* .F
F . (E)
F .id
I8: F (E.)
T E. +T
I9: E E+T.
T T. *F
I10: T T*F.
I11: F (E).
Construction of SLR parsing Table:
This is also a 2 dimensional array in which the rows are states &
columns are terminals & non terminals. This table has 2 parts,
1) Action
2) Go to entries
The action may one of the following:
1) Shift
2) Reduce
3) Accept
4) Errors
Example:
1) E E+T. is in I9 ,FOLLOW (T) ={+,$,)}
2) E T. is in I2 ,FOLLOW (T) ={+,$,)}
3) T T*F. is in I10 ,FOLLOW (F) ={+,*,$,)}
4) T F. is in I3 ,FOLLOW (F) ={+,*,$,)}
40
Id + * ( ) $E T F
0 S5 S4 1 2 3
1 S6 A
2 r2 S7 r2 r2
3 r4 r4 r4 r4
4 S5 S4 8 2 3
5 r6 r6 r6 r6
6 S5 S4 9 3
7 S5 S4 10
8 S6 S11
9 r1 S7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
41
LL vs. LR
LL LR
Starts with the root th Ends with the root nonterminal on the
nonterminal on e stack.
stack.
Uses the stack for designating what Uses the stack for designating what is
is still already
to be expected. seen.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Reads the terminals when it pops one offReads the terminals while it
pushes them on
the stack. the stack.
Summary
Syntax analyzer is also called parser.
Context free grammar is used as recognizer.
Parse tree is an output of Parser
Precedence rules are used to operate the operators.
Two types of parsing, Bottom up parsing and Top Down parsing.
Er. R.D Kumar Page 74
Recursive descent parsing is a kind of Top Down parsing.
Shift reduce parsing is a kind of Bottom Up parsing.
Predictive Parser is a kind of Recursive Decent parser.
Questions
1. Explain the types of Parser.
2. Construct a Parsing Table for the Grammar
E E+T/
T
T T*F/
F
F (E)/i
d
42
ERROR HANDLING
4.1 Introduction
A parser should be able to detect and report any error in the program. It
is expected that when an error is encountered, the parser should be able to
handle it and carry on parsing the rest of the input. Mostly it is expected
from the parser to check for errors but errors may be encountered at
various stages of the compilation process. A program may have the
following kinds of errors at various stages:
Lexical Semantic
Analyzer Parser Checker
43
44
If watched closely, we find most of the leaf nodes are single child to their
parent nodes. This information can be eliminated before feeding it to the
next phase. By hiding extra information, we can obtain a tree as shown
below:
Summary
Errors may be encountered at various stages of the compilation process.
The parser may discover Syntactic errors.
Syntactic occurs when there no precedence errors.
AST is a data structure more compact than the parse tree.
Augmented grammar is used to generate erroneous constructs.
Global error correction may done.
45
SEMANTIC ANALYSIS
5.1 Introduction
We have learnt how a parser constructs parse trees in the syntax
analysis phase. The plain parse-tree constructed in that phase is generally
of no use for a compiler, as it does not carry any information of how to
evaluate the tree. The productions of context-free grammar, which makes
the rules of the language, do not accommodate how to interpret them.
For example:
E→E+T
The above CFG production has no semantic rule associated with it, and
it cannot help in making any sense of the production.
Semantics
Semantics of a language provide meaning to its constructs, like tokens
and syntax structure. Semantics help interpret symbols, their types, and
their relations with each other. Semantic analysis judges whether the
syntax structure constructed in the source program derives any meaning
or not.
CFG + semantic rules = Syntax Directed
Definitions For example:
int a = “value”;
Should not issue an error in lexical and syntax analysis phase, as it is
lexically and structurally correct, but it should generate a semantic error
as the type of the assignment differs. These rules are set by the grammar of
the language and evaluated in semantic analysis. The following tasks
should be performed in semantic analysis:
Scope resolution
Type checking
Array-bound checking
46
Example :
S E$
E E+E
E E*E
Er. R.D Kumar Page 84
E (E)
E (I)
I I
I I Digit
[Note I – for integer]
To implement this syntax – directed translation
scheme, we need to construct
a lexical analyzer and a bottom-up parser.
47
A compiler – compiler would tie the parser
and the semantic action program
fragments together, producing one module.
PRODUCTI PROGRAM
ON FRAGMENT
1. S E$ Print VAL [TOP]
VAL[TOP]:=VAL[TOP]+VAL[T
2. E E+E OP-2]
VAL[TOP]:=VAL[TOP]*VAL[T
3. E E*E OP-2]
4. E (E) VAL[TOP]:=VAL[TOP-1]
non
5. E I e
VAL[TOP]:=10*VAL[TOP]+LE
6. I I digit XVAL
7. I digit VAL[TOP]:=LEXVAL
Input : 23*5+4$
N STAT PRODUCTION
O INPUT E VAL USED
23*5+4
1. $ -- --
2. 3*5+4$ 2 --
3. 3*5+4$ I 2 I digit
I
4. *5+4$ 3 2-
5. *5+4$ I (23) I I digit
6. *5+4$ E (23) E I
Er. R.D Kumar Page 86
7. 5+4$ E* (23)-
(23)-
8. +4$ E*5 -
9. +4$ E*I (23)-5 I digit
10
. +4$ E*E (23)-5 E I
11
. +4$ E (115) E E*E
12
. 4$ E+ (115)-
13 (115)-
. $ E+4 -
14
. $ E+I (115)-4I digit
48
15
. $ E+E (115)-4E I
16
. $E (119) E E+E
17
. -- E$ (119)-
18
. -- S -- S E$
[Sequence of moves]
Example:
E → E + T { E.value = E.value + T.value }
The right part of the CFG contains the semantic rules that specify how
the grammar should be interpreted. Here, the values of non-terminals E
and T are added together and the result is copied to the non-terminal E.
Semantic attributes may be assigned to their values from their domain
at the time of parsing and evaluated at the time of assignment or
conditions. Based on the way the attributes get their values, they can be
broadly divided into two categories : synthesized attributes and inherited
attributes.
49
A can get values from S, B, and C. B can take values from S, A, and C.
Likewise, C can take values from S, A, and B.
Expansion: When a non-terminal is expanded to terminals as per a
grammatical rule.
Reduction
When a terminal is reduced to its corresponding non-terminal according
to grammar rules. Syntax trees are parsed top-down and left to right.
Whenever reduction occurs, we apply its corresponding semantic rules
(actions).
Semantic analysis uses Syntax Directed Translations to perform the
above tasks. Semantic analyzer receives AST (Abstract Syntax Tree)
from its previous stage (syntax
analysis).
Semantic analyzer attaches attribute information with AST, which are
called Attributed AST.
Attributes are two tuple value, <attribute name, attribute value>
For example:
int value = 5;
<type, “integer”>
<presentvalue, “5”>
For every production, we attach a semantic rule.
S-attributed SDT
If an SDT uses only synthesized attributes, it is called as S-attributed
SDT. These attributes are evaluated using S-attributed SDTs that have
their semantic actions written after the production (right hand side).
L-attributed SDT
This form of SDT uses both synthesized and inherited attributes with
restriction of not taking values from right siblings.
In L-attributed SDTs, a non-terminal can get values from its parent,
child, and sibling nodes. As in the following production,
S → ABC
S can take values from A, B, and C (synthesized). A can take values
from S only. B can take values from S and A. C can get values from S, A,
and B. No non-terminal can get values from the sibling to its right.
Attributes in L-attributed SDTs are evaluated by depth-first and left-to-
right parsing manner.
51
Return
Value Stores return values.
52
SUMMARY
Syntax directed translation is a context free grammar.
Syntax directed translation is helps to generates Intermediate code.
Sematic actions is enclosed in curly braces.
The translation of a non-terminal on the Right side of the production is
defined in terms of non-terminal on the Left side is called inherited
translation.
Input and output mapping is described using Syntax directed
translator.
The parse tree having values in the node is called Annotated Parse tree.
Questions
1. Explain syntax directed translation.
2. Define sematic actions.
3. Write note on decency graph.
4. Generate a sematic action for production S
E$ E E + E/ E*E/(E)/(I)
I I/Idigit (Where I is a Integer)
Let us consider the data can be associated with a name in the symbol table,
This information includes,
1. The string of characters denoting the name.
2. Attributes of the name and information identifying what use is being
made of the name.
3. Parameters, (Dimensions of arrays etc.)
4. An offset describing the partition in storage to be allocated for the name.
5. The syntax of the language may also implicitly declared variables to play
certain role.
8. 2.1.1 Names and Symbol – Table Records
The simplest way to implement a symbol table is as a linear array of records,
one record per name.
A record consists of number of consecutive words of memory, identifier.
8. 2.1.2 Reusing Symbol – Table space
The identifier used by the programmer to denote a particular name must
be preserved in the symbol table until no further references to that
identifier can possibly denote the same name.
This is essential so that all users of the identifier can be associated with the
same symbol table entry, and hence the same name.
A compiler can be designed to run in less space if the space used to store
identifiers can be reused in subsequent passes.
8. 2.1.3 Array Names
If the language places a limit on the number of dimensions, then all subscript
information can in principle, be paced in the symbol table record itself
The upper limit and lower limit of a dynamically allocated array can be any
expression evaluate at run time, when the storage is allocated for the array.
If an expression is a constant, its value can be stored in the symbol table.
If a limit is declared to be an expression, the compiler must generate code to
evaluate that expression and assign the result to a temporary variable T.
8.2.1.4 Indirection In Symbol-Table Entries
Designing Symbol-Table formats that have pointers to information that is of
variable length.
Save space i.e. allocating in each symbol-table entry the maximum possible
amount of space.
The most significant advantage in using indirection comes when we, have a type
of information that is applicable to only a minority of the entries.
Er. R.D Kumar Page 98
8.2.1.5Storage Allocation Information
To denote the locations in the storage belonging to objects at run time.
Static storage if the object code is assembly language generating assembly code
scan the symbol table generates definition appended to the executable portion.
Machine code generated by compiler – stored with a fixed origin.
The same remark applies to blocks of data loaded as a module separate from the
executable program.
In the case of names where storage is allocated on a stack, the compiler need
not allocate storage at all.
The compiler must plan out the activation record for each
Self-Organizing Lists
Needs little extra space, save a little bit time.
Three fields, NAME1, DATA1, and LINK1 are there.
Two tables, a hash table and a storage table are used.
Hashing mean variation of searching
techniques.
Open hashing
no limit on the
number of entries.
The Average time to insert „n‟ Name and to make „e‟ enquires is n (n+e)/m
If m is large, average time will be
reduced. If m is smaller, average
time will be high.
Hashing Method
Consist of a fixed away of m pointers to table entries.
Table entries organized into „m‟ separate linked list called buckets
The of K words, numbered a, 1…. K-1. There are pointers to
hash table consists
the storage table.
Hash function h such that h (NAME) is an integer value between 0 and k-1, is
used to find whether NAME is in the symbol table.
Characteristic
Uniform Distribution
8.4 Operations
A symbol table, either linear or hash, should provide the
following operations. 8.4.1 insert()
This operation is more frequently used by analysis phase, i.e., the first
half of the compiler where tokens are identified and names are stored in
the table. This operation is used to add information in the symbol table
about unique names occurring in the source code. The format or structure
in which the names are stored depends upon the compiler in hand.
An attribute for a symbol in the source code is the information
associated with that symbol. This information contains the value, state,
scope, and type about the symbol. The insert() function takes the symbol
Er. R.D Kumar Page 101
and its attributes as arguments and stores the information in the symbol
table.
For example:
int a;
should be processed by the compiler as:
insert(a, int);
8.4.2 lookup()
lookup () operation is used to search a name in the symbol table to
determine:
1. If the symbol exists in the table.
2. If it is declared before it is being used.
3. If the name is used in the scope.
4. If the symbol is initialized.
5. If the symbol declared multiple times.
The format of lookup() function varies according to the programming
language. The
basic format should match the following:
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol
table. If the symbol exists in the symbol table, it returns its attributes
stored in the table.
{
int one_3; inner scope 1
58
int one_4;
}
int one_5;
{
int one_6; inner scope 2
int one_7;
}
}
void pro_two()
{
int two_1;
int two_2;
{
int two_3; inner scope 3
int two_4;
}
int two_5;
}
...
The above program can be represented in a hierarchical structure of
symbol tables:
The global symbol table contains names for one global variable (int
value) and two procedure names, which should be available to all the child
nodes shown above. The names mentioned in the pro_one symbol table
(and all its child tables) are not available for pro_two symbols and its child
tables.
This symbol table data structure hierarchy is stored in the semantic
analyzer and whenever a name needs to be searched in a symbol table, it is
searched using the following algorithm:
Data in C can be global, meaning it is allocated as static storage and
available to any procedure or local, meaning
it can be accessed only by
the procedure in which it is declared.
Procedures : Their text part is static but they are called in a random
manner. That is why, stack storage is used to manage procedure calls
and activations.
Variables : Variables are known at the runtime only, unless they
are global or constant. Heap memory allocation scheme is used for
managing allocation and de-
allocation of memory for variables in runtime.
6.9.1 Static Allocation
In this allocation scheme, the compilation data is bound to a fixed
location in the memory and it does not change when the program executes.
As the memory requirement and storage locations are known in advance,
runtime support package for memory allocation and de-allocation is not
required.
6.9.2 Stack Allocation
Procedure calls and their activations are managed by means of stack
memory allocation. It works in last-in-first-out (LIFO) method and this
allocation strategy is very useful for recursive procedure calls.
6.9.3 Heap Allocation
Variables local to a procedure are allocated and de-allocated only at
runtime. Heap allocation is used to dynamically allocate memory to the
variables and claim it back when the variables are no more required.
Except statically allocated memory area, both stack and heap memory can
grow and shrink dynamically and unexpectedly. Therefore, they cannot be
provided with a fixed amount of memory in the system.
Parameter Passing
The communication medium among procedures is known as parameter
passing. The
values of the variables from a calling procedure are transferred to the called
procedure by
some mechanism. Before moving ahead, first go through
some basic terminologies
pertaining to the values in a program.
r-value
The value of an expression is called its r-value. The value contained in a
single variable also becomes an r-value if it appears on the right-hand side
of the assignment operator. r-values can always be assigned to some other
variable.
l-value
The location of memory (address) where an expression is stored is
known as the l-value of that expression. It always appears at the left hand
side of an assignment operator.
61
For example:
day = 1;
week = day * 7;
month = 1;
year = month * 12;
From this example, we understand that constant values like 1, 7, 12,
and variables like day, week, month, and year, all have r-values. Only
variables have l-values, as they also represent the memory location
assigned to them.
For example:
7 = x + y;
is an l-value error, as the constant 7 does not represent any memory
location.
Formal Parameters
Variables that take the information passed by the caller procedure are
called formal parameters. These variables are declared in the definition of
the called function.
Actual Parameters
Variables whose values or addresses are being passed to the called
procedure are
called actual parameters. These variables are specified in the function call
as arguments.
Example:
fun_one()
{
int actual_parameter = 10;
call fun_two(int actual_parameter);
}
fun_two(int formal_parameter)
{
print formal_parameter;
}
Formal parameters hold the information of the actual parameter,
depending upon the parameter passing technique used. It may be a value
or an address.
Pass by Value
In pass by value mechanism, the calling procedure passes the r-value of
actual parameters and the compiler puts that into the called procedure‟s activation
record. Formal parameters then hold the values passed by the calling
Pass by Reference
In pass by reference mechanism, the l-value of the actual parameter is
copied to the activation record of the called procedure. This way, the called
procedure now has the address (memory location) of the actual parameter
and the formal parameter refers to the same memory location. Therefore, if
the value pointed by the formal parameter is changed, the impact should
be seen on the actual parameter, as they should also point to the same
value.
Pass by Copy-restore
This parameter passing mechanism works similar to „pass-by-reference‟ except
that the changes to actual parameters are made when the called procedure
ends. Upon function call, the values of actual parameters are copied in the
activation record of the called procedure. Formal parameters, if
manipulated, have no real-time effect on actual parameters (as l-values are
passed), but when the called procedure ends, the l-values of formal
parameters are copied to the l-values of actual parameters. Example:
int y; calling_procedure()
{
y = 10;
copy_restore(y); //l-value of y is passed printf y; //prints 99
}
copy_restore(int x)
{
x = 99; // y still has value 10 (unaffected) y = 0; // y is now 0
}
When this function ends, the l-value of formal parameter x is copied to
the actual parameter y. Even if the value of y is changed before the
procedure ends, the l-value of x is copied to the l-value of y, making it
behave like call by reference.
Pass by Name
Languages like Algol provide a new kind of parameter passing
mechanism that works like preprocessor in C language. In pass by name
mechanism, the name of the procedure being called is replaced by its
actual body. Pass-by-name textually substitutes the argument expressions
in a procedure call for the corresponding parameters in the body of the
procedure so that it can now work on actual parameters, much like pass-
by-reference.
Question
1. Explain the contents of Symbol Table.
2. Describe the data structures for Symbol table.
3. Explain the Stack allocation scheme.
4. Write note on Storage Allocation in Block structure Language.
5. Briefly explain
a. Common Data area.
b. Equivalence Data area.
63
Intermediate code eliminates the need of a new full compiler for every
unique machine by keeping the analysis portion same for all the compilers.
The second part of compiler, synthesis, is changed according to the target
machine.
It becomes easier to apply the source code modifications to improve
code performance by applying code optimization techniques on the
intermediate code.
usually the “Three
address code” contains address two for the operand and
one for the result
Additional 3-address statements:
1. Assignment statements of the form A:=B op C (binary arithmetic logical
operator)
2. Assignment instruction, A:=op B (Unary operator) (A:=B B is assigned to A)
3. Unconditional: goto L.
4. Conditional : if A reop B goto L (relational op)
5. param A and call p,n
Eg: Procedure call:
Param A1
-
-
Param An
Call p,n
6. Indexed assignment : A:=B[I] (Location (array))
7. Pointer and address assignments,
A: = addr B( Address of B), A= *B(Value at B) and *A = B
8.8.3.2 Triples
Used to avoid temporary names into the symbol table.
Here only 3 Fields are used
67
E.g.:
A: = -B * (C+D)
A: = T3
A: = -B * (C+D)
Example:
A -> id: = E Consists of code to evaluate E into sometimes
operators E -> E+E | E*E | -E | (E)| id
Er. R.D Kumar Page 115
A means assignment statement.
CODE OPTIMIZATION
8.11 Introduction
Optimization is a program transformation technique, which tries to
improve the code by making it consume less resources (i.e. CPU, Memory)
and deliver high speed.
1. The term “code optimization” refers to techniques, a compiler can
employ in an attempt to produce a better object language program than
the most obvious for a given source program.
2. The quality of the object program is generally measured by its size (for
small computation) or its running time (for large computation).
3. It is theoretically impassible for a compiler to produce the best possible
object program for every source program under any reasonable cast
function.
4. The accurate term for “code optimization” is ”code improvement”.
5. There are many aspects to code optimization.
(i) Cast
(ii) Quick & straight forward translation (time).
72
73
74
The above control flow graph depicts a chunk of program where variable „a‟ is
used to assign the output of expression „x * y‟. Let us assume that the value
assigned to „a‟ is never used inside the loop. Immediately after the control leaves
the loop, „a‟ is assigned the value of variable „z‟, which would be used later in the
program. We conclude here that the assignment code of „a‟ is never used
anywhere, therefore it is eligible to be eliminated.
Likewise, the picture above depicts that the conditional statement is
always false, implying that the code, written in true case, will never be
executed, hence it can be removed.
76
If (condition)
{
...
tmp = y OP z;
a = tmp;
...
}
else
{
...
tmp = y OP z;
}
c = tmp;
Here, whether the condition is true or false; y OP z should be computed
only once.
77
Summary
Code optimization are called code improvement.
The aspect to Code is cost and Time
Constant timing is used for later optimization.
Optimization is mainly depends on the algorithm
Data structure to analyze basic blocks DAG.
DAG is the Direct Graph with no cycles.
DAG is useful to determine common sub expressions.
Questions
1. Explain the principle sources of Optimization.
2. Describe Loop Optimization.
3. Write note on reduction and Strength.
4. Describe DAG representation.
5. State the properties and uses of DAG.
9. CODE GENERATION
9.1 Introduction
Code generation can be considered as the final phase of compilation.
Through post code generation, optimization process can be applied on the
code, but that can be seen as a part of code generation phase itself. The
code generated by the compiler is an object code of some lower-level
programming language, for example, assembly language. We have seen that
the source code written in a higher-level language is transformed into a
lower-level language that results in a lower-level object code, which should
have the following minimum properties:
1. It should carry the exact meaning of the source code.
2. It should be efficient in terms of CPU usage and memory management.
We will now see how the intermediate code is transformed into target
object code (assembly code, in this case).
79
80
MOV y‟, L
OP z‟, L
Example:
t0 = a + b
t1 = t0 + c
d = t0 + t1
81
T1:=A+B;
T2:=C+D;
T3:=E-T2;
T4:=T1-T3;
Called a heuristic ordering for DAG‟S.
2. Optimal ordering of trees:
The DAG representation of a quadruple is a tree and if should be in a designed
register.
Using labeling algorithm- we can find the node and the leaf.
Multi register and algebraic properties are made.
Sub expressions are made i.e. partition of trees into sub trees.
Example:
void add_ten(int x)
82
{
return x + 10;
printf(“value of x is %d”, x);
}
In this code segment, the printf statement will never be executed, as the
program control returns back before it can execute, hence printf can be
removed.
Strength Reduction
There are operations that consume more time and space. Their „strength‟
can be reduced by replacing them with other operations that consume less
time and space, but produce the same result.
For example, x * 2 can be replaced by x << 1, which involves only one
left shift. Though the output of a * a and a2 is same, a2 is much more
efficient to implement.
Summary
Er. R.D Kumar Page 142
Code generation can be considered as the final phase of compilation.
The output of the code generator is machine code.
The address descriptor is used to track the location in the memory.
DAG representation made the code generation simple.
If the value of a name is found at more than one place, the register‟s value will
be preferred over the cache and main memory.
Redundancy in loads and stores are reduced in peephole
optimization. Question
1. Explain register descriptor and address descriptor.
2. Write note on GETREG ().
83
84
85
86
87
88
89
90
91
Exercise
1. If W is a string of terminals and A, b are two non-terminals, then
which of the following are right-linear grammars?
(a) A Bw
(b) A Bw|w
(c) A wb|w
(d) None of the above
92
93
14. The language generated by the above grammar is the set of all
strings made uo of a, b, c, such that
(a) The number of a‟s, b‟s, and c‟s will be equal
(b) a‟s always precede b‟s
(c) b‟s always precede c‟s
(d) The number of a‟s b‟s and c‟s are same and the a‟s precede b‟s. Which
precede c‟s.
94
95
96
97
60. The graph that shows the basic blocks and their successor
relationship is called.
(a) Control graph
(b) Flow graph
(c) DAG
(d) Hamiltonian graph
98
67. du-chaining
(a) Stands for use definition chaining
(b) Is useful for copy propagation removal
(c) Is useful for induction variable removal
(d) None of the above
REFERENCES
[1] Alfred V. Aho and Jeffrey D. Ullman, “Principles of Compiler Design”,
1989.
[2] Y.N. Srikant and Priti Shankar, “The Compiler Design Handbook:
Optimizations and Machine Code Generation, Second Edition”, Dec 7,
2007.
[3] Adesh K.Pandey,” Concepts of compiler Design”, S.K.Kataria and ons
publisher of
India Books, India
[4] Reinhard Wilhelm & Dieter Maurer,” Compiler Design”, Addion-Wesley,
I edition
[5] Gajendra Sharma,” Compiler Design”, S.K.Kataria and ons publisher
of India Books, India
[6] K.Muneewaran,” Compiler Design”, Oxford University Press, 2012
[7] K.Krishnakumari,”Compiler Design”, ARS Publications,2013
[8] A.A.Puntambekar, “Compiler Design(Principles of Compiler Design),
Technical
Publications, 2013
[9] Steven S.Muchnick,” Advanced Compiler Design Implementation”,
Morgan Kaufman
Publisher,2012
[10] Alexander Meduna,”Elements of Compiler Design”, Auerbach
Publication, 2009.
[11] G.Sudha Sadasivam,”Compiler Design”, Scitech Publication, 2009.
[12] P.Kalaiselvi, AAR Senthilkumaar,”Principles of Compiler
Deign”,Charulatha
Publications,2013.
[13] https://www.tutorialspoint.com/compiler_design/
[14] http://www.dreamincode.net/forums/topic/260592-an-
introduction-to-compiler-design-part-i-lexical-analysis/