Compiler Design
Compiler Design
Compiler Design
[R18A0512]
LECTURE NOTES
DEPARTMENT OF
COMPUTER SCIENCE AND ENGINEERING
Objectives:
UNIT – I:
Language Translation: Basics, Necessity, Steps involved in atypical language processing system,
Types of translators, Compilers: Overview and Phases of a Compiler, Pass and Phases of
translation, bootstrapping, data structures in compilation
Lexical Analysis (Scanning): Functions of Lexical Analyzer, Specification of tokens: Regular
expressions and Regular grammars for common PL constructs. Recognition of Tokens: Finite
Automata in recognition and generation of tokens. Scanner generators: LEX-Lexical Analyzer
Generators. Syntax Analysis (Parsing) : Functions of a parser, Classification of parsers. Context
free grammars in syntax specification, benefits and usage in compilers.
UNIT – II:
Top down parsing –Definition, types of top down parsers: Backtracking, Recursive descent,
Predictive, LL (1), Preprocessing the grammars to be used in top down parsing, Error recovery, and
Limitations. Bottom up parsing: Definition, types of bottom up parsing, Handle pruning. Shift
Reduce parsing, LR parsers: LR(0), SLR, CALR and LALR parsing, Error recovery, Handling
ambiguous grammar, Parser generators: YACC-yet another compiler compiler. .
UNIT – III:
Semantic analysis: Attributed grammars, Syntax directed definition and Translation schemes, Type
checker: functions, type expressions, type systems, types checking of various constructs.
Intermediate Code Generation: Functions, different intermediate code forms- syntax tree, DAG,
Polish notation, and Three address codes. Translation of different source language constructs into
intermediate code.
Symbol Tables: Definition, contents, and formats to represent names in a Symbol table. Different
approaches used in the symbol table implementation for block structured and non block structured
languages, such as Linear Lists, Self Organized Lists, and Binary trees, Hashing based STs.
UNIT – IV:
Runtime Environment: Introduction, Activation Trees, Activation Records, Control stacks.
Runtime storage organization: Static, Stack and Heap storage allocation. Storage allocation for
arrays, strings, and records etc.
Code optimization: goals and Considerations for Optimization, Scope of Optimization: Local
optimizations, DAGs, Loop optimization, Global Optimizations. Common optimization techniques:
Folding, Copy propagation, Common Sub expression eliminations, Code motion, Frequency
reduction, Strength reduction etc.
UNIT – V:
Control flow and Data flow analysis: Flow graphs, Data flow equations, global optimization:
Redundant sub expression elimination, Induction variable eliminations, Live Variable analysis.
Object code generation: Object code forms, machine dependent code optimization, register
allocation and assignment generic code generation algorithms, DAG for register allocation.
TEXT BOOKS:
1. Compilers, Principle, Techniques, and Tools. – Alfred.V Aho, Monica S.Lam, Ravi Sethi, Jeffrey
D. Ullman ; 2nd Edition, Pearson Education.
2. Modern Compiler implementation in C , - Andrew N.Appel Cambridge University Press.
REFERENCES:
1. lex & yacc , -John R Levine, Tony Mason, Doug Brown; O’reilly.
2. Compiler Construction,- LOUDEN, Thomson.
3. Engineering a compiler – Cooper & Linda, Elsevier
4. Modern Compiler Design – Dick Grune, Henry E.Bal, Cariel TH Jacobs, Wiley Dreatech
Outcomes:
By the end of the semester, the student will be able to:
Language Translation 01 – 03
Compilers 03 – 08
I
Lexical Analysis (Scanning) 09 – 14
Semantic analysis 59 – 65
UNIT-I
INTRODUCTION TO LANGUAGE PROCESSING:
As Computers became inevitable and indigenous part of human life, and several languages
with different and more advanced features are evolved into this stream to satisfy or comfort the user
in communicating with the machine , the development of the translators or mediator Software‘s
have become essential to fill the huge gap between the human and machine understanding. This
process is called Language Processing to reflect the goal and intent of the process. On the way to
this process to understand it in a better way, we have to be familiar with some key terms and
concepts explained in following lines.
LANGUAGE TRANSLATORS :
Is a computer program which translates a program written in one (Source) language to its
equivalent program in other [Target]language. The Source program is a high level language where as
the Target language can be any thing from the machine language of a target machine (between
Microprocessor to Supercomputer) to another high level language program.
1|Page
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Source Program
Input Interpreter Output
Collects all the modules, files in case if the source program is divided into different modules
stored at different files.
Expands short hands / macros into source languagestatements.
Compiler: Is a translator that takes as input a source program written in high level language and
converts it into its equivalent target program in machine language. In addition to above the compiler
also
Reports to its user the presence of errors in the source program.
Facilitates the user in rectifying the errors, and execute the code.
Assembler: Is a program that takes as input an assembly language program and converts it into its
equivalent machine language code.
Loader / Linker: This is a program that takes as input a relocatable code and collects the library
functions, relocatable object files, and produces its equivalent absolute machine code.
Specifically,
Loading consists of taking the relocatable machine code, altering the relocatable addresses,
and placing the altered instructions and data in memory at the proper locations.
Linking allows us to make a single program from several files of relocatable machine
code. These files may have been result of several different compilations, one or more
may be library routines provided by the system available to any program that needs them.
2|Page
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
In addition to these translators, programs like interpreters, text formatters etc., may be used in
language processing system. To translate a program in a high level language program to an
executable one, the Compiler performs by default the compile and linking functions.
Normally the steps in a language processing system includes Preprocessing the skeletal Source
program which produces an extended or expanded source program or a ready to compile unit of
the source program, followed by compiling the resultant, then linking / loading , and finally its
equivalent executable code is produced. As I said earlier not all these steps are mandatory. In
some cases, the Compiler only performs this linking and loading functions implicitly.
The steps involved in a typical language processing system can be understood with following
diagram.
Source Program [ Example: filename.C ]
Preprocessor
Compiler
Assembler
TYPES OF COMPILERS:
Based on the specific input it takes and the output it produces, the Compilers can be classified
into the following types;
Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a HLL
into its equivalent in native machine code or object code.
3|Page
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert Source code into
intermediate code, and then interprets (emulates) it to its equivalent machine code.
Cross-Compilers: These are the compilers that run on one machine and produce code for
another machine.
Incremental Compilers: These compilers separate the source into user defined–steps;
Compiling/recompiling step- by- step; interpreting steps in a given order
Converters (e.g. COBOL to C++): These Programs will be compiling from one high level
language to another.
Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime compilers from
intermediate language (byte code, MSIL) to executable code or native machine code. These
perform type –based verification which makes the executable code more trustworthy
Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-compilers to the native
code for Java and .NET
Binary Compilation: These compilers will be compiling object code of one platform into object code
of another platform.
PHASES OF A COMPILER:
Compiler Phases are the individual modules which are chronologically executed to perform their
respective Sub-activities, and finally integrate the solutions to give target code.
It is desirable to have relatively few phases, since it takes time to read and write immediate files.
Following diagram (Figure1.4) depicts the phases of a compiler through which it goes during the
compilation. There fore a typical Compiler is having the following Phases:
In addition to these, it also has Symbol table management, and Error handler phases. Not all
the phases are mandatory in every Compiler. e.g, Code Optimizer phase is optional in some
4|Page
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
The Phases of compiler divided in to two parts, first three phases we are called as
Analysis part remaining three called as Synthesis part.
In some application we can have a compiler that is organized into what is called passes.
Where a pass is a collection of phases that convert the input from one representation to a
completely deferent representation. Each pass makes a complete scan of the input and produces
its output to be processed by the subsequent pass. For example a two pass Assembler.
5|Page
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
All of these phases of a general Compiler are conceptually divided into The Front-end,
and The Back-end. This division is due to their dependence on either the Source Language or
the Target machine. This model is called an Analysis & Synthesis model of a compiler.
The Front-end of the compiler consists of phases that depend primarily on the Source
language and are largely independent on the target machine. For example, front-end of the
compiler includes Scanner, Parser, Creation of Symbol table, Semantic Analyzer, and the
Intermediate Code Generator.
The Back-end of the compiler consists of phases that depend on the target machine, and
those portions don‘t dependent on the Source language, just the Intermediate language. In this we
have different aspects of Code Optimization phase, code generation along with the necessary
Error handling, and Symbol table operations.
LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface
between the compiler and the Source language program and performs the following functions:
Reads the characters in the Source program and groups them into a stream of tokens in
which each token specifies a logically cohesive sequence of characters, such as an
identifier , a Keyword , a punctuation mark, a multi character operator like := .
The Scanner generates a token-id, and also enters that identifiers name in the Symbol
table if it doesn‘t exist.
SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its subsequent
phase Semantic Analyzer and performs the following functions:
Groups the above received, and recorded token stream into syntactic structures, usually
into a structure called Parse Tree whose leaves are tokens.
The interior node of this tree represents the stream of tokens that logically belongs
together.
SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the
semantically correctness of the program. Though the tokens are valid and syntactically correct, it
6|Page
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
may happen that they are not correct semantically. Therefore the semantic analyzer checks the
semantics (meaning) of the statements formed.
The Syntactically and Semantically correct structures are produced here in the form of a
Syntax tree or DAG or some other sequential representation like matrix.
It should be easy to produce,and Easy to translate into the target program. Example
intermediate code forms are:
CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in
terms of saving development time, effort, and cost. This phase performs the following specific
functions:
Sometimes the data structures used in representing the intermediate forms may also be
changed.
CODE GENERATOR: This is the final phase of the compiler and generates the target code,
normally consisting of the relocatable machine code or Assembly code or absolute machine
code.
Memory locations are selected for each variable used, and assignment of variables to
registers is done.
The Compiler also performs the Symbol table management and Error handling throughout the
compilation process. Symbol table is nothing but a data structure that stores different source
language constructs, and tokens generated during the compilation. These two interact with all
phases of the Compiler.
7|Page
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
For example the source program is an assignment statement; the following figure shows how the
phases of compiler will process the program.
8|Page
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
LEXICAL ANALYSIS:
As the first phase of a compiler, the main task of the lexical analyzer is to read the
input characters of the source program, group them into lexemes, and produce as output tokens
for each lexeme in the source program. This stream of tokens is sent to the parser for syntax
analysis. It is common for the lexical analyzer to interact with the symbol table as well.
. When lexical analyzer identifies the first token it will send it to the parser, the parser
receives the token and calls the lexical analyzer to send next token by issuing the
getNextToken() command. This Process continues until the lexical analyzer identifies all the
tokens. During this process the lexical analyzer will neglect or discard the white spaces and
comment lines.
A token is a pair consisting of a token name and an optional attribute value. The token
name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a
sequence of input characters denoting an identifier. The token names are the input symbols that
the parser processes. In what follows, we shall generally write the name of a token in boldface.
We will often refer to a token by its token name.
A pattern is a description of the form that the lexemes of a token may take [ or match]. In the
case of a keyword as a token, the pattern is just the sequence of characters that form the
keyword. For identifiers and some other tokens, the pattern is a more complex structure that is
matched by many strings.
9|Page
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
A lexeme is a sequence of characters in the source program that matches the pattern for a
token and is identified by the lexical analyzer as an instance of that token.
both printf and score are lexemes matching the pattern for token id, and "Total = %d\n‖
is a lexeme matching literal [or string].
There are a number of reasons why the analysis portion of a compiler is normally separated into
lexical analysis and parsing (syntax analysis) phases.
10 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
INPUT BUFFERING:
Before discussing the problem of recognizing lexemes in the input, let us examine
some ways that the simple but important task of reading the source program can be speeded.
This task is made difficult by the fact that we often have to look one or more characters beyond
the next lexeme before we can be sure we have the right lexeme. There are many situations
where we need to look at least one additional character ahead. For instance, we cannot be sure
we've seen the end of an identifier until we see a character that is not a letter or digit, and
therefore is not part of the lexeme for id. In C, single-character operators like -, =, or <
could also be the beginning of a two-character operator like ->, ==, or <=. Thus, we shall
introduce a two-buffer scheme that handles large look aheads safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.
Buffer Pairs
Because of the amount of time taken to process characters and the large number of characters
that must be processed during the compilation of a large source program, specialized buffering
techniques have been developed to reduce the amount of overhead required to process a single
input character. An important scheme involves two buffers that are alternately reloaded.
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters in to a buffer, rather than
using one system call per character. If fewer than N characters remain in the input file, then a
special character, represented by eof, marks the end of the source file and is different from any
possible character of the source program.
1. The Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy
whereby this determination is made will be covered in the balance of this chapter.
11 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin
is set to the character immediately after the lexeme just found. In Fig, we see forward has passed
the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted
one position to its left.
Advancing forward requires that we first test whether we have reached the end of one
of the buffers, and if so, we must reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer. As long as we never need to look so far ahead of the
actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater
than N, we shall never overwrite the lexeme in its buffer before determining it.
If we use the above scheme as described, we must check, each time we advance forward,
that we have not moved off one of the buffers; if we do, then we must also reload the other
buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one
to determine what character is read (the latter may be a multi way branch). We can combine the
buffer-end test with the test for the current character if we extend each buffer to hold a sentinel
character at the end. The sentinel is a special character that cannot be part of the source program,
and a natural choice is the character eof. Figure 1.8 shows the same arrangement as Figure 1.7,
but with the sentinels added. Note that eof retains its use as a marker for the end of the entire
input.
Any eof that appears other than at the end of a buffer means that the input is at an end. Figure 1.9
summarizes the algorithm for advancing forward. Notice how the first test, which can be part of
12 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
a multiway branch based on the character pointed to by forward, is the only test we make, except
in the case where we actually are at the end of a buffer or the end of the input.
switch ( *forward++ )
{
case eof: if (forward is at end of first buffer )
{
reload second buffer;
forward = beginning of second buffer;
SPECIFICATION OF TOKENS:
Regular expressions are an important notation for specifying lexeme patterns. While they cannot express
all possible patterns, they are very effective in specifying those types of patterns that we actually need for
tokens.
Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool is
referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the
Lex compiler transforms the input patterns into a transition diagram and generates code, in a
file called lex .yy .c, it is a c program given for C Compiler, gives the Object code. Here we need
to know how to write the Lex language. The structure of the Lex program is given below.
13 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Declarations
%%
Translation rules
%%
In the Translation rules section, We place Pattern Action pairs where each pair have the form
Pattern {Action}
The auxiliary function definitions section includes the definitions of functions used to install
identifiers and numbers in the Symbol tale.
%}
/* regular definitions */
delim [ \t\n]
ws { delim}+
letter [A-Za-z]
digit [o-91
id {letter} ({letter} | {digit}) *
if {return(1F) ; }
14 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
then {return(THEN) ; }
else {return(ELSE) ; }
int installID0() {/* function to install the lexeme, whose first character is pointed to by yytext,
and whose length is yyleng, into the symbol table and return a pointer
thereto */
int installNum() {/* similar to installID, but puts numerical constants into a separate table */}
Figure 1.10 : Lex Program for tokens common tokens
In our compiler model, the parser obtains a string of tokens from the lexical analyzer,
as shown in the below Figure, and verifies that the string of token names can be generated
by the grammar for the source language. We expect the parser to report any syntax errors in
an intelligible fashion and to recover from commonly occurring errors to continue processing the
remainder of the program. Conceptually, for well-formed programs, the parser constructs a parse
tree and passes it to the rest of the compiler for further processing.
15 | P a g e
Figure2.1: Parser in the Compiler
During the process of parsing it may encounter some error and present the error information back
to the user
Syntactic errors include misplaced semicolons or extra or missing braces; that is,
―{" or "}." As another example, in C or Java, the appearance of a case statement without
an enclosing switch is a syntactic error (however, this situation is usually allowed by the
parser and caught later in the processing, as the compiler attempts to generate code).
Based on the way/order the Parse Tree is constructed, Parsing is basically classified in to
following two types:
1. Top Down Parsing : Parse tree construction start at the root node and moves to the
children nodes (i.e., top down order).
2. Bottom up Parsing: Parse tree construction begins from the leaf nodes and proceeds
towards the root node (called the bottom up order).
1. What is a Compiler? Explain the working of a Compiler with your own example?
2. What is the Lexical analyzer? Discuss the Functions of Lexical Analyzer.
3. Write short notes on tokens, pattern and lexemes?
4. Write short notes on Input buffering scheme? How do you change the basic input
buffering algorithm to achieve better performance?
5. What do you mean by a Lexical analyzer generator? Explain LEX tool.
16 | P a g e
ASSIGNMENT QUESTIONS:
1. Write the differences between compilers and interpreters?
17 | P a g e
Department of Computer Science & Engineering Course File : Compiler Design
UNIT-II
TOP DOWN PARSING:
Top-down parsing can be viewed as the problem of constructing a parse tree for the given
input string, starting from the root and creating the nodes of the parse tree in preorder
(depth-first left to right).
It is classified in to two different variants namely; one which uses Back Tracking and the other is
Non Back Tracking in nature.
Non Back Tracking Parsing: There are two variants of this parser as given below.
1. Table Driven Predictive Parsing :
i. LL (1) Parsing
Back Tracking
1. Brute Force method
An input buffer that contains the string to be parsed followed by a $ Symbol, used to
indicate end of input.
A stack, containing a sequence of grammar symbols with a $ at the bottom of the stack,
which initially contains the start symbol of the grammar on top of $.
A parsing table containing the production rules to be applied. This is a two dimensional
array M [Non terminal, Terminal].
V is a finite set of Non terminal; Non terminals are syntactic variables that denote sets of
strings. The sets of strings denoted by non terminals help define the language generated
by the grammar. Non terminals impose a hierarchical structure on the language that
is key to syntax analysis and translation.
T is a Finite set of Terminal; Terminals are the basic symbols from which strings are
formed. The term "token name" is a synonym for '"terminal" and frequently we will use
the word "token" for terminal when it is clear that we are talking about just the token
name. We assume that the terminals are the first components of the tokens output by the
lexical analyzer.
S is the Starting Symbol of the grammar, one non terminal is distinguished as the start
symbol, and the set of strings it denotes is the language generated by the grammar. P
is finite set of Productions; the productions of a grammar specify the manner in which the
19 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
terminals and non terminals can be combined to form strings, each production is in α->β
form, where α is a single non terminal, β is (VUT)*.Each production consists of:
(a) A non terminal called the head or left side of the production; this
production defines some of the strings denoted by the head.
(b) The symbol ->. Some times: = has been used in place of the arrow.
(c) A body or right side consisting of zero or more terminals and non-
terminals. The components of the body describe one way in which strings of the non
terminal at the head can be constructed.
Conventionally, the productions for the start symbol are listed first.
20 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
DERIVATIONS:
The construction of a parse tree can be made precise by taking a derivational view, in
which productions are treated as rewriting rules. Beginning with the start symbol, each rewriting
step replaces a Non terminal by the body of one of its productions. This derivational view
corresponds to the top-down construction of a parse tree as well as the bottom construction of the
parse tree.
Derivations are classified in to Let most Derivation and Right Most Derivations.
21 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
=> id + E
=> id + E * E
=> id + id * E
=> id + id * id
NOTE: Every time we need to start from the root production only, the under line using at Non
terminal indicating that, it is the non terminal (left most one) we are choosing to rewrite the
productions to accept the string.
E => E + E
=> E + E * E
=> E + E * id
=> E + id * id
=> id + id * id
NOTE: Every time we need to start from the root production only, the under line using at Non
terminal indicating that, it is the non terminal (Right most one) we are choosing to rewrite the
productions to accept the string.
What is a Parse Tree?
A parse tree is a graphical representation of a derivation that filters out the order in which
productions are applied to replace non terminals.
Each interior node of a parse tree represents the application of a production.
All the interior nodes are Non terminals and all the leaf nodes terminals.
All the leaf nodes reading from the left to right will be the output of the parse tree.
If a node n is labeled X and has children n1,n2,n3,…nk with labels X1,X2,…Xk
respectively, then there must be a production A->X1X2…Xk in the grammar.
Example1:- Parse tree for the input string - (id + id) using the above Context free Grammar is
22 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Figure 2.4 : Parse Tree for the input string - (id + id)
The Following figure shows step by step construction of parse tree using CFG for the parse tree
for the input string - (id + id).
Figure 2.5 : Sequence outputs of the Parse Tree construction process for the input string –(id+id)
Example2:- Parse tree for the input string id+id*id using the above Context free Grammar is
Figure 2.6: Parse tree for the input string id+ id*id
23 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
AMBIGUITY in CFGs:
Definition: A grammar that produces more than one parse tree for some sentence (input string)
is said to be ambiguous.
In other words, an ambiguous grammar is one that produces more than one leftmost
derivation or more than one rightmost derivation for the same sentence.
Or If the right hand production of the grammar is having two non terminals which are
exactly same as left hand side production Non terminal then it is said to an ambiguous grammar.
Example : If the Grammar is E-> E+E | E*E | -E| (E) | id and the Input String is id + id* id
Two parse trees for given input string are
(a)
(b)
Two Left most Derivations for given input String are :
E => E +E E => E * E
=> id + E => E+E*E
=> id + E * E => id + E * E
=> id + id * E => id+ id* E
=> id + id * id => id + id * id
(a) (b)
The above Grammar is giving two parse trees or two derivations for the given input string so, it
is an ambiguous Grammar
Note: LL (1) parser will not accept the ambiguous grammars or We cannot construct an
LL(1) parser for the ambiguous grammars. Because such grammars may cause the Top
Down parser to go into infinite loop or make it consume more time for parsing. If necessary
we must remove all types of ambiguity from it and then construct.
ELIMINATING AMBIGUITY: Since Ambiguous grammars may cause the top down Parser
go into infinite loop, consume more time during parsing.
Therefore, sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. The
general form of ambiguous productions that cause ambiguity in grammars is
24 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
A Aα | β
This can be written as (introduce one new non terminal in the place of second non terminal)
A β Aꞌ
Aꞌ α Aꞌ| ε
Example : Let the grammar is E E+E | E*E | -E| (E) | id . It is shown that it is ambiguous that
can be written as
E E+E
E E-E
E E*E
E -E
E (E)
E id
In the above grammar the 1st and 2nd productions are having ambiguity. So, they can be written
as
E-> E+E | E*E this production again can be written as
E-> E+E | β , where β is E*E
The above production is same as the general form. so, that can be written as
E->E+T|T
T->β
LEFT RECURSION:
Another feature of the CFGs which is not desirable to be used in top down parsers is left
recursion. A grammar is left recursive if it has a non terminal A such that there is a derivation
A=>Aα for some string α in (TUV)*. LL(1) or Top Down Parsers can not handle the Left
Recursive grammars, so we need to remove the left recursion from the grammars before being
used in Top Down Parsing.
25 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
A Aα | β
The above left recursive production can be written as the non left recursive equivalent :
A βAꞌ
Aꞌ αAꞌ| €
Example : - Is the following grammar left recursive? If so, find a non left recursive grammar
equivalent to it.
E E+T|T
T T*F|F
F -E | (E) | id
Yes ,the grammar is left recursive due to the first two productions which are satisfying the
general form of Left recursion, so they can be rewritten after removing left recursion from
E → E + T, and T→ T * F is
E TE′
E′ +T E′ | €
T F T′
T′ *F T′ | €
F (E) | id
LEFT FACTORING:
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive or top-down parsing. A grammar in which more than one production has common
prefix is to be rewritten by factoring out the prefixes.
For example, in the following grammar there are n A productions have the common prefix α,
which should be removed or factored out without changing the language defined for A.
We can factor out the α from all n productions by adding a new A production A αA′
, and rewriting the A′ productions grammar as
A αA′
A′ A1|A2|A3|A4…|An
FIRST and FOLLOW:
26 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
The construction of both top-down and bottom-up parsers is aided by two functions,
FIRST and FOLLOW, associated with a grammar G. During top down parsing, FIRST and
FOLLOW allow us to choose which production to apply, based on the next input (look a head)
symbol.
Computation of FIRST:
FIRST function computes the set of terminal symbols with which the right hand side of
the productions begin. To compute FIRST (A) for all grammar symbols, apply the following
rules until no more terminals or € can be added to any FIRST set.
1. If A is a terminal, then FIRST {A} = {A}.
2. If A is a Non terminal and A->X1X2…Xi
FIRST(A)=FIRST(X1) if X1is not null, if X1 is a non terminal and X1->€, add
FIRST(X2) to FIRST(A), if X2-> € add FIRST(X3) to FIRST(A), … if Xi->€ ,
i.e., all Xi‘s for i=1..i are null, add € FIRST(A).
3. If A ->€ is a production, then add € to FIRST (A).
Computation Of FOLLOW:
Follow (A) is nothing but the set of terminal symbols of the grammar that are
immediately following the Non terminal A. If a is to the immediate right of non terminal A, then
Follow(A)= {a}. To compute FOLLOW (A) for all non terminals A, apply the following rules
until no more symbols can be added to any FOLLOW set.
1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right end
marker.
2. If there is a production A-> αBβ, then everything in FIRST (β) except € is in
FOLLOW(B).
3. If there is a production A->αB or a production A-> αBβ with FIRST(β) contains €,
then FOLLOW (B) = FOLLOW (A).
Example: - Compute the FIRST and FOLLOW values of the expression grammar
1. E TE′
2. E′ +TE′ | €
3. T FT′
4. T′ *FT′ | €
5. F (E) | id
27 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
INPUT SYMBOLS
NON-TERMINALS
+ * ( ) id $
28 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
E TE′ E id
E
E′ +TE′ E′ € E′ €
E′
T FT′ T FT′
T
T′ € T′ *FT′ T′ € T′ €
T′
F (E) F id
F
Table 2.2: LL (1) Parsing Table for the Expressions Grammar
Note: if there are no multiple entries in the table for single a terminal then grammar is accepted
by LL(1) Parser.
LL (1) Parsing Algorithm:
The parser acts on basis on the basis of two symbols
i. A, the symbol on the top of the stack
ii. a, the current input symbol
There are three conditions for A and ‗a‘, that are used fro the parsing program.
1. If A=a=$ then parsing is Successful.
2. If A=a≠$ then parser pops off the stack and advances the current input pointer to the
next.
3. If A is a Non terminal the parser consults the entry M [A, a] in the parsing table. If
M[A, a] is a Production A-> X1X2..Xn, then the program replaces the A on the top of
the Stack by X1X2..Xn in such a way that X1 comes on the top.
29 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
30 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
31 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
T′( );
return true;
}
else return error;
}
procedure F( )
{
if input = ‗(‗
{
advance( );
E ( );
if input = ‗)‘
advance( );
return true;
}
else if input = ―id‖
{
advance( );
return true;
}
else return error;
}
advance()
{
input = next token;
}
BACK TRACKING: This parsing method uses the technique called Brute Force method
during the parse tree construction process. This allows the process to go back (back track) and
redo the steps by undoing the work done so far in the point of processing.
Brute force method: It is a Top down Parsing technique, occurs when there is more
than one alternative in the productions to be tried while parsing the input string. It selects
alternatives in the order they appear and when it realizes that something gone wrong it tries with
next alternative.
For example, consider the grammar bellow.
S cAd
A ab | a
To generate the input string ―cad‖, initially the first parse tree given below is generated.
As the string generated is not ―cad‖, input pointer is back tracked to position ―A‖, to examine the
next alternate of ―A‖. Now a match to the input string occurs as shown in the 2nd parse trees
given below.
32 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
( 1) (2)
IMPORTANT AND EXPECTED QUESTIONS
1. Explain the components of working of a Predictive Parser with an example?
2. What do the FIRST and FOLLOW values represent? Give the algorithm for computing
FIRST n FOLLOW of grammar symbols with an example?
3. Construct the LL (1) Parsing table for the following grammar?
E E+T|T
T T*F
F (E) | id
4. For the above grammar construct, and explain the Recursive Descent Parser?
5. What happens if multiple entries occurring in your LL (1) Parsing table? Justify your
answer? How does the Parser
ASSIGNMENT QUESTIONS
4. Will the Predictive parser accept the ambiguous Grammar justify your answer?
33 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
BOTTOM-UP PARSING
Bottom-up parsing corresponds to the construction of a parse tree for an input string
beginning at the leaves (the bottom nodes) and working up towards the root (the top node). It
involves ―reducing an input string ‗w‘ to the Start Symbol of the grammar. in each reduction
step, a perticular substring matching the right side of the production is replaced by symbol on the
left of that production and it is the Right most derivation. For example consider the following
Grammar:
E E+T|T
T T*F
F (E)|id
Bottom up parsing of the input string “id * id “is as follows:
34 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Figure 3.1 : A Bottom-up Parse tree for the input String “id*id”
Bottom up parsing is classified in to 1. Shift-Reduce Parsing, 2. Operator Precedence parsing ,
and 3. [Table Driven] L R Parsing
i. SLR( 1 )
ii. CALR ( 1 )
iii.LALR( 1 )
SHIFT-REDUCE PARSING:
Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar
symbols and an input buffer holds the rest of the string to be parsed, We use $ to mark the
bottom of the stack and also the right end of the input. And it makes use of the process of shift
and reduce actions to accept the input string. Here, the parse tree is Constructed bottom up from
the leaf nodes towards the root node.
When we are parsing the given input string, if the match occurs the parser takes the
reduce action otherwise it will go for shift action. And it can accept ambiguous grammars also.
For example, consider the below grammar to accept the input string ―id * id―, using S-R parser
E E+T|T
T T*F | F
F (E)|id
Actions of the Shift-reduce parser using Stack implementation
35 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
36 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
7. An operator precedence parsing program takes an input string and determines whether it
conforms to the grammar specifications. It uses an operator precedence parse table and
stack to arrive at the decision.
a1 a2 a3 ……….. $ Input Buffer
Operator precedence
Parsing Algorithm
Output
$
Stack
E E+E
E E-E
E E*E
E E/E
E E^E
E -E
E (E)
E id , Construct operator precedence table and accept input string “ id+id*id”
37 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
The intention of the precedence relations is to delimit the handle of the given input String with <•
marking the left end of the Handle and •> marking the right end of the handle.
Parsing Action:
To locate the handle following steps are followed:
1. Add $ symbol at the both ends of the given input string.
2. Scan the input string from left to right until the right most •> is encountered.
3. Scan towards left over all the equal precedence‘s until the first <• precedence is
encountered.
4. Every thing between <• and •> is a handle.
5. $ on S means parsing is success.
Example, Explain the parsing Actions of the OPParser for the input string is “id*id” and the
grammar is:
E E+E
E E*E
E id
1. $ <• id •> *<• id•> $
The first handle is ‗id‘ and match for the ‗id ‗in the grammar is E id .
So, id is replaced with the Non terminal E. the given input string can be
written as
2. $ <• E •> *<• id•> $
The parser will not consider the Non terminal as an input. So, they are not
considered in the input string. So , the string becomes
3. $ <• *<• id•> $
The next handle is ‗id‘ and match for the ‗id ‗in the grammar is E id .
So, id is replaced with the Non terminal E. the given input string can be
written as
4. $ <• *<• E•> $
The parser will not consider the Non terminal as an input. So, they are not
considered in the input string. So, the string becomes
5. $ <• * •> $
The next handle is ‗*‘ and match for the ‗ ‗in the grammar is E E*E.
So, id is replaced with the Non terminal E. the given input string can be
written as
6. $ E $
The parser will not consider the Non terminal as an input. So, they are not
considered in the input string. So, the string becomes
38 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
7. $ $
$ On $ means parsing successful.
Operator Parsing Algorithm:
The operator precedence Parser parsing program determines the action of the parser depending
on
1. ‗a‘ is top most symbol on the Stack
2. ‗b‘ is the current input symbol
There are 3 conditions for ‗a‘ and ‗b‘ that are important for the parsing program
1. a=b=$ , the parsing is successful
2. a <• b or a = b, the parser shifts the input symbol on to the stack and advances the
input pointer to the next input symbol.
3. a •> b, parser performs the reduce action. The parser pops out elements one by
one from the stack until we find the current top of the stack element has lower
precedence than the most recently popped out terminal.
Example, the sequence of actions taken by the parser using the stack for the input string ―id * id
― and corresponding Parse Tree are as under.
E * E
id id
Advantages and Disadvantages of Operator Precedence Parsing:
The following are the advantages of operator precedence parsing
1. It is simple and easy to implement parsing technique.
2. The operator precedence parser can be constructed by hand after understanding the
grammar. It is simple to debug.
The following are the disadvantages of operator precedence parsing:
1. It is difficult to handle the operator like ‗-‗which can be either unary or binary and hence
different precedence‘s and associativities.
2. It can parse only a small class of grammar.
39 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
LR Parsing:
Most prevalent type of bottom up parsing is LR (k) parsing. Where, L is left to right scan of the
given input string, R is Right Most derivation in reverse and K is no of input symbols as the
Look ahead.
It is the most general non back tracking shift reduce parsing method
The class of grammars that can be parsed using the LR methods is a proper superset of
the class of grammars that can be parsed with predictive parsers.
Shift GOTO
Stack
LR Parsing Table
40 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
1. ACTION Part
The ACTION part of the table is a two dimensional array indexed by state and the
input symbol, i.e. ACTION[state][input], An action table entry can have one of
following four kinds of values in it. They are:
1. Shift X, where X is a State number.
2. Reduce X, where X is a Production number.
3. Accept, signifying the completion of a successful parse.
4. Error entry.
2. GO TO Part
The GO TO part of the table is a two dimensional array indexed by state and a
Non terminal, i.e. GOTO[state][NonTerminal]. A GO TO entry has a state
number in the table.
A parsing Algorithm uses the current State X, the next input symbol ‗a‘ to consult the
entry at action[X][a]. it makes one of the four following actions as given below:
1. If the action[X][a]=shift Y, the parser executes a shift of Y on to the top of the stack
and advances the input pointer.
2. If the action[X][a]= reduce Y (Y is the production number reduced in the State X), if
the production is Y->β, then the parser pops 2*β symbols from the stack and push Y
on to the Stack.
3. If the action[X][a]= accept, then the parsing is successful and the input string is
accepted.
4. If the action[X][a]= error, then the parser has discovered an error and calls the error
routine.
The parsing is classified in to
1. LR ( 0 )
2. Simple LR ( 1 )
3. Canonical LR ( 1 )
4. Look ahead LR ( 1 )
41 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
The Augment Grammar G`, is G with a new starting symbol S` an additional production
S` S. this helps the parser to identify when to stop the parsing and announce the acceptance of
the input. The input string is accepted if and only if the parser is about to reduce by S` S. For
example let us consider the Grammar below:
E E+T|T
T T*F
F (E) | id the Augment grammar G` is Represented by
E` E
E E+T|T
T T*F
F (E) | id
NOTE: Augment Grammar is simply adding one extra production by preserving the actual
meaning of the given Grammar G.
Canonical collection of LR (0) items
LR (0) items
An LR (0) item of a Grammar is a production G with dot at some position on the right
side of the production. An item indicates how much of the input has been scanned up to a given
point in the process of parsing. For example, if the Production is X YZ then, The LR (0)
items are:
1. X •AB, indicates that the parser expects a string derivable from AB.
2. X A•B, indicates that the parser has scanned the string derivable from the A and
expecting the string from Y.
3. X AB•, indicates that he parser has scanned the string derivable from AB.
If the grammar is X € the, the LR (0) item is
X •, indicating that the production is reduced one.
Canonical collection of LR(0) Items:
This is the process of grouping the LR (0) items together based on the closure and Go to
operations
Closure operation
If I is an initial State, then the Closure (I) is constructed as follows:
1. Initially, add Augment Production to the state and check for the • symbol in the Right
hand side production, if the • is followed by a Non terminal then Add Productions
which are Stating with that Non Terminal in the State I.
42 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
2. If a production X α•Aβ is in I, then add Production which are starting with X in the
State I. Rule 2 is applied until no more productions added to the State I( meaning that
the • is followed by a Terminal symbol).
Example :
0. E` E E` •E
1. E E+T LR (0) items for the Grammar is E •E+T
2. T F T •F
3. T T*F T • T*F
4. F (E) F • (E)
5. F id F • id
GO TO Operation
Go to (I0, X), where I0 is set of items and X is the grammar Symbol on which we
are moving the „•‟ symbol. It is like finding the next state of the NFA for a give State I0 and the
input symbol is X. For example, if the production is E •E+T
Note: Once we complete the Go to operation, we need to compute closure operation for the
output production
43 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
E`->.E E`-> E.
E->.E+T E E-> E.+T
T-> .T*F
a States ACTION GO TO
A->α•aβ A->αa•β a $ A
Ii Sj
Ii Ij
Ij
If there is a transaction from one state (Ii ) to anoth er state (Ij ) on a Non terminal val ue
then, we should write the subscript value of Ii in the GO TO part as shown below: part as shown
below:
A States ACTION GO TO
A->α•Aβ A->αA•β a $ A
Ii j
Ii Ij
Ij
If there is one state (Ii), where there is one production which has no transitions. Then, the
production is said to be a reduced production. These productions should have reduced entry in
the Action part along with their production numbers. If the Augment production is reducing then,
write accept in the Action part.
States ACTION GO TO
1 A->αβ• a $ A
ngineering & Technology/Hyderabad/In
Ii r1 r1
44 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Ii
Ii
For Example, Construct the LR (0) parsing Table for the given Grammar (G)
S aB
B bB | b
Sol: 1. Add Augment Production and insert „•‟ symbol at the first position for every
production in G
0. S′ •S
1. S •aB
2. B •bB
3. B •b
I0 State:
1. Add Augment production to the I0 State and Compute the Closure
I0 = Closure ( S′ •S)
Since ‗•‘ is followed by the Non terminal, add all productions starting with S in to I 0 State. So,
the I0 State becomes
I0 = S′ •S
S •aB Here, in the S production ‗.‘ Symbol is followed by a terminal value so close
the state.
I1= Go to (I0, S)
S` S•
Closure( S` S•) = S′ S• Here, The Production is reduced so close the State.
I1= S′ S•
I2= S a•B
B •bB
45 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
B •b
B • bB
B •b The Dot Symbol is followed by the terminal value. So, close the State.
I4= B b•B
B • bB
B •b
B b•
I7 = Go to ( I4 , b) = I4
Drawing Finite State diagram DFA: Following DFA gives the state transitions of the parser
and is useful in constructing the LR parsing table.
S->aB•
S′->S•
S I3
I1 B
S′->•S
S->•aB
B->b•B B
a S->a•B b B->•bB
B->bB•
I0 B->•bB B->•b
B->•b B->b•
b
I5
I4
I2 I4
46 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
LR Parsing Table:
ACTION GOTO
States
a B $ S B
I0 S2 1
I1 ACC
I2 S4 3
I3 R1 R1 R1
I4 R3 S4/R3 R3 5
I5 R2 R2 R2
Note: if there are multiple entries in the LR (1) parsing table, then it will not accepted by the
LR(1) parser. In the above table I3 row is giving two entries for the single terminal value ‗b‘ and
it is called as Shift- Reduce conflict.
Shift-Reduce Conflict in LR (0) Parsing: Shift Reduce Conflict in the LR (0) parsing
occurs when a state has
1. A Reduced item of the form A α• and
2. An incomplete item of the form A β•aα as shown below:
Ii
Ij
1 A-> α• a $ A B
47 | P a g e
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
1. Write the Context free Grammar for the given input string
If there is a transaction from one state (Ii ) to another state(Ij ) on a terminal value then,
we should write the shift entry in the action part as shown below:
States ACTION GO TO
a
A->α•aβ A->αa•β a $ A
Ii Sj
Ii Ij
Ij
If there is a transaction from one state (Ii ) to another state (Ij ) on a Non terminal value
then, we should write the subscript value of Ii in the GO TO part as shown below: part as shown
below:
States ACTION GO TO
A->α•Aβ A->αA•β a $ A
Ii j
Ij
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Ii Ij
1 If there is one state (Ii), where there is one production ( A->αβ•) which has no transitions
to the next State. Then, the production is said to be a reduced production. For all
terminals X in FOLLOW (A), write the reduce entry along with their production
numbers. If the Augment production is reducing then write accept.
1 S -> •aAb
2 A->αβ•
Follow(S) = {$}
Follow (A) = (b}
Ii States ACTION GO TO
2 A->αβ• a b $ S A
Ii r2
Ii
S aB
B bB | b
ACTION GOTO
States
A b $ S B
I0 S2 1
I1 ACCEPT
I2 S4 3
I3 R1
I4 S4 R3 5
I5 R2
Note: When Multiple Entries occurs in the SLR table. Then, the grammar is not accepted by
SLR(1) Parser.
Conflicts in the SLR (1) Parsing :
When multiple entries occur in the table. Then, the situation is said to be a Conflict.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Shift-Reduce Conflict in SLR (1) Parsing : Shift Reduce Conflict in the LR (1) parsing occurs
when a state has
1. A Reduced item of the form A α• and Follow(A) includes the terminal value
‗a‘.
2. An incomplete item of the form A β•aα as shown below:
1 A-> β•a α
States Action GOTO
a
2 B->b• Ij a $ A B
Ii Sj/r2
Ii
2 B->β• a $ A B
Ii r1/r2
Ii
Canonical LR (1) Parsing: Various steps involved in the CLR (1) Parsing:
1. Write the Context free Grammar for the given input string
2. Check for the Ambiguity
5. Draw DFA
7. Based on the information from the Table, with help of Stack and Parsing
algorithm generate the output.
LR (1) items :
The LR (1) item is defined by production, position of data and a terminal symbol. The
terminal is called as Look ahead symbol.
General form of LR (1) item is S->α•Aβ , $
I0 State : Add Augment production and compute the Closure, the look ahead symbol for the Augment
Production is $.
S′->•S, $= Closure(S′->•S, $)
The dot symbol is followed by a Non terminal S. So, add productions starting with S in I0
State.
S->•CC, $
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
The dot symbol is followed by a Non terminal C. So, add productions starting with C in I0
State.
C->•cC, FIRST(C, $)
C->•d, FIRST(C, $)
C->•cC, c/d
C->•d, c/d
The dot symbol is followed by a terminal value. So, close the I0 State. So, the productions in the
I0 are
S′->•S , $
S->•CC , $
C->•cC, c/d
C->•d , c/d
S-> C->•cC , $
C->•d,$ So, the I2 State is
S->C•C,$
C->•cC , $
C->•d,$
C->c•C, c/d
C->•cC, c/d
C->•d , c/d
C->c•C , $
C->•cC , $
C->•d,$
Drawing the Finite State Machine DFA for the above LR (1) items
S->CC•, $
S′->S•,$
I1 C I5 C->cC• , $
0 S′->•S , $ C->c•C , $ I9
S->C•C,$
1 S->•CC , $ C->•cC , $ c C->•cC , $ c
2C->•cC,c/d C->•d,$ C->•d,$
3 C->•d ,c/d I6
I2 I6 I7
I0 c
d
C->c•C, c/d C->d•, $
I4 I3 I8
C->cC•, c/d
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
ACTION GOTO
States
c d $ S C
I0 S3 S4 1 2
I1 ACCEPT
I2 S6 S7 5
I3 S3 S4 8
I4 R3 R3 5
I5 R1
I6 S6 S7 9
I7 R3
I8 R2 R2
I9 R2
These states are differing only in the look-aheads. They have the same productions. Hence these
states are combined to form a single state called as I47.
Similarly the states I3 and I6 differing only in their look-aheads as given below:
I3= Goto(I0,c)=
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
C->c•C, c/d
C->•cC, c/d
C->•d , c/d
These states are differing only in the look-aheads. They have the same productions. Hence these
states are combined to form a single state called as I36.
Similarly the States I8 and I9 differing only in look-aheads. Hence they combined to form
the state I89.
ACTION GOTO
States
c d $ S C
I0 S36 S47 1 2
I1 ACCEPT
I2 S36 S47 5
I36 S36 S47 89
I47 R3 R3 R3 5
I5 R1
I89 R2 R2 R2
Shift Reduce Conflict in the CLR (1) parsing occurs when a state has
3. A Reduced item of the form A α•, a and
4. An incomplete item of the form A β•aα as shown below:
1 A-> β•a α ,$
States Action GOTO
a
2 B->b• ,a Ij a $ A B
Ii Sj/r2
Ii
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Reduce- Reduce Conflict in the CLR (1) parsing occurs when a state has two or more
reduced items of the form
3. A α•
4. B β• If two productions in a state (I) reducing on same look ahead symbol
as shown below:
1 A-> α• ,a
States Action GOTO
2 B->β•,a
a $ A B
Ii r1/r2
Ii
String Acceptance using LR Parsing:
Consider the above example, if the input String is cdd
ACTION GOTO
States
c D $ S C
I0 S3 S4 1 2
I1 ACCEPT
I2 S6 S7 5
I3 S3 S4 8
I4 R3 R3 5
I5 R1
I6 S6 S7 9
I7 R3
I8 R2 R2
I9 R2
$0 cdd$ Shift S3
$0c3 dd$ Shift S4
$0c3d4 d$ Reduce with R3,C->d, pop 2*β symbols from the stack
$0c3C d$ Goto ( I3, C)=8Shift S6
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
$0c3C8 d$ Reduce with R2 ,C->cC, pop 2*β symbols from the stack
$0C d$ Goto ( I0, C)=2
$0C2 d$ Shift S7
$0C2d7 $ Reduce with R3,C->d, pop 2*β symbols from the stack
$0C2C $ Goto ( I2, C)=5
$0C2C5 $ Reduce with R1,S->CC, pop 2*β symbols from the stack
$0S $ Goto ( I0, S)=1
$0S1 $ Accept
Ambiguity: A Grammar can have more than one parse tree for a string . For example, consider
grammar.
A grammar is said to be an ambiguous grammar if there is some string that it can generate in
more than one way (i.e., the string has more than one parse tree or more than one leftmost
derivation). A language is inherently ambiguous if it can only be generated by ambiguous
grammars.
In this grammar, the string 9-5+2 has two possible parse trees as shown in the next slide.
Consider the parse trees for string 9-5+2, expression like this has more than one parse tree. The
two trees for 9-5+2 correspond to the two ways of parenthesizing the expression: (9-5)+2 and 9-
(5+2). The second parenthesization gives the expression the value 2 instead of 6.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Ambiguity is harmful to the intent of the program. The input might be deciphered in a way which
was not really the intention of the programmer, as shown above in the 9-5+2 example. Though
there is no general technique to handle ambiguity i.e., it is not possible to develop some feature
which automatically identifies and removes ambiguity from any grammar. However, it can be
removed, broadly speaking, in the following possible ways:-
2) Implementing precedence and associatively rules in the grammar. We shall discuss this
technique in the later slides.
If an operand has operator on both the sides, the side on which operator takes this operand is the
associativity of that operator
Grammar to generate strings with right associative operators right à letter = right | letter letter
a| b |.| z
A binary operation * on a set S that does not satisfy the associative law is called non-
associative. A left-associative operation is a non-associative operation that is conventionally
evaluated from left to right i.e., operand is taken by the operator on the left side.
For example,
6*5*4 = (6*5)*4 and not 6*(5*4)
6/5/4 = (6/5)/4 and not 6/(5/4)
For example,
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Following is the grammar to generate strings with left associative operators. (Note that this is left
recursive and may go into infinite loop. But we will handle this problem later on by making it
right recursive)
IMPORTANT QUESTIONS
1. Discuss the the working of Bottom up parsing and specifically the OperatorPrecedence
Parsing with an exaple?
2. What do you mean by an LR parser? Explain the LR (1) Parsing technique?
3. Write the differences between canonical collection of LR (0) items and LR (1) items?
4. Write the Difference between CLR (1) and LALR(1) parsing?
5. What is YACC? Explain how do you use it in constructing the parser using it.
ASSIGNMENT QUESTIONS
5. E E+T|T
T T*F
F (E) |id, construct the LALR (1) Parsing table? And explain the Conflicts?
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
UNIT-III
INTERMEDIATE CODE GENERATION
In Intermediate code generation we use syntax directed methods to translate the source
program into an intermediate form programming language constructs such as declarations,
assignments and flow-of-control statements.
The output of the Parser and the input to the Code Generator.
Relatively machine-independent and allows the compiler to be retargeted.
Relatively easy to manipulate (optimize).
1. Retargeting is facilitated - Build a compiler for a new machine by attaching a new code
generator to an existing front-end.
1. Syntax Trees
2. Postfix notation
Semantic rules for generating three-address code from common programming language
constructs are similar to those for constructing syntax trees of for generating postfix notation.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Graphical Representations
A syntax tree depicts the natural hierarchical structure of a source program. A DAG
(Directed Acyclic Graph) gives the same information but in a more compact way because
common sub-expressions are identified. A syntax tree for the assignment statement a:=b*-c+b*-c
appear in the following figure.
. assign
a +
* *
b uniminus b uniminus
c c
Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes of the in
which a node appears immediately after its children. The postfix notation for the syntax tree in
the fig is
The edges in a syntax tree do not appear explicitly in postfix notation. They can be
recovered in the order in which the nodes appear and the no. of operands that the operator at a
node expects. The recovery of edges is similar to the evaluation, using a staff, of an expression in
postfix notation.
t1 := y * z
t2 : = x + t1
Where t1 and t2 are compiler-generated temporary names. This unraveling of
complicated arithmetic expressions and of nested flow-of-control statements makes three-address
code desirable for target code generation and optimization. The use of names for the intermediate
values computed by a program allow- three-address code to be easily rearranged – unlike postfix
notation. Three - address code is a linearzed representation of a syntax tree or a dag in which
explicit names correspond to the interior nodes of the graph.
t1 := -c
t2 := b * t1
t3 := -c
t4 := b * t3
t5 := t2 + t4
a := t5
The reason for the term‖three-address code‖ is that each statement usually contains three
addresses, two for the operands and one for the result. In the implementations of three-address
code given later in this section, a programmer-defined name is replaced by a pointer tc a symbol-
table entry for that name.
Three-address statements are akin to assembly code. Statements can have symbolic labels
and there are statements for flow of control. A symbolic label represents the index of a three-
address statement in the array holding inter- mediate code. Actual indices can be substituted for
the labels either by making a separate pass, or by using ‖back patching,‖ discussed in Section
8.6. Here are the common three-address statements used in the remainder of this book:
2. Assignment instructions of the form x:= op y, where op is a unary operation. Essential unary
operations include unary minus, logical negation, shift operators, and conversion operators that,
for example, convert a fixed-point number to a floating-point number.
4. The unconditional jump goto L. The three-address statement with label L is the next to be
executed.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
5. Conditional jumps such as if x relop y goto L. This instruction applies a relational operator
(<, =, >=, etc.) to x and y, and executes the statement with label L next if x stands in relation
relop to y. If not, the three-address statement following if x relop y goto L is executed next, as in
the usual sequence.
6. param x and call p, n for procedure calls and return y, where y representing a returned value
is optional. Their typical use is as the sequence of three-address statements
param x1
param x2
param xn
call p, n
Generated as part of a call of the procedure p(x,, x~,..., x‖). The integer n indicating the number
of actual parameters in ‖call p, n‖ is not redundant because calls can be nested. The
implementation of procedure calls is outline d in Section 8.7.
7. Indexed assignments of the form x: = y[ i ] and x [ i ]: = y. The first of these sets x to the
value in the location i memory units beyond location y. The statement x[i]:=y sets the contents of
the location i units beyond x to the value of y. In both these instructions, x, y, and i refer to data
objects.
8. Address and pointer assignments of the form x:= &y, x:= *y and *x: = y. The first of these
sets the value of x to be the location of y. Presumably y is a name, perhaps a temporary, that
denotes an expression with an I-value such as A[i, j], and x is a pointer name or temporary. That
is, the r-value of x is the l-value (location) of some object!. In the statement x: = ~y, presumably
y is a pointer or a temporary whose r- value is a location. The r-value of x is made equal to the
contents of that location. Finally, +x: = y sets the r-value of the object pointed to by x to the r-
value of y.
When three-address code is generated, temporary names are made up for the interior
nodes of a syntax tree. The value of non-
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
computed into a new temporary t. In general, the three- address code for id: = E consists of code
to evaluate E into some temporary t, followed by the assignment id.place: = t. If an expression is
a single identifier, say y, then y itself holds the value of the expression. For the moment, we
create a new name every time a temporary is needed; techniques for reusing temporaries are
given in Section S.3. The S-attributed definition in Fig. 8.6 generates three-address code for
assignment statements. Given input a: = b+ – c + b+ – c, it produces the code in Fig. 8.5(a). The
synthesized attribute S.code represents the three- address code for the assignment S. The non-
terminal E has two attributes:
The function newtemp returns a sequence of distinct names t1, t2,... in response to
successive calls. For convenience, we use the notation gen(x ‘: =‘ y ‘+‘ z) in Fig. 8.6 to represent
the three-address statement x: = y + z. Expressions appearing instead of variables like x, y, and z
are evaluated when passed to gen, and quoted operators or operands, like ‘+‘, are taken literally.
In practice, three- address statements might be sent to an output file, rather than built up into the
code attributes. Flow-of-control statements can be added to the language of assignments in Fig.
8.6 by productions and semantic rules )like the ones for while statements in Fig. 8.7. In the
figure, the code for S - while E do S, is generated using‘ new attributes S.begin and S.after to
mark the first statement in the code for E and the statement following the code for S,
respectively.
These attributes represent labels created by a function new label that returns a new label
every time it is called.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
QUADRUPLES:
A quadruple is a record structure with four fields, which we call op, arg l, arg 2, and
result. The op field contains an internal code for the operator. The three-address statement x:= y
op z is represented by placing y in arg 1. z in arg 2. and x in result. Statements with unary
operators like x: = – y or x: = y do not use arg 2. Operators like param use neither arg2 nor
result. Conditional and unconditional jumps put the target label in result. The quadruples in Fig.
H.S(a) are for the assignment a: = b+ – c + b i – c. They are obtained from the three-address code
.The contents of fields arg 1, arg 2, and result are normally pointers to the symbol-table entries
for the names represented by these fields. If so, temporary names must be entered into the
symbol table as they are created.
TRIPLES:
To avoid entering temporary names into the symbol table. We might refer to a temporary
value bi the position of the statement that computes it. If we do so, three-address statements can
be represented by records with only three fields: op, arg 1 and arg2, as Shown below. The fields
arg l and arg2, for the arguments of op, are either pointers to the symbol table (for programmer-
defined names or constants) or pointers into the triple structure (for temporary values). Since
three fields are used, this intermediate code format is known as triples.‘ Except for the treatment
of programmer-defined names, triples correspond to the representation of a syntax tree or dag by
an array of nodes, as in
Parenthesized numbers represent pointers into the triple structure, while symbol-table
pointers are represented by the names themselves. In practice, the information needed to interpret
the different kinds of entries in the arg 1 and arg2 fields can be encoded into the op field or some
additional fields. The triples in Fig. 8.8(b) correspond to the quadruples in Fig. 8.8(a). Note that
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
the copy statement a:= t5 is encoded in the triple representation by placing a in the arg 1 field
and using the operator assign. A ternary operation like x[ i ]: = y requires two entries in the triple
structure, as shown in Fig. 8.9(a), while x: = y[i] is naturally represented as two operations in
Fig. 8.9(b).
Indirect Triples
Another implementation of three-address code that has been considered is that of listing
pointers to triples, rather than listing the triples themselves. This implementation is naturally
called indirect triples. For example, let us use an array statement to list pointers to triples in the
desired order. Then the triples in Fig. 8.8(b) might be represented as in Fig. 8.10.
- Type checking
-Control flow checking
- Uniqueness checking
- Name checking aspects of translation
Assume that the program has been verified to be syntactically correct and converted into
some kind of intermediate representation (a parse tree). One now has parse tree available. The
next phase will be semantic analysis of the generated parse tree. Semantic analysis also includes
error reporting in case any semantic error is found out.
Semantic analysis is a pass by a compiler that adds semantic information to the parse tree
and performs certain checks based on this information. It logically follows the parsing phase, in
which the parse tree is generated, and logically precedes the code generation phase, in which
(intermediate/target) code is generated. (In a compiler implementation, it may be possible to fold
different phases into one pass.) Typical examples of semantic information that is added and
checked is typing information ( type checking ) and the binding of variables and function names
to their definitions ( object binding ). Sometimes also some early code optimization is done in
this phase. For this phase the compiler usually maintains symbol tables in which it stores what
each symbol (variable names, function names, etc.) refers to.
TYPE CHECKING: The process of verifying and enforcing the constraints of types is called
type checking. This may occur either at compile-time (a static check) or run-time (a dynamic
check). Static type checking is a primary task of the semantic analysis carried out by a compiler.
If type rules are enforced strongly (that is, generally allowing only those automatic type
conversions which do not lose information), the process is called strongly typed, if not, weakly
typed.
UNIQUENESS CHECKING: Whether a variable name is unique or not, in the its scope.
Type coersion: If some kind of mixing of types is allowed. Done in languages which are not
strongly typed. This can be done dynamically as well as statically.
NAME CHECKS: Check whether any variable has a name which is not allowed. Ex. Name is
same as an identifier (Ex. int in java).
- Whether an identifier has been declared before use, this problem is of identifying a language
{w α w | w ε Σ *}
A parser has its own limitations in catching program errors related to semantics, something that
is deeper than syntax analysis. Typical features of semantic analysis cannot be modeled using
context free grammar formalism. If one tries to incorporate those features in the definition of a
language then that language doesn't remain context free anymore.
Example: in
string x; int y;
y=x+3 the use of x is a type error
int a, b;
a = b + c c is not declared
An identifier may refer to different variables in different parts of the program . An identifier
may be usable in one part of the program but not another These are a couple of examples which
tell us that typically what a compiler has to do beyond syntax analysis. The third point can be
explained like this: An identifier x can be declared in two separate functions in the program,
once of the type int and then of the type char. Hence the same identifier will have to be bound to
these two different properties in the two different contexts. The fourth point can be explained in
this manner: A variable declared within one function cannot be used within the scope of the
definition of the other function unless declared there separately. This is just an example.
Probably you can think of many more examples in which a variable declared in one scope cannot
be used in another scope.
ABSTRACT SYNTAX TREE: Is nothing but the condensed form of a parse tree, It is
In the next few slides we will see how abstract syntax trees can be constructed from syntax
directed definitions. Abstract syntax trees are condensed form of parse trees. Normally operators
and keywords appear as leaves but in an abstract syntax tree they are associated with the interior
nodes that would be the parent of those leaves in the parse tree. This is clearly indicated by the
examples in these slides.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Chain of single productions may be collapsed, and operators move to the parent nodes
Chain of single productions are collapsed into one node with the operators moving up to become
the node.
. Each node of the tree can be represented as a record consisting of at least two fields to store
operators and operands.
.operators : one field for operator, remaining fields ptrs to operands mknode( op,left,right )
.identifier : one field with label id and another ptr to symbol table mkleaf(id, id.entry)
.number : one field with label num and another to keep the value of the number mkleaf(num,val)
Each node in an abstract syntax tree can be implemented as a record with several fields. In the
node for an operator one field identifies the operator (called the label of the node) and the
remaining contain pointers to the nodes for operands. Nodes of an abstract syntax tree may have
additional fields to hold values (or pointers to values) of attributes attached to the node. The
functions given in the slide are used to create the nodes of abstract syntax trees for expressions.
Each function returns a pointer to a newly created note.
For Example: the following
sequence of function
calls creates a parse
tree for w= a- 4 + c
P 1 = mkleaf(id, entry.a)
P 2 = mkleaf(num, 4)
P 3 = mknode(-, P 1 , P 2 )
P 4 = mkleaf(id, entry.c)
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
P 5 = mknode(+, P 3 , P 4 )
An example showing the formation of an abstract syntax tree by the given function calls for the
expression a-4+c.The call sequence can be defined based on its postfix form, which is explained
blow.
A- Write the postfix equivalent of the expression for which we want to construct a syntax tree
B- Call the functions in the sequence, as defined by the sequence in the postfix expression which
results in the desired tree. In the case above, call mkleaf() for a, mkleaf() for 4, mknode() for
-, mkleaf() for c , and mknode() for + at last.
1. P1 = mkleaf(id, a.entry) : A leaf node made for the identifier a, and an entry for a is made in
the symbol table.
2. P2 = mkleaf(num,4) : A leaf node made for the number 4, and entry for its value.
3. P3 = mknode(-,P1,P2) : An internal node for the -, takes the pointer to previously made nodes
P1, P2 as arguments and represents the expression a-4.
4. P4 = mkleaf(id, c.entry) : A leaf node made for the identifier c , and an entry for c.entry made
in the symbol table.
5. P5 = mknode(+,P3,P4) : An internal node for the + , takes the pointer to previously made
nodes P3,P4 as arguments and represents the expression a- 4+c .
Following is the syntax directed definition for constructing syntax tree above
Now we have the syntax directed definitions to construct the parse tree for a given grammar. All
the rules mentioned in slide 29 are taken care of and an abstract syntax tree is formed.
In an AG, the values of attributes at a parse tree node are computed by semantic rules. There are
two different specifications of AGs used by the Semantic Analyzer in evaluating the semantics
of the program constructs. They are,
It is a high level specification in which implementation details are hidden, e.g., S.sys =
A.sys + B.sys;
/* does not give any implementation details. It just tells us. This kind of attribute
equation we will be using, Details like at what point of time is it evaluated and in what manner
are hidden from the programmer.*/
2) Syntax directed Translation(SDT) scheme: Sometimes we want to control the way the
attributes are evaluated, the order and place where they are evaluated. This is of a slightly lower
level.
An SDT is an SDD in which semantic actions can be placed at any position in the body of the
production.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
For example , following SDT prints the prefix equivalent of an arithmetic expression consisting a
+ and * operators.
L En{ printf(„E.val‟) }
E { printf(„+‟) }E1 + T
E T
T { printf(„*‟) }T 1 * F
T F
F (E)
F { printf(„id.lexval‟) }id
F { printf(„num.lexval‟) } num
This action in an SDT, is executed as soon as its node in the parse tree is visited in a preorder
traversal of the tree.
To avoid repeated traversal of the parse tree, actions are taken simultaneously when a token is
found. So calculation of attributes goes along with the construction of the parse tree.
Along with the evaluation of the semantic rules the compiler may simultaneously generate code,
save the information in the symbol table, and/or issue error messages etc. at the same time while
building the parse tree.
Example
Build attribute grammar that annotates Number with the value it represents
symbol attributes
Number value
sign negative
list position, value
bit position, value
production Attribute rule number sign list
list.position 0
if sign.negative
Attributes of RHS can be computed from attributes of LHS and vice versa.
Dependence graph shows the dependence of attributes on other attributes, along with the
syntax tree. Top down traversal is followed by a bottom up traversal to resolve the dependencies.
Number, val and neg are synthesized attributes. Pos is an inherited attribute.
Attributes : . Attributes fall into two classes namely synthesized attributes and inherited
attributes. Value of a synthesized attribute is computed from the values of its children nodes.
Value of an inherited attribute is computed from the sibling and parent nodes.
The attributes are divided into two groups, called synthesized attributes and inherited
attributes. The synthesized attributes are the result of the attribute evaluation rules also using the
values of the inherited attributes. The values of the inherited attributes are inherited from parent
nodes and siblings.
Each grammar production A a has associated with it a set of semantic rules of the form
Dependence relation tells us what attributes we need to know before hand to calculate a
particular attribute.
Here the value of the attribute b depends on the values of the attributes c1 to ck . If c1 to ck
belong to the children nodes and b to A then b will be called a synthesized attribute. And if b
belongs to one among a (child nodes) then it is an inherited attribute of one of the grammar
symbols on the right.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Synthesized Attributes : A syntax directed definition that uses only synthesized attributes
is said to be an S- attributed definition
. A parse tree for an S-attributed definition can be annotated by evaluating semantic rules for
attributes
S-attributed grammars are a class of attribute grammars, comparable with L-attributed grammars
but characterized by having no inherited attributes at all. Inherited attributes, which must be
passed down from parent nodes to children nodes of the abstract syntax tree during the semantic
analysis, pose a problem for bottom-up parsing because in bottom-up parsing, the parent nodes
of the abstract syntax tree are created after creation of all of their children. Attribute evaluation
in S-attributed grammars can be incorporated conveniently in both top-down parsing and bottom-
up parsing .
. terminals are assumed to have only synthesized attribute values of which are supplied by lexical
analyzer
This is a grammar which uses only synthesized attributes. Start symbol has no parents, hence no
inherited attributes.
Using the previous attribute grammar calculations have been worked out here for 3 * 4 + 5 n.
Bottom up parsing has been done.
. possible to use only S-attributes but more natural to use inherited attributes
D TL L.in = T.type
T real T.type = real
T int T.type = int
L L1 , id L1 .in = L.in; addtype(id.entry, L.in)
L id addtype (id.entry,L.in)
Inherited attributes help to find the context (type, scope etc.) of a token e.g., the type of a token
or scope when the same variable name is used multiple times in a program in different functions.
An inherited attribute system may be replaced by an S -attribute system but it is more natural to
use inherited attributes in some cases like the example given above.
Here addtype(a, b) functions adds a symbol table entry for the id a and attaches to it the type of b
.
Dependence of attributes in an inherited attribute system. The value of in (an inherited attribute)
at the three L nodes gives the type of the three identifiers x , y and z . These are determined by
computing the value of the attribute T.type at the left child of the root and then valuating L.in top
down at the three L nodes in the right subtree of the root. At each L node the procedure addtype
is called which inserts the type of the identifier to its entry in the symbol table. The figure also
shows the dependence graph which is introduced later.
Dependence Graph : . If an attribute b depends on an attribute c then the semantic rule for b
must be evaluated after the semantic rule for c
. The dependencies among the nodes can be depicted by a directed graph called dependency
graph
Dependency Graph : Directed graph indicating interdependencies among the synthesized and
inherited attributes of various nodes in a parse tree.
for a
for i = 1 to k do
An algorithm to construct the dependency graph. After making one node for every
attribute of all the nodes of the parse tree, make one edge from each of the other attributes on
which it depends.
For example ,
The semantic rule A.a = f(X.x , Y.y) for the production A -> XY defines the synthesized
attribute a of A to be dependent on the attribute x of X and the attribute y of Y . Thus the
dependency graph will contain an edge from X.x to A.a and Y.y to A.a accounting for the two
dependencies. Similarly for the semantic rule X.x = g(A.a , Y.y) for the same production there
will be an edge from A.a to X.x and an edg e from Y.y to X.x.
Example
The synthesized attribute E.val depends on E1.val and E2.val hence the two edges one
each from E 1 .val & E 2 .val
For example, the dependency graph for the sting real id1, id2, id3
. Put a dummy synthesized attribute b for a semantic rule that consists of a procedure call
The figure shows the dependency graph for the statement real id1, id2, id3 along with the
parse tree. Procedure calls can be thought of as rules defining the values of dummy synthesized
attributes of the nonterminal on the left side of the associated production. Blue arrows constitute
the dependency graph and black lines, the parse tree. Each of the semantic rules addtype
(id.entry, L.in) associated with the L productions leads to the creation of the dummy attribute.
Evaluation Order :
Any topological sort of dependency graph gives a valid order in which semantic rules must be
evaluated
a4 = real
a5 = a4
addtype(id3.entry, a5)
a7 = a5
addtype(id2.entry, a7 )
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
a9 := a7 addtype(id1.entry, a9 )
A topological sort of a directed acyclic graph is any ordering m1, m2, m3 .......mk of the
nodes of the graph such that edges go from nodes earlier in the ordering to later nodes. Thus if
mi -> mj is an edge from mi to mj then mi appears before mj in the ordering. The order of the
statements shown in the slide is obtained from the topological sort of the dependency graph in
the previous slide. 'an' stands for the attribute associated with the node numbered n in the
dependency graph. The numbering is as shown in the previous slide.
Abstract Syntax Tree is the condensed form of the parse tree, which is
In the next few slides we will see how abstract syntax trees can be constructed from
syntax directed definitions. Abstract syntax trees are condensed form of parse trees. Normally
operators and keywords appear as leaves but in an abstract syntax tree they are associated with
the interior nodes that would be the parent of those leaves in the parse tree. This is clearly
indicated by the examples in these slides.
. Chain of single productions may be collapsed, and operators move to the parent nodes
Chain of single production are collapsed into one node with the operators moving up to become
the node.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
. operators : one field for operator, remaining fields ptrs to operands mknode(
op,left,right )
. identifier : one field with label id and another ptr to symbol table mkleaf(id,entry)
. number : one field with label num and another to keep the value of the number
mkleaf(num,val)
Each node in an abstract syntax tree can be implemented as a record with several fields.
In the node for an operator one field identifies the operator (called the label of the node) and the
remaining contain pointers to the nodes for operands. Nodes of an abstract syntax tree may have
additional fields to hold values (or pointers to values) of attributes attached to the node. The
functions given in the slide are used to create the nodes of abstract syntax trees for expressions.
Each function returns a pointer to a newly created note.
P 1 = mkleaf(id, entry.a)
P 2 = mkleaf(num, 4)
P 3 = mknode(-, P 1 , P 2 )
P 4 = mkleaf(id, entry.c)
P 5 = mknode(+, P 3 , P 4 )
An example showing the formation of an abstract syntax tree by the given function calls for the
expression a-4+c.The call sequence can be explained as:
1. P1 = mkleaf(id,entry.a) : A leaf node made for the identifier Qa R and an entry for Qa R is
made in the symbol table.
2. P2 = mkleaf(num,4) : A leaf node made for the number Q4 R.
3. P3 = mknode(-,P1,P2) : An internal node for the Q- Q.I takes the previously made nodes as
arguments and represents the expression Qa-4 R.
4. P4 = mkleaf(id,entry.c) : A leaf node made for the identifier Qc R and an entry for Qc R is
made in the symbol table.
5. P5 = mknode(+,P3,P4) : An internal node for the Q+ Q.I takes the previously made nodes as
arguments and represents the expression Qa- 4+c R.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Now we have the syntax directed definitions to construct the parse tree for a given grammar. All
the rules mentioned in slide 29 are taken care of and an abstract syntax tree is formed.
Translation schemes : A CFG where semantic actions occur within the right hand side of
production, A translation scheme to map infix to postfix.
E TR
addop T { print(addop)} R | e
T num {print(num)}
We assume that the actions are terminal symbols and Perform depth first order traversal to obtain
9 5 - 2 +.
When designing translation scheme, ensure attribute value is available when referred to
In case of synthesized attribute it is trivial (why?)
In a translation scheme, as we are dealing with implementation, we have to explicitly
worry about the order of traversal. We can now put in between the rules some actions as part of
the RHS. We put this rules in order to control the order of traversals. In the given example, we
have two terminals (num and addop). It can generally be seen as a number followed by R (which
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
necessarily has to begin with an addop). The given grammar is in infix notation and we need to
convert it into postfix notation. If we ignore all the actions, the parse tree is in black, without the
red edges. If we include the red edges we get a parse tree with actions. The actions are so far
treated as a terminal. Now, if we do a depth first traversal, and whenever we encounter a action
we execute it, we get a post-fix notation. In translation scheme, we have to take care of the
evaluation order; otherwise some of the parts may be left undefined. For different actions,
different result will be obtained. Actions are something we write and we have to control it.
Please note that translation scheme is different from a syntax driven definition. In the latter, we
do not have any evaluation order; in this case we have an explicit evaluation order. By explicit
evaluation order we have to set correct action at correct places, in order to get the desired output.
Place of each action is very important. We have to find appropriate places, and that is that
translation scheme is all about. If we talk of only synthesized attribute, the translation scheme is
very trivial. This is because, when we reach we know that all the children must have been
evaluated and all their attributes must have also been dealt with. This is because finding the place
for evaluation is very simple, it is the rightmost place.
. An inherited attribute for a symbol on rhs of a production must be computed in an action before
that symbol
. A synthesized attribute for non terminal on the lhs can be computed after all attributes it
references, have been computed. The action normally should be placed at the end of rhs
We have a problem when we have both synthesized as well as inherited attributes. For the given
example, if we place the actions as shown, we cannot evaluate it. This is because, when doing a
depth first traversal, we cannot print anything for A1. This is because A1 has not yet been
initialized. We, therefore have to find the correct places for the actions. This can be that the
inherited attribute of A must be calculated on its left. This can be seen logically from the
definition of L-attribute definition, which says that when we reach a node, then everything on its
left must have been computed. If we do this, we will always have the attribute evaluated at the
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
correct place. For such specific cases (like the given example) calculating anywhere on the left
will work, but generally it must be calculated immediately at the left.
S B B.pts = 10
S.ht = B.ht
B B1 B2 B1 .pts = B.pts
B 2 .pts = B.pts
B.ht = max(B 1 .ht,B2 .ht)
B B1 sub B 2 B1 .pts = B.pts;
B 2 .pts = shrink(B.pts)
B.ht = disp(B1 .ht,B2 .ht)
B text B.ht = text.h * B.pts
We now look at another example. This is the grammar for finding out how do I compose text.
EQN was equation setting system which was used as an early type setting system for UNIX. It
was earlier used as an latex equivalent for equations. We say that start symbol is a block: S - >B
We can also have a subscript and superscript. Here, we look at subscript. A Block is composed
of several blocks: B -> B1B2 and B2 is a subscript of B1. We have to determine what is the point
size (inherited) and height Size (synthesized). We have the relevant function for height and point
size given along side. After putting actions in the right place
We have put all the actions at the correct places as per the rules stated. Read it from left to right,
and top to bottom. We note that all inherited attribute are calculated on the left of B symbols and
synthesized attributes are on the right.
We now come to implementation. We decide how we use parse tree and L-attribute
definitions to construct the parse tree with a one-to-one correspondence. We first look at the top-
down translation scheme. The first major problem is left recursion. If we remove left recursion
by our standard mechanism, we introduce new symbols, and new symbols will not work with the
existing actions. Also, we have to do the parsing in a single pass.
. If both the operands of arithmetic operators +, -, x are integers then the result is of type integer
. The result of unary & operator is a pointer to the object referred to by the operand.
- If the type of operand is X then type of result is pointer to X
1. Basic types: These are atomic types with no internal structure. They include the types boolean,
character, integer and real.
2. Sub-range types: A sub-range type defines a range of values within the range of another type.
For example, type A = 1..10; B = 100..1000; U = 'A'..'Z';
3. Enumerated types: An enumerated type is defined by listing all of the possible values for the
type. For example: type Colour = (Red, Yellow, Green); Country = (NZ, Aus, SL, WI, Pak, Ind,
SA, Ken, Zim, Eng); Both the sub-range and enumerated types can be treated as basic types.
4. Constructed types: A constructed type is constructed from basic types and other basic types.
Examples of constructed types are arrays, records and sets. Additionally, pointers and functions
can also be treated as constructed types.
TYPE EXPRESSION:
It is an expression that denotes the type of an expression. The type of a language construct is
denoted by a type expression
The type of a language construct is denoted by a type expression. A type expression is either a
basic type or is formed by applying an operator called a type constructor to other type
expressions. Formally, a type expression is recursively defined as:
1. A basic type is a type expression. Among the basic types are boolean , char , integer , and real
.A special basic type, type_error , is used to signal an error during type checking. Another
special basic type is void which denotes "the absence of a value" and is used to check statements.
2. Since type expressions may be named, a type name is a type expression.
3. The result of applying a type constructor to a type expression is a type expression.
4. Type expressions may contain variables whose values are type expressions themselves.
TYPE CONSTRUCTORS: are used to define or construct the type of user defined types based
on their dependent types.
Arrays : If T is a type expression and I is a range of integers, then array ( I , T ) is the type
expression denoting the type of array with elements of type T and index set I.
For example, the Pascal declaration, var A: array[1 .. 10] of integer; associates the type
expression array ( 1..10, integer ) with A.
Products : If T1 and T2 are type expressions, then their Cartesian product T1 X T2 is also a type
expression.
Records : A record type constructor is applied to a tuple formed from field names and field
types. For example, the declaration
The type row has type expression : record ((addr x integer) x (lexeme x array(1 .. 15, char)))
and type expression of table is array(1 .. 10, row)
Note: Including the field names in the type expression allows us to define another record type
with the same fields but with different names without being forced to equate the two.
Pointers: If T is a type expression, then pointer ( T ) is a type expression denoting the type
"pointer to an object of type T".
For example, in Pascal, the declaration
var p: row declares variable p to have type pointer( row ).
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
P D;E
D D ; D | id : T
A type checker is a translation scheme that synthesizes the type of each expression from the
types of its sub-expressions. Consider the above given grammar that generates programs
consisting of a sequence of declarations D followed by a single expression E.
Specifications of a type checker for the language of the above grammar: A program generated
by this grammar is
key : integer;
key mod 1999
Assumptions:
1. The language has three basic types: char , int and type-error
2. For simplicity, all arrays start at 1. For example, the declaration array[256] of char leads to the
type expression array ( 1.. 256, char).
E1 .type == s t
then t
else type-error
The rules for the symbol table entry are specified above. These are basically the way in which
the symbol table entries corresponding to the productions are done.
The production E -> E ( E ) where an expression is the application of one expression to another
can be used to represent the application of a function to an argument. The rule for checking the
type of a function application is
E -> E1 ( E2 ) { E.type := if E2.type == s and E1. type == s -> t then t else type_error }
This rule says that in an expression formed by applying E1 to E2, the type of E1 must be a
function s -> t from the type s of E2 to some range type t ; the type of E1 ( E2 ) is t . The above
rule can be generalized to functions with more than one argument byconstructing a product type
consisting of the arguments. Thus n arguments of type T1 , T2
... Tn can be viewed as a single argument of the type T1 X T2 ... X Tn . For example,
declares a function root that takes a function from reals to reals and a real as arguments and
returns a real. The Pascal-like syntax for this declaration is
TYPE CHECKING FOR EXPRESSIONS: consider the following SDD for expressions
else type_error
E E1 [E2 ] E.type = if E2 .type==integer and
E1 .type==array(s,t)
then t
else type_error
E E1 ^ E.type = if E1 .type==pointer(t)
then t
else type_error
To perform type checking of expressions, following rules are used. Where the synthesized
attribute type for E gives the type expression assigned by the type system to the expression
generated by E.
The following semantic rules say that constants represented by the tokens literal and num have
type char and integer , respectively:
. The function lookup ( e ) is used to fetch the type saved in the symbol-table entry pointed to by
e. When an identifier appears in an expression, its declared type is fetched and assigned to the
attribute type:
. According to the following rule, the expression formed by applying the mod operator to two
sub-expressions of type integer has type integer ; otherwise, its type is type_error .
E -> E1 mod E2 { E.type := if E1.type == integer and E2.type == integer then integer else
type_error }
In an array reference E1 [ E2 ], the index expression E2 must have type integer , inwhich case
the result is the element type t obtained from the type array ( s, t ) of E1.
Within expressions, the postfix operator yields the object pointed to by its operand. The type of E
is the type t of the object pointed to by the pointer E:
Since statements do not have values, the special basic type void is assigned to them, but if an
error is detected within a statement, the type assigned to the statement is type_error .
The statements considered below are assignment, conditional, and while statements. Sequences
of statements are separated by semi-colons. The productions given below can be combined with
those given before if we change the production for a complete program to P -> D; S. The
program now consists of declarations followed by statements.
This rule checks that the left and right sides of an assignment statement have the same type.
This rule specifies that the expressions in an if -then statement must have the type boolean .
This rule specifies that the expression in a while statement must have the type boolean .
4. S S1; S2 { S.type := if S1.type == void and S2.type == void then void else type_error }
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Errors are propagated by this last rule because a sequence of statements has type void only if
each sub-statement has type void.
1. What do you mean by THREE ADDRESS CODE? Generate the three-address code for
the following code.
begin
PROD: = 0;
I: =1;
do
begin
PROD:=PROD + A[I] B[I];
I:=I+1;
End
while I <=20
end
ASSIGNMENT QUESTIONS:
1. Write Three address code for the below example
While( i<10)
{
a= b+c*-d;
i++;
}
2. What is a Syntax Directed Definition? Write Syntax Directed definition to convert binary
value in to decimal?
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
SYMBOL TABLE
Symbol Table(ST) : Is a data structure used by the compiler to keep track of scope and binding
information about names
- Symbol table is changed every time a name is encountered in the source;
Changes to table occur when ever a new name is discovered; new information about an existing
name is discovered
As we know the compiler uses a symbol table to keep track of scope and binding information
about names. It is filled after the AST is made by walking through the tree, discovering and
assimilating information about the names. There should be two basic operations - to insert a new
name or information into the symbol table as and when discovered and to efficiently lookup a
name in the symbol table to retrieve its information.
Two common data structures used for the symbol table organization are -
1. Linear lists:- Simple to implement, Poor performance.
2. Hash tables:- Greater programming / space overhead, but, Good performance.
Ideally a compiler should be able to grow the symbol table dynamically, i.e., insert new entries
or information as and when needed.
But if the size of the table is fixed in advance then (an array implementation for example), then
the size must be big enough in advance to accommodate the largest possible program.
For each entry in declaration of a name
- The format need not be uniform because information depends upon the usage of the name
- Each entry is a record consisting of consecutive words
- To keep records uniform some entries may be outside the symbol table
Information is entered into symbol table at various times. For example,
- keywords are entered initially,
- identifier lexemes are entered by the lexical analyzer.
. Symbol table entry may be set up when role of name becomes clear , attribute values are filled
in as information is available during the translation process.
For each declaration of a name, there is an entry in the symbol table. Different entries
need to store different information because of the different contexts in which a name can occur.
An entry corresponding to a particular name can be inserted into the symbol table at different
stages depending on when the role of the name becomes clear. The various attributes that an
entry in the symbol table can have are lexeme, type of name, size of storage and in case of
functions - the parameter list etc.
A name may denote several objects in the same block
- int x; struct x { float y, z; }
The lexical analyzer returns the name itself and not pointer to symbol table entry. A record in the
symbol table is created when role of the name becomes clear. In this case two symbol table
entries are created.
Aattributes of a name are entered in response to declarations
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
There might be multiple entries in the symbol table for the same name, all of them having
different roles. It is quite intuitive that the symbol table entries have to be made only when the
role of a particular name becomes clear. The lexical analyzer therefore just returns the name and
not the symbol table entry as it cannot determine the context of that name. Attributes
corresponding to the symbol table are entered for a name in response to the corresponding
declaration. There has to be an upper limit for the length of the lexemes for them to be stored in
the symbol table.
If target code is assembly code, then assembler can take care of storage for various names and
the compiler needs to generate data definitions to be appended to assembly code
If target code is machine code, then compiler does the allocation. No storage allocation is done
for names whose storage is allocated at runtime
Information about the storage locations that will be bound to names at run time is kept in
the symbol table. If the target is assembly code, the assembler can take care of storage for
various names. All the compiler has to do is to scan the symbol table, after generating assembly
code, and generate assembly language data definitions to be appended to the assembly language
program for each name. If machine code is to be generated by the compiler, then the position of
each data object relative to a fixed origin must be ascertained. The compiler has to do the
allocation in this case. In the case of names whose storage is allocated on a stack or heap, the
compiler does not allocate storage at all, it plans out the activation record for each procedure.
When the called procedure returns, the interrupted activation can be restarted after restoring the
saved machine state. The heap may be used to store dynamically allocated data objects, and also
other stuff such as activation information (in the case of languages where an activation tree
cannot be used to represent lifetimes). Both the stack and the heap change in size during program
execution, so they cannot be allocated a fixed amount of space. Generally they start from
opposite ends of the memory and can grow as required, towards each other, until the space
available has filled up.
For Pascal and C, the activation record is generally stored on the run-
time stack during the period when the procedure is executing.
Of the fields shown in the figure, access link and control link are optional (e.g. FORTRAN
doesn't need access links). Also, actual parameters and return values are often stored in registers
instead of the activation record, for greater efficiency.
The activation record for a procedure call is generated by the compiler. Generally, all
field sizes can be determined at compile time.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
However, this is not possible in the case of a procedure which has a local array whose size
depends on a parameter. The strategies used for storage allocation in such cases will be discussed
in forth coming lines.
Static allocation: lays out storage at compile time for all data objects
Stack allocation: manages the runtime storage as a stack
Heap allocation :allocates and de-allocates storage as needed at runtime from heap
These represent the different storage-allocation strategies used in the distinct parts of the
run-time memory organization (as shown in slide 8). We will now look at the possibility of using
these strategies to allocate memory for activation records. Different languages use different
strategies for this purpose. For example, old FORTRAN used static allocation, Algol type
languages use stack allocation, and LISP type languages use heap allocation.
STATIC ALLOCATION: In this approach memory is allocated statically. So,Names are bound
to storage as the program is compiled
These are the fundamental characteristics of static allocation. Since name binding occurs during
compilation, there is no need for a run-time support package. The retention of local name values
across procedure activations means that when control returns to a procedure, the values of the
locals are the same as they were when control last left. For example, suppose we had the
following code, written in a language using static allocation:
function F( )
{
int a;
print(a);
a = 10;
}
After calling F( ) once, if it was called a second time, the value of a would initially be 10, and
this is what would get printed.
The type of a name determines its storage requirement. The address for this storage is an offset
from the procedure's activation record, and the compiler positions the records relative to the
target code and to one another (on some computers, it may be possible to leave this relative
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
position unspecified, and let the link editor link the activation records to the executable code).
After this position has been decided, the addresses of the activation records, and hence of the
storage for each name in the records, are fixed. Thus, at compile time, the addresses at which the
target code can find the data it operates upon can be filled in. The addresses at which information
is to be saved when a procedure call takes place are also known at compile time. Static allocation
does have some limitations.
- Size of data objects, as well as any constraints on their positions in memory, must be
available at compile time.
- No recursion, because all activations of a given procedure use the same bindings for local
names.
- No dynamic data structures, since no mechanism is provided for run time storage allocation.
STACK ALLOCATION: Figure shows the activation records that are pushed onto and popped
for the run time stack as the control flows through the given activation tree.
First the procedure is activated. Procedure readarray 's activation is pushed onto the stack, when
the control reaches the first line in the procedure sort . After the control returns from the
activation of the readarray , its activation is popped. In the activation of sort , the control then
reaches a call of qsort with actuals 1 and 9 and an activation of qsort is pushed onto the top of the
stack. In the last stage the activations for partition (1,3) and qsort (1,0) have begun and ended
during the life time of qsort (1,3), so their activation records have come and gone from the stack,
leaving the activation record for qsort (1,3) on top.
Calling sequence and activation records differ, even for the same language. The code in the
calling sequence is often divided between the calling procedure and the procedure it calls.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
The code for the Callee can access its temporaries and the
local data using offsets from stack top.
The fields whose sizes are fixed early are placed in the middle. The decision of whether
or not to use the control and access links is part of the design of the compiler, so these fields can
be fixed at compiler construction time. If exactly the same amount of machine-status information
is saved for each activation, then the same code can do the saving and restoring for all
activations. The size of temporaries may not be known to the front end. Temporaries needed by
the procedure may be reduced by careful code generation or optimization. This field is shown
after that for the local data. The caller usually evaluates the parameters and communicates them
to the activation record of the callee. In the runtime stack, the activation record of the caller is
just below that for the callee. The fields for parameters and a potential return value are placed
next to the activation record of the caller. The caller can then access these fields using offsets
from the end of its own activation record. In particular, there is no reason for the caller to know
about the local data or temporaries of the callee.
As described earlier, in the runtime stack, the activation record of the caller is just below
that for the callee. The fields for parameters and a potential return value are placed next to the
activation record of the caller. The caller can then access these fields using offsets from the end
of its own activation record. The caller copies the return value into its own activation record. In
particular, there is no reason for the caller to know about the local data or temporaries of the
callee. The given calling sequence allows the number of arguments of the called procedure to
depend on the call. At compile time, the target code of the caller knows the number of arguments
it is supplying to the callee. The caller knows the size of the parameter field. The target code of
the called must be prepared to handle other calls as well, so it waits until it is called, then
examines the parameter field. Information describing the parameters must be placed next to the
status field so the callee can find it.
The procedure P has three local arrays. The storage for these arrays is not part of the
activation record for P; only a pointer to the beginning of each array appears in the activation
record. The relative addresses of these pointers are known at the compile time, so the target code
can access array elements through the pointers. Also shown is the procedure Q called by P . The
activation record for Q begins after the arrays of P. Access to data on the stack is through two
pointers, top and stack top. The first of these marks the actual top of the stack; it points to the
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
position at which the next activation record begins. The second is used to find the local data. For
consistency with the organization of the figure in slide 16, suppose the stack top points to the end
of the machine status field. In this figure the stack top points to the end of this field in the
activation record for Q. Within the field is a control link to the previous value of stack top when
control was in calling activation of P. The code that repositions top and stack top can be
generated at compile time, using the sizes of the fields in the activation record. When q returns,
the new value of top is stack top minus the length of the machine status and the parameter fields
in Q's activation record. This length is known at the compile time, at least to the caller. After
adjusting top, the new value of stack top can be copied from the control link of Q.
int *dangle();
{
int i=23;
return &i;
}
The problem of dangling references arises, whenever storage is de-allocated. A dangling
reference occurs when there is a reference to storage that has been de-allocated. It is a logical
error to use dangling references, since the value of de-allocated storage is undefined according to
the semantics of most languages. Since that storage may later be allocated to another datum,
mysterious bugs can appear in the programs with dangling references.
HEAP ALLOCATION: If a procedure wants to put a value that is to be used after its activation
is over then we cannot use stack for that purpose. That is language like Pascal allows data to be
allocated under program control. Also in certain language a called activation may outlive the
caller procedure. In such a case last-in-first-out queue will not work and we will require a data
structure like heap to store the activation. The last case is not true for those languages whose
activation trees correctly depict the flow of control between procedures.
o The values of the local variables must be retained when an activation ends
o A called activation outlives the caller
In such a case de-allocation of activation record cannot occur in last-in first-out fashion
Heap allocation gives out pieces of contiguous storage for activation records
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
As mentioned earlier, for efficiency reasons we can handle small activations and activations of
predictable size as a special case as follows:
1. For each size of interest, keep a linked list if free blocks of that size
2. If possible, fill a request for size s with a block of size s', where s' is the smallest size greater
than or equal to s. When the block is eventually de-allocated, it is returned to the linked list it
came from.
Heap manger will dynamically allocate memory. This will come with a runtime
overhead. As heap manager will have to take care of defragmentation and garbage collection.
But since heap manger saves space otherwise we will have to fix size of activation at compile
time, runtime overhead is the price worth it.
A block is a any sequence of operations or instructions that are used to perform a [sub] task. In
any programming language,
Blocks contain its own local data structure.
Blocks can be nested and their starting and ends are marked by a delimiter.
They ensure that either block is independent of other or nested in another block. That is,
it is not possible for two blocks B1 and B2 to overlap in such a way that first block B1
begins, then B2, but B1 end before B2.
This nesting property is called block structure. The scope of a declaration in a block-
structured language is given by the most closely nested rule:
For the example, in the above figure, the scope of declaration of b in B0 does not include B1
because b is re-declared in B1. We assume that variables are declared before the first statement
in which they are accessed. The scope of the variables will be as follows:
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
DECLARATION SCOPE
1. STACK ALLOCATION: This is based on the observation that scope of a declaration does
not extend outside the block in which it appears, the space for declared name can be allocated
when the block is entered and de-allocated when controls leave the block. The view treat block
as a "parameter less procedure" called only from the point just before the block and returning
only to the point just before the block.
2. COMPLETE ALLOCATION: Here you allocate the complete memory at one time. If there
are blocks within the procedure, then allowance is made for the storage needed for declarations
within the books. If two variables are never alive at the same time and are at same depth they can
be assigned same storage.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Generally languages like Lisp and ML which do not allow for explicit de-allocation of memory
do garbage collection. A reference to a pointer that is no longer valid is called a 'dangling
reference'. For example, consider this C code:
The simplest form of dynamic allocation involves blocks of a fixed size. By linking the blocks in
a list, as shown in the figure, allocation and de-allocation can be done quickly with little or no
storage overhead.
Explicit Allocation of Fixed Sized Blocks: In this approach, blocks are drawn from
contiguous area of storage, and an area of each block is used as pointer to the next block
The pointer available points to the first block
Allocation means removing a block from the available list
De-allocation means putting the block in the available list
Compiler routines need not know the type of objects to be held in the blocks
Each block is treated as a variant record
Suppose that blocks are to be drawn from a contiguous area of storage. Initialization of
the area is done by using a portion of each block for a link to the next block. A pointer available
points to the first block. Generally a list of free nodes and a list of allocated nodes is maintained,
and whenever a new block has to be allocated, the block at the head of the free list is taken off
and allocated (added to the list of allocated nodes). When a node has to be de-allocated, it is
removed from the list of allocated nodes by changing the pointer to it in the list to point to the
block previously pointed to by it, and then the removed block is added to the head of the list of
free blocks. The compiler routines that manage blocks do not need to know the type of object
that will be held in the block by the user program. These blocks can contain any type of data
(i.e., they are used as generic memory locations by the compiler). We can treat each block as a
variant record, with the compiler routines viewing the block as consisting of some other type.
Thus, there is no space overhead because the user program can use the entire block for its own
purposes. When the block is returned, then the compiler routines use some of the space from the
block itself to link it into the list of available blocks, as shown in the figure in the last slide.
The situation shown can occur if a program allocates five blocks and then de-allocates the
second and the fourth, for example.
Fragmentation is of no consequence if blocks are of fixed size, but if they are of variable size, a
situation like this is a problem, because we could not allocate a block larger than any one of the
free blocks, even though the space is available in principle.
So, if variable- sized blocks are allocated, then internal fragmentation can be avoided, as we only
allocate as much space as we need in a block. But this creates the problem of external
fragmentation, where enough space is available in total for our requirements, but not enough
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
space is available in continuous memory locations, as needed for a block of allocated memory.
For example, consider another case where we need to allocate 400 bytes of data for the next
request, and the available continuous regions of memory that we have are of sizes 300, 200 and
100 bytes. So we have a total of 600 bytes, which is more than what we need. But still we are
unable to allocate the memory as we do not have enough contiguous storage.
The amount of external fragmentation while allocating variable-sized blocks can become very
high on using certain strategies for memory allocation.
So we try to use certain strategies for memory allocation, so that we can minimize memory
wastage due to external fragmentation. These strategies are discussed in the next few lines.
. Storage can become fragmented, Situation may arise, If program allocates five blocks
. then de-allocates second and fourth block
IMPORTANT QUESTIONS:
1. What are calling sequence, and Return sequences? Explain briefly.
2. What is the main difference between Static & Dynamic storage allocation? Explain the
problems associated with dynamic storage allocation schemes.
3. What is the need of a display associated with a procedure? Discuss the procedures for
maintaining the display when the procedures are not passed as parameters.
4. Write notes on the static storage allocation strategy with example and discuss its
limitations?
5. Discuss about the stack allocation strategy of runtime environment with an example?
6. Explain the concept of implicit de allocation of memory.
7. Give an example of creating dangling references and explain how garbage is created.
ASSIGNMENT QUESTIONS:
UNIT-IV
To study the run-time storage management system it is sufficient to focus on the statements:
action, call, return and halt, because they by themselves give us sufficient insight into the
behavior shown by functions in calling each other and returning.
And the run-time allocation and de-allocation of activations occur on the call of functions and
when they return.
There are mainly two kinds of run-time allocation systems: Static allocation and Stack
Allocation. While static allocation is used by the FORTRAN class of languages, stack allocation
is used by the Ada class of languages.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
GOTO callee.code-area
callee.static-area and callee.code-area are constants referring to address of the activation record
and the first address of called procedure respectively.
. #here+20 in the move instruction is the return address; the address of the instruction following
the goto instruction
GOTO *callee.static-area
For the call statement, we need to save the return address somewhere and then jump to
the location of the callee function. And to return from a function, we have to access the return
address as stored by its caller, and then jump to it. So for call, we first say: MOV #here+20,
callee.static-area. Here, #here refers to the location of the current MOV instruction, and
callee.static- area is a fixed location in memory. 20 is added to #here here, as the code
corresponding to the call instruction takes 20 bytes (at 4 bytes for each parameter: 4*3 for this
instruction, and 8 for the next). Then we say GOTO callee. code-area, to take us to the code of
the callee, as callee.codearea is merely the address where the code of the callee starts. Then a
return from the callee is implemented by: GOTO *callee.static area. Note that this works only
because callee.static-area is a constant.
Example:
. The activation :
Records 300:
arestatically 304:
allocated starting :
at addresses 364:
300 and 364. 368:
This example corresponds to the code shown in slide 57. Statically we say that the code
for c starts at 100 and that for p starts at 200. At some point, c calls p. Using the strategy
discussed earlier, and assuming that callee.staticarea is at the memory location 364, we get the
code as given. Here we assume that a call to 'action' corresponds to a single machine instruction
which takes 20 bytes.
STACK ALLOCATION : . Position of the activation record is not known until run time
. Position is stored in a register at run time, and words in the record are accessed with an
offset from the register
. The code for the first procedure initializes the stack by setting up SP to the start of the
stack area
MOV #Stackstart, SP
HALT
In stack allocation we do not need to know the position of the activation record until run-
time. This gives us an advantage over static allocation, as we can have recursion. So this is used
in many modern programming languages like C, Ada, etc. The positions of the activations are
stored in the stack area, and the position for the most recent activation is pointed to by the stack
pointer. Words in a record are accessed with an offset from the register. The code for the first
procedure initializes the stack by setting up SP to the stack area by the following command:
MOV #Stackstart, SP. Here, #Stackstart is the location in memory where the stack starts.
A procedure call sequence increments SP, saves the return address and transfers control to the
called procedure
ADD #caller.recordsize, SP
GOTO callee.code_area
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Consider the situation when a function (caller) calls the another function(callee), then
procedure call sequence increments SP by the caller record size, saves the return address and
transfers control to the callee by jumping to its code area. In the MOV instruction here, we only
need to add 16, as SP is a register, and so no space is needed to store *SP. The activations keep
getting pushed on the stack, so #caller.recordsize needs to be added to SP, to update the value of
SP to its new value. This works as #caller.recordsize is a constant for a function, regardless of
the particular activation being referred to.
DATA STRUCTURES: Following data structures are used to implement symbol tables
LIST DATA STRUCTURE : Could be an array based or pointer based list. But this
implementation is
- Simplest to implement
- Use a single array to store names and information
- Search for a name is linear
- Entry and lookup are independent operations
- Cost of entry and search operations are very high and lot of time goes into book keeping
Hash table: Hash table is a data structure which gives O(1) performance in accessing any
element of it. It uses the features of both array and pointer based lists.
The entries in the symbol table are for declaration of names. When an occurrence of a name in
the source text is looked up in the symbol table, the entry for the appropriate declaration,
according to the scoping rules of the language, must be returned. A simple approach is to
maintain a separate symbol table for each scope.
Most closely nested scope rules can be implemented by adapting the data structures
discussed in the previous section. Each procedure is assigned a unique number. If the language is
block-structured, the blocks must also be assigned unique numbers. The name is represented as a
pair of a number and a name. This new name is added to the symbol table. Most scope rules can
be implemented in terms of following operations:
int fun2()
{
int a;
int c;
....
}
Visibility: The visibility of a variable determines how much of the rest of the program
can access that variable. You can arrange that a variable is visible only within one part of
one function, or in one function, or in one source file, or anywhere in the program.
r) Local and Global variables: A variable declared within the braces {} of a function is
visible only within that function; variables declared within functions are called local
variables. On the other hand, a variable declared outside of any function is a global
variable , and it is potentially visible anywhere within the program.
s) Automatic Vs Static duration: How long do variables last? By default, local variables
(those declared within a function) have automatic duration: they spring into existence
when the function is called, and they (and their values) disappear when the function
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
returns. Global variables, on the other hand, have static duration: they last, and the values
stored in them persist, for as long as the program does. (Of course, the values can in
general still be overwritten, so they don't necessarily persist forever.) By default, local
variables have automatic duration. To give them static duration (so that, instead of
coming and going as the function is called, they persist for as long as the function does),
you precede their declaration with the static keyword: static int i; By default, a
declaration of a global variable (especially if it specifies an initial value) is the defining
instance. To make it an external declaration, of a variable which is defined somewhere
else, you precede it with the keyword extern: extern int j; Finally, to arrange that a global
variable is visible only within its containing source file, you precede it with the static
keyword: static int k; Notice that the static keyword can do two different things: it adjusts
the duration of a local variable from automatic to static, or it adjusts the visibility of a
global variable from truly global to private-to-the-file.
t) Symbol attributes and symbol table entries
u) Symbols have associated attributes
v) Typical attributes are name, type, scope, size, addressing mode etc.
w) A symbol table entry collects together attributes such that they can be easily set and
retrieved
x) Example of typical names in symbol table
Name Type
name character string
class enumeration
size integer
type enumeration
Following are prototypes of typical function declarations used for managing local symbol table.
The right hand side of the arrows is the output of the procedure and the left side has the input.
A major consideration in designing a symbol table is that insertion and retrieval should be as fast
as possible
. One dimensional table: search is very slow
. Balanced binary tree: quick insertion, searching and retrieval; extra work required to keep the
tree balanced
. Hash tables: quick insertion, searching and retrieval; extra work to compute hash keys
A major consideration in designing a symbol table is that insertion and retrieval should be
as fast as possible. We talked about the one dimensional and hash tables a few slides back. Apart
from these balanced binary trees can be used too. Hashing is the most common approach.
Hash tables can clearly implement 'lookup' and 'insert' operations. For implementing the
'delete', we do not want to scan the entire hash table looking for lists containing entries to be
deleted. Each entry should have two links:
a) A hash link that chains the entry to other entries whose names hash to the same value - the
usual link in the hash table.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
b) A scope link that chains all entries in the same scope - an extra link. If the scope link is left
undisturbed when an entry is deleted from the hash table, then the chain formed by the scope
links will constitute an inactive symbol table for the scope in question.
Look at the nesting structure of this program. Variables a, b and c appear in global as
well as local scopes. Local scope of a variable overrides the global scope of the other variable
with the same name within its own scope. The next slide will show the global as well as the local
symbol tables for this structure. Here procedure I and h lie within the scope of g ( are nested
within g).
GLOBAL SYMBOL TABLE STRUCTURE The global symbol table will be a collection of
symbol tables connected with pointers.
The exact structure will be determined by the scope and visibility rules of the
language.The global symbol table will be a collection of symbol tables connected with pointers.
The exact structure will be determined by the scope and visibility rules of the language.
Whenever a new scope is encountered a new symbol table is created. This new table contains a
pointer back to the enclosing scope's symbol table and the enclosing one also contains a pointer
to this new symbol table. Any variable used inside the new scope should either be present in its
own symbol table or inside the enclosing scope's symbol table and all the way up to the root
symbol table. A sample global symbol table is shown in the below figure.
Storage binding and symbolic registers : Translates variable names into addresses and the
process must occur before or during code generation
There is a base address and every name is given an offset with respect to this base which changes
with every invocation. The variables can be divided into four categories:
a) Global Variables : fixed relocatable address or offset with respect to base as global pointer
b) Global Static Variables : .Global variables, on the other hand, have static duration (hence
also called static variables): they last, and the values stored in them persist, for as long as the
program does. (Of course, the values can in general still be overwritten, so they don't necessarily
persist forever.) Therefore they have fixed relocatable address or offset with respect to base as
global pointer.
c) Stack Variables : allocate stack/global in registers and registers are not indexable, therefore,
arrays cannot be in registers
d) Stack Static Variables : By default, local variables (stack variables) (those declared within a
function) have automatic duration: they spring into existence when the function is called, and
they (and their values) disappear when the function returns. This is why they are stored in stacks
and have offset from stack/frame pointer.
Register allocation is usually done for global variables. Since registers are not indexable,
therefore, arrays cannot be in registers as they are indexed data structures. Graph coloring is a
simple technique for allocating register and minimizing register spills that works well in practice.
Register spills occur when a register is needed for a computation but all available registers are in
use. The contents of one of the registers must be stored in memory to free it up for immediate
use. We assign symbolic registers to scalar variables which are used in the graph coloring.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
word boundaries - the most significant byte of the object must be located at an address whose
two least significant bits are zero relative to the frame pointer
half-word boundaries - the most significant byte of the object being located at an address
whose least significant bit is zero relative to the frame pointer .
While allocating memory to the variables, sort variables by the alignment they need. You may:
Store largest variables first: It automatically aligns all the variables and does not require padding
since the next variable's memory allocation starts at the end of that of the earlier variable
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
. Store smallest variables first: It requires more space (padding) since you have to accommodate
for the biggest possible length of any variable data structure. The advantage is that for large stack
frame, more variables become accessible within small offsets
How to store large local data structures? Because they Requires large space in local frames
and therefore large offsets
- If large object is put near the boundary other objects require large offset either from fp (if
put near beginning) or sp (if put near end)
- Allocate another base register to access large objects
- Allocate space in the middle or elsewhere; store pointer to these locations from at a small
offset from fp
- Requires extra loads
Large local data structures require large space in local frames and therefore large offsets.
As told in the previous slide's notes, if large objects are put near the boundary then the other
objects require large offset. You can either allocate another base register to access large objects
or you can allocate space in the middle or elsewhere and then store pointers to these locations
starting from at a small offset from the frame pointer, fp.
In the unsorted allocation you can see the waste of space in green. In sorted frame there is no
waste of space.
Elements of an array are stored in a block of consecutive locations. For a single dimensional
array, if low is the lower bound of the index and base is the relative address of the storage
allocated to the array i.e., the relative address of A[low], then the i th Elements of an array are
stored in a block of consecutive locations
For a single dimensional array, if low is the lower bound of the index and base is the
relative address of the storage allocated to the array i.e., the relative address of A[low], then the i
th elements begins at the location: base + (I - low)* w . This expression can be reorganized as
i*w + (base -low*w) . The sub-expression base-low*w is calculated and stored in the symbol
table at compile time when the array declaration is processed, so that the relative address of A[i]
can be obtained by just adding i*w to it.
i x w + const
2-DIMENSIONAL ARRAY: For a row major two dimensional array the address of A[i][j]
can be calculated by the formula :
base + ((i-lowi )*n2 +j - lowj )*w where low i and lowj are lower values of I and j and n2 is
number of values j can take i.e. n2 = high2 - low2 + 1.
((i * n2) + j) *w + (base - ((lowi *n2) + lowj ) * w) and the second term can be calculated at
compile time.
In the same manner, the expression for the location of an element in column major two-
dimensional array can be obtained. This addressing can be generalized to multidimensional
arrays. Storage can be either row major or column major approach.
x=t4
Let A be a 10x20 array
n1 = 10 and n2 = 20
Assume width of the type stored in the array is 4. The three address code to access A[y,z] is
t1 = y * 20
t1 = t1 + z
t2 = 4 *t1
t3 =base A -84 {((low 1 *n2)+low 2 )*w)=(1*20+1)*4=84}
t4 =t2 +t3
x = t4
The following operations are designed :1. mktable(previous): creates a new symbol table and
returns a pointer to this table. Previous is pointer to the symbol table of parent procedure.
2. entire(table,name,type,offset): creates a new entry for name in the symbol table pointed to by
table .
4. enterproc(table,name ,newtable): creates an entry for procedure name in the symbol table
pointed to by table . newtable is a pointer to symbol table for name .
P {t=mktable(nil);
push(t,tblptr);
push(0,offset)}
D
{addwidth(top(tblptr),top(offset));
pop(tblptr);
pop(offset)}
D D; D
The symbol tables are created using two stacks: tblptr to hold pointers to symbol tables of
the enclosing procedures and offset whose top element is the next available relative address for a
local of the current procedure. Declarations in nested procedures can be processed by the syntax
directed definitions given below. Note that they are basically same as those given above but we
have separately dealt with the epsilon productions. Go to the next page for the explanation.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
D proc id;
{ t = mktable(top(tblptr));
push(t, tblptr); push(0, offset)}
D1;S
{ t = top(tblptr);
addwidth(t, top(offset));
pop(tblptr); pop(offset);;
enterproc(top(tblptr), id.name, t)}
D id: T
{ enter(top(tblptr), id.name, T.type, top(offset));
top(offset) = top (offset) + T.width }
The action for M creates a symbol table for the outermost scope and hence a nil pointer is passed
in place of previous. When the declaration, D proc id ; ND1 ; S is processed, the action
corresponding to N causes the creation of a symbol table for the procedure; the pointer to symbol
table of enclosing procedure is given by top(tblptr). The pointer to the new table is pushed on to
the stack tblptr and 0 is pushed as the initial offset on the offset stack. When the actions
corresponding to the subtrees of N, D1 and S have been executed, the offset corresponding to the
current procedure i.e., top(offset) contains the total width of entries in it. Hence top(offset) is
added to the header of symbol table of the current procedure. The top entries of tblptr and offset
are popped so that the pointer and offset of the enclosing procedure are now on top of these
stacks. The entry for id is added to the symbol table of the enclosing procedure. When the
declaration D -> id :T is processed entry for id is created in the symbol table of current
procedure. Pointer to the symbol table of current procedure is again obtained from top(tblptr).
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Offset corresponding to the current procedure i.e. top(offset) is incremented by the width
required by type T to point to the next available location.
T record
{t = mktable(nil);
D end
{T.type = record(top(tblptr));
T.width = top(offset);
pop(tblptr); pop(offset)}
else error}
The operation lookup in the translation scheme above checks if there is an entry for this
occurrence of the name in the symbol table. If an entry is found, pointer to the entry is returned
else nil is returned. Look up first checks whether the name appears in the current symbol table. If
not then it looks for the name in the symbol table of the enclosing procedure and so on. The
pointer to the symbol table of the enclosing procedure is obtained from the header of the symbol
table.
CODE OPTIMIZATION
Considerations for optimization : The code produced by the straight forward compiling
algorithms can often be made to run faster or take less space,or both. This improvement is
achieved by program transformations that are traditionally called optimizations. Machine
independent optimizations are program transformations that improve the target code without
taking into consideration any properties of the target machine. Machine dependant optimizations
are based on register allocation and utilization of special machine-instruction sequences.
- Simply stated, the best program transformations are those that yield the most benefit for
the least effort.
- First, the transformation must preserve the meaning of programs. That is, the
optimization must not change the output produced by a program for a given input, or
cause an error.
Some transformations can only be applied after detailed, often time-consuming analysis of the
source program, so there is little point in applying them to programs that will be run only a few
times.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
1. Exploit the fast path in case of multiple paths fro a given situation.
4. Trade off between the size of the code and the speed with which it gets executed.
5. Place code and data together whenever it is required to avoid unnecessary searching of
data/code
During code transformation in the process of optimization, the basic requirements are as follows:
Consider all that has happened up to this point in the compiling process—lexical
analysis, syntactic analysis, semantic analysis and finally intermediate-code generation. The
compiler has done an enormous amount of analysis, but it still doesn‘t really know how the
program does what it does. In control-flow analysis, the compiler figures out even more
information about how the program does its work, only now it can assume that there are no
syntactic or semantic errors in the code.
Now we can construct the control-flow graph between the blocks. Each basic block is a
node in the graph, and the possible different routes a program might take are the connections, i.e.
if a block ends with a branch, there will be a path leading from that block to the branch target.
The blocks that can follow a block are called its successors. There may be multiple successors or
just one. Similarly the block may have many, one, or no predecessors. Connect up the flow graph
for Fibonacci basic blocks given above. What does an if then-else look like in a flow graph?
What about a loop? You probably have all seen the gcc warning or javac error about:
"Unreachable code at line XXX." How can the compiler tell when code is unreachable?
LOCAL OPTIMIZATIONS
expression is alive if the operands used to compute the expression have not been changed. An
expression that is no longer alive is dead.
Example :
a=b*c;
d=b*c+x-y;
We can eliminate the second evaluation of b*c from this code if none of the intervening
statements has changed its value. We can thus rewrite the code as
t1=b*c;
a=t1;
d=t1+x-y;
2. Variable Propagation:
c=a*b;
x=a;
d=x*b+4;
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
if we replace x by a in the last statement, we can identify a*b and x*b as common sub
expressions. This technique is called variable propagation where the use of one variable is
replaced by another variable if it has been assigned the value of same
Compile Time evaluation
The execution efficiency of the program can be improved by shifting execution time
actions to compile time so that they are not performed repeatedly during the program execution.
We can evaluate an expression with constants operands at compile time and replace that
expression by a single value. This is called folding. Consider the following statement:
a= 2*(22.0/7.0)*r;
Here, we can perform the computation 2*(22.0/7.0) at compile time itself.
4. Code Movement:
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
The motivation for performing code movement in a program is to improve the execution time of
the program by reducing the evaluation frequency of expressions. This can be done by moving
the evaluation of an expression to other parts of the program. Let us consider the bellow code:
If(a<10)
{
b=x^2-y^2;
}
else
{
b=5;
a=( x^2-y^2)*10;
}
At the time of execution of the condition a<10, x^2-y^2 is evaluated twice. So, we can optimize
the code by moving the out side to the block as follows:
t= x^2-y^2;
If(a<10)
{
b=t;
}
else
{
b=5;
a=t*10;
}
5. Strength Reduction:
In the frequency reduction transformation we tried to reduce the execution frequency of
the expressions by moving the code. There is other class of transformations which perform
equivalent actions indicated in the source program by reducing the strength of operators. By
strength reduction, we mean replacing the high strength operator with low strength operator with
out affecting the program meaning. Let us consider the bellow example:
i=1;
while (i<10)
{
y=i*4;
}
while (i<10)
{
y=t;
t=t+4;
}
Here the high strength operator * is replaced with +.
Let‘s consider a global common sub expression elimination optimization as our example.
Careful analysis across blocks can determine whether an expression is alive on entry to a block.
Such an expression is said to be available at that point. Once the set of available expressions is
known, common sub-expressions can be eliminated on a global basis. Each block is a node in
the flow graph of a program. The successor set (succ(x)) for a node x is the set of all nodes that x
directly flows into. The predecessor set (pred(x)) for a node x is the set of all nodes that flow
directly into x. An expression is defined at the point where it is assigned a value and killed when
one of its operands is subsequently assigned a new value. An expression is available at some
point p in a flow graph if every path leading to p contains a prior definition of that expression
which is not subsequently killed. Lets define such useful functions in DF analysis in following
lines.
avail[B] = set of expressions available on entry to block B
exit[B] = set of expressions available on exit from B
avail[B] = ∩ exit[x]: x ∈ pred[B] (i.e. B has available the intersection of the exit of its
predecessors)
killed[B] = set of the expressions killed in B
defined[B] = set of expressions defined in B
exit[B] = avail[B] - killed[B] + defined[B]
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
First, divide the code above into basic blocks. Now calculate the available expressions for each
block. Then find an expression available in a block and perform step 2c above. What common
sub-expression can you share between the two blocks? What if the above code were:
main:
BeginFunc 28;
b=a+2;
c=4*b;
tmp1 = b < c ;
IfNZ tmp1 Goto L1 ;
b=1;
z = a + 2 ; <========= an additional line here
L1:
d=a+2;
EndFunc ;
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
MACHINE OPTIMIZATIONS
In final code generation, there is a lot of opportunity for cleverness in generating
efficient target code. In this pass, specific machines features (specialized instructions, hardware
pipeline abilities, register details) are taken into account to produce code optimized for this
particular architecture.
REGISTER ALLOCATION:
One machine optimization of particular importance is register allocation, which is
perhaps the single most effective optimization for all architectures. Registers are the fastest kind
of memory available, but as a resource, they can be scarce.
The problem is how to minimize traffic between the registers and what lies beyond them
in the memory hierarchy to eliminate time wasted sending data back and forth across the bus and
the different levels of caches. Your Decaf back-end uses a very naïve and inefficient means of
assigning registers, it just fills them before performing an operation and spills them right
afterwards.
A much more effective strategy would be to consider which variables are more heavily
in demand and keep those in registers and spill those that are no longer needed or won't be
needed until much later.
One common register allocation technique is called "register coloring", after the central
idea to view register allocation as a graph coloring problem. If we have 8 registers, then we try to
color a graph with eight different colors. The graph‘s nodes are made of "webs" and the arcs are
determined by calculating interference between the webs. A web represents a variable‘s
definitions, places where it is assigned a value (as in x = …), and the possible different uses of
those definitions (as in y = x + 2). This problem, in fact, can be approached as another graph.
The definition and uses of a variable are nodes, and if a definition reaches a use, there is an arc
between the two nodes. If two portions of a variable‘s definition-use graph are unconnected, then
we have two separate webs for a variable. In the interference graph for the routine, each node is a
web. We seek to determine which webs don't interfere with one another, so we know we can use
the same register for those two variables. For example, consider the following code:
i = 10;
j = 20;
x = i + j;
y = j + k;
We say that i interferes with j because at least one pair of i‘s definitions and uses is
separated by a definition or use of j, thus, i and j are "alive" at the same time. A variable is alive
between the time it has been defined and that definition‘s last use, after which the variable is
dead. If two variables interfere, then we cannot use the same register for each. But two variables
that don't interfere can since there is no overlap in the liveness and can occupy the same register.
Once we have the interference graph constructed, we r-color it so that no two adjacent nodes
share the same color (r is the number of registers we have, each color represents a different
register).
We may recall that graph-coloring is NP-complete, so we employ a heuristic rather than
an optimal algorithm. Here is a simplified version of something that might be used:
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
1. Find the node with the least neighbors. (Break ties arbitrarily.)
2. Remove it from the interference graph and push it onto a stack
3. Repeat steps 1 and 2 until the graph is empty.
4. Now, rebuild the graph as follows:
a. Take the top node off the stack and reinsert it into the graph
b. Choose a color for it based on the color of any of its neighbors presently in the graph,
rotating colors in case there is more than one choice.
c. Repeat a , and b until the graph is either completely rebuilt, or there is no color
available to color the node.
If we get stuck, then the graph may not be r-colorable, we could try again with a different
heuristic, say reusing colors as often as possible. If no other choice, we have to spill a variable to
memory.
INSTRUCTION SCHEDULING:
PEEPHOLE OPTIMIZATIONS:
Peephole optimization is a pass that operates on the target assembly and only considers a
few instructions at a time (through a "peephole") and attempts to do simple, machine dependent
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Abstract Syntax Tree/DAG : Is nothing but the condensed form of a parse tree and is
.DAG is more compact than abstract syntax tree because common sub expressions are eliminated
A syntax tree depicts the natural hierarchical structure of a source program. Its structure has
already been discussed in earlier lectures. DAGs are generated as a combination of trees:
operands that are being reused are linked together, and nodes may be annotated with variable
names (to denote assignments). This way, DAGs are highly compact, since they eliminate local
common sub-expressions. On the other hand, they are not so easy to optimize, since they are
more specific tree forms. However, it can be seen that proper building of DAG for a given
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
An example of a syntax tree and DAG has been given in the next slide .
a := b * -c + b * -c
You can see that the node " * " comes only once in the DAG as well as the leaf " b ", but the
meaning conveyed by both the representations (AST as well as the DAG) remains the same.
IMPORTANT QUESTIONS:
1. What is Code optimization? Explain the objectives of it. Also discuss Function preserving
transformations with your own examples?
2. Explain the following optimization techniques
(a) Copy Propagation
(b) Dead-Code Elimination
(c) Code Motion
(d) Reduction in Strength.
4. Explain the principle sources of code-improving transformations.
5. What do you mean by machine dependent and machine independent code optimization?
Explain about machine dependent code optimization with examples.
ASSIGNMENT QUESTIONS:
UNIT-V
FLOW GRAPHS :
We can add flow control information to the set of basic blocks making up a program by
constructing a directed graph called a flow graph. The nodes of a flow graph are the basic nodes.
One node is distinguished as initial; it is the block whose leader is the first statement. There is a
directed edge from block B1 to block B2 if B2 can immediately follow B1 in some execution
sequence; that is, if
- There is conditional or unconditional jump from the last statement of B 1 to the first
statement of B2 , or
- B2 immediately follows B1 in the order of the program, and B1 does not end in an
unconditional jump. We say that B1 is the predecessor of B 2,and B 2 is a successor of B1.
We wish to determine for each three-address statement x := y op z, what the next uses of
x, y and z are. We collect next-use information about names in basic blocks. If the name in a
register is no longer needed, then the register can be assigned to some other name. This idea of
keeping a name in storage only if it will be used subsequently can be applied in a number of
contexts. It is used to assign space for attribute values.
The simple code generator applies it to register assignment. Our algorithm is to determine
next uses makes a backward pass over each basic block, recording (in the symbol table) for each
name x whether x has a next use in the block and if not, whether it is live on exit from that block.
We can assume that all non-temporary variables are live on exit and all temporary variables are
dead on exit.
1. Attach to statement i the information currently found in the symbol table regarding the
next use and live ness of x, y and z.
2. In the symbol table, set x to "not live" and "no next use".
3. In the symbol table, set y and z to "live" and the next uses of y and z to i. Note that the
order of steps (2) and (3) may not be interchanged because x may be y or z.
If three-address statement i is of the form x := y or x := op y, the steps are the same as above,
ignoring z. consider the below example:
1: t1 = a * a
2: t 2 = a * b
3: t3 = 2 * t2
4: t4 = t 1 + t3
5: t5 = b * b
6: t6 = t 4 + t5
7: X = t 6
Example :
We can allocate storage locations for temporaries by examining each in turn and
assigning a temporary to the first location in the field for temporaries that does not contain a live
temporary. If a temporary cannot be assigned to any previously created location, add a new
location to the data area for the current procedure. In many cases, temporaries can be packed into
registers rather than memory locations, as in the next section.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Example .
The six temporaries in the basic block can be packed into two locations. These locations
correspond to t 1 and t 2 in:
6: t1 = t1 + t 2 ,7: X = t1
Data analysis is needed for global code optimization, e.g.: Is a variable live on exit from a block?
Does a definition reach a certain point in the code? Data flow equations are used to collect
dataflow information A typical dataflow equation has the form
Out[s]=Gen[s]U(in[s]-kill[s])
The notion of generation and killing depends on the dataflow analysis problem to be
solved Let's first consider Reaching Definitions analysis for structured programs A definition of
a variable x is a statement that assigns or may assign a value to x An assignment to x is an
unambiguous definition of x An ambiguous assignment to x can be an assignment to a pointer or
a function call where x is passed by reference When x is defined, we say the definition is
generated An unambiguous definition of x kills all other definitions of x When all definitions of
x are the same at a certain point, we can use this information to do some optimizations Example:
all definitions of x define x to be 1. Now, by performing constant folding, we can do strength
reduction if x is used in z=x*y.
So far we were only considering making changes within one basic block. With some
additional analysis, we can apply similar optimizations across basic blocks, making them global
optimizations. It‘s worth pointing out that global in this case does not mean across the entire
program. We usually only optimize one function at a time. Interprocedural analysis is an even
larger task, one not even attempted by some compilers. The additional analysis the optimizer
must do to perform optimizations across basic blocks is called data-flow analysis. Data-flow
analysis is much more complicated than control-flow analysis.
Let‘s consider a global common sub-expression elimination optimization as our example.
Careful analysis across blocks can determine whether an expression is alive on entry to a block.
Such an expression is said to be available at that point.
Once the set of available expressions is known, common sub-expressions can be
eliminated on a global basis. Each block is a node in the flow graph of a program. The successor
set (succ(x)) for a node x is the set of all nodes that x directly flows into. The predecessor set
(pred(x)) for a node x is the set of all nodes that flow directly into x. An expression is defined at
the point where it is assigned a value and killed when one of its operands is subsequently
assigned a new value. An expression is available at some point p in a flow graph if every path
leading to p contains a prior definition of that expression which is not
subsequently killed.
BeginFunc 28;
b=a+2;
c=4*b;
tmp1 = b < c;
ifNZ tmp1 goto L1 ;
b=1;
L1:
d=a+2;
EndFunc ;
First, divide the code above into basic blocks. Now calculate the available expressions
for each block. Then find an expression available in a block and perform step 2c above.
What common subexpression can you share between the two blocks? What if the above
code were:
main:
BeginFunc 28;
b=a+2;
c=4*b;
tmp1 = b < c ;
IfNZ tmp1 Goto L1 ;
b=1;
z = a + 2 ; <========= an additional line here
L1:
d=a+2;
EndFunc ;
main()
{
int x, y, z;
x = (1+20)* -x;
y = x*x+(x/y);
y = z = (x/y)/(x*x);
}
straight translation:
tmp1 = 1 + 20 ;
tmp2 = -x ;
x = tmp1 * tmp2 ;
tmp3 = x * x ;
tmp4 = x / y ;
y = tmp3 + tmp4 ;
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
tmp5 = x / y ;
tmp6 = x * x ;
z = tmp5 / tmp6 ;
y=z;
What sub-expressions can be eliminated? How can valid common sub-expressions (live ones) be
determined? Here is an optimized version, after constant folding and propagation and elimination
of common sub-expressions:
tmp2 = -x ;
x = 21 * tmp2 ;
tmp3 = x * x ;
tmp4 = x / y ;
y = tmp3 + tmp4 ;
tmp5 = x / y ;
z = tmp5 / tmp3 ;
y=z;
de f[B] is the set of variables assigned values in B prior to any use of that variable in B use [B]
is the set of variables whose values may be used in [B] prior to any definition of the variable.
A variable comes live into a block (in in[B]), if it is either used before redefinition of it is
live coming out of the block and is not redefined in the block .A variable comes live out of a
block (in out[B]) if and only if it is live coming into one of its successors
Out[B]= U in[s]
S succ[B]
Note the relation between reaching-definitions equations: the roles of in and out are interchanged
Copy Propagation
This optimization is similar to constant propagation, but generalized to non-constant
values. If we have an assignment a = b in our instruction stream, we can replace later
occurrences of a with b (assuming there are no changes to either variable in-between). Given the
way we generate TAC code, this is a particularly valuable optimization since it is able to
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
eliminate a large number of instructions that only serve to copy values from one variable to
another. The code on the left makes a copy of tmp1 in tmp2 and a copy of tmp3 in tmp4. In the
optimized version on the right, we eliminated those unnecessary copies and propagated the
original variable into the later uses:
tmp2 = tmp1 ;
tmp3 = tmp2 * tmp1;
tmp4 = tmp3 ;
tmp5 = tmp3 * tmp2 ;
c = tmp5 + tmp4 ;
tmp3 = tmp1 * tmp1 ;
tmp5 = tmp3 * tmp1 ;
c = tmp5 + tmp3 ;
We can also drive this optimization "backwards", where we can recognize that the original
assignment made to a temporary can be eliminated in favor of direct assignment to the final goal:
tmp1 = LCall _Binky ;
a = tmp1;
tmp2 = LCall _Winky ;
b = tmp2 ;
tmp3 = a * b ;
c = tmp3 ;
a = LCall _Binky;
b = LCall _Winky;
c=a*b;
IMPORTANT QUESTIONS:
ASSIGNMENT QUESTIONS:
In final code generation, there is a lot of opportunity for cleverness in generating efficient
target code. In this pass, specific machines features (specialized instructions, hardware pipeline
abilities, register details) are taken into account to produce code optimized for this particular
architecture.
Register Allocation
i = 10;
j = 20;
x = i + j;
y = j + k;
We say that i interferes with j because at least one pair of i‘s definitions and uses is
separated by a definition or use of j, thus, i and j are "alive" at the same time. A variable is alive
between the time it has been defined and that definition‘s last use, after which the variable is
dead. If two variables interfere, then we cannot use the same register for each. But two variables
that don't interfere can since there is no overlap in the liveness and can occupy the same register.
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Once we have the interference graph constructed, we r-color it so that no two adjacent nodes
share the same color (r is the number of registers we have, each color represents a different
register). You may recall that graph-coloring is NP-complete, so we employ a heuristic rather
than an optimal algorithm. Here is a simplified version of something that might be used:
1. Find the node with the least neighbors. (Break ties arbitrarily.)
2. Remove it from the interference graph and push it onto a stack
3. Repeat steps 1 and 2 until the graph is empty.
4. Now, rebuild the graph as follows:
a. Take the top node off the stack and reinsert it into the graph
b. Choose a color for it based on the color of any of its neighbors presently in the
graph, rotating colors in case there is more than one choice.
c. Repeat a and b until the graph is either completely rebuilt, or there is no color
available to color the node.
If we get stuck, then the graph may not be r-colorable, we could try again with a different
heuristic, say reusing colors as often as possible. If no other choice, we have to spill a variable to
memory.
Instruction Scheduling:
Another extremely important optimization of the final code generator is instruction
scheduling. Because many machines, including most RISC architectures, have some sort of
pipelining capability, effectively harnessing that capability requires judicious ordering of
instructions. In MIPS, each instruction is issued in one cycle, but some take multiple cycles to
complete. It takes an additional cycle before the value of a load is available and two cycles for a
branch to reach its destination, but an instruction can be placed in the "delay slot" after a branch
and executed in that slack time. On the left is one arrangement of a set of instructions that
requires 7 cycles. It assumes no hardware interlock and thus explicitly stalls between the second
and third slots while the load completes and has a Dead cycle after the branch because the delay
slot holds a noop. On the right, a more Favorable rearrangement of the same instructions will
execute in 5 cycles with no dead Cycles.
lw $t2, 4($fp)
lw $t3, 8($fp)
noop
add $t4, $t2, $t3
subi $t5, $t5, 1
goto L1
noop
lw $t2, 4($fp)
lw $t3, 8($fp)
subi $t5, $t5, 1
goto L1
add $t4, $t2, $t3
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
Register Allocation
i = 10;
j = 20;
x = i + j;
y = j + k;
We say that i interferes with j because at least one pair of i‘s definitions and uses is
separated by a definition or use of j, thus, i and j are "alive" at the same time. A variable is alive
between the time it has been defined and that definition‘s last use, after which the variable is
dead. If two variables interfere, then we cannot use the same register for each. But two variables
that don't interfere can since there is no overlap in the liveness and can occupy the same register.
Once we have the interference graph constructed, we r-color it so that no two adjacent nodes
share the same color (r is the number of registers we have, each color represents a different
register). You may recall that graph-coloring is NP-complete, so we employ a heuristic rather
than an optimal algorithm. Here is a simplified version of something that might be used:
1. Find the node with the least neighbors. (Break ties arbitrarily.)
2. Remove it from the interference graph and push it onto a stack
3. Repeat steps 1 and 2 until the graph is empty.
4. Now, rebuild the graph as follows:
a. Take the top node off the stack and reinsert it into the graph
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
b. Choose a color for it based on the color of any of its neighbors presently in the graph,
rotating colors in case there is more than one choice.
c. Repeat a and b until the graph is either completely rebuilt, or there is no color available
to color the node.
If we get stuck, then the graph may not be r-colorable, we could try again with a different
heuristic, say reusing colors as often as possible. If no other choice, we have to spill a variable to
memory.
CODE GENERATION:
The code generator generates target code for a sequence of three-address statement. It
considers each statement in turn, remembering if any of the operands of the statement are
currently in registers, and taking advantage of that fact, if possible. The code-generation uses
descriptors to keep track of register contents and addresses for names.
1. A register descriptor keeps track of what is currently in each register. It is consulted whenever
a new register is needed. We assume that initially the register descriptor shows that all registers
are empty. (If registers are assigned across blocks, this would not be the case). As the code
generation for the block progresses, each register will hold the value of zero or more names at
any given time.
2. An address descriptor keeps track of the location (or locations) where the current value of the
name can be found at run time. The location might be a register, a stack location, a memory
address, or some set of these, since when copied, a value also stays where it was. This
information can be stored in the symbol table and is used to determine the accessing method for
a name.
for each X = Y op Z do
Mov Y', L
- Generate
op Z', L
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
. If current value of Y and/or Z has no next use and are dead on exit from block and are in
registers, change register descriptor to indicate that they no longer contain Y and/or Z.
The code generation algorithm takes as input a sequence of three-address statements constituting
a basic block. For each three-address statement of the form x := y op z we perform the following
actions:
1. Invoke a function getreg to determine the location L where the result of the
computation y op z should be stored. L will usually be a register, but it could also be a
memory location. We shall describe getreg shortly.
2. Consult the address descriptor for u to determine y', (one of) the current location(s) of
y. Prefer the register for y' if the value of y is currently both in memory and a register. If
the value of u is not already in L, generate the instruction MOV y', L to place a copy of y
in L.
3. Generate the instruction OP z', L where z' is a current location of z. Again, prefer a
register to a memory location if z is in both. Update the address descriptor to indicate that
x is in location L. If L is a register, update its descriptor to indicate that it contains the
value of x, and remove x from all other register descriptors.
4. If the current values of y and/or y have no next uses, are not live on exit from the
block, and are in registers, alter the register descriptor to indicate that, after execution of
x := y op z, those registers no longer will contain y and/or z, respectively.
FUNCTION getreg:
1. If Y is in register (that holds no other values) and Y is not live and has no next use after
X = Y op Z
then return register of Y for L.
2. Failing (1) return an empty register
3. Failing (2) if X has a next use in the block or op requires register then get a register R, store its
content into M (by Mov R, M) and use it.
4. Else select memory location X as L
The function getreg returns the location L to hold the value of x for the assignment x := y op z.
1. If the name y is in a register that holds the value of no other names (recall that copy
instructions such as x := y could cause a register to hold the value of two or more variables
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
simultaneously), and y is not live and has no next use after execution of x := y op z, then return
the register of y for L. Update the address descriptor of y to indicate that y is no longer in L.
3. Failing (2), if x has a next use in the block, or op is an operator such as indexing, that requires
a register, find an occupied register R. Store the value of R into memory location (by MOV R,
M) if it is not already in the proper memory location M, update the address descriptor M, and
return R. If R holds the value of several variables, a MOV instruction must be generated for each
variable that needs to be stored. A suitable occupied register might be one whose datum is
referenced furthest in the future, or one whose value is also in memory.
4. If x is not used in the block, or no suitable occupied register can be found, select the memory
location of x as L.
Example :
Stmt code reg desc addr desc
t2=a-c
t 3 = t 1 + t2
d = t 3 + t2
The code generation algorithm that we discussed would produce the code sequence as shown.
Shown alongside are the values of the register and address descriptors as code generation
progresses.
DAG (Directed Acyclic Graphs) are useful data structures for implementing
transformations on basic blocks. A DAG gives a picture of how the value computed by a
statement in a basic block is used in subsequent statements of the block. Constructing a DAG
from three-address statements is a good way of determining common sub-expressions
(expressions computed more than once) within a block, determining which names are used inside
the block but evaluated outside the block, and determining which statements of the block could
have their computed value used outside the block.
A DAG for a basic block is a directed cyclic graph with the following labels on nodes:
1. Leaves are labeled by unique identifiers, either variable names or constants. From the
operator applied to a name we determine whether the l-value or r-value of a name is needed;
most leaves represent r- values. The leaves represent initial values of names, and we subscript
them with 0 to avoid confusion with labels denoting "current" values of names as in (3) below.
3. Nodes are also optionally given a sequence of identifiers for labels. The intention is
that interior nodes represent computed values, and the identifiers labeling a node are deemed to
have that value.
For example, the slide shows a three-address code. The corresponding DAG is shown. We
observe that each node of the DAG represents a formula in terms of the leaves, that is, the values
possessed by variables and constants upon entering the block. For example, the node labeled t 4
represents the formula
b[4 * i]
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
that is, the value of the word whose address is 4*i bytes offset from address b, which is the
intended value of t 4 .
S 1= 4 * i S1=4*i
S2 = addr(A)-4 S 2 = addr(A)-4
S3 = S 2 [S 1 ] S 3 = S2 [S 1 ]
S4=4*i
S5 = addr(B)-4 S 5= addr(B)-4
S 6 = S 5 [S4 ] S6 = S5 [S 4 ]
S7 = S 3 * S6 S 7 = S3 * S 6
S8 = prod+S7
prod = S8 prod = prod + S 7
S9 = I+1
I = S9 I=I+1
If I <= 20 goto (1) If I <= 20 goto (1)
We see how to generate code for a basic block from its DAG representation. The
advantage of doing so is that from a DAG we can more easily see how to rearrange the order of
the final computation sequence than we can starting from a linear sequence of three-address
statements or quadruples. If the DAG is a tree, we can generate code that we can prove is optimal
under such criteria as program length or the fewest number of temporaries used. The algorithm
for optimal code generation from a tree is also useful when the intermediate code is a parse tree.
t1=a+b
t2=c+d
t 3 = e -t 2
X = t 1 -t 3
Here, we briefly consider how the order in which computations are done can affect the
cost of resulting object code. Consider the basic block and its corresponding DAG representation
as shown in the slide.
Rearranging order .
If we generate code for the three-address statements using the code generation algorithm
described before, we get the code sequence as shown (assuming two registers R0 and R1 are
available, and only X is live on exit). On the other hand suppose we rearranged the order of the
statements so that the computation of t 1 occurs immediately before that of X as:
t2 = c + d
t3 = e -t 2
t1 = a + b
X = t 1 -t3
Then, using the code generation algorithm, we get the new code sequence as shown (again only
R0 and R1 are available). By performing the computation in this order, we have been able to
save two instructions; MOV R0, t 1 (which stores the value of R0 in memory location t 1 ) and
MOV t 1 , R1 (which reloads the value of t 1 in the register R1).
COMPILER DESIGN NOTES III YEAR/ I SEM MRCET
1. What is Object code? Explain about the following object code forms:
(a) Absolute machine-language
(b) Relocatable machine-language
(c) Assembly-language.
2. Explain about Generic code generation algorithm?
3. Write and explain about object code forms?
4. Explain Peephole Optimization
ASSIGNMENT QUESTIONS: