Compiler Design Notes
Compiler Design Notes
Compiler Design Notes
PART – A
UNIT – 1 8
Hours
Introduction, Lexical analysis: Language processors; The structure of a Compiler; The
evolution pf programming languages; The science of building a Compiler; Applications
of compiler technology; Programming language basics.
Lexical analysis: The Role of Lexical Analyzer; Input Buffering; Specifications of
Tokens; Recognition of Tokens.
UNIT – 2 6
Hours
Syntax Analysis – 1: Introduction; Context-free Grammars; Writing a Grammar. Top-
down Parsing; Bottom-up Parsing.
UNIT – 3 6
Hours
Syntax Analysis – 2: Top-down Parsing; Bottom-up Parsing.
UNIT – 4 6
Hours
Syntax Analysis – 3: Introduction to LR Parsing: Simple LR; More powerful LR parsers
(excluding Efficient construction and compaction of parsing tables) ; Using ambiguous
grammars; Parser Generators.
PART – B
UNIT – 5 7
Hours
Syntax-Directed Translation: Syntax-directed definitions; Evaluation orders for SDDs;
Applications of syntax-directed translation; Syntax-directed translation schemes.
UNIT – 6 6
Hours
Intermediate Code Generation: Variants of syntax trees; Three-address code;
Translation of expressions; Control flow; Back patching; Switch statements; Procedure
calls.
UNIT – 7 6
Hours
Run-Time Environments: Storage Organization; Stack allocation of space; Access to
non-local data on the stack; Heap management; Introduction to garbage collection.
UNIT – 8 7
Hours
DEPT. OF IT , ANITS
Code Generation: Issues in the design of Code Generator; The Target Language;
Addresses in the target code; Basic blocks and Flow graphs; Optimization of basic
blocks; A Simple Code Generator
Text Books:
1. Alfred V Aho, Monica S.Lam, Ravi Sethi, Jeffrey D Ullman: Compilers- Principles,
Techniques and Tools, 2nd Edition, Pearson Education, 2007.
(Chapters 1, 3.1 to 3.4, 4 excluding 4.7.5 and 4.7.6, 5.1 to 5.4, 6.1, 6.2, 6.4, 6.6, 6.7 to
6.9, 7.1 to 7.5, 8.1 to 8.6.)
Reference Books:
1. Charles N. Fischer, Richard J. leBlanc, Jr.: Crafting a Compiler with C, Pearson
Education, 1991.
2. Andrew W Apple: Modern Compiler Implementation in C, Cambridge University
Press, 1997.
3. Kenneth C Louden: Compiler Construction Principles & Practice, Cengage Learning,
1997.
DEPT. OF IT , ANITS
INDEX SHEET
UNIT – 1 INTRODUCTION: 1 - 28
PART B
DEPT. OF IT ,ANITS
COMPILER DESIGN
Preliminaries Required
• Basic knowledge of programming languages.
• Basic knowledge of DFA and NFA( FAFL concepts).
• Knowledge of a high programming language for the programming assignments.
Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, Monica,
“Compilers: Principles, Techniques, and Tools”
Addison-Wesley, 2nd Edition.
Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
– Context Free Grammars
– Top-Down Parsing, LL Parsing
– Bottom-Up Parsing, LR Parsing
• Syntax-Directed Translation
– Attribute Definitions
– Evaluation of Attribute Definitions
• Semantic Analysis, Type Checking
• Run-Time Organization
• Intermediate Code Generation
DEPT. OF IT ,ANITS
UNIT I: INTRODUCTION, LEXICAL ANALYSIS
SYLLABUS:
x Lexical analysis: Language processors;
x The structure of a Compiler;
x The evolution of programming languages;
x The science of building a Compiler;
x Applications of compiler technology;
x Programming language basics.
x Lexical analysis: The Role of Lexical Analyzer;
x Input Buffering;
x Specifications of Tokens;
x Recognition of Tokens.
INTERPRETER
Pictorial representation of an Interpreter is given below
HYBRID COMPILER
Source program Translator
Intermediate pgm
COMPILERS
• A compiler is a program takes a program written in a source language and
translates it into an equivalent program in a target language.
error messages
Other Applications
Source Target
Program Analysis Synthesis Program
• Assignments:
• 1. What is the difference between a compiler and an interpreter?
• 2. what are advantages of (i) a compiler over an interpreter (ii) an interpreter over a
compiler?
Lexical Analyzer
• Lexical Analyzer reads the source program character by character and returns the
tokens of the source program.
• Lexical Analyzer also called as Scanner.
• It reads the stream of characters making up the source program and groups the
characters into meaningful sequences called Lexemes.
• For each lexeme, it produces as output a token of the form
• <token_name, attribute_value>
• token_name is an abstract symbol that is used during syntax analysis, and
the second component attribute_value points to an entry in the symbol
table for this token.
• A token describes a pattern of characters having same meaning in the source
program. (such as identifiers, operators, keywords, numbers, delimeters and so
on)
Ex: newval := oldval + 12 => tokens: newval identifier
:= assignment operator
oldval identifier
+ add operator
12 a number
• Puts information about identifiers into the symbol table not all attributes.
• Regular expressions are used to describe tokens (lexical constructs).
Syntax Analyzer
A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the
given program.
A syntax analyzer is also called as a parser.
A parse tree describes a syntactic structure.
=
newval +
oldval 12
Parsing Techniques
• Depending on how the parse tree is created, there are different parsing techniques.
• These parsing techniques are categorized into two groups:
– Top-Down Parsing, Bottom-Up Parsing
• Top-Down Parsing:
– Construction of the parse tree starts at the root, and proceeds towards the
leaves.
– Efficient top-down parsers can be easily constructed by hand.
– Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL
Parsing).
• Bottom-Up Parsing:
– Construction of the parse tree starts at the leaves, and proceeds towards
the root.
– Normally efficient bottom-up parsers are created with the help of some
software tools.
– Bottom-up parsing is also known as shift-reduce parsing.
– Operator-Precedence Parsing – simple, restrictive, easy to implement
– LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
Semantic Analyzer
• A semantic analyzer checks the source program for semantic errors and collects
the type information for the code generation.
• Determines meaning of source string.
– Matching of parenthesis
– Matching if else stmt.
– Checking scope of operation
• Type-checking is an important part of semantic analyzer.
• Normally semantic information cannot be represented by a context-free language
used in syntax analyzers.
• Context-free grammars used in the syntax analysis are integrated with attributes
(semantic rules)
– the result is a syntax-directed translation,
– Attribute grammars
• Ex:
The type of the identifier newval must match with type of the
•
expression (oldval+12)
Intermediate Code Generation
• A compiler may produce an explicit intermediate codes representing the source
program.
• Is a kind of code.
• Easy to generate
• Easily converted to target code
• It can be in three address code or quadruples, triples etc
• Ex:
newval = oldval + 1
i d1 = i d2 + 1
•Parallelism can be found at several levels : at the instruction level, where multiple
operations are executed simultaneously and at the processor level, where different
threads of same application are run on different processors.
• Memory hierarchies are a response to the basic limitation that we can built very
fast storage or very large storage, but not storage that is both fast and large.
PARALLELISM
MEMORY HIERARCHIES.
• a memory hierarchy consists of several levels of storage with different speeds and
sizes.
• a processor usually has a small number of registers consisting of hundred of bytes,
several levels of caches containing kilobytes to megabytes, and finally secondary
storage that contains gigabytes and beyond.
• Correspondingly, the speed of accesses between adjacent levels of the hierarchy
can differ by two or three orders of magnitude.
• The performance of a system is often limited not by the speed of the processor but
by the performance of the memory subsystem.
• While compliers traditionally focus on optimizing the processor execution, more
emphasis is now placed on making the memory hierarchy more effective.
PROGRAM TRANSLATIONS
Normally we think of compiling as a translation of a high level Lang to machine level
Lang, the same technology can be applied to translate between diff kinds of
languages.
The following are some of the imp applications of program translation techniques:
HARDWARE SYNTHESIS
• Not only is most software written in high level languages, even hardware designs
are mostly described in high level hardware description languages like verilog and
VHDL(very high speed integrated circuit hardware description languages).
• Hardware designs are typically described at the register transfer level (RTL).
• Hardware synthesis tools translate RTL descriptions automatically into gates
which are then mapped to transistors and eventually to a physical layout. This
process takes long hours to optimize the circuits unlike compilers for
programming langs.
Symbol
Table
2. Examples of tokens
Token Informal description Sample Lexemes
if Characters i, f if
else Characters e, l, s, e else
Comparis < or> or <= or >= or == or != <=, !=
on
id Letter followed by letters and Pi, score, D2
digits
Number Any numeric constant 3.14, 0.6, 20
Literal Anything but surrounded by “ “total= %d\n” , “core
” ’s dumped”
In FORTRAN,
DO 5 I = 1.25 DO5I is a lexeme
Disadvantage of
Lexical Analyzer
DO 5 I = 1, 25 do- statement
Input Buffering
To recognize tokens reading data/ source program from hard disk is done. Accessing
hard disk each time is time consuming so special buffer technique have been
developed to reduce the amount of overhead required.
- One such technique is two-buffer scheme each of which is alternately
loaded.
- Size of each buffer is N(size of disk block) Ex:4096 bytes
– One read command is used to read N characters.
– If fewer than N characters remain in the input file , then special character,
represented by eof, marks the end of source file.
.Sentinel is a special character that cannot be a part of source program. eof is used as
sentinel
• Two pointers to the input are maintained
– Pointer lexemeBegin, marks the beginning of the current lexeme, whose
extent we are attempting to determine.
– Pointer forward scans ahead until a pattern match if found.
15
Terminology of Languages
• Alphabet : a finite set of symbols (ASCII characters)
• String :
– Finite sequence of symbols on an alphabet
– Sentence and word are also used in terms of string
– H is the empty string
– |s| is the length of string s.
• Language: sets of strings over some fixed alphabet
– the empty set is a language.
– {H} the set containing empty string is a language
– The set of well-wormed C programs is a language
– The set of all possible identifiers is a language.
• Operators on Strings:
– Concatenation: xy represents the concatenation of strings x and y. s H = s
Hs=s
– sn = s s s .. s ( n times) s0 = H
Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 L2 = {a,b,c,d,1,2}
Regular Expressions
• We use regular expressions to describe tokens of a programming language.
• A regular expression is built up of simpler regular expressions (using defining
rules)
• Each regular expression denotes a language.
• A language denoted by a regular expression is called as a regular set.
• (r)+ = (r)(r)*
• (r)? = (r) | H
• Ex:
– 6 = {0,1}
– 0|1 => {0,1}
– (0|1)(0|1) => {00,01,10,11}
– 0* => {H ,0,00,000,0000,....}
– (0|1)* => all strings with 0 and 1, including the empty string
Regular Definitions
• Ex: Identifiers in Pascal
letter o A | B | ... | Z | a | b | ... | z
digit o 0 | 1 | ... | 9
id o letter (letter | digit ) *
– If we try to write the regular expression representing identifiers without
using regular definitions, that regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
• Ex: Unsigned numbers in Pascal
digit o 0 | 1 | ... | 9
digits o digit +
opt-fraction o ( . digits ) ?
opt-exponent o ( E (+|-)? digits ) ?
unsigned-num o digits opt-fraction opt-exponent
Our current goal is to perform the lexical analysis needed for the following grammar.
Recall that the terminals are the tokens, the nonterminals produce terminals.
digit → [0-9]
digits → digits+
number → digits (. digits)? (E[+-]? digits)?
letter → [A-Za-z]
id → letter ( letter | digit )*
if → if
then → then
else → else
relop → < | > | <= | >= | = | <>
On the board show how this can be done with just REs.
where blank, tab, and newline are symbols used to represent the corresponding ascii
characters.
Recall that the lexer will be called by the parser when the latter needs a new token. If the
lexer then recognizes the token ws, it does not return it to the parser but instead goes on
to recognize the next token, which is then returned. Note that you can't have two
consecutive ws tokens in the input because, for a given token, the lexer will match the
longest lexeme starting at the current position that yields this token. The table on the
right summarizes the situation.
For the parser, all the relational ops are to be treated the same so they are all the same
token, relop. Naturally, other parts of the compiler, for example the code generator, will
need to distinguish between the various relational ops so that appropriate code is
generated. Hence, they have distinct attribute values.
Specification of Token
To specify tokens Regular Expressions are used.
Recognition of Token
To recognize tokens there are 2 steps
Transition Diagrams
A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for
each possible token. It shows the decisions that must be made based on the input seen.
The two main components are circles representing states (think of them as decision
points of the lexer) and arrows representing edges (think of them as the decisions made).
1. The double circles represent accepting or final states at which point a lexeme has
been found. There is often an action to be done (e.g., returning the token), which
is written to the right of the double circle.
2. If we have moved one (or more) characters too far in finding the token, one (or
more) stars are drawn.
3. An imaginary start state exists and has an arrow coming from it to indicate where
to begin the process.
It is fairly clear how to write code corresponding to this diagram. You look at the first
character, if it is <, you look at the next character. If that character is =, you return
(relop,LE) to the parser. If instead that character is >, you return (relop,NE). If it is
another character, return (relop,LT) and adjust the input buffer so that you will read this
character again since you have not used it for the current lexeme. If the first character
was =, you return (relop,EQ).
• Each state represents a condition that could occur during the process of
scanning the input looking for a lexeme that matches one of several
patterns.
• Edges are directed from one state of the transition diagram to another.
start 1 < 2
25
We will continue to assume that the keywords are reserved, i.e., may not be used as
identifiers. (What if this is not the case—as in Pl/I, which had no reserved words? Then
the lexer does not distinguish between keywords and identifiers and the parser must.)
We will use the method mentioned last chapter and have the keywords installed into the
identifier table prior to any invocation of the lexer. The table entry will indicate that the
entry is a keyword.
gettoken() examines the lexeme and returns the token name, either id or a name
corresponding to a reserved keyword.
The text also gives another method to distinguish between identifiers and keywords.
So far we have transition diagrams for identifiers (this diagram also handles keywords)
and the relational operators. What remains are whitespace, and numbers, which are
respectively the simplest and most complicated diagrams seen so far.
Recognizing Whitespace
The diagram itself is quite simple reflecting the simplicity of the corresponding regular
expression.
x The delim in the diagram represents any of the whitespace characters, say space,
tab, and newline.
x The final star is there because we needed to find a non-whitespace character in
order to know when the whitespace ends and this character begins the next token.
x There is no action performed at the accepting state. Indeed the lexer does not
return to the parser, but starts again from its beginning as it still must find the next
token.
Recognizing Numbers
This certainly looks formidable, but it is not that bad; it follows from the regular
expression.
In class go over the regular expression and show the corresponding parts in the diagram.
When an accepting states is reached, action is required but is not shown on the diagram.
Just as identifiers are stored in a identifier table and a pointer is returned, there is a
corresponding number table in which numbers are stored. These numbers are needed
when code is generated. Depending on the source language, we may wish to indicate in
the table whether this is a real or integer. A similar, but more complicated, transition
diagram could be produced if the language permitted complex numbers as well.
The idea is that we write a piece of code for each decision diagram. I will show the one
for relational operations below. This piece of code contains a case for each state, which
typically reads a character and then goes to the next case depending on the character read.
The numbers in the circles are the names of the cases.
Accepting states often need to take some action and return to the parser. Many of these
accepting states (the ones with stars) need to restore one character of input. This is called
retract() in the code.
What should the code for a particular diagram do if at one state the character read is not
one of those for which a next state has been defined? That is, what if the character read is
not the label of any of the outgoing arcs? This means that we have failed to find the token
corresponding to this diagram.
The code calls fail(). This is not an error case. It simply means that the current input does
not match this particular token. So we need to go to the code section for another diagram
after restoring the input pointer so that we start the next diagram at the point where this
failing diagram started. If we have tried all the diagram, then we have a real failure and
need to print an error message and perhaps try to repair the input.
Note that the order the diagrams are tried is important. If the input matches more than one
token, the first one tried will be chosen.
1. Unlike the method above, which tries the diagrams one at a time, the first new
method tries them in parallel. That is, each character read is passed to each
diagram (that hasn't already failed). Care is needed when one diagram has
accepted the input, but others still haven't failed and may accept a longer prefix of
the input.
2. The final possibility discussed, which appears to be promising, is to combine all
the diagrams into one. That is easy for the example we have been considering because
all the diagrams begin with different characters being matched. Hence we just have
one large start with multiple outgoing edges. It is more difficult when there is a
character that can begin more than one diagram.
Introduction
• Syntax Analyzer creates the syntactic structure of the given source program.
• This syntactic structure is mostly a parse tree.
• Syntax Analyzer is also known as parser.
• The syntax of a programming is described by a context-free grammar (CFG). We
will use BNF (Backus-Naur Form) notation in the description of CFGs.
• The syntax analyzer (parser) checks whether a given source program satisfies the
rules implied by a context-free grammar or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
• A context-free grammar
– gives a precise syntactic specification of a programming language.
– the design of the grammar is an initial phase of the design of a compiler.
– a grammar can be directly converted into a parser by some tools.
• Parser works on a stream of tokens.
• The smallest item is a token.
Fig :Position Of Parser in Compiler model
SYMBOL TABLE
• We categorize the parsers into two groups:
1. Top-Down Parser
2. the parse tree is created top to bottom, starting from the root.
1. Bottom-Up Parser
– the parse is created bottom to top; starting from the leaves
• Both top-down and bottom-up parsers scan the input from left to right (one
symbol at a time).
• Efficient top-down and bottom-up parsers can be implemented only for sub-
classes of context-free grammars.
– LL for top-down parsing
– LR for bottom-up parsing
Syntax Error Handling
• Common Programming errors can occur at many different levels.
1. Lexical errors: include misspelling of identifiers, keywords, or operators.
2. Syntactic errors : include misplaced semicolons or extra or missing braces.
• Given an incorrect input string x and grammar G, these algorithms will find a
parse tree for a related string y.
• Such that the number of insertions, deletions and changes of tokens required to
transform x into y is as small as possible.
• It is too costly to implement in terms of time space, so these techniques only of
theoretical interest.
Context-Free Grammars
• Inherently recursive structures of a programming language are defined by a
context-free grammar.
• In a context-free grammar, we have:
– A finite set of terminals (in our case, this will be the set of tokens)
– A finite set of non-terminals (syntactic-variables)
– A finite set of productions rules in the following form
• AoD where A is a non-terminal and
D is a string of terminals and non-terminals (including the empty
string)
– A start symbol (one of the non-terminal symbol)
NOTATIONAL CONVENTIONS
1. Symbols used for terminals are :
Lower case letters early in the alphabet (such as a, b, c, . . .)
Operator symbols (such as +, *, . . . )
Punctuation symbols (such as parenthesis, comma and so on)
The digits(0…9)
Boldface strings and keywords (such as id or if) each of which represents
a single terminal symbol
2. Symbols used for non terminals are:
Uppercase letters early in the alphabet (such as A, B, C, …)
The letter S, which when it appears is usually the start symbol.
Lowercase, italic names (such as expr or stmt).
3. Lower case greek letters, α, β, J for example represent (possibly empty)
strings of grammar symbols.
Example: using above notations list out terminals, non terminals and start symbol in
the following example
Eo E+T | E–T | T
ToT*F | T/F |F
F o ( E ) | id
Here terminal are +, -, *, / , (, ), id
Non terminals are E, T, F
Start symbol is E
Derivations
E E+E
• E+E derives from E
– we can replace E by E+E
– to able to do this, we have to have a production rule EoE+E in our
grammar.
E E+E id+E id+id
• A sequence of replacements of non-terminal symbols is called a derivation of
id+id from E.
• In general a derivation step is
DAE DJ if there is a production rule AoJ in our grammar
where D and E are arbitrary strings of terminal and non-
terminal symbols
D1 D2 ... Dn (Dn derives from D1 or D1 derives Dn )
: derives in one step
: derives in zero or more steps
: derives in one or more steps
CFG – Terminology
• L(G) is the language of G (the language generated by G) which is a set of
sentences.
• A sentence of L(G) is a string of terminal symbols of G.
• If S is the start symbol of G then
Z is a sentence of L(G) iff S Z where Z is a string of terminals of G.
• If G is a context-free grammar, L(G) is a context-free language.
• Two grammars are equivalent if they produce the same language.
• S D - If D contains non-terminals, it is called as a sentential form of G.
- If D does not contain non-terminals, it is called as a sentence of
G.
Derivation Example
E -E -(E) -(E+E) -(id+E) -(id+id)
OR
E -E -(E) -(E+E) -(E+id) -(id+id)
• At each derivation step, we can choose any of the non-terminal in the sentential
form of G for the replacement.
• If we always choose the left-most non-terminal in each derivation step, this
derivation is called as left-most derivation.
• If we always choose the right-most non-terminal in each derivation step, this
derivation is called as right-most derivation.
Left-Most and Right-Most Derivations
Left-Most Derivation
E -E -(E) -(E+E) -(id+E) -(id+id)
Right-Most Derivation
E -E -(E) -(E+E) -(E+id) -(id+id)
• We will see that the top-down parsers try to find the left-most derivation of the
given source program.
• We will see that the bottom-up parsers try to find the right-most derivation of the
given source program in the reverse order.
Parse Tree
• Inner nodes of a parse tree are non-terminal symbols.
• The leaves of a parse tree are terminal symbols.
• A parse tree can be seen as a graphical representation of a derivation.
Ambiguity (cont.)
stmt stmt
E2 S1 E2 S1 S2
1 2
• We prefer the second parse tree (else matches with closest if).
• So, we have to disambiguate our grammar to reflect this choice.
• The unambiguous grammar will be:
• stmt o matchedstmt | unmatchedstmt
Left-Recursion – Problem
• A grammar cannot be immediately left-recursive, but it still can be
left-recursive.
• By just eliminating the immediate left-recursion, we may not get
a grammar which is not left-recursive.
S o Aa | b
A o Sc | d This grammar is not immediately left-recursive,
but it is still left-recursive.
S Aa Sca or
A Sc Aac causes to a left-recursion
• So, we have to eliminate all left-recursions from our grammar
Eliminate Left-Recursion – Algorithm
- Arrange non-terminals in some order: A1 ... An
- for i from 1 to n do {
- for j from 1 to i-1 do {
replace each production
Ai o Aj J
by
Ai o D1 J | ... | Dk J
where Aj o D1 | ... | Dk
}
- eliminate immediate left-recursions among Ai productions
}
Example2:
S o Aa | b
A o Ac | Sd | f
- Order of non-terminals: A, S
for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A o SdA’ | fA’
A’ o cA’ | H
for S:
- Replace S o Aa with S o SdA’a | fA’a
So, we will have S o SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S o fA’aS’ | bS’
S’ o dA’aS’ | H
So, the resulting equivalent grammar which is not left-recursive is:
S o fA’aS’ | bS’
S’ o dA’aS’ | H
A o SdA’ | fA’
A’ o cA’ | H
Problems of left recursion
1. S S(S)S |
2. S S+S | |SS | (S) |S* a
3. S SS+ | SS* | a
4. bexpr bexpr or bterm | bterm
bterm bterm and bfactor | bfactor
bfactor not bfactor | (bexpr) | true | false
5. S (L) | a, L L,S | S
Left-Factoring
• A predictive parser (a top-down parser without backtracking) insists that the
grammar must be left-factored.
grammar a new equivalent grammar suitable for predictive parsing
stmt o if expr then stmt else stmt |
if expr then stmt
• when we see if, we cannot now which production rule to choose to re-write stmt
in the derivation.
• In general,
A o DE1 | DE2 where D is non-empty and the first symbols
of E1 and E2 (if they have one)are different.
• when processing D we cannot know whether expand
A to DE1 or
A to DE2
• But, if we re-write the grammar as follows
A o DA’
A’ o E1 | E2 so, we can immediately expand A to DA’
Left-Factoring – Algorithm
• For each non-terminal A with two or more alternatives (production rules) with a
common non-empty prefix, let say
A o DE1 | ... | DEn | J1 | ... | Jm
convert it into
A o DA’ | J1 | ... | Jm
A’ o E1 | ... | En
Left-Factoring – Example1
A o abB | aB | cdg | cdeB | cdfB
b c b
Predictive Parser
• When re-writing a non-terminal in a derivation step, a predictive parser can
uniquely choose a production rule by just looking the current symbol in the input
string.
A o D1 | ... | Dn input: ... a .......
current token
example
s t m t o i f . ..... |
while ...... |
begin ...... |
for .....
• When we are trying to write the non-terminal stmt, if the current token is if we
have to choose first production rule.
• When we are trying to write the non-terminal stmt, we can uniquely choose the
production rule by just looking the current token.
• We eliminate the left recursion in the grammar, and left factor it. But it may not
be suitable for predictive parsing (not LL(1) grammar).
Non-Recursive Predictive Parsing -- LL(1) Parser
• Non-Recursive predictive parsing is a table-driven parser.
• It is a top-down parser.
• It is also known as LL(1) Parser.
LL(1) Parser
input buffer
– our string to be parsed. We will assume that its end is marked with a
special symbol $.
output
– a production rule representing a step of the derivation sequence (left-most
derivation) of the string in the input buffer.
stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S. $S
initial stack
– when the stack is emptied (ie. only $ left in the stack), the parsing is
completed.
parsing table
– a two-dimensional array M[A, a]
– each row is a non-terminal symbol
– each column is a terminal symbol or the special symbol $
each entry holds a production rule.
Constructing LL(1) Parsing Tables
• Two functions are used in the construction of LL(1) parsing tables:
– FIRST FOLLOW
• FIRST(D) is a set of the terminal symbols which occur as first symbols in strings
derived from D where D is any string of grammar symbols.
• if D derives to H, then H is also in FIRST(D) .
• FOLLOW(A) is the set of the terminals which occur immediately after (follow)
the non-terminal A in the strings derived from the starting symbol.
– a terminal a is in FOLLOW(A) if S DAaE
– $ is in FOLLOW(A) if S DA
Compute FIRST for Any String X
• If X is a terminal symbol FIRST(X)={X}
• If X is a non-terminal symbol and X o H is a production rule H is in
FIRST(X).
• If X is a non-terminal symbol and X o Y1Y2..Yn is a production rule
if a terminal a in FIRST(Yi) and H is in all FIRST(Yj) for j=1,...,i-1then a is in
FIRST(X).
if H is in all FIRST(Yj) for j=1,...,n then H is in FIRST(X).
• If X is H FIRST(X)={H}
• If X is Y1Y2..Yn if a terminal a in FIRST(Yi) and H is in all
FIRST(Yj) for
j=1,...,i-1 then a is in FIRST(X).
if H is in all FIRST(Yj) for j=1,...,n
then H is in FIRST(X).
Compute FOLLOW (for non-terminals)
• If S is the start symbol $ is in FOLLOW(S)
• if A o DBE is a production rule everything in FIRST(E) is
FOLLOW(B) except H
• If ( A o DB is a production rule ) or ( A o DBE is a production rule and H is in
FIRST(E) )
everything in FOLLOW(A) is in
FOLLOW(B).
We apply these rules until nothing more can be added to any follow set.
parser looks at the parsing table entry M[X, a]. If M[X, a] holds a production
rule XoY1Y2...Yk, it pops X from the stack and pushes Yk,Yk-1,...,Y1 into the stack. The
parser also outputs the production rule XoY1Y2...Yk to represent a step of the derivation.
4. none of the above error
– all empty entries in the parsing table are errors.
– If X is a terminal symbol different from a, this is also an error case.
$aB ba$ B o bB
$aBb ba$
$aB a$ BoH
$a a$
$ $ accept, successful completion
Outputs: S o aBa B o bB B o bB BoH
Derivation(left-most): SaBaabBaabbBaabba
Example2
E o TE’
E’ o +TE’ | H
T o FT’
T’ o *FT’ | H
F o (E) | id
Soln:
FIRST Example
E o TE’
E’ o +TE’ | H
T o FT’
T’ o *FT’ | H
F o (E) | id
FIRST(F) = {(,id} FIRST(TE’) = {(,id}
FIRST(T’) = {*, H} FIRST(+TE’ ) = {+}
FIRST(T) = {(,id} FIRST(H) = {H}
FIRST(E’) = {+, H} FIRST(FT’) = {(,id}
FIRST(E) = {(,id} FIRST(*FT’) = {*}
FIRST(H) = {H} FIRST((E)) = {(} FIRST(id) = {id}
FOLLOW Example
E o TE’
E’ o +TE’ | H
T o FT’
T’ o *FT’ | H
F o (E) | id
FOLLOW (E) = {$,)}
FOLLOW (E’) = {$,)}
FOLLOW (T) = {+,), $}
$ E’ T id$ T o FT’
$ E’ T’ F id$ F o id
’ ’
$ E T id id$
$ E’ T’ $ T’ o H
$ E’ $ E’ o H
$ $ accept
Construct the predictive parser LL (1) for the following grammar and parse the
given string
1. S S(S)S | with the string ( ( ) ( 7. P Ra | Qba
)) R aba | caba | Rbc
2. S + S S | |*SS | a with the string Q bbc |bc string “
“+*aa a” cababca”
8. S PQR
3. S aSbS | bSaS | with the string
P a | Rb |
“aabbbab”
Q c | dP |
4. bexpr bexpr or bterm | bterm R e | f string “ adeb”
bterm bterm and bfactor | 9. E E+ T |T
bfactor T id | id[ ] | id[X]
bfactor not bfactor | (bexpr) | X E, E | E string “id[id]”
true | false 10. S (A) | 0
string “ not(true or false)”
A SB
5. S 0S1 | 01 string “00011”
B ,SB | string “ (0, (0,0))”
6. S aB | aC | Sd |Se
11. S a | n | (T)
B bBc | f
T T,S | S
C g
String (a,(a,a))
String ((a,a), n , (a),a)
LL (1) Grammars
• A grammar whose parsing table has no multiply-defined entries is said to be LL
(1) grammar. one input symbol used as a look-head symbol do determine parser
action LL (1) left most derivation input scanned from left to right
• The parsing table of a grammar may contain more than one production rule. In
this case, we say that it is not a LL (1) grammar.
A Grammar which is not LL (1)
SoiCtSE | a
EoeS | H
Cob
FIRST(iCtSE) = {i} FOLLOW(S) = {$, e}
FIRST(a) = {a} FOLLOW (E) = {$, e}
FIRST(eS) = {e} FOLLOW(C) = {t }
FIRST(H) = {H}
FIRST(b) = {b}
a b e i t $
S Soa S o iCtSE
E EoeS EoH
EoH
C Cob
two production rules for M[E, e]
Problem ambiguity
• What do we have to do it if the resulting parsing table contains multiply defined
entries?
– If we didn’t eliminate left recursion, eliminate the left recursion in the
grammar.
– If the grammar is not left factored, we have to left factor the grammar.
– If it’s (new grammar’s) parsing table still contains multiply defined
entries, that grammar is ambiguous or it is inherently not a LL(1)
grammar.
• A left recursive grammar cannot be a LL (1) grammar.
– A o AD | E
any terminal that appears in FIRST(E) also appears FIRST(AD)
because AD ED.
If E is H, any terminal that appears in FIRST(D) also appears in
FIRST(AD) and FOLLOW(A).
• A grammar is not left factored, it cannot be a LL(1) grammar
– A o DE1 | DE2
any terminal that appears in FIRST(DE1) also appears in
FIRST(DE2).
• An ambiguous grammar cannot be a LL (1) grammar.
a b c d e $
S S o sync S o sync S o S o
AbS AbS e H
A Aoa sync A o sync sync sync
cAd
Eg: input string “aab”
stack input output
$S aab$ S o AbS
$SbA aab$ A o a
$Sba aab$
$Sb ab$ Error: missing b, inserted
$S ab$ S o AbS
$SbA ab$ Aoa
$Sba ab$
$Sb b$
$S $ SoH
$ $ accept
Bottom-Up Parsing
• A bottom-up parser creates the parse tree of the given input starting from leaves
towards the root.
• A bottom-up parser tries to find the right-most derivation of the given input in the
reverse order.
S ... Z (the right-most derivation of Z)
m (the bottom-up parser finds the right-most derivation in the reverse
order)
• Bottom-up parsing is also known as shift-reduce parsing because its two main
actions are shift and reduce.
– At each shift action, the current symbol in the input string is pushed to a
stack.
– At each reduction step, the symbols at the top of the stack (this symbol
sequence is the right side of a production) will replaced by the non-
terminal at the left side of that production.
– There are also two more actions: accept and error.
Shift-Reduce Parsing
• A shift-reduce parser tries to reduce the given input string into the starting
symbol.
a string the starting symbol
reduced to
• At each reduction step, a substring of the input matching to the right side of a
production rule is replaced by the non-terminal at the left side of that production
rule.
• If the substring is chosen correctly, the right most derivation of that string is
created in the reverse order.
Rightmost Derivation: SZ
Example
S o aABb input string: aaabb
A o aA | a aaAbb
B o bB | b aAbb reduction
aABb
S
S DAZ DEZ
A Shift-Reduce Parser
Consider the following grams and parse the respective strings using shift-
reduce parser.
(1) E o E+T | T
T o T*F | F
F o (E) | id string is “id + id * id”
Here we follow 2 rules
1. If the incoming operator has more priority than in stack operator then
perform shift.
2. If in stack operator has same or less priority than the priority of incoming
operator then perform reduce.
(2) S TL;
T int | float
L L, id | id
String is “int id, id;” do shift-reduce parser.
(3) S (L) |a
L L,S | S
String “(a,(a,a))” do shift-reduce parser.
Shift-Reduce Parsers
• There are two main categories of shift-reduce parsers
1. Operator-Precedence Parser
– Simple, but only a small class of grammars.
– LR-Parsers
LR Parsers
• LR-Parsers
– covers wide range of grammars.
– SLR – simple LR parser
– LR – most general LR parser(canonical LR)
– LALR – intermediate LR parser (look-head LR parser)
– SLR, LR and LALR work same (they used the same algorithm), only their
parsing tables are different.
LR Parsing Algorithm
input a1 . . . ai . . . an $
stack
Sm
Xm
LR Parsing Algorithm output
Sm-1
Xm-1
.
.
Action Table Goto Table
S1 terminals and $ non-terminal
X1 s s
t four different t each item is
S0 a actions a a state number
t t
e e
s s
Actions of A LR-Parser
1. shift s -- shifts the next input symbol and the state s onto the stack
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ ) ( So X1 S1 ... Xm Sm ai s, ai+1 ... an $ )
4. Error -- Parser detected an error (an empty entry in the action table)
Reduce Action
• pop 2|E | (=r) items from the stack; let us assume that E = Y1Y2...Yr
• then push A and s where s=goto[sm-r,A]
( So X1 S1 ... Xm-r Sm-r Y1 Sm-r ...Yr Sm, ai ai+1 ... an $ )
( So X1 S1 ... Xm-r Sm-r A s, ai ... an $ )
• In fact, Y1Y2...Yr is a handle.
X1 ... Xm-r A ai ... an $ X1 ... Xm Y1...Yr ai ai+1 ... an $
Constructing SLR Parsing Tables – LR(0) Item
• An LR(0) item of a grammar G is a production of G a dot at the some position of
the right side.
• Ex: A o aBb Possible LR(0) Items: A o .aBb
(four different possibility) A o a.Bb
A o aB.b
A o aBb.
• Sets of LR(0) items will be the states of action and goto table of the SLR parser.
• A collection of sets of LR(0) items (the canonical LR(0) collection) is the basis
for constructing SLR parsers.
• Augmented Grammar:
G’ is G with a new production rule S’oS where S’ is the new starting symbol.
The Closure Operation
• If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0)
items constructed from I by the two rules:
1. Initially, every LR(0) item in I is added to closure(I).
E T
I0 I1 I6 I9 * to I7
F
( to I3
T
id to I4
to I5
F I2 * I7 F
(
I10
I3 id to I4
(
to I5
I4 E I8 )
id id T
F
to I2 + I11
I5 to I3 to I6
(
to I4
SLR(1) Grammar
• An LR parser using SLR(1) parsing tables for a grammar G is called as the
SLR(1) parser for G.
• If a grammar G has an SLR(1) parsing table, it is called SLR(1) grammar (or SLR
grammar in short).
• Every SLR grammar is unambiguous, but every unambiguous grammar is not a
SLR grammar.
shift/reduce and reduce/reduce conflicts
• If a state does not know whether it will make a shift operation or reduction for a
terminal, we say that there is a shift/reduce conflict.
• If a state does not know whether it will make a reduction operation using the
production rule i or j for a terminal, we say that there is a reduce/reduce conflict.
• If the SLR parsing table of a grammar G has a conflict, we say that that grammar
is not SLR grammar.
Problems on SLR
1. S SS+ | SS* | a with the string “aa+a*” 6. S +SS | *SS | a with the
string “+*aaa”
2. S (L) | a, L L,S | S 7. Show that following grammar is SLR(1)
but not LL(1)
S S A | A
Aa
3. S aSb | ab 8. X Xb |a parse the string “ abb”
4. S aSbS | bSaS | 9. Given the grammar A (A) |a string “((a))”
5. S o E#
E o E-T
EoT
ToF T
ToF
F o (E)
Foi
Construct parsing table for this. In this table there are 2 actions in one entry of the
table which is why It is not a SLR(1) grammar.
Conflict Example2
S o AaAb I0: S’ o .S
S o BbBa S o .AaAb
AoH S o .BbBa
BoH Ao.
Bo.
Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A o H b reduce by A o H
reduce by B o H reduce by B o H
reduce/reduce conflict reduce/reduce conflict
...
A o D.,an
Canonical Collection of Sets of LR(1) Items
• The construction of the canonical collection of the sets of LR(1) items are similar
to the construction of the canonical collection of the sets of LR(0) items, except
that closure and goto operations work a little bit different.
goto operation
• If I is a set of LR(1) items and X is a grammar symbol (terminal or non-terminal),
then goto(I,X) is defined as follows:
– If A o D.XE,a in I then
every item in closure({A o DX.E,a}) will be in goto(I,X).
C is { closure({S’o.S,$}) }
repeat the followings until no more set of LR(1) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
goto function is a DFA on the sets in C.
A Short Notation for The Sets of LR(1) Items
• A set of LR(1) items containing the following items
A o D.E,a1
...
A o D.E,an
can be written as
A o D.E,a1/a2/.../an
I9:S o L=R.,$
R I13:L o *R.,$
I6:S o L=.R,$ t o I9
R o .L,$ L I10:R o L.,$
to I10
L o .*R,$ * I4 and I11
to I11 I11:L o *.R,$ R
L o . id , $ to I13
id R o .L,$ L I5 and I12
to I12 to I10
Lo .*R,$ *
I7:L o *R.,$/= L o . id , $ to I11 I7 and I13
id
I8: R o L.,$/= to I12 I8 and I10
I12:L o id.,$
7 r3 r3
8 r5 r5
9 r1 so, it is a LR(1) grammar
10 r5
11 s12 s11 10 13
12 r4
13 r3
• In fact, the number of the states of the LALR parser for a grammar will be equal
to the number of states of the SLR parser for that grammar.
• Find each core; find all sets having that same core; replace those sets having same
cores with a single set which is their union.
Shift/Reduce Conflict
• We say that we cannot introduce a shift/reduce conflict during the shrink process
for the creation of the states of a LALR parser.
• Assume that we can introduce a shift/reduce conflict. In this case, a state of LALR
parser must have:
.
id
4 ) L o id
5) R o L
Lo
Ro .
id , $ /=
L,$
R
I 3: S o R , $ id
.
I :L o id , $ /=
512
to I512
.
.. .
R I9:S o L=R ,$
I6:S o L= R,$ t o I9 Same Cores
R o L,$ L
to I810 I4 and I11
.
L o *R,$ *
L o id , $ to I411
id I5 and I12
to I512
.
I713:L o *R ,$/=
I7 and I13
.
I810: R o L ,$/=
I8 and I10
• Ex.
E o E+T | T
E o E+E | E*E | (E) | id T o T*F | F
F o (E) | id
..EE+E .. .. ..
I 0: E ’ o E I 1: E ’ o E + I 4: E o E + E E I7: E o E+E + I4
.E*E . . .
Eo E o E +E E o E+E ( E o E +E * I
..(idE) ..
5
Eo E o E *E E o E*E I2 E o E *E
* id
Eo E o (E ) I3
Eo E o id
(
.. ..
(
I 5: E o E * E E
..EE+)E
I8: E o E*E + I4
.
E o E+E (
.
I 2: E o ( I2 E o E +E * I
..
E o E*E id
.E*E
Eo I3 E o E *E 5
E o (E )
..(idE)
Eo E
Eo E o id
.. .
id Eo id )
I 6: E o ( E ) I 9: E o ( E )
I : E o id . .
E o E +E +
3 E o E *E * I4
I5
E * E
I0 I1 I5 I7
PART-B
UNIT V: SYNTAX-DIRECTED DEFINITIONS
SYLLABUS:
x Syntax-directed definitions;
x Evaluation orders for SDDs;
x Applications of syntax-directed translation;
x Syntax-directed translation schemes
Overview
input parse tree dependency graph attribute evaluation order
Grammar symbols are associated with attributes to associate information with
the programming language constructs that they represent.
Values of these attributes are evaluated by the semantic rules associated with the
production rules.
Evaluation of these semantic rules:
o may generate intermediate codes
o may put information into the symbol table
o may perform type checking
o may issue error messages
o may perform some other activities
o in fact, they may perform almost any activities.
An attribute may hold almost anything.
o a string, a number, a memory location, a complex record.
Attributes for expressions:
type of value: int, float, double, char, string,…
type of construct: variable, constant, operations, …
Attributes for constants: values
Attributes for variables: name, scope
o Attributes for operations: arity, operands, operator,…
When we associate semantic rules with productions, we use two notations:
o Syntax-Directed Definitions
o Translation Schemes
Syntax-Directed Definitions:
o give high-level specifications for translations
o Hide many implementation details such as order of evaluation of semantic
actions.
o We associate a production rule with a set of semantic actions, and we do
not say when they will be evaluated.
Translation Schemes:
o Indicate the order of evaluation of semantic actions associated with a
production rule.
o In other words, translation schemes give a little bit information about
implementation details.
Syntax directed definition (SDD) :
To translate a programming language construct compiler has to keep track of
many quantities such as the type of the construct, location of the first instruction
in target code or the number of instructions generated.
For the Non-terminals E,T and F the values can be obtained using the attribute “Val”.
The taken digit has synthesized attribute “lexval”.
In SoEN, symbol S is the start symbol. This rule is to print the final answer of
expressed.
Following steps are followed to Compute S attributed definition.
1. Write the SDD using the appropriate semantic actions for corresponding
production rule of the given Grammar.
2. The annotated parse tree is generated and attribute values are computed. The
Computation is done in bottom up manner.
3. The value obtained at the node is supposed to be final output.
PROBLEM 1:
Consider the string 5*6+7; Construct Syntax tree, parse tree and annotated tree.
Solution :
The corresponding annoted parse tree is shown below for the string 5*6+7;
Syntax tree:
Parse tree:
S
E N
E + T
+
T F
T * F digit
F digit 7
digit 6
The corresponding annoted parse tree is shown below for the string 5*6+7;
Advantages: SDDs are more readable and hence useful for specifications
Disadvantages: not very efficient.
Ex2:
PROBLEM : Consider the grammar that is used for Simple desk calculator. Obtain
the Semantic action and also the annotated parse tree for the string
3*5+4n.
LoEn
EoE1+T
EoT
ToT1*F
ToF
Fo(E)
Fodigit
Solution :
Production rule Semantic actions
LoEn L.val=E.val
EoE1+T E.val=E1.val + T.val
EoT E.val=T.val
ToT1*F T.val=T1.val*F.val
ToF T.val=F.val
Fo(E) F.val=E.val
Fodigit F.val=digit.lexval
The corresponding annotated parse tree U shown below, for the string 3*5+4n.
Exercise :
For the SDD of the problem 1 give annotated parse tree for the Following expressions
a) (3+4)*(5+6)n
b) 1*2*3*(4+5)n
c) (9+8*(7+6)+5)*4n
Solution: a)
b)
c)
Dependency Graphs
2) Inherited attributes :
Consider an example and compute the inherited attributes, annotate the parse
tree for the computation of inherited attributes for the given string int a, b, c
Ex:
SoTL
Toint
Tofloat
Tochar
Todouble
LoL,id
L o id
The steps are to be followed are:
1) Construct the syntax directed definition using semantic action.
2) Annotate the parser tree with inherited attributes by processing in top down
fashion.
S o TL L.inh =T.type
Toint T.type = int
Tofloat T.type =float
Tochar T.type =char
Todouble T.type=double
LoL,id T.type = L, inh
Add–type (id.entry,L.inh)
Loid Add – type (id.entry,L.inh)
string int a, b,c
Ex2: PROBLEMS: consider the following context free grammar for evaluating
arithmetic expressions with operator *
T FT’
T’ *FT’
T’
F digit
digit 2 lexval
A topological sort of the above graph gives the order of evaluation of the SDD. One of
the topological sort order is (1,2,3,4,5,6,7,8 and 9) another topological sort order is
(1,2,5,2,4,6,7,8,9)
Advantages: dependency graph helps in computing the order of evaluation of the
attributes
Disadvantage: it cannot give the order of evaluation of attributes if there is a cycle
formation in the graph. However, this disadvantage can be overcome by using S –
attributed and L – attributed definitions.
1. E TE’ 2. T BC
E’ +TE’ B int
E’ B float
T FT’ C [num]C
T’ *FT’ C
T’ String (i) “int[2][3]” (note: int
F (E) [2][3] should be passed as array(2,
F id array(3, integer)))
String (i) “id +id*id” (ii) “ (id (ii) “float [3]” (iii) “float
+ id* id)” [3][3][2]”
S-Attributed Definitions
• Syntax-directed definitions are used to specify syntax-directed translations.
• To create a translator for an arbitrary syntax-directed definition can be difficult.
• We would like to evaluate the semantic rules during parsing (i.e. in a single pass,
we will parse and we will also evaluate semantic rules during the parsing).
• We will look at two sub-classes of the syntax-directed definitions:
– S-Attributed Definitions: only synthesized attributes used in the syntax-
directed definitions.
– L-Attributed Definitions: in addition to synthesized attributes, we may
also use inherited attributes in a restricted fashion.
• To implement S-Attributed Definitions and L-Attributed Definitions are easy (we
can evaluate semantic rules in a single pass during the parsing).
• Implementations of S-attributed Definitions are a little bit easier than
implementations of L-Attributed Definitions
L-Attributed Definitions
• S-Attributed Definitions can be efficiently implemented.
• We are looking for a larger (larger than S-Attributed Definitions) subset of
syntax-directed definitions which can be efficiently evaluated.
L-Attributed Definitions
• L-Attributed Definitions can always be evaluated by the depth first visit of the
parse tree.
This means that they can also be evaluated during the parsing
• A syntax-directed definition is L-attributed if each inherited attribute of Xj,
where 1djdn, on the right side of A → X1X2...Xn depends only on:
1. The attributes of the symbols X1,...,Xj-1 to the left of Xj in the production
and
2. the inherited attribute of A
• Every S-attributed definition is L-attributed, the restrictions only apply to the
inherited attributes (not to synthesized attributes).
Syntax Trees
Postfix SDT's
x Leftmost: the leftmost nonterminal is always chosen for expansion at each step of
derivation.
L-attributed SDT's
x Shows a graphical depiction of a derivation.
Semantic Actions
a+b+c ab+c+
Background
Eg2: syntax tree for assignment statement DAG for a=b*-c + b*-c
a=b*-c + b*-c
* *
2 x y _
* y
2 x
The DAG representation may expose instances where redundancies can be eliminated.
SDD to construct DAG for the expression a + a * ( b - c ) + ( b - c ) * d.
to entry for i
6.2.3 Triples
• A triple has only three fields: op, arg1, and arg2
• Using triples, we refer to the result of an operation x op y by its position, rather by
an explicit temporary name.
Example 6.7
Fig6.11: Representations of a = b * - c + b * - c
Sequences of Declarations
• An expression with more than one operator, like a + b * c, will translate into
instructions with at most one operator per instruction.
• An array reference A[ i ][ j ] will expand into a sequence of three-address
instructions that calculate an address for the reference.
6.4.1 Operations within Expressions
• Example 6.11:
a = b+ -c
t1 = minus
t2 = b + t1
a = t2
Incremental Translation
Control Flow
Boolean expressions are often used to:
Alter the flow of control.
Compute logical values.
Short-Circuit Code
Flow-of-Control Statements
Syntax-directed definition
Backpatching
Previous codes for Boolean expressions insert symbolic labels for jumps
It therefore needs a separate pass to set them to appropriate addresses
We can use a technique named backpatching to avoid this
We assume we save instructions into an array and labels will be indices in the
array
For nonterminal B we use two attributes B.truelist and B.falselist together with
following functions:
makelist(i): create a new list containing only I, an index into the array of
instructions
Merge(p1,p2): concatenates the lists pointed by p1 and p2 and returns a
pointer to the concatenated list
Backpatch(p,i): inserts i as the target label for each of the instruction on
the list pointed to by p
Flow-of-Control Statements
Translation of a switch-statement
Compiler must do the storage allocation and provide access to variables and data
Memory management
Stack allocation
Heap management
Garbage collection
Storage Organization
Activation records
Procedure calls and returns are usaully managed by a run-time stack called the
control stack.
Each live activation has an activation record (sometimes called a frame)
The root of activation tree is at the bottom of the stack
The current execution path specifies the content of the stack with the last
activation has record in the top of the stack.
Activation Record
Temporary values
Local data
A saved machine status
An “access link”
A control link
Space for the return value of the called function
The actual parameters used by the calling procedure
Elements in the activation record:
temporary values that could not fit into registers
local variables of the procedure
saved machine status for point at which this procedure called. includes
return address and contents of registers to be restored.
access link to activation record of previous block or procedure in lexical
scope chain
control link pointing to the activation record of the caller
space for the return value of the function, if any
actual parameters (or they may be placed in registers, if possible)
ML
ML is a functional language
Variables are defined, and have their unchangeable values initialized, by a statement
of the form:
val (name) = (expression)
Functions are defined using the syntax:
fun (name) ( (arguments) ) = (body)
For function bodies we shall use let-statements of the form:
let (list of definitions) in (statements) end
A version of quicksort, in ML style, using nested functions
Memory Manager
Two basic functions:
Allocation
Deallocation
Properties of memory managers:
Space efficiency
Program efficiency
Low overhead
Typical Memory Hierarchy Configurations
Locality in Programs
The conventional wisdom is that programs spend 90% of their time executing 10% of the
code:
Programs often contain many instructions that are never executed.
Only a small fraction of the code that could be invoked is actually executed in a
typical run of the program.
The typical program spends most of its time executing innermost loops and tight
recursive cycles in a program.
• The quality of the generated code is usually determined by its speed and size.
• A given IR program can be implemented by many different code sequences, with
significant cost differences between the different implementations.
• A naïve translation of the intermediate code may therefore lead to correct but
unacceptably inefficient target code.
• For example use INC for a=a+1 instead of
LD R0,a
ADD R0, R0, #1
ST a, R0
• We need to know instruction costs in order to design good code sequences but,
unfortunately, accurate cost information is often difficult to obtain.
Register Allocation
• A key problem in code generation is deciding what values to hold in what
registers.
• Efficient utilization is particularly important.
• The use of registers is often subdivided into two subproblems:
1. Register Allocation, during which we select the set of variables that will
reside in registers at each point in the program.
2. Register assignment, during which we pick the specific register that a
variable will reside in.
• Finding an optimal assignment of registers to variables is difficult, even with
single-register machine.
• Mathematically, the problem is NP-complete.
Register pairs (even/odd numbered) for some operands & results
• Multiplication instruction is in the form M x, y where x, the multiplicand, is the
even register of even/odd register pair and y, the multiplier, is the odd register.
• The product is occupies the entire even/odd register pair.
• D x, y where the dividend occupies an even/odd register pair whose even register
is x, the divisor is y. After division, the even register holds the remainder and the
odd register the quotient.
Example:
Evaluation Order
• The order in which computations are performed can affect the efficiency of the
target code.
• Some computation orders require fewer registers to hold intermediate results than
others.
• However, picking a best order in the general case is a difficult NP-complete
problem.
• Code optimization:
– A transformation to a program to make it run faster and/or take up less
space
– Optimization should be safe, preserve the meaning of a program.
– Code optimization is an important component in a compiler
– Examples :
• Flow of cont rol opt imizat ion:
got o L1 got o L2
… …
L1: got o L2 L1: got o L2
• Algebraic simplification:
x : = x+0
x := x*1 == nop
• Reduction in strength
X^2 x * x
X * 4 x << 2
• Instruction selection
Sometimes some hardware instructions can implement certain operation efficiently.
• Code optimization can either be high level or low level:
– High level code optimizations:
– Register allocation
– Common subexpression elimination
– Code motion
– Strength reduction
– Induction variable elimination
– Dead code elimination
– Branch chaining
– Jump elimination
– Instruction scheduling
– Procedure inlining
– Loop unrolling
– Loop fusing
– Code hoisting
• Instruction selection:
– Using a more efficient instruction to replace a sequence of instructions
(space and speed).
– Example:
Mov R2, (R3)
Add R2, #1, R2
Mov (R3), R2 Add (R3), 1, (R3)
• Register allocation: allocate variables to registers (speed)
• Example:
ALGEBRAIC TRANSFORMATION
FLOW GRAPHS
We can add the flow-of –control information to the set of basic blocks making
up a program by constructing a directed graph called a flow graph. The nodes of the flow
graph are the basic blocks. One node is distinguished as initial; it is the block whose
leader is the first statement. There is a directed edge from block B1 to block B2can be
immediately follow B1in some execution sequence; that is, if
1. there is a conditional or unconditional jump from the last statement of B2, or
2. B2 immediately follow B1in the order of the program, and B1 does not end in the
unconditional jump
B1 is a predecessor of B2, and B2is a successor of B1.
Example 4:The flow graph of the program of fig. 7 is shown in fig. 9, B1 is the initial
node.
B1 Prod := 0
I:=1
B2 t1 := 4 * i
t2 := a [ t1 ]
t3 := 4 * i
t4 := b [ t3 ]
t5 := t2 * t4
t6:= prod + t5
t7:=i+1
i := t7
if I <= 20 goto B2
LOOPS
Loop is a collection of nodes in a flow graph such that
1. All nodes in the collection are strongly connected; from any node in the loop to any
other, there is path of length one or more, wholly within the loop, and
2. The collection of nodes has a unique entry, a node in the loop such that is, a node in
the loop such that the only way to reach a node of the loop from a node outside the loop
is to first go through the entry.
A loop that contains no other loops is called an inner loop.
REDUNTANT LOADS AND STORES
If we see the instructions sequence
(1) MOV R0,a
(2) MOV a,R0
-we can delete instructions (2) because whenever (2) is executed. (1) will ensure that the
value of a is already in register R0.If (2) had a label we could not be sure that (1) was
always executed immediately before (2) and so we could not remove (2).
UNREACHABLE CODE
Another opportunity for peephole optimizations is the removal of
unreachable instructions. An unlabeled instruction immediately following an
unconditional jump may be removed. This operation can be repeated to eliminate a
sequence of instructions. For example, for debugging purposes, a large program may
have within it certain segments that are executed only if a variable debug is 1.In C, the
source code might look like:
#define debug 0
….
If ( debug ) {
Print debugging information
}
In the intermediate representations the if-statement may be translated as:
If debug =1 goto L2
Goto L2
L1: print debugging information
L2: …………………………(a)
One obvious peephole optimization is to eliminate jumps over jumps .Thus no matter
what the value of debug, (a) can be replaced by:
If debug ≠1 goto L2
Print debugging information
L2: ……………………………(b)
As the argument of the statement of (b) evaluates to a constant true it can be replaced by
If debug ≠0 goto L2
Print debugging information
L2: ……………………………(c)
As the argument of the first statement of (c) evaluates to a constant true, it can be
replaced by goto L2. Then all the statement that print debugging aids are manifestly
unreachable and can be eliminated one at a time.
FLOW-OF-CONTROL OPTIMIZATIONS
The unnecessary jumps can be eliminated in either the intermediate code or the
target code by the following types of peephole optimizations. We can replace the jump
sequence
goto L2
….
L1 : gotoL2
by the sequence
goto L2
….
L1 : goto L2
If there are now no jumps to L1, then it may be possible to eliminate the statement
L1:goto L2 provided it is preceded by an unconditional jump .Similarly, the sequence
if a < b goto L1
….
L1 : goto L2
can be replaced by
if a < b goto L2
….
L1 : goto L2
Finally, suppose there is only one jump to L1 and L1 is preceded by an unconditional
goto. Then the sequence
goto L1
……..
L1:if a<b goto L2
L3: …………………………………..(1)
may be replaced by
if a<b goto L2
goto L3
…….
L3: ………………………………….(2)
While the number of instructions in (1) and (2) is the same, we sometimes skip the
unconditional jump in (2), but never in (1).Thus (2) is superior to (1) in execution time
ALGEBRAIC SIMPLIFICATION
There is no end to the amount of algebraic simplification that can be attempted
through peephole optimization. Only a few algebraic identities occur frequently enough
that it is worth considering implementing them .For example, statements such as
x := x+0
Or
x := x * 1
are often produced by straightforward intermediate code-generation algorithms, and they
can be eliminated easily through peephole optimization.
Frequently, a program will include several calculations of the same value, such as an
offset in an array. Some of these duplicate calculations cannot be avoided by the
programmer because they lie below the level of detail accessible within the source
language. For example, block B5 shown in fig recalculates 4*i and 4*j.
Common Subexpressions
An occurrence of an expression E is called a common subexpression if E was previously
computed, and the values of variables in E have not changed since the previous
computation. We can avoid recomputing the expression if we can use the previously
computed value. For example, the assignments to t7 and t10 have the common
subexpressions 4*I and 4*j, respectively, on the right side in Fig. They have been
eliminated in Fig by using t6 instead of t7 and t8 instead of t10. This change is what
would result if we reconstructed the intermediate code from the dag for the basic block.
Example: Fig shows the result of eliminating both global and local common
subexpressions from blocks B5 and B6 in the flow graph of Fig. We first discuss the
transformation of B5 and then mention some subtleties involving arrays.
t9:= a[t4]; a[t4:= x using t4 computed in block B3. In Fig. observe that as control passes
from the evaluation of 4*j in B3 to B5, there is no change in j, so t4 can be used if 4*j is
needed.
Another common sub-expression comes to light in B5 after t4 replaces t8. The new
expression a[t4] corresponds to the value of a[j] at the source level. Not only does j retain
its value as control leaves b3 and then enters B5, but a[j], a value computed into a
temporary t5, does too because there are no assignments to elements of the array a in the
interim. The statement
t9:= a[t4]; a[t6]:= t9
in B5 can therefore be replaced by
a[t6]:= t5
The expression in blocks B1 and B6 is not considered a common subexpression although
t1 can be used in both places.After control leaves B1 and before it reaches B6,it can go
through B5,where there are assignments to a.Hence, a[t1] may not have the same value
on reaching B6 as it did in leaving B1, and it is not safe to treat a[t1] as a common
subexpression.
Copy Propagation
Block B5 in Fig. can be further improved by eliminating x using two new
transformations. One concerns assignments of the form f:=g called copy statements, or
copies for short. Had we gone into more detail in Example 10.2, copies would have arisen
much sooner, because the algorithm for eliminating common subexpressions introduces
them, as do several other algorithms. For example, when the common subexpression in
c:=d+e is eliminated in Fig., the algorithm uses a new variable t to hold the value of d+e.
Since control may reach c:=d+e either after the assignment to a or after the assignment to
b, it would be incorrect to replace c:=d+e by either c:=a or by c:=b.
The idea behind the copy-propagation transformation is to use g for f, wherever possible
after the copy statement f:=g. For example, the assignment x:=t3 in block B5 of Fig. is a
copy. Copy propagation applied to B5 yields:
x:=t3
a[t2]:=t5
a[t4]:=t3
goto B2
discussed the use of debug that is set to true or false at various points in the program, and
used in statements like
If (debug) print …
By a data-flow analysis, it may be possible to deduce that each time the program reaches
this statement, the value of debug is false. Usually, it is because there is one particular
statement
Debug :=false
That we can deduce to be the last assignment to debug prior to the test no matter what
sequence of branches the program actually takes. If copy propagation replaces debug b y
false, then the print statement is dead because it cannot be reached. We can eliminate
both the test and printing from the o9bject code. More generally, deducing at compile
time that the value of an expression is a co9nstant and using the constant instead is
known as constant folding.
One advantage of copy propagation is that it often turns the copy statement into dead
code. For example, copy propagation followed by dead-code elimination removes the
assignment to x and transforms 1.1 into
a [ t2 ] := t5
a [t4] := t3
goto B2
Loop Optimizations
We now give a brief introduction to a very important place for optimizations, namely
loops, especially the inner loops where programs tend to spend the bulk of their time. The
running time of a program may be improved if we decrease the number of instructions in
an inner loop, even if we increase the amount of code outside that loop. Three techniques
are important for loop optimization: code motion, which moves code outside a loop;
induction-variable elimination, which we apply to eliminate I and j from the inner loops
B2 and B3 and, reduction in strength, which replaces and expensive operation by a
cheaper one, such as a multiplication by an addition.
Code Motion
An important modification that decreases the amount of code in a loop is code motion.
This transformation takes an expression that yields the same result independent of the
number of times a loop is executed ( a loop-invariant computation) and places the
expression before the loop. Note that the notion “before the loop” assumes the existence
of an entry for the loop. For example, evaluation of limit-2 is a loop-invariant
computation in the following while-statement:
While (i<= limit-2 )
Code motion will result in the equivalent of
t= limit-2;
while (i<=t)
Induction Variables and Reduction in Strength
While code motion is not applicable to the quicksort example we have been considering
the other two transformations are.Loops are usually processed inside out.For example
consider the loop around B3.
Note that the values of j and t4 remain in lock-step;every time the value of j decreases b y
1 ,that of t4 decreases by 4 because 4*j is assigned to t4.Such identifiers are called
induction variables.
When there are two or more induction variables in a loop, iit may be possible to get rid
of all but one, by the process of induction-variable elimination.For the inner loop around
B3 in Fig. we cannot ger rid of either j or t4 completely.; t4 is used in B3 and j in B4.
However, we can illustrate reduction in strength and illustrate a part of the process of
induction-variable elimination. Eventually j will be eliminated when the outer loop of B2
- B5 is considered.
Example: As the relationship t4:=4*j surely holds after such an assignment to t4 in Fig.
and t4 is not changed elsewhere in the inner loop around B3, it follows that just after the
statement j:=j-1 the relationship t4:= 4*j-4 must hold. We may therefore replace the
assignment t4:= 4*j by t4:= t4-4. The only problem is that t4 does not have a value when
we enter block B3 for the first time. Since we must maintain the relationship t4=4*j on
entry to the block B3, we place an intializations\ of t4 at the end of the blcok where j
itself is initialized, shown by the dashed addt\ition to block B1 in second Fig.
The replacement of a multiplication by a subtraction will speed up the object code if
multiplication takes more time than addition or subtraction, as is the case on many
machines.
Before After
strength reduction applied to 4*j in block B3