Module 1
Module 1
Module 1
Compiler Construction
Module 1
B.TECH. - CSE - Semester 6TH
3
Textbooks
• Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D.
Ullman, “Compilers: Principles Techniques and Tool”,
Second Edition, Pearson Publication, 2007.
Reference Book
• Des Watson, A Practical Approach to Compiler
Construction, First Edition, Springer, 2017
4
Why Compiler?
• Computers are a balanced mix of software and hardware.
• Hardware is just a piece of mechanical device and its functions are
being controlled by a compatible software. Hardware understands
instructions in the form of electronic charge, which is the
counterpart of binary language in software programming.
• Binary language has only two alphabets, 0 and 1. To instruct the
machine, the hardware codes must be written in binary format,
which is simply a series of 1s and 0s.
• It would be a difficult and cumbersome task for computer
programmers to write such codes that is why we have
compilers to write such codes.
5
Introduction of Compiler
6
Introduction of
Compiler…
• Cross Compiler that runs on a machine ‘A’ and
produces a code for another machine ‘B’. It is capable of
creating code for a platform other than the one on which
the compiler is running.
7
Language processing
systems (using Compiler)
• We know a computer is a logical assembly of
Software and Hardware. The hardware
knows a language, that is hard for us to
grasp, consequently we tend to write
programs in high-level language, that is
much less complicated for us to comprehend
and maintain in thoughts.
• Now these programs go through a series of
transformation so that they can readily be
used machines. This is where language
procedure systems come handy.
8
Language processing
systems (using Compiler)…
• High Level Language – If a program contains #define or #include
directives such as #include or #define it is called HLL. They are closer to
humans but far from machines. These (#) tags are called pre-processor
directives. They direct the pre-processor about what to do.
• Pre-Processor – The pre-processor removes all the #include directives by
including the files called file inclusion and all the #define directives using
macro expansion. It performs file inclusion, augmentation, macro-
processing etc.
• Assembly Language – Its neither in binary form nor high level. It is an
intermediate state that is a combination of machine instructions and some
other useful data needed for execution.
• Assembler – For every platform (Hardware + OS) we will have a
assembler. They are not universal since for each platform we have one. The
output of assembler is called object file. Its translates assembly language to
machine code.
9
Language processing
systems (using Compiler)…
• Interpreter – An interpreter converts high level language into low level
machine language, just like a compiler. But they are different in the way
they read the input. The Compiler in one go reads the inputs, does the
processing and executes the source code whereas the interpreter does the
same line by line. Compiler scans the entire program and translates it as a
whole into machine code whereas an interpreter translates the program one
statement at a time. Interpreted programs are usually slower with respect to
compiled ones.
• Relocatable Machine Code – It can be loaded at any point and can be run.
The address within the program will be in such a way that it will cooperate
for the program movement.
• Loader/Linker – It converts the relocatable code into absolute code and
tries to run the program resulting in a running program or an error message
(or sometimes both can happen). Linker loads a variety of object files into a
single file to make it executable. Then loader loads it in memory and
executes it.
10
Phases of a Compiler
We basically have two
phases of compilers,
namely: Analysis
phase and Synthesis
phase.
Analysis phase
creates an
intermediate
representation from
the given source code.
Synthesis phase
creates an equivalent
target program from
the intermediate
representation.
11
Phases of a Compiler …
Analysis Phase – An intermediate representation is created from the give source code :
• Lexical Analyzer
• Syntax Analyzer
• Semantic Analyzer
• Intermediate Code Generator
• Lexical analyzer divides the program into “tokens”, Syntax analyzer recognizes
“sentences” in the program using syntax of language and Semantic analyzer checks
static semantics of each construct. Intermediate Code Generator generates “abstract”
code.
12
Phases of a Compiler…
Lexical Analyzer
• It is also called scanner. It takes the output of preprocessor (which
performs file inclusion and macro expansion) as the input which is in
pure high level language.
• It reads the characters from source program and groups them into
lexemes (sequence of characters that “go together”).
• Each lexeme corresponds to a token.
• Tokens are defined by regular expressions which are understood by
the lexical analyzer.
• It also removes lexical errors (for e.g., erroneous characters),
comments and white space.
13
Phases of a Compiler…
Syntax Analyzer
• It is sometimes called as parser. It constructs the parse tree. It takes all the
tokens one by one and uses Context Free Grammar to construct the parse tree.
• Why Grammar ? The rules of programming can be entirely represented in
some few productions. Using these productions we can represent what the
program actually is. The input has to be checked whether it is in the desired
format or not.
• The parse tree is also called the derivation tree. Parse trees are generally
constructed to check for ambiguity in the given grammar. There are certain rules
associated with the derivation tree.
• Any identifier is an expression. Any number can be called an expression.
Performing any operations in the given expression will always result in an
expression. For example, the sum of two expressions is also an expression.
• The parse tree can be compressed to form a syntax tree. Syntax error can be
detected at this level if the input is not in accordance with the grammar.
14
Phases of a Compiler…
• Semantic Analyzer: It verifies the parse tree, whether it’s
meaningful or not. It furthermore produces a verified parse tree. It
also does type checking, Label checking and Flow control checking.
15
Phases of a Compiler…
• Code Optimizer: It transforms the code so that it consumes
fewer resources and produces more speed. The meaning of
the code being transformed is not altered. Optimisation can
be categorized into two types: machine dependent and
machine independent.
17
Compiler Construction
Tools…
• Scanner Generator: It generates lexical analyzers from
the input that consists of regular expression description
based on tokens of a language. It generates a finite
automaton to recognize the regular expression.
Example: Lex
18
Compiler Construction
Tools…
• Syntax directed translation engines: It generates intermediate code with
three address format from the input that consists of a parse tree. These
engines have routines to traverse the parse tree and then produces the
intermediate code. In this, each node of the parse tree is associated with
one or more translations.
• Automatic code generators: It generates the machine language for a
target machine. Each operation of the intermediate language is translated
using a collection of rules and then is taken as an input by the code
generator. A template matching process is used. An intermediate language
statement is replaced by its equivalent machine language statement using
templates.
• Data-flow analysis engines: It is used in code optimization.Data flow
analysis is a key part of the code optimization that gathers the information,
that is the values that flow from one part of a program to another.
• Compiler construction toolkits: It provides an integrated set of routines
that aids in building compiler components or in the construction of various
phases of compiler.
19
Symbol Table in
Compiler
Symbol Table is an important data structure created and
maintained by the compiler in order to keep track of
semantics of variable i.e. it stores information about scope
and binding information about names, information about
instances of various entities such as variable and function
names, classes, objects, etc.
• It is built in lexical and syntax analysis phases.
• The information is collected by the analysis phases of
compiler and is used by synthesis phases of compiler to
generate code.
• It is used by compiler to achieve compile time efficiency.
20
Symbol Table in
Compiler…
Items stored in Symbol table:
• Variable names and constants
• Procedure and function names
• Literal constants and strings
• Compiler generated temporaries
• Labels in source languages
22
Compiler Passes
Pass is a complete traversal of the source program. Compiler has two
passes to traverse the source program.
Multi-pass Compiler
• Multi pass compiler is used to process the source code of a program
several times.
• In the first pass, compiler can read the source program, scan it, extract the
tokens and store the result in an output file.
• In the second pass, compiler can read the output file produced by first pass,
build the syntactic tree and perform the syntactical analysis. The output of
this phase is a file that contains the syntactical tree.
• In the third pass, compiler can read the output file produced by second pass
and check that the tree follows the rules of language or not. The output of
semantic analysis phase is the annotated tree syntax.
• This pass is going on, until the target output is produced.
23
Compiler Passes…
One-pass Compiler
• One-pass compiler is used to traverse the program only
once. The one-pass compiler passes only once through
the parts of each compilation unit. It translates each part
into its final machine code.
• In the one pass compiler, when the line source is
processed, it is scanned and the token is extracted.
• Then the syntax of each line is analyzed and the tree
structure is build. After the semantic part, the code is
generated.
• The same process is repeated for each line of code until
the entire program is compiled.
24
Bootstrapping
• Bootstrapping is widely used in the compilation development.
• Bootstrapping is used to produce a self-hosting compiler. Self-
hosting compiler is a type of compiler that can compile its own
source code.
• Bootstrap compiler is used to compile the compiler and then you can
use this compiled compiler to compile everything else as well as
future versions of itself.
25
Bootstrapping…
The T- diagram shows a compiler SCIT for Source S, Target T,
implemented in I.
26
Bootstrapping…
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a
compiler for language L, which runs on machine A and produces code
for machine A.
27
Introduction of Lexical
Analysis
Lexical Analysis is the first phase of the compiler also known
as a scanner. It converts the High level input program into a sequence
of Tokens.
• Lexical Analysis can be implemented with the Deterministic finite
Automata.
• The output is a sequence of tokens that is sent to the parser for
syntax analysis.
28
Introduction of Lexical
Analysis…
What is a token?
A lexical token is a sequence of characters that can be
treated as a unit in the grammar of the programming
languages.
Example of tokens:
• Type token (id, number, real, . . . )
• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
• Keywords; Examples-for, while, if etc. Identifier;
Examples-Variable name, function name, etc. Operators;
Examples '+', '++', '-' etc. Separators; Examples ',' ';' etc
29
Introduction of Lexical
Analysis…
Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks,
tabs, newline, etc.
30
Introduction of Lexical
Analysis…
How Lexical Analyzer functions:
• Tokenization i.e. Dividing the program into
valid tokens.
• Remove white space characters.
• Remove comments.
• It also provides help in generating error
messages by providing row numbers and
column numbers.
31
Introduction of Lexical
Analysis…
The lexical analyzer identifies the error with the help
of the automation machine and the grammar of the given
language on which it is based like C, C++, and gives row
number and column number of the error.
32
Introduction of Lexical
Analysis…
Example 1:
Valid Tokens: 'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b'
';' 'a' '=' '10' ';' 'return' '0' ';' '}'
33
Introduction of Lexical
Analysis…
Example 2: → There are 5 valid token in this
printf statement.
34
Input Buffering
The lexical analyzer scans the input from left to right one character at a
time. It uses two pointers begin ptr(bp) and forward to keep track of the
pointer of the input scanned.
Initially both the pointers point to the first character of the input string as
shown below
35
Input Buffering…
The forward ptr moves ahead to search for end of lexeme. As
soon as the blank space is encountered, it indicates end of lexeme. In
above example as soon as ptr (fp) encounters a blank space the
lexeme “int” is identified.
The fp will be moved ahead at white space, when fp
encounters white space, it ignore and moves ahead. then both the
begin ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but
reading in this way from secondary storage is costly. hence buffering
technique is used.A block of data is first read into a buffer, and then
second by lexical analyzer. there are two methods used in this context:
One Buffer Scheme, and Two Buffer Scheme. These are explained as
following below.
36
Input Buffering…
40
LEX: LEX file format
A Lex program is separated into three sections by %% delimiters. The formal of
Lex source is as follows:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Definitions include declarations of constant, variable and regular definitions.
Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.
Where pi describes the regular expression and action1 describes the actions
what action the lexical analyzer should take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The
subroutine can be loaded with the lexical analyzer and compiled separately.
41
YACC
• YACC stands for Yet Another Compiler Compiler.
• YACC provides a tool to produce a parser for a given grammar.
• YACC is a program designed to compile a LALR (1) grammar.
• It is used to produce the source code of the syntactic analyzer of the
language produced by LALR (1) grammar.
• The input of YACC is the rule or grammar and the output is a C program.
These are some points about YACC:
Input: A CFG- file.y
Output: A parser y.tab.c (yacc)
• The output file "file.output" contains the parsing tables.
• The file "file.tab.h" contains declarations.
• The parser called the yyparse ().
• Parser expects to use a function called yylex () to get tokens.
42
YACC…
This file contains the desired grammar in YACC format.
C Compiler
43
Formal grammar
• Formal grammar is a set of rules. It is used to identify correct or incorrect
strings of tokens in a language. The formal grammar is represented as G.
• Formal grammar is used to generate all possible strings over the alphabet
that is syntactically correct in the language.
• Formal grammar is used mostly in the syntactic analysis phase (parsing)
particularly during the compilation.
44
Formal grammar:
Example
L = {a, b}, N = {S, R, B}
Production rules:
S → bR
R → aR
R → aB
B → b
Through this production we can produce some strings like: bab, baab, baaab
etc.
This production describes the string of shape banab.
45
BNF Notation
BNF stands for Backus-Naur Form. It is used to write a formal representation
of a context-free grammar. It is also used to describe the syntax of a programming
language.
BNF notation is basically just a variant of a context-free grammar.
In BNF, productions have the form:
Left side → definition
Where leftside ∈ (Vn∪ Vt)+ and definition ∈ (Vn∪ Vt)*. In BNF, the leftside contains one
non-terminal.
We can define the several productions with the same leftside. All the
productions are separated by a vertical bar symbol "|".
There is the production for any grammar as follows:
S → aSa
S → bSb
S→c
In BNF, we can represent above grammar as follows:
S → aSa| bSb| c
46
Context Free Grammar
Context free grammar is a formal grammar which is used to generate
all possible strings in a given formal language.
Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)
Where,
G describes the grammar
T describes a finite set of terminal symbols.
V describes a finite set of non-terminal symbols
P describes a set of production rules
S is the start symbol.
In CFG, the start symbol is used to derive the string. You can derive
the string by repeatedly replacing a non-terminal by the right hand side of the
production, until all non-terminal have been replaced by terminal symbols.
47
Context Free Grammar:
Example
L= {wcwR | w € (a, b)*}
Production rules:
S → aSa
S → bSb
S→c
Now check that abbcbba string can be derived from the given
CFG.
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba
By applying the production S → aSa, S → bSb recursively and
finally applying the production S → c, we get the string abbcbba.
48
Capabilities of CFG
• Context free grammar is useful to describe most of the
programming languages.
• If the grammar is properly designed then an efficient
parser can be constructed automatically.
• Using the features of associatively & precedence
information, suitable grammars for expressions can be
constructed.
• Context free grammar is capable of describing nested
structures like: balanced parentheses, matching begin-
end, corresponding if-then-else's & so on.
49
Derivation
Derivation is a sequence of production rules. It is used to get
the input string through these production rules. During parsing we have
to take two decisions. These are as follows:
• We have to decide the non-terminal which is to be replaced.
• We have to decide the production rule by which the non-terminal will
be replaced.
We have two options to decide which non-terminal to be
replaced with production rule.
• Left-most Derivation: In the left most derivation, the input is
scanned and replaced with the production rule from left to right. So
in left most derivatives we read the input string from left to right.
• Right-most Derivation: In the right most derivation, the input is
scanned and replaced with the production rule from right to left. So
in right most derivatives we read the input string from right to left.
50
Left-most Derivation
54
Ambiguity
55
Ambiguity: Example
S = aSb | SS
S=∈
For the string aabb, the above grammar generates two parse trees: