Module 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

Amity School of Engineering and Technology

Compiler Construction
Module 1
B.TECH. - CSE - Semester 6TH

Dr. Prabhishek Singh


Assistant Professor – II
Dept. of CSE, ASET, Amity University Uttar Pradesh, Noida
Index
1. Why Compiler
2. Introduction of Compiler
3. Language processing systems (using Compiler)
4. Phases of a Compiler
5. Compiler Construction Tools
6. Symbol Table in Compiler
7. Compiler Passes
8. Bootstrapping
9. Introduction of Lexical Analysis
10. Input Buffering
11. LEX
12. LEX: LEX file format
13. YACC
14. Formal grammar
15. BNF Notation
16. Context Free Grammar
17. Capabilities of CFG
18. Derivation
19. Parse Tree
20. Ambiguity 2
Objectives

After completing this section, you will be


able to
1.Understand the working of compiler.
2.Understand the various processes
involved in the working of compiler.
3.Analyze and implement the lexical
analyzer.

3
Textbooks
• Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D.
Ullman, “Compilers: Principles Techniques and Tool”,
Second Edition, Pearson Publication, 2007.

• A.A Putambekar, “Compiler Construction”, Technical


Publications, 2009.

Reference Book
• Des Watson, A Practical Approach to Compiler
Construction, First Edition, Springer, 2017
4
Why Compiler?
• Computers are a balanced mix of software and hardware.
• Hardware is just a piece of mechanical device and its functions are
being controlled by a compatible software. Hardware understands
instructions in the form of electronic charge, which is the
counterpart of binary language in software programming.
• Binary language has only two alphabets, 0 and 1. To instruct the
machine, the hardware codes must be written in binary format,
which is simply a series of 1s and 0s.
• It would be a difficult and cumbersome task for computer
programmers to write such codes that is why we have
compilers to write such codes.

5
Introduction of Compiler

• Compiler is a software which converts a


program written in high level language
(Source Language) to low level language
(Object/Target/Machine Language).

6
Introduction of
Compiler…
• Cross Compiler that runs on a machine ‘A’ and
produces a code for another machine ‘B’. It is capable of
creating code for a platform other than the one on which
the compiler is running.

• Source-to-source Compiler or transcompiler or


transpiler is a compiler that translates source code
written in one programming language into source code of
another programming language.

7
Language processing
systems (using Compiler)
• We know a computer is a logical assembly of
Software and Hardware. The hardware
knows a language, that is hard for us to
grasp, consequently we tend to write
programs in high-level language, that is
much less complicated for us to comprehend
and maintain in thoughts.
• Now these programs go through a series of
transformation so that they can readily be
used machines. This is where language
procedure systems come handy.

8
Language processing
systems (using Compiler)…
• High Level Language – If a program contains #define or #include
directives such as #include or #define it is called HLL. They are closer to
humans but far from machines. These (#) tags are called pre-processor
directives. They direct the pre-processor about what to do.
• Pre-Processor – The pre-processor removes all the #include directives by
including the files called file inclusion and all the #define directives using
macro expansion. It performs file inclusion, augmentation, macro-
processing etc.
• Assembly Language – Its neither in binary form nor high level. It is an
intermediate state that is a combination of machine instructions and some
other useful data needed for execution.
• Assembler – For every platform (Hardware + OS) we will have a
assembler. They are not universal since for each platform we have one. The
output of assembler is called object file. Its translates assembly language to
machine code.

9
Language processing
systems (using Compiler)…
• Interpreter – An interpreter converts high level language into low level
machine language, just like a compiler. But they are different in the way
they read the input. The Compiler in one go reads the inputs, does the
processing and executes the source code whereas the interpreter does the
same line by line. Compiler scans the entire program and translates it as a
whole into machine code whereas an interpreter translates the program one
statement at a time. Interpreted programs are usually slower with respect to
compiled ones.
• Relocatable Machine Code – It can be loaded at any point and can be run.
The address within the program will be in such a way that it will cooperate
for the program movement.
• Loader/Linker – It converts the relocatable code into absolute code and
tries to run the program resulting in a running program or an error message
(or sometimes both can happen). Linker loads a variety of object files into a
single file to make it executable. Then loader loads it in memory and
executes it.

10
Phases of a Compiler
We basically have two
phases of compilers,
namely: Analysis
phase and Synthesis
phase.
Analysis phase
creates an
intermediate
representation from
the given source code.
Synthesis phase
creates an equivalent
target program from
the intermediate
representation.

11
Phases of a Compiler …
Analysis Phase – An intermediate representation is created from the give source code :
• Lexical Analyzer
• Syntax Analyzer
• Semantic Analyzer
• Intermediate Code Generator
• Lexical analyzer divides the program into “tokens”, Syntax analyzer recognizes
“sentences” in the program using syntax of language and Semantic analyzer checks
static semantics of each construct. Intermediate Code Generator generates “abstract”
code.

Synthesis Phase – Equivalent target program is created from the intermediate


representation. It has two parts :
• Code Optimizer
• Code Generator
• Code Optimizer optimizes the abstract code, and final Code Generator translates
abstract intermediate code into specific machine instructions.

12
Phases of a Compiler…
Lexical Analyzer
• It is also called scanner. It takes the output of preprocessor (which
performs file inclusion and macro expansion) as the input which is in
pure high level language.
• It reads the characters from source program and groups them into
lexemes (sequence of characters that “go together”).
• Each lexeme corresponds to a token.
• Tokens are defined by regular expressions which are understood by
the lexical analyzer.
• It also removes lexical errors (for e.g., erroneous characters),
comments and white space.

13
Phases of a Compiler…
Syntax Analyzer
• It is sometimes called as parser. It constructs the parse tree. It takes all the
tokens one by one and uses Context Free Grammar to construct the parse tree.
• Why Grammar ? The rules of programming can be entirely represented in
some few productions. Using these productions we can represent what the
program actually is. The input has to be checked whether it is in the desired
format or not.
• The parse tree is also called the derivation tree. Parse trees are generally
constructed to check for ambiguity in the given grammar. There are certain rules
associated with the derivation tree.
• Any identifier is an expression. Any number can be called an expression.
Performing any operations in the given expression will always result in an
expression. For example, the sum of two expressions is also an expression.
• The parse tree can be compressed to form a syntax tree. Syntax error can be
detected at this level if the input is not in accordance with the grammar.

14
Phases of a Compiler…
• Semantic Analyzer: It verifies the parse tree, whether it’s
meaningful or not. It furthermore produces a verified parse tree. It
also does type checking, Label checking and Flow control checking.

• Intermediate Code Generator: It generates intermediate code, that


is a form which can be readily executed by machine We have many
popular intermediate codes. Example – Three address code etc.
Intermediate code is converted to machine language using the last
two phases which are platform dependent. Till intermediate code, it
is same for every compiler out there, but after that, it depends on the
platform. To build a new compiler we don’t need to build it from
scratch. We can take the intermediate code from the already
existing compiler and build the last two parts.

15
Phases of a Compiler…
• Code Optimizer: It transforms the code so that it consumes
fewer resources and produces more speed. The meaning of
the code being transformed is not altered. Optimisation can
be categorized into two types: machine dependent and
machine independent.

• Target Code Generator: The main purpose of Target Code


generator is to write a code that the machine can understand
and also register allocation, instruction selection etc. The
output is dependent on the type of assembler. This is the final
stage of compilation. The optimized code is converted into
relocatable machine code which then forms the input to the
linker and loader.
16
Compiler Construction
Tools
• Parser Generator: It produces syntax analyzers
(parsers) from the input that is based on a grammatical
description of programming language or on a context-
free grammar. It is useful as the syntax analysis phase is
highly complex and consumes more manual and
compilation time. Example: PIC, EQM

17
Compiler Construction
Tools…
• Scanner Generator: It generates lexical analyzers from
the input that consists of regular expression description
based on tokens of a language. It generates a finite
automaton to recognize the regular expression.
Example: Lex

18
Compiler Construction
Tools…
• Syntax directed translation engines: It generates intermediate code with
three address format from the input that consists of a parse tree. These
engines have routines to traverse the parse tree and then produces the
intermediate code. In this, each node of the parse tree is associated with
one or more translations.
• Automatic code generators: It generates the machine language for a
target machine. Each operation of the intermediate language is translated
using a collection of rules and then is taken as an input by the code
generator. A template matching process is used. An intermediate language
statement is replaced by its equivalent machine language statement using
templates.
• Data-flow analysis engines: It is used in code optimization.Data flow
analysis is a key part of the code optimization that gathers the information,
that is the values that flow from one part of a program to another.
• Compiler construction toolkits: It provides an integrated set of routines
that aids in building compiler components or in the construction of various
phases of compiler.
19
Symbol Table in
Compiler
Symbol Table is an important data structure created and
maintained by the compiler in order to keep track of
semantics of variable i.e. it stores information about scope
and binding information about names, information about
instances of various entities such as variable and function
names, classes, objects, etc.
• It is built in lexical and syntax analysis phases.
• The information is collected by the analysis phases of
compiler and is used by synthesis phases of compiler to
generate code.
• It is used by compiler to achieve compile time efficiency.

20
Symbol Table in
Compiler…
Items stored in Symbol table:
• Variable names and constants
• Procedure and function names
• Literal constants and strings
• Compiler generated temporaries
• Labels in source languages

Information used by compiler from Symbol table:


• Data type and name
• Declaring procedures
• Offset in storage
• If structure or record then, pointer to structure table.
• For parameters, whether parameter passing by value or by reference
• Number and type of arguments passed to function
• Base Address 21
Symbol Table in
Compiler…
Operations of Symbol table:

22
Compiler Passes
Pass is a complete traversal of the source program. Compiler has two
passes to traverse the source program.

Multi-pass Compiler
• Multi pass compiler is used to process the source code of a program
several times.
• In the first pass, compiler can read the source program, scan it, extract the
tokens and store the result in an output file.
• In the second pass, compiler can read the output file produced by first pass,
build the syntactic tree and perform the syntactical analysis. The output of
this phase is a file that contains the syntactical tree.
• In the third pass, compiler can read the output file produced by second pass
and check that the tree follows the rules of language or not. The output of
semantic analysis phase is the annotated tree syntax.
• This pass is going on, until the target output is produced.

23
Compiler Passes…
One-pass Compiler
• One-pass compiler is used to traverse the program only
once. The one-pass compiler passes only once through
the parts of each compilation unit. It translates each part
into its final machine code.
• In the one pass compiler, when the line source is
processed, it is scanned and the token is extracted.
• Then the syntax of each line is analyzed and the tree
structure is build. After the semantic part, the code is
generated.
• The same process is repeated for each line of code until
the entire program is compiled.
24
Bootstrapping
• Bootstrapping is widely used in the compilation development.
• Bootstrapping is used to produce a self-hosting compiler. Self-
hosting compiler is a type of compiler that can compile its own
source code.
• Bootstrap compiler is used to compile the compiler and then you can
use this compiled compiler to compile everything else as well as
future versions of itself.

A compiler can be characterized by three languages:


• Source Language
• Target Language
• Implementation Language

25
Bootstrapping…
The T- diagram shows a compiler SCIT for Source S, Target T,
implemented in I.

Follow some steps to produce a new language L for machine A:


1. Create a compiler SCAA for subset, S of the desired language, L
using language "A" and that compiler runs on machine A.

2. Create a compiler LCSA for language L written in a subset of L.

26
Bootstrapping…
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a
compiler for language L, which runs on machine A and produces code
for machine A.

27
Introduction of Lexical
Analysis
Lexical Analysis is the first phase of the compiler also known
as a scanner. It converts the High level input program into a sequence
of Tokens.
• Lexical Analysis can be implemented with the Deterministic finite
Automata.
• The output is a sequence of tokens that is sent to the parser for
syntax analysis.

28
Introduction of Lexical
Analysis…
What is a token?
A lexical token is a sequence of characters that can be
treated as a unit in the grammar of the programming
languages.
Example of tokens:
• Type token (id, number, real, . . . )
• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
• Keywords; Examples-for, while, if etc. Identifier;
Examples-Variable name, function name, etc. Operators;
Examples '+', '++', '-' etc. Separators; Examples ',' ';' etc
29
Introduction of Lexical
Analysis…
Example of Non-Tokens:
Comments, preprocessor directive, macros, blanks,
tabs, newline, etc.

Lexeme: The sequence of characters matched by a


pattern to form the corresponding token or a sequence
of input characters that comprises a single token is
called a lexeme.
Example: “float”, “abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

30
Introduction of Lexical
Analysis…
How Lexical Analyzer functions:
• Tokenization i.e. Dividing the program into
valid tokens.
• Remove white space characters.
• Remove comments.
• It also provides help in generating error
messages by providing row numbers and
column numbers.
31
Introduction of Lexical
Analysis…
The lexical analyzer identifies the error with the help
of the automation machine and the grammar of the given
language on which it is based like C, C++, and gives row
number and column number of the error.

Suppose we pass a statement through lexical analyzer:


• a = b + c ; → It will generate token sequence like this:
• Id = id + id; → Where each id refers to it’s variable in
the symbol table referencing all details

32
Introduction of Lexical
Analysis…
Example 1:

Valid Tokens: 'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b'
';' 'a' '=' '10' ';' 'return' '0' ';' '}'
33
Introduction of Lexical
Analysis…
Example 2: → There are 5 valid token in this
printf statement.

Example 3: int max(int i);


• Lexical analyzer first read int and finds it to be valid and
accepts as token
• max is read by it and found to be a valid function name
after reading (
• int is also a token , then again i as another token and
finally ;
Answer: Total number of tokens 7: int, max, ( ,int, i, ), ;

34
Input Buffering
The lexical analyzer scans the input from left to right one character at a
time. It uses two pointers begin ptr(bp) and forward to keep track of the
pointer of the input scanned.

Initially both the pointers point to the first character of the input string as
shown below

35
Input Buffering…
The forward ptr moves ahead to search for end of lexeme. As
soon as the blank space is encountered, it indicates end of lexeme. In
above example as soon as ptr (fp) encounters a blank space the
lexeme “int” is identified.
The fp will be moved ahead at white space, when fp
encounters white space, it ignore and moves ahead. then both the
begin ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but
reading in this way from secondary storage is costly. hence buffering
technique is used.A block of data is first read into a buffer, and then
second by lexical analyzer. there are two methods used in this context:
One Buffer Scheme, and Two Buffer Scheme. These are explained as
following below.

36
Input Buffering…

One Buffer Scheme: In this scheme, only


one buffer is used to store the input string but
the problem with this scheme is that if lexeme
is very long then it crosses the buffer
boundary, to scan rest of the lexeme the
buffer has to be refilled, that makes
overwriting the first of lexeme. 37
Input Buffering…
Two Buffer Scheme: To overcome the problem of one
buffer scheme, in this method two buffers are used to store
the input string. the first buffer and second buffer are
scanned alternately. when end of current buffer is reached
the other buffer is filled. the only problem with this method is
that if length of the lexeme is longer than length of the buffer
then scanning input cannot be scanned completely.Initially
both the bp and fp are pointing to the first character of first
buffer. Then the fp moves towards right in search of end of
lexeme. as soon as blank character is recognized, the string
between bp and fp is identified as corresponding token. to
identify, the boundary of first buffer end of buffer character
should be placed at the end first buffer.
Similarly end of second buffer is also recognized
by the end of buffer mark present at the end of second
buffer. when fp encounters first eof, then one can recognize
end of first buffer and hence filling up second buffer is
started. in the same way when second eof is obtained then it
indicates of second buffer. alternatively both the buffers can
be filled up until end of the input program and stream of
tokens is identified. This eof character introduced at the end
is calling Sentinel which is used to identify the end of buffer.
38
LEX
• Lex is a program that generates lexical analyzer. It is used with YACC
parser generator.
• The lexical analyzer is a program that transforms an input stream into a
sequence of tokens.
• It reads the input stream and produces the source code as output
through implementing the lexical analyzer in the C program.

The function of Lex is as follows:


• Firstly lexical analyzer creates a program lex.1 in the Lex language.
Then Lex compiler runs the lex.1 program and produces a C program
lex.yy.c.
• Finally C compiler runs the lex.yy.c program and produces an object
program a.out.
• a.out is lexical analyzer that transforms an input stream into a sequence
of tokens.
39
LEX…

40
LEX: LEX file format
A Lex program is separated into three sections by %% delimiters. The formal of
Lex source is as follows:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Definitions include declarations of constant, variable and regular definitions.
Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.
Where pi describes the regular expression and action1 describes the actions
what action the lexical analyzer should take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The
subroutine can be loaded with the lexical analyzer and compiled separately.

41
YACC
• YACC stands for Yet Another Compiler Compiler.
• YACC provides a tool to produce a parser for a given grammar.
• YACC is a program designed to compile a LALR (1) grammar.
• It is used to produce the source code of the syntactic analyzer of the
language produced by LALR (1) grammar.
• The input of YACC is the rule or grammar and the output is a C program.
These are some points about YACC:
Input: A CFG- file.y
Output: A parser y.tab.c (yacc)
• The output file "file.output" contains the parsing tables.
• The file "file.tab.h" contains declarations.
• The parser called the yyparse ().
• Parser expects to use a function called yylex () to get tokens.

42
YACC…
This file contains the desired grammar in YACC format.

It shows the YACC program.

It is the c source program created by YACC.

C Compiler

Executable file that will parse grammar given in gram.Y

43
Formal grammar
• Formal grammar is a set of rules. It is used to identify correct or incorrect
strings of tokens in a language. The formal grammar is represented as G.
• Formal grammar is used to generate all possible strings over the alphabet
that is syntactically correct in the language.
• Formal grammar is used mostly in the syntactic analysis phase (parsing)
particularly during the compilation.

Formal grammar G is written as follows:


G = <V, T, P, S>
Where:
N describes a finite set of non-terminal symbols.
T describes a finite set of terminal symbols.
P describes a set of production rules
S is the start symbol.

44
Formal grammar:
Example
L = {a, b}, N = {S, R, B}
Production rules:
S → bR
R → aR
R → aB
B → b
Through this production we can produce some strings like: bab, baab, baaab
etc.
This production describes the string of shape banab.

45
BNF Notation
BNF stands for Backus-Naur Form. It is used to write a formal representation
of a context-free grammar. It is also used to describe the syntax of a programming
language.
BNF notation is basically just a variant of a context-free grammar.
In BNF, productions have the form:
Left side → definition
Where leftside ∈ (Vn∪ Vt)+ and definition ∈ (Vn∪ Vt)*. In BNF, the leftside contains one
non-terminal.
We can define the several productions with the same leftside. All the
productions are separated by a vertical bar symbol "|".
There is the production for any grammar as follows:
S → aSa
S → bSb
S→c
In BNF, we can represent above grammar as follows:
S → aSa| bSb| c

46
Context Free Grammar
Context free grammar is a formal grammar which is used to generate
all possible strings in a given formal language.
Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)
Where,
G describes the grammar
T describes a finite set of terminal symbols.
V describes a finite set of non-terminal symbols
P describes a set of production rules
S is the start symbol.

In CFG, the start symbol is used to derive the string. You can derive
the string by repeatedly replacing a non-terminal by the right hand side of the
production, until all non-terminal have been replaced by terminal symbols.

47
Context Free Grammar:
Example
L= {wcwR | w € (a, b)*}
Production rules:
S → aSa
S → bSb
S→c
Now check that abbcbba string can be derived from the given
CFG.
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba
By applying the production S → aSa, S → bSb recursively and
finally applying the production S → c, we get the string abbcbba.
48
Capabilities of CFG
• Context free grammar is useful to describe most of the
programming languages.
• If the grammar is properly designed then an efficient
parser can be constructed automatically.
• Using the features of associatively & precedence
information, suitable grammars for expressions can be
constructed.
• Context free grammar is capable of describing nested
structures like: balanced parentheses, matching begin-
end, corresponding if-then-else's & so on.

49
Derivation
Derivation is a sequence of production rules. It is used to get
the input string through these production rules. During parsing we have
to take two decisions. These are as follows:
• We have to decide the non-terminal which is to be replaced.
• We have to decide the production rule by which the non-terminal will
be replaced.
We have two options to decide which non-terminal to be
replaced with production rule.
• Left-most Derivation: In the left most derivation, the input is
scanned and replaced with the production rule from left to right. So
in left most derivatives we read the input string from left to right.
• Right-most Derivation: In the right most derivation, the input is
scanned and replaced with the production rule from right to left. So
in right most derivatives we read the input string from right to left.
50
Left-most Derivation

• Production • The left-most


rules: derivation is:
S=S+S S=S+S
S=S-S S=S-S+S
S = a | b |c S=a-S+S
S=a-b+S
• Input: S=a-b+c
a-b+c
51
Right-most Derivation

• Production • The right-


rules: most
S=S+S derivation is:
S=S-S S=S-S
S = a | b |c S=S-S+S
S=S-S+c
• Input: S=S-b+c
a-b+c S=a-b+c
52
Parse tree
• Parse tree is the graphical representation of symbol. The symbol
can be terminal or non-terminal.
• In parsing, the string is derived using the start symbol. The root of
the parse tree is that start symbol.
• It is the graphical representation of symbol that can be terminals or
non-terminals.
• Parse tree follows the precedence of operators. The deepest sub-
tree traversed first. So, the operator in the parent node has less
precedence over the operator in the sub-tree.

The parse tree follows these points:


• All leaf nodes have to be terminals.
• All interior nodes have to be non-terminals.
• In-order traversal gives original input string.
53
Parse tree: Example
Production rules:
Input:
T= T + T | T * T
a*b+c
T = a|b|c

54
Ambiguity

• A grammar is said to be ambiguous if


there exists more than one leftmost
derivation or more than one rightmost
derivative or more than one parse tree for
the given input string.
• If the grammar is not ambiguous then it is
called unambiguous.

55
Ambiguity: Example
S = aSb | SS
S=∈
For the string aabb, the above grammar generates two parse trees:

If the grammar has ambiguity then it is not good for a


compiler construction. No method can automatically detect
and remove the ambiguity but you can remove ambiguity by
re-writing the whole grammar without ambiguity.
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
Thank You
91

You might also like