CD Unit 1

Compiler Design
Introduction to the Course

Divyaprabha K N
Assistant Professor,
Department of Computer Science & Engineering
COMPILER DESIGN: UE21CS341B
INTRODUCTION
Divyaprabha K N
COMPILER DESIGN
Syllabus
Unit 1 : Compilers The Language Processing System, The Phases of a Compiler,

The Grouping of Phases into passes. Lexical Analysis: The Role of the Lexical
Analyzer, Input Buffering, Specification of Tokens, Recognition of Tokens, Design
of a Lexical Analyzer Generator. Syntax Analysis The role of the parser, Syntax
Error Handling, Error-Recovery Strategies. Top-down parsing: Recursive Descent
Parser (RDP) with Backtracking, LL(1) Parser. - 22 Slots
Unit 2 : Bottom-up parsing: Shift-Reduce Parsing, LR (0), SLR, viable prefixes, CLR,
LALR. Syntax-Directed Translation Syntax-directed definitions, Evaluation orders
for SDD’s, Applications of Syntax-Directed Translation, and Syntax-directed
Translation Schemes – Postfix Translation Schemes. – 20 Slots
COMPILER DESIGN
Syllabus
Unit 3 : Parser Stack Implementation: Parser Stack Implementation of Postfix SDT's, SDT's
with actions inside Productions, SDT's for L-Attributed Definitions. Implementing L-
Attributed SDD’s: Bottom-Up Parsing. Intermediate-Code Generation Variants of Syntax
Trees – Directed Acyclic Graphs for Expressions, Three-Address Code – Addresses and
Instructions, Quadruples, Triples, Indirect Triples, SSA Form, Control Flow Graph. – 22 Slots
Unit 4 : Machine Independent Optimization: Different Optimizations, Optimization of Basic

Blocks. Data Flow Analysis: Live-variable analysis, Next-use algorithm. Run-Time
Environments Storage Organization, Different Allocation Strategies, Stack Allocation of
space, Access to Non local Data on the stack. Code Generation: Issues in the design of a
code generator, the target language, addresses in the target code, static allocation, stack
allocation, run-time addresses for names. A Simple Code generator - The Code generation
algorithm. - 22 Slots
Desirable Knowledge:
1. UE19CS202- Data Structures and its Application,

2. UE19CS205- Automata Formal Languages & Logic.
COMPILER DESIGN
Introduction - Overview
Course Contents
Compiler Design : UE21CS341B
฀ Unit 1 – Basics of Compiler & Lexical Analysis

฀ Unit 2 – Syntax Analysis
฀ Unit 3 – Syntax Directed Translation
฀ Unit 4 – Intermediate Code Generation, Run-Time
Environments
Compiler Design
Tools, Text Book and Reference Book
T1 -
“Compilers–Principles, Techniques and
Tools” Alfred V. Aho, Monica S. Lam,
Ravi Sethi, Jeffery D. Ullman; 2nd
Edition
R1 –
“Modern Compiler Design”, Dick Grune,
Kees van Reeuwijk, Henri E. Bal, Ceriel
J.H. Jacobs, Koen Langendoen; 2nd
Edition
Tools/Languages: Lex and Yaac.

T1 R1
COMPILER DESIGN
Course Objectives
1. Introduce the major concept areas of language translation and

compiler design.
1. Develop a greater understanding of the issues involved in

programming language design and implementation.
1. Provide practical programming skills necessary for constructing a

compiler.
1. Develop an awareness of the function and complexity of modern

compilers.
1. Provide an understanding on the importance and techniques of

optimizing a code from compiler's perspective.
COMPILER DESIGN
Course Outcome
At the end of this course, the student will be able to:
1. Use the knowledge of patterns, tokens and regex for solving the problems
in the field of data mining.
1. Analyze and design the semantic behavior of a compiler.
1. Choose the appropriate compiler internal representation for different

kinds of compiler tasks.
1. Translate a source-level language into a low-level compiler internal

representation.
1. Optimize the performance of a program in terms of speed and space using

new code optimization techniques.
COMPILER DESIGN
Introduction
Compiler Input and Output

Compiler Design
Why study compilers?
•Build a large, ambitious software system. See theory come to life.

• Learn how to build programming languages.
• Learn how programming languages work.
• Learn tradeoffs in language design.
A short history of compilers:-

• First, there was nothing.
• Then, there was machine code.
• Then, there were assembly languages.
• Programming expensive; 50% of costs for
machines went into programming.
Compiler Design
Introduction: Compiler
Rear Admiral Grace Hopper: FORTRAN (Formula Translation) : first widely

Inventor of A-0(1952), used high level programming language. First
COBOL(1959), and the term unambiguously complete compiler. (1954-57)
Compiler Design
First C Compiler
• First C Compiler (Modified B Compiler : Written in B)
•TMG (TransMoGrifier) was the compiler definition tool used by Ken

Thompson to write the compiler for the B language on his PDP-7 (an
assembler) in 1970.
• B was the immediate ancestor of C.

Compiler Design
Self-hosting Compiler
•Self-hosting Compiler : Compiler that can compile its own source

code.
• Example : C, C++ , C#, Java, Pascal, Python, VisualBasic etc
•In some of these cases, the initial implementation was not self-
hosted, but rather, written in another language (or even in machine
language); in other cases, the initial implementation was developed
using bootstrapping.
Compiler Design
Bootstrapping (Compilers)
•Bootstrapping is the technique for producing a self- compiling compiler.

•An initial core version of the compiler - the bootstrap compiler - is
generated in a different language (which could be assembly language);
successive expanded versions of the compiler are developed using this
minimal subset of the language.
• Example : BASIC, ALGOL, C, D, Pascal, Lisp, Java, Rust, Python.
https://www.geeksforgeeks.org/bootstrapping-in-compiler-design/
• Assemblers were the first language tools to bootstrap themselves!

Compiler Design
Bootstrapping (Compilers)
•Bootstrapping is the technique for producing a self- compiling compiler.

•An initial core version of the compiler - the bootstrap compiler - is
generated in a different language (which could be assembly language);
successive expanded versions of the compiler are developed using this
minimal subset of the language.
• Example : BASIC, ALGOL, C, D, Pascal, Lisp, Java, Rust, Python.
https://www.geeksforgeeks.org/bootstrapping-in-compiler-design/
• Assemblers were the first language tools to bootstrap themselves!

Compiler Design
Cross Compiler
Native Compiler
Generates code for the same Platform Converts
on which it runs. high language into
computer’s native language. A cross
Example: Turbo C or GCC compiler compiler is for
Cross compiler cross-platform
A cross compiler is a compiler capable of creating executable code for a software
platform other than the one on which the compiler is running. development
of binary code
For example, a compiler that runs on a Windows 7 PC but generates
code that runs on Android smartphone is a cross compiler. USE :
Embedded Computers (Microwave oven, Washing Machine)
Compiler Design
Other Variants
TRANSPILER : Source-to-Source Compiler (High

level language to High level Language)
DECOMPILER : Low-level lang to High-level lang.
Compiler-compiler refers to tools used to create parsers that

perform syntax analysis.
Compiler Design
Other Variants - Compiler-compiler (Compiler generator)
•A programming tool that creates a parser, interpreter, or compiler from

some form of formal description of a programming language and
machine.
•The most common type of compiler-compiler is more precisely called a
parser generator, and only handles syntactic analysis:
• its input is a grammar, typically written in Backus–Naur form (BNF)
or extended Backus–Naur form (EBNF) that defines the syntax
of a programming language
• and its output is source code of a parser for the programming
language.
• Parser generators do not handle the semantics of the programming
language, or the generation of machine code for the target
machine
COMPILER DESIGN
Introduction
Application of Compiler
฀ Compiler design helps full implementation Of High-

Level Programming Languages
฀ Support optimization for Computer Architecture
Parallelism
฀ Design of New Memory Hierarchies of Machines
฀ Widely used for Translating Programs
฀ Used with other Software Productivity Tools
THANK
YOU
Divyaprabha K N
[email protected]
Compiler Design
Divyaprabha K N
COMPILER DESIGN
Unit 1
฀ The Language Processing System

฀ Phases of a Compiler
฀ Grouping of Phases into passes
฀ Lexical Analysis
• Role of the Lexical Analyzer
• Input Buffering
• Specification of Tokens
• Recognition of Tokens
• Design of a Lexical Analyzer
Generator
COMPILER DESIGN: UNIT 1
The Language Processing

System
COMPILER DESIGN
System
Compiler
• Compiler is a program that can read a program in one
language –source language and translate it into an
equivalent program in another language –target language
• Compilers is to report any errors in the source program

that it detects during the translation process
Source program Target
Compiler
program
COMPILER DESIGN
System
Compiler
• Target program is executable Machine language program
• It will be loaded and executed with the help of loader

to process inputs and produce outputs
Input output
Target Program
COMPILER DESIGN
System
Interpreter
- An interpreter is another common kind of language processor

- An interpreter directly executes the operations specified in
the source program on inputs supplied by the user
- It executes the source program statement by statement.
Source program output

Interpreter
Input
COMPILER DESIGN
System
฀ Preprocessor
฀ Interpreter
฀ Assembler
฀ Linker
฀ Loader
฀ Cross Compiler
฀ Source to Source
Compiler
COMPILER DESIGN
System
printf – stdio.h
strlen – conio.h
clrscr – string.h
#define PI 3.14
Steps for Language Processing

COMPILER DESIGN
Detail study of Language processing system
GCC : The acronym GCC is used to refer to the "GNU Compiler Collection". Over
time GCC has been extended to support many additional languages, including Fortran, A
DA, Java and ObjectiveC.
GCC is written in C with a strong focus on portability, and can compile itself, so it can
be adapted to new systems easily. GCC is a portable compiler-
it runs on most platforms available today, and can produce output for many types of proce
ssors.
GCC is not only a native compilerit can also crosscompile any program, producing
executable files for a different system from the one used by GCC itself. GCC has multiple
language frontends, for parsing different languages. Programs in each language can be co
mpiled, or crosscompiled, for any architecture. For example, an ADA program can be
compiled for a microcontroller, or a C program for a supercomputer.
Compiler Design
The language processing system
Compiled version in
memory is interpreted.
COMPILER DESIGN
Example of Language Processing System(LPS)
Input Program : <hello.c>

#include
int main (void)
{
printf ("Hello, world!\n");
return 0;
}
The sequence of commands executed by a single invocation of GCC consists of the following stages:
• preprocessing (to expand macros)

• compilation (from source code to assembly language)
• assembly (from assembly language to machine code)
• linking (to create the final executable)
Separate PDF file is shared for details.

Compiler Design
Static Library
A library is a collection of pre-compiled object files that can be linked into

your programs via the linker. Examples are the system functions such as
printf() and sqrt().
There are two types of external libraries: static library and shared library.
A static library has file extension of ".a" (archive file) in Unixes or ".lib"
(library) in Windows. When your program is linked against a static library, the
machine code of external functions used in your program is copied into the
executable. A static library can be created via the archive program “ar.exe".
Compiler Design
Shared Library
•A shared library has file extension of ".so" (shared objects) in

Unixes or ".dll“ (dynamic link library) in Windows.
•When your program is linked against a shared library, only a small
table is created in the executable. Dynamic linking
•Before the executable starts running, the operating system loads makes executable
the machine code needed for the external functions - a process files smaller and
known as dynamic linking. saves disk space,
because copy
•Furthermore, most operating systems allows one copy of a shared
of aone
library in memory to be used by all running programs, thus, saving
shared library
betweencan
memory. The shared library codes can be upgraded without the need
multiplebeprograms.
to recompile your program.
Compiler Design
GCC Compilation Process
Compiler Design
GCC Compiler framework
Refer to the following link for further information

http://gcc.gnu.org/onlinedocs/gccint/
Picture taken from grc/iitb
Compiler Design
7 Phases of Compiler
of
The compiler th does its
em in seven different
work
have access to
phases and each
something known as “Symbol
Table” and to an “Error
Handler”.
These two entities are explained
in the upcoming slides
Compiler Design
Grouping of phases into passes
Single pass Compiler – All phases are grouped into one part.
Source program (HLL)
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Code Generation
Target Machine code

(Assembly language code)
Compiler Design
Grouping of phases into passes
Two pass Compiler- The phases are grouped into two parts.
Source
Analysis Part
Program Lexical Syntax Semantic
Analysis Analysis Analysis (Front end of Compiler)
(HLL)
Intermediate Code
Target
Synthesis Part Code Code Machine
(Back end of Compiler) Optimizer Generator code
(Assembly level)
Compiler Design
Traditional two-pass compiler
• Use an intermediate representation (IR)

• Front end maps legal source code into IR
• Back end maps IR into target machine code
• Admits multiple front ends and multiple passes
•Typically, front end is O(n) or
O(n log n), while back end is
NPC
•Different phases of compiler
also interact through the
symbol table
Compiler Design
The front-end
Responsibilities
• Recognize legal programs
• Report errors for the illegal programs in a useful way
• Produce IR and construct the symbol table
• Much of front end construction can be automated
Compiler Design
The back-end
We can not run

Responsibilities these “IR” files
• Translate IR into target machine code since it is machine
independent
• Choose instructions to implement each IR operation because the code
• Decide which values to keep in registers gets transferred
• Schedule the instructions for instruction pipeline into another form
Automation has been much less successful in the back end

Compiler Design
The back-end (Instruction Selection)
• Produce fast, compact code

•Take advantage of target language features such as
addressing modes
•Usually viewed as a pattern matching problem This
was the problem of the future in late 70’s when
instruction sets were complex
• RISC architectures simplified this problem
Compiler Design
Code generation: Instruction Selection
Source code
a = b + c; If we generate code for each statement separately
d = a + e; we will not generate efficient code
Target code
MOV b,R0
If c = 1, then:-
ADD c,R0 Code for first statement
MOV R0,a Redundant MOV b,R0
MOV a,R0 INC R0
ADD e,R0 ADD e.R0
Code for second statement
MOV R0,d MOV R0,d
Compiler Design
The back-end (Instruction Scheduling)
• Avoid hardware stalls (keep pipeline moving)

• Use all functional units productively
• Optimal scheduling is the focus
Compiler Design
The back-end (Register Allocation)
• Have each value in a register when it is used

• Manage a limited set of registers
•Can change instruction and insert LOADs
choices STOREs and
• Optimal allocation is a focus
Compiler Design
Three-pass compiler
It also includes a middle-end that involves

optimization of intermediate code
Compiler Design
The middle-end (Intermediate Representation - IR)
Typical Transformations:
• Analyzes IR and transforms IR
•Discover and
•Primary goal is to reduce running constantpropagate
values
time of the compiled code
•Discover a
•May also improve space, power consumption (mobile redundant
computing) computation and remove it
• Must also preserve “meaning” of the code •Remove unreachable code
THANK
YOU
Divyaprabha K N
[email protected]
Compiler Design
Divyaprabha K N
COMPILER DESIGN
The Structure of a Compiler
Phases of a Compiler
SYMBOL TABLE
COMPILER DESIGN
Phases of a Compiler
Source Lexeme
Program (Sequence of
Toke <token-name, attribute
toke nam aatttributtribu Tokens)
n value>
n e tee vvalue
Lexeme Tokens
Source a = b + c * 60
a <id, 1>
Program
<id, 1> <=> <id, 2> <+> <id, 3> <*> <60>
COMPILER DESIGN
The Structure of a Compiler (Phases of a Compiler)
a = b + c * 60
<id, 1> <=> <id, 2> <+> <id, 3> <*> <60>
id1 +
id2 *
id3 60
COMPILER DESIGN
=
id1 +
a
id2 *
b
=
c
id3 60
id1 + Type-
checking
id2 *
id3 inttofloat
60
COMPILER DESIGN
COMPILER DESIGN
Three address code assignment instructions:

• It can have max one operator towards right side
• They must generate a temporary variable for storing the result.
• Some instructions may contain fewer than three addresses/
operands.
=
t1 = inttofloat(60)
t2 = id3 * t1
id1 +
t3 = id2 + t2
id1 = t3
id2 *
id3 inttofloat
60
COMPILER DESIGN
t1 = id3 * 60.0
id1 = id2 + t1
COMPILER DESIGN
(Phases of a Compiler)
LDF R2, id3
LD- Load MULF R2, #60.0
LD Register, Memory LDF R1, id2
ST- Store ADDF R1, R2
ST Memory, Register STF id1, R1
ADD- Additon
ADD R1, R2
ADD R1, R2, R3
MUL - Multiplication
MUL R1, R2
For floating point

numbers
LDF, STF, ADDF, MULF
COMPILER DESIGN
(Phases of a Compiler)
COMPILER DESIGN
Symbol Table
฀ Data structure containing a record for each variable

name with fields for the attributes of the name
฀ Attributes provides the information's about:
▪ Storage allocation
▪ Type
▪ Scope (where in the program its value may be used)
▪ Procedure names
▪ Number and types of its arguments
▪ Method of passing arguments (call by value or
reference)
COMPILER DESIGN
Grouping of Phases into passes
฀ Single pass Compiler – All phases are grouped into one

part.
฀ Two pass Compiler- The phases are grouped into two parts.
COMPILER DESIGN
Grouping of Phases into passes
฀ Single pass Compiler – ฀ Two pass Compiler-

Source program (HLL)
Analysis Part (Front end of Compiler)
Source
Lexical Syntax Semantic
progra
Lexical Analysis Analyzer Analyzer Analyzer
m (HLL)
Syntax Analysis
Semantic Analysis
Intermediate Code
Intermediate
Generator Code
code
Optimization Synthesis Part (Back end of
Code Generator Compiler)
Code Code
Target Machine code
Optimizer Generator
(Assembly language code)
Target Machine code

(Assembly level language)
The Role of Lexical

Analyzer
Divyaprabha K N
COMPILER DESIGN
Detailed study of Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning. The

lexical analyzer reads the stream of characters making up the source
program and groups the characters into meaningful sequences called
lexemes. For each lexeme, the lexical analyzer produces as output a token
of the form
(token-name, attribute-value)
that it passes on to the subsequent phase, syntax analysis. In the token,
the first component token-name is an abstract symbol that is used during
syntax analysis,
and the second component attribute-value points to an entry in the
symbol table for this token.
Information from the symbol-table entry is needed for semantic analysis
and code generation.
COMPILER DESIGN
Lexical Analysis Example
For example, suppose a source program contains the assignment statement

position = initial + rate * 60
The characters in this assignment could be grouped into the following lexemes and mapped into the
following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token (id, 1), where id is an abstract symbol
standing for identifier and 1 points to the symboltable entry for position . The symbol-table entry
for an identifier holds information about the identifier, such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token (=). Since this token needs no
attribute-value, we have omitted the second component. We could have used any abstract symbol
such as assign for the token-name, but for notational convenience we have chosen to use the
lexeme itself as the name of the abstract symbol.
3. initial is a lexeme that is mapped into the token (id, 2), where 2 points to the symbol-table entry for
initial .
4. + is a lexeme that is mapped into the token (+).
5. rate is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table entry for
rate .
6. * is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60).
COMPILER DESIGN
Detailed study of Lexical Analysis
Final representation of the assignment statement after lexical analysis as

the sequence of tokens
(id,l) <=) (id, 2) (+) (id, 3) (*) (60)
In this representation, the token names =, +, and * are abstract symbols
for the assignment, addition, and multiplication operators, respectively.
Technically speaking, for the lexeme 60 we should make up a token like

(number, 4), where 4 points to the symbol table for the internal
representation of integer 60 but we shall defer the discussion of tokens
for numbers until Chapter 2.
Chapter 3 discusses techniques for building lexical analyzers.
COMPILER DESIGN
Role of Lexical Analyzer
฀ It reads the input characters of the source program , group them into Lexemes.
฀ It produces a sequence of tokens for each lexemes in the source program
as output.
฀ The stream of tokens is sent to the parser for syntax analysis.
฀ When it discovers the lexeme representing an identifier, it will enter that
lexeme into the symbol table.
COMPILER DESIGN
Other Tasks of Lexical Analyzer

฀ Stripping out comments and white space like blank, new line, tab and
other character that used to separate tokens in the input
฀ Correlating error messages generated by the compiler with the source
program.
฀ It keep track of the number of the newline character and it associate each
line number with each error message.
฀ Expansion of macros.
COMPILER DESIGN
Interaction Between the Lexical Analyzer and
parser
COMPILER DESIGN
Lexical Analyzer
฀ Lexical Analyzer are divided into two process
1.Scanning :- this consists of the simple processes that do not require tokenization
of the input, such as deletion of comments and compaction of consecutive
whitespace character into one.
2. Lexical Analysis :- It produces the sequence of tokens as output.
COMPILER DESIGN
Lexical Analyzer Versus Parsing

Reasons to separate lexical analyzer and syntax analysis into different phases
of compiler
1. Simplicity of design. (Comments and spaces are removed in LA)
2. Compiler efficiency is improved. (LA use specialized techniques for lexical
task, but not the job for parsing. Buffering tech. is used for reading chars can
speed up compiler)
3. Compiler portability is enhanced.
COMPILER DESIGN
Tokens, Patterns and
Lexemes Related but
distinct terms :
฀ Token :- It is a pair consisting of a token name and an optional attribute
value.
< token name, attribute value>
- Token name is an abstract symbol representing a kind of lexical unit
like keywords, identifiers
- Attribute value is pointer to the symbol table entry
COMPILER DESIGN
Tokens, Patterns and Lexemes
฀ Pattern :- is a description of the form that the Lexemes of a token may
take.
- keyword- sequence of character
- Identifiers - sequence of character with some complex Structure
฀ Lexeme :- It is a sequence of characters in the source program that

matches the pattern for a token as an instance of a token
COMPILER DESIGN
Role of Lexical
Analyzer
Examples of Tokens
Token Informal Sample
Description(p Lexemes
attern) printf(“total = %d\n”, score);
(keyword) Character i,f If
(keyword) Characters e,l,s,e else
Comparison < or > or <= or >= <=,!=

or == or !=
id Letter followed by Sum, avg1, pi
letters and digits
number Any numeric 10001, 3.14
constant
literal Anything but “, “ compiler design”
surrounded by “’s
COMPILER DESIGN
Role of Lexical
Analyzer
Example: Find the number of
tokens
1. int a = 20;
Lexeme Tokens Count of Token
COMPILER DESIGN
Role of Lexical Example: Find the number of
Analyzer tokens
lexeme Tokens Count of Token 2. Find number of tokens in the given
code. int main()
{
//Compiler Design
int a, b;
a = 10;
b = 20;
return 0;
}
COMPILER DESIGN
Attributes for Tokens
• When more than one lexeme can match a pattern, lexical analyzer must
provide additional information about the lexeme
• Lexical analyzer returns token name and attribute value to parser.
• Consider the example : id
• token
Its attributes are , its lexeme, its type and the location at which it is first found
is kept in the symbol table.
• Attribute value is pointer to the symbol table entry for that identifier.
COMPILER DESIGN
Role of Lexical
Analyzer
Example: Write the token names and associated
values
attribute
for the given E = M * C ** 2
statement:
฀ <id, Pointer to symbol table entry for E
>
฀ <assign_op>
฀ <id , Pointer to symbol table entry for
M>
฀ <mult-op>
฀ <id, Pointer to symbol table entry for
C>
฀ <exp-op>
COMPILER DESIGN
Role of Lexical
Analyzer
Lexical Errors
฀Suppose a situation arises in which the lexical

analyzer is unable to proceed because none of the ฀ Spelling Error –
pattern for tokens matches any prefix of the di {….
remaining input. …………………..}
fi ฀ Spelling Erro
Error while();
fi (a == f(x)) ฀ Unmatchedr string
฀ Appearance of illegal
characters
฀ Exceeding length of identifiers
COMPILER DESIGN
Role of Lexical
Analyzer
Lexical Errors

analyzer is unable to proceed because none of the ฀ Unmatched String-
pattern for tokens matches any prefix of the printf(“hi);
remaining input.
฀ Spelling Error printf(“hi”);
fi
fi (a == f(x)) ฀ Unmatched string
characters
COMPILER DESIGN
Role of Lexical
Analyzer
Lexical Errors
฀Suppose a situation arises in which

฀ Appearance of Illegal
the lexical analyzer is unable to proceed
pattern for tokens matches of thecharacter
because none of the
any prefix remaining input. printf(“hi”)$$;
fi ฀ Spelling Error
characters
COMPILER DESIGN
Role of Lexical
Analyzer
Lexical Errors

analyzer is unable to proceed because none of the ฀ Exceeding length of identifier
pattern for tokens matches any prefix of the ✔More than 32
remaining input.
characters in identifier
fi ฀ Spelling Error
฀ characters
Exceeding length of
identifiers
COMPILER DESIGN
Error Recovery Actions
฀ Delete one character from the remaining input.
฀ Insert missing character into the remaining input
฀ Replace character by another character
฀ Transpose two adjacent characters.
฀ Delete successive characters from the remaining input until
Lexical Analyzer recognizes a well-formed token. This is
also called PANIC MODE error recovery.
COMPILER DESIGN
Role of Lexical
Analyzer
฀ Delete one character from the remaining
Example -
฀ input.
Insert missing character into the remaining dio{…
input …..
} while();
฀ Transpose
฀ Delete two adjacent
successive characters.
characters from the remaining input until
COMPILER DESIGN
Role of Lexical
Analyzer
฀ Delete one character from the remaining
Example -
฀ input.
Insert missing character into the remaining prntf(“hi”);
฀ input
Replace character by another printf(“hi”);
character
฀ Transpose
COMPILER DESIGN
Role of Lexical
Analyzer
Example -
฀ Insert missing character into the remaining di{…
…..
฀ input
Replace character by another } while();
character
฀ Transpose two adjacent characters.
฀ Delete successive characters from the remaining input until Lexical Analyzer
recognizes a well-formed token. This is also called PANIC MODE error recovery.
COMPILER DESIGN
Role of Lexical
Analyzer
Example -
฀ Insert missing character into the remaining rpintf(“hi”);
input
printf(“hi”);
฀ Transpose
฀ Replace character by another
two adjacent character
characters.
฀ Delete successive characters from the remaining input until Lexical Analyzer
recognizes a well-formed token. This is also called PANIC MODE error recovery.
COMPILER DESIGN
Role of Lexical
Analyzer
Example -
฀ Insert missing character into the remaining printf(“hi”)$$;
input
printf(“hi”);
฀ Transpose
Compiler Design
Error Message
When writing an error message, make sure it satisfies the following properties:
● It must pinpoint the error correctly
● It must be understandable
● It must be specific
● It must not have any redundancy
THANK
YOU
Divyaprabha K N
[email protected]
Compiler
Design
Divyaprabha K N
Compiler Design
Unit 1
Symbol Table
Divyaprabha K N
Compiler Design
Lecture Overview
In this lecture, you will learn about:

● Symbol Table
● Construction of symbol table
Compiler Design
7 Phases of Compiler
The compiler does its work in

seven different phases and each
of them have access to
something known as “Symbol
Table” and to an “Error
Handler”.
These two entities are explained
in the upcoming slides
Compiler Design
Symbol Table
What is a symbol table?

Data structure containing a record for each variable name with fields for
the attributes of the name
How does it look like?

Attributes provides the information about:
• Storage allocation
• Type
• Scope (where in the program its value may be used)
• Procedure names
• Number and types of its arguments
• Method of passing arguments (call by value or reference)
Compiler Design
Symbol Table
Why symbol table?

• Used during all phases of compilation
• Maintains information about many source language constructs
• Incrementally constructed and expanded during the analysis phases
• Used directly in the code generation phases
• Efficient storage and access important in practice
Compiler Design
Symbol Table
When and where is it used?

• Lexical Analysis time
– Lexical Analyzer scans program
– Finds Symbols
– Adds Symbols to symbol table
• Syntactic Analysis Time
– Information about each symbol is filled in/updated
• Used for type checking during semantic analysis
Compiler Design
Symbol Table
More about symbol tables

• Each piece of info associated with a name is called an attribute.
• Attributes are language dependent.
– Actual Characters of the name
– Type
– Storage allocation info (number of bytes).
– Line number where declared
– Lines where referenced.
– Scope.
Compiler Design
Symbol Table
Different Classes of Symbols have different attributes

• Variable, Type, Constant, parameter, record field.
– Type, storage, name, scope.
• Procedure or function.
– Number of parameters, parameters themselves, result type.
• Array
– # of Dimensions, Array bounds.
• File
– record size, record type
Compiler Design
Symbol Table
Other attributes:
• A scope of a variable can be represented by
– A number (Scope is just one of attributes)
– A different symbol table is constructed for different scope.
• Object Oriented Languages Have classes like

– Method names, class names, object names.
– Scoping is VERY important. (Inheritance).
Compiler Design
Constructing the symbol table
There are three main operations to be carried out on the symbol

table:
• determining whether a string has already been stored
• inserting an entry for a string
• deleting a string when it goes out of scope
This requires three functions:

•lookup(s): returns the index of the entry for string s, or 0 if there is
no entry
•insert(s,t): add a new entry for string s (of token t), and return its
index
• delete(s): deletes s from the table (or, typically, hides it)
Compiler Design
Constructing the symbol table
The two symbol table mechanisms:

Linear lists and Hash tables
•Each scheme is evaluated on the basis of time required to add n

entries and make e inquiries.
•A linear list is the simplest to implement, but its performance is

poor when n and e are large.
•Hashing schemes provide better performance for greater

programming effort and space overhead.
Compiler Design
Constructing the symbol table (Example)
Name Token Dtype Value Size Scope Other Attribute

1. float fun2(int I, float j,){
2. int k,e; Declared Referred Other
3.float z; Fun2 TK_ID procname 1
4. ; I TK_ID Int 4 1 1 6 parameter
5. e=0; j TK_ID float 4 1 1 6 parameter
6. k = I *j+k; k TK_ID Int 4 0 2 6,7 argument
e Int 0 4 0 2 7 argument
7. z = fun1(k,e,); TK_ID
z TK_ID Float 4 0 3 7,8 return

8. *z; }
fun1 TK_ID procname 7 proccall
https://www.tutorialspoint.com/compiler_design/compiler_design_symbol_table.htm
Compiler Design
Constructing the symbol table (Linked list implementation)
http://codingloverlavi.blogspot.com/2014/02/symbol-table-implementation-using.html
Compiler Design
Symbol table and scope
•Symbol tables typically need to support multiple declarations

of the same identifier within a program The scope of a
declaration is the
portion of a program
•We shall implement scopes by setting up a separate symbol to which the
table for each scope declaration applies.
Compiler Design
Symbol table organization
THANK YOU
Divyaprabha K N
Department of Computer Science &
Engineering
[email protected]
Compiler Design
Divyaprabha K N
Challenges in Lexical Analyzer
Divyaprabha K N
Compiler Design
Lecture Overview
We will learn about:

● Challenges in scanning in different languages
● C Language Grammar
Compiler Design
Challenge : Scanning is hard
Example 1: In Fortran, whitespace is insignificant:
VAR1 is exactly the same as VA R1

Fortran idea: removing all the whitespace should not change the meaning of the program.
Loop example
DO 5 I = 1,25 Here DO is a keyword representing a loop (similar to FOR in C), I = 1,25 represents that the
iteration variable I ranges from 1..25. The number 5 represents that the loop extends from
the DO statement till the statement with label 5, including all statements that are in between.
Another example:
DO 5 I = 1.25 The only difference in this code from the previous code is the replacement of , with .. This
simply means that the variable DO5I = 1.25, i.e., a variable has been assigned an integer (there is no
loop). The problem is that just by looking at the first three characters, I cannot tell whether DO is a
keyword or a prefix of a variable name. Hence we need a lookahead. In this example, a large lookahead is
required. Ideally, the language design should ensure that the lookahead is small.
Compiler Design
Lookahead is always required

Lookahead may be required to decide where one token ends and the next token begins. We
would like to minimize lookahead
For example, when we read the first character of the keyword else, we need to lookahead to
decide whether it is an identifier or a keyword. Similarly, we need lookahead to disambiguate
between == and =.
PL/1 keywords are not reserved.

Example 2: When keywords are used as identifiers.
This makes lexical analysis a bit more difficult -- need to decide what is a variable name and
what is a keyword, and so need to look at what's going on in the rest of the expression.
Compiler Design
Example 3 : Different uses of the same character >,< in C++
Template syntax: a<b>

Stream syntax: cin>>var;
Binary right shift syntax: a>>4
Nested Template : A<B<C>>D;
Compiler Design
Syntactic Sugar
Syntactic Sugar : Syntax is designed to make things easier to read or express.

Example:
• In C : a[i] is a syntactic sugar for *(a+i)
• In C : a+=b is equivalent to a = a + b
• In C# : var x = expr (compiler deduces type of x from expr)
Compilers expand sugared constructs into fundamental constructs (Desugaring).

Compiler Design
Lexer Feedback
C and C++ Lexers require lexical feedback to differentiate between typedef

names and identifiers.
Compiler Design
Example 4: Scanning in Python - When Scope is handled using

whitespace.
This requires whitespace tokens - Special tokens inserted to indicate

changes in levels of indentation.
• NEWLINE marks the end of a line.
• INDENT indicates an increase in indentation.
• DEDENT indicates a decrease in indentation.
•Note that INDENT and DEDENT encode change in indentation, not
the total amount of indentation.
Reference:
https://docs.python.org/3/reference/lexical_analysis.html
Compiler Design
Example 4: Scanning in python (cont.)

THANK YOU
Divyaprabha K N
[email protected]
Compiler Design
Preet Kanwal
Teaching Assistant : Lavitra Kshitij

Madan
Compiler Design
Unit 1
The Phases of a Compiler
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview

● Running an input through the different phases of a compiler
● Syntax Tree versus Parse Tree
Compiler Design
C Language Grammar
To define the Grammar for C Language, let us first define the following:
P : Program Beginning
S : Statement
Declr : Declaration
Assign : Assignment
Cond : Condition UnaryExpr
: Unary Expression
Type : Data type
ListVar : List of variables
X : (can take any identifier or assignment)
RelOp : Relational Operator
Compiler Design
C Language Grammar
P →S
S → Declr; S | Assign; S | if (Cond) {S} S | while (Cond) {S} S |
if (Cond) {S} else {S} S | for (Assign; Cond; UnaryExpr) {S} S | return E; S | λ
→ Type ListVar
Declr
Type → int | float
ListVar → X | ListVar, X
Recall that
X → id | Assign E -> E + T | E – T | T
Assign → id = E T -> T * F | T / F | F
F -> G ^ F | G
Cond → E RelOp E G -> ( E ) | id | num
RelOp → < | > | <= | >= | == | !=
UnaryExpr → E++ | ++E | E-- | --E
Compiler Design
Another example for all phases of a compiler
Example 1 : Consider the following piece of code:

while (number > 0)
{
factorial = factorial * number;
--number;
}
We will now run this code through the different phases of a compiler and
see what the output will be at each stage.
Compiler Design
Example 1
(continued)
Phase 1 : Lexical Analysis or Linear Analysis or

Scanner
Pattern Lexemes matched
Keyword while
Identifier number, factorial while (number > 0)
RelOp (<, <=, >, >=, ==, !=) >
{
LogOp (&&, ||, ~)
Assign (=) = factorial = factorial *
ArithOp (*, +, -, /) * number;
Punctuation ( (, ), {, }, [, ], ;) ( { ; ; } --number;
) }
Number 0
Literal – anything in double quotes
Inc_op
Compiler Design
Example 1
(continued)
Output: tokens
< keyword, while >
< punctuation, ( > or < ( > < ID, 3 > corresponds
< ID, 3 > to the third
identifier in the
< RelOp, > > or < > > symbol table
< Number, 0 >
< punctuation, ) > or < ) >
while (number > 0)
< punctuation, { > or < { >
< ID, 4 > {
< Assign, = > or < = > factorial = factorial *
< ID, 4 > number;
< ArithOp, * > or < * >
--number;
< ID, 3 >
< punctuation, ; > or < ; > }
< Dec_op, -- > or < -- >
< ID, 3 >
Compiler Design
Example 1
(continued)
Phase 2 : Syntax Analysis or Parser
Output: Parse Tree or Syntax Tree
Use the grammar
we have defined
earlier
while (number > 0)

{
factorial = factorial *
number;
--number;
}
Compiler Design
Example 1
(continued)
Phase 3 : Semantic Analyzer
Output: Semantically checked Syntax Tree
while (number > 0)

{
number;
--number;
}
Compiler Design
Example 1
(continued)
Phase 4 : Intermediate Code Generator
Output: Three Address Code (3AC or TAC)
L0 : If number > 0 goto L1

goto L2 while (number > 0)
L1 : t1 = factorial * number
{
factorial = t1
t2 = number – 1 number;
number = t2 --number;
goto L0 }
L2 : …
Compiler Design
Example 1
(continued)
Phase 5 : Machine Independent Code Optimization
Output: DAG Optimization
L0 : If number > 0 goto L1 while (number > 0)

goto L2 {
L1 : factorial = factorial * factorial = factorial *
number;
number number = number – 1
--number;
goto
}
L0 L2 :
…
Compiler Design
Example 1
(continued)
Phase 6 : Code Generator
Output: Assembly Code
LD R1, #0
L1 : LD R4, number L0 : If number > 0 goto L1
SUB R2, R1, R4
goto L2 while (number > 0)
BLTZ R2, L2
LD R3, factorial L1 : factorial = factorial * {
MUL R3, R3, R4
number number = number – 1 factorial = factorial *
ST factorial,
R3 SUB R4, number;
goto
R4, #1 --number;
ST number, L0 L2 :
}
R4 BR L1 …
L2:
Phase 7 : Machine dependent Optimized Code
Output: Optimized Target Code
Compiler Design
Example 2

int rhs, lhs = 0;
rhs = 2;
if (lhs <= rhs)

{
lhs = lhs – 2 * rhs;
}
Run this code through the different phases of a compiler and produce the
output at each stage.
Compiler Design
C Language Grammar
P →S
S → Declr; S | Assign; S | if (Cond) {S} S | while (Cond) {S} S |
if (Cond) {S} else {S} S | for (Assign; Cond; UnaryExpr) {S} S |
return E; S | λ
Declr → Type ListVar
Type → int | float
ListVar → X | ListVar, X Recall that
E -> E + T | E – T | T
X → id | Assign
T -> T * F | T / F | F
Assign → id = E F -> G ^ F | G
Cond → E RelOp E G -> ( E ) | id | num
RelOp → < | > | <= | >= | == | !=
UnaryExpr → E++ | ++E | E-- | --E
Compiler Design
Simple IR
Optimizations
Constant folding is the process of recognizing and evaluating constant

expressions at compile time rather than computing them at runtime.
Constant propagation is the process of substituting the values of known

constants in expressions at compile time.
Common Subexpression Elimination (CSE) is a compiler optimization that

searches for instances of identical expressions (i.e., they all evaluate to the
same value), and analyzes whether it is worthwhile replacing them with a
single variable holding the computed value.
Compiler Design
Syntax Tree vs Parse
Tree
Parse Tree (Concrete Syntax Tree) (Abstract) Syntax Tree

It is huge and focuses on all lexeme It focuses only on the syntax
Consists of both Non-terminals (P, All nodes are tokens;

S, E, Cond, RelOp, etc.) and It also has dummy nodes (<body>)
Terminals (leaf nodes like tokens) to maintain proper syntax
Complicated No non-terminals; Easy

Has punctuation (; , { No punctuation symbols
} )
Compiler Design
Example 2

int rhs, lhs = 0;
rhs = 2;
if (lhs <= rhs)

{
}
Run this code through the different phases of a compiler and produce the
output at each stage.
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal
Teaching Assistant : Lavitra Kshitij

Madan
Compiler Design
Unit 1
The Phases of a Compiler
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview
In this lecture, you will practise:

● Running an input through the different phases of a
compiler
Compiler Design
Example 2

int rhs, lhs =
0; rhs = 2;
if (lhs <= rhs)

{
}
Compiler Design
Example 2
(continued)

Scanner
Keyword int if
Identifier rhs lhs lhs rhs lhs lhs rhs int rhs, lhs = 0;
r
hs rhs = 2;
RelOp (<, <=, >, >=, ==, !=) <= if (lhs <= rhs)
LogOp (&&, ||, ~)
{
Assign (=) = = =
ArithOp (*, +, -, /) - * lhs = lhs – 2 * rhs;
Punctuation ( (, ), {, }, [, ], ;) ; ; ( { ; } }
)
Number 0 2 2
Literal – anything in double quotes
Compiler Design
Example 2
(continued)
Output: # of tokens : 27 Symbol
Table
< keyword, int > name type …
< ID, 3 > 1 …
< punctuation, , > 2 ….
< ID, 4 > 3 rhs
< Assign, = >
< Number, #0 >
4 lhs int rhs, lhs = 0;
< punctuation, ; > rhs = 2;
< ID, 3 > < punctuation, { >
< Assign, = > < ID, 4 > if (lhs <= rhs)
< Number, #2 > < Assign, = > {
< punctuation, ; > < ID, 4 >
< keyword, if > < ArithOp, - > lhs = lhs – 2 * rhs;
< punctuation, ( > < Number, #2 >
< ID, 4 > }
< ArithOp, * >
< RelOp, <= > < ID, 3 >
< ID, 3 > < punctuation, ; >
< punctuation, ) > < punctuation, } >
Compiler Design
Example 2
(continued)
Phase 2 : Syntax Analysis or
Parser
Use the grammar
we have defined
earlier
int rhs, lhs = 0;

rhs = 2;
if (lhs <= rhs)

{
}
Compiler Design
Example 2
(continued)
Output: Semantically checked Syntax
Tree
int rhs, lhs = 0;

rhs = 2;
if (lhs <= rhs)
{
}
Compiler Design
Example 2
(continued)
Output: Three Address Code (3AC or
TAC)lhs = 0
rhs = 2
int rhs, lhs = 0;
t0 = lhs <= rhs
rhs = 2;
if t0 goto L1 if (lhs <= rhs)
goto L2 {
L1 : t1 = 2 * rhs
}
t2 = lhs -
t1 lhs = t2
Compiler Design
Example 2
(continued)
Phase 5 : Machine Independent Code
Optimization Output: Optimized IR
– t2,
lhs = 0
lhs rhs = 2 int rhs, lhs = 0;
t0 = lhs <= rhs rhs = 2;
lhs if (lhs <= rhs)

* if t0 goto L1
t {
1 goto L2
L1 : t1 = 2 * rhs }
2 rhs
lhs = lhs - t1
L2 : next
Compiler Design
Example 2
(continued)
Phase 6 : Code
Generator Output:
Assembly
MOV R1, Code
#0
lhs = 0
ST lhs, R1
MOV R2, rhs = 2 int rhs, lhs = 0;
#2 t0 = lhs <= rhs rhs = 2;
ST rhs, R2 if (lhs <= rhs)
if t0 goto L1
SUB R3, R2, {
R1 BLTZ R3, goto L2
L2
L1 : t1 = 2 * rhs }
LD R4, #2
MUL R4, R2 lhs = lhs - t1
SUB R5, R1, R4 L2 : next
Compiler Design
Example 3

n = 23;
for (i = 0; i < n; i++)
{
sum = sum * i;
}
Compiler Design
Example 3
(continued)

Scanner
Keyword for
Identifier n i n i printf sum sum i
n = 23;
i
for (i = 0; i < n; i++)
RelOp (<, <=, >, >=, ==, !=) <
LogOp (&&, ||, ~) {
Assign (=) = = sum = sum * i;
= }
ArithOp (*, +, -, /) *
Punctuation ( (, ), {, }, [, ], ;) ; ( ; ) { ; }
;
Number 23 0
Compiler Design
Example 3
(continued)
Output: # of tokens : 25 Symbol
Table
< ID, 1 > name type …
<=> 1 n
< Number, 23 > 2 i
< Keyword, for > 3 sum
<(> …
< ID, 2 > n = 23;

<assign_op, =>
<{> for (i = 0; i < n; i++)
< Number, 0 >
<ID, 3>
<;> {
<assign_op, =>
< ID, 2 >
<ID, 3> sum = sum * i;
<<>
<arith_op, *>
< ID, 1 > }
<ID, 2>
<;>
<;>
< ID, 2 >
<}>
< Inc_op, ++ >
<)>
Compiler Design
Example 3
(continued)
Parser Assume slightly
modified
grammar (with
functions)
n = 23;
for (i = 0; i < n; i++)
{
sum = sum * i;
}
Compiler Design
Example 3
(continued)
Tree
n = 23;
for (i = 0; i < n; i++)
{
sum = sum * i;
}
Compiler Design
Example 3
(continued)
TAC)number = 23
i=0
L0 : If i < n goto L1
n = 23;
goto L2
for (i = 0; i < n; i++)
L1 : t1 = sum * i
{
sum = t1
sum = sum * i;
t2 = i + 1
}
i = t2
goto L0
L2 : …
Compiler Design
Example 3
(continued)
Optimization
Output: Optimized IR
number = 23
i=0
n = 23;
for (i = 0; i < n; i++)
goto L2
{
L1 : sum = sum * i
sum = sum * i;
i=i+1
}
goto L0
L2 : …
Compiler Design
Example 3
(continued)
Phase 6 : Code
Generator Output:
Assembly Code number = 23
MOV R1, #23 //
i=0
R1 → n MOV R2, #0 //
R2 → i n = 23;
MOV R4, sum // R4 = goto L2 for (i = 0; i < n; i++)
contents(sum) L0 : SUB R3, R1, R2 L1 : sum = sum * i {
BLTZ R3, L2 i=i+1 sum = sum * i;
MUL R4, R4, R2
goto L0 }
A L2 : …
DD
R2 ,
Compiler Design
Example 3

n = 23;
for (i = 0; i < n; i++)
{
printf(“%c”, 65+i);
}
Compiler Design
Example 3
(continued)

Scanner
Keyword for
Identifier n i n i printf i
n = 23;
i
for (i = 0; i < n; i++)
RelOp (<, <=, >, >=, ==, !=) <
LogOp (&&, ||, ~) {
Assign (=) = = printf(“%c”, 65+i);
ArithOp (*, +, -, /) +
}
Punctuation ( (, ), {, }, [, ], ;) ; ( ; ) { ; }
;
Number 23 0 65
Literal – anything in double quotes “%c”
Compiler Design
Example 3
(continued)
Output: tokens
< ID, 1 >
<=>
< Number, 23 >
< Keyword, for
>
<(> <{> n = 23;
< ID, 2 > < ID, 3 >
for (i = 0; i < n; i++)
<=> <(>
< Number, 0 > < Literal, “%c” {
<;> >
< ID, 2 > <,> printf(“%c”,
<<> < Number, 65 > 65+i);
}
< ID, 1 > <+>
<;> < ID, 2 >
< ID, 2 > <)>
< ++ > <;>
<)> <}>
Compiler Design
Example 3
(continued)
Parser Assume slightly
modified
grammar (with
functions)
n = 23;
for (i = 0; i < n; i++)
{
}
S -> id ( Param ); S
Param -> E, Param | E
Compiler Design
Example 3
(continued)
Tree
n = 23;
for (i = 0; i < n; i++)
{
}
Compiler Design
Example 3
(continued)
TAC)n = 23
i=0
n = 23;
goto L2
for (i = 0; i < n; i++)
L1 : t1 = 65 + i
{
call (printf,
printf(“%c”,
t1) t2 = i + 1 65+i);
}
i = t2
goto L0
L2 : …
Compiler Design
Example 3
(continued)
Optimization
Output: DAG
Optimization
n = 23
i=0
L0 : If i < n goto L1 n = 23;
goto L2 for (i = 0; i < n; i++)
L1 : t1 = 65 + i {
call (printf, printf(“%c”,
t1) i = i + 1 65+i);
}
goto L0
L2 : …
Compiler Design
Example 3
(continued)
Phase 6 : Code
Generator Output:
Assembly
MOV R1,Code
#23 n = 23
ST n, R1 i=0
LD R2, #0 L0 : If i < n goto L1 n = 23;
L1 : goto L2 for (i = 0; i < n; i++)
SUB R3, R1,
L1 : t1 = 65 + i {
R2 BLTZ R3,
L2 call (printf, printf(“%c”,
LD R4, #65 t1) i = i + 1 65+i);
}
ADD R5, R4, goto L0
R2 PRINT R5
L2 : …
ADD R2, R2,
#1 BR L1
Compiler Design
Example 4 - Take home
exercise
Exercise Problem : Consider the following piece of
code: int n = 3;
do This problem is for the
{ understanding of the
if( (n/2)*2 == n ) concept by testing you
{ on if-else and do-while
n = n / 2; constructs.
}
Note: The Parse Tree
else
may become large, try to
{ do the if-else part
n = 3 * n + 1; separately
}
} Hint: Consider
while(n != 1 || n != 2 || n != S -> do {S} while (S);
4); return n; S in the grammar
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal
Teaching Assistant : Kavya P

K
Compiler Design
Unit 1
Lexical
Analysis
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview

● The Lexical Analyser
● Role of a lexer
● The need to separate Lexical Analyser
and Parser and the interaction between them
● Challenge of Lexical Analyser - Speed
● Input Buffer
○ Buffer Pairs
○ Buffer Pair Algorithm
○ Sentinels
● Lexical Errors
Compiler Design
Some important
definitions
1. Token :- It is a pair consisting of a token name and an optional

attribute value.
<token name, attribute value>
● Token name is an abstract symbol representing a kind of lexical unit
- keywords, identifiers etc.
● Attribute value is pointer to the symbol table entry.
1. Pattern :- It is a description of the form that the Lexemes of a
token may take.
a. Keyword - a sequence of characters
b. Identifiers - sequence of character with some complex Structure
2. Lexeme :- It is a sequence of characters in the source
program that matches the pattern for a token, i.e, and instance of
a token
Compiler Design
Examples
Token Informal Description of Pattern Lexemes

if Character i,f if
else Characters e,l,s,e else
Comparison < or > or <= or >= or == or != <=,!=
id Letter followed by letters and digits Sum, avg1, pi

number Any numeric constant 10001, 3.14
literal Anything but “, surrounded by “ “compiler design”

Compiler Design
Introduction to Lexical
Analyzer
● The Lexical Analyzer is the first phase
of the compiler.
● It is language dependent.
● It performs two main tasks -
1. Scanning
a. Reading the input characters of
the source program.
b. Performing simple
processes that do not
require- tokenization
input deletion of the
comments,
compaction
of of consecutive
whitespaces.
2. Lexical Analysis - Producing
the sequence of tokens as
output.
Compiler Design
Goal of a Lexical
Analyzer
1. To convert from physical description of a program into sequence of

of tokens.
2. Each token represents one logical piece of the source file - a keyword,
the name of a variable etc.
3. Each token is associated with a lexeme, i.e, the actual text of the
token: ”137,” ”int,” etc.
4. Each token can have optional attributes - Extra information derived
from the text, perhaps a numeric value.
5. The token sequence will be used in the parser to recover the
program structure.
Compiler Design
Role of a lexer
(1)
The Lexical analyzer reads the input characters of the source program and
then performs the following steps:
1. Groups the input characters into Lexemes.
2. Produces a sequence of tokens for each lexeme in the source program as
output.
3. Sends the stream of tokens to the parser for Syntax Analysis.
4. When a lexeme representing an identifier is discovered, it is entered into
the symbol table.
Compiler Design
Role of a Lexer
(2)
The Lexical Analyzer also performs the following secondary tasks:

● Stripping out comments and white space like blank,new line,
tab and other characters that are used to separate tokens in the
input.
● Correlating error messages generated by the compiler
with the source program file.
● Keeping track of the number of the newline characters and associate
each error message with its line number.
● Make a copy of the source program with error
messages inserted at appropriate positions.
● Expansion of macros
Compiler Design
The need to separate Lexical Analyzer and
Parser
● There exists a tight coupling between lexical analyzer and Parser.
● The Parser works as the master program, and directs the lexical analyzer.
● However, there are several reasons to separate lexical
analyzer and syntax analysis into different phases -
1. Simplicity of design
● Comments and spaces are removed in the lexical analysis phase.
● If performed by the parser, parser design becomes very complex.
● Separating lexical and syntactical concerns leads to cleaner design.
1. Compiler efficiency is improved - specialised buffering schemes can be used
for reading characters, which can speed up the compiler.
2. Compiler portability is enhanced - input specific
peculiarities can be restricted to the lexical analyzer.
Compiler Design
Lexical Analyzer Versus
Parser
Compiler Design
Interaction Between the Lexical Analyzer and
Parser(1)
Error
Handler
read (2) getNextToken

Lexer To
Input File / (1) Parser /
Scanner
/ / semantic
Source Syntax Analysis
code lexeme Lexical Token Analyzer
(3) Analyzer (4)
Symbol
Regex Grammar
Table
Compiler Design
Interaction Between the Lexical Analyzer and Parser
(2)
● Lexer stops when parser stops since the parser is its

master.
● If working only with the lexer, then the lexer must be
explicitly stopped lest it will continue running
infinitely.
● Comments are eliminated by the preprocessor but
even the compiler can remove comments in the lexer
phase. If “-E” is used in the gcc command, then
comments are not removed in the preprocessing
phase. So, the lexer can make sure that comments do
not reach the parser.
Compiler Design
Challenge: Speed of Lexical
Analyzer
● The lexical analyzer reads characters of the source program one at a time to
discover tokens.
● Thus, speed of lexical analysis is a concern.
● In many languages, there are scenarios when the lexer needs to look ahead
at least one character before a match can be announced.
Solution - The lexical analyzer reads from Input Buffers
● There are many schemes that can be used to bufferinput.
This course discusses two schemes -
○ Buffer Pairs
○ Sentinels
Compiler Design
Input Buffers - Buffer
Pairs
● This scheme makes use of two buffers of the same size(N).

● Usually, N = size of disk block = 4096 bytes.
● These buffers are alternately reloaded.
● A single system read can be used to read N characters into the buffer instead
of one system call per character.
● If fewer than N characters remain in the input file, EOF is read into the
buffer, marking the end of the source file.
Compiler Design
Input Buffers - Buffer
Pairs
● Two pointers are maintained to the input buffer -
1. lexeme_begin - marks the beginning of the current lexeme
2. forward - scans ahead till a pattern is found.
● Initially, both pointers point to the first character of the next lexeme to be
found.
● The forward pointer scans ahead until a match for the pattern is found.
● The string of characters between the two pointers is the current lexeme.
● Once the lexeme is determined, the forward pointer is set to the character at
the right end of the lexeme.
● After the lexeme is processed, both pointers are set
to the character immediately after the lexeme.
Compiler Design
Input Buffering - Buffer Pairs
Algorithm
if forward at end of first half then begin

reload second half;
forward := forward +1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first
half
end
else
forword := forward + 1;
Compiler Design
Example
Consider the following statement

- abc = pqr * xyz ;
Compiler Design
Example
abc = pqr * xyz ;

Initially, both pointers are set to the beginning of the
buffer.
Compiler Design
Example
abc = pqr * xyz ;

Forward pointer advances till a pattern is
matched.
Compiler Design
Example
abc = pqr * xyz ;

Forward pointer advances till a pattern is
matched.
Compiler Design
Example
abc = pqr * xyz ;

Here, a pattern is matched. The current lexeme lies
between the two pointers.
Current lexeme: abc
Compiler Design
Example
abc = pqr * xyz ;

Both forward and lexeme_begin are set to the next
character after the lexeme.
Compiler Design
Example
abc = pqr * xyz ;

The current lexeme is =
Both forward and lexeme_begin are moved to the
next character after the current lexeme.
Compiler Design
Example
abc = pqr * xyz ;

The forward pointer has reached the end of the first
buffer.
Compiler Design
Example
abc = pqr * xyz ;

The second input bufferis reloaded. Forward pointer
moves aheadtill the next pattern is matched.
This process continues till EOF is reached.
Compiler Design
Drawbacks of Buffer
Pairs
The drawbacks of the bufferpairs algorithm is that each

time we advance forward, there are two checks that need to be
performed -
1. For End Of Buffer
2. To determine which character is read
Solution - Combine the buffer-end check with the check for current
character.
This can be done by extending each buffer to hold a Sentinel character (EOF)
at the end.
Compiler Design
Input Buffering : Buffer pairs with
Sentinels
Sentinels (eof) are added to the end of each input

buffer.
Compiler Design
Input Buffering - Buffer pair Algorithm (with
sentinels)
forward = forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward +1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else
terminate lexical analysis
/* eof within buffer, i.e, end of input
*/
end
Compiler Design
Lexical Errors
Suppose a situation arises in which the lexical analyzer is unable to

proceed because none of the patterns for tokens matches any prefix of
the remaining input.
Some possible scenarios are -

1. Spelling Error
2. Unmatched string
3. Appearance of illegal characters
4. Exceeding length of identifiers
Compiler Design
Lexical Errors
1. Spelling Error
di{…
2. Unmatched string
3. Appearance of illegal …………………}
characters while();
Compiler Design
Lexical Errors
1. Spelling Error printf(“hi);

2. Unmatched string
3. Appearance of illegal printf(“hi”);
characters
Compiler Design
Lexical Errors
1. Spelling Error
2. Unmatched string printf(“hi”)$$;
3. Appearance of illegal
characters
Compiler Design
Lexical Errors
1. Spelling Error
More than 32
2. Unmatched string characters in
3. Appearance of illegal an identifier
characters
Compiler Design
Error Message
When writing an error message, make sure it satisfies the following

properties:
● It must pinpoint the error correctly
● It must be understandable
● It must be specific
● It must not have any redundancy
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal

K
Compiler Design
Lex
Tutorial
Preet Kanwal
Engineering
Compiler Design
Lex and Yacc
● Lex - reads a specification file containing regular expressions and

generates a C routine that performs lexical analysis. Matches sequences
that identify tokens.
● Yacc - reads a specification file that codifies the grammar of a language
and generates a parsing routine.
Compiler Design
Lecture
Overview
● What is lex and yacc?
● Flow of code
● How to write a Lex Specification file ?
● Example 1 - A simple program
● Predefined variables in lex
● Example 2 - Prepend Line Numbers
● Example 3 - Use of definitions
● Example 4 - Word count
● Example 5 - Removing Comments
● Start conditions
● Commonly used Regular expressions in lex specification
file
Compiler Design
Flow of code - using lex and yacc
tools
Compiler Design
How to write a Lex Specification file (*.l)
?
• lex.l is an a input file written in a language which

describes the generation of lexical analyzer. The lex
compiler transforms lex.l to a C program known
as lex.yy.c.
• lex.yy.c is compiled by the C compiler to a file
called a.out.
• The output of C compiler is the working lexical
analyzer which takes stream of input characters and
produces a stream of tokens.
• yylval is a global variable which is shared by lexical
analyzer and parser to return the name and an
attribute value of token.
• The attribute value can be numeric code, pointer to
symbol table or nothing.
• Another tool for lexical analyzer generation is Flex.
Compiler Design
How to write a Lex Specification file (*.l)
?
● The Lex specification file is divided into three sections with%% dividing
the sections.
● The structure of a lex specification file is as follows -
Definitions include declarations of constant, variable and regular definitions.
Rules define the statement of form p1 {action1} p2 {action2}....pn {action}.

Where pi describes the regular expression and action1 describes the actions what action the lexical
analyzer should take when pattern pi matches a lexeme.
User subroutines are auxiliary procedures needed by the actions. The subroutine can be loaded with
the lexical analyzer and compiled separately.
Compiler Design
Lex Specification
file
● Input is copied to output one character at a time.
● The first %% is always required as there must always be a rules section.
● However, if we don’t specify any rules, then the default action is to
match everything and copy it to output.
● Defaults for input and output are stdin and stdout.
● Any source code not intercepted by lex is copied
into the generated program.
○ A line that is not part of a lex rule or action, which begins with a blank
or tab, is copied out as above (useful for global declarations)
○ Anything included between lines containing only %{ and %}
(useful for preprocessor statements that must start in
col.1)
○ Anything after the second %% delimiter is copied
Compiler Design
Lex Specification
file
● Definitions intended for lex are given before the first %%. Anyline
in this section that does not begin with a blank or tab, or is not
enclosed by %{...%}, is assumed to be defining a lex substitution
string of the form
name translation , for e.g, letter [a-zA-Z]
● yytext : a character array that contains the
actual string that matches a pattern. [0-9]+ { yylval =
atoi(yytext);
● yyleng : the no. of characters matched.
return INTEGER;}
● Values associated with the token are returned by lex in a union
variable yylval.
● Local variables can be defined.
● Do not leave extra spaces and/or empty lines at the end of the
lex specification file.
Compiler Design
Lex Specification file - A Simple
Example
https://www.epaperpress.com/lexandyacc/prl.html
● Here, two patterns have been specified in the rules section. %%

● Each pattern must begin in column one. . ECHO;
\n ECHO;
● This is followed by whitespace (space, tab or newline) and an %%
optional action associated with the pattern. int yywrap() {
● The action may be a single C return 1;
statements
statement,enclosed
or in multiple
braces. C }
int main(void) {
● Anything not starting in column one is copied to the generated yylex();
C file. This behavior can be to specify comments in the lex file. return 0;
● In this example there are two patterns, "." and "\n", with an }
ECHO action associated for each pattern.
Compiler Design
Lex Specification file - A Simple Example
(cont.)
● Several macros and variables are predefined by lex.

○ ECHO is a macro that writes code matched by the pattern. %%
. ECHO;
This is the default action for any unmatched strings.
\n ECHO;
Typically, ECHO is defined as: %%
#define ECHO fwrite(yytext, yyleng, 1, yyout) int yywrap() {
● Function yywrap is called by lex when input is exhausted. return 1;
Return 1 if done , 0 if more processing is required. }
int main(void) {
● Every C program requires a main function. In this case we yylex();
simply call yylex which is the main entry-point for lex. return 0;
● Some implementations of lex include copies of main and }
yywrap in a library thus eliminating the need to code them
explicitly.
Compiler Design
Some predefined variables in
Lex
Table : Lex Predefined Variables
Name Function
int yylex(void) call to invoke lexer, returns token
char *yytext pointer to matched string
yyleng length of matched string
yylval value associated with token
int yywrap(void) wrapup, return 1 if done, 0 if not done
FILE *yyout output file
FILE *yyin input file
INITIAL initial start condition
BEGIN condition switch start condition
ECHO write matched string
Compiler Design
Lex Specification file - Example 2 - Line
Numbers
● This example prepends line
numbers to each line in a file. %{
● Some implementations of lex int yylineno;
predefine and calculate yylineno. %}
%%
○ %option yylineno is added to the definitions (.*)\n
○ Check program file : start_conditions.l printf("%4d\t%s",
yylineno++, yytext);
● Here, the input file for lex is %%
defaults
yyin to (whichstdin), which is int yywrap() { return(1);
initialised in the main function. } int main() {
● Variable yytext is a pointer yyin = fopen("input.txt",
to (NULL-terminated)
string the matched and yyleng "r"); yylex();
lengthisof thethe
matched string. fclose(yyin);
}
● Variable yyout is the output file and defaults to Program file - line_numbers.l
stdout.
Compiler Design
Lex Specification file - Example 3 - Using
definitions
digit [0-9]
● This example demonstrates the letter
use of the definitions section. [_A-Za-z]
%{
● The definitions section is #include<stdio.h>
composed of substitutions, code, %}
and start states. %% printf("Identifier :
{letter}({letter}|{digit}) %s",yytext);
● Code in the definitions section is *
simply copied as-is to the top of .;
the generated C file and must be %%
bracketed with %{ and %} markers. int
yywrap(){return(1);}
yyin = fopen("input.txt",
● Substitutions simplify int"r");
main()
yylex();
pattern-matching { fclose(yyin);
rules. return 0;
}
Program file -
Compiler Design
Lex Specification file - Example 4 - Word
count
%{
● This example counts the number of
int nchar, nword, nline;
characters, words, and lines in a %}
file (similar to Unix wc) %%
● Whitespace must separate the \n { nline++; nchar++; }
defining term and the [^ \t\n]+ { nword++, nchar += yyleng; }
associated expression. . { nchar++; }
%%
● References to substitutions in the int
rules section are surrounded by yywrap(){return(1);}
braces ({letter}) to distinguish int main(void) {
them from literals. yyin = fopen("input.txt",
"r"); yylex();
● When we have a match in the printf("%d\t%d\t%d\n", nchar, nword,
rules section the associated C nline); return 0;
code is executed. }
Program file -
Compiler Design
Lex Specification file - Example 5 - Removing
comments
%%
● This example removes single line
\/\/(.*) ;
and double line comments. \/\*(.*\n*)*.*\*\/ ;
%%
int
int main()
{
yyin=fopen("input.txt","r")
; yylex();
return 0;
}
Program file - removing_comments.l
Compiler Design
yyparse() and yylex()
https://www.ibm.com/docs/en/zos/2.4.0?topic=yacc-function-section#yacfun
yyparse() returns a value of 0 if the input it parses is valid according to the

given grammar rules. It returns a 1 if the input is incorrect and error recovery
is impossible.
yyparse() does not do its own lexical analysis. In other words, it does not pull
the input apart into tokens ready for parsing. Instead, it calls a routine
called yylex() everytime it wants to obtain a token from the input.
yylex() returns a value indicating the type of token that has been obtained. If
the token has an actual value, this value (or some representation of the
value, for example, a pointer to a string containing the value) is returned in
an external variable named yylval.
It is up to the user to write a yylex() routine that breaks the input into tokens
and returns the tokens one by one to yyparse(). See Function section for
more information on the lexical analyzer.
Compiler Design
Start conditions
● Start conditions are a mechanism for conditionally activating patterns.

● This is useful for handling -
○ Conceptually different components of an input
○ Situations where the lex defaults (e.g., “longest
possible match”) don’t work, like in comments or quoted strings.
● Declare a set of start condition names using

%Start name1 name2 . . .
● If scn is a start condition name, then a pattern prefixed with <scn> will only be
active when the scanner is in start condition scn.
Compiler Design
Start conditions
● The scanner begins in start condition INITIAL,

of which all non-<scn>-prefixed rules are members.
● Start conditions such as these are inclusive: i.e., being
in that start condition adds appropriately prefixed rules to the
active rule set.
● flex also allows exclusive start conditions (declared using %x), which are
sometimes more convenient.
○ %x scn
Example - Detecting multi-line comments using start

conditions Program File - start_conditions.l
● This program makes use of start conditions to detect the beginning
and end of a multiline comment.
Compiler Design
Example
Compiler Design
Commonly used Regular expressions in lex specification
file
Table 1: Special Characters

Pattern Matches
. any character except newline
\. literal .
\n newline
\t tab
\\ matches character \
^ beginning of line
$ end of line
Compiler Design
file
Table 2: Operators
Pattern Matches
? zero or one copy of the preceding expression
* zero or more copies of the preceding expression
+ one or more copies of the preceding expression
a|b a or b (alternating)
(ab)+ one or more copies of ab (grouping)
abc abc
abc* ab abc abcc abccc ...
"abc*" literal abc*
abc+ abc abcc abccc abcccc ...
a(bc)+ abc abcbc abcbcbc ...
a(bc)? a abc
Compiler Design
file
Table 3: Character Class
Pattern Matches
[abc] one of: a b c
[a-z] any letter a through z
[a\-z] one of: a - z
[-az] one of: - a z
one or more alphanumeric
[A-Za-z0-9]+
characters
[ \t\n]+ whitespace
[^ab] anything except: a b
[a^b] one of: a ^ b
[a|b] one of: a | b
https://support.workiva.com/hc/en-us/articles/4407304269204-Regular-expression-operators
Compiler Design
file
Pattern Matches
[a] a but nothing else
[ ]a] ] or a but nothing else
[[a] [ or a but nothing else
[ ][ ] ] or [ but nothing else (why?)
\[ [ (not a character class)
[-az] - or a or z but nothing else
[az-] - or a or z but nothing else
[ ]-az] any character from ] to a, or z
[ ]az-] ] or a or z or - but nothing else
[ab^] a or b or ^ but nothing else
[ ]az^-] ] or a or z or ^ or - but nothing else
Pattern Matches
[^ab] any character EXCEPT a or b
[^^] any character EXCEPT ^
https://www.ics.uci.edu/~alspaugh/cls/shr/regularExpression.html
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal

K
Compiler Design
Unit 1
Design of a Lexical Analyzer
Generator
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview

● Maximal Munch Rule
● Questions
● Lexical Analyzer with Lex
● The Structure of the Lex Analyzer Generator
● Constructing Automata - RE to NFA to DFA
● Example - Constructing Automata
● Lookahead Operator in Lex (/)
● Program - To identify keywords and
identifiers
● Program - Demonstrate Maximal munch rule
Compiler Design
Behaviour of Lexer - Example
1
Given -
abb printf(“1”)
aba printf(“2”)
a printf(“3”)
Questions:
• Provide output for the String : a
• Provide output for the String :
ababa
Compiler Design
How does a Lexer resolve
ambiguities?
Lex follows the principle of Maximal Munch rule.

● When more than one pattern can match the input, lex chooses as follows:
1. The longest match is preferred.
2. Among rules that match the same number of characters, the rule that
occurs earliest in the list is preferred.
Compiler Design
Example 1 -
Solution
Given -
abb printf(“1”)
aba printf(“2”)
a printf(“3”)
Questions:
● Provide output for the String : a
○ Answer - 3 Recall Panic
● Provide output for the String : Mode
ababa Recovery
○ Answer
■ aba --22b3
■ b
■ a-3
Compiler Design
Example 2
Given -
a*b printf(“1”)
(a|b)*b printf(“2”)
c* printf(“3”)
Questions:
• Provide output for the String : cbabc
• Provide output for the String : cbbbbac
• Provide a String such that the Output is
132
Compiler Design
Example 2 -
Solution
a*b printf(“1”)
(a|b)*b printf(“2”)
c* printf(“3”)
● Provide output for the String : cbabc ● Provide a String such that the Output is
○ Answer - 323 132
■ c-3 ○ Answer - abccbb
■ bab - 2 ■ ab - 1
■ c-3 ■ cc - 3
● Provide output for the String : ■ bb - 2
cbbbbac ○ Other answers are possible
○ Answer - 32a3
■ c-3
■ bbbb - 2
■ a-a
■ c-3
Compiler Design
Example 3
Given -
aa printf(“1”)
b?a+b? printf(“2”)
b?a*b? printf(“3”)
Questions:
● Provide output for the String : bbbaabb
● Provide a String such that the Output is
123
321
Compiler Design
Example 3 -
Solution
aa printf(“1”)
b?a+b? printf(“2”)
b?a*b? printf(“3”)
● Provide output for the String : bbbaabb
○ Answer - 323
■ bb - 3
■ baab - 2
■ b-3
123
○ Not possible
321
○ bbbabaa
■ bb - 3
■ bab - 2
Compiler Design
Example 4
Give an Example of such a set of regular expressions and an input string

such that :
1.The String can be broken apart into substrings, where

each substring matches one of the regular expressions BUT,
2.The longest-prefix algorithm will fail to break the string in a way where
each piece matches one of the regex.
Compiler Design
Example 4 -
Solution
Answer -
Suppose the patterns are as follows
:a* print(“1”)
ab print(“2”)
bb print(“3”)
Let the pattern be aabbb
Here, a matches 1, ab matches 2 and bb matches 3

However, due to maximal munch rule, the longest match is found, and
aa matches 1, bb matches 3 and last b is left unmatched. So output is
13b
Compiler Design
Verify the answers of the Example 1,2,3,4 using
lex
Program Files -
● Example 1 - example1.l
○ In this program, lex shows a warning - “rule cannot be matched”,
because the first rule always matches the lexemes matched by the
other 2 rules.
Compiler Design
Lexical Analyzer with
Lex
Compiler Design
The Structure of the Lex Analyzer
Generator
Compiler Design
Constructing Automata - RE to NFA to
DFA
Compiler Design
Constructing Automata - RE to NFA to
DFA
● Convert each pattern (regular expressions) in the lex program to an NFA.

● Combine all NFAs into one using a new start state, with ε transactions to
each of the start states of the NFAs.
● Convert the combined NFA to DFA.
Compiler Design
The Structure of the Lex Analyzer
Generator
The generated lexical analyzer consists of components that are created from
the lex specification file.
These components are -
● Transition Table - Created for all patterns defined in the lex program.
● Actions - Fragments of code defined to their corresponding patterns.
○ Actions are invoked by the Automata Simulator at the appropriate
time.
● Functions - Defined in the auxiliary functions section of the lex program,
these are passed directly through lex to the output file.
● Automata Simulator - The program that serves as the lexical analyzer and
uses the components mentioned above.
○ It simulates an automata (NFA or DFA)
Compiler Design
Question 1- Construct an automata for the lex
program
Given patterns and corresponding actions

-a A1
abb A2
a*b+ A3
Compiler Design
program
Step 1 - Convert each lex pattern to an

NFA
Compiler Design
program
Step 2 - Construct a combined NFA by adding

a new start state, and ε
transactions to each start state (1,3,7)
Compiler Design
program
If the Lexical analyzer simulates the above NFA, pattern matching is done as
follows -
● Let the input lexeme be aaba
● Initially, the combined NFA reads the input character pointed
by lexeme_begin.
● The forward pointer moves ahead. Set of states is calculated at each point.
● When the NFA simulation reaches a point in the input where there are no
next states, the set of states is empty (none).
Sequence of
states entered
while processing
input aaba
Compiler Design
program
● At this point, the longest prefix that matches the NFA can be identified,
● As indicated in the diagram, in state 8, aab matches the pattern a*b.
● Prefix aab is the longest prefix, and so it executes action A3.
● The forward pointer is movies to the end of the lexeme.
● Note - If there are several accepting states in the set, the action
associated with the earliest pattern is executed.
Sequence of
states entered
while
processing
input aaba
Compiler Design
program
Step 3 - Convert NFA to

DFA
Compiler Design
program
If the Lexical analyzer simulates the above DFA, pattern matching is done as
follows -
● Let the input lexeme be abba
● Set of states is calculated at each point.
● When the DFA simulation reaches a point in the input where there are no
next states, the action associated with the last accepting state is executed.
● The lexeme is abb, and the accepting state is 68
● This matches pattern abb, and action A2 is executed
Sequence of
states entered
while
processing
input abba
Compiler Design
program
Given patterns and corresponding actions

-a*b A1
(a|b)*b A2
c* A3
Compiler Design
program
Step 1 - Convert each lex pattern to an

NFA
Compiler Design
program
Step 2 - Construct a combined NFA by adding

a new start state, and ε
transactions to each start state (1,3,5)
Compiler Design
program
If the Lexical analyzer simulates the above NFA, pattern matching is done as
follows -
● Let the input lexeme be ababc
● Initially, the combined NFA reads the input character pointed
by lexeme_begin.
● The forward pointer moves ahead. Set of states is calculated at each point.
● When the NFA simulation reaches a point in the input where there are no
next states, the set of states is empty (none).
Sequence of
states entered
while processing
input ababc
Compiler Design
program
● At this point, the longest prefix that matches the NFA can be identified,
● As indicated in the diagram, in state 8, abab matches the pattern (a|b)*b.
● Prefix abab is the longest prefix, and so it executes action A2.
● The forward pointer is movies to the end of the lexeme.
● Note - If there are several accepting states in the set, the action
associated with the earliest pattern is executed.
Sequence of
states entered
while
processing
input aabac
Compiler Design
program
Step 3 - Convert NFA to

DFA
Compiler Design
program
If the Lexical analyzer simulates the above DFA, pattern matching is done as
follows -
● Let the input lexeme be ababc
● Set of states is calculated at each point.
● When the DFA simulation reaches a point in the input where there are no
next states, the action associated with the last accepting state is executed.
● The lexeme is abab, and the accepting state is 24
● This matches pattern (a|b)*b, and action A2 is executed
Sequence of
states entered
while
processing
input ababc
Compiler Design
Lookahead Operator in Lex
(/)
● When a certain pattern is required to be matched only when it is followed

by certain other characters, the lookahead operator is / used
● r1 / r2
○ r1 - the pattern that matches the lexeme
○ r2 - the additional pattern that must be matched before deciding the
token of the lexeme. Note - the characters that match r2 is NOT part of
the lexeme.
● Example - FORTRAN IF statement of the from IF(condition) THEN …
● In FORTRAN, keywords are not reserved
Compiler Design
Lookahead Operator in Lex
(/)
● IF is always followed by letter [A-Za-z]
○ Left parenthesis ( %{
#include<stdio.h>
○ Some text
%}
○ Right parenthesis ) %%
○ Any letter IF/$.*${letter}
printf("FORTRAN IF");
● The lex rule can be written
.;
as IF/$.*${letter} %%
int
● This lex code demonstrates use of / to
detect a FORTRAN IF
int main()
{
yyin = fopen("input.txt",
"r"); yylex();
Program file : lookahead.l
fclose(yyin);
Compiler Design
Implementation of Lookahead
Operator
● When converting to NFA, the lookahead operator (/) is
treated as an ε
transaction.
● The below diagram represents NFA for FORTRAN IF using lookahead
● The ε transaction from state 2 to state 3 represents the lookahead operator.
● State 6 indicates presence of IF
● The lexeme IF is found by scanning backwards to the last occurrence of state
2 whenever state 6 is reached.
● Note - If the NFA has more than one ε transaction on / , then the
problem of finding the correct state (e.g, state 2 in this example) is difficult.
Compiler Design
Write the lex file to identify identifiers and
keywords
digit [0-9]
letter
[A-Za-z]
%{
#include <stdio.h>
The rule
}% .;
%% is used to ignore
if|else|int|char|float {printf("Keyword :\t %s",yytext); all other
characters.
{letter}({letter}|{digit})* printf("Valid Identifier:\t %s",yytext);
.;
%%
int
int main() {
Program file : keywords_and_identifiers.l
yyin = fopen(“input.txt”,
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal
Teaching Assistant : Kavya P K

Compiler Design
Unit 1
Specification and Recognition of Tokens
Preet Kanwal
Compiler Design
Lecture Overview

● Revision :
○ What are tokens?
○ Classes of tokens
○ Attributes
● Implementation of Lexer:
○ Hand-written Implementation
○ Using a tool like: lex, PLY, CUP(in Java)
● Handwritten Implementation:
○ Specification of Tokens
○ Recap - Regular expressions and Transition diagrams
○ Implementation of Transition Diagrams
○ How does the Lexical Analyzer implement all transition
diagrams?
● Questions
Compiler Design
Some important definitions
Recall that
Token :- It is a pair consisting of a token name and an optional attribute
value.
<token name, attribute value>
● Token name is an abstract symbol representing a kind of lexical unit -
keywords, identifiers etc.
● Attribute value is pointer to the symbol table entry.
Compiler Design
Examples
Token Informal Description of Pattern Lexemes

if Character i,f If
else Characters e,l,s,e else
Comparison < or > or <= or >= or == or != <=,!=
id Letter followed by letters and digits Sum, avg1, pi

number Any numeric constant 10001, 3.14
literal Anything but “, surrounded by “ “compiler design”

Compiler Design
Tokens
● Tokens are required to classify substrings of source program according to

their role.
● The parser relies on token distinctions.
● get next token is the command sent from the Parser to the Lexical analyzer.
● On receipt of the command, the lexical analyzer scans the input until it
determines the next token, and returns it.
● Input is read left to right.
● Lookahead is required to decide where one token ends.
Compiler Design
Classes of Tokens
Choosing good tokens is important, and is mostly language specific.

The most common classes of tokens are listed below:
● Keywords - One token for each keyword. The pattern for a keyword is the
same as the keyword itself.
● Operators - Tokens either individually, or in classes for operators.
● Identifiers - One token representing all identifiers.
● Constants - One or more tokens representing constants, such as numbers and
literal strings.
● Punctuation Symbols - Tokens for each punctuation symbol, such as left and
right parentheses, comma and semicolon.
Compiler Design
Attributes of Tokens
● When more than one lexeme can match a pattern, lexical analyzer must
provide additional information about each lexeme matched.
● This additional information is stored in the attribute-value field of the token.
● The attribute-value field can include several pieces of information.
● Consider the example : token ID (an identifier)
○ Its attributes are the lexeme, the type and the location at which it is first
found. This is added to the symbol table.
○ The attribute value is the pointer to the symbol table entry for that
identifier.
● Lexical analyzer returns token name and attribute value to parser.
● The token name influences parsing decisions, whereas the attribute value
influences translation of tokens after parsing.
Compiler Design
Implementation of Lexer
There are two ways :

● Hand-written Implementation : One will write the lexer from scratch
as a program. For example : The C compiler is hand-written. It is
basically called loop-switch implementation.
● Using a tool : lex, PLY

Compiler Design
Implementation of Lexer
Hand-written Implementation
Compiler Design
Specification of Tokens
The specification of tokens is done using formal notations, i.e, regular

expressions.
Some important definitions -

● Symbol : A letter, digit or punctuation
● Alphabet : A finite set of symbols, for e.g, {0,1} is a Binary alphabet
● String over an alphabet : A finite set of symbols drawn from the alphabet
● If s is the string, then |s| denotes the length of the string
● Language - A set of strings over a fixed alphabet
● Abstract language - An empty set, {e}
Compiler Design
Recap - Regular Expressions
● Regular expressions are special text strings that describe a search pattern.
● It is mainly used for pattern matching.
● It is a metalanguage - a form of language used for description or analysis of
another language.
● It uses metacharacters - characters that have special meaning. For eg -
○ * matches any number of characters
○ . matches single characters except newline
○ \ drops the special meaning of the next character
● A Regular language is a formal language that can be expressed using a regular
expression.
● Regular expressions are the grammatical notations used for describing
structure of tokens, i.e, patterns.
Compiler Design
Regular Definitions
● For notational convenience, names can be given to regular expressions,

and these names can be used to define other regular expressions.
● This is called a regular definition.
● Example 1 : Regular definition for Identifiers
letter -> [a-zA-Z_]
digit -> [0-9]
id -> letter(letter|digit)*
● Example 2: Regular definition for whitespace
blank -> “ “
tab -> \t
newline -> \n
ws -> (blank|tab|newline)+
Compiler Design
Recap : Transition Diagrams
● Patterns are converted to stylized flowcharts called Transition Diagrams to

understand how a pattern traverses a given input.
● Transition diagrams consist of a collection of nodes called states.
● Each state represents a condition that summarises the characters seen.
● Two types of states -
1. Final or Acception State
a. Indicates that a lexeme has been found.
b. An action is attached to this state - returning the token and attribute
value to the parser.
c. If necessary to retract the forward pointer by one position, a * is placed *
next to the state.
2. Start State
Compiler Design
Implementing Transition Diagrams
● A variable state holds current state number

● A switch based on the value of state executes the code for that state.
● The following functions are used -
○ getc() or nextChar() - reads the next character from input.
○ install_id() - places the lexeme (identifier) in the symbol table if it’s not
present and returns a pointer to the same.
○ install_num() - similar to install_id(), enters the number in the table of
numbers if not present, and returns a pointer to the same.
○ retract() - If the accepting state has a * , this function is used to retract
forward pointer.
Compiler Design
Implementing Transition Diagrams
○ getToken() - examines the symbol table entry for the lexeme found and
returns the corresponding token name.
○ fail() - resets forward ptr to lexeme_begin to try another transition
diagram.
■ What the fail() function does depends on the global error recovery
strategy of the Lexical Analyzer
○ isalpha(c) - returns true iff c is an alphabet
○ isdigit(c) - returns true iff c is a digit
○ isalnum(c) - returns true iff c is an alphabet/digit
○ isdelim(c) - returns true iff c is a delimiter
Compiler Design
Transition Diagram for Relational operators
Regular Expression -
relop < | < = | > | > = | <> | ==

Compiler Design
Implementation - Relational Operators
Compiler Design
Transition Diagram for Identifiers
letter [a-zA-Z]
digit [0-9]
letter ([letter|digit] | ”_”)*
Compiler Design
Implementation for Identifiers
For pseudo-code, please refer to the file:

identifier_loopswitch_pseudocode.c.txt
Note: This file is only a conversion of the transition

diagram from the previous slide and is not executable
without a proper interface and implementations of
other functions like retract(), fail(), nextChar(), etc.
Compiler Design
Transition Diagram for Keywords
int | char | float | if | while | else | for
non-letter/digit
simply means that
the transition
should be over a
non-letter, non-digit
character
Compiler Design
Implementation for Keywords
● For pseudo-code, please refer to the file:

keyword_loopswitch_pseudocode.c
Note: This file is only a conversion of the transition

diagram from the previous slide and is not executable
without a proper interface and implementations of other
functions like retract(), fail(), nextChar(), etc.
● For code that combines identification of keywords and

identifiers, please refer to the file:
loopswitch.c
This code is an altered version of the code from:

https://gist.github.com/luckyshq/6749155
Compiler Design
Transition Diagram for Whitespaces
blank ““
tab “\t”
Here, the input is
newline “\n”
ws -> (blank | tab | newline)+
retracted to begin at a
non-whitespace
There is no return to
parser.
The process of lexical

analysis is restarted
after whitespace
Compiler Design
Transition Diagram for Unsigned numbers
digit [0-9]+
number digit(.digit)?(E[+-]? digit)?
Compiler Design
Transition Diagram for Single and Multi-Line Comments in C
single-line comments: \/\/(.*)
multiline comments: \/\*(.*\n)*.*\*\/
Compiler Design
Keywords vs Identifiers - How does the lexer handle keywords?
The transition diagram for identifiers will also recognise Keywords.
There are two ways to handle keywords that look like identifiers -
1. Install the reserved words in the symbol table beforehand.
a. One field of the symbol table entry can indicate that these are reserved
words (e.g. token_name = keyword)
b. Thus, when an identifier is found
i. installID() is called to place the identifier in the symbol table.
ii. Since keywords were preloaded in the symbol table, installID()
returns the pointer to the keyword found.
c. getToken() can be used to return the token_name
Compiler Design
Keywords vs Identifiers - How does the lexer handle keywords?
2. Create a separate transition diagram for each keyword.

a. In this approach, each pattern for the keywords should be defined
before the pattern for identifiers.
b. This enables reserved word tokens to be recognised when the lexeme
matches both patterns.
Compiler Design
How does the Lexical Analyzer implement all transition diagrams?
There are three possible approaches -

● Approach 1 - Arrange transition diagrams for each token to be tried sequentially.
● Approach 2 - Run various transition diagrams in parallel
● Approach 3 - Combine all transition diagrams into one.
Compiler Design
● Approach 1 - Arrange transition diagrams for each token to be tried

sequentially.
○ In this case, each time the function fail() is called, it resets the
forward pointer to lexeme_begin, and then tries the next transition
diagram.
○ In this method, each keyword can have it’s own transition diagram,
but it must be defined before the transition diagram for identifiers.
Compiler Design
● Approach 2 - Run various transition diagrams in parallel

○ This involves feeding the net input character to all transition diagrams
○ If more than one transition diagram matches the lexeme, the usual strategy
is to take the longest prefix.
● Approach 3 - Combine all transition diagrams into one.

○ The transition diagram reads input until there is no possible next state.
○ The longest lexeme that matches a pattern is taken.
○ This is the preferred approach. However, the problem of combining several
transition diagrams is more complex.
Compiler Design
Question 1: Find the number of tokens
int a = 20;
Tokens Count
int keyword 1
a Identifier 1
= Operator 1
20 Constant 1
; Special Symbol 1
Total 5
Compiler Design
Question 2: Find the number of tokens
Tokens Count
int main Keyword 2
() Special symbol 2
int main() { Special symbol 1
{ //Compiler Design Comment 0
//Compiler Design
int Keyword 1
int a, b;
a Identifier 1
a = 10;
b = 20; , Separator 1
return 0; b; Identifier, symbol 2

} a=10; … 4
b=10; … 4
return 0; .. 3
} Special symbol 1
Total 22
COMPILER DESIGN
Question 3: Write the tokens and associated attribute
E = M * C ** 2
● <id, Pointer to symbol table entry for E >

● <assign_op>
● <id , Pointer to symbol table entry for M>
● <mult-op>
● <id, Pointer to symbol table entry for C>
● <exp-op>
● <number, integer value 2>
COMPILER DESIGN
Question 4: List all the tokens in the following code snippet
Pattern Lexemes
Keyword if, else
if(a > b) Identifier a, b
{ Relop >
a = a + 1; Assignop =
}
else Arithop +
{ Number 1
b = b + 1; Punctuation ( ) { ; } { ; }
} Number 0
COMPILER DESIGN
Question 4: List all the tokens in the following code snippet
< keyword, if > < keyword, else >

< punctuation, ( > < punctuation, { >
<identifier, a > <identifier, b>
< relop, > > < Assignop, = >
if(a > b)
< identifier, b > <identifier, b>
{
< punctuation,) > < Arithop, + >
a = a + 1;
} < punctuation, { > < number, 1 >
else <identifier, a> < punctuation, ; >
{ < Assignop, = > < punctuation, } >
b = b + 1; <identifier, a>
} < Arithop, + >
< number, 1 >
< punctuation, ; >
< punctuation, } >
THANK YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal
Teaching Assistant : Sanket N

D
Compiler Design
Unit 2
Parsers and Error Recovery
Strategies
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview
In this lecture, you will learn about -
• What are parsers?

• Role of a Parser
• Error Recovery Strategies in Parsers
• Overview of Types of Parsers
• Top-down and Bottom-Up
Parsers
Compiler Design
What is a
Parser?
● Parser, also called a Syntax Analyzer, is a program which

takes as input a grammar G and a string w to produce a
parse tree for w, if the string w belongs to the language of
grammar G else throws an error.
● Note - Input is read from left to right.

Compiler Design
What is a
Parser?
● Parser, also called a Syntax Analyzer, is a program which

takes as input a grammar G and a string w to produce a
parse tree for w, if the string w belongs to the language of
grammar G else throws an error.
● Note - Input is read from left to right.

Compiler Design
Role of a Parser
The Role of a parser is-

● To validate the syntax of the program.
● To construct a parse tree and pass it to the
semantic analyzer in the compiler.
● To store information in the symbol table about tokens received from
the lexer.
● Report syntax errors.
Compiler Design
Properties of Error
Messages
The compiler must report errors by generating messages of the following
format -
● The message must pinpoint the error - for example, the line number of
the statement containing the error.
● The message should be specific and must localize the problem.
○ Eg: Undeclared variable in function foo() at line 29.
● The message should be clear and understandable.
● Messages should not be redundant. For example,
if an undeclared variable is referenced 20 times, the error
should be displayed only once.
Compiler Design
Types of Errors
Common programming errors can occur at many different levels.
• Lexical errors include misspellings of identifiers, keywords, or operators — e.g., the use of
an identifier elipseSize instead of ellipseSize — and missing quotes around text intended as
a string.
• Syntactic errors include misplaced semicolons or extra or missing braces; that is, "{ " or "}.
" As another example, in C or Java, the appearance of a case statement without an enclosing
switch is a syntactic error (however, this situation is usually allowed by the parser and
caught later in the processing, as the compiler attempts to generate code).
• Semantic errors include type mismatches between operators and operands. An example is
a return statement in a Java method with result type void.
• Logical errors can be anything from incorrect reasoning on the part of the programmer to
the use in a C program of the assignment operator = instead of the comparison operator ==.
The program containing = may be well formed; however, it may not reflect the
programmer's intent.
Compiler Design
Error Recovery
Strategies
Error Recovery
Strategies
Compiler Design
Error Recovery in
Parsers
● There is no universally accepted strategy for error recovery in Parsers.

● Some simples strategies would be to quit with an informative error
message (eg. python), or gather additional errors before quitting (eg. C),
presuming it is possible to restore the parser state where it can
continue processing.
● There are four commonly used error recovery strategies -
1. Panic Mode Recovery
2. Phrase level Recovery
3. Error Productions
4. Global Corrections
Compiler Design
Error Recovery
Strategies
1. Panic Mode Recovery

● The parser will discard the input symbols
until it finds a delimiter such as semicolon or} and
then restart the parsing process.
● Such delimiters are called synchronizing tokens.
● Consider the following expression
- (1 + + 2) + 3
In this case, we skip ahead to next integer
and continue parsing.
Its very simple method and will not go to infinite loop
Compiler Design
Error Recovery
Strategies
2. Phrase level
recovery
● This involves corrections in the
performing
program, in caselocal
it contains errors. input
● For example,
○ missing semicolon ; - insert it
○ missing closing bracket ) - insert it
● The choice of the correction is left to the compiler designer.
However, it needs to be done carefully.
○ Incorrectly inserting characters could cause the program
to perform erroneously.
Compiler Design
Error Recovery
Strategies
3. Error productions
● The idea here is to specify in the grammar known common
mistakes.
For example,
○ S → if (cond) {S} else {S}
is grammar for valid if-else
code.
○ S → fi (cond) {S} else {S}
can be written as an error production, with action being
a more informative message about the error to the user.
Compiler Design
Error Recovery
Strategies
4. Global Correction
● Extremely hard to implement
● Takes the grammar and the input program (with errors) as
input, and passes them to the algorithm to find a
program(string) y that is closest correct program to the
original program x.
● This has minimal changes done to x.
● Disadvantages
○ The closest correct program y may not exactly be what
the programmer intended to write.
○ It slows down parsing of correct programs and
expensive.
Compiler Design
Error Recovery
Strategies
Types of
Parsers
Compiler Design
Types of Parsers - An
Overview
Parser
s
Top- Universal (CYK) Bottom-up

down
Predictive Shift reduce Table driven

RDP
Parser/LL(1) parsing LR Parser
s
With Without LR (0) SLR (1) LALR (1) CLR (1)

backtrackin backtrackin
g g
Compiler Design
Types of Parsers
● There are mainly two types of parsers - top-down parsers

and bottom-up parsers.
● A Top-down Parser generates parse tree for the given input
string with the help of grammar productions by expanding
the non-terminals.
○ It starts from the start symbol and ends on the terminals.
○ It uses left most derivation.
● A Bottom-up Parser generates the parse tree for the given
input string with the help of grammar productions by
compressing the non-terminals
○ It starts from terminals and ends on the start symbol.
○ It uses the reverse of the rightmost derivation.
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal

D
Compiler Design
Unit 2 - Types of
Parsers
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview
In this lecture, you will learn about

-
• Overview of Types of Parsers

• Top-down Parsers
• Bottom-Up Parsers
Compiler Design
Types of Parsers - An
Overview
Parser
s
Top- Universal (CYK) Bottom-up

down
With Without Shift reduce Table driven

backtrackin backtrackin parsing LR Parser
g g s
RDP Predictive LR (0) SLR (1) LALR (1) CLR (1)

Parser/LL(1
)
Compiler Design
Types of Parsers
● There are two types of parsers - top-down parsers and

bottom-up parsers.
● A Top-down Parser generates parse tree for the given input
string with the help of grammar productions by
expanding the non-terminals.
○ It starts from the start symbol and ends on the terminals.
○ It uses left most derivation.
● A Bottom-up Parser generates the parse tree for the given
input string with the help of grammar productions by
compressing the non-terminals
○ It starts from non-terminals and ends on the start
symbol.
○ It uses the reverse of the rightmost derivation.
Compiler Design
Top Down
Parsers
● Top down parsing involves constructing a parse
tree from the start symbol and attempting to
transform the start symbol to the input.
● Top down parsing is based on Left Most Derivation Top-down
of input string.
● There are two types of top-down parsing
techniques: With Without
backtrackin backtrackin
○ With Backtracking
g g
○ Without
Backtracking Predictive
● Backtracking - repeated scan of input RDP
symbols. Parser/LL(1
)
Compiler Design
Top Down Parsing with Backtracking -
RDP
● Top-down parsers with backtracking can be

implemented by using procedures.
● This implementation design is called Recursive Descent Parsing (RDP).
● In RDP, procedures are written for each non terminal in the
grammar. While performing RDP with backtracking -

1. Try different alternatives of a nonterminal in the way it is listed in
the grammar.
2. If at least one alternative matches, the input string can be parsed.
Compiler Design
Top-down Parsing without
backtracking
● There are two implementations of TDP without

backtracking-
○ RDP without backtracking
○ Table-driven approach, also known as Predictive Parser
or LL(1) parser.
● In order for a top-down parser to work without
backtracking, we must -
○ Left factor the grammar
○ Eliminate Left recursion
● The two operations need not be done in
any particular sequence.
Compiler Design
RDP without backtracking
● To deal with the drawbacks of RDP with backtracking, we

introduce a few modifications to apply to grammars to
make it compatible for TDP without backtracking.
● The grammar is modified such that
○ It is Left factored
○ Left recursion is eliminated
● Procedures are written for each non-terminal in RDP.
Compiler Design
Predictive Parsers or
LL(1)
● Predictive Parser is a table driven top-down parser without

backtracking.
● Class of grammars for which a table driven parser can be constructed is
LL(k).
● A predictive parser uses a lookahead pointer which points to the next
input symbols.
● It uses a stack and a parsing table to parse the input and produce a
parse tree.
● Both the stack and the input contains an end symbol$ to denote that
the stack is empty and the input is consumed.
Compiler Design
Bottom Up
Parsers
● Bottom-up parsers read input from left-to-right and provide

the rightmost derivation in reverse.
● It generates the parse tree for the given input string with
the help of grammar productions by compressing
the non-terminals.
● Bottom-up Parsing is the process of reducing a string “w” to
the start symbol “S”.
● It makes use of stacks.
● There are two broad categories of bottom-up parsers -
○ Shift reduce parsers
○ Table driven Parsers
Compiler Design
Shift Reduce
Parsers
Shift reduce parsers make use of a table consisting of three

columns:
● Stack (status)
● Input buffer (status)
● Action (Shift, reduce, error, accept)
● Note: Whatever occurs on the stack is known as viable prefix.
Compiler Design
LR(0)
● LR(0) parser is one of the four table-

driven bottom-up parsers.
○ LR(0) - The ‘L’ in ‘LR’ parsers mean
that the input is scanned from Left-to-right.
○ LR(0) - The output of a bottom-up parser is the Rightmost
derivation of the input string in Reverse.
○ LR(0) - The ‘0’ indicates that the parser
does not look-ahead into the input buffer before
making decisions.
● These parsers are very weak and are almost never used.
● The DFA of an LR(0) parser is known
as LR(0) automata because it uses LR(0) items.
Compiler Design
SLR(1)
● SLR(1) parser is one of the four table-

driven bottom-up parsers. It is also known as SLR parser.
● The ‘S’ stands for Simple.
● These parsers are more powerful than LR(0) parsers.
● SLR uses look-ahead that is not specific and so, even though
it does better than LR(0) by looking ahead, it is not
powerful enough as it generalises the look-aheads
instead of looking into particular derivation/productions.
● The DFA of an SLR parser is also known as LR(0) automata
because it uses LR(0) items.
● These parsers have the same number of states as an LR(0)
parser.
Compiler Design
CLR(1)
● CLR(1) parser is one of the four table-driven bottom-up

parsers. It is also known as CLR parser or LR(1) parser or
simply LR parser.
● The ‘C’ stands for Canonical.
● These parsers are more powerful than LR(0) and SLR parsers
as they accept a set of grammars that is a superset of
those grammars that are accepted by LR(0) and SLR(1).
● CLR uses look-ahead that is highly production/derivation
specific and so it catches errors almost immediately
without making any more reductions to be sure, unlike
LR(0) and SLR.
● The DFA of an CLR parser is also known as LR(1) automata
● These parsers may have more number of states than an
LR(0) parser but never lesser. [nCLR ≥ nLR(0)]
Compiler Design
LALR(1)
● LALR(1) parser is one of the four table-driven bottom-up

parsers. It is also known as LALR parser.
● The ‘LA’ stands for Look-Ahead.
● These parsers are more powerful than LR(0) and SLR parsers
but less powerful than CLR parsers.
● Yacc is an LALR type parser.
● LALR uses look-ahead that is highly production/derivation
specific. It uses the CLR parsing table and merges states
based on a condition.
● The DFA of an LALR parser is also known as LR(1) automata
● These parsers may have the same number of states as an
LR(0) parser
[nLR(0) = nSLR = nLALR] ≤ nCLR
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal

D
Compiler Design
Unit 2
Recursive Descent Parsers
with Backtracking
Preet Kanwal
Engineering
Compiler Design
Recap - Top Down Parsers

transform the start symbol to the input. Top-down
● Top down parsing is based on Left Most
Derivation of input string.
● There are two types of top-down parsing With Without
techniques: backtrackin backtrackin
g g
○ With Backtracking
○ Without Predictive
RDP
Backtracking Parser/LL(1
)
Compiler Design
Top Down Parsing with Backtracking
● Top-down parsers with backtracking can

be implemented by using procedures.
● This implementation design is called
Recursive Descent Parsing (RDP).
● In RDP, procedures are written for each non terminal in
the grammar.
Compiler Design
Recursive Descent Parsing with
backtracking
For example, the pseudocode below shows procedures written
for the non terminals present in the following grammar -
S → cAd
A → ab|a
A() {
S() { isave = inputPointer();
if(inputSymbol++ == ‘c’) if(inputSymbol++ ==
{ ‘a’)
if(A()) {
{ if(inputSymbol++ == ‘b’) {
if(inputSymbol++ == ‘d’) return true;
{ }
return true; }
} inputSymbol = isave;
} if(inputSymbol++ == ‘a’)
} {
return false; return true;
} }
Compiler Design
Recursive Descent Parsing with
backtracking
Points to keep in mind while performing RDP with

backtracking-
1. Try different alternatives of a nonterminal in the way it
is listed in the grammar.
2. If at least one alternative matches, the input string can
be parsed.
Compiler Design
RDP With backtracking - Example 1
Consider the earlier grammar

- S
→ cAd
A → ab |
a
Let the input string be w =

‘cad’
Compiler Design
● We will start with S, which is the start

symbol for the grammar.
● The first production of S, (S -> cAd)
matches with the leftmost letter of input c.
● So we consider this production and continue.
● The input letter now advances to a.
Compiler Design
● Now, we need to expand the non-terminal A.

● The input letter is a. The first production of A is A -> ab. This
matches current input letter a.
● However, since the next input letter is d
and this doesn’t match the current production A ->
ab.
● So, we backtrack to check the next production.
Compiler Design
● The next production A -> a. This

matches with the input letter a, so we accept this
production and expand A.
● After expansion, the next input letter d
matches the remaining part of the first production.
● We see that all the input letters are matched correctly and
are also in order.
● Hence, the string is accepted by the parser.
Compiler Design
Consider the following grammar

-S→ rXd | rZd
X → oa | ea
Z → ai
Let the input string be w =

‘read’
Compiler Design
● We will start with S, which is the start

symbol for the grammar.
● The first production of S, (S -> rXd)
matches with the left-most letter of input r.
● So we consider this production and continue.
● The input letter now advances to e
Compiler Design
● Now, we need to expand the non-terminal X.

● The input letter is e. The first production of X is X -> oa. This
doesn’t match with the current input letter e. So,
we backtrack and check the next production X ->
ea. This matches with the input letter e, so we accept
this production and expand X.
● After expansion, we see that all the input letters are
matched correctly and are also in order. Hence, the
string is accepted by the parser.
Compiler Design
Drawbacks of RDP with backtracking
1. Parser program may not accept the input even if the input
string belonged to the grammar because of the order of
productions.
2. RDP with backtracking cannot work with left recursive
grammars because it would cause the program to enter an
infinite loop.
3. Reversing semantic actions during the parsing of a string
using RDP with backtracking is an overhead.
Compiler Design
1. Parser may not accept a valid input

● Alternative productions are tried out in the
order in which they are listed in the grammar.
● This could sometimes cause the parser program to throw
an error even if the input string belonged to
grammar the
.
● Example - the grammar from Example 1
○ Consider
-S → cAd
A → ab |
○ Suppose theasecond production was changed
to A → a | ab
○ How would the input string cabd be parsed?
Compiler Design
Given -
S → cAd
A → a |
ab
○ Input string - cabd
● The first production of S, (S -> cAd) matches with the leftmost

letter of input c.
● The input letter now advances to a.
● Expand the non-terminal A. The first production of A is A -> a.
● This matches current input letter a.
Compiler Design
● The next input letter is b.

● This does not match the remaining part
of the first production S -> cAd.
● So, we backtrack to check the next production.
● The next production A -> ab. This
matches with the input letter a, so we accept this
production and expand A.
● The next input letter b matches the
remaining part of the production.
● The next input letter d matches the
remaining part of the first production.
● We see that all the input letters are matched correctly and
are also in order. Hence, the string is accepted by the
parser.
Compiler Design
2. Cannot work with Left Recursive Grammars

● Left recursive grammars would cause the
program to enter an infinite loop.
● Example -
○ Consider the following grammar rule A -> Aa | b
○ Procedure A() would roughly have
the following pseudocode:
A() {
if (A()) {
…
}
● This causes an infinite loop.
● Thus, RDP with backtracking cannot work with left recursive
grammars.
Compiler Design
3. Overhead
● RDP with backtracking involves a series
of erroneous expansions and corresponding semantic
actions.
● Reversing semantic actions during the
parsing of a string causes substantial
overhead.
Compiler Design
Question - Working of RDP
● Consider the language defined by grammar S -> aSa | aa,

which ideally accepts L(G) = { a2n, n>=1 }
● Show the working of RDP for the following input strings -
○ aa
○ aaaa
○ aaaaaa
○ aaaaaaaa
● All the strings belong to L(G)
Compiler Design
Solution - aa
Compiler Design
Solution - aaaa
Compiler Design
Solution - aaaaaa
Compiler Design
Solution - aaaaaaaa
Compiler Design
Question - Working of RDP
● The above working shows that the string aaaaaa cannot be

parsed using RDP with backtracking.
Input String RDP with Backtracking Language

a2 Accepts Accepts
a4 Accepts Accepts
a6 Not accepted Accepts
a8 Accepts Accepts
● It is evident that only strings of the form a2^n is accepted by

the RDP with backtracking.
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal

K
Compiler Design
Unit 2
Recursive Descent Parsing
without Backtracking
Preet Kanwal
Engineering
Compiler Design
Recap - Top Down
Parsers

transform the start symbol to the input.
Top-down
● Top down parsing is based on Left Most
Derivation of input string.
● There are two types of top-down parsing
With Without
techniques: backtrackin backtrackin
○ With Backtracking g g
○ Without
Backtracking RDP Predictive
■ RDP Parser/LL(1
■ Table driven/Predictive )
parser
Compiler Design
backtracking
● There are two implementations of TDP without

backtracking
- RDP and table driven approach, also known as Predictive
Parser or LL(1) parser.
● To deal with the drawbacks of TDP with backtracking, we
introduce a few modifications to apply to grammars to
make it compatible for TDP without backtracking.
Compiler Design
backtracking
● To deal with the drawbacks of TDP with backtracking, a few

modifications are applied to grammars to make it
compatible for TDP without backtracking.
● This involves two steps -
○ Left factoring of the grammar
○ Eliminating Left recursion
● The two operations need not be done in
any particular sequence.
● Left factoring can be done before or
after eliminating left recursion.
● Eliminating left recursion once is sufficient.
Compiler Design
backtracking
1. Left factoring of grammar

○ Factor out the common prefix present
in the grammar, i.e, scan the common part
once.
○ The following example shows a grammar before and
after
S left→factoring
i C t is
S edone.
S | iCtS|
C a
→ b
On left factoring
- S
→ iCtSX |
C a
X → b
→ eS | λ
Compiler Design
backtracking
2. Elimination of Left Recursion
A left recursive grammar is typically of the
form A -> Aα, where α ϵ (V U T)*
[ V = variable, T = terminal]
● Left recursion is eliminated to prevent top down parsers from
entering into an infinite loop.
● Left recursion can be
○ Immediate - For example -
○ A → Aa |b
○ Indirect - Recursion occurs in two or more steps
Example
S -→ Aab
A → Sd |
λ
Compiler Design
How to eliminate Left
Recursion?
In case of indirect left recursion, whenever there are multiple

non-terminals in the grammar, specify the order of
nonterminals, which is usually the order in which they are
presented.
● This will convert indirect left recursion to immediate left
recursion.
● Example -
○ Given -
S → Aab
A → Sd |
λ by Aab in the second
○ Here, S is replaced
production.
S → Aab
A → Aabd |
λ
Compiler Design
Algorithm to eliminate Left
Recursion?
Algorithm to eliminate left recursion -

● Input: Grammar with no cycles or λ productions
● Output: Grammar without Left Recursion
1. Arrange the non-terminals in some order
(usually the order in which they are
presented)
2. Replace
A → production
each A α | of A α form
the | .. | β
| β | … | β
with 1 2 1 2 s
A → β A’ | β A’ |
… | β A’
1 2 s
A’ → α A’ | α A’ |
Compiler Design
Eliminating Left Recursion - Example
1

- A
→ Bxy |
B x
C → CD
D → A |
c
→ d
Eliminate left
recursion.
Compiler Design
1
Given -
A → Bxy |
x
B → CD
C → A| c
Step
D 1 - Replace
→ dindirect left
recursion
A → Bxy |x
B → CD
C → Bxy | x => C → CDxy |
D | c c | x
→ d
Compiler Design
1
A → Bxy |x
B → CD
C → CDxy |
D c | x
→ d
Step 2 - Eliminate immediate Left

recursion A → Bxy |x
B→ CD
C→ x C’ | c
C’
C’ → D x y C’ |
Compiler Design
2

- S
→ aAcB
A → Ab | b
B | bc
→ d
Eliminate left
recursion.
Compiler Design
2

- S
→ aAcB
A → Ab | b
B | bc
→ d
Answer -
S → aAcB
A → b A’ | b c A’
A’ → b A’ | λ
Compiler Design
Eliminating Left Recursion -
Exercise
Eliminate left recursion in the following

grammar -
E → E+T | T
T → T*F | F
F → id | num
| (E)
Compiler Design
Eliminating Left Recursion - Exercise
Solution
Given -
E → E+T | T
T → T*F | F
F → id num |
(E)
|
Eliminate immediate left recursions for E and T
productions using the algorithm.
E → T E’
E’ → + T E’ | λ
T → F T’
T’ → * F T’ | λ
F → id | num
Compiler Design
RDP without backtracking - Example
1
Consider the earlier grammar for parsing
arithmetic expressions -
E → T E'
E' → + T E' | λ
T → F T'
T' → * F T' | λ
F → id | num | (E)
Notice that the grammar is left factored, and left-recursion is

removed.
The RDP implementation for this grammar is
present at - RDPwithoutBacktracking.c
Compiler Design
RDP without backtracking -
Example

-
SX → → i C teSSX | λ | a
C → c
Notice that the grammar is left factored, and left-recursion

is removed.
Compiler Design
● It is a top-down parser that implements a set of recursive

procedures to process the input without backtracking.
● Recursive Procedures are written for each non-terminal in
the grammar.
● The recursive procedures can be accessible to write and
adequately effective if written in a language that performs
the procedure call effectively.
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal
Teaching Assistant : Kavya

PK
Compiler Design
Unit 2
Predictive Parsers
(LL(1))
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview
• What are Predictive Parsers?

• Model of a Table-driven Parser
• Computing First Sets
• Computing Follow Sets
• What is an LL(1) Parsing Table?
• Construction of an LL(1) Parsing
Table
Compiler Design
What are Predictive
Parsers?
● Predictive Parser is a table driven top-down parser without

backtracking.
● Class of grammars for which a table driven
parser can be constructed is LL(k).
● A predictive parser uses a lookahead pointer which points
to the next input symbols.
● It uses a stack and a parsing table to
parse the input and produce a parse tree.
● Both the stack and the input contains an
end symbol $ to denote that the stack is
empty and the input is consumed.
Compiler Design
Predictive
Parsers
● Table driven parser can be constructed is for a class of

grammars LL(k) where
○ The first L indicates that input is parsed from left to right.
○ The second L indicates that the output of the parser,
which is a parse tree, is the leftmost derivation of the
input string.
○ k denotes the number of symbols used for lookahead.
Compiler Design
Model of a table driven
Parser
Compiler Design
Computing First
sets
● The First() of a non-terminal is the set of symbols that
appear as the first symbol in a string that derives from the
non-terminal.
● Rules to compute First set:
1. First(t) = { t }, where t is a terminal
2. If X -> λ, then add λ to First(X)
3. If X -> Y1Y2Y3
○ First(X) = First(Y1)
○ If λ ∈ First(Y ), then First(X) = { First(Y ) - λ } U {
1 1
First(Y2) }
○ If λ ∈ First(Y ) for all 1 ≤ i ≤ n, then add λ to
First(X)
i
Compiler Design
Computing First set -
Example 1
● Consider the following

grammar:
E → T E’
E’ → + T E’ | λ
T → F T’
T → * F T’ | λ
F → id | num
| (E)
Find the first set of non terminal

T
Compiler Design
Computing First set -
Example 1
● Production of T: T -> FT’

● According to Rule 3, First(T) = { First(F) }
● Consider all productions of
F F -> id
F -> num
F -> (
● According to Rule 1, First(F) = { id, num, ( }
● Since F does not have any λ productions, we stop here.
Since First(T) = First(F),
First(T) = { id, num, ( }
Compiler Design
Follow Sets
● Consider the partial parse tree on the right. Assume

the following grammar:
A -> aBb
B -> c | λ
● When the input symbol is b, the non-terminal to derive
is B. There is no production of B that starts with b, so
the parser is now stuck.
● However, if the lambda production of B, B -> λ, is
applied, B vanishes and the parser parses ab correctly.
● A non-terminal should vanish if the symbol that follows
the non-terminal in the input is the same as the symbol
that follows it in the production rule/in some
Compiler Design
Follow Sets
● Thus, when there are nullable non-terminals in a grammar, there is

a need to calculate follow sets of each of the nullable non-terminal
so that their lambda productions can be used using parsing.
● Follow(X) to be the set of terminals that can appear immediately

to the right of Non-Terminal X in some sentential form.
Compiler Design
Computing Follow
Sets
● Rules to compute Follow set:

1. Follow(S) = { $ }, where S is the start
symbol
2. If A -> αBβ, where λ ∈ First(β),
then Follow(B) = { First(β) - λ}U{
Follow(A) }
3. If A -> αB, Follow(B) = Follow(A)
Note: Follow set should not contain λ
where α, β ∈ (V U T)*
V = Variables (non
Compiler Design
Follow Sets - Example
1
● Consider the following

grammar:
E → T E’
E’ → + T E’ | λ
T → F T’
T’ → * F T’ | λ
F → num | id |
● Compute the Follow (E
of all non-terminals
)
sets grammar for each in the
production.
Compiler Design
Follow Sets - Example 1 -
Solution
E → TE’ Follow(E) = $
Follow(T) = {+}⋃
Follow(E) Follow(E’) =
Follow(E)
E’ → + T E’/λ Follow(T) = { + } ⋃ Follow(E’)
Follow(E’) = Follow(E’)
T → F T’ Follow(F) = { * } ⋃ Follow(T')
Follow(T’) = Follow(T)
T’ → * F T’/λ Follow(F) = { * } ⋃ Follow(T’)
Follow(T’) = Follow(T')
F → (E) Follow(E) = {)}
Compiler Design
Steps to construct LL(1) Parsing
Table
● Eliminate Left recursion (if present) and Left factor the

grammar(if required).
● Calculate First() and Follow() sets for all non-terminals.
● Draw a table where the first row contains the Non-
Terminals and the first column contains the Terminal
Symbols.
● All the Null Productions of the Grammars will go under the
elements of Follow(LHS).
○ Reason - The null production must be used only when
the parser knows that the character that follows the LHS
in the production rule is same as the current input
character.
Compiler Design
Steps to construct LL(1) Parsing
Table
For each production A –> α

1. For each terminal in First(α), make entry A –> α in the table.
2. If First(α) contains λ, then for each terminal in Follow(A) (i.e,
Follow(LHS) ) make an entry A –> α in the table.
3. If the First(α) contains λ and Follow(A)
contains $ as a terminal, then make an
entry A –> α in the table under $.
Compiler Design
Constructing LL(1) Parsing Table -
Example 1
Construct the parsing table for the earlier

Grammar.
E → T E’
E’ → + T E’ | λ
T → F T’
T’ → * F T’ | λ
F → num | id |
(E
)
Compiler Design
Example 1
From the earlier examples, the first and follow

tables for the given grammar is as follows -
First Table Follow Table

E E’ T T’ F E E’ T T’ F
id + id * id $ $ + + *
num λ num λ num ) ) $ $ +
( ( ( ) ) $
)
Compiler Design
Example 1
Construct a table with terminals on the first

row and Non-terminals on the first column.
id + * num ( ) $
E’
T’
F
Compiler Design
Example 1
For each production in the entries to
grammar, add appropriate locations. the
id + * num ( ) $
E → TE’
Add the E→
E E -> TE’ E -> TE’ E -> TE’
TE’ under each
E’ element in First(T)
T First(T) = {id,
num and ( }
T’
F
Compiler Design
Example 1
For each production in the grammar, add entries

to the appropriate locations.
id + * num ( ) $ E’ → +
T E’ Add E’ →
E E -> TE’ E -> TE’ E -> TE’ + T E’ under +
E’ E’->+TE’ E’ ->λ E’ ->λ E’ → λ

So, Add E’ → λ
T under all
elements in
T’
Follow(LHS), i.e,
F { ), $ }
Compiler Design
Example 1

id + * num ( ) $ T -> FT’

Check First(F).
E E -> TE’ E -> TE’ E -> TE’
E’ E’->+TE’ E’ ->λ E’ ->λ First(F)={id,num,(}
T T -> FT’ T -> FT’ T -> FT’ Add T -> FT’ under
each element.
T’
F
Compiler Design
Example 1
id + * num ( ) $ T’ → *FT’
Add T’ →
E E -> TE’ E -> TE’ E -> TE’
*FT’ under *
E’ E’->+TE’ E’ ->λ E’ ->λ T’ → λ

So add T’ → λ
T T -> FT’ T -> FT’ T -> FT’
under all
T’ T’->λ T’->*FT’ T’->λ T’->λ elements of
Follow(T’) =
F
{+,),$}
Compiler Design
Example 1
F → num | id |
id + * num ( ) $
(E) Add F → num
E E -> TE’ E -> TE’ E -> TE’ under num
E’ E’->+TE’ E’ ->λ E’ ->λ F → id

Add F → id
T T -> FT’ T -> FT’ T -> FT’ under id
T’ T’->λ T’->*FT’ T’->λ T’->λ F → (E)

Add F → (E)
F F->id F->num F->(E)
under (
Compiler Design
Example 1
Thus, this is the final LL(1) parsing table for the given
grammar.
id + * num ( ) $
E E -> TE’ E -> TE’ E -> TE’
E’ E’->+TE’ E’ ->λ E’ ->λ
T T -> FT’ T -> FT’ T -> FT’
T’ T’->λ T’->*FT’ T’->λ T’->λ

Compiler Design
LL(1) Parsing Table -
Conclusions
● The LL(1) parsing table can now be used to parse an input

string. This will be demonstrated in the next class.
● To check whether a grammar belongs to LL(1), check
whether each box in the LL(1) parser table has atmost 1
production.
● If a grammar is not Left Factored, or Left Recursion is not
removed, it will not belong to LL(1).
● The first table will never contain $
● The follow table will never contain λ
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal

K
Compiler Design
Unit 2
Predictive Parsers
(LL(1))
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview
• Additional examples on LL(1) parsing

algorithm
• LL(k) Grammars
Compiler Design
LL(1) Parsing Algorithm - Example
5
Construct the LL(1) Parsing table for the given grammar

- S→ AB
A. → a|λ
B. → b|λ
Then, demonstrate parsing using the input string λ.
Compiler Design
5
Given -
S → AB
A. → a|λ
B. → b|λ
No left recursion present, and grammar is left
factored.
Compiler Design
5
Construct first and follow tables for the grammar

- S→ AB
A. → a|λ
B. → b|λ

S A B S A B
a a b $ b $
b λ λ $
λ
Compiler Design
5
Construct the LL(1) parsing

table.
a b $
S S → AB S → AB S → AB
A A →a A →λA →λ
B B →bB →λ
Compiler Design
5
Stack Input Buffer Action

S$ $ M[S,$] = S → AB
AB$ $ M[A,$] = A → λ Parse the input string
B$ $ M[B,$] = B → λ
λ
Parsing successful.
$ $ Accept
Compiler Design
6
Is the Given grammar in LL(1) ? Also, Compute first and follow

sets. Parse the input string cde
S → ABCDE
A. → a|λ
B. → b|λ
C. → c
D. → d|λ
E. → e|λ
Compiler Design
6
S → ABCDE
A. → a|λ
B. → b|λ
C. → c
D. → d|λ
E. → e|λ
Compute First and Follow Sets
- First Table Follow Table
S A B C D E S A B C D E
a a b c d e $ b c d e $
b λ λ λ λ c e $
c $
Compiler Design
6

table.
a b c d e $
S S→ABCDE S→ABCDE S→ABCDE

A A→a A→λ A→λ
B B→b B→λ
C C→c
D D→d D→λ D→λ
E E→e E→λ
Compiler Design
6

S$ cde$ M[S,c] = S→ABCDE
ABCDE$ cde$ M[A,c] = A → λ Parse the input string
BCDE$ cde$ M[B,c] = B → λ
cde Parsing successful.
CDE$ cde$ M[C,c] = C→ c

cDE$ cde$ M[c,c] = match
DE$ de$ M[D,d] = D→ d
dE$ de$ M[d,d] = match
E$ e$ M[E,e] = E→ e
e$ e$ M[e,e] = match
$ $ Accept
Compiler Design
Identify whether a grammar belongs to LL(k) -
Examples
Are the following grammars in LL(k)? If yes, find

k.
1. S → Abbx |
Bbby A → x
B→ x
2. S→ Z
Z→ aMa | bMb | aRb |
bRa M→ c
R→ c
Compiler Design
Solution

k.
1. S → Abbx |
Bbby A → x
B→ x
Answer - Yes. It belongs to LL(4).
2. S→ Z
Z→ aMa | bMb | aRb |
bRa M→ c
R→ c
Answer - Yes. It belongs to LL(3).
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal

K
Compiler Design
Unit 2
Predictive Parsers
(LL(1))
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview
• LL(1) Parsing Algorithm

• Examples on LL(1) parsing algorithm
• Identify whether a grammar belongs to
LL(1)
• LL(k) Grammars
Compiler Design
Parsing using an LL(1) Parsing
Table
● Given an LL(1) Parsing table, how do we parse a given input
string?
● Initially,
○ Stack contains start symbol S
○ Input buffer contains the input string w
S$ w$
● Parsing Algorithm is applied.
● Finally, if the string is accepted, the stack and input buffer
contain $, indicating that they are empty.

$ $ accept
Compiler Design
LL(1) Parsing
Algorithm
//Let a be the current input symbol and M be the parsing
table
//Let X be the current stack
top. while(X != $) {
if(X==a) { //If X is a
terminal stack.pop()
Delete ‘a’ from input buffer
}
else if(M[X,a} = X ->Y1Y2…Yk { //If X is a non
terminal stack.pop()
stack.push(Y1,Y2,...Yk) //Y1 is the stack top now
}
else
error()
Compiler Design
1
● Given the following LL(1) Parsing table M,
parse the input string
“id + id * id”
Compiler Design
1
● Given the following LL(1) Parsing table M,

parse the input string:
“id + id * id”
Initially, the stack and input buffer contain -

E$ id + id * id $
Compiler Design
1
Now, find the action for the stack

top E and current input symbol id
Stack Input Buffer Action in M -
M[E,id] = E->TE’
E $ id + id * id $ M[E,id] = E->TE’
pop E from
Based on the parsing
stack push TE’
algorithm, pop E from stack
T E’ $ id + id * id $ push TE’
Current stack top changes to T.

Compiler Design
1
Stack Input Buffer Action M[T,id] = T->FT’

pop T from
E $ id + id * id $ M[E,id] = E->TE’ stack push FT’
pop E from
stack push TE’ Current stack top changes to
T E’ $ id + id * id $ M[T,id] = T->FT’ F.
pop T from
stack push FT’
F T’ E’ $ id + id * id $
Compiler Design
1

E$ id + id * id $ M[E,id] = E->TE’ M[F,id] = F->id
pop E from pop F from
stack push TE’ stack push id
T E’ $ id + id * id $ M[T,id] = T->FT’
pop T from Current stack top changes to
stack push FT’ id.
F T’ E’ $ id + id * id $ M[F,id] = F->id
pop F from
stack push id
id T’ E’ $ id + id * id $
Compiler Design
1
E $ id + id * id $ M[E,id] = E->TE’
pop E from
stack push TE’ M[id,id] - case 1, id==id -
True pop id from stack
T E’ $ id + id * id $ M[T,id] = T->FT’
delete id from input buffer
pop T from
stack push FT’
Current stack top changes to
F T’ E’ $ id + id * id $ M[F,id] = F->id T’.
pop F from
stack push id
id T’ E’ $ id + id * id $ M[id,id]
pop id from stack
T’ E’ $ + id * id $
Compiler Design
1
E$ id + id * id $ M[E,id] = E->TE’
pop E from
stack push TE’
T E’ $ id + id * id $ M[T,id] = T->FT’ M[T’,+] = T’ -> λ
pop T from pop T’ from stack
stack push FT’ Current stack top changes to
F T’ E’ $ id + id * id $ M[F,id] = F->id E’.
pop F from
stack push id
id T’ E’ $ id + id * id $ M[id,id]
pop id from stack
T’ E’ $ + id * id $ M[T’,+] = T -> λ
pop T’ from stack
Current stack top changes to E’.
E’ $ + id * id $
Compiler Design
1

… …. …
E’ $ + id * id $ M[E’,+] = E’ -> +TE’
M[E’,+] = E’ -> +TE’
pop E’ from stack
pop E’ from stack
push +TE’ to
push +TE’ to
stack
stack
+ T E’ $ + id * id $ Current stack top changes to
+.
Compiler Design
1

… …. …
E’ $ + id * id $ M[E’,+] = E’ -> +TE’
M[+,+] = match
pop E’ from stack
pop + from
push +TE’ to
stack
stack
remove + from input buffer
+ T E’ $ + id * id $ M[+,+] = match Current stack top changes to
pop + from T.
stack
T E’ $ id * id $
Compiler Design
1
… …. …
E’ $ + id * id $ M[E’,+] = E’ -> +TE’
pop E’ from stack
push +TE’ to M[T,id] = T->FT’
stack pop T from
stack push FT’
+ T E’ $ + id * id $ M[+,+] = match
to stack
pop + from
stack
F.
T E’ $ id * id $ M[T,id] = T->FT’
pop T from
stack push FT’
to stack
F T’ E’ $ id * id $
Compiler Design
1
… …. …
E’ $ + id * id $ M[E’,+] = E’ -> +TE’
pop E’ from stack
push +TE’ to
stack M[F,id] = F->id
+ T E’ $ + id * id $ M[+,+] = match
pop F from
pop + from stack push id to
stack stack
remove + from input buffer Current stack top changes to
T E’ $ id * id $ M[T,id] = T->FT’ F.
pop T from
stack push FT’
to stack
F T’ E’ $ id * id $ M[F,id] = F->id
pop F from
stack push id to
Compiler Design
1
… …. …
id T’ E’ $ id * id $ M[id,id] = match
pop id from
stack M[id,id] = match
remove id from i/p buffer pop id from
stack
T’ E’ $ * id $
remove id from i/p buffer
F.
Compiler Design
1
… …. …
pop id from
stack M[T’,*] = T’-> *FT’
remove id from i/p buffer pop T’ from stack
push *FT’ to
T’ E’ $ * id $ M[T’,*] = T’-> λ
stack
pop T’ from stack
push *FT’ to
E’.
stack
* F T’ E’ $ * id $
Compiler Design
1
… …. …
pop id from
stack M[*,*] = match
remove id from i/p buffer Current stack top changes to
F.
T’ E’ $ * id $ M[T’,*] = T’-> λ
pop T’ from stack
push *FT’ to
stack
* F T’ $ * id $ M[*,*] = match
F T’ $ id $
Compiler Design
1
… …. …
pop id from
stack M[F,id] = F->id
remove id from i/p buffer pop F from
stack push id to
T’ E’ $ * id $ M[T’,*] = T’-> λ
stack
pop T’ from stack
push *FT’ to
id.
stack
* F T’E’ $ * id $ M[*,*] = match
F T’E’ $ id $ M[F,id] = F->id
pop F from
stack push id to
Compiler Design
1

… …. … M[id,id] = match
id T’E’ $ id $ M[id,id] = match Current stack top changes to
T’.
T’E’ $ $ M[T’,$] = T’-> λ, E’->λ
pop T’ from stack M[T’,$] = T’-> λ
$ $ Accept pop T’ from stack
Thus, the string id+id*id is accepted by the

parser.
Compiler Design
LL(1) Parsing Algorithm - Example 1 -
Homework
Parse the string (id *

id)
Compiler Design
2
Construct the LL(1) Parsing table for the given grammar

- S
→ a |
L (L)
→ L,S |
S
Then, demonstrate parsing using the input string
(a,a).
Compiler Design
2
Given -
S → a |
(L)
L → L,S |
First, remove left S
recursion.
● Arrange non-terminals -
S → a | (L
L )
→ L,S | a
● Replacing productions with left recursion
| (L)
- S
→ a |
L (L)
L → a L’ |
( L ) L’
Compiler Design
2
Construct first and follow tables for the modified grammar

- S
→ a |
L (L)
L → a L’ |
( L ) L’
’
→ , S L’ |
FirstλTable Follow Table
S L L’ S L L’
a a , $ ) )
( ( λ ,
)
Compiler Design
2

table.
a ( ) , $
S S->a S->(L)
L L->aL’ L->(L)L’
L’ L’->λ L’->,SL’
Compiler Design
2
S$ ( a , a ) $ M[S,(] = S -> (L)
(L)$ ( a , a ) $ M[(,(] = match
L)$ a , a ) $ M[L,a] = L->aL’

Parse the input string
a L’ ) $ a , a ) $ M[a,a] = match
(a,a) Parsing successful.
L’ ) $ , a ) $ M[X,,] = L’->,SL’
, S L’ ) $ , a ) $ M[,,,] = match
S L’ ) $ a ) $ M[S,a] = S->a
a L’ ) $ a ) $ M[a,a] = match
L’ ) $ ) $ M[X,)] = L’->λ
)$ ) $ M[),)] = match
$ $ Accept
Compiler Design
Identify whether a grammar belongs to
LL(1)
• If the grammar is not left factored and/or

left recursion is not eliminated, it does not belong to
LL(1).
• If the LL(1) Table has more than one entry in any cell, the grammar
does not belong to LL(1) class of grammars.
• Formally, to check whether a grammar belongs
to LL(1) without constructing the table, the following rules
are checked.
• A grammar is NOT in LL(1) if
1. For A -> a | b, i.e, there are two
or more alternatives for a Non-terminal
If First(a) ∩ First(b) ≠ Φ , i.e, if they have something in
common.
1. A -> a | λ
Compiler Design
Identify whether a grammar belongs to LL(1) - Example
3
Is the Given grammar in LL(1) ? Also, Compute first and follow

sets.
S → AaAb |
A BbBa
B → λ
→ λ
Compiler Design
3
S → AaAb |
A BbBa
B → λ
→ λ
Compute First and Follow Sets
-

S A B S A B
a λ λ $ a a
b b b
Compiler Design
3
S → AaAb |
BbBa
is of the format
S → a |
b
First(AaAb) = First(A) = { a }
First(BbBa) = First(B) = { b }
Since First(A) ∩ First(B) = Φ, the given grammar belongs to

LL(1)
Compiler Design
4
Is the Given grammar in LL(1) ? If not, modify and re-

check.
S → iCtS| iCtSeS
C | a
→ b
Compiler Design
2
• If we construct the parse table M for the following
grammar,
S → iCtS| iCtSeS
C | a
→ b
• M[S,i] will have 2
productions
S → iCtS
S → iCtSeS
• Therefore, it is not in LL(1).
• To modify, left factor the above grammar
-
S → iCtSX | a
X → eS |λ
C → b
Compiler Design
4
• Is the modified grammar in LL(1)
?
S → iCtSX | a
X → eS |λ
C → b
• Notice the production
X → eS |
λ
is of the form
A → a |λ
• First(e) = e
• Follow(X) = { e, $ }
• Thus, First(e) ∩ Follow(X) = e ≠Φ
• Therefore, the modified grammar is also not in
Compiler Design
LL(k) grammars,
k>1
As mentioned
previously,
• When k = 1, we can choose the next production rule using 1

lookahead symbol.
• That is, by reading 1 input character.
• For example,
S → a | b
Compiler Design
LL(k) grammars,
k>1
• For k>1, we can choose the next production after reading k
input characters.
• For example,
S → ab|ac|ad
• Here, the production can be chosen only after reading 2
input characters.
• Therefore, the above grammar belongs to LL(2) class of
parsers.
• LL(k) ⊂ LL(k+1)
• If the LL(k) Table has more than k entries in any cell, the
grammar does not belong to LL(k) class of grammars, where
k > 1.
Compiler Design
Solution
1. S → iCtS| iCtSeS
C | a
Answer - →No. It bis not left-
factored.
2. S → aaB | a
B aC
C → b
Answer - →
Yes. Itcbelongs to
LL(3).
3. E → T+E | T
T → id | id * T
| (E)
Answer - Yes. It belongs to LL(2), as the grammar
needs 2 input symbols to choose the production.
Compiler Design
Identify whether a grammar belongs to
LL(1)
# Grammar LL(k)
1. S → ab | ac | ad LL(2)
2. S→aaB|aaC LL(3)
B→b
C→c
3. E→T+E|T LL(2)
T→ id | id * T | ( E )
4. S → Abbx | Bbby LL(4)

A →x
B→ x
5. S→ Z LL(3)
Z→ aMa | bMb | aRb | bRa
M→ c
R→ c
Compiler Design
Examples

k.
1. S → iCtS| iCtSeS
C | a
→ b
2. S → aaB | a
B aC
C → b
→ c
3. E → T+E | T
T → id | id * T
| (E)
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal

K
Compiler Design
Unit 2
Predictive Parsers
(LL(1))
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview
In this lecture, you will learn about

-
• Error recovery in LL(1) Parser

• RDP implementation backtracking vs
without Parsers Predictive
Compiler Design
Error Recovery in LL(1)
Parser
● When an action M[X,Y] is blank in the Parsing Table, and it

is encountered during parsing of an input string, it indicates
syntax error.
● How can we handle errors in LL(1) parsers?
● One of the most commonly used Error recovery strategies
in Parsers is Panic Mode Recovery.
● Recall that in Panic mode Recovery,
○ The parser will discard the input symbols until it finds a
delimiter, and then restart the parsing process.
○ Such delimiters are called synchronizing tokens.
Compiler Design
Error Recovery in LL(1)
Parser
How is Panic Mode Recovery implemented in LL(1) Parsers?
● For each Non-terminal X in the parsing table M, add a
synchronising token sync under the elements of Follow(X) if
the cell is blank.
● Parsing procedure -
If M[X,a] == blank // i.e, syntax error
ignore the input symbol
a If M[X,a] == sync
If X is the only symbol in the
Stack Ignore input symbol
a
Compiler Design
Error Recovery in LL(1) Parser - Example
1
Implement the parsing table for the following
grammar with panic mode recovery. Also, parse the
string
“)id*+id”
E → T E’
E’ → + T E’ | λ
T → F T’
T’ → * F T’ | λ
F → num | id |
(E
)
Compiler Design
1
Given,
E → T E’
E’ → + T E’ | λ
T → F T’
T’ → * F T’ | λ
F → num | id |
(E Follow Table
)
First Table E E’ T T’ F
E E’ T T’ F $ $ + + *
id + id * id ) ) $ $ +
num λ num λ num ) ) $
( ( ( )
Compiler Design
1
The LL(1) parsing table without panic mode
recovery-
id + * num ( ) $
E E -> TE’ E -> TE’ E -> TE’
E’ E’->+TE’ E’ ->λ E’ ->λ
T T -> FT’ T -> FT’ T -> FT’
T’ T’->λ T’->*FT’ T’->λ T’->λ

Compiler Design
1
Adding sync tokens under elements of Follow(X), where X is a
First Table
non-terminal.
E E’ T T’ F
id + * num ( ) $ id + id * id
nu λ nu λ nu
E E -> TE’ E -> TE’ E -> TE’ sync sync m m m
( ( (
E’ E’->+TE' E’ ->λ E’ ->λ Follow
Table
E E’ T T’ F
T T -> FT’ sync T -> FT’ T -> FT’ sync sync $ $ + + *
) ) $ $ +
T’ T’->λ T’- T’->λ T’->λ
) ) $
>*FT’
)
F F->id sync sync F->num F->(E) sync sync
Compiler Design
1
Now, using the new LL(1) Parsing table M, parse the input
string
) id * + id
Initially, the stack and input buffer contain -

E$ ) id * + id $
Compiler Design
1
Stack Input Buffer Action M[E,)] = sync

E$ )id*+id$ M[E,)] = Based on the new parsing algorithm,
sync skip Since E is the only symbol in the
E$ id*+id$ stack, ignore the current input
symbol )
Compiler Design
1
E$ ) id * + id $ M[E,)] =
sync skip
Continue parsing.
E$ id * + id $ M[E,id] = E -> TE’
T E’ $ id * + id $ M[T,id] = T ->FT’ M[F, +] = sync
F T’ E’ $ id * + id $ M[F,id] = F->id F is not the only symbol in the
stack. So, Pop F from stack.
id T’ E’ $ id * + id $ M[id,id] = match
T’ E’ $ * + id $ M[T’,*] = T’->*FT’ The current stack top is T’
* F T’ E’ $ * + id $ M[*,*] = match
F T’ E’ $ + id $ M[F,+] =
sync Pop F
T’ E’ $ + id $
Compiler Design
1
… … …
T’ E’ $ + id $ M[T’,+] = T’ ->λ
Continue parsing.
E’ $ + id $ M[E’,+] = +TE’
+ T E’ $ + id $ M[+,+] = match Thus, the input string is parsed
till the end even though it
T E’ $ id $ M[T,id] = T -> FT’ contained several syntax
F T’ E’ $ id $ M[F,id] = F->id errors.
id T’ E’ $ id $ M[id,id] = match
T’ E’ $ $ M[T’,$] = T’->λ
E’ $ $ M[E’,$] = E’->λ
$ $ Accept
Compiler Design
Homework
Parse the erroneous string id ( + )

id
Compiler Design
RDP without Backtracking vs Predictive
Parsers
RDP without Backtracking Predictive Parsers

It uses mutually recursive It uses a lookahead pointer
procedures for every non- which points to next k input
terminal entity to parse strings. symbols.
This places a constraint on
the grammar.
It accepts all grammars. It accepts only a LL(k) grammars.
Thus, RDP without Backtracking is more powerful, as it accepts

a larger set of grammars.
THANK
YOU
Preet Kanwal
[email protected]
Compiler Design
Preet Kanwal

K
Compiler Design
Unit 2
Predictive Parsers
(LL(1))
Preet Kanwal
Engineering
Compiler Design
Lecture
Overview
• Additional examples on Error recovery in LL(1)

Parser
Compiler Design
2
Implement the parsing table for the following grammar
with panic mode recovery. Also, parse the string a(a)
S → a |
L (L)
L → a L’ |
( L ) L’
’
→ , S L’ |
λ
Compiler Design
2
Construct first and follow tables for the grammar

- S
→ a |
L (L)
L → a L’ |
( L ) L’
’
→ , S L’ |
FirstλTable Follow Table
S L L’ S L L’
a a , $ ) )
( ( λ ,
)
Compiler Design
2
The LL(1) parsing table without panic mode recovery

is -
a ( ) , $
S S->a S->(L)
L L->aX L->(L)L’
L’ L’->λ L’->,SL’
Compiler Design
Error Recovery in LL(1) Parser - Example 2
Adding sync tokens under elements of Follow(X), where X is a

non-terminal.
a ( ) , $ Follow Table
S L L’
S S->a S->(L) sync sync sync
$ ) )
L L->aX L->(L)L’ sync
,
L’ L’->λ L’->,SL’
)
Compiler Design
1
Now, using the new LL(1) Parsing table M, parse the input string
a(a) Initially, the stack and input buffer contain -

S$ a(a)$
Compiler Design
2

S$ a ( a ) $ M[S,a] = S->a
a$ a ( a ) $ M[a,a] = match
$ (a)$
Compiler Design
Homework
Parse the erroneous string

^(,a,a)
THANK
YOU
Preet Kanwal
[email protected]

CD Unit 1

Uploaded by

Copyright:

Available Formats

CD Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CD Unit 1

Uploaded by

Copyright:

Available Formats

Compiler Design

Introduction to the Course

Unit 1 : Compilers The Language Processing System, The Phases of a Compiler,

Unit 4 : Machine Independent Optimization: Different Optimizations, Optimization of Basic

1. UE19CS202- Data Structures and its Application,

Compiler Design : UE21CS341B

฀ Unit 1 – Basics of Compiler & Lexical Analysis

Tools/Languages: Lex and Yaac.

1. Introduce the major concept areas of language translation and

1. Develop a greater understanding of the issues involved in

1. Provide practical programming skills necessary for constructing a

1. Develop an awareness of the function and complexity of modern

1. Provide an understanding on the importance and techniques of

At the end of this course, the student will be able to:

1. Analyze and design the semantic behavior of a compiler.

1. Choose the appropriate compiler internal representation for different

1. Translate a source-level language into a low-level compiler internal

1. Optimize the performance of a program in terms of speed and space using

Compiler Input and Output

•Build a large, ambitious software system. See theory come to life.

A short history of compilers:-

Rear Admiral Grace Hopper: FORTRAN (Formula Translation) : first widely

• First C Compiler (Modified B Compiler : Written in B)

•TMG (TransMoGrifier) was the compiler definition tool used by Ken

• B was the immediate ancestor of C.

•Self-hosting Compiler : Compiler that can compile its own source

• Example : C, C++ , C#, Java, Pascal, Python, VisualBasic etc

•Bootstrapping is the technique for producing a self- compiling compiler.

• Assemblers were the first language tools to bootstrap themselves!

•Bootstrapping is the technique for producing a self- compiling compiler.

• Assemblers were the first language tools to bootstrap themselves!

TRANSPILER : Source-to-Source Compiler (High

DECOMPILER : Low-level lang to High-level lang.

Compiler-compiler refers to tools used to create parsers that

•A programming tool that creates a parser, interpreter, or compiler from

฀ Compiler design helps full implementation Of High-

฀ The Language Processing System

The Language Processing

• Compilers is to report any errors in the source program

• It will be loaded and executed with the help of loader

- An interpreter is another common kind of language processor

Source program output

Steps for Language Processing

Input Program : <hello.c>

• preprocessing (to expand macros)

Separate PDF file is shared for details.

A library is a collection of pre-compiled object files that can be linked into

•A shared library has file extension of ".so" (shared objects) in

Refer to the following link for further information

Source program (HLL)

Target Machine code

• Use an intermediate representation (IR)

We can not run

Automation has been much less successful in the back end

• Produce fast, compact code

• Avoid hardware stalls (keep pipeline moving)

• Have each value in a register when it is used

It also includes a middle-end that involves

Three address code assignment instructions: