Compiler Design
Compiler Design
Compiler Design
Compiler writing spans programming languages, machine architecture, language theory, algorithms, and
software engineering. Fortunately, a few basic compiler-writing techniques can be used to construct translators for a
wide variety of languages and machines.
A Compiler is a program that reads a program written in one language – the source language – and translates
it into an equivalent program in another language – the target language as illustrated in Figure 1.1 in which the
important part of the translation process is that the compiler reports to its user the presence of errors in the source
program.
source target
program compiler program
error
messages
At first glance, the variety of compilers may appear overwhelming. There are thousands of source languages,
ranging from traditional programming languages such as Fortran and Pascal to specialized languages that have arisen
in virtually every area of computer application. Target languages are equally as varied; a target language may be
another programming language, or the machine language of any computer between a microprocessor and a
supercomputer.
A compiler translates a source program into machine language. An interpreter program reads individual words
of a source program and immediately executes corresponding machine-language segments. Interpretation occurs each
time the program is used. Thus, once it has been compiled, a program written into a compiled language will run more
quickly than a program in an interpreted language.
An interpreter is a computer program that translates commands written in a high-level computer language into
machine-language commands that the computer can understand and execute. An interpreter's function is thus similar
to that of a compiler, but the two differ in their mode of operation. A compiler translates a complete set of high-level
commands, producing a corresponding set of machine-language commands that are then executed, whereas an
interpreter translates and executes each command as the user enters it. Interpretive languages, such as the widely
used BASIC, are relatively easy to learn, and programs written in them are easy to edit and correct. Compiled
programs, on the other hand, operate more rapidly and efficiently.
There are two parts of compilation: analysis and synthesis. The analysis part breaks up the source program
into constituent pieces and creates an intermediate representation of the source program. The synthesis part
constructs the desired target program from the intermediate representation.
During analysis, the operations implied by the source program are determined and recorded in a hierarchical
structure called a tree. It is a special kind of tree called a syntax tree, in which each node represents an operation and
the children of a node represent the arguments of the operation. An example is shown in
Figure 1.2.
:=
position +
initial *
rate 60
1. Structure editors – takes an input a sequence of commands to build a source program. The structure editor
not only performs the text-creation and modification functions of an ordinary text editor, but it also analyzes
the program text, putting an appropriate hierarchical structure of the source program.
2. Pretty printers – analyzes a program and prints it in such a way that the structure of the program becomes
clearly visible. Example of this, comments may appear in a special font, statements may appear with an
amount of indentation proportional to the depth of their nesting in the hierarchical organization of the
statements.
3. Static checkers – reads a program, analyzes it, and attempts to discover potential bugs without running the
program. For example, a static checker may detect that parts of the source program can never be executed,
or that a certain variable might be used before being defined.
4. Interpreters – performs the operations implied by the source program. For an assignment statement, for
example, an interpreter might build a tree like in Figure 1.2 and then carry out the operations at the nodes as
it “walks” the tree. Interpreters are frequently used to execute command languages, since each operator
executed in a command language is usually in invocation of a complex routine such as an editor or compilers.
1. Linear analysis, in which the stream of characters making up the source program is read from left-to-right and
grouped into tokens that are sequences of characters having a collective meaning.
2. Hierarchical analysis, in which characters or tokens are grouped hierarchically into nested collections with
collective meaning.
3. Semantic analysis, in which certain checks are performed to ensure that the components of a program fit
together meaningfully.
lexical analyzer
syntax analyzer
semantic
analyzer
symbol-table
error handler
manager
intermediate
code generator
code optimizer
code generator
target program
LEXICAL ANALYSIS
In a compiler, linear analysis is called lexical analysis or scanning. For example, in lexical analysis the
characters in the assignment statement
The blanks separating the characters of these tokens would normally be eliminated during lexical analysis.
SYNTAX ANALYSIS
Hierarchical analysis is called parsing or syntax analysis. It involves grouping the tokens of the source
program into grammatical phrases that are used by the compiler to synthesize output. Usually, a parse tree such as
the one shown in Figure 1.4 represents the grammatical phrases of the source program.
SEMANTIC ANALYSIS
The semantic analysis phase checks the source program for semantic errors and gathers type information for
the subsequent code-generation phase. It uses the hierarchical structure determined by the syntax-analysis phase to
identify the operators and operands of expressions and statements.
An important component of semantic analysis is type checking. Here the compiler checks that each operator
has operands that are permitted by the source language specification. For example, many programming language
definitions require a compiler to report an error every time a real number is used to index an array. However, the
language specification may permit some operand coercions, for example, when a binary arithmetic operator is applied
to an integer and real. In this case, the compiler may need to convert the integer to a real.
SYMBOL-TABLE MANAGEMENT
An essential function of a compiler is to record the identifiers used in the source program and collect
information about various attributes of each identifier. These attributes may provide information about the storage
allocated for an identifier, its type, its scope, and in the case of procedure names, such things as the number and
types of its arguments, the method of passing each argument, and the type returned, if any.
A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the
identifier. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from
that record quickly.
When the lexical analyzer detects an identifier in the source program, the identifier is entered into the symbol
table. However, the attributes of an identifier cannot normally be determined during lexical analysis. For example, in
a Pascal declaration like
The type real is not known when position, initial, and rate are seen by the lexical analyzer.
After syntax and semantic analysis, some compilers generate an explicit intermediate representation of the
source program. We can think of this intermediate representation as a program for an abstract machine. This
intermediate representation should have two important properties: it should be easy to produce, and easy to translate
into the target program.
CODE OPTIMIZATION
The code optimization phase attempts to improve the intermediate code, so that faster-running machine code
will result.
CODE GENERATION
The final phase of the compiler is the generation of target code, consisting normally of relocatable machine
code or assembly code. Memory locations are selected for each of the variables used by the program. Then,
intermediate instructions are each translated into a sequence of machine instructions that perform the same task. A
crucial aspect is the assignment of variables to registers.
ERROR-HANDLING
Each phase can encounter errors. However, after detecting an error, a phase must somehow deal with that
error, so that compilation can proceed, allowing further errors in the source program to be detected. A compiler that
stops when it finds the first error is not as helpful as it could be.
The syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the
compiler. The lexical phase can detect errors where the characters remaining in the input do not form any token of
the language. Errors where the token stream violates the structure rules (syntax) of the language are determined by
the syntax analysis phase. During semantic analysis the compiler tries to detect constructs that have the right
syntactic structure but no meaning to the operation involved.
COMPILER-WRITING TOOLS
The compiler writer, like any programmer, can profitably use software tool such as debuggers, version
managers, profilers, and so on. Shortly after the first compilers were written, systems to help with the compiler-
writing process appeared. These systems have often been referred to as compiler-compilers, compiler-generators, or
translator-writing systems. Largely, they are oriented around a particular model of languages, and they are most
suitable for generating compilers of languages similar to the model.
Some general tools have been created for the automatic design of specific compiler components. These tools
use specialized languages for specifying and implementing the component, and many use algorithms that are quite
sophisticated. The most successful tools are those that hide the details of the generation algorithm and produce
components that can be easily integrated into the remainder of a compiler. The following is a list of some useful
compiler-construction tools:
1. Parser generators. These produce syntax analyzers, normally from input that is based on a context-free grammar.
In early compilers, syntax analysis consumed not only a large fraction of the running time of a compiler, but also a
large fraction of the intellectual effort of writing a compiler. This phase is now considered one of the easiest to
implement. Many parser generators utilize powerful parsing algorithms that are too complex to be carried out by
hand.
2. Scanner generators. These automatically generate lexical analyzers, normally from a specification based on regular
expressions. The basic organization of the resulting lexical analyzer is in effect a finite automation.
3. Syntax-directed translation engines. These produce collections of routines that walk the parse tree, generating
intermediate code. The basic idea is that one or more “translations” are associated with each node of the parse
tree, and each translation is defined in terms of translations at its neighbor nodes in the tree.
4. Automatic code generators. Such a tool takes a collection of rules that define the translation of each operation of
the intermediate language into the machine language for the target machine. The rules must include sufficient
detail that we can handle the different possible access methods for data; e.g., variables may be in registers, in a
fixed (static) location in memory, or may be allocated a position on a stack. The basic technique is “template
matching.” The intermediate code statements are replaced by “templates” that represent sequences of machine
instructions, in such a way that the assumptions about storage of variables match from template to template.
5. Data-flow engines. Much of the information needed to perform good code optimization involves “data-flow
analysis,” the gathering of information about how values are transmitted from one part of a program to each other
part. Different tasks of this nature can be performed by essentially the same routine, with the user supplying
details of the relationship between intermediate code statements and the information being gathered.
The lexical analyzer is the first phase of a compiler. Its main task is to read the input characters and produce
as output a sequence of tokens that the parser uses for syntax analysis. Upon receiving a “get next token” command
from the parser, the lexical analyzer reads input characters until it can identify the next token.
Since the lexical analyzer is the part of the compiler that reads the source text, it may also perform certain
secondary tasks at the user interface. One such task is stripping out from the source program comments and white
spaces in the form of blank, tab, and new line characters. Another is correlating error messages from the compiler
with the source program.
Sometimes lexical analyzers are divided into a cascade of two phases first called “scanning” and the second
“lexical analysis”. The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the more
complex operations.
Reasons for separating the analysis phase of compiling into lexical analysis and parsing:
1. Simpler design is perhaps the most important consideration. The separation of lexical analysis from syntax
analysis often allows us to simplify one or the other of these phases.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to construct a specialized and potentially
more efficient processor for the task.
3. Compiler portability is enhanced. Input alphabet peculiarities and other device-specific anomalies can be
restricted to the lexical analyzer.
A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
Examples of tokens
Token Sample Lexemes Informal Description of Pattern
Const const const
If if if
Relation <, <=, =, <>, >, >= < or <= or = or <> or > or >=
Id pi, count, D2 letter followed by letters and digits any numeric constant
Num 3.1416, 0, 6.02E23 any numeric constant
Literal “core dumped” any characters between “ and “ except “
Tokens are treated as terminal symbols in the grammar for the source language, using boldface names to
represent tokens. The lexemes matched by the pattern for the token represent strings of characters in the source
program that can be treated together as a lexical unit.
In other languages, the following constructs are treated as tokens: keywords, operators, identifiers, constants,
literal strings, and punctuation symbols such as parentheses, commas, and semicolons.
A pattern is a rule describing the set of lexemes that can represent a particular token in a source program.
The pattern for the token co nst just the single co nst that spells out the keyword.
LEXICAL ERRORS
Few errors are discernible at the lexical level alone, because a lexical analyzer has a very localized view of the
source of program.
But suppose a situation does arise in which the lexical analyzer is unable to proceed because none of the
patterns for the tokens matches a prefix of the remaining input. Perhaps the simplest recovery strategy is “panic
mode” recovery.
Error transformations like these may be tried in attempting to repair the input. The simplest such strategy is
to see whether a prefix of the remaining input can be transformed into a valid lexeme by just a single error
transformation.
One way of finding the error in a program is to compute the minimum number of error transformations
required to transform the erroneous program into one that is syntactically well formed.
INPUT BUFFERING
Three general approaches to the implementation of a lexical analyzer:
1. Use a lexical analyzer generator, such as Lex compiler to produce the lexical analyzer from a regular
expression based specification.
2. Write the lexical analyzer in a conventional systems-programming language, using the I/O facilities of that
language to read the input.
3. Write the lexical analyzer in assembly language and explicitly manage the reading of input.
The three are listed in order of increasing difficulty for the implementor. Since the lexical analyzer is the only
phase of the compiler that reads the source program character by character, it is possible to spend a considerable
amount of time in the lexical analysis phase, even though the later phases are conceptually more complex.
Buffer Pairs
There are many times when the lexical analyzer needs to look ahead several characters beyond the lexeme for
a pattern before a match can be announced. Because a large amount of time can be consumed moving characters,
specialized buffering techniques have been developed to reduce the amount of overhead required to process an input
character.
: : :E: :=: :M: * C: *: *: 2: eof : : :
Sentinels
Except at the ends of the buffer halves, it requires tests for each advance of the forward pointer. The two
tests can be reduced to one if each buffer half will be extended to hold a sentinel character at the end.
Term Definition
Prefix of s A string obtained by removing zero or more trailing symbols of string s, e.g. ban is a
prefix of banana.
Suffix of s A string formed by deleting zero or more of the leading symbols of s; nana is a suffix of
banana.
Substring of s A string obtained by deleting a prefix and a suffix from s; nan is a substring of banana.
Every prefix and every suffix of s is a substring of s, but not every substring of s is a
prefix or suffix of s.
Proper prefix, suffix, of Any nonempty string x that is, respectively, a prefix, suffix, or substring of such that s
substring of s is not equal to x.
Subsequence of s Any string formed by deleting zero or more not necessarily contiguous symbols from s;
e.g., baaa is a subsequence of banana.
Operations on Languages
There are several important operations that can be applied to languages. For lexical analysis, the primary
operations are union, concatenation, and closure.
Operation Definition
Union L and M written L L ∪ M = { s |s is in L or s is in M }
∪M
Positive closure of L L +
denotes “one or more concatenations of” L
written L +
REGULAR EXPRESSIONS
A regular expression is built up out of simpler regular expressions using a set of defining rules. Each regular
expression r denotes a language L(r). The defining rules specifies how L(r) is formed by combining in various ways
the languages denoted by the subexpressions of r.
Rules that define the regular expressions using a set of defining rules:
1. ∈ is a regular expression that denotes { ∈ }, that is, the set of containing the empty string.
2. If a is a symbol in ∑, then a is a regular expression that denotes {a} is a set containing the string a.
3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then:
a) (r) | (s) is a regular expression denoting L(r) ∪ L(s)
b) (r) (s) is a regular expression denoting L(r) L(s)
c) (r)* is a regular expression denoting (L(r))*
d) (r) is a regular expression denoting L(r)2
A language denoted by a regular expression is said to be a regular set.
P RAC TICE S ET
1. What is the input alphabet of each of the following languages?
a. Pascal
b. C
c. C++
d. Visual Basic
2. What are the conventions regarding the use of the blanks in each language in number 1?
3. In a string length n, how many of the following are there?
a. Prefixes
b. Suffixes
c. Substrings
d. Proper prefixes
e. Subsequences
4. Describe the language denoted by the following regular expressions:
a. 0(0|1)*)
b. (0|1)*0(0|1)(0|1)
c. 0*10*10*
d. (00|11)*((01|10)(00|11))
5. Write regular definitions for the following languages:
a. all strings of letters that contain the five vowels in order
b. all strings of letters in which the letters are in ascending lexicographic order
c. comments consisting of a string surrounded by /* and */ without and intervening */ unless it appears
inside the quotes “and”
d. all strings of digits with at most one repeated digit
Definition
Consider a string w made up from the symbols in the alphabet Σ = {a, b} such as “a” or “b” or “ab” or “ababbbbb” or
“aaa” or “abbbaaabb” etc., then we define string “e”, called the empty string such that for any string w:
ew = we = w
The symbol “e” is then “neutral element” for strings when performing the concatenation operation. (This is similar to
defining 0 as the neutral element for real numbers when performing the operation of addition or defining 1 as the
neutral element for the operation of multiplication). It is important to notice that e in never part of the alphabet.
a* = {a}* = {e, a, aa, aaa, aaaa, ....}, this is the Kleene star or simply operation on one symbol of the alphabet.
a | a*b denotes the set containing the string a or all strings of zero or more a’s followed by a b.
Note : The notation “a | a*b” can also be written as “(a + a*b)” as we shall see when reviewing the regular
expressions. The symbol “|” or the symbol “+” in this case means the so-called “or” operation used in set theory.
Therefore, the expression a*b means the following:
a*b = {e, a, aa, aaa, ...}b = {b, ab, aab, aaab, ....},
that is, the set of all strings of zero or more a’s followed by a b.
• Define a set of strings (i.e., a language L) over a given alphabet Σ and use a set of construction rules:
- The empty string e is a regular expression.
- Every symbol a in the alphabet Σ is a regular expression
- Given the regular expressions r and s denoting the
languages L(r) and L(s), then:
• (r) + (s) is the regular expression denoting L(r)U L(s)
• (r)°(s) is the regular expression denoting L(r)°L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denoting L(r)
• Precedence rules to avoid parentheses:
- the unary operator * has the highest precedence
- concatenation ° has the second highest precedence
- + (also “|”, U-union-, “or”) has the lowest precedence.
0 1,0 0
1 0
s0 s1 s2
this figure (which corresponds to the transition table above) is a non-deterministic finite automaton, NFA. The big
circles represent states and the double circles represent accepting or final states. The state with an unlabeled arrow
coming from its left is the starting state. (A brief list of the differences between a Deterministic Finite Automaton,
DFA, and a Non-deterministic Finite Automaton, NFA, given below.)
NFA vs. DF A
Example
The FAs below are equivalent and they recognize the same regular language. A regular expression (there may be
many regular expressions) corresponding to the regular language can be easily captured from the NFA
construction:
0*1Σ*00*
that is, the regular language includes all strings of 0’s and 1’s with at least an identified 1 which in the end has to
be followed by an identified 0 regardless how many characters there are between them or regardless how many
characters follow or precede them.
0 1,0 0
1 0
s0 s1 s2
NFA
0 1 0
1 0
A B C
1
DFA
AL GO RI TH M TO SIMUL ATE A D FA
Al go ri thm 1
Input: A string x ended by an EOF character and a DFA defined as a 5-tuple with s0 as its initial state and F as its
set of accepting states.
Output: A “YES” if the DFA accepts x or “NO” otherwise.
Me th od :
Apply the “pseudo code” algorithm below to the input string x. The function move(x, c) gives the state to which
there is a transition from state s on input character c. The function nextchar returns the next character of the input
string x.
s = s0;
c = next ch ar ;
wh ile c =! EO F
s = mov e (s, c) ;
c = next ch ar ;
end
if s is in F then
retu rn “YES ”
else return “N O” ;
Example
q0
1 0
q1 q3
1 0
0, 1 1, 0
q2 q4
| 0 1 .
q0 | {q0, q3} {q0, q1}
q1 | Ø {q2}
q2 | {q2} {q2}
q3 | {q4} Ø
q4 | {q4} {q4}
We show next the proliferation of states of the NFA under the input string 01001
0 1 q0 0 q0 0 q0 1 q0
q0 q0
0 0 1
1 0
q3 q3 q1
q1 q3
0
1 q4
q4
DE FI NI TI ONS REL ATED TO THE e- cl osure AL GOR IT HM
Defin it ion
Given a state s of some NFA N, the e-closure(s) is the set of states in N reachable from s on e-transitions
only.
Defin it ion
Given a set of states T of some NFA N, the e-closure(T) is the set of states in N reachable from some state
s in T on e-transitions alone.
Example
X = e-closure({0}) = {0, 1, 2, 4, 7}
Y = e-closure({2}) = {2}
Z = e-closure({6}) = {6, 1, 2, 4, 7} = {1, 2, 4, 6, 7}
T = e-clos.({1, 2, 5}) = e-clos.({1}) U e-clos.({2}) U e-clos.({5})
= {1, 2, 5, 4, 6, 7} = {1, 2, 4, 5, 6, 7}
e
a
2 3
e e
e e
0 1 6 7
e b e
4 5
NFA
(with e-transitions)
TH E e- closure AL GORI TH M
The computation of e-closure(T) is a typical process of searching a graph for nodes reachable from a given
set T of nodes by following the e-labeled edges of the NFA.
The (pseudo code) algorithm below used to compute the e-closure (T) uses a stack to hold the states
whose edges have not been checked for e-labeled transitions
AL GO RI TH M TO BUILD TH E DF A FRO M AN NF A
Potentially, the total number of states of the DFA is the power set of the number of states in the NFA (2N).
Input: An NFA N with starting state s0, accepting some regular language.
Output: A DFA D accepting the same language as N.
Me th od :
The algorithm constructs a transition table DTran for D. We call Dstates the set of states in D. We will use the
definitions given for the e-closure algorithm plus:
move(T, a) = set of states to which there is a transition on input symbol a in Σ from some NFA state s in T.
Consider an NFA N and a DFA D accepting the same regular language. [Equivalent to Algorithm 2].
1) Initially the list of states in D is empty. Create the starting state as e-closure(s0), named after initial state {s0} in N.
That is, new start state: state(1) = e-closure (s0).
2) While (there is an uncompleted row in the transition table for D) do:
a) Let x = {s1, s2, ...,sk) be the state for this row.
b) For each input symbol a do:
i) Find the e-closure({s1,s2,...sk},a) = N({s1},a) U N({s2},a) U..... U N({sk},a) = some set we'll call T .
ii) Create the D-state y = {T}, corresponding to T.
iii) If y is not already in the list of D-states, add it to the list. (This results in a new row in the table)
iv) Add the rule D(x , a) = y to the list of transition rules for D.
3) Identify the accepting states in D. (They must include the accepting states in N). That is, make state(j) final
state iff at least one state in state(j) is final state in the NFA.
0 1,0 0
1 0
0 1 2
There are in total 23 = 8 states for the corresponding DFA. This is the Power Set of set S = {1, 2, 3} represented
by 2|S|. That is, the 8 states of the DFA are:
2|S| ={Ø, {0}, {1}, {2}, {0, 1}, {0, 2}, {1, 2}, {0, 1, 2}}
1 1
{0} {1} {2} Ø
1
1 0 1 1
0 {0, 1, 2}
{0, 1}
{0, 2} {1, 2}
0
0 0
Trivially, the states {2} and Ø plus the states {0, 2} and {0, 1} that have no input can be eliminated. After the 1 st
elimination cycle is complete, {0, 1, 2} has no input to it and can be eliminated. Only A = {0}, B= {1} and C = {1,
2} remain.
Example
The NFA below represents the regular expression letter (letter | digit)*. Find its corresponding DFA. (The
theoretical number of states in this case is 210 = 1,024).
e
letter
5 6 e
letter e e
e e
1 2 3 4 9 10
digit
e
7 8 e
e
Solut io n:
A = e-closure ({1}) = {1}
move(A, letter) = {2} (function move defined in Alg. 2)
move(A, digit) = Ø
B = e-closure({2}) = {2, 3, 4, 5, 7, 10}
move(B, letter) = {6}
move(B, digit) = {8}
C = e-closure({6}) = {6, 9, 10, 4, 5, 7} = {4, 5, 6, 7, 9, 10}
move(C, letter) = {6}
move(C, digit) = {8}
D = e-closure({8}) = {8, 9, 10, 4, 5, 7} = {4, 5, 7, 8, 9, 10}
move(D, letter) = {6}
move(D, digit) = {8}
State A is the start state of the DFA and all states that include state 10 of the NFA are accepting states of the DFA.
The transition diagram of the DFA is given below.
Saint Michael College
Cantilan, Surigao del Sur
Submitted by:
Submitted to: