24 Steps of Compiler Design
24 Steps of Compiler Design
24 Steps of Compiler Design
Abdulaziz Ghuloum
Department of Computer Science, Indiana University, Bloomington, IN 47408
[email protected]
Abstract too simplistic and do not prepare the novice compiler writer to
Compilers are perceived to be magical artifacts, carefully crafted construct a useful compiler. The source language of these compilers
by the wizards, and unfathomable by the mere mortals. Books on often lacks depth and the target machine is often fictitious. Niklaus
compilers are better described as wizard-talk: written by and for Wirth states that “to keep both the resulting compiler reasonably
a clique of all-knowing practitioners. Real-life compilers are too simple and the development clear of details that are of relevance
complex to serve as an educational tool. And the gap between only for a specific machine and its idiosyncrasies, we postulate
real-life compilers and the educational toy compilers is too wide. an architecture according to our own choice”[20]. On the other
The novice compiler writer stands puzzled facing an impenetrable extreme, advanced books focus mainly on optimization techniques,
barrier, “better write an interpreter instead.” and thus target people who are already well-versed in the topic.
The goal of this paper is to break that barrier. We show that There is no gradual progress into the field of compiler writing.
building a compiler can be as easy as building an interpreter. The The usual approach to introducing compilers is by describing
compiler we construct accepts a large subset of the Scheme pro- the structure and organization of a finalized and polished compiler.
gramming language and produces assembly code for the Intel-x86 The sequencing of the material as presented in these books mirrors
architecture, the dominant architecture of personal computing. The the passes of the compilers. Many of the issues that a compiler
development of the compiler is broken into many small incremen- writer has to be aware of are solved beforehand and only the final
tal steps. Every step yields a fully working compiler for a progres- solution is presented. The reader is not engaged in the process of
sively expanding subset of Scheme. Every compiler step produces developing the compiler.
real assembly code that can be assembled then executed directly In these books, the sequential presentation of compiler imple-
by the hardware. We assume that the reader is familiar with the mentation leads to loss of focus on the big picture. Too much focus
basic computer architecture: its components and execution model. is placed on the individual passes of the compiler; thus the reader
Detailed knowledge of the Intel-x86 architecture is not required. is not actively aware of the relevance of a single pass to the other
The development of the compiler is described in detail in an passes and where it fits in the whole picture. Andrew Appel states
extended tutorial. Supporting material for the tutorial such as an that “a student who implements all the phases described in Part I of
automated testing facility coupled with a comprehensive test suite the book will have a working compiler”[2]. Part I of Appel’s book
are provided with the tutorial. It is our hope that current and future concludes with a 6-page chapter on “Putting it all together” after
implementors of Scheme find in this paper the motivation for de- presenting 11 chapters on the different passes of Tiger.
veloping high-performance compilers and the means for achieving Moreover, practical topics such as code generation for a real
that goal. machine, interfacing to the operating system or to other languages,
heap allocation and garbage collection, and the issues surround-
Categories and Subject Descriptors D.3.4 [Processors]: Compil- ing dynamic languages are either omitted completely or placed
ers; K.3.2 [Computer and Information Science Education]: Com- in an appendix. Muchnick states that “most of the compiler ma-
puter science education terial in this book is devoted to languages that are well suited for
compilation: languages that have static, compile-time type systems,
Keywords Scheme, Compilers
that do not allow the user to incrementally change the code, and
that typically make much heavier use of stack storage than heap
1. Introduction storage”[13].
Compilers have traditionally been regarded as complex pieces of
software. The perception of complexity stems mainly from tradi-
tional methods of teaching compilers as well as the lack of available
2. Preliminary Issues
examples of small and functional compilers for real languages. To develop a compiler, there are a few decisions to be made.
Compiler books are polarized into two extremes. Some of them The source language, the implementation language, and the target
focus on “educational” toy compilers while the others focus on architecture must be selected. The development time frame must
“industrial-strength” optimizing compilers. The toy compilers are be set. The development methodology and the final goal must be
decided. For the purpose of our tutorial, we made the decisions
presented below.
27
We assume that the reader has basic knowledge of C and the C 2.6 Development Methodology
standard library (e.g. malloc, printf, etc.). Although our com- We advocate the following iterative development methodology:
piler will produce assembly code, some functionality is easier to
implement in C; implementing it directly as assembly routines dis- 1. Choose a small subset of the source language that we can
tracts the reader from the more important tasks. compile directly to assembly.
2. Write as many test cases as necessary to cover the chosen subset
2.2 The Source Language of the language.
In our tutorial, we choose a subset of Scheme as the source pro- 3. Write a compiler that accepts an expression (in the chosen sub-
gramming language. The simple and uniform syntax of Scheme set of the source language) and outputs the equivalent sequence
obviates the need for a lengthy discussion of scanners and parsers. of assembly instructions.
The execution model of Scheme, with strict call-by-value evalua-
tion, simplifies the implementation. Moreover, all of the Scheme 4. Ensure that the compiler is functional, i.e. it passes all the tests
primitives in the subset can be implemented in short sequences of that are written beforehand.
assembly instructions. Although not all of Scheme is implemented 5. Refactor the compiler, if necessary, making sure that none of
in the first compiler, all the major compiler-related issues are tack- the tests are broken due to incorrect refactoring.
led. The implementation is a middle-ground between a full Scheme
compiler and a toy compiler. 6. Enlarge the subset of the language in a very small step and re-
In choosing a specific source language, we gain the advantage peat the cycle by writing more tests and extending the compiler
that the presentation is more concrete and eliminates the burden to meet the newly-added requirements.
of making the connection from the abstract concepts to the actual
A fully working compiler for the given subset of the language
language.
is available at every step in the development cycle starting from
the first day of development. The test cases are written to help en-
2.3 The Implementation Language sure that the implementation meets the specifications and to guard
Scheme is chosen as the implementation language of the compiler. against bugs that may be introduced during the refactoring steps.
Scheme’s data structures are simple and most Scheme program- Knowledge of compilation techniques as well as the target machine
mers are familiar with the basic tasks such as constructing and pro- is built incrementally. The initial overhead of learning the assembly
cessing lists and trees. The ability to manipulate Scheme programs instructions of the target machine is eliminated—instructions are
as Scheme data structures greatly simplifies the first steps of con- introduced only when they are needed. The compiler starts small
structing a compiler, since the issue of reading the input program is and is well focused on translating the source language to assembly,
solved. Implementing a lexical-scanner and a parser are pushed to and every incremental step reinforces that focus.
the end of the tutorial.
Choosing Scheme as the implementation language also elimi- 2.7 Testing Infrastructure
nates the need for sophisticated and specialized tools. These tools The interface to the compiler is defined by one Scheme procedure,
add a considerable overhead to the initial learning process and dis- compile-program, that takes as input an s-expression represent-
tracts the reader from acquiring the essential concepts. ing a Scheme program. The output assembly is emitted using an
emit form that routes the output of the compiler to an assembly
2.4 Choosing The Target Architecture file.
We choose the Intel-x86 architecture as our target platform. The Defining the compiler as a Scheme procedure allows us to de-
velop and debug the compiler interactively by inspecting the output
x86 architecture is the dominant architecture on personal comput-
ers and thus is widely available. assembly code. It also allows us to utilize an automated testing fa-
Talking about compilers that are detached from a particular cility. There are two core components of the testing infrastructure:
the test-cases and the test-driver.
architecture puts the burden on the reader to make the connection
from the abstract ideas to the concrete machine. Novice compiler The test cases are made of sample programs and their expected
writers are unlikely to be able to derive the connection on their own. output. For example, the test cases for the primitive + may be
defined as follows:
Additionally, the compiler we develop is small enough to be easily
portable to other architectures, and the majority of the compiler (test-section "Simple Addition")
passes are platform independent. (test-case ’(+ 10 15) "25")
(test-case ’(+ -10 15) "5")
2.5 Development Time Frame ...
The development of the compiler must proceed in small steps
where every step can be implemented and tested in one sitting. Fea- The test-driver iterates over the test cases performing the follow-
tures that require many sittings to complete are broken down into ing actions: (1) The input expression is passed to compile-program
smaller steps. The result of completing every step is a fully working to produce assembly code. (2) The assembly code and a minimal
compiler. The compiler writer, therefore, achieves progress in every run-time system (to support printing) are assembled and linked to
step in the development. This is in contrast with the traditional de- form an executable. (3) The executable is run and the output is
velopment strategies that advocate developing the compiler as a se- compared to the expected output string. An error is signaled if any
ries of passes only the last of which gives the sense of accomplish- of the previous steps fails.
ment. With our approach of incremental development, where every
step results in a fully working compiler for some subset of Scheme, 2.8 The End Goal
the risk of not “completing” the compiler is minimized. This ap- For the purpose of this paper, we define the end goal to be writing
proach is useful for people learning about compilers on their own, a compiler powerful enough to compile an interactive evaluator.
where the amount of time they can dedicate constantly changes. It Building such a compiler forces us to solve many interesting prob-
is also useful in time-limited settings such as an academic semester. lems.
%esp-12
The primitives car and cdr are simple; we only need to re- local3
locals local2 %esp-8
member that the pointer to the pair is its address incremented by 1. %esp-4
local1
Consequently, the car and cdr fields are located at −1 and 3 from base return point %esp
movl 3(%eax), %eax # cdr (A) Caller’s View (B) Callee’s View
movl 3(%eax), %eax # cddr
movl -1(%eax), %eax # caddr
Figure 2. The view of the stack from (A) the Caller’s side before
Vectors and strings are different from pairs in that they vary in making the procedure call, and (B) the Callee’s side on entry to the
length. This has two implications: (1) we must reserve one extra procedure.
• For the labels form, a new set of unique labels are created and
free
the initial environment is constructed to map each of the lvars stack
to its corresponding label. space …
free
• For each code expression, the label is first emitted, followed arg4 %esp-28
by the code of the body. The environment used for the body outgoing
arguments
arg3 %esp-24
%esp-20
stack
space …
arg2
contains, in addition to the lvars, a mapping of each of the arg1 %esp-16 arg4 %esp-16
formal parameters to the first set of stack locations (−4, −8, local3 %esp-12
outgoing arg3 %esp-12
etc.). The stack index used for evaluating the body starts above locals local2 %esp-8 arguments arg2 %esp-8
traverses the list of symbols looking for one having the same string.
A new symbol is constructed if one with the same name does not
exist. This new symbol is then added to the list before it is returned.
Once string->symbol is implemented, adding symbols to our
high address high address
set of valid complex constants is straightforward by the following
transformation: (A) Parameter passing to Scheme (B) Parameter passing to C
(labels ((f0 (code () () ’foo)))
(let ((f (closure f0))) Figure 4. The parameters to Scheme functions are placed on the
(eq? (funcall f) (funcall f)))) stack above the return point while the parameters to C functions are
=> placed below the return point.
(labels ((f0 (code () () (constant-ref t1)))
(t1 (datum)))
(constant-init t1 3.16 Error Checking and Safe Primitives
(funcall (primitive-ref string->symbol) Using our newly acquired ability to write and exit, we can define
(string #\f #\o #\o))) a simple error procedure that takes two arguments: a symbol
(let ((f (closure f0))) (denoting the caller of error), and a string (describing the error).
(eq? (funcall f) (funcall f)))) The error procedure would write an error message to the console,
then causes the program to exit.
3.15 Foreign Functions With error, we can secure some parts of our implementation
Our Scheme implementation cannot exist in isolation. It needs to provide better debugging facilities. Better debugging allows us
a way of interacting with the host operating system in order to to progress with implementing the rest of the system more quickly
perform Input/Output and many other useful operations. We now since we won’t have to hunt for the causes of segfaults.
add a very simple way of calling to foreign C procedures. There are three main causes of fatal errors:
We add one additional form to our compiler:
1. Attempting to call non-procedures.
<Expr> ::= (foreign-call <string> <Expr> ...)
2. Passing an incorrect number of arguments to a procedure.
The foreign-call form takes a string literal as the first argu-
3. Calling primitives with invalid arguments. For example: per-
ment. The string denotes the name of the C procedure that we intend
forming (car 5) causes an immediate segfault. Worse, per-
to call. Each of the expressions are evaluated first and their values
forming vector-set! with an index that’s out of range causes
are passed as arguments to the C procedure. The calling convention
other parts of the system to get corrupted, resulting in hard-to-
for C differs from the calling convention that we have been using
debug errors.
for Scheme in that the arguments are placed below the return point
and in reverse order. Figure 4 illustrates the difference. Calling nonprocedures can be handled by performing a proce-
To accommodate the C calling conventions, we evaluate the dure check before making the procedure call. If the operator is not
arguments to a foreign-call in reverse order, saving the values a procedure, control is transferred to an error handler label that sets
on the stack, adjusting the value of %esp, issuing a call to the up a call to a procedure that reports the error and exits.
named procedure, then adjusting the stack pointer back to its initial Passing an incorrect number of arguments to a procedure can be
position. We need not worry about the C procedure clobbering the handled by a collaboration from the caller and the callee. The caller,
values of the allocation and closure pointer because the Application once it performs the procedure check, sets the value of the %eax
Binary Interface (ABI) guarantees that the callee would preserve register to be the number of arguments passed. The callee checks
the values of the %edi, %esi, %ebp and %esp registers[14]. that the value of %eax is consistent with the number of arguments
Since the values we pass to the foreign procedure are tagged, it expects. Invalid arguments cause a jump to a label that calls a
we would write wrapper procedures in our run-time file that take procedure that reports the error and exits.
care of converting from Scheme to C values and back. For primitive calls, we can modify the compiler to insert explicit
We first implement and test calling the exit procedure. Call- checks at every primitive call. For example, car translates to:
ing (foreign-call "exit" 0) should cause our program to
exit without performing any output. We also implement a wrapper movl %eax, %ebx
around write as follows: andl $7, %ebx
cmpl $1, %ebx
ptr s_write(ptr fd, ptr str, ptr len){ jne L_car_error
int bytes = write(unshift(fd), movl -1(%eax), %eax
string_data(str), ...
unshift(len)); L_car_error:
return shift(bytes); movl car_err_proc, %edi # load handler
} movl $0, %eax # set arg-count
jmp *-3(%edi) # call the handler
1 Symbols are similar to pairs in having two fields: a string and a value ...