PLDI Week 03 Irs

CS4212: Compiler Design
Week 3: Compiling Function Calls to x86;

Intermediate Representations
Ilya Sergey
[email protected]
ilyasergey.net/CS4212/
From the Last week:
Comparisons and Conditioning
2
X86lite State: Flags & Condition Codes
• X86 instructions set flags as a side effect
• X86lite has only 3 flags:
– OF: “overflow” set when the result is too big/small to fit in 64-bit reg.
– SF: “sign” set to the sign or the result (0=positive, 1 = negative)
– ZF: “zero” set when the result is 0
• From these flags, we can define Condition Codes

– You can think of Cond. Codes as of additional registers,
whose value changes depending on the current flags
• E.g., cmpq SRC1, SRC2 computes SRC1 – SRC2 to set the flags
• Now we can check conditional codes:
– eq equality holds when ZF is set
– neq inequality holds when (not ZF)
– lt less than holds when SF <> OF
• Equivalently: ((SF && not OF) || (not SF && OF))
– ge greater or equal holds when (not lt) holds, i.e. (SF = OF)
– le than or equal holds when SF <> OF or ZF
– gt greater than holds when (not le) holds,
• i.e. (SF = OF) && not(ZF)
3
Conditional Instructions
• cmpq SRC1, SRC2 Compute SRC2 – SRC1, set condition flags
• setb CC DEST DEST’s lower byte ← if CC then 1 else 0
• jCC SRC rip ← if CC then SRC else do nothing
• Example:
cmpq %rcx, %rax Compare rax to ecx

jeq __truelbl If rax = rcx then jump to __truelbl
4
Code Blocks & Labels
• X86 assembly code is organized into labeled blocks:
label1:
<instruction>
<instruction>
…
<instruction>
label2:
<instruction>
<instruction>
…
<instruction>
Labels indicate code locations that can be jump targets (either through conditional branch instructions or function calls).
• Labels are translated away by the linker and loader – instructions live in the heap in the “code segment”
• An X86 program begins executing at a designated code label (usually “main”).
5
Basic Control Flow
6
Jumps, Calls, and Return
• jmp SRC rip ← SRC Jump to location in SRC
• callq SRC Push rip; rip ← SRC

– Call a procedure: Push the program counter to the stack (decrementing rsp)
and then jump to the machine instruction at the address given by SRC.
• retq Pop into rip

– Return from a procedure: Pop the current top of the stack into rip
(incrementing rsp).
– This instruction effectively jumps to the address at the top of the stack
7
Loop-based Factorial in Assembly
.globl _program
_program:
int program() {
movq $1, %rax
int acc = 1;
movq $6, %rdi
int n = 6;
loop:
while (0 < n) {
cmpq $0, %rdi
acc = acc * n;
je exit
n = n - 1;
imulq %rdi, %rax
}
decq %rdi
return acc;
jmp loop
}
exit:
retq
8
Demo: Hand-Coded x86Lite
• https://github.com/cs4212/week-02-x86lite
• Basic definitions: x86.ml
• Linking with assembly: test.c
• Example program, simple output, factorial
9
Compiling, Linking, Running
• To use hand-coded X86:
1. Compile OCaml program main1.ml to the executable
by running make
2. Run it, redirecting the output to some .s file, e.g.:

./main1.native >> prog.s
3. Use clang (or gcc) to compile & link with test.c:

clang -o test test.c prog.s
One M1/M2 (Apple Silicon) Mac, use the following flags:

clang -arch x86_64 -o test prog.s test.c
4. You should be able to run the resulting executable:

./test
10
Implementing Functions &
C Calling Conventions
11
X86 Schematic
Processor Memory
Code & Data

RIP
Instruction
Decoder
Larger Addresses
Heap
OF
ALU
Control SF ALU
ZF
Flags
rax rbx rcx rdx
rsi rdi rbp rsp
r08 r09 r10 r11 Stack
r12 r13 r14 r15
Registers
3 parts of the C memory model
Memory
• The code & data (or "text") segment Code & Data
– contains compiled code, constant strings, etc.
• The Heap
– Stores dynamically allocated objects
– Allocated via "malloc"
Larger Addresses
Heap
– Deallocated via "free"
– C runtime system
• The Stack
– Stores local variables
– Stores the return address of a function
Stack
• In practice, most languages use this model.
13
Local/Temporary Variable Storage
• Need space to store:
– Global variables
– Values passed as arguments to procedures
– Local variables (either defined in the source program or introduced by the compiler)
• Processors provide two options

– Registers: fast, small size (64 bits), very limited number (e.g., only 16 in x86Lite)
– Memory: slow, very large amount of space (2GB or more)
• caching important
• In practice on X86:
– Registers are limited (and have restrictions)
– Divide memory into regions including the stack and the heap
14
Calling Conventions
• Specify the locations (e.g. register or stack) of arguments passed to a function and returned by the function
int64_t g(int64_t a, int64_t b) {

return a + b;
f is a caller
}
int64_t f(int64_t x) {
int64_t ans = g(3,4) + x;
return ans;
}
g is a callee
• Designate registers either:
– Caller Save – e.g., freely usable by the called code
– Callee Save – e.g., must be restored by the called code
• Define the protocol for deallocating stack-allocated arguments
– Caller cleans up
– Callee cleans up (makes variable number arguments harder — the callee doesn’t know how many are those)
15
x64 Calling Conventions: Caller Protocol
f: …
… # set up arguments
%rip f …
callq g
lret: …
…
Machine state when
executing in function f.
larger memory addresses

“empty”
%rdi arg1 stack
%rsi arg2 space
%rdx arg3
%rcx arg4
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
16
f: …
%rip …
callq g
lret: …
Calling conventions:
…
args 1…6 in registers
as shown below.

%rdi arg1
%rsi arg2
%rdx arg3
%rcx arg4
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
17
f: …
%rip …
callq g
lret: …
args > 6 pushed onto
…
the stack (from right to left)
Note: %rsp changes

%rdi arg1
%rsi arg2
%rdx arg3 arg7

%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
18
Call Instruction
f: …
%rip …
callq g
lret: …
To execute the call:
…
1. push the return
address

(here shown as lret)
%rdi arg1
%rsi arg2 lret

%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
19
Call Instruction
g: pushq %rbp
movq %rsp, %rbp
%rip subq $128, %rsp
…
To execute the call:
…
2. set rip to address g

%rdi arg1
%rsi arg2 lret

%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
20
Callee Function Prologue
g: pushq %rbp
movq %rsp, %rbp
…
Callee protocol:
…
1. store the old %rbp
g’s
frame

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
21
g: pushq %rbp
movq %rsp, %rbp
…
Callee protocol:
…
2. adjust the %rbp to
point to the new “base” g’s
(%rbp is the “base pointer”) frame

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
22
g: pushq %rbp
movq %rsp, %rbp
…
Callee protocol:
…
3. allocate 128 bytes of
“scratch” stack space g’s
frame

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
23
Callee Invariants: Function Arguments
g: pushq %rbp
movq %rsp, %rbp
…
Now g’s body can run...
• its arguments are accessible
…
either in registers or as
g’s
offsets from %rbp
frame

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 16(%rbp) arg7
%rcx arg4 24(%rbp) arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
24
Callee Invariants: Callee Same Registers
g: pushq %rbp
movq %rsp, %rbp
…
…
g’s
frame

Callee save registers:
%rdi arg1 %rbp, %rbx, %r12-%r15
old %rbp
%rsi arg2 Their values
lret must be the same
%rdx arg3 when g returns
arg7 as when g was
%rcx arg4 called. (Ifarg8
g uses these registers,
%r8 arg5 it should save their values on the
%r9 arg6 stack and then restore them.)
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
25
Callee Epilogue (Return Protocol)
g: …
movq ANS, %rax
addq $128, %rsp
%rip
popq %rbp
retq
Step 1: Move the result
…
(if any) into %rax.
g’s
%rax ANS frame

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
26
g: …
movq ANS, %rax
addq $128, %rsp
%rip
popq %rbp
Step 2: deallocate the retq
scratch space
…
g’s
%rax ANS frame

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
27
g: …
movq ANS, %rax
addq $128, %rsp
%rip
popq %rbp
Step 3: restore the caller’s retq
%rbp
…
g’s
%rax ANS frame

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
28
g: …
movq ANS, %rax
addq $128, %rsp
%rip
popq %rbp
Step 4: the return instruction retq
pops the stack into %rip
…
g’s
%rax ANS frame

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
29
f: …
…
%rip
callq g
lret: …
Step 4: the return instruction
pops the stack into %rip
…
%rax ANS

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
30
Back in f
f: …
…
%rip
callq g
lret: …
At this point, f has the result of
…
g in %rax. It should clean up its
stack as needed.
%rax ANS

%rdi arg1 old %rbp
%rsi arg2 lret
%rdx arg3 arg7
%rcx arg4 arg8
%r8 arg5
%r9 arg6
%rsp f ’s
%rbp frame
registers old %rbp
(not all of them)
31
X86-64 SYSTEM v AMD 64 ABI
• More modern variant of C calling conventions
– used on Linux, Solaris, BSD, OS X
• Callee save: %rbp, %rbx, %r12-%r15

• Caller save: all others
• Parameters 1 .. 6 go in: %rdi, %rsi, %rdx, %rcx, %r8, %r9

• Parameters 7+ go on the stack (in right-to-left order)
– so: for n > 6, the nth argument is located at (((n-7)+2)*8)(%rbp)
– e.g.: argument 7 is at 16(%rbp) and argument 8 is at 24(%rbp)
• Return value: in %rax
• 128 byte "red zone" – scratch pad for the callee's data
– typical of C compilers, not required
– can be optimised away
32
Announcements
• HW2: X86lite
• Due: Sunday, September 11 at 23:59
• Pair Programming:
• Use GitHub Classroom link to create a new team for the project or join an existing one
• Choose a funny group name!
• Submission by any group member done on Canvas counts for the group
Demo: Directly Compiling Expressions to X86lite
• https://github.com/cs4212/week-02-x86lite
• Definition of compilation: compile.ml
• Example programs: main2.ml
• Linking with assembly: calculator.c
34
Directly Translating AST to Assembly
• For simple languages, no need for intermediate representation.

– e.g. the arithmetic expression language from
• Main Idea: Maintain invariants

– e.g. Code emitted for a given expression always computes the answer into %rax
• Key Challenges:
– storing intermediate values needed to compute complex expressions
– some instructions use specific registers (e.g. shift)
One Simple Strategy
• Compilation is the process of “emitting” instructions into an instruction stream.

• To compile an expression, we recursively compile sub expressions and then process the results.
• Invariants:
– Compilation of an expression yields its result in %rax
– Argument (Xi) is stored in a dedicated operand register
– Intermediate values are pushed onto the stack
– Stack slot is popped after use (so the space is reclaimed)
• Resulting code is wrapped (e.g., with retq) to comply with cdecl calling conventions
• Alternative strategy: see the compile2 in compile.ml

Why do something else?
• We have seen a simple syntax-directed translation
– Input syntax uniquely determines the output, no complex analysis or code transformation is done.
– It works fine for simple languages.
But…
• The resulting code quality is poor.
• Richer source language features are hard to encode
– Structured data types, objects, first-class functions, etc.
• It’s hard to optimize the resulting assembly code.
– The representation is too concrete – e.g. it has committed to using certain registers and the stack
– Only a fixed number of registers
– Some instructions have restrictions on where the operands are located
• Control-flow is not structured:
– Arbitrary jumps from one code block to another
– Implicit fall-through makes sequences of code non-modular
(i.e. you can’t rearrange sequences of code easily)
• Retargeting the compiler to a new architecture is hard.
– Target assembly code is hard-wired into the translation
Intermediate Representations (IR’s)
• Abstract machine code: hides details of the target architecture
• Allows machine independent code generation and optimization.
x86
Java
AST IR Byte-
code
Optimization Arm
Multiple IR’s
• Goal: get program closer to machine code without losing the
information needed to do analysis and optimizations
• In practice, multiple intermediate representations

might be used (for different purposes)
x86
Java
AST HIR MIR Byte-
code
Optimization Optimization Arm
Optimizations
What makes a good IR?
• Easy translation target (from the level above)
• Easy to translate (to the level below)
• Narrow interface
– Fewer constructs means simpler phases/optimizations
• Example: Source language might have “while”, “for”, and “foreach” loops
(and maybe more variants)
– IR might have only “while” loops and sequencing
– Translation eliminates “for” and “foreach”
⟦for(pre; cond; post) {body}⟧

=
⟦pre; while(cond) {body;post}⟧
– Here the notation ⟦cmd⟧ denotes the “translation” or “compilation” of the command cmd.
IR’s at the extreme
• High-level IR’s
– Abstract syntax + new node types not generated by the parser
• e.g. Type checking information or disambiguated syntax nodes
– Typically preserves the high-level language constructs
• Structured control flow, variable names, methods, functions, etc.
• May do some simplification (e.g. convert for to while)
– Allows high-level optimizations based on program structure
• e.g. inlining “small” functions, reuse of constants, etc.
– Useful for semantic analyses like type checking
• Low-level IR’s
– Machine dependent assembly code + extra pseudo-instructions
• e.g. a pseudo instruction for interfacing with garbage collector or memory allocator (parts of the language runtime
system)
• e.g. (on x86) a imulq instruction that doesn’t restrict register usage
– Source structure of the program is lost:
• Translation to assembly code is straightforward
– Allows low-level optimizations based on target architecture
• e.g. register allocation, instruction selection, memory layout, etc.
• What’s in between?
Mid-level IR’s: Many Varieties
• Intermediate between AST (abstract syntax) and assembly

• May have unstructured jumps, abstract registers, or memory locations
• Convenient for translation to high-quality machine code
– Example: all intermediate values are named to facilitate optimizations that attempt to minimize stack/register usage
• Many examples:
– Triples: OP a b
• Useful for instruction selection on X86 via “graph tiling” (a way to better utilise registers)
– Quadruples: a = b OP c (RISC-like “three address form”)
– SSA: variant of quadruples where each variable is assigned exactly once
• Easy dataflow analysis for optimization
• e.g. LLVM: industrial-strength IR, based on SSA
– Stack-based:
• Easy to generate
• e.g. Java Bytecode, UCODE
Growing an IR
• Develop an IR in detail… starting from the very basic.
• Start: a (very) simple intermediate representation for the arithmetic language

– Very high level
– No control flow
• Goal: A simple subset of the LLVM IR

– LLVM = “Low-level Virtual Machine”
– Used in HW3+
• Add features needed to compile rich source languages

Simple let-based IR
Eliminating Nested Expressions
• Fundamental problem:
– Compiling complex & nested expression forms to simple operations.
Source ((1 + X4) + (3 + (X1 * 5)))
Add(Add(Const 1, Var X4),

AST Add(Const 3, Mul(Var X1,
Const 5)))
IR ?
• Idea: name intermediate values, make order of evaluation explicit.
– No nested operations.
Translation to SLL
Add(Add(Const 1, Var X4),
• Given this: Add(Const 3, Mul(Var X1,
Const 5)))
• Translate to this desired SLL form:

let tmp0 = add 1L varX4 in
let tmp1 = mul varX1 5L in
let tmp2 = add 3L tmp1 in
let tmp3 = add tmp0 tmp2 in
tmp3
• Translation makes the order of evaluation explicit.

• Names intermediate values
• Note: introduced temporaries are never modified
Demo
• https://github.com/cs4212/week-03-intermediate-2023
• Using IRs: ir_by_hand.ml
• Definitions: ir<X>.ml
• IR1: Expressions
– simple arithmetic expressions, immutable global variables
• IR2: Commands
– global mutable variables
– commands for update and sequencing
• IR3: Local control flow

– conditional commands & while loops
– basic blocks
• IR4: Procedures (top-level functions)

– local state
– call stack
• IR5: ”almost” LLVM IR

IR3: Basic Blocks
• A sequence of instructions that is always executed starting at the first

instruction and always exits at the last instruction.
– Starts with a label that names the entry point of the basic block.
– Ends with a control-flow instruction (e.g. branch or return) the “link”
– Contains no other control-flow instructions
– Contains no interior label used as a jump target
• Basic blocks can be arranged into a control-flow graph

– Nodes are basic blocks
– There is a directed edge from node A to node B if the control flow instruction at the
end of basic block A might jump to the label of basic block B.
Next Lecture
• LLVM

PLDI Week 03 Irs

Uploaded by

Copyright:

Available Formats

PLDI Week 03 Irs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PLDI Week 03 Irs

Uploaded by

Copyright:

Available Formats

CS4212: Compiler Design

Week 3: Compiling Function Calls to x86;

• From these flags, we can define Condition Codes

• cmpq SRC1, SRC2 Compute SRC2 – SRC1, set condition flags

• setb CC DEST DEST’s lower byte ← if CC then 1 else 0

• jCC SRC rip ← if CC then SRC else do nothing

cmpq %rcx, %rax Compare rax to ecx

• X86 assembly code is organized into labeled blocks:

• An X86 program begins executing at a designated code label (usually “main”).

• jmp SRC rip ← SRC Jump to location in SRC

• callq SRC Push rip; rip ← SRC

• retq Pop into rip

• Basic definitions: x86.ml

• Linking with assembly: test.c

• Example program, simple output, factorial

2. Run it, redirecting the output to some .s file, e.g.:

3. Use clang (or gcc) to compile & link with test.c:

One M1/M2 (Apple Silicon) Mac, use the following flags:

4. You should be able to run the resulting executable:

Code & Data

• Processors provide two options

int64_t g(int64_t a, int64_t b) {

larger memory addresses

larger memory addresses

larger memory addresses

%rdx arg3 arg7

larger memory addresses

%rsi arg2 lret

larger memory addresses

%rsi arg2 lret

larger memory addresses

larger memory addresses

larger memory addresses

larger memory addresses

larger memory addresses

larger memory addresses

larger memory addresses

larger memory addresses

pops the stack into %rip

larger memory addresses

larger memory addresses

larger memory addresses

• Callee save: %rbp, %rbx, %r12-%r15

• Parameters 1 .. 6 go in: %rdi, %rsi, %rdx, %rcx, %r8, %r9

• Return value: in %rax

• Definition of compilation: compile.ml

• Example programs: main2.ml

• Linking with assembly: calculator.c

• For simple languages, no need for intermediate representation.

• Main Idea: Maintain invariants

• Compilation is the process of “emitting” instructions into an instruction stream.

• Alternative strategy: see the compile2 in compile.ml

• Allows machine independent code generation and optimization.

• In practice, multiple intermediate representations

Optimization Optimization Arm

⟦for(pre; cond; post) {body}⟧

• Intermediate between AST (abstract syntax) and assembly

• Develop an IR in detail… starting from the very basic.

• Start: a (very) simple intermediate representation for the arithmetic language