1 Davis Chisnall LLVM 2017
1 Davis Chisnall LLVM 2017
1 Davis Chisnall LLVM 2017
David Chisnall
University of Cambridge
LLVM Summer School, Paris, June 12, 2017
Reusable IR
Tokeniser
Token Stream
Parser
Parser Actions
AST Builder
Intermediate Representation
Optimiser
Intermediate Representation
Code Generator
Executable Code
Structure of a Modern Compiler
Source Code
As with any other piece of
Tokeniser software using libraries simpli-
Token Stream fies development.
Parser
Parser Actions
AST Builder
Intermediate Representation
Optimiser
Intermediate Representation
Code Generator
Executable Code
Optimisation Passes
⌥
Source language:
⌃ ⇧
a = (b+c) * (b+c);
⌥
r1 = load b
r2 = load c
r3 = r1 + r2
r4 = load b
r5 = load c
r6 = r4 + r5
r7 = r3 * r6
⌃ ⇧
store a r6
Common Subexpression Elimination: Register IR
⌥
Source language:
⌃ ⇧
a = (b+c) * (b+c);
⌥
r1 = load b
r2 = load c
r3 = r1 + r2
r4 = load b
r5 = load c
r6 = r1 + r5
r7 = r3 * r6
⌃ ⇧
store a r7
Common Subexpression Elimination: Register IR
⌥
Source language:
⌃ ⇧
a = (b+c) * (b+c);
⌥
r1 = load b
r2 = load c
r3 = r1 + r2
r4 = load b
r5 = load c
r6 = r1 + r2
r7 = r3 * r6
⌃ ⇧
store a r7
Common Subexpression Elimination: Register IR
⌥
Source language:
⌃ ⇧
a = (b+c) * (b+c);
⌥
r1 = load b
r2 = load c
r3 = r1 + r2
r4 = load b
r5 = load c
r6 = r1 + r2
r7 = r3 * r3
⌃ ⇧
store a r7
Common Subexpression Elimination: Stack IR
⌥
Source language:
⌃ ⇧
a = (b+c) * (b+c);
⌥
load b
load c
add
load b
load c
add
mul
⌃ ⇧
store a
Common Subexpression Elimination: Stack IR
⌥
Source language:
⌃ ⇧
a = (b+c) * (b+c);
⌥
load b
load c
add
dup
mul
⌃ ⇧
store a
Problems with CSE and Stack IR
David Chisnall
University of Cambridge
LLVM Summer School, Paris, June 12, 2017
What Is LLVM IR?
⌥
int a = someFunction () ;
⌃ ⇧
a ++;
⌥
% a = call i32 @someFunction ()
⌃ ⇧
% a = add i32 %a , 1
⌥
% a = call i32 @someFunction ()
⌃ ⇧
% a2 = add i32 %a , 1
• Front end must keep track of which register holds the current
value of a at any point in the code
• How do we track the new values?
Translating to LLVM IR The Easy Way
⌥
; int a
% a = alloca i32 , align 4
; a = someFunction
%0 = call i32 @someFunction ()
store i32 %0 , i32 * % a
; a ++
%1 = load i32 * % a
%2 = add i32 %1 , 1
⌃ ⇧
store i32 %2 , i32 * % a
⌥
int b = 12;
if ( a )
b ++;
⌃ ⇧
return b ;
entry :
; int b = 12
% b = alloca i32
store i32 12 , i32 * % b
; if ( a )
%0 = load i32 * % a
% cond = icmp ne i32 %0 , 0
br i1 % cond , label % then , label % end
then :
; b ++ end :
%1 = load i32 * % b ; return b
%2 = add i32 %1 , 1 %3 = load i32 * % b
store i32 %2 , i32 * % b ret i32 %3
br label % end
In SSA Form...
entry :
; if ( a )
% cond = icmp ne i32 %a , 0
br i1 % cond , label % then , label % end
then :
; b ++
% inc = add i32 12 , 1
br label % end
end :
; return b
% b .0 = phi i32 [ % inc , % then ] , [ 12 , % entry ]
ret i32 % b .0
In SSA Form...
entry :
; if ( a )
% cond = icmp ne i32 %a , 0
br i1 % cond , label % then , label % end
then :
; b ++ The output from
% inc = add i32 12 , 1 the mem2reg pass
br label % end
end :
; return b
% b .0 = phi i32 [ % inc , % then ] , [ 12 , % entry ]
ret i32 % b .0
And After Constant Propagation...
entry :
; if ( a )
% cond = icmp ne i32 %a , 0
br i1 % cond , label % then , label % end
end :
; b ++
; return b
% b .0 = phi i32 [ 13 , % then ] , [ 12 , % entry ]
ret i32 % b .0
And After CFG Simplification...
entry :
% tobool = icmp ne i32 %a , 0
%0 = select i1 % tobool , i32 13 , i32 12
ret i32 %0
⌥
define i32 @get ( i32 % i ) {
entry :
% arrayidx = getelementptr inbounds % struct . a *
@a , i32 0 , i32 1 , i32 % i
%0 = load i32 * % arrayidx
ret i32 %0
⌃ ⇧
}
get:
movl 4(%esp), %eax # load parameter
movl a+4(,%eax,4), %eax # GEP + load
ret
As ARM Assembly
⌥
define i32 @get ( i32 % i ) {
entry :
% arrayidx = getelementptr inbounds % struct . a *
@a , i32 0 , i32 1 , i32 % i
%0 = load i32 * % arrayidx
ret i32 %0
⌃ ⇧
}
get:
ldr r1, .LCPI0_0 // Load global address
add r0, r1, r0, lsl #2 // GEP
ldr r0, [r0, #4] // load return value
bx lr
.LCPI0_0:
.long a
The Most Important LLVM Classes
ARC Optimisations:
• Part of LLVM
• Elide reference counting operations in Objective-C code when
not required
• Makes heavy use of LLVM’s flow control analysis
GNUstep Objective-C runtime optimisations:
• Distributed with the runtime.
• Can be used by clang (Objective-C) or LanguageKit
(Smalltalk)
• Cache method lookups, turn dynamic into static behaviour if
safe
Writing A Simple Pass
⌥
void visitCallInst ( CallInst & CI ) {
if ( CI . getCalledValue () == exampleFn )
if ( auto * arg = dyn_cast < GlobalVariable >(
CI . getOperand (0) -> st ripPoi nterCa sts () ) )
if ( auto * init = dyn_cast <
ConstantDataSequential >(
arg - > getInitializer () ) )
if ( init - > isString () )
sites . push_back ({ CI ,
init - > getAsString () }) ;
⌃ ⇧
}
Creating the Cache
• Once we’ve found all of the replacement points, we can insert
the caches.
• Don’t do this during the search - iteration doesn’t like the
collection being mutated...
⌥
StringMap < GlobalVariable * > statics ;
for ( auto & s : sites ) {
auto * lookup = & s . first ;
auto arg = s . second ;
GlobalVariable * cache = statics [ arg ];
if (! cache ) {
cache = new GlobalVariable (M , retTy , false ,
GlobalVariable :: PrivateLinkage ,
Constant :: getNullValue ( retTy ) ,
" . _cache " ) ;
statics [ arg ] = cache ;
⌃ ⇧
}
Restructuring the CFG
⌥
auto * preLookupBB = lookup - > getParent () ;
auto * lookupBB =
preLookupBB - > splitBasicBlock ( lookup ) ;
BasicBlock :: iterator iter ( lookup ) ;
auto * afterLookupBB =
lookupBB - > splitBasicBlock (++ iter ) ;
preLookupBB - > getTerminator () -> eraseFromParent () ;
lookupBB - > getTerminator () -> eraseFromParent () ;
auto * phi = PHINode :: Create ( retTy , 2 , " cache " ,
&* afterLookupBB - > begin () ) ;
⌃ ⇧
lookup - > replaceA llUsesWith ( phi ) ;
Adding the Test
⌥
IRBuilder <> B ( beforeLookupBB ) ;
llvm :: Value * cachedClass =
B . CreateBitCast ( B . CreateLoad ( cache ) , retTy ) ;
llvm :: Value * needsLookup =
B . CreateIsNull ( cachedClass ) ;
B . CreateCondBr ( needsLookup , lookupBB ,
afterLookupBB ) ;
B . SetInsertPoint ( lookupBB ) ;
B . CreateStore ( lookup , cache ) ;
B . CreateBr ( afterLookupBB ) ;
phi - > addIncoming ( cachedClass , beforeLookupBB ) ;
⌃ ⇧
phi - > addIncoming ( lookup , lookupBB ) ;
A Simple Test
⌥
int example ( char * foo ) {
printf ( " example (% s ) \ n " , foo ) ;
int i =0;
while (* foo )
i += *( foo ++) ;
return i ;
}
int main ( void ) {
int a = example ( " a contrived example " ) ;
a += example ( " a contrived example " ) ;
a += example ( " a contrived example " ) ;
a += example ( " a contrived example " ) ;
a += example ( " a contrived example " ) ;
return a ;
⌃ ⇧
}
Running the Test
0 1 0
Register %RDI [ID=4]
callseq_start [ORD=2] [ID=11] X86ISD::WrapperRIP [ORD=2]
0x7fae340715f0
0x7fae34071390 0x7fae34800720
i64
ch glue i64
0 1 2
Register %AL [ID=5] Constant<0> [ID=3]
CopyToReg [ORD=2] [ID=12]
0x7fae34071850 0x7fae340714c0
0x7fae34071720
i8 i8
ch glue
0 1 2 3
TargetGlobalAddress<i32 (i8*, ...)* @printf> 0 [ORD=2] [ID=6] RegisterMask [ID=7]
CopyToReg [ORD=2] [ID=13]
0x7fae34071ab0 0x7fae34071be0
0x7fae34071980
i64 Untyped
ch glue
0 1 2 3 4 5
X86ISD::CALL [ORD=2] [ID=14]
0x7fae34071d10
ch glue
0 1 2 3
Register %EAX [ID=8]
callseq_end [ORD=2] [ID=15]
0x7fae34800000
0x7fae34071e40
i32
ch glue
0 1 2
Constant<0> [ID=9]
CopyFromReg [ORD=2] [ID=16]
0x7fae34800260
0x7fae34800130
i32
i32 ch glue
0 1 2
TargetConstant<0> [ID=10]
CopyToReg [ORD=3] [ID=17]
0x7fae34800390
0x7fae348004c0
i16
ch glue
0 1 2 3
X86ISD::RET_FLAG [ORD=3] [ID=18]
0x7fae348005f0
ch
Instruction Selection
David Chisnall
University of Cambridge
LLVM Summer School, Paris, June 13, 2017
The 1960s - 1970s
add
Fetch Decode Register Fetch Execute Writeback
⌥
( int i =100 ; i !=0 ; i - -)
⌃ ⇧
...
⌥
start :
...
add r1 , r1 , -1
⌃ ⇧
jne r1 , 0 , start
Stall Example
jne add
Fetch Decode Register Fetch Execute Writeback
⌥
( int i =100 ; i !=0 ; i - -)
⌃ ⇧
...
⌥
start :
...
add r1 , r1 , -1
⌃ ⇧
jne r1 , 0 , start
Stall Example
jne add
Fetch Decode Register Fetch Execute Writeback
⌥
( int i =100 ; i !=0 ; i - -)
⌃ ⇧
...
⌥
start :
...
add r1 , r1 , -1
⌃ ⇧
jne r1 , 0 , start
Stall Example
jne add
Fetch Decode Register Fetch Execute Writeback
⌥
( int i =100 ; i !=0 ; i - -)
⌃ ⇧
...
⌥
start :
...
add r1 , r1 , -1
⌃ ⇧
jne r1 , 0 , start
Stall Example
jne add
Fetch Decode Register Fetch Execute Writeback
⌥
( int i =100 ; i !=0 ; i - -)
⌃ ⇧
...
⌥
start :
...
add r1 , r1 , -1
⌃ ⇧
jne r1 , 0 , start
Stall Example
jne
Fetch Decode Register Fetch Execute Writeback
⌥
( int i =100 ; i !=0 ; i - -)
⌃ ⇧
...
⌥
start :
...
add r1 , r1 , -1
⌃ ⇧
jne r1 , 0 , start
Stall Example
jne
Fetch Decode Register Fetch Execute Writeback
⌥
( int i =100 ; i !=0 ; i - -)
⌃ ⇧
...
⌥
start :
...
add r1 , r1 , -1
⌃ ⇧
jne r1 , 0 , start
Fixing the Stall
⌥
( int i =100 ; i !=0 ; i - -)
⌃ ⇧
...
⌥
start :
add r1 , r1 , -1
...
⌃ ⇧
jne r1 , 0 , start
Fixing the Stall
⌥
( int i =100 ; i !=0 ; i - -)
⌃ ⇧
...
⌥
start :
add r1 , r1 , -1
...
⌃ ⇧
jne r1 , 0 , start
David Chisnall
University of Cambridge
LLVM Summer School, Paris, June 13, 2017
Late Binding
⌥
define void @_ Z1 1 c a l l V i r t u a l R 3 F o o (% struct . Foo * %
f ) uwtable ssp {
%1 = bitcast % struct . Foo * % f to void (% struct .
Foo *) ***
%2 = load void (% struct . Foo *) *** %1 , align 8 ,
! tbaa !0
%3 = load void (% struct . Foo *) ** %2 , align 8
tail call void %3(% struct . Foo * % f )
ret void
⌃ ⇧
}
⌥ ⌥
⌃ ⇧ hod
kup_fn , $last , fail
⌃ ⇧
:
⌥
, $expected , cls
hod
, $expected2 , cls
⌃ ⇧
hod
• Branching is expensive
• Dynamic programming languages have lots of method calls
• Common hot code paths follow a single path
• Chain together basic blocks from di↵erent methods into a
trace
• Compile with only branches leaving
• Contrast: trace vs basic block (single entry point in both,
multiple exit points in a trace)
Type specialisation
⌥
JavaScript:
⌃ ⇧
c;
⌥
Deoptimisable pseudocode:
if (!( is_integer ( b ) && is_integer ( c ) ) )
anycall_inte r p re t e r (& a , b , c ) ; // Function
does not return
⌃ ⇧
a = b+c;
Case Study: JavaScriptCore (WebKit)
• Pioneered by Self
• Both LLInt and the baseline JIT collect type information
• Later tiers can optimise based on this
• More useful than type inference for optimisation (this is
usually type X, vs this type must always be type X, Y, or Z)
General note: for optimisation, X is usually true is often more
helpful than Y is always true if X is a much stronger constraint
than Y (and X is cheap to check).
Other profiling
• Function entry
• Branch targets
• Build common control flow graphs
Tiers 3/4: the high-performance JITs
• Continuation-passing style IR
• Every call is a tail call, all data flow is explicit
• Lots of JavaScript-specific optimisations
• Many related to applying type information
• CPS not covered much in this course, but lots of recent
research on combining the best aspects of SSA and CPS!
Type inference
(higher is better)
Tier 4: LLVM / B3
(Lower is better)
FTL vs Clang
(Lower is better)
Lessons
The End