LLVM Tutorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Using LLVM For Program

Transformation
ANDREW RUEF
UNIVERSITY OF MARYLAND
COMPUTER SCIENCE

LLVM Overview
Research project at UIUC
Modular compiler tool chain
Integrated in many open source and commercial

projects
Licensed under an open-source license

Introduction

Components of LLVM
Mid-level compiler Intermediate
Representation (IR)
C/C++ compiler frontend (clang)
Target-specific (X86, ARM, etc) code
generators
Divide between clang and LLVM
Clang is a C/C++ compiler with an
LLVM backend
LLVM is everything else

Todays Agenda
Well talk about existing LLVM tools
Well do a few demos using those tools
Well talk about how to build tools on top of LLVM
Well build two analysis tools
Well look at a program re-writing tool

Lab: Where were going


clang C language frontend, translates C into

LLVM bitcode
opt Analyze and transform LLVM bitcode
llc Code generator for LLVM bitcode to native
code

Lab: Commands to run


$
$
$
$

clang c emit-llvm o test.bc test.c!


opt O1 o test.bc test.bc!
llc o test.s test.bc!
gcc o test test.s!

Lab: What just happened?


Full translation of C program to executable program
At each stage we can look at what the compiler

infrastructure is doing

C to un-optimized bitcode
Optimized bitcode
Machine code
Executable

Very good blog post on the life of an LLVM

instruction
http://eli.thegreenplace.net/2012/11/24/life-of-aninstruction-in-llvm/

LLVM Intermediate
Representation

Lab: Find Non-Constant Format String


Condition to check for:
Any time the first parameter to printf, sprintf (others?) is nonconstant, alert for potential security badness
Can we statically detect this in LLVM IR?

Algorithm For Detection


Visit every call instruction in the program
Ask if that call instruction is a format-string

accepting routine
If it is, retrieve the first parameter
If the first parameter is not a constant global, raise
an alert

Structure of Provided Driver


Very basic driver that uses a PassManager
Reads in LLVM bitcode and runs the VarPrintf pass

on it
Produce bitcode file using clang c emit-llvm!
Using the driver might seem clunky, this is easier
than integrating with opt
The pass can later be integrated with opt
!

Building the drivers


$
$
$
$
$

cd tutorial!
mkdir build!
cd build!
cmake DLLVM_ROOT=/usr/local ..!
make!

CMake
CMake is a meta make
Why? Why not
CMake generates your build environment
Makefiles
XCode solution
Visual Studio solution
CMake has its own build specification system for

describing building code

It might be saner than what you are used to

LLVM can be built with cmake or automake/

autoconf

LLVM Intermediate Representation


Language allows for expression of computation
Instructions produce unique values
Collection of statements:
%5 = add nsw i32 %3, %4!
%N

a value!
add a binary instruction!
nsw no signed wrap!

The language is Static Single Assignment (SSA)


Values defined by statements are never re-defined

Hierarchy of the Language

A compilation unit is a Module, contains functions


A function is a Function, contains basic blocks
A basic block is a BasicBlock, contains instructions
An instruction is an Instruction
Instructions can contain operands, each is a Value
All of the above, except Module, is a Value

Types
No implicit casting in LLVM IR, all values must be

explicitly converted
All values have a static type
Integers are specified at arbitrary bitwidth

i1, i2, i3, , i32, i398

Floating point types


Derived types specify arrays, vectors, functions

pointers, structures
Structures have types like {i32, i32, i8}!
Pointers have types like pointer to i32

Note on Integer Types


There are no signed or unsigned integers
LLVM views integers as bit vectors
Frontends destroyed signed/unsigned information
Really, C programmers destroyed signed/unsigned
information
Research prototypes exist that analyze integer

wrapping in LLVM IR (
http://code.google.com/p/wrapped-intervals/ )
Operations are interpreted as signed or unsigned
based on instructions they are used in

Memory Model
LLVM has a low level view of memory
Just a key -> value map
Keys are pointer values
Values stored in LLVM memory must be integers, floating
point, pointers, vectors, structures, or arrays
LLVM has a concept of creating function-local

memory via alloca

The Module
Highest level concept
Contains a set of global values
Global variables
Functions

The Function
Name
Argument list
Return type
Calling convention
Extends from GlobalValue, has properties of

linkage visibility

The BasicBlock
Contains a list of Instructions
All BasicBlocks must end in a TerminatorInst
BasicBlocks descend from values, and are used as

values in branching instructions

The Instruction
Terminator instructions
Binary instructions
Bitwise instructions
Aggregate instructions
Memory instructions
Type conversion instructions
Control and misc instructions

Language By Example
Produced with opt dot-cfg o fib.bc fib.bc and graphviz

Language By Example, Part 2

Static Single Assignment


LLVM contains a pass to promote variable-using

functions to value-using functions


Once transformed by this pass, an LLVM module is
in SSA form
Most LLVM analyses and transformations expect to
operate on an SSA IR
SSA allows for Def-Use and Use-Def chain analysis

Simple function
int foo(int a, int b) {!
int i = a;!
int j = b;!
!
return i+j+1;!
}!

Pre-SSA
define i32 @foo(i32 %a, i32 %b) nounwind uwtable ssp {!
entry:!
%a.addr = alloca i32, align 4!
%b.addr = alloca i32, align 4!
%i = alloca i32, align 4!
%j = alloca i32, align 4!
store i32 %a, i32* %a.addr, align 4!
store i32 %b, i32* %b.addr, align 4!
%0 = load i32* %a.addr, align 4!
store i32 %0, i32* %i, align 4!
%1 = load i32* %b.addr, align 4!
store i32 %1, i32* %j, align 4!
%2 = load i32* %i, align 4!
%3 = load i32* %j, align 4!
%add = add nsw i32 %2, %3!
%add1 = add nsw i32 %add, 1!
ret i32 %add1!
}!

Post-SSA
define i32 @foo(i32 %a, i32 %b) nounwind
uwtable ssp {!
entry:!
%add = add nsw i32 %a, %b!
%add1 = add nsw i32 %add, 1!
ret i32 %add1!
}!

The Phi-Node
To support conditional assignments, we introduce an

imaginary function
Phi defines a value and accepts a list of tuples as an
argument
Each tuple is a (BasicBlock * Value)
Interpret the phi node as defining a value
conditionally based on the previous basic block

Phi node example


int foo(int a, int b) {!
int r;!
!
if( a > b )!
r = a;!
else!
r = b;!
!
return r;!
}!

Phi node example, pre SSA


define i32 @foo(i32 %a, i32 %b) nounwind uwtable ssp {!
entry:!
%a.addr = alloca i32, align 4!
%b.addr = alloca i32, align 4!
%r = alloca i32, align 4!
store i32 %a, i32* %a.addr, align 4!
store i32 %b, i32* %b.addr, align 4!
%0 = load i32* %a.addr, align 4!
%1 = load i32* %b.addr, align 4!
%cmp = icmp sgt i32 %0, %1!
br i1 %cmp, label %if.then, label %if.else!
!
if.then:
!
%2 = load i32* %a.addr, align 4!
store i32 %2, i32* %r, align 4!
br label %if.end!
!
if.else:
!
%3 = load i32* %b.addr, align 4!
store i32 %3, i32* %r, align 4!
br label %if.end!
!
if.end: !
%4 = load i32* %r, align 4!
ret i32 %4!
}!

Phi node example, post SSA


define i32 @foo(i32 %a, i32 %b) nounwind uwtable ssp
{!
entry:!
%cmp = icmp sgt i32 %a, %b!
br i1 %cmp, label %if.then, label %if.else!
!
if.then: br label %if.end!
!
if.else: br label %if.end!
!
if.end: %r.0 = phi i32 [ %a, %if.then ], [ %b,
%if.else ]!
ret i32 %r.0!
}!

Phi node example 2


int aa(int a, int b) {!
int i = 0;!
int k = 0;!
while( k < b) {!
i += a;!
}!
!
return i;!
}!

LLVM CFG

The GetElementPtr instruction


An instruction so frequently misunderstood, it has

its own documentation page about how it is


misunderstood
Frequently abbreviated as GEP
GEP instructions compute offsets from pointer bases

Similar to lea instructions in X86 assembler

GEP instructions are type aware


Asking for the 5th field of a pointer to structure operand will
do the right thing

Well-Formed LLVM
There are specific rules as to what constitutes Well-

Formed LLVM
Phi-nodes dominate their uses
Instruction arguments are defined before use
All blocks end in a terminator
All branch targets are defined values

There is an automatic verification pass that will alert

when IR is not well formed

C++ API

Value Hierarchy
Value has a very rich class hierarchy
LLVM API allows the manipulation of every Value
Any degree of transformation is possible

Value class hierarchy

Everything From Value


Every item contained in a Module inherits from

Value
This allows for some useful APIs
Def-Use / Use-Def iteration
Replace any Value with another Value
Sub

Allows for classification


Instructions can be UnaryInstructions or BinaryInstructions
GlobalValues can be Functions or GlobalVariables

LLVM Context
Frequent argument to LLVM API functions
These can normally be retrieved from a Value via

getContext!

There is also a getGlobalContext!

The same LLVMContext should always be used

across code that interacts with the same Values


LLVM objects are created in a specific context and are unique
by pointer values
For example, type objects can be pointer-compared for
equality between types of different instructions

Run Time Type Information


An evil C++ concept
If you have a function that accepts a parameter of an

abstract class and it could be one of any specific


implementations, how to choose?
Normal C++ methods

dynamic_cast<T> and friends

Compiler stores information about object types off to

the side so that it can be used at run-time

LLVM and Run Time Type Information


The LLVM codebase implements its own RTTI for

LLVM objects
When writing passes, you use LLVM specific helpers
isa<T> - True or false if pointer/reference is of type T!
cast<T> - Checked cast, asserts on failure if not type T!
dyn_cast<T> - unchecked cast, null if not type T

The project advises you not to use big chains of these

to approximate match from ML


Instead they give you a Visitor pattern (yay)
You might find these insufficient (or distasteful)

Common Patterns
Iterate over BasicBlock in a Function
Use begin(), end() iterators of Function
Iterate over Instructions in a Function
Use inst_iterator
Iterate over Def-Use chains
Use use_begin, use_end!

InstVisitor
Pattern to avoid giant blocks of !

!if(T *n = dyn_cast<T>(foo))!
Inherit from InstVisitor class and define a visitTInst
method
Could work for your purposes
Could confuse control flow even more

Including LLVM In Your Project


llvm-config executable that will provide useful

info about the installed LLVM


Provide paths to headers, library files, etc
If LLVM is built with Cmake, it will add a
FindLLVM.cmake to your /usr/share
Compiling your code with fno-rtti will probably be
required
If you compiled LLVM yourself, you can pass
LLVM_REQUIRES_RTTI to cmake
Needed if combining boost and llvm

Passes and transformations

Passes
In the previous lab, we wrote a

pass
Compiling is the act of passing
over and analyzing/transforming
IR
Most things that happen in LLVM
happen in the context of a pass
Passes can have complicated
actions

Pass Dependencies
Passes can depend on the output of other passes
Analysis passes for alias analysis
Passes note their dependencies on other passes
By overriding the getAnalysisUsage method
PassManager figures out the dependency graph
It also attempts to optimize the traversal of the graph
Each Pass returns a bool, PassManager runs until

everyone stops

Pass Manager
PassManager performs dependency maintenance
Note that PassManager invocations could be multi-threaded!
Importance of multiple LLVMContexts
PassManager also performs optimizations of pass

ordering
PassManager defines different kinds of Passes that
can be run
ModulePass Run on entire module
FunctionPass Run on individual functions
BasicBlockpass Run on individual basic blocks

Pass Rules
Non-analysis passes should not remember any

information about a function or basic block


Analysis passes should remember some information

Otherwise why run them

Transformation passes should be idempotent

Lab: Escape Analysis


If a variable is allocated on the local stack, a pointer

to that variable should not outlive the stack


This could happen if a pointer to a local is returned
or assigned to a global
clang currently includes a check for this, but the
check is kind of busted

Algorithm For Escape Analysis


Populate a set of values that escape the function via

return or store
Traverse the set checking for alloca-ed values in
the Values descending from the escapes

Structure of Provided Driver


Driver is laid out similarly to before
Collection of tests are included

Projects built on LLVM


Google AddressSanitizer/ThreadSanitizer
http://code.google.com/p/address-sanitizer/
Utah Integer Overflow Checker
http://embed.cs.utah.edu/ioc/
Emscripten, LLVM to Javascript
https://github.com/kripken/emscripten/wiki
Dagger, decompilation from x86 to LLVM
http://llvm.org/devmtg/2013-04/bougacha-slides.pdf

Important LLVM subprojects


poolalloc field-sensitive, context-sensitive alias

analysis
lldb llvm debugger
klee symbolic execution for LLVM
FreeBSD compiles with clang, soon will switch to

building exclusively with clang

Conclusion
LLVM enables powerful transformations
Includes an industry grade C/C++ frontend
clang is default compiler on OSX, supported by Apple
Can compile much of Linux userspace
Well defined Intermediate Language
Modular and pluggable framework for analysis and

transformation

Project Documentation
Good documentation online
http://www.llvm.org/docs
Documentation covers many aspects of the LLVM

project
Programmers manual details finer points of the C++ API
Language reference is ultimate source for language details and
semantics

Relatively responsive IRC channel on OFTC


Active and responsive mailing list

You might also like