DynamoRIO Tutorial Apr2010
DynamoRIO Tutorial Apr2010
DynamoRIO Tutorial Apr2010
Instrumentation Tools
with
DynamoRIO
Saman Amarasinghe
Derek Bruening
Qin Zhao
Tutorial Outline
• Dynamo
– HP Labs: PA-RISC late 1990’s
– x86 Dynamo: 2000
• RIO DynamoRIO
– MIT: 2001-2004
• Prior releases
– 0.9.1: Jun 2002 (PLDI tutorial)
– 0.9.2: Oct 2002 (ASPLOS tutorial)
– 0.9.3: Mar 2003 (CGO tutorial)
– 0.9.4: Feb 2005
• Efficient
– Near-native performance
• Transparent
– Match native behavior
• Comprehensive
– Control every instruction, in any application
• Customizable
– Adapt to satisfy disparate tool needs
• Multiple threads
– Synchronization
• Application introspection
– Reading of return address
• Transparency corner cases are the norm
– Example: access beyond top of stack
• Scalability
– Must adapt to varying code sizes, thread counts, etc.
• Dynamically generated code
– Performance challenges
• Efficient
– Software code cache overview
– Thread-shared code cache
– Cache capacity limits
– Data structures
• Transparent
• Comprehensive
• Customizable
e9 37 6f 48 92 jmp <callout>
Kernel32!TerminateProcess:
7d4d1028 7c 05 jl 7d4d102f
7d4d102a 33 c0 xor %eax,%eax
7d4d102c 40 inc %eax
7d4d102d eb 08 jmp 7d4d1037
7d4d102f 50 push %eax
7d4d1030 e8 ed 7c 00 00 call 7d4d8d22
cc int3 (breakpoint)
Kernel32!TerminateProcess:
7d4d1028 7c 05 jl 7d4d102f
7d4d102a 33 c0 xor %eax,%eax
7d4d102c 40 inc %eax
7d4d102d eb 08 jmp 7d4d1037
7d4d102f 50 push %eax
7d4d1030 e8 ed 7c 00 00 call 7d4d8d22
e9 37 6f 48 92 jmp <callout>
Kernel32!TerminateProcess:
7d4d1028 7c 05 jl 7d4d102f
7d4d102a 33 c0 xor %eax,%eax
7d4d102c 40 inc %eax
7d4d102d eb 08 jmp 7d4d1037
7d4d102f 50 push %eax
7d4d1030 e8 ed 7c 00 00 call 7d4d8d22
e9 37 6f 48 92 jmp <callout>
Kernel32!TerminateProcess:
7d4d1028 7c 05 jl 7d4d102f
7d4d102a 33 c0 xor %eax,%eax
7d4d102c 40 inc %eax
7d4d102d eb 08 jmp 7d4d1037
7d4d102f 50 push %eax
7d4d1030 e8 ed 7c 00 00 call 7d4d8d22
• Not transparent
– Cannot write jump atomically if crosses cache line
– Even if write is atomic, not safe if overwrites part of next
instruction
– Jump may span code entry point
• Too limited
– Not safe w/o suspending all threads and knowing all entry points
– Limited to inserting callouts
• Code displaced by jump is a mini code cache
– All the same consistency challenges of larger cache
• Inter-operation issues with other hooks
START
Slowdown: ~300x
dispatch
context switch
BASIC BLOCK
CACHE Non-control-flow instructions
non-control-flow executed from software code
instructions
cache
B C
dispatch
context switch
BASIC BLOCK
CACHE
A B D
dispatch
context switch
BASIC BLOCK
CACHE Direct branch to existing
non-control-flow block can bypass dispatch
instructions
B C
dispatch
context switch
BASIC BLOCK
CACHE
A B D
dispatch
context switch
BASIC BLOCK
CACHE Application address
non-control-flow indirect branch
lookup
mapped to code cache
instructions
ib_lookup: ...
...
...
G J G
K
K
G J G
K
K J
dispatch
context switch
• Extra instructions
– Indirect branch target comparisons
– Indirect branch hashtable lookups
• Extra data cache pressure
– Indirect branch hashtable
• Branch mispredictions
– ret becomes jmp*
• Application code modification
dispatch < 1%
context switch
~ 0% ~ 4% ~ 94% ~ 2%
DynamoRIO Tutorial at CGO 24 April 2010
31
Not An Ordinary Application
• Efficient
– Software code cache overview
– Thread-shared code cache
– Cache capacity limits
– Data structures
• Transparent
• Comprehensive
• Customizable
Operating System
Operating System
• Thread-private
– Less synchronization needed
– Absolute addressing for thread-local storage
– Thread-specific optimization and instrumentation
• Thread-shared
– Scales to many-threaded apps
• Efficient
– Software code cache overview
– Thread-shared code cache
– Cache capacity limits
– Data structures
• Transparent
• Comprehensive
• Customizable
exit stubs
19%
net jumps
8% original code
66%
• Thread-private:
– Working set size matching is on by default
– Client may see blocks or traces being deleted in the absence of
any cache consistency event
– Can disable capacity management via
• -no_finite_bb_cache
• -no_finite_trace_cache
• Thread-shared:
– Set to infinite size by default
– Can enable capacity management via
• -finite_shared_bb_cache
• -finite_shared_trace_cache
• Efficient
– Software code cache overview
– Thread-shared code cache
– Cache capacity limits
– Data structures
• Transparent
• Comprehensive
• Customizable
• Fine-grained scheme
– Supports individual code fragment unlink and removal
– Separate data structure per code fragment and each of its exits,
memory regions spanned, and incoming links
• Coarse-grained scheme
– No individual code fragment control
– Permanent intra-cache links
– No per-fragment data structures at all
– Treat entire cache as a unit for consistency
• Fine-grained scheme
– Data structures are highly tuned and compact
• Coarse-grained scheme
– There are no data structures
– Savings on applications with large amounts of code are typically
15%-25% of committed memory and 5%-15% of working set
• Fine-grained scheme
– Current default
• Coarse-grained scheme
– Select with –opt_memory runtime option
– Possible performance hit on certain benchmarks
– In the future will be the default option
– Required for persisted and process-shared caches
• Efficient
• Transparent
– Rules of transparency
– Cache consistency
– Synchronization
• Comprehensive
• Customizable
Linux Windows
DynamoRIO Tutorial at CGO 24 April 2010
58
Rule 2: If It’s Not Broken, Don’t Change It
• Threads
• Executable on disk
• Application data
– Including the stack!
Error
Error
Error
Error
Error
Error
Error
Error
Error
Error
SPEC CPU2000 Server Desktop
DynamoRIO Tutorial at CGO 24 April 2010
60
Rule 3: If You Change It, Emulate
Original Behavior’s Visible Effects
• Application addresses
• Address space
• Error transparency
• Code cache consistency
• Efficient
• Transparent
– Rules of transparency
– Cache consistency
– Synchronization
• Comprehensive
• Customizable
RISC x86
I-Cache D-Cache I-Cache D-Cache
A: A: A: A:
B: B: B: B:
C: C: C: C:
D: D: D: D:
Store B Store B
Flush B Jump B
Jump B
SPECINT 29 0 0
Excel 144 21 20
Photoshop 1168 40 0
Powerpoint 367 28 33
Word 345 20 6
Store B
Jump B
• Efficient
• Transparent
– Rules of transparency
– Cache consistency
– Synchronization
• Comprehensive
• Customizable
• Efficient
• Transparent
• Comprehensive
• Customizable
message pending
save user context
majority of
executed
code in a
time
typical message handler
Windows
application
no message pending
restore context
signal pending
register our
save user context
own signal
handler
DynamoRIO handler
time
signal handler
no signal pending
restore context
message pending
save user context
time
dispatcher
message handler
no message pending
restore context
message pending
modify save user context
shared library
memory image
dispatcher
time
dispatcher
message handler
no message pending
restore context
• To maintain control:
– Calls that affect the flow of control: register signal handler,
create thread, set thread context, etc.
• To maintain transparency:
– Queries of modified state app should not see
• To maintain cache consistency:
– Calls that affect the address space
• To support cache eviction:
– Interruptible system calls must be redirected
• Efficient
• Transparent
• Comprehensive
• Customizable
client client
START
basic block builder trace selector
client
dispatch
context switch
• Step2: Implementation
– Initialization
– Finalization
– Instrumentation
• Step 3: Optimization
– Optimize the instrumentation to improve the performance
dispatch
context switch
BASIC BLOCK
CACHE
uint num_dyn_instrs;
# switch stack
# switch aflags and errorno
# save all registers
# call do_ins_count
push $0x00000003
call $0xb7ef73e4 (do_ins_count)
# restore registers
# switch aflags and errorno back
# switch stack back
# application code
add $0x0000e574 %ebx %ebx
test %al $0x08
jz $0xb80e8a98
DynamoRIO Tutorial at CGO 24 April 2010
92
Step 3: Optimization (I): counter update inlining
where = instrlist_first(ilist);
/* save aflags */
dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1);
/* num_dyn_instrs += num_instrs */
opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR);
opnd2 = OPND_CREATE_INT32(num_instrs);
instr = INSTR_CREATE_add(drcontext, opnd1, opnd2);
instrlist_meta_preinsert(ilist, where, instr);
/* restore aflags */
dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1);
}
• File-based scheme
• Per-user local files
– $HOME/.dynamorio/ on Linux
– $USERPROFILE/dynamorio/ on Windows
• Global files
– /etc/dynamorio/ on Linux
– Registry-specified directory on Windows
• Files are lists of var=value
• Standalone API
– Use DynamoRIO as a library of IA-32/AMD64 manipulation
routines
• Start/Stop API
– Can instrument source code with where DynamoRIO should
control the application
• Notifications:
– -stderr_mask 0xN
– -msgbox_mask 0xN
• Windows:
– -no_hide
• Debug-build-only:
– -loglevel N
– -ignore_assert_list ‘*’
client client
START
basic block builder trace selector
client
dispatch
context switch
static dr_emit_flags_t
event_basic_block(void *drcontext, void *tag,
instrlist_t *bb, bool for_trace,
bool translating) {
instr_t *inst;
for (inst = instrlist_first(bb);
inst != NULL;
inst = instr_get_next(inst)) {
/* … */
}
return DR_EMIT_DEFAULT;
}
static dr_emit_flags_t
event_trace(void *drcontext, void *tag,
instrlist_t *trace, bool translating) {
instr_t *inst;
for (inst = instrlist_first(trace);
inst != NULL;
inst = instr_get_next(inst)) {
/* … */
}
return DR_EMIT_DEFAULT;
}
• Transparency support
– Separate memory allocation and I/O
– Alternate stack
• Thread support
– Thread-local memory
– Simple mutexes
– Thread-private code caches, if requested
• Communication
– Nudges: ping from external process
– File creation, reading, and writing
• Sideline support
– Create new client-only thread
– Thread-private itimer (Linux-only)
• Application inspection
– Address space querying
– Module iterator
– Processor feature identification
– Symbol lookup (currently Windows-only)
• Third-party library support
– If transparency is maintained!
– -wrap support on Linux
– ntdll.dll link support on Windows
– Custom loader for private library copy on Windows
• Three flavors:
– Thread-private: no synchronization; thread lifetime
– Global: synchronized, process lifetime
– “Non-heap”: for generated code, etc.
– No header on allocated memory: low overhead but must pass
size on free
• Leak checking
– Debug build complains at exit if memory was not deallocated
• When going to or from the IR, the thread mode and instruction
mode determine how instrs are interpreted
• When decoding, current thread’s mode is used
– Default is 64-bit for 64-bit DynamoRIO
– Can be changed with set_x86_mode()
• When encoding, that instruction’s mode is used
– When created, set to mode of current thread
– Can be changed with instr_set_x86_mode()
• Processor information
• State preservation
– Eflags, arith flags, floating-point state, MMX/SSE state
– Spill slots, TLS
• Clean calls to C code
• Dynamic instrumentation
– Replace code in the code cache
• Branch instrumentation
– Convenience routines
• Processor type
– proc_get_vendor(), proc_get_family(), proc_get_type(),
proc_get_model(), proc_get_stepping(), proc_get_brand_string()
• Processor features
– proc_has_feature(), proc_get_all_feature_bits()
• Cache information
– proc_get_cache_line_size(), proc_is_cache_aligned(),
proc_bump_to_end_of_cache_line(),
proc_get_containing_page()
– proc_get_L1_icache_size(), proc_get_L1_dcache_size(),
proc_get_L2_cache_size(), proc_get_cache_size_str()
• Absolute addressing
– Thread-private only
• Application stack
– Not reliable or transparent
• Stolen register
– Performance hit
• Segment
– Best solution for thread-shared
if (instr_is_mbr(instr)) {
app_pc address = instr_get_app_pc(instr);
uint opcode = instr_get_opcode(instr);
instr_t *nxt = instr_get_next(instr);
dr_insert_clean_call(drcontext, ilist, nxt, (void *) at_mbr,
false/*don't need to save fp state*/,
2 /* 2 parameters */,
/* opcode is 1st parameter */
OPND_CREATE_INT32(opcode),
/* address is 2nd parameter */
OPND_CREATE_INTPTR(address));
}
171%
121%
44MB
15MB
3.5MB
2.6MB
static dr_emit_flags_t
event_basic_block(void *drcontext, void *tag, instrlist_t *bb,
bool for_trace, bool translating) {
instr_t *instr, *first = instrlist_first(bb);
uint flags;
/* Our inc can go anywhere, so find a spot where flags are dead.
* Technically this can be unsafe if app reads flags on fault =>
* stop at instr that can fault, or supply runtime op */
for (instr = first; instr != NULL; instr = instr_get_next(instr)) {
flags = instr_get_arith_flags(instr);
/* OP_inc doesn't write CF but not worth distinguishing */
if (TESTALL(EFLAGS_WRITE_6, flags) && !TESTANY(EFLAGS_READ_6, flags))
break;
}
if (instr == NULL)
dr_save_arith_flags(drcontext, bb, first, SPILL_SLOT_1);
instrlist_meta_preinsert(bb, (instr == NULL) ? first : instr,
INSTR_CREATE_inc(drcontext, OPND_CREATE_ABSMEM((byte *)&global_count, OPSZ_4)));
if (instr == NULL)
dr_restore_arith_flags(drcontext, bb, first, SPILL_SLOT_1);
return DR_EMIT_DEFAULT;
}
569%
231%
233%
226%
185%
• Dynamic Optimization
– Strength Reduction
– Software Prefetching
• Profiling
– Memory Reference Trace
– PiPA
• Shadow Memory
– Umbra
• Dr. Memory
dispatch
context switch
• Pentium 4
– inc is slower add 1
– dec is slower than sub 1
• Pentium 3
– inc is faster add 1
– dec is faster than sub 1
• Microarchitecture-specific optimization best performed
dynamically
Pentium 4?
if (proc_get_family() == FAMILY_PENTIUM_IV)
dr_register_trace_event(event_trace);
}
static void event_trace(void *drcontext, app_pc tag, instrlist_t *trace, bool xl8) {
instr_t *instr, *next_instr;
int opcode;
for (instr = instrlist_first(bb); instr != NULL; instr = next_instr) {
next_instr = instr_get_next(instr);
opcode = instr_get_opcode(instr);
if (opcode == OP_inc || opcode == OP_dec)
replace_inc_with_add(drcontext, instr, trace);
}
}
} Look for inc / dec
static bool replace_inc_with_add(void *drcontext, instr_t *instr, instrlist_t *trace) {
instr_t *in;
uint eflags;
int opcode = instr_get_opcode(instr);
bool ok_to_replace = false;
for (in = instr; in != NULL; in = instr_get_next(in)) {
eflags = instr_get_arith_flags(in);
if ((eflags & EFLAGS_READ_CF) != 0) return false;
if ((eflags & EFLAGS_WRITE_CF) != 0) {
ok_to_replace = true;
break;
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
applu
mgrid
equake
ammp
sixtrack
wupwise
har. mean
apsi
swim
mesa
art
• Dynamic Optimization
– Strength Reduction
– Software Prefetching
• Profiling
– Memory Reference Trace
– PiPA
• Shadow Memory
– Umbra
• Dr. Memory
• Memtrace
– Profile format
• <r/w, addr>
– Steps
• Thread initialization
– Allocate buffer per thread
• Instrumentation
– Fill the buffer
– Dump to file if buffer is full
• Thread exit
– Delete the buffer
– Optimization
• Inline buffer filling code
• Out-line the clean call invocation code
DR_EXPORT void
dr_init(client_id_t id)
{
…
mutex = dr_mutex_create();
dr_register_exit_event(event_exit);
dr_register_thread_init_event(event_thread_init);
dr_register_thread_exit_event(event_thread_exit);
dr_register_bb_event(event_basic_block);
/* out-lined clean call invocation code */
code_cache_init();
…
}
static dr_emit_flags_t
event_basic_block(void *drcontext, void *tag, instrlist_t *bb,
bool for_trace, bool translating) {
…
for (instr = instrlist_first(bb); instr != NULL; instr = instr_get_next(instr)) {
if (instr_get_app_pc(instr) == NULL) continue;
if (instr_reads_memory(instr))
for (i = 0; i < instr_num_srcs(instr); i++)
if (opnd_is_memory_reference(instr_get_src(instr, i)))
instrument_mem(drcontext, bb, instr, i, false);
if (instr_writes_memory(instr))
for (i = 0; i < instr_num_dsts(instr); i++)
if (opnd_is_memory_reference(instr_get_dst(instr, i)))
instrument_mem(drcontext, bb, instr, i, true);
}
/* jecxz call */
call = INSTR_CREATE_label(drcontext);
instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jecxz(drcontext, opnd_create_instr(call)));
/* jump restore to skip clean call */
restore = INSTR_CREATE_label(drcontext);
instrlist_meta_preinsert(ilist, where,
INSTR_CREATE_jmp(drcontext, opnd_create_instr(restore)));
instrlist_meta_preinsert(ilist, where, call);
/* mov restore REG_XCX */
instr = INSTR_CREATE_mov_st(drcontext,opnd_create_reg(reg2),opnd_create_instr(restore));
instrlist_meta_preinsert(ilist, where, instr);
/* jmp code_cache */
opnd1 = opnd_create_pc(code_cache);
instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jmp(drcontext, opnd1));
/* restore %reg */
instrlist_meta_preinsert(ilist, where, restore);
• Dynamic Optimization
– Strength Reduction
• Profiling
– Memory Reference Trace
• Shadow Memory
– Umbra
• Dr. Memory
• Application
– Store meta-data to track properties of application memory
• Millions of software watchpoints
• Dynamic information flow tracking (taint propagation)
• Race detection
• Memory usage debugging tool (MemCheck/Dr. Memory)
• Issues
– Performance
– Multi-thread applications
– Flexibility
– Platform dependent
– Development challenges
• Design
• Implementation
• Optimization
• Download
– http://people.csail.mit.edu/qin_zhao/umbra/
• Memory Manager
– Monitor and control application memory allocation
• brk, mmap, munmap, mremap
• dr_register_pre_syscall_event
• dr_register_post_syscall_event
– Allocate shadow memory
– Maintain translation table
• Instrumenter
– Instrument every memory reference
• Context save
• Address calculation
• Address translation
• Shadow memory update
• Context restore
DynamoRIO Tutorial at CGO 24 April 2010
199
Instrument Code Example
Context Save mov %ecx [ECX_SLOT]
mov %edx [EDX_SLOT]
mov %eax [EAX_SLOT]
lahf %ah
seto %al
• Translation Optimization
– Thread Local Translation Table
– Memoization Check
– Reference Check
• Instrumentation Optimization
– Context Switch Reduction
– Reference Grouping
– 3-stage Code Layout
• Caching
App 1
Shd 2
Shd 1
App 2
Global translation
table
Thread 1
Thread 2
• Memoization Cache
– Software cache per thread
– Stores frequently used translation entries
• Stack
• Units found in last table lookup
Thread 1
Thread 2
• Reference Cache
– Software cache per static application memory reference
• Last reference unit tag
• Last translation offset
Thread 1
Thread 2
• Inline stub
– Reference cache check
– Jump to lean procedure if miss
• Lean procedure
– Memoization cache check
– Local table lookup
– Clean call to call out
• Callout
– Global table synchronization
– Local table lookup
• Dynamic Optimization
– Strength Reduction
• Profiling
– Memory Reference Trace
• Shadow Memory
– Umbra
• Dr. Memory
freed invalid
• Clean call
– Simplest, but expensive in both time and space
• Shared clean call
– Saves space
• Lean procedure
– Shared routine with smaller context switch than full clean call
• Inlined
– Smallest context switch, but should limit to small sequences of
instrumentation
• Common task
• Dr. Memory monitors malloc, calloc, realloc, free,
malloc_usable_size, etc.
– Alternative is to replace w/ own copies
• Locating entry point
– Module API
• Pre-hooks are easy
• Post-hooks are hard
– Three techniques, each with its own limitations