Binsok

SoK: All You Ever Wanted to Know About x86/x64
Binary Disassembly But Were Afraid to Ask

Chengbin Pang⇤‡ § Ruotong Yu⇤ Yaohui Chen† Eric Koskinen⇤ Georgios Portokalidis⇤ Bing Mao‡ Jun Xu⇤
⇤ Stevens Institute of Technology † Facebook Inc. ‡ Nanjing University
TABLE I: The group of open-source tools that our study covers

Abstract—Disassembly of binary code is hard, but necessary and representative works that use those tools.
for improving the security of binary software. Over the past
few decades, research in binary disassembly has produced many Tool (Version) Source (Release Date) Public Use
tools and frameworks, which have been made available to PSI (1.0) Website [63] (Sep 2014) [50, 88, 111]
researchers and security professionals. These tools employ a U ROBOROS (0.11) Github [93] (Nov. 2016) [103]
variety of strategies that grant them different characteristics. DYNINST (9.3.2) Github [79] (April 2017) [7, 18, 69, 73, 96]
The lack of systematization, however, impedes new research in OBJDUMP (2.30) GNU [47] (Jan. 2018) [21, 103, 111]
the area and makes selecting the right tool hard, as we do G HIDRA (9.0.4) Github [75] (May 2019) [24, 45, 91]
not understand the strengths and weaknesses of existing tools. M C S EMA (2.0.0) Github [13] (Jun. 2019) [22, 41, 44]
In this paper, we systematize binary disassembly through the ANGR (8.19.7.25) Github [8] (Oct. 2019) [20, 71, 81, 98, 112]
study of nine popular, open-source tools. We couple the manual BAP (2.1.0) Github [26] (Mar. 2020) [10, 16, 64]
examination of their code bases with the most comprehensive R ADARE 2 (4.4.0) Github [89] (April 2020) [4, 31, 52, 58]
experimental evaluation (thus far) using 3,788 binaries. Our
study yields a comprehensive description and organization of • Algorithms typically produce results with some correct-
strategies for disassembly, classifying them as either algorithm ness guarantees. They mostly leverage knowledge from the
or else heuristic. Meanwhile, we measure and report the impact binary (e.g., symbols), the machine (e.g., instruction set),
of individual algorithms on the results of each tool. We find that and/or the ABI (e.g., calling conventions).
while principled algorithms are used by all tools, they still heavily
• Heuristics are based on common patterns and typically do
rely on heuristics to increase code coverage. Depending on the
heuristics used, different coverage-vs-correctness trade-offs come not offer assurances of correctness.
in play, leading to tools with different strengths and weaknesses. Moreover, each tool adopts a different set of strategies, with
We envision that these findings will help users pick the right tool technical details not always fully documented or publicized.
and assist researchers in improving binary disassembly.
To complicate the matters, the implemented strategies have
I. I NTRODUCTION evolved over time, further deviating from documentation. The
above have created a knowledge gap that impedes the users
The disassembly of binary programs is a crucial task in of these tools and, specifically, binary analysis researchers. To
reverse engineering and software security, and it is a core bridge the gap, we must answer several questions:
component of innumerable works on malware analysis [54],
• Q1 – What are the algorithms and heuristics used in existing
code-similarity measurement [17, 42, 55], vulnerability dis-
disassembly tools and how do they interact?
covery [25, 66, 82, 95], security retrofitting [2, 78, 80,
• Q2 – What is the coverage & accuracy of heuristic methods
100, 104, 109, 111] and patching [11]. However, correctly
in comparison to algorithmic ones? Are there trade-offs?
disassembling a binary is challenging, mainly owing to the
• Q3 – What errors do existing disassembly tools make and
loss of information (e.g., symbols and types) occurring when
what are the underlying causes?
compiling a program to machine code and the complexity of
To answer these questions, this paper presents a system-
constructs (e.g., jump tables, data embedded in code, etc.) used
atization of binary disassembly research, through the study
to efficiently implement language features.
of nine popular open-source tools shown in Table I. Unlike
Binary disassembly has seen remarkable advancements in
past research [5, 56, 68, 77, 105], we study these tools both
the past decade, awarding researchers and developers with a
qualitatively and quantitatively to understand the tools not only
variety of tools and frameworks, under both open source [3,
as a whole, but also their individual algorithms and heuristics.
33, 90, 94, 95, 102, 103] and commercial [36, 74] licenses.
More specifically, our qualitative study of the tools is
These tools have lifted a significant burden off researchers
based on manually inspecting source code. This allows us
that aim to develop new, advanced binary analysis techniques.
to answer Q1 by presenting their exact and most recent
This new plurality of options encapsulates a broad variety
strategies, avoiding ambiguities and out-of-date information
of underlying strategies with different guarantees, which fall
found in documentation and publications. The quantitative
under two categories:
study answers questions Q2–Q3 by applying the tools on a
§ Pangis a PhD student at Nanjing University. This work was done while Pang corpus of 3,788 benchmark binaries, consisting of utilities,
was a Visiting Scholar at Stevens Institute of Technology. client/server programs, and popular libraries on both Linux
TABLE II: Popular solutions of binary security and the in-
and Windows systems (see Table IV). To evaluate the tools formation the solutions need from binary disassembly. “Inst”,
in terms of coverage and accuracy, we built an analysis “CFG”, “Func”, and “Xrefs” respectively mean legitimate in-
framework based on LLVM, GCC, the Gold Linker, and Visual structions, control flow graph, functions, and cross references.
Studio to automatically collect the ground truth while building
the corpus. We evaluate the tools by individually measuring Category Solutions & Required Information
Vulnerability Finding [25, 34, 97] Inst, CFG, Func, Xrefs
different disassembly phases to quantify the effectiveness of Control Flow [37, 80, 100, 108, 109, 111]
Inst, CFG, Func
the strategies employed. Our evaluation presents the degree of Integrity [19, 38, 51, 72, 85, 87, 99]
Code Layout [28, 53, 60, 104]
use, precision, and pitfalls of each component of each tool. Randomization [61, 65, 78, 106, 109]
Inst, CFG, Func, Xrefs
By systematically dissecting and evaluating the tools, we Execute-only Code [21, 110] Inst, CFG
were able to make new observations that amend or comple- Legacy-code Patching [11, 101, 102, 103] Inst, CFG, Func, Xrefs
Code Similarity [14, 17, 40, 42, 55, 57, 82]
ment prior knowledge. Our major observations include: (1) Measurement [15, 29, 30, 43, 62, 70, 83]
Inst, CFG, Func
For better coverage, mainstream tools incorporate heuristics in Software Fault Isolation [38, 39, 67, 107] Inst, CFG, Func
Software De-bloating [46, 86, 92] Inst, CFG, Func
nearly every phase of disassembly. These heuristics are heavily
used in disassembling real-world binaries and, without them, code regions and correctly identifies the instructions that were
the tools cannot provide practical utility in many tasks. (2) emitted by the compiler or introduced by the developer.
Heuristics typically cannot provide correctness assurances and Symbolization determines cross-references (xrefs for short)
lead to various errors, particularly when encountering complex or precisely, numeric values in the binary that are references
constructs. Moreover, previous works may have overestimated of other code or data objects. Depending on the location of the
the reliability of those heuristics. For instance, a recent study reference and the location of the target, there are four types
[5] (unintentionally) overstated the accuracy of linear sweep of xrefs: code-to-code (c2c), code-to-data (c2d), data-to-code
because many benchmarks containing data-in-code were not (d2c), and data-to-data (d2d).
considered. (3) Tools may share the same group of algorithms Function Entry Identification locates the entry points of
and heuristics, however, they organize and combine them functions. A special but important case is the main function.
differently, leading to different accuracy-coverage trade-offs. CFG Reconstruction re-builds the control flow graph (CFG)
(4) Tools have different strengths across different tasks. For of a binary program. We consider direct control transfers,
instance, commercial tools are better at recovering instructions indirect jumps/calls, tail calls, and non-returning functions.
but open source tools can better identify cross-references.
Contributions: Our main contributions are as follows: B. Targeted Binaries
• We present a thorough systematization of binary disassem- Similarly to the majority of the works we study, we focus on
bly from the perspective of algorithms and heuristics. To our binaries with the following key properties: (1) They have been
knowledge, this is the first research that can answer Q1–Q3. produced with mainstream compilers and linkers; (2) Binaries
• We developed a compiler-based framework for automated may include hand-written assembly; (3) They have not been
end-to-end collection of ground truth for binary disassembly. obfuscated; (4) We do not assume symbol availability, i.e.,
We used it to compose a benchmark data set for assessing binaries are stripped; (5) We only consider X86/X64 binaries.
binary disassembly tools. The framework and benchmarks The majority of effort in prior works has focused on such
are available at https://github.com/junxzm1990/x86-sok. binaries, owing to the popularity of the architectures; (6) They
• We present, to our knowledge, the most comprehensive run on Linux or Windows operating systems.
evaluation of open-source disassembly tools. Our analysis
C. Targeted Tools
unveils the prevalence of heuristics, their contribution to
disassembly, and shortcomings. Our systematization is based on study of disassembly tools.
• We make new observations and improve the understanding We use five criteria to select tools: (1) They are designated for
of binary-disassembly strategies and tools. We envision that disassembly or have an independent module for disassembly.
these insights will facilitate future research in the area of (2) They can do automated disassembly without user interac-
disassembly and drive improvements in existing tools. tions. (3) They are open source tools so that we can study
their implemented strategies. (4) They have unique strategies
II. S COPE OF S YSTEMATIZATION that are not fully covered by other tools. (5) They can run our
targeted binaries to support our quantitative evaluation.
A. Functionality
Following the above criteria, we selected 9 tools, as listed
In general, binary disassembly can involve different tasks in Table I. We also looked at JakStab [59], RetDec [32],
based on the context of use. This work focuses on tasks that and BinCat [12] but excluded them in the study because (1)
relate to binary-software security. Table II classifies popular JakStab cannot run our benchmark binaries due to a parsing
security works and summarizes the information each class error (https://github.com/jkinder/jakstab/issues/9); (2) RetDec
needs to recover from binary code. Our study accordingly aims for de-compilation; it uses preliminary strategies for
concentrates on the disassembly aspects of providing them: disassembly, which are all covered by other tools we selected;
Disassembly is the process of recovering the assembly in- (3) BinCat requires user interactions to do disassembly. The
structions of a binary. Perfect disassembly separates data from disassembly results vary upon different user interactions.
TABLE III: The specifics of existing algorithms (numbered with rings like ¿) and heuristics (numbered with discs like ∂).
Alg. Algorithms & Heuristics Goals Tools
1 Start from code addresses with symbols Code accuracy OBJDUMP , PSI, U ROBOROS
1 Continuous scanning for instructions Code coverage OBJDUMP , PSI, U ROBOROS
Linear
2 Skip bad opcodes Code accuracy OBJDUMP
Disassembly
Sweep
3 Replace padding and re-disassembly Code accuracy PSI
4 Exclude code around errors Code accuracy U ROBOROS
2 Follow control flow to do disassembly Code accuracy DYNINST, G HIDRA, ANGR, BAP, R ADARE 2
3 Start from program entry, main, and symbols Code accuracy DYNINST, G HIDRA, ANGR, BAP, R ADARE 2
Recursive
5 Function entry matching Code coverage DYNINST, G HIDRA, ANGR, BAP, R ADARE 2
Descent
6 Linear sweep code gaps Code coverage ANGR
7 Disassembly from targets of xrefs Code coverage G HIDRA, R ADARE 2
4 Exclude data units that are floating points Xref accuracy ANGR
8 Brute force operands and data units Xref coverage U ROBOROS, M C S EMA, G HIDRA, ANGR
Symbolization
9 Pointers in data have machine size Xref accuracy U ROBOROS, M C S EMA, G HIDRA, ANGR
10 Alignment of pointers in data Xref accuracy U ROBOROS, M C S EMA, G HIDRA
11 Pointers in data or referenced by other xrefs can be non-aligned Xref coverage G HIDRA, ANGR
Xrefs
12 References to code can only point to function entries Xref accuracy G HIDRA
13 Enlarge boundaries of data regions Xref coverage G HIDRA, ANGR
14 Address tables have minimal size of 2 Xref accuracy G HIDRA
15 Exclude pointers that may overlap with a string Xref accuracy M C S EMA, G HIDRA
16 While scanning data regions, use step-length based on type inference Xref accuracy ANGR
MAIN 5 Identify main based on arguments to __libc_start_main Func coverage ANGR ,BAP
Function Entry
Function 17 Identify main using patterns in _start/__scrt_common_main_seh Func coverage DYNINST, R ADARE 2
6 Identify function entries based on symbols Func coverage DYNINST, G HIDRA, ANGR, BAP, R ADARE 2
7 Identify function entries based on exception information Func coverage G HIDRA
General 8 Identify function entries based on targets of direct calls Func coverage DYNINST, G HIDRA, ANGR, BAP, R ADARE 2
Function 9 Identify function entries by resolving indirect calls Func coverage G HIDRA, ANGR
18 Identify function entries based on prologues/decision-tree Func coverage DYNINST, G HIDRA, ANGR, BAP, R ADARE 2
19 Consider begins of code discovered by linear scan as function entries Func coverage ANGR
10 Use VSA to resolve jump table targets CFG accuracy DYNINST, G HIDRA, ANGR
Indirect 20 Follow patterns to determine jump tables CFG accuracy DYNINST, G HIDRA, R ADARE 2
Jump 21 Discard jump tables with index bound larger than a threshold CFG accuracy G HIDRA, ANGR, R ADARE 2
22 Restrict the depth of slice for VSA Efficiency DYNINST, ANGR
Indirect Call 11 Identify targets based on constant propagation CFG coverage G HIDRA, ANGR
CFG
12 Consider a jump to the start of another function as a tail call CFG accuracy DYNINST, ANGR
23 Determine tail call based on distance between the jump and its target CFG accuracy R ADARE 2
24 A tail call and its target cross multiple functions CFG accuracy G HIDRA
Tail Call 25 Tail calls cannot be conditional jumps CFG accuracy G HIDRA, ANGR
26 A tail call tears down its stack CFG accuracy DYNINST, ANGR
27 A tail call does not jump to the middle of a function CFG accuracy ANGR
28 Target of a tail call cannot be target of any conditional jumps CFG accuracy ANGR
13 Identify system calls or library functions that are known non-returning CFG accuracy DYNINST, G HIDRA, ANGR, BAP, R ADARE 2
Non-returning 14 Identify functions with no ret and no tail calls that return CFG accuracy DYNINST, ANGR, R ADARE 2
Function
15 Identify functions that always call non-returning functions CFG accuracy BAP
29 Detect non-returning functions based on fall-through after the call-sites CFG accuracy G HIDRA
III. A NALYSIS OF T OOLS A. Algorithms and Heuristics in Disassembly
To understand the strategies employed in today’s binary The disassembly strategies we study fall into two broad and
disassembly tools, we studied 9 representative examples in well-known classes: linear sweep and recursive descent.
Table I. These tools have varying popularity and cover nearly Linear Sweep [OBJDUMP, PSI, U ROBOROS]: Linear sweep
all publicly known techniques in binary disassembly [105]. continuously scans pre-selected code ranges and identifies
Our investigation is primarily based on studying source valid instructions ( 1 ), exploiting the rationale that modern
code, instead of solely relying on available documentation assemblers tend to layout code successively to reduce the
and publications. Source code reflects the exact semantics of binary’s size. In general, a linear sweep strategy can be
the strategies applied, protecting us from ambiguities in the described by how it selects sweep ranges and how it handles
documents. Also, many tools have evolved over time and their errors during scanning. As such, we summarize algorithms
corresponding documentation and publications are out-of-date. according to these two aspects.
The rest of this section presents our findings, summarized in All tools in this class follow OBJDUMP to select code
Table III. We assign a number to each algorithm and heuristic, regions for sweep: they process code ranges specified by sym-
respectively placing it within a ring (e.g., 1 ) or disc (e.g., 1 ). bols in the .symtab and .dynsym sections ( 1 ), followed by
② ③
jmp loc / jmp * / ret /call
③
f2e0: retq
④
When encountering direct control transfers, the tools expand
jmp loc / jmp * / ret 0x0 0x* (padding) ④ f2e1: 66 * nopw %cs *
0x0 0x* (padding) ③
0x0 0x* (padding) f2e8: 0x0 (padding) 0x*
the disassembly to the targets. However, to handle indirect
0x0 0x* (padding)
…
… … control transfers, different tools adopt different approaches. We
② 0x0 (mid of inst) 0x* ② f2ef: 0x0 * add %cl *
①
(bad) will cover the details in § III-D. Another challenging part of
... jmp (mid of inst) ① 3c40: callq f2f0 ① control flow is to determine non-returning functions. Related
(a) Handling bad opcode (b) Handling invalid control flow (c) Case-b error PSI’s idea can handle
details are also discussed in § III-D.
Fig. 1: Error-handling by PSI. Part (a) and part (b) respectively As indirect control flows are (formally and practically)
show the handling of bad opcode and invalid control transfers. undecidable, recursive descent often leaves behind code gaps.
Part (c) is an invalid control-transfer from BinUtils that PSI Our evaluation in § IV-B1 shows that recursive disassembly
should handle but the actual implementation does not. alone can miss 49.35% of the code on average. As such,
existing tools incorporate heuristics to enlarge code coverage,
remaining gaps in the code sections. In a general sense, these which inevitably jeopardizes correctness guarantees. The most
ranges comprehensively encapsulate legitimate instructions. common heuristic, used by ANGR, DYNINST, R ADARE 2,
Various heuristics are used for error detection and handling. BAP, and G HIDRA, searches for function entry points in the
OBJDUMP deems invalid opcodes as errors, skips a byte, and code gaps based on common function prologues / epilogues or
resumes scanning ( 2 ). Beyond invalid opcodes, PSI considers pre-trained decision-tree models [9] ( 5 ). On finding a function
control transfers to non-instructions as errors. Furthermore, entry, the tools will consider it as a new starting point for
PSI has more sophisticated error-handling as shown in Fig. 1. recursive disassembly. According to our evaluation (§ IV-B3),
Upon a bad opcode, PSI traces backwards from the erroneous function-entry matching on average identifies 17.36% of all
instruction to a non-fall-through control transfer (unconditional the functions, leading to a 31.55% increase of code coverage.
jump, indirect jump, or return) and identifies padding after that Beyond function matching, existing tools also use heuristics
control transfer. Replacing the padding with nop instructions, that are more aggressive. ANGR performs linear sweep on the
PSI then re-runs linear-sweep for re-disassembly ( 3 ). When code gaps and recursive disassembly on legitimate instructions
encountering an invalid control transfer, the public version of ( 6 ). In case of errors, ANGR skips the current basic block
PSI only handles cases where the transfer part is correct but and moves to the next bytes. This linear scan increases the
code around the target contains errors. Specifically, it checks code coverage of ANGR by around 8.20%. However, it will
the code around the target, seeks a preceding instruction that misidentify data as code (e.g., Listing 2 in Appendix G).
starts with zero, and finally identifies a further preceding non- G HIDRA includes targets of xrefs in recursive descent ( 7 ).
fall-through control transfer and its following padding. Again, This strategy discovers 4.33% more code coverage. However,
PSI will replace the padding with nop instructions and re- the xrefs are mostly collected with heuristics, which can lead
run linear-sweep. U ROBOROS follows a similar idea as PSI. to errors like Listing 3 in Appendix G.
But instead of re-disassembly, U ROBOROS simply excludes In summary, strict recursive descent ensures correctness but
the code around the error locations ( 4 ). often produces insufficient coverage. To expand code cov-
The design of PSI can handle cases like the one in Fig. 1(c). erage, existing tools incorporate many aggressive heuristics
Its implementation, however, is too restrictive for high effec- that undermine the correctness guarantees (§ IV-B1).
tiveness. For instance, the public version of PSI only considers B. Algorithms and Heuristics in Symbolization
zero-started padding and cannot correct the error in Fig. 1(c).
Symbolization identifies numerical values in the binary that
To sum up, linear sweep aggressively scans all possible code
are actually references to code or data objects. Tools generally
and hence, maximizes the recovery of instructions. However,
follow the workflow in Fig. 2.
it can run into errors due to data-in-code. To address errors,
Constant Operand and Data Unit Extraction [ANGR,
existing tools rely on heuristics for correction, which are less
G HIDRA, U ROBOROS, M C S EMA]: These tools start by identi-
comprehensive and can have limited utility (§ IV-B1).
fying numerical values that are potential pointers. They search
Recursive Descent [DYNINST, G HIDRA, ANGR, BAP, through all instructions to identify constant operands and scan
R ADARE 2]: Recursive descent starts with a given code address the non-code regions to find data units ( 8 ). As identification
and performs disassembly following the control flow ( 2 ). of constant operands is trivial, we omit the details and focus on
Strategies in this category usually consist of three components: data units. In general, a data unit is composed of consecutive
(1) how to select code addresses, (2) how to resolve control n-bytes at an aligned address. However, different tools have
flow, and (3) how to handle the code gaps left by recursive varying choices of n, alignment, and non-code regions:
disassembly. Accordingly, we summarize the existing tools • All tools assume that a data unit’s size is the same as the
based on the three components. machine size, i.e., 4 bytes in 32-bit and 8 bytes in 64-bit
All the tools we study consider the program entry and binaries ( 9 ). This assumption, however, is not always true.
available symbols as code addresses for recursive disassembly Listing 4 in Appendix G shows a jump table with 4-byte
( 3 ). These addresses are, in principle, known to be safe. entries from a 64-bit binary. In such cases, the assumption
Further, ANGR, BAP, DYNINST and R ADARE 2 also discover about a data unit’s size can mislead symbolization in tools
the main function and the details are covered in § III-C. that do not otherwise handle jump tables (e.g., U ROBOROS).
Instructions Non-instruction Regions 1,024 bytes and G HIDRA adopts a similar idea ( 13 ) because
Data Unit Extract
many pointers are dereferenced with an offset. This strategy
Data Type Inference
Constant Operand Extract
q UROBOROS, ANGR,GHIDRA,MCSEMA
q ANGR
q UROBOROS
q ANGR
indeed benefits coverage (e.g., Listing 6 in Appendix G). It,
q GHIDRA
q GHIDRA however, also introduces errors like Listing 5 in Appendix G.
q MCSEMA
Recall that G HIDRA leverages xrefs to aid recursive descent
Code Pointer Check Data Pointer Check Address Table Check
q UROBOROS q UROBORS q UROBOROS (§ III-A). When a constant operand is symbolized as a xref to a
q ANGR q ANGR q ANGR
q GHIDRA q GHIDRA q GHIDRA non-disassembled code region, G HIDRA recursively disassem-
q MCSEMA q MCSEMA q MCSEMA
bles that region. If G HIDRA runs into errors like bad opcodes
Code Pointers Data Pointers Code & Data Pointers and invalid control transfers, it rolls back the disassembly.
Address Table [ANGR, G HIDRA, U ROBOROS, M C S EMA]:
Fig. 2: A general workflow of symbolization. Beyond constant operands, these tools also symbolize the non-
• U ROBOROS and M C S EMA use machine-size alignment ( 10 ). code regions by locating address tables: a group of consecutive
G HIDRA assumes a 4-byte alignment unless a data unit is data units that are pointers. In general, determining address
the target of another xref. In the latter case, G HIDRA has tables depends on the choice of table size and the rules to
no alignment requirement ( 11 ). ANGR enforces no alignment classify a data unit as a pointer. G HIDRA considers 2 as
requirement ( 11 ) due to observations of unaligned point- the minimal size of an address table ( 14 ) and the others
ers [101]. Our evaluation shows that the choice of alignment consider 1. While the choice of G HIDRA helps more accurately
is a coverage-accuracy trade-off: around 600 pointers are identify grouped pointers like function tables, it misses many
saved at non-aligned addresses while the no alignment individual pointers, leading to false negatives. With regard to
assumption leads to nearly 60% of ANGR’s false positives. determining pointers, all tools follow the approaches as we
• Besides data segments, G HIDRA and ANGR also search for previously discussed. After the initial generation of address
data units from non-disassembled code regions. tables, ANGR, M C S EMA and G HIDRA also apply refinements:
• ANGR excludes table entries that are floating points ( 4 ).
Data Unit Type Inference [ANGR, G HIDRA]: In operand • M C S EMA excludes table entries that may overlap with a
extraction, ANGR and G HIDRA infer the types of data units string ( 15 ). When a piece of data can be both a pointer and
when possible. ANGR identifies memory loads from data units. a string, M C S EMA prefers string. As we will discuss below,
If the loaded values flow to floating-point instructions, ANGR ANGR uses an opposite strategy.
marks the data units as floating points. This inference is in • G HIDRA excludes table entries that point to the middle of
general reliable as it follows data flow. G HIDRA uses a more recovered functions. G HIDRA also excludes table entries
aggressive strategy: given a constant operand “pointing” to a that overlap with strings or cut into other pointers. Finally,
data unit, G HIDRA considers the data unit to be the start of a G HIDRA splits an address table when adjacent entries have
string if it is followed by a sequence of ASCII/Unicode bytes a distance larger than 0xffffff.
and a null-byte. Otherwise, G HIDRA deems the data unit to be • Given an entry to non-disassembled code, G HIDRA expands
a pointer if the value inside meets the following conditions: the recursive descent using the aforementioned approach.
(1) the value is at least 4096; (2) the value is an address of
ANGR uses a special strategy when brute-force searching
an instruction or an address in a non-code region; (3) if the
data regions. Given a location, ANGR in turn checks whether
value is an address of an instruction in a known function, the
the data inside is a pointer, a ASCII/Unicode string, or an
instruction must be the function entry ( 12 ); (4) if the value is
arithmetic sequence. If any type matches, ANGR jumps over
a data address, the address cannot overlap another typed data
the typed bytes and then resumes the search ( 16 ). This strategy
unit. G HIDRA’s type inference has no correctness assurance.
incurs many false negatives like Listing 7 in Appendix G.
Code-to-Code and Code-to-Data Xrefs [ANGR, G HIDRA,
Overall, there is a lack of algorithmic solutions to symbol-
U ROBOROS, M C S EMA]: For each constant operand, ANGR,
ization. Today’s tools incorporate a multitude of heuristics,
U ROBOROS, and M C S EMA seek to symbolize it as a code
striving for a coverage-correctness balance.
pointer, checking whether the operand refers to a legitimate
instruction. Beyond, G HIDRA has two extra rules: (1) the C. Algorithms and Heuristics in Function Entry Identification
operand cannot be a value in {[0-4095], 0xffff, 0xff00, Most tools use separate strategies to identify the entry of
0xffffff, 0xff0000, 0xff0000, 0xffffffff, 0xffffff00, 0xffff0000, main versus the entries of other functions. As such, we first
0xff00000}; (2) the instruction being referred to must be a discuss main, followed by the other types of functions.
function entry (if the function was known) ( 12 ). We measured Main Function [DYNINST, ANGR, BAP, R ADARE 2]: To
heuristic 12 with 3,788 binaries. We discover thousands of locate it, ANGR and BAP analyze the _start function and,
pointers to the middle of functions (e.g., pointers for try-catch
1 48 c7 c7 e2 e0 40 00 mov $0x40e0e2,%rdi ;main
in exception handling), showing that heuristic 12 is unsound. 2 ff 15 ce 48 05 00 ** callq __libc_start_main
For a constant operand that cannot be a code pointer, the
Listing 1: Call to __libc_start_main in _start.
tools attempt to symbolize it as a data pointer, checking if the
operand refers to a legitimate data location. In the checking following calling conventions, infer the first argument passed
process, ANGR enlarges the boundaries of a data region by by _start to __libc_start_main ( 5 ). Take Listing 1
as an example. ANGR and BAP locate the instruction at line cmp/sub $const,%r1 cmp/sub $const,%r1 cmp/sub $const,%r1 lea $const, %r1
ja default ja default ja default …
1 and deem the immediate operand (0x40e0e2) to be the cmp/sub $const,%r2
lea offset(%rip),%r2 ja default
address of main. As __libc_start_main is a standard less
…
mov base(,%r2,size),%r3 than
interface, ANGR and BAP ensure correctness. R ADARE 2 jmpq *base(,%r2,size)
jmpq *%r3 60
mov (%r2,%r3,size),%r4
add %r2, %r4
mov (%r1,%r3,size),%r4
…
bytes
and DYNINST search architecture-specific patterns near the jmpq *%r4 jmpq *%r4
Type-1 jump table Type-2 jump table Type-3 jump table Type-4 jump table
call to __libc_start_main to get the address of main
( 17 ). DYNINST finds the instruction right before the call Fig. 3: Handling of jump tables by R ADARE 2.
and extracts the immediate operand; R ADARE 2 searches the cmp $0x4f,%esi
cmp $0x4f,%esi
address after a fixed sequence of raw bytes (e.g., 48 c7 c7 in jaja default
default
Function
Function entry:
entry: [N/A]
[N/A]
lea 0x24e4(%rip),%r9
Fig. 1). For Windows binaries, R ADARE 2 seeks a pattern in lea
mov
mov
0x24e4(%rip),%r9
%esi,%r8d
%esi,%r8d Base flow: [N/A]
Base flow: [N/A]
__scrt_common_main_seh as it does for Linux binaries movslq
movslq (%r9,%r8,4),%rax
(%r9,%r8,4),%rax Index
Index flow:
flow: [esi[esi → r8d]
→ r8d]
add
add %r9,%rax
%r9,%rax
to locate main. Other tools do not particularly find main. jmpq
jmpq *%rax
*%rax Base
Base Guards:
Guards: [N/A]
[N/A]
General Functions [DYNINST, G HIDRA, ANGR, BAP, Index
Index Guards:
Guards: [0≤[0≤ esi ≤79]
esi ≤79]
Indexbound:
Index bound: [0≤
esiesi ≤79]
R ADARE 2]: To identify the entries of non-main functions, [0≤ ≤79]
Define
Define base:
base: [0x24e4f(rip)
[0x24e4f(rip) → r9]
→ r9]
Define index:
these tools adopt a hybrid approach that consists of three parts: Indexslicing:
Index slicing: [esi
[esi →→ r8d]
r8d]
Define index: [esi[esi → r8d]
→ r8d]
(1) The tools seek symbols remaining in the .symtab and Findbase:
Find base: [0x24e4f(rip)
[0x24e4f(rip) →→r9]r9]
index:
index: = = Base
Base guards:
guards: [N/A]
[N/A]
default Index
default Index guards:
guards: [N/A]
[N/A]
.dynsym sections to determine known-to-be-good functions Findindex:
Find index: [r8]
[r8]
( 6 ). (Only) G HIDRA considers the .eh_frame section to Memoryload:

Memory load: [(r9
[(r9 + r8*4)
+ r8*4) →→ rax]
rax]
Define
Define target:
target: [(r9[(r9 + r8*4)
+ r8*4) → rax]
→ rax]
identify functions that have unwinding information ( 7 ). As Jmp

Jmp target:
target: [rax]
[rax]
Jmptarget:
Jmp target:[rax]
[rax]
mandated by X86-64 ABI, modern compilers (e.g., LLVM and
GCC) keep unwinding information for every function. As we (a) DYNINST (b) G HIDRA
illustrate in § IV-B3, this way G HIDRA can identify nearly all Fig. 4: Jump table resolution in DYNINST and G HIDRA. The
function entries, but (surprisingly) it does not use exception upper half of Sub-Fig. 4a shows a jump table from Gold
information when handling Windows binaries. Linker. Lower half of Sub-Fig. 4a and Sub-Fig. 4b respectively
(2) All tools consider targets of direct calls to be function illustrate how DYNINST and G HIDRA resolve the jump table.
entries ( 8 ), while ANGR and G HIDRA additionally resolve
certain indirect calls to determine more function entries ( 9 ). R ADARE 2 only handles jump tables, by searching four types
Finally, DYNINST, G HIDRA, ANGR and R ADARE 2 include of patterns shown in Fig. 3 ( 20 ). Consider Type-1 jump table as
targets of tail calls as function entries (see § III-D). an example: encountering a jump table of this type, R ADARE 2
(3) All tools use pattern-based approaches to further recover searches an indirect jump in the format of jmp [base +
functions ( 18 ). G HIDRA, ANGR and R ADARE 2 find function reg * size] and a cmp/sub instruction in the preceding
entries based on common prologues (or epilogues); DYNINST basic block. On finding the two items, R ADARE 2 deems base
(by default) and BAP find function entries with pre-trained as the base address and the constant operand in cmp/sub as
decision-tree models [9]. As we will present in § IV-B3, this the upper bound of the index. If that upper bound exceeds
pattern-based approach is heavily used and indeed identifies 512, R ADARE 2 discards the jump table ( 21 ). The strategies
many functions. However, it is very sensitive to architectures, of R ADARE 2 are highly sensitive to compiling configurations
optimizations, and compiler specifics. and are less effective with resolving jump tables (§ IV-B4).
ANGR adopts an extra aggressive approach: during linear
DYNINST also only handles jump tables, using a hybrid
scan over code gaps left by recursive descent, it treats the begin approach shown in Fig. 4a. It performs backward slicing from
of each identified code piece as a new entry ( 19 ). This method the target. In the sliced area, if the first memory read (after
improves the coverage but incurs many errors (§ IV-B3). simplification) has the format of [CONST + reg * size]
( 20 ), DYNINST deems the indirect jump to be a jump table,
To sum up, the identification of function entry mostly uses
respectively using CONST and reg as the base address and
hybrid approaches, mixing algorithms and heuristics.
index. Starting from the index, DYNINST performs backwards
slicing again up to 50 assignments [35] ( 22 ) or the function
D. Algorithms and Heuristics in CFG Reconstruction
entry. In this slice, DYNINST uses a simplified Value Set
CFG reconstruction consists of many tasks. We focus on the Analysis (VSA) to gather value bounds along the flow of the
challenging ones: resolving indirect jumps/calls, detecting tail index (and its aliases) ( 10 ).
calls, and finding non-returning functions. Direct jumps/calls G HIDRA first considers an indirect jump to be a jump table
are not discussed as they are easily derivable after disassembly. and resolves it with a strategy shown in Fig. 4b. In the current
Indirect Jumps [DYNINST, G HIDRA, ANGR, R ADARE 2]: function, G HIDRA seeks a single path that defines both the
From our benchmark binaries (Table IV), we observed three base address and the index ( 20 ). Along the path, G HIDRA
types of indirect jumps: (1) jump tables [23] (compiled from tracks the propagation of the base and index to identify their
switch-case and if-else statements); (2) indirect tail value bounds. Instead of using a full VSA, G HIDRA considers
calls (indirect calls optimized as tail calls); and (3) hand- restrictions by variable types, conditional jumps, and and
written ones (e.g., longjmp and other cases in Glibc [48]). instructions. Similar to R ADARE 2, G HIDRA discards jump
tables with an index bound over 1024 ( 21 ). If G HIDRA cannot Tools take different strategies to detect tail calls. These
resolve the jump table, it will consider the indirect jump as an strategies depend critically on function entry detection, in-
indirect call and perform analysis that we will discuss shortly. heriting inaccuracies from function identification.
ANGR , given an indirect jump, considers the operand as a Non-returning Functions [ANGR, BAP, R ADARE 2,
source and runs backwards slicing. In the sliced area, ANGR DYNINST, G HIDRA]: Tools use a similar workflow for
uses full-scale VSA to identify possible targets ( 10 ). However, detecting non-returning functions. First, they gather the
the public version of ANGR restricts the slicing to at most three group of library functions or system calls that are known
levels of basic blocks ( 22 ), trading utility for efficiency. ANGR to be non-returning ( 13 ). Second, from this initial group of
also adopts heuristic 21 with a very large threshold: 100,000. functions, tools further find other ones.
In summary, tools employ various heuristics to resolve ANGR , R ADARE 2, and DYNINST use the same idea: they
indirect jumps. These heuristics, mostly derived for accuracy, scan each function and, if no ret instruction is found, they
introduce fewer errors but have limited coverage (§ IV-B4). consider the function to be non-returning ( 14 ). However, they
Indirect Calls [G HIDRA, ANGR]: G HIDRA finds targets of adopt different solutions when encountering a call to a child
indirect calls based on constant propagation ( 11 ). It tracks function that has unknown returning status: (1) ANGR simply
the intra-procedure propagation of constants from immediate assumes a fall-through after the call; (2) R ADARE 2 takes a
operands, the LEA instructions, and global memory. Once a similar strategy but when the child status is later updated,
constant flows to an indirect call, G HIDRA takes the constant R ADARE 2 recursively propagates the update to its predecessor
as a target. ANGR also uses constant propagation ( 11 ) to functions; (3) DYNINST takes a depth-first approach to first
handle indirect calls but only considers the current basic block. handle the successor functions. Only until the status of the
Tail Calls [R ADARE 2, ANGR, DYNINST, G HIDRA]: For ef- child function is determined, DYNINST will continue handling
ficiency, function calls at the end of procedures are often the original call. To our understanding, the strategies in
optimized as jumps (i.e., tail calls). Tools adopt different R ADARE 2 and DYNINST are equivalent. Both are principled
strategies to detect tail calls. and produce nearly perfect precision (§ IV-B4).
R ADARE 2 uses a simple heuristic to determine tail calls: G HIDRA also follows the idea of ( 14 ). To handle a call with
the distance between a jump and its target exceeding a certain unknown returning status, G HIDRA checks the fall-through
threshold ( 23 ). This heuristic exploits the insight that different code after the call. If the code runs into abnormal regions
functions are usually apart. However, it is hard to pick a (i.e., data, unrecognizable instructions, or another function) or
threshold that is both effective and accurate. the code is referenced by xrefs (other than jumps), G HIDRA
marks the fall-through unsafe. If the child function of that call
G HIDRA determines a jump as a tail call if the code between
incurs at least three unsafe fall-through, G HIDRA considers the
the jump and its target spans multiple functions ( 24 ). The
child to be non-returning. Further, instead of doing recursive
heuristic can lead to both false positives due to discontinuous
updates, G HIDRA simply runs the detection process twice.
functions and false negatives due to unrecognized functions.
G HIDRA further excludes conditional jumps from considera- BAP considers a function to be non-returning if all the
tion ( 25 ), preventing the detection of 21.6% of the tail calls. paths in the function end at a call to a non-returning function
( 14 ). To handle calls to child functions with unknown status,
DYNINST takes a sophisticated strategy. It considers a jump
it takes a similar recursive-updating strategy as R ADARE 2.
as a tail call if the target is the start of a known function ( 12 ).
Otherwise, DYNINST checks two rules to determine a tail call: Tools use many principled strategies to detect non-returning
(1) the jump’s target cannot be reached by only following false functions, ensuring higher correctness.
branches; (2) right before the jump, the stack is tore down
by [leave; pop $reg] or [add $rsp $const] ( 26 ). IV. L ARGE - SCALE E VALUATION
While the first rule is hard to reason, the second rule exploits
an intrinsic property of tail calls: the current function recycles A. Evaluation Setup
the stack such that the child re-uses its return address. The Benchmarks: We used the software listed in Table IV to
above pattern-based approach is less accurate, leading to 97% experimentally evaluate the tools we studied. The software
of DYNINST’s false positives and most of its false negatives. includes programs and libraries of diverse functionality and
ANGR adopts similar strategies as DYNINST. It first identi- complexity, written in C / C++, and containing hand-written
fies jumps whose targets are starts of known functions ( 12 ). assembly and hard-coded binary code. It also carries a signif-
Beyond that, ANGR requires four conditions to detect a tail icant amount of complex constructs as listed in Table XVII,
call: (1) the jump is unconditional ( 25 ); (2) the stack at the fitting the use as a benchmark. To test the effect of different
jump is tore down based on stack height analysis ( 26 ); (3) the compilers and options, we built each software package for two
target does not belong to any function or it belongs to the operating systems (Linux and Windows), using three compilers
current function ( 27 ); (4) all incoming edges of the target are (GCC-8.1.0 and LLVM-6.0.0 on Linux, and Visual Studio
unconditional jumps or direct calls ( 28 ). As we will show in 2015 on Windows), various optimization levels (O0, O2, O3,
§ IV-B4, these heuristics do not provide correctness guarantees Os, Ofast on Linux, Od, O1, O2, Ox on Windows), and two
and each can lead to both false positives and false negatives. architectures (X86 and X64). This resulted in 3,788 binaries,
TABLE IV: Software used for evaluating tools. TABLE V: Evaluation results of instruction recovery. L and
W are short for Linux and Windows. Pre and Rec means
Programs/Binaries
Type Name
Linux Windows
precision and recall. Ave/Min show the average/minimal
Benchmark SPEC CPU2006 30 / 546 15 / 120 results among all binaries. The best/worst results specific to
Utilities
Unzip-6.0 Coreutils-8.30 7-zip-19
125 / 2500 26 / 196
each optimization level are respectively marked as blue/red.
Findutils-4.4 Binutils-2.26 Tiff-4.0
Openssl-1.1.0l Putty-0.73 D8-6.4 Instructions Instructions
Filezilla-3.44.2 Busybox-1.31 L Avg Min W Avg Min
Clients 13 / 154 13 / 104
Protobuf-c-1 ZSH-5.7.1 VIM-8.1 Pre Rec Pre Rec Pre Rec Pre Rec
XML2-2.9.8 Openssh-8.0 Git-2.23 O0 99.98 99.99 88.80 99.70 Od 98.04 99.98 86.84 99.83
Objdump
Lighttpd-1.4.54 Mysqld-5.7.27 O2 99.94 99.94 84.76 99.56 O1 98.96 99.97 83.84 99.84
Servers 3 / 49 2 / 16
Nginx-1.15.0 SQLite-3.32.0 O3 99.95 99.95 85.84 93.45 O2 97.58 99.97 83.73 99.82
Glibc-2.27 libpcap-1.9.0 libv8-6.4 Os 99.95 99.97 84.30 96.45 Ox 97.57 99.97 83.73 99.82
Libraries libtiff-4.0.10 libxml2-2.9.8 6 / 79 3 / 24 Of 99.90 99.90 86.52 96.65 - - - - -
libsqlite-3.32.0 libprotobuf-c-1.3.2 O0 99.99 99.07 99.90 6.30 - - - - -
Dyninst
Total 177 / 3328 59 / 460 O2 99.99 92.45 99.46 4.15 - - - - -
O3 99.99 91.72 99.51 19.47 - - - - -
Os 99.99 89.79 99.83 5.26 - - - - -
which are less than the theoretically possible 4484 (177⇥20 Of 99.99 91.65 99.07 19.55 - - - - -
+ 59⇥16), as some cannot be built for certain configurations. O0 99.99 99.99 99.96 99.80 Od 99.79 97.49 94.82 83.20
Ghidra Ghidra-NE McSema

O2 99.99 99.97 99.88 99.11 O1 99.83 99.32 95.06 97.47
Ground Truth: We intercept the compiling and linking pro-
O3 99.99 99.83 99.90 98.64 O2 99.84 95.41 95.81 75.07
cess to automatically obtain ground truth about the produced Os 99.99 99.99 99.84 99.35 Ox 99.82 96.16 94.18 78.30
binaries, which includes instructions, functions, control trans- Of 99.99 99.83 99.85 96.49 - - - - -
fers, jump tables, xrefs, and remaining complex constructs. O0 99.99 98.03 99.60 63.26 Od 99.88 96.01 94.84 65.05
O2 99.99 80.65 99.64 2.06 O1 99.86 90.40 95.04 73.67
On Linux, we replicate the approach in CCR [60].
O3 99.99 77.83 99.69 17.77 O2 99.88 92.12 95.80 72.00
CCR extends the LLVM Machine Code layer to record Os 99.99 89.66 99.76 12.76 Ox 99.87 92.09 94.10 69.34
the needed information in each compiled bitcode/assembly Of 99.99 78.48 99.70 17.92 - - - - -
file. It also instruments the GNU Gold Linker to merge O0 99.99 99.65 99.96 64.51 Od 99.88 96.01 94.84 65.05
the information during linking. Throughout experimenta- O2 99.99 94.90 99.68 51.31 O1 99.86 90.40 95.04 73.67
O3 99.99 93.63 99.72 48.26 O2 99.88 92.12 95.80 72.00
tion, we identified several limitation of the CCR linker,
Os 99.99 95.26 99.73 16.23 Ox 99.87 92.09 94.10 69.34
which we addressed by extending it to: (1) collect informa- Of 99.99 93.73 99.73 37.78 - - - - -
tion from previously ignored sections (.text.unlikely, O0 99.99 98.79 99.96 64.88 Od 99.97 93.02 98.67 14.50
Angr-NS
.text.exit,.text.startup,.text.hot); (2) sup- O2 99.99 87.75 99.89 48.75 O1 99.97 93.42 98.10 18.94
port relocatable objects (compiled with -r); (3) record both O3 99.99 89.00 99.90 54.31 O2 99.96 90.31 98.14 17.80
Os 99.99 94.58 99.94 60.73 Ox 99.96 90.17 98.14 17.80
the size and the offset of basic blocks; (4) handle linker- Of 99.99 88.59 99.87 51.07 - - - - -
inserted functions (e.g., _start) and statically-linked Glibc O0 99.99 99.97 99.95 97.64 Od 98.96 99.92 90.01 99.37
functions; (5) support X86 targets. We also extend the CCR O2 99.99 99.98 99.79 99.23 O1 99.33 99.90 87.71 99.16
Angr
approach to GCC by instrumenting its RTL pass to label all O3 99.99 99.98 99.81 99.30 O2 98.66 99.88 87.55 99.20
Os 99.99 99.99 99.77 99.78 Ox 98.64 99.88 87.55 99.20
related constructs when producing assembly code. We also
Of 99.98 99.98 99.84 99.75 - - - - -
customized the GNU Assembler to record the locations of the O0 99.99 82.50 99.96 9.37 Od 99.98 71.44 98.97 30.24
constructs and xrefs in the emitted object files. Finally, we O2 99.98 66.88 99.70 9.81 O1 99.96 78.95 98.54 50.87
Bap
re-use the CCR linker to merge object files. O3 99.99 65.64 99.69 11.73 O2 99.90 70.49 97.48 42.41
Os 99.98 76.92 97.20 9.16 Ox 99.90 70.47 97.45 38.75
On Windows, we combine compiler options, symbol/debug
Of 99.94 66.62 95.56 12.64 - - - - -
information, and lightweight manual analysis to build ground O0 99.98 87.92 94.51 44.59 Od 99.21 86.25 46.59 30.00
truth. Details are covered in Appendix B.
Radare2
O2 99.99 78.80 99.29 8.55 O1 99.82 78.60 92.87 46.21

Tools and Configurations: Besides open-source tools, we also O3 99.98 75.32 99.24 9.23 O2 98.50 75.78 42.78 44.73
Os 99.97 85.31 91.09 8.59 Ox 98.48 75.60 45.74 38.44
test two commercial tools, IDA P RO-7.4 and B INARY N INJA-
Of 99.99 76.50 99.24 8.90 - - - - -
1.2. We detail the configuration of all tools in Appendix C.
O0 99.99 99.99 99.61 99.34 Od 99.80 99.98 94.83 99.54
It should be noted that we run two versions of G HIDRA and O2 99.99 99.95 99.76 98.15 O1 99.82 99.96 95.08 99.39
IDA
ANGR , including a version of G HIDRA not using exception O3 99.99 99.95 99.87 98.03 O2 99.84 99.89 95.83 96.44
information, namely GHIDRA-NE, and a version of ANGR Os 99.99 94.61 99.93 57.31 Ox 99.82 99.90 94.20 96.66
without linear scan, namely ANGR-NS. Of 99.99 93.04 99.88 51.62 - - - - -
O0 99.99 99.81 99.72 89.85 Od 99.61 99.23 94.84 96.61
O2 99.99 98.79 99.50 72.01 O1 99.76 99.74 95.14 96.04
Ninja
B. Evaluation Results & Analysis O3 99.98 97.38 99.56 71.77 O2 99.51 98.96 95.93 93.26
Os 99.99 99.46 99.37 77.51 Ox 99.52 99.22 94.32 96.94
1) Disassembly : This evaluation measures the recovery of Of 99.98 97.49 99.65 71.79 - - - - -
legitimate instructions. We excluded all the padding bytes and
linker-inserted functions (e.g., _start). We also inserted a Overall Performance: In Table V, we show the overall results
symbol of main so that all recursive tools can find it. of instruction recovery. Note that we do not show the results
TABLE VI: Statistics of false positives in instruction recovery.
of PSI and U ROBOROS as they are close to OBJDUMP. Pad, data, Func, Non-Ret, Jump-Tbl respectively represent
In general, the results of disassembly vary across tool errors due to padding, data-in-code, wrong function matching,
categories. Linear tools and linear-sweep aided tools, like non-returning functions, and wrong jump tables.
OBJDUMP and ANGR , have every high coverage (99.95%+ recall).
Recursive tools have lower recovery rates and some can only Percentage of False Positives (%)
Tools
recover less than 80% of the instructions (BAP and R ADARE 2). Pad Data Func Non-Ret Jump-Tbl Other
We also notice that the performance of recursive tools changes Objdump 71.7 28.3 0.0 0.0 0.0 0.0
across optimization levels and architectures. In particular, Dyninst 0.0 0.0 36.2 39.9 23.8 0.0
Ghidra 0.0 0.0 18.2 65.4 3.0 13.4
nearly all recursive tools (ANGR-NS, G HIDRA-NE, DYNINST, BAP,
Ghidra-NE 0.0 0.0 6.4 61.7 5.9 26.0
R ADARE 2) have reduced recovery as the optimization increases
Angr 10.5 10.3 76.3 2.9 0.0 0.0
and when analyzing X64 targets. This is because that opti- Angr-NS 0.0 0.0 70.2 29.1 0.7 0.0
mization levels and architectures affect the function matching Bap 0.0 0.0 89.0 0.8 0.0 10.2
in recursive tools, further leading to missing of instructions. Radare2 0.0 0.0 96.5 2.2 1.3 0.0
Such results well comply with previous observations [5].
In the aspect of precision, we have an opposite observation. aids ANGR to recover 8.20% more instructions. Third, the
Recursive tools have high precision (over 99.5%), regardless use of xrefs in GHIDRA leads to a 4.33% increase in code
of the compiler, architecture, and optimization level. Linear coverage, as shown by the result of G HIDRA on Linux binaries
tools are less precise. The precision of OBJDUMP, in the worst in Table V and XVIII
case, drops to around 85%. This difference is mainly because Understanding of Errors: In Table VI, we summarize the
recursive tools mostly follow the control flow, ensuring cor- statistic of false positives. For linear tools (e.g., OBJDUMP), all
rectness. However, linear tools scan every byte and often run the false positives are caused by misidentifying padding bytes
into errors when data appears in code. For instance, OBJDUMP or data-in-code as code. For recursive tools, the most common
produces the worst result in analyzing Openssl (precision: reasons of errors include (1) considering illegal locations
85.35%) because Openssl has a lot of data in assembly files as function entries; (2) missing non-returning functions and
and OBJDUMP wrongly identifies the data as code. assuming the calls to them fall through; (3) incorrect resolution
Use of Heuristics: To augment the correctness of linear of jump tables. Beyond that, ANGR’s linear sweep incurs 21%
sweep, PSI introduces error detection and handling. In our of the errors due to data in code; BAP and G HIDRA have a
evaluation, PSI can analyze 971 of the X86 binaries on Linux few implementation flaws, also leading to a group of errors.
systems. On the binaries, the linear sweep produces over 10K The reasons of false negatives are consistent. All the false
errors. PSI captures 26% of the errors leading to bad opcode negatives by linear tools are side-effects of false positives. For
and other 6% resulting in invalid control transfers. However, recursive tools, most false negatives are caused by undetected
the public version of PSI has strict requirement of padding function: as shown in Table X, recursive tools averagely miss
patterns, preventing it from fixing the errors. In summary, 25.0% of the functions. The remaining instructions are missed
PSI’s heuristic can capture 32% of the errors in linear sweep. mainly because certain jump tables are not resolved and false
However, we are unable to report the detectable errors due to positives over-run the legitimate instructions.
PSI’s implementation restrictions. 2) Symbolization: In this evaluation, we measure the re-
In contrast, heuristics in recursive tools are mostly for covery of xrefs. Xrefs in direct jumps/calls are trivial to
coverage enhancement. We measure the effectiveness of each identify so we excluded them. We also excluded xrefs in
heuristic in turn. We start with pure recursive descent by wrong instructions since such errors are rooted from incorrect
disabling function matching, linear scan in ANGR, xref aided disassembly. Finally, we did not consider jump tables as they
disassembly in G HIDRA and R ADARE 2. The results are shown are separately measured in § IV-B4.
in Table XVIII in Appendix D. Unsurprisingly, all tools have Overall Performance: We summarize the overall performance
nearly perfect precision. However, without heuristics, the tools of symbolization in Table VII. In general, open-source tools
have very low recall. ANGR, G HIDRA, DYNINST all have a have much higher recovery rates (98.35% on average) than com-
recall around 51% and R ADARE 2 recovers no more than 10% mercial tools (88.63% on average). The difference is even bigger
of the code. We noted that G HIDRA still produces high recall with the worst case. This is because open-source tools brute
with Linux binaries. This is because G HIDRA uses exception force all constant operands and data units while the commer-
information to highly accurately identify functions (§ III-C). cial tools adopt more conservative strategies. Our hypothesis
On top of pure recursive descent, we in turn enable function can be verified by comparing M C S EMA and IDA P RO, since
matching, linear scan, and the use of xrefs. With function M C S EMA just adds a round of brute force on top of IDA P RO.
matching, recursive tools recover significantly more functions Somewhat surprisingly, open-source tools also have high
and code (see Table XIX in Appendix D). In particular, ANGR- precision (99.92% on average). Based on our observation, the
NS and DYNINST respectively identify 21.30% and 18.24% high precision is because (1) the heuristic-based checks are
more functions, leading to recovery of 36.44% and 26.65% generally restrictive and (2) most benchmark programs have
more code (comparing Table V and XVIII). Second, according less data. On programs that have more data, the tools are more
to the results of ANGR and ANGR - NS in Table V, linear scan inclined to make mistakes (e.g., Mysqld, having plenty of
TABLE VII: Evaluation results of symbolization. U ROBOROS TABLE VIII: Statistics of false positives in symbolization.
cannot run Linux-Of binaries and all Window binaries. We Align, Type, Type-based Sliding, and Extended-data respec-
also omitted ANGR’s symbolization on Windows binaries as tively mean no alignment (heuristic 11), ANGR’s moving
ANGR ’s Reassembler component cannot run them. scheme (heuristic 16), enlarging data regions (heuristic 13).
Symbolizations Symbolizations Percentage of False Positives (%)
L Avg Min W Avg Min Tools
Align Type-based Sliding Extended-data Collision
Pre Rec Pre Rec Pre Rec Pre Rec Urobo* 0.00 0.00 0.00 100
O0 99.99 100 99.89 100 - - - - -
Mcsema 0.00 0.00 0.00 100
*Urobo
O2 100 100 100 100 - - - - -

Ghidra 0.29 0.00 3.91 95.80
O3 99.99 100 99.91 100 - - - - -
- - - - -
Angr 58.58 4.47 0.01 36.94
Os 99.99 100 99.88 100
Of - - - - - - - - -
O0 99.99 99.77 99.04 94.05 Od 99.93 96.05 94.79 78.22
TABLE IX: Statistics of false negatives in symbolization.
Align, Type, Extended-data, Function, Address-table respec-
McSema
O2 99.99 99.96 98.78 97.87 O1 99.96 96.08 98.00 84.80

O3 99.99 99.96 98.88 98.79 O2 99.97 95.38 98.70 86.92 tively mean assumption of alignment (heuristic 10), preferring
Os 99.98 99.75 98.68 90.91 Ox 99.95 95.51 96.54 83.42 strings over pointers (heuristic 15), enlarging data regions
Of 99.96 99.68 93.10 81.82 - - - - -
(heuristic 13), and size of address table (heuristic 14).
O0 99.90 95.70 61.99 45.52 Od 99.70 96.53 94.75 78.12
Ghidra
O2 99.91 95.53 87.35 37.71 O1 99.63 95.02 95.59 77.86

Percentage of False Negative (%)
O3 99.92 95.99 88.36 41.60 O2 99.86 95.25 98.67 74.54 Tools
Align Type Extended-data Address Table Func
Os 99.89 95.37 61.98 41.72 Ox 99.83 95.30 96.45 74.52
Of 99.97 96.26 98.19 20.00 - - - - - Urobo* 0.00 0.00 0.00 0.00 0.00
O0 99.86 99.52 81.74 95.38 - - - - - Mcsema 1.14 75.22 23.64 0.00 0.00
O2 99.88 99.54 87.28 95.35 - - - - - Ghidra 0.01 3.13 0.00 96.53 0.33
Angr
O3 99.90 99.59 88.29 96.21 - - - - - Angr 0.00 100 0.00 0.00 0.00
Os 99.81 99.99 86.57 98.96 - - - - -
Of 99.84 99.99 88.05 98.89 - - - - -
positives in ANGR and G HIDRA. Heuristic 14 introduces most
O0 99.98 95.96 97.17 36.96 Od 99.99 95.76 99.90 82.51
false negatives in G HIDRA, although removing a small set of
O2 99.93 94.32 86.77 31.96 O1 99.99 94.11 99.88 72.65
false positives. Heuristic 15 produces over 3K false negatives
IDA
O3 99.94 95.06 88.38 31.96 O2 99.99 94.26 99.88 69.94

Os 99.93 93.81 85.59 25.26 Ox 99.99 94.39 99.89 70.48 in M C S EMA because the inference of strings is not accurate.
Of 99.98 95.63 97.52 32.02 - - - - - Understanding of Errors: Table VIII shows the statistics of
O0 99.99 79.89 99.81 17.29 Od 99.90 85.37 93.62 40.94
false positives. Using the assumption that pointers in data can
O2 99.99 81.08 97.59 19.01 O1 99.98 84.67 99.61 45.18
Ninja
O3 99.99 81.38 99.40 21.15 O2 99.98 84.54 99.63 50.17 be non-aligned, ANGR and G HIDRA trigger 59% and 0.3% of
Os 99.99 77.71 98.67 17.45 Ox 99.97 84.98 99.44 50.14 their false positives. They also produce false positives because
Of 99.99 82.35 99.52 21.15 - - - - - they enlarge the boundaries of data regions when checking the
legitimacy of xrefs’ targets. All the other false positives are
data, leads to the lowest precision in many tools). due to collisions between numeric values and pointers.
Use of Heuristics: As shown in Table III, symbolization in- As shown in Table IX, most false negatives by ANGR and
volves many heuristics, competing for a coverage-correctness M C S EMA are because they exclude pointers that overlap with
trade-off. Our evaluation shows that heuristic 8 (the brute-force the inferred strings. M C S EMA also misses 23.64% of the xrefs
based approach), 11 (the no-alignment assumption about pointers), and that point to locations outside of the data regions. G HIDRA,
13 (the enlargement of data regions) bring full coverage of xrefs in because of its assumption about the address table’s minimal
our benchmarks. The other heuristics are instead striving for size and that code pointers always point to function entries,
correctness. In the following, we discuss them in turn. respectively produces 96.53% and 0.33% of its false negatives.
Heuristic 8 (the brute-force based approach) is necessary for 3) Function Entry Identification : This evaluation measures
symbolization. This can be verified by comparing IDA P RO the identification of function entries. In this test, we further
and M C S EMA. M C S EMA adds a round of brute force to IDA considered N UCLEUS [6] and B YTE W EIGHT [9]. We re-
P RO, increasing the coverage from 95% to 98%. Heuristic trained B YTE W EIGHT with our benchmarks binaries.
9 (pointers in data have machine size) is principled if jump tables Overall Performance: Table X presents the overall results.
are not considered. From over 6 million xrefs, we observe no The key observation is that function entry identification re-
violations. Heuristic 10 (pointers in data are aligned) and 11 (pointers mains a challenge. 4 of the open source tools can only identify
in data may not be aligned) conflict with each other, and neither of less than 80% of the functions. In particular, R ADARE 2 has a
them is perfect. Heuristic 10 misses around 600 xrefs while recall lower than 66%. Such results indicate, even with heuris-
heuristic 11 introduces most false positives in ANGR (over tics, we have yet to develop better function identification. We
50K). Heuristic 12 (references to code point to function entries) can also observe that the results of function identification varies
reduce false positives but it misses thousands of xrefs to the across optimization levels and architectures. This is because
middle of functions (e.g., pointers for try-catch in exception the tools widely use signature-based function matching, which
handling). Heuristic 13 (enlarging boundaries of data regions) helps are specific to optimizations and architectures. Besides limited
recover 12K xrefs in data, but leads to 6 and 2K+ false coverage, existing tools also have lower precision in function
TABLE X: Evaluation results of function entry identification. TABLE XI: Statistics of false positives in function detection.
Mismatch, J-Tab, Scan, T-Call, Mis-disa respectively mean
Function Entry Function Entry
L Avg Min W Avg Min wrong pattern matching, wrong handling of jump tables,
Pre Rec Pre Rec Pre Rec Pre Rec wrong detection of tail calls, and wrong disassembly.
O0 99.65 97.11 82.76 2.07 - - - - -
Percentage of False Positives (%)
Ghidra Ghidra-NE Dyninst
O2 93.92 75.26 48.95 12.37 - - - - - Tools

O3 93.37 74.12 48.95 15.95 - - - - - Mismatch J-Tab Scan T-Call Mis-disa Other
Os 98.98 50.46 66.67 4.00 - - - - - Dyninst 80.80 0.00 0.00 19.20 0.00 0.00
Of 98.70 42.76 50.00 6.25 - - - - - Ghidra 0.01 0.00 0.00 0.05 0.00 99.94
O0 99.92 97.29 94.59 45.30 Od 98.68 83.41 66.42 40.58 Ghidra-NE 28.38 0.00 0.00 71.61 0.01 0.00
O2 99.40 48.46 88.89 10.00 O1 98.45 69.78 61.41 40.62 Angr 9.89 0.00 78.41 11.70 0.00 0.00
O3 99.30 47.52 88.89 10.26 O2 98.48 70.00 61.55 40.59 Angr-NS 92.98 0.00 0.00 7.01 0.01 0.00
Os 98.85 69.97 77.78 10.50 Ox 98.43 70.93 55.53 40.32
Bap 99.99 0.00 0.00 0.00 0.01 0.00
Of 98.32 49.23 81.82 11.02 - - - - -
Radare2 6.83 0.00 0.00 0.00 0.01 93.17
O0 99.94 99.66 86.82 55.11 Od 98.68 83.41 66.42 40.58
O2 97.25 87.92 71.92 16.94 O1 98.45 69.78 61.41 40.62
O3 96.48 87.59 71.30 20.00 O2 98.48 70.00 61.55 40.59
TABLE XII: Statistics of false negatives in function detection.
Os 99.22 88.90 90.00 12.60 Ox 98.43 70.93 55.53 40.32 J-Tab, T-Call, Non-Ret, and FP-Overlap respectively represent
Of 95.90 88.31 71.27 20.69 - - - - - missing jump table targets, missing tail calls, missing non-
O0 97.47 96.92 45.65 60.00 Od 79.19 78.18 27.07 10.63 returning functions, and side effects of other false positives.
Angr-NS
O2 86.37 68.08 44.74 17.01 O1 77.86 69.36 27.04 10.90

O3 85.23 68.35 42.15 17.00 O2 76.04 66.60 24.03 10.71 Percentage of False Negatives (%)
Os 89.59 72.36 41.59 19.55 Ox 75.87 66.23 27.56 10.71 Tools
J-Tab T-Call Non-Ret FP-Overlap No-Match
Of 84.98 69.99 41.59 16.99 - - - - -
Dyninst 0.03 1.68 2.07 10.16 86.06
O0 91.44 98.29 22.13 80.00 Od 20.31 99.47 7.01 94.83
Ghidra 0.02 0.05 1.48 12.42 86.03
O2 74.67 90.50 8.53 48.38 O1 44.99 99.23 7.01 55.04
Angr
O3 74.27 90.35 16.71 45.25 O2 23.21 99.07 10.79 50.78 Ghidra-NE 0.01 0.18 14.16 0.84 84.81
Os 87.30 97.18 12.45 70.80 Ox 21.05 99.11 6.94 50.77 Angr 0.08 13.05 8.25 73.61 5.01
Of 72.76 90.42 16.90 45.25 - - - - - Angr-NS 1.18 1.06 4.01 32.78 60.97
O0 96.36 85.55 33.59 33.43 Od 88.24 67.14 7.97 30.14 Bap 1.32 2.42 7.99 11.82 76.45
O2 78.62 57.64 3.30 24.47 O1 87.18 59.03 65.54 29.43 Radare2 1.12 2.78 3.64 5.52 86.94
Bap
O3 77.40 57.22 3.06 24.30 O2 90.68 57.85 74.56 29.94

Os 88.25 65.44 8.78 25.97 Ox 90.62 57.13 74.55 23.71
Of 77.63 58.11 4.75 24.33 - - - - -
ular, B INARY N INJA can identify over 97% of the functions
O0 98.71 97.01 80.01 60.02 Od 97.27 77.74 7.96 30.24 with a precision over 95%. Second, N UCLEUS achieves com-
Radare2
O2 97.39 51.90 20.86 23.99 O1 98.05 64.69 93.08 29.63 parable coverage and precision to B INARY N INJA, showing a
O3 97.12 49.66 21.58 23.81 O2 97.26 62.26 92.54 30.02 high promise of its CFG-connectivity based solution. Based on
Os 98.39 71.86 34.64 25.34 Ox 97.45 61.64 93.08 24.53
an official blog [84], B INARY N INJA incorporates N UCLEUS
Of 97.52 51.44 21.87 23.84 - - - - -
for CFG and function detection. Third, B YTEWEIGHT outper-
O0 99.47 92.64 87.34 36.28 Od 99.10 90.45 95.49 61.48
O2 98.50 71.50 53.25 29.74 O1 98.04 83.61 2.35 3.83
forms BAP despite BAP internally runs B YTEWEIGHT. This
IDA
O3 98.43 72.49 51.41 29.61 O2 98.85 83.97 95.52 61.47 is because BAP uses pre-trained signatures but B YTEWEIGHT
Os 98.35 81.97 74.39 31.49 Ox 98.86 83.02 95.48 57.79 uses signatures trained with our benchmark binaries.
Of 98.40 78.05 39.93 29.64 - - - - - Use of Heuristics: In function entry identification, two major
O0 97.24 99.78 38.75 85.88 Od 96.09 99.07 59.66 94.25
O2 93.83 94.55 32.68 69.33 O1 96.29 98.02 2.20 3.83
heuristics are used. The first heuristic searches function entries
Ninja
O3 93.85 95.20 34.65 68.82 O2 96.00 98.45 63.50 82.55 using common prologues/data-mining models. We summarize
Os 96.40 98.63 39.64 73.38 Ox 95.99 98.83 55.89 94.25 the contribution and accuracy of the heuristic in Table XIX in
Of 93.74 95.30 36.66 68.83 - - - - - Appendix D. Without counting functions recursively reachable
O0 97.41 98.29 53.58 80.00 Od 88.33 97.63 40.78 94.44 from the matched ones, this heuristic recovers 17.36% more
ByteWeight Nucleus
O2 94.99 90.49 48.11 66.79 O1 88.53 96.60 40.78 93.46

functions with an average precision of 77.53%. Also observ-
O3 95.15 90.76 48.11 66.21 O2 85.34 96.48 40.73 94.00
Os 96.80 91.13 49.86 69.20 Ox 85.53 96.79 40.78 94.26
able is that utility of this heuristic varies across optimization
Of 94.31 90.44 46.53 66.14 - - - - - levels and architectures. Moreover, existing tools use different
O0 99.73 99.35 97.33 97.99 Od 95.62 92.25 49.58 52.43 patterns or data-mining models, competing for a coverage-
O2 97.54 96.09 78.15 83.61 O1 95.87 95.78 73.99 63.50 accuracy trade-off. Comparing to G HIDRA-NE and R ADARE 2,
O3 97.94 97.33 77.67 89.91 O2 97.98 96.34 74.81 80.68
ANGR -NS and DYNINST use more aggressive patterns/models,
Os 97.83 96.86 85.38 76.83 Ox 97.99 95.33 74.23 72.32
Of 97.74 96.82 87.12 84.47 - - - - - producing higher coverage (21.3%/24.02 v.s. 18.24%/7.93%) but
lower precision (56.61%/85.37% v.s. 98.42%/87.29%).
The other heuristic, used by ANGR, takes the begin of
identification, comparing to instruction recovery and symbol- each code region detected by linear scan as a function entry.
ization. For instance, ANGR and BAP respectively have an The heuristic recovers 23% more functions but reduces the
average precision of 56.67% and 86.11%. precision by 26.96% because it often considers the begin of
We also notice three other interesting observations. First, padding or data-in-code as a function entry.
commercial tools do better than open source ones. In partic- Understanding of Errors: Table XI and XII present the
TABLE XIII: Results of CFG reconstruction on Linux. Edge,
analysis of false positives and false negative (the analysis of CG, T-Call, N-Ret, and J-Tab respectively mean edges, call
G HIDRA is only done on binaries with exception information). graphs, tail calls, non-returning functions, and jump tables.
For false positives, there are three common causes. First,
the signature-based detection wrongly matches function entries Edge CG T-Call N-Ret J-Tab
L Pre Rec Pre Rec Pre Rec Pre Rec Pre Rec
(FP ratio - DYNINST: 80.80%, G HIDRA-NE: 28.38%, ANGR-NS: 92.98%,
O0 97.28 97.10 99.99 99.01 81.91 71.15 100.0 86.89 99.97 98.82
BAP: 99.99%). Second, inaccurate tail call detection takes target
Dyninst
O2 98.61 92.41 99.99 91.17 68.75 78.84 100.0 86.04 99.87 99.05
of a regular jump as a function entry (FP ratio - DYNINST: 19.20%, O3 98.88 91.58 99.99 90.34 59.64 71.12 100.0 86.20 99.81 99.10
G HIDRA-NE: 71.61, ANGR: 11.70%). Third, incorrect disassembly re- Os 98.93 90.14 99.99 86.92 67.74 68.79 100.0 86.33 99.89 98.50
sults in erroneous call instructions to incorrect target functions Of 98.87 91.33 99.99 90.26 58.93 68.60 100.0 85.08 99.93 98.75
O0 99.63 97.41 99.99 96.66 90.98 79.42 100.0 50.19 98.49 75.58
(FP ratio - G HIDRA-NE: 0.01%, ANGR-NS: 0.01%, R ADARE 2: 0.01%).
Ghidra
O2 99.89 89.61 99.99 88.91 95.93 97.91 99.98 71.21 97.46 97.55
Beyond the three causes, ANGR produces 78.41% of its false O3 99.89 89.30 99.99 88.59 92.75 96.10 100.0 67.15 91.44 97.71
positives because it considers code discovered by its linear Os 99.88 90.12 99.99 88.88 99.01 99.02 100.0 59.74 99.03 91.53
scan as a function entry; G HIDRA generates 99.94% of its false Of 99.90 89.47 99.99 88.37 92.65 94.74 100.0 69.59 91.70 97.81
positives because exception information also carries pointers O0 98.41 98.13 99.99 99.93 34.09 57.80 97.62 81.25 99.80 87.34
O2 94.88 94.95 99.99 99.98 91.20 77.46 89.57 80.48 89.21 70.07
to middle of functions; R ADARE 2, aggressively inferring code
Angr
O3 95.23 95.16 99.99 99.98 87.46 80.97 90.48 80.72 89.12 73.42
pointers based on xrefs, brings 93.17% of its false positives. Os 95.53 96.44 99.99 99.95 84.37 96.05 93.04 83.05 98.62 81.07
False negatives also have a group of similar causes. First, Of 95.12 95.10 99.99 99.95 87.34 79.13 88.98 78.28 90.23 73.22
tools can miss targets of jump tables and fail to identify O0 95.31 76.12 99.99 85.00 - - 100.0 75.61 - -
O2 91.76 55.48 99.99 78.05 - - 99.18 76.66 - -
functions called by the target code or their successors (FN
Bap
O3 92.00 55.97 99.99 77.69 - - 99.12 71.40 - -
ratio - ANGR-NS: 1.18%, BAP: 1.32%, R ADARE 2: 1.12%). Jump tables Os 94.03 72.00 99.99 80.42 - - 100.0 79.37 - -
have smaller impacts to ANGR and DYNINST because ANGR Of 91.95 56.54 99.99 77.84 - - 99.24 68.66 - -
uses linear scan to compensate jump tables and DYNINST has O0 98.47 83.29 99.99 92.67 - - 83.45 83.51 55.74 43.14
high coverage of jump tables. Second, tools cannot recognize Radare2
O2 98.07 85.75 99.99 81.02 - - 96.93 79.80 98.11 97.78
O3 98.23 83.09 99.99 76.96 - - 96.85 79.25 98.52 98.11
many tail calls and hence, miss the functions indicated by their
Os 98.67 80.01 99.99 86.99 - - 96.63 79.56 82.53 76.24
targets (FN ratio - ANGR: 13.05%, BAP: 2.42%, R ADARE 2: 2.78%). Of 98.42 86.97 99.99 77.79 - - 97.03 76.25 98.51 97.26
Third, missed non-returning functions prevent the detection of O0 99.38 97.39 99.99 97.60 93.05 90.82 100.0 89.18 99.95 99.99
many function entries in cases like Listing 8 in Appendix G O2 99.32 92.74 99.99 90.35 72.57 87.79 100.0 89.32 99.88 99.59
IDA
(FN ratio - G HIDRA-NE: 14.16%, ANGR: 8.25%, ANGR-NS: 4.01%, BAP: O3 99.36 92.88 99.99 89.89 93.65 92.74 100.0 89.11 99.89 99.71
7.99%). Fourth, wrongly identified functions can over-lap with Os 99.34 95.78 99.99 95.12 75.62 85.83 99.98 90.67 99.61 99.92
Of 99.33 93.99 99.99 92.07 73.62 81.82 100.0 87.86 99.68 99.75
true functions, making the true ones un-recognizable (FN ratio O0 99.99 99.46 99.99 99.75 54.98 87.69 100.0 86.28 99.60 99.30
- DYNINST: 10.16%, G HIDRA: 12.42%, ANGR: 73.61%, ANGR-NS: 32.78%). O2 99.32 98.27 99.99 99.27 83.14 94.85 100.0 86.78 98.91 95.81
Ninja
All the other functions are missed because they are neither O3 99.20 94.97 99.99 98.89 84.18 93.37 100.0 86.84 98.67 89.87
reached by recursive descent nor matched by patterns (FN Os 99.58 99.05 99.99 99.55 87.24 96.91 100.0 90.13 99.02 96.36
Of 99.39 95.87 99.99 98.85 84.79 90.93 100.0 86.65 98.77 90.26
ratio - ANGR-NS: 60.97%, G HIDRA: 86.03%, R ADARE 2: 86.94%).
4) CFG Reconstruction : In this part, we measure 5 targets:
(1) intra-procedure edges between basic blocks; (2) call graphs comparison to open-source tools, commercial tools have both
for direct calls; (3) indirect jumps and indirect calls; (4) tails higher coverage (96.5% v.s. 84.8%) and accuracy (99% v.s. 92.96%).
calls; (5) non-returning functions. For task (1), we exclude the Besides jump tables, there are two more types of indirect
edge between a call and the fall through code. For jump tables jumps: handle-written assembly code (336 cases like [48]) and
in task (3), we consider a case with no targets resolved as a indirect tail calls (35,087 cases). For the first type, G HIDRA,
false negative. All other cases are counted as false positives. by analysing constant propagation, resolves 96 cases but only
For task (4), we exclude recursive calls and we count unique report one target in each case. B INARY N INJA generates results
target functions instead of jumps. for 120 cases, however, with incorrect targets.
Overall Performance: Table XIII and XIV show the results of Third, ANGR, G HIDRA, IDA P RO, and B INARY N INJA have
CFG reconstruction on Linux binaries and Windows binaries. (limited) supports of indirect calls. ANGR resolves 43 cases
First, the tools can recover most of the edges with high but only reports one target in each case. Our manual analysis
accuracy. DYNINST, G HIDRA, and ANGR find over 90% of verified the targets are correct, all following the pattern of
the edges with an accuracy higher than 95%. Moreover, the [mov/lea CONST, reg; ...; call reg]. G HIDRA
recovery of edges highly correlates to the recovery of instruc- finds an incomplete set of targets for 88,078 indirect calls (only
tions: precision and recall in the two tasks are consistent. The one target for 87,947 cases). IDA P RO reports results for 15,325
results of call graphs are similar to, so we omit the details. indirect calls (only one target for 15,267 cases). The other 58 cases
Second, tools have different capabilities of handling jump follow the format of Listing 9 in Appendix G, which are fully
tables. On average, G HIDRA and DYNINST can resolve over solved by IDA P RO. B INARY N INJA reports targets for 3,410
93% of the jump tables with an accuracy of around 90%. indirect calls (1 target for 92 cases; 2 targets for 1,865 cases; 3 targets for
R ADARE 2 and ANGR have a similar coverage rate (around 75%) 469 cases; 4 targets for 889 targets; 4+ targets for 95 cases).
while ANGR has much higher accuracy (96.27% v.s. 90%). In Fourth, existing tools are not perfect with detecting tail
TABLE XIV: Results of CFG reconstruction on Windows. TABLE XV: Statistics of false positives in CFG-edge recovery.
Non-Ret, Func, T-Call, Inst, No-split, Jump-Tab respectively
Edge CG T-Call N-Ret J-Tab
W Pre Rec Pre Rec Pre Rec Pre Rec Pre Rec
represent undetected non-returning functions, unrecognized
Od 96.66 87.09 99.88 90.84 88.50 80.92 100.0 11.23 70.84 95.46 functions, unidentified tail calls, wrong instructions, missing
Ghidra
O1 98.15 82.78 99.91 85.24 85.64 88.41 100.0 7.37 77.30 97.79 edges to middle of basic blocks, and wrong jump table targets.
O2 96.58 83.80 99.86 87.61 83.03 85.65 99.44 7.56 72.89 94.62
Ox 96.56 84.19 99.83 88.48 83.44 87.85 99.44 7.51 72.91 94.96 Percentage of False Positives (%)
Tools
Od 95.33 96.86 99.87 99.99 55.33 80.21 93.14 16.40 99.65 71.22 Non-Ret Func T-Call Inst No-split J-Tab
Angr
O1 96.43 97.80 99.89 99.99 97.30 91.87 96.67 35.85 99.82 73.65 Dyninst 23.09 12.25 23.28 4.95 35.24 1.19
O2 95.72 96.13 99.88 99.98 97.48 92.52 97.04 41.41 99.99 59.71 Ghidra 8.76 1.88 13.00 0.11 75.13 1.12
Ox 95.58 96.12 99.86 99.98 96.61 92.48 97.11 42.05 99.99 60.03 Angr 16.69 38.42 10.82 4.73 27.02 2.32
Od 96.05 70.89 99.29 73.92 - - 0.00 0.00 - -
Bap 37.95 22.30 1.34 0.34 38.06 0
O1 97.00 80.00 99.99 80.87 - - 100.0 1.52 - -
Bap
Radare2 73.32 3.90 15.55 0.67 4.23 2.34

O2 96.43 70.22 99.99 72.59 - - 100.0 6.64 - -
Ox 96.40 72.77 99.99 69.23 - - 100.0 6.16 - -
Od 97.74 85.12 99.11 86.83 - - 100.0 6.30 96.24 84.29 parts of non-continuous functions, producing over 70K false
Radare2
O1 98.97 80.03 99.90 81.14 - - 100.0 6.43 88.89 71.56

O2 98.07 79.59 99.91 80.79 - - 100.0 6.56 95.29 64.49
positives. ANGR and DYNINST use control-flow and stack-
Ox 98.03 80.71 99.90 81.26 - - 100.0 6.55 95.16 64.03 height based heuristics, detecting 4.24% and 6.99% more tail
Od 96.98 95.77 99.87 98.00 96.25 77.06 100.0 53.89 99.43 100.0 calls. However, the heuristics are neither sound nor complete,
O1 98.36 95.79 99.87 97.41 90.95 81.80 100.0 58.09 99.66 99.81 producing both false negatives and false positives.
IDA
O2 97.12 93.92 99.85 97.07 88.18 83.34 100.0 63.38 99.91 99.85 Understanding of Errors: Table XV presents the analysis
Ox 97.11 93.73 99.84 96.88 88.32 81.99 100.0 63.19 99.66 99.81
Od 98.63 97.39 99.90 99.75 84.89 80.37 100.0 38.66 96.97 93.09
of false positives in edge recovery. The false negatives of
edge recovery are mostly caused by non-recovered instructions
Ninja
O1 99.20 98.60 99.80 99.83 95.31 91.89 100.0 47.30 99.61 90.22
O2 98.59 96.63 99.88 98.93 95.49 92.76 100.0 48.67 96.49 91.49 (see § IV-B1). We omit the details. For call graphs, the false
Ox 98.59 96.82 99.86 99.10 95.61 92.83 100.0 48.43 96.50 91.55 positives and false negatives are generally side effects of errors
in disassembly. Details are also discussed in § IV-B1.
calls. The recovery rate ranges from 71.7% (DYNINST) to The errors related to jump tables are case specific and
91.28% (B INARY N INJA); the precision varies from 67.39% tool specific. We randomly sample 10% false negatives and
(DYNINST) to 90.21% (G HIDRA). We observe that many tools 10% false positives from each tool and do manual analysis
(ANGR, G HIDRA, B INARY N INJA) has the lowest precision to understand the reasons. The false negatives vary across
with optimization disabled. This is because less tail calls are tools. R ADARE 2, G HIDRA, and DYNINST rely on patterns
emitted at lower optimization levels and a few false positives to detect jump tables, leading to 64.1%, 31.4%, and 37.72%
can result in a low precision. of their false negatives. Further, R ADARE 2 incurs 35.9% of
Finally, tools can detect non-returning functions with very its false negatives because it discards jump tables with over
high precision (nearly 100% for many tools). However, they 512 entries; G HIDRA produces 62.63%, 5.50%, and 0.47% of
have limited coverage, especially on Windows binaries. Based its false negatives, respectively because it does not consider
on our analysis, the root cause of the low coverage is incom- sub instructions in VSA analysis, does not capture the correct
plete collection of non-returning library functions. restrictions to the index, and discards jump tables with more
Use of Heuristics: The recovery of edges and call graphs than 1,024 entries; ANGR generates false negatives because of
share heuristics in disassembly so we omit the details. no VSA results (67.86%), wrong VSA results leading to over
The resolving of jump tables uses three heuristics. First, 100,000 entries (14.29%), and no handling of sbb instructions
R ADARE 2, G HIDRA, and DYNINST use patterns to detect (17.86%); DYNINST’s remaining false negatives are caused
jump tables, identifying 85.53%, 98.01%, and 99.56% of all by its restriction of slicing scope (2.1%) and no handling
jump tables. Second, DYNINST and ANGR restrict the scope of get_pc_thunk (60.18%). Beyond the above reasons, we
of slicing in VSA, missing the index bound in many jump also observe 6 cases where the index is unrestricted (e.g.,
tables. By enlarging the scope of slicing to 500 assignments Listing 10), leading to false negatives in many tools. The
in DYNINST and 10 basic blocks in ANGR, the two tools find causes of false positives are also diverse. R ADARE 2’s false
the index bound in 2.1% and 19.8% more jump tables. Third, positives are mostly because it matches the wrong cmp/sub
G HIDRA and R ADARE 2 discard jump tables with index over instruction (54.29%) or the restriction to the index is not by
1,024 and 512, respectively missing 51 and 2,435 jump tables. cmp/sub (45.71%). DYNINST produces false positives because
To detect tail calls, R ADARE 2 relies on the distance between it does not handle special aliases of the index (78.96%), does not
jumps and their targets. It is hard to pick a proper threshold: consider restrictions by type (17.95%), and restricts the scope
in our benchmark binaries, the distance for regular jumps (min: of slicing (3.09%). ANGR incurs false positives due to incorrect
0; max: 0xb5867) and tail calls (min: 0; max: 0xb507c5) are largely VSA solving (78.26%) and ignoring certain paths in slicing
overlapped. G HIDRA uses the heuristic that a tail call and (21.74%). With G HIDRA, the false positives are because of
its target spans multiple functions. This heuristic is general wrongly considering indirect jumps as indirect calls (55.32%),
to capture tail calls, producing a high coverage of 91.29%. no consideration of sub instruction in VSA analysis (34.04%),
However, it cannot recognize regular jumps between two identifying extra/wrong restrictions to the index (4.26%), and
TABLE XVI: Statistics of false negatives in detection of
non-returning functions. Lib, Cond-NonRet, Exit-Inst, Fake- 1998; the other one is because Reassembler tries to parses the
Ret, Propa respectively stand for unrecognized non-returning PLT section that does not exist in Windows binaries). Validity
library functions, argument-dependent non-returning functions wise, implementation issues in our evaluation and analysis
(e.g., error), instructions that exit (e.g., ud2), functions code may have slipped into the results, leading to inaccurate
that contain ret instructions but do not actually return, and reports of the tools’ performance. To mitigate this threat, we
propagation of other types of false negatives. checked if the evaluation results match our understanding of
the tools. In particular, we checked if the false positives and
Percentage of False Negatives (%) false negatives were explainable by the tool’s strategies.
Tools
Lib Cond-NonRet Exit-Inst Fake-Ret Propa
Dyninst 12.30 43.86 0.30 27.64 15.90 V. F INDINGS
Ghidra 3.97 13.01 0.04 9.54 73.44 This study uncovered the following major findings.
Angr 19.08 21.70 0.00 22.29 36.93 F-1: Complex constructs are common and heuristics are
Bap 24.45 22.73 0.10 28.90 23.82
indispensable to handle them. We observe a large amount of
Radare2 19.89 26.04 0.08 13.07 40.92
complex constructs in our benchmarks (Table XVII). In partic-
including the default case as an entry (6.38%). ular, we found 295 instances of data-in-code, complementary
In detecting tail calls, false negatives are largely caused to previous research [5]. Heuristics are indispensable to handle
by excluding conditional jumps (FN ratio - G HIDRA: 21.6%; the constructs: they are responsible for 49% of the instructions,
ANGR 17.5%), missing boundaries between adjacent functions
17% of the function entries, and most xrefs.
(FN ratio - G HIDRA: 78.4%), excluding targets reached by both F-2: Heuristics inherently introduce coverage-correctness
conditional jumps and unconditional jumps (FN ratio - ANGR: trade-offs. Heuristics significantly increase coverage, but con-
81.6%), incorrect calculation of stack height (FN ratio - ANGR:
currently induce new errors. As a counter-movement, new
0.9%), unrecognized functions and no stack adjustment patterns
heuristics are devised to reduce the errors, leading to coverage-
before the jumps (FN ratio - DYNINST:100%). For false posi- correctness trade-offs in nearly every disassembly phase.
tives, the most common cause is wrong function entries (FP F-3: Tools complement each other. As shown in Table III, ex-
ratio - G HIDRA: 5.9%; ANGR 19.7%; DYNINST: 2.4%). Other reasons isting tools use heuristics and algorithms that have overlaps but
include non-continuous functions (FP ratio - G HIDRA: 94.1%; also many differences. This features tools different strengths,
ANGR 0.3%; DYNINST: 0.6%), instructions follow the format of
indicating that tool selection should be demand-specific.
add esp/rsp, CONST but do not tear down the stack (FP F-4: Broader and deeper evaluation is needed for future
ratio - DYNINST:97%), unchanged stack height before the jump improvements. We envision that the community may have
(FP ratio - ANGR: nearly 90%). We also notice that ANGR’s false insufficiently evaluated existing tools. This prevents compre-
positives contain a group of conditional jumps, which are hensive understanding about the limitations of the tools. We
caused by a defect in its code. hope our study will bring a piece of basis to broaden and
Table XVI shows the false negatives in detecting non- deepen the evaluation towards better binary disassembly.
returning functions. A special category is functions that have VI. C ONCLUSION
ret instructions but do not return (e.g., _Unwind_Resume
We present a systematization of binary disassembly with
in Glibc). Such functions alter the stack to use a self-prepared
a thorough study and comprehensive evaluation, centering
return address, avoiding returning to the parent. We observe
around the perspective of algorithms and heuristics. Our study,
no false positives with DYNINST. R ADARE 2 generates false
via the comprehension on the source code of 9 disassembly
positives because of un-handled jump tables (23.40%), wrong
tools, presents in-depth understanding of their strategies. Our
function boundaries (4.26%), and propagation of other false
evaluation separately measures the tools on each disassem-
positives (72.34%). ANGR produces false positives due to un-
bly phase and on individual strategies, which fully unveils
handled jump tables/tail calls (29.73%), wrong function bound-
how much the heuristics are used, how much the heuristics
aries (54.05%), and propagation of other false positives (16.22%).
contribute to disassembly, and what errors the heuristics can
G HIDRA incurs all its false positives due to heuristic 29 . False
introduce. Throughout the study, we derive a group of new
positives by BAP are due to implementation issues.
observations that can amend/complement previous understand-
C. Threats to Completeness and Validity ings and also inspire future directions of binary disassembly.
Despite our best efforts to carefully evaluate the public ACKNOWLEDGMENTS
versions of the selected tools, there could still be threats to the
We would like to thank our shepherd Yan Shoshitaishvili
completeness and validity of our results. Completeness wise,
and the anonymous reviewers for their feedback. This project
we may have missed reporting the results of some tools in
was supported by the Office of Naval Research (Grant#:
certain tasks due to implementation issues of the tools (other
N00014-16-1-2261, N00014-17-1-2788, and N00014-17-1-
than fundamental limitations). For instance, we could not
2787). Any opinions, findings, and conclusions or recommen-
report Angr’s results of symbolization with Windows binaries
dations expressed in this paper are those of the authors and
because of two implementation issues in Angr’s Reassembler
do not necessarily reflect the views of the funding agency.
module (one is reported at https://github.com/angr/angr/issues/
R EFERENCES [30] Y. David and E. Yahav, “Tracelet-based code search in executables,” in Acm
Sigplan Notices, vol. 49, no. 6. ACM, 2014, pp. 349–360.
[1] “Ida help: Kernel analysis options,” https://www.hex-rays.com/products/ida/
[31] L. De La Rosa, S. Kilgallon, T. Vanderbruggen, and J. Cavazos, “Efficient
support/idadoc/621.shtml.
characterization and classification of malware using deep learning,” in 2018
[2] M. Abadi, M. Budiu, Ú. Erlingsson, and J. Ligatti, “Control-flow integrity
Resilience Week (RWS). IEEE, 2018, pp. 77–83.
principles, implementations, and applications,” ACM Transactions on Information
[32] R. Decompiler, “A retargetable machine-code decompiler based on llvm,” https:
and System Security (TISSEC), vol. 13, no. 1, p. 4, 2009.
//retdec.com/.
[3] N. S. Agency, “Ghidra,” https://www.nsa.gov/resources/everyone/ghidra/, 2019.
[33] A. Dinaburg and A. Ruef, “Mcsema: Static translation of x86 instructions to llvm,”
[4] H. Alasmary, A. Anwar, J. Park, J. Choi, D. Nyang, and A. Mohaisen, “Graph-
in ReCon 2014 Conference, Montreal, Canada, 2014.
based comparison of iot and android malware,” in International Conference on
[34] S. Dinesh, N. Burow, D. Xu, and M. Payer, “Retrowrite: Statically instrumenting
Computational Social Networks. Springer, 2018, pp. 259–272.
cots binaries for fuzzing and sanitization,” in 2020 IEEE Symposium on Security
[5] D. Andriesse, X. Chen, V. Van Der Veen, A. Slowinska, and H. Bos, “An in-depth
and Privacy (SP). IEEE Computer Society, 2020, pp. 9–9.
analysis of disassembly on full-scale x86/x64 binaries,” in 25th USENIX Security
[35] DynInst, “Dataflowapi programmers guide,” https://dyninst.org/sites/default/files/
Symposium, 2016, pp. 583–600.
manuals/dyninst/dataflowAPI.pdf, 2019.
[6] D. Andriesse, A. Slowinska, and H. Bos, “Compiler-agnostic function detection
[36] C. Eagle, The IDA pro book. No Starch Press, 2011.
in binaries,” in 2017 IEEE European Symposium on Security and Privacy
[37] M. Elsabagh, D. Fleck, and A. Stavrou, “Strict virtual call integrity checking for
(EuroS&P). IEEE, 2017, pp. 177–189.
c++ binaries,” in 2017 ACM on Asia Conference on Computer and Communica-
[7] W. Armstrong, P. Christen, E. McCreath, and A. P. Rendell, “Dynamic algorithm
tions Security. ACM, 2017, pp. 140–154.
selection using reinforcement learning,” in 2006 International Workshop on
[38] Ú. Erlingsson, M. Abadi, M. Vrable, M. Budiu, and G. C. Necula, “Xfi: Software
Integrating AI and Data Mining. IEEE, 2006, pp. 18–25.
guards for system address spaces,” in 7th USENIX Security Symposium, 2006, pp.
[8] C. S. L. at UC Santa Barbara, “angr github repo,” https://github.com/angr/angr/
75–88.
tree/76da434f, 2019.
[39] U. Erlingsson and F. B. Schneider, “Sasi enforcement of security policies: A
[9] T. Bao, J. Burket, M. Woo, R. Turner, and D. Brumley, “Byteweight: Learning to
retrospective,” in Proceedings DARPA Information Survivability Conference and
recognize functions in binary code,” in 23rd USENIX Security Symposium, 2014,
Exposition. DISCEX’00, vol. 2. IEEE, 2000, pp. 287–295.
pp. 845–860.
[40] S. Eschweiler, K. Yakdan, and E. Gerhards-Padilla, “discovre: Efficient cross-
[10] T. Barabosch, “cwe checker finds vulnerable patterns in binary executables,”
architecture identification of bugs in binary code.” in NDSS, 2016.
https://github.com/fkie-cad/cwe checker, 2018.
[41] Q. Feng, M. Wang, M. Zhang, R. Zhou, A. Henderson, and H. Yin, “Extracting
[11] A. R. Bernat and B. P. Miller, “Anywhere, any-time binary instrumentation,” in
conditional formulas for cross-platform bug search,” in 2017 ACM on Asia
10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools.
Conference on Computer and Communications Security. ACM, 2017, pp. 346–
ACM, 2011, pp. 9–16.
359.
[12] P. Biondi, R. Rigo, S. Zennou, and X. Mehrenberger, “Bincat: purrfecting binary
[42] Q. Feng, R. Zhou, C. Xu, Y. Cheng, B. Testa, and H. Yin, “Scalable graph-
static analysis,” in Symposium sur la sécurité des technologies de linformation et
based bug search for firmware images,” in the 2016 ACM SIGSAC Conference on
des communications, 2017.
Computer and Communications Security. ACM, 2016, pp. 480–491.
[13] L. Bits, “Mcsema github repo,” https://github.com/lifting-bits/mcsema/tree/
[43] H. Flake, “Structural comparison of executable objects,” in DIMVA, vol. 46.
62a1319e, 2019.
Citeseer, 2004, pp. 161–173.
[14] M. Bourquin, A. King, and E. Robbins, “Binslayer: accurate comparison of
[44] P. Garba and M. Favaro, “Saturn-software deobfuscation framework based on
binary executables,” in the 2nd ACM SIGPLAN Program Protection and Reverse
llvm,” in 3rd ACM Workshop on Software Protection. ACM, 2019, pp. 27–38.
Engineering Workshop. ACM, 2013, p. 4.
[45] A. Gazet, “ret-sync is a set of plugins that helps to synchronize a debugging
[15] D. Bruschi, L. Martignoni, and M. Monga, “Detecting self-mutating malware
session,” https://github.com/bootleg/ret-sync, 2016.
using control-flow graph matching,” in International conference on detection of
[46] M. Ghaffarinia and K. W. Hamlen, “Binary control-flow trimming,” in cnum26th
intrusions and malware, and vulnerability assessment. Springer, 2006, pp. 129–
ACM Conference on Computer and Communications Security nymCCS. forthcom-
143.
ing, 2019.
[16] S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley, “Unleashing mayhem on
[47] GNU, “Index of /gnu/binutils,” https://ftp.gnu.org/gnu/binutils/, 2019.
binary code,” in 2012 IEEE Symposium on Security and Privacy (SP). IEEE,
[48] GNU, “Mirror of sourceware glibc repository,” https://github.com/bminor/glibc/
2012, pp. 380–394.
blob/master/sysdeps/x86 64/multiarch/memcpy-ssse3.S#L441, 2019.
[17] M. Chandramohan, Y. Xue, Z. Xu, Y. Liu, C. Y. Cho, and H. B. K. Tan, “Bingo:
[49] I. Guilfanov, “Jump tables,” https://www.hex-rays.com/blog/jump-tables/.
Cross-architecture cross-os binary search,” in the 2016 24th ACM SIGSOFT
[50] N. Hasabnis and R. Sekar, “Lifting assembly to intermediate representation: A
International Symposium on Foundations of Software Engineering. ACM, 2016,
novel approach leveraging compilers,” in ACM SIGARCH Computer Architecture
pp. 678–689.
News, vol. 44, no. 2. ACM, 2016, pp. 311–324.
[18] P. Chen, J. Xu, Z. Hu, X. Xing, M. Zhu, B. Mao, and P. Liu, “What you see is not
[51] W. He, S. Das, W. Zhang, and Y. Liu, “No-jump-into-basic-block: Enforce basic
what you get thwarting just-in-time rop with chameleon,” in 2017 47th Annual
block cfi on the fly for real-world binaries,” in 54th Annual Design Automation
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
Conference 2017. ACM, 2017, p. 23.
IEEE, 2017, pp. 451–462.
[52] G. Hernandez, F. Fowze, D. J. Tian, T. Yavuz, and K. R. Butler, “Firmusb: Vetting
[19] X. Chen, A. Slowinska, D. sse, H. Bos, and C. Giuffrida, “Stackarmor: Compre-
usb device firmware using domain informed symbolic execution,” in 2017 ACM
hensive protection from stack-based memory error vulnerabilities for binaries.” in
SIGSAC Conference on Computer and Communications Security. ACM, 2017,
NDSS, 2015.
pp. 2245–2262.
[20] Y. Chen, D. Mu, J. Xu, Z. Sun, W. Shen, X. Xing, L. Lu, and B. Mao,
[53] J. Hiser, A. Nguyen-Tuong, M. Co, M. Hall, and J. W. Davidson, “Ilr: Where’d
“Ptrix: Efficient hardware-assisted fuzzing for cots binary,” in 2019 ACM on Asia
my gadgets go?” in 2012 IEEE Symposium on Security and Privacy (SP). IEEE,
Conference on Computer and Communications Security. ACM, 2019, pp. 1032–
2012, pp. 571–585.
1043.
[54] X. Hu and K. G. Shin, “DUET: Integration of dynamic and static analyses for
[21] Y. Chen, D. Zhang, R. Wang, R. Qiao, A. M. Azab, L. Lu, H. Vijayakumar, and
malware clustering with cluster ensembles,” in 29th Annual Computer Security
W. Shen, “Norax: Enabling execute-only memory for cots binaries on aarch64,” in
Applications Conference (ACSAC’13), 2013, pp. 79–88.
2017 IEEE Symposium on Security and Privacy (SP). IEEE, 2017, pp. 304–319.
[55] Y. Hu, Y. Zhang, J. Li, and D. Gu, “Binary code clone detection across
[22] V. Chipounov and G. Candea, “Enabling sophisticated analyses of⇥ 86 binaries
architectures and compiling configurations,” in 25th International Conference on
with revgen,” in 2011 IEEE/IFIP 41st International Conference on Dependable
Program Comprehension. IEEE Press, 2017, pp. 88–98.
Systems and Networks Workshops (DSN-W). IEEE, 2011, pp. 211–216.
[56] M. Jiang, Y. Zhou, X. Luo, R. Wang, Y. Liu, and K. Ren, “An empirical study on
[23] C. Cifuentes and M. Van Emmerik, “Recovery of jump table case statements from
arm disassembly tools,” in ACM SIGSOFT International Symposium on Software
binary code,” Science of Computer Programming, vol. 40, no. 2-3, pp. 171–188,
Testing and Analysis (ISSTA’20), 2020.
2001.
[57] W. M. Khoo, A. Mycroft, and R. Anderson, “Rendezvous: A search engine for
[24] coreboot, “Ghidra firmware utilities, weeks 1-2,” https://blogs.coreboot.org/blog/
binary code,” in the 10th Working Conference on Mining Software Repositories.
2019/06/04/gsoc-ghidra-firmware-utilities-weeks-1-2/, 2019.
IEEE Press, 2013, pp. 329–338.
[25] M. Cova, V. Felmetsger, G. Banks, and G. Vigna, “Static detection of vulnerabil-
[58] S. Kilgallon, L. De La Rosa, and J. Cavazos, “Improving the effectiveness and
ities in x86 executables,” in 2006 22nd Annual Computer Security Applications
efficiency of dynamic malware analysis with machine learning,” in 2017 Resilience
Conference (ACSAC’06). IEEE, 2006, pp. 269–278.
Week (RWS). IEEE, 2017, pp. 30–36.
[26] C. CyLab, “Bap github repo,” https://github.com/BinaryAnalysisPlatform/bap/tree/
[59] J. Kinder, “The jakstab static analysis platform for binaries,” http://www.jakstab.
cfeacbfc, 2020.
org/.
[27] C. CyLab, “Bap-toolkit github repo,” https://github.com/BinaryAnalysisPlatform/
[60] H. Koo, Y. Chen, L. Lu, V. P. Kemerlis, and M. Polychronakis, “Compiler-assisted
bap-toolkit/tree/7b7744dc3/with-no-return, 2020.
code randomization,” in 2018 IEEE Symposium on Security and Privacy (SP).
[28] L. Davi, C. Liebchen, A.-R. Sadeghi, K. Z. Snow, and F. Monrose, “Isomeron:
IEEE, 2018, pp. 461–477.
Code randomization resilient to (just-in-time) return-oriented programming.” in
[61] H. Koo and M. Polychronakis, “Juggling the gadgets: Binary-level code ran-
NDSS, 2015.
domization using instruction displacement,” in 11th ACM on Asia Conference
[29] Y. David, N. Partush, and E. Yahav, “Firmup: Precise static detection of common
on Computer and Communications Security. ACM, 2016, pp. 23–34.
vulnerabilities in firmware,” in ACM SIGPLAN Notices, vol. 53, no. 2. ACM,
[62] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna, “Polymorphic worm
2018, pp. 392–404.
detection using structural information of executables,” in International Workshop 986353c0, 2019.
on Recent Advances in Intrusion Detection. Springer, 2005, pp. 207–226. [94] M. Santosa, “Understanding elf using readelf and objdump,” Linux Forms article,
[63] S. S. Lab, “Psi: A platform for secure static binary instrumentation,” http://www. pp. 1–6, 2006.
seclab.cs.sunysb.edu/seclab/psi/, 2019. [95] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher,
[64] T. C. S. D. Laboratory, “Program analysis tools developed at draper on the cbat J. Grosen, S. Feng, C. Hauser, C. Kruegel et al., “Sok:(state of) the art of war:
project,” https://github.com/draperlaboratory/cbat tools/, 2018. Offensive techniques in binary analysis,” in 2016 IEEE Symposium on Security
[65] L. Li, J. E. Just, and R. Sekar, “Address-space randomization for windows and Privacy (SP). IEEE, 2016, pp. 138–157.
systems,” in 2006 22nd Annual Computer Security Applications Conference [96] S. Sidiroglou, O. Laadan, C. Perez, N. Viennot, J. Nieh, and A. D. Keromytis,
(ACSAC’06). IEEE, 2006, pp. 329–338. “Assure: automatic software self-healing using rescue points,” ACM SIGARCH
[66] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong, Computer Architecture News, vol. 37, no. 1, pp. 37–48, 2009.
“Vuldeepecker: A deep learning-based system for vulnerability detection,” arXiv [97] D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. G. Kang, Z. Liang,
preprint arXiv:1801.01681, 2018. J. Newsome, P. Poosankam, and P. Saxena, “Bitblaze: A new approach to computer
[67] S. McCamant and G. Morrisett, “Evaluating sfi for a cisc architecture.” in 7th security via binary analysis,” in International Conference on Information Systems
USENIX Security Symposium, 2006. Security. Springer, 2008, pp. 1–25.
[68] X. Meng and B. P. Miller, “Binary code is not easy,” in 25th International [98] N. Stephens, J. Grosen, C. Salls, A. Dutcher, R. Wang, J. Corbetta, Y. Shoshi-
Symposium on Software Testing and Analysis. ACM, 2016, pp. 24–35. taishvili, C. Kruegel, and G. Vigna, “Driller: Augmenting fuzzing through
[69] B. P. Miller, M. Christodorescu, R. Iverson, T. Kosar, A. Mirgorodskii, and selective symbolic execution.” in NDSS, vol. 16, no. 2016, 2016, pp. 1–16.
F. Popovici, “Playing inside the black box: Using dynamic instrumentation to [99] V. Van Der Veen, E. Göktas, M. Contag, A. Pawoloski, X. Chen, S. Rawat, H. Bos,
create security holes,” Parallel Processing Letters, vol. 11, no. 02n03, pp. 267– T. Holz, E. Athanasopoulos, and C. Giuffrida, “A tough call: Mitigating advanced
280, 2001. code-reuse attacks at the binary level,” in 2016 IEEE Symposium on Security and
[70] J. Ming, M. Pan, and D. Gao, “ibinhunt: Binary hunting with inter-procedural con- Privacy (SP). IEEE, 2016, pp. 934–953.
trol flow,” in International Conference on Information Security and Cryptology. [100] M. Wang, H. Yin, A. V. Bhaskar, P. Su, and D. Feng, “Binary code continent:
Springer, 2012, pp. 92–109. Finer-grained control flow integrity for stripped binaries,” in 31st Annual Com-
[71] M. Muench, D. Nisi, A. Francillon, and D. Balzarotti, “Avatar 2: A multi-target puter Security Applications Conference (ACSAC’15). ACM, 2015, pp. 331–340.
orchestration platform,” in Workshop on Binary Analysis Research (colocated with [101] R. Wang, Y. Shoshitaishvili, A. Bianchi, A. Machiry, J. Grosen, P. Grosen,
NDSS Symposium)(February 2018), BAR, vol. 18, 2018. C. Kruegel, and G. Vigna, “Ramblr: Making reassembly great again.” in NDSS,
[72] P. Muntean, M. Fischer, G. Tan, Z. Lin, J. Grossklags, and C. Eckert, “⌧ cfi: Type- 2017.
assisted control flow integrity for x86-64 binaries,” in International Symposium [102] S. Wang, P. Wang, and D. Wu, “Reassembleable disassembling,” in 24th USENIX
on Research in Attacks, Intrusions, and Defenses. Springer, 2018, pp. 423–444. Security Symposium, 2015, pp. 627–642.
[73] J. Mußler, D. Lorenz, and F. Wolf, “Reducing the overhead of direct application [103] S. Wang, P. Wang, and D. Wu, “Uroboros: Instrumenting stripped binaries with
instrumentation using prior static analysis,” in European Conference on Parallel static reassembling,” in 2016 IEEE 23rd International Conference on Software
Processing. Springer, 2011, pp. 65–76. Analysis, Evolution, and Reengineering (SANER). IEEE, 2016.
[74] B. Ninja, “binary.ninja : a reverse engineering platform,” https://binary.ninja/, [104] R. Wartell, V. Mohan, K. W. Hamlen, and Z. Lin, “Binary stirring: Self-
2019. randomizing instruction addresses of legacy x86 binary code,” in 2012 ACM
[75] NSA, “Ghidra github repo,” https://github.com/NationalSecurityAgency/ghidra/ conference on Computer and communications security. ACM, 2012, pp. 157–168.
tree/Ghidra 9.0.4 build, 2019. [105] M. Wenzl, G. Merzdovnik, J. Ullrich, and E. Weippl, “From hack to elaborate
[76] T. of Bits, “Vulnerability modeling with binary ninja,” https://blog.trailofbits.com/ techniquea survey on binary rewriting,” ACM Computing Surveys (CSUR), vol. 52,
2018/04/04/vulnerability-modeling-with-binary-ninja/, 2018. no. 3, p. 49, 2019.
[77] R. Paleari, L. Martignoni, G. Fresi Roglia, and D. Bruschi, “N-version disassem- [106] D. Williams-King, G. Gobieski, K. Williams-King, J. P. Blake, X. Yuan, P. Colp,
bly: differential testing of x86 disassemblers,” in 19th international symposium M. Zheng, V. P. Kemerlis, J. Yang, and W. Aiello, “Shuffler: Fast and deployable
on Software testing and analysis. ACM, 2010, pp. 265–274. continuous code re-randomization,” in 12th USENIX Symposium on Operating
[78] V. Pappas, M. Polychronakis, and A. D. Keromytis, “Smashing the gadgets: Systems Design and Implementation (OSDI 16), 2016, pp. 367–382.
Hindering return-oriented programming using in-place code randomization,” in [107] B. Yee, D. Sehr, G. Dardyk, J. B. Chen, R. Muth, T. Ormandy, S. Okasaka,
2012 IEEE Symposium on Security and Privacy (SP). IEEE, 2012, pp. 601–615. N. Narula, and N. Fullagar, “Native client: A sandbox for portable, untrusted x86
[79] Paradyn, “Dyninst github repo,” https://github.com/dyninst/dyninst/tree/5d2ddacb, native code,” in 2009 30th IEEE Symposium on Security and Privacy (SP). IEEE,
2019. 2009, pp. 79–93.
[80] M. Payer, A. Barresi, and T. R. Gross, “Fine-grained control-flow integrity through [108] C. Zhang, T. Wei, Z. Chen, L. Duan, S. McCamant, and L. Szekeres, “Protecting
binary hardening,” in International Conference on Detection of Intrusions and function pointers in binary,” in 8th ACM SIGSAC symposium on Information,
Malware, and Vulnerability Assessment. Springer, 2015, pp. 144–164. computer and communications security. ACM, 2013, pp. 487–492.
[81] H. Peng, Y. Shoshitaishvili, and M. Payer, “T-fuzz: fuzzing by program transfor- [109] C. Zhang, T. Wei, Z. Chen, L. Duan, L. Szekeres, S. McCamant, D. Song,
mation,” in 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 2018, and W. Zou, “Practical control flow integrity and randomization for binary
pp. 697–710. executables,” in 2013 IEEE Symposium on Security and Privacy (SP). IEEE,
[82] J. Pewny, B. Garmany, R. Gawlik, C. Rossow, and T. Holz, “Cross-architecture 2013, pp. 559–573.
bug search in binary executables,” in 2015 IEEE Symposium on Security and [110] M. Zhang, M. Polychronakis, and R. Sekar, “Protecting cots binaries from
Privacy (SP). IEEE, 2015, pp. 709–724. disclosure-guided code reuse attacks,” in 33rd Annual Computer Security Ap-
[83] J. Pewny, F. Schuster, L. Bernhard, T. Holz, and C. Rossow, “Leveraging semantic plications Conference (ACSAC’17). ACM, 2017, pp. 128–140.
signatures for bug search in binary programs,” in the 30th Annual Computer [111] M. Zhang and R. Sekar, “Control flow integrity for cots binaries,” in 22nd USENIX
Security Applications Conference (ACSAC’14). ACM, 2014, pp. 406–415. Security Symposium, 2013, pp. 337–352.
[84] B. Potchik, “Binary ninja deep thoughts,” https://binary.ninja/2017/11/06/ [112] L. Zhao, Y. Duan, H. Yin, and J. Xuan, “Send hardest problems my way:
architecture-agnostic-function-detection-in-binaries.html, 2017. Probabilistic path prioritization for hybrid fuzzing,” in NDSS, 2019.
[85] A. Prakash, X. Hu, and H. Yin, “vfguard: Strict protection for virtual function
calls in cots c++ binaries.” in NDSS, 2015.
[86] C. Qian, H. Hu, M. Alharthi, P. H. Chung, T. Kim, and W. Lee, “Razor: A A PPENDIX
framework for post-deployment software debloating,” in 28th USENIX Security
Symposium, 2019, pp. 1733–1750. TABLE XVII: Statistics of complex constructs in our bench-
[87] W. Qiang, Y. Huang, D. Zou, H. Jin, S. Wang, and G. Sun, “Fully context-sensitive
cfi for cots binaries,” in Australasian Conference on Information Security and
mark binaries.
Privacy. Springer, 2017, pp. 435–442.
[88] R. Qiao, M. Zhang, and R. Sekar, “A principled approach for rop defense,” in Complex Constructs Types Cases / Prog / Bin
31st Annual Computer Security Applications Conference (ACSAC’15). ACM,
2015, pp. 101–110. Padding bytes 6,445,649 / 236 / 3,788
[89] radreorg, “Radare2 github repo,” https://github.com/radareorg/radare2/tree/ Data in code Hard-coded bytes 295 / 3 / 18
5a1df188, 2020.
[90] G. Ravipati, A. R. Bernat, N. Rosenblum, B. P. Miller, and J. K. Hollingsworth,
Jump tables 21,586 / 57 / 426
“Toward the deconstruction of dyninst,” Univ. of Wisconsin, technical report, p. 32, Jump tables 118,616 / 225 / 3,594
2007. Indirect Jumps Indirect tail-calls 35,087 / 26 / 465
[91] S. Rawat, V. Jain, A. Kumar, L. Cojocar, C. Giuffrida, and H. Bos, “Vuzzer:
Application-aware evolutionary fuzzing,” in NDSS, 2017. Handle-written ones 336 / 1 / 4
[92] N. Redini, R. Wang, A. Machiry, Y. Shoshitaishvili, G. Vigna, and C. Kruegel, Overlapping functions 2,050 / 6 / 49
“B in t rimmer: Towards static binary debloating through abstract interpretation,” Special functions Multi-entry functions 103 / 3 / 11
in International Conference on Detection of Intrusions and Malware, and Vulner-
ability Assessment. Springer, 2019, pp. 482–501. Non-return functions 29,668 / 228 / 3,712
[93] s3team, “Uroboros github repo,” https://github.com/s3team/uroboros/tree/ (Direct) Tail-calls N/A 503,527 / 236 / 3,781
TABLE XVIII: Results of disassembly without heuristics.
A. Complex Constructs
Table XVII presents the statistics of complex constructs Instructions Instructions
L Avg Min W Avg Min
from our benchmark binaries. Pre Rec Pre Rec Pre Rec Pre Rec
O0 99.99 66.30 99.97 1.20 - - - - -
B. Building Ground Truth for Windows Binaries
Ghidra Ghidra-NE Dyninst

O2 99.99 66.44 99.91 1.19 - - - - -
First, we follow a similar approach as described in [5] to O3 99.99 63.79 99.91 1.11 - - - - -
gather instructions. We use the line table in debug information Os 99.99 68.58 99.96 1.31 - - - - -
Of 99.99 66.35 99.78 1.11 - - - - -
to find the first instruction of each source-code/assembly state- O0 99.99 64.53 99.90 1.22 Od 99.87 53.34 93.66 2.26
ment, followed by recursive disassembly (skipping indirect O2 99.99 69.97 99.62 1.13 O1 99.87 53.58 94.66 2.61
control transfers) to find instructions not covered by the line O3 99.99 63.28 99.63 1.21 O2 99.87 53.24 95.32 2.15
table. We also found that Visual Studio considers hard-coded Os 99.99 68.53 99.77 1.35 Ox 99.86 52.91 93.56 2.15
Of 99.99 64.67 99.64 1.27 - - - - -
bytes as code in the line table. We manually exclude such
O0 99.99 91.03 99.96 14.08 Od 99.87 53.34 93.66 2.26
cases. Second, we rely on symbols to identify functions. The O2 99.99 91.75 99.68 1.80 O1 99.87 53.58 94.66 2.61
symbol of a function also carries a flag indicating whether it O3 99.99 90.18 99.72 1.95 O2 99.87 53.24 95.32 2.15
returns, aiding us to gather non-returning functions. Third, we Os 99.99 93.59 99.94 19.49 Ox 99.86 52.91 93.56 2.15
enable the DEBUGTYPE:FIXUP option in Visual Studio such Of 99.99 88.93 99.73 21.16 - - - - -
O0 99.99 65.78 99.99 1.21 Od 99.99 47.21 99.99 2.16
that all xrefs are preserved in the linking process. Finally, we
Angr-NS
O2 99.99 58.13 99.74 0.61 O1 99.99 51.38 99.98 2.51
identify jump tables based on xrefs (which contain the base O3 99.99 58.62 99.94 0.54 O2 99.99 46.32 99.99 2.06
address and entries of a jump table). For each xref, if its target Os 99.99 64.14 96.46 0.80 Ox 99.99 45.98 99.98 2.06
location contains a list of other xrefs pointing to basic blocks Of 99.99 60.14 97.25 0.55 - - - - -
in the same function, we consider the xref refers to a jump O0 99.99 6.37 99.99 0.02 Od 99.99 2.88 99.96 0.03
Radare2
O2 99.99 11.87 99.99 0.02 O1 99.99 3.69 99.99 0.04
table and deem the list of xrefs at its target location as entries. O3 99.99 12.03 99.99 0.01 O2 99.99 4.49 99.92 0.03
For correctness, we further verify if the jump table entries Os 99.86 7.29 83.70 0.01 Ox 99.99 4.14 99.83 0.03
correspond to switch-cases in the source code. Otherwise, Of 99.87 7.94 79.25 0.01 - - - - -
we manually verify the correctness (since some jump tables
are hand-crafted or compiled from if-else statements). TABLE XIX: Results of function-matching. The baseline of
Our initial ground truth of non-returning functions (for both Rec is the total number of true functions.
Linux and Windows binaries) can miss propagated cases. To Function Matching Function Matching
this end, we expand our ground truth by running algorithms L Avg Min W Avg Min
14 and 15 with recursive updates. Pre Rec Pre Rec Pre Rec Pre Rec
O0 97.56 45.64 50.00 0.99 - - - - -
Angr-NS Ghidra-NE Dyninst
C. Configurations of Disassembly Tools O2 83.84 33.30 17.24 0.72 - - - - -

O3 83.08 33.88 19.35 0.64 - - - - -
OBJDUMP, PSI, UROBOROS: We use the recommended Os 89.11 4.88 0.00 0.00 - - - - -
options to run them. We parse their outputs for results. Of 73.24 2.39 0.00 0.00 - - - - -
DYNINST: We use its ParseAPI interface to perform recursive O0 97.26 42.43 0.00 0.00 Od 99.78 30.32 94.42 3.00
disassembly and we further enable IdiomMatching to include O2 97.29 3.48 0.00 0.00 O1 99.62 18.38 86.78 1.78
O3 97.09 4.42 0.00 0.00 O2 99.68 20.98 93.58 0.92
the decision-tree based function matching (which is excluded
Os 97.69 16.55 0.00 0.00 Ox 99.69 23.08 93.58 0.89
when we do testing without heuristics). We parse the returned Of 97.72 4.49 0.00 0.00 - - - - -
structure of ParseAPI to get the disassembly results. O0 91.04 36.93 0.00 0.00 Od 45.65 17.35 1.04 0.16
ANGR: We use its CFGFast interface for disassembly. O2 53.02 19.20 0.00 0.00 O1 51.90 12.95 2.60 0.16
When using this interface, we enable the normalize and O3 54.97 21.12 0.00 0.00 O2 47.56 12.60 6.34 0.50
Os 59.15 26.16 0.00 0.00 Ox 47.93 13.36 6.86 0.50
detect tail calls arguments so that a basic block is split by Of 58.26 22.03 0.00 0.00 - - - - -
a control transfer to the middle and tail calls are detected. O0 89.10 31.67 5.68 1.78 Od 57.94 12.78 1.16 0.23
When testing disassembly without heuristics, we disable the O2 54.21 16.23 0.00 0.00 O1 50.45 7.57 1.11 0.23
Bap
force complete scan and function prologues arguments to O3 55.03 18.21 0.00 0.00 O2 55.54 8.97 1.28 0.25
Os 66.83 16.18 0.00 0.00 Ox 55.58 8.40 1.16 0.23
CFGFast, preventing linear scan and function matching. To
Of 54.75 17.91 1.41 1.03 - - - - -
obtain the results, we interpret the CFG returned by CFGFast O0 99.86 30.66 66.66 0.00 Od 77.36 14.83 0.00 0.00
Radare2
for results. We also excluded functions marked as “alignment”. O2 97.19 0.46 50.00 0.00 O1 74.53 7.30 0.00 0.00
We use the Reassembler interface for symbolization. O3 97.15 0.37 50.00 0.00 O2 71.80 5.24 0.00 0.00
GHIDRA: Besides default settings, we enable Assume Con- Os 97.86 6.30 50.00 0.00 Ox 71.80 5.58 0.00 0.00
Of 98.05 0.60 58.33 0.00 - - - - -
tiguous Functions Only and Allow Conditional Jumps to per-
form tail call detection. We disable the X86 Constant Reference
Analyzer in symbolization as it introduces many dummy RADARE2: We run R ADARE 2 with four options: aa for
xrefs. Finally, to test G HIDRA without heuristics, we disable default recursive disassembly, aanr for non-return function
Function Start Search and xrefs related options, to prevent the detection, aac and aap respectively enable xref based disas-
signature-based matching and xref based disassembly. sembly and signature-based function matching. With aac and
aap disabled, we can test R ADARE 2 without heuristics. We Disassembly: Both IDA P RO and B INARY N INJA perform
use APIs provided by R ADARE 2 to obtain the results. recursive descent to recover instructions. We inferred this
BAP: We run BAP with -dasm and -drcfg to do disassembly based on their correct results of handling Listing 2 and [84].
and reconstruct the CFG. Note the plugin for -drcfg needs extra They also take other approaches to handling code gaps.
installation. We use the with-no-return [27] pass to detect B INARY N INJA follows a heuristic as described in [84] to
non-returning function. We parse BAP’s outputs for results. deal with code gaps. It linearly scans non-disassembled code
IDA PRO, BINARY NINJA: We run the two tools with regions and aggregates call targets. Once done, B INARY N INJA
default settings and use their public APIs to get the results. sorts all the targets based on the times of being referenced and
D. Breakdown Analysis of Heuristics (in order) hands them off to further recursive descent.
IDA P RO at least uses four strategies to handle code gaps:
Table XVIII and XIX respectively show the results of
(1) it searches for common code sequences (e.g., [push bp;
instruction recovery without heuristics and the results of
mov bp, sp]). We inferred this from the kernel option of
pattern-based function matching.
mark typical code sequences as code [1] and the test-case
E. Overlap of False Positives and False Negatives in Listing 11; (2) it considers addresses in the .eh_frame
TABLE XX: Overlap of FP and FN in disassembly. sections for recursive descent. This is inferred from the enable
EH analysis kernel option, confirmed by comparing the results
Number of Appearance (%) with and without the .eh_frame section; (3) it also performs
Type 1 2 3 4 5 6 7 8 recursive descent at the targets of d2c xrefs. This is inferred
FP 92.05 4.11 2.52 0.85 0.31 0.14 0.01 0.001 from the create function if data xref data!code32 exists kernel
FN 83.08 14.16 2.36 0.34 0.04 0.01 0.0005 0
option, verified by comparing the results with and without
certain xrefs; (4) it coagulates the remaining bytes in .text
TABLE XXI: Overlap of FP and FN in symbolization.
as code or data (the make final analysis pass kernel option).
Number of Appearance (%) Symbolization: It is in general hard to infer how exactly IDA
Type 1 2 3 4 5 6 P RO and B INARY N INJA do symbolization. We empirically
FP 95.30 4.70 0.03 0.001 0.00 0.00 learned the following strategies: (1) The two tools do not use
FN 85.60 5.40 9.00 0.02 0.00 0.00 heuristics 8 , 10 , and 12 (2) B INARY N INJA rarely considers
d2c and d2d xrefs. (3) The majority of c2d xrefs identified by
TABLE XXII: Overlap of FP and FN in function detection. B INARY N INJA are constant address operands. (4) IDA P RO
will seek to symbolize the data unit at the target of a c2d xref.
Number of Appearance (%)
Type 1 2 3 4 5 6 7 8 9 10
It will further seek to symbolize the neighbors of the data unit
FP 70.4 19.2 2.5 8.6 1.5 0.2 0.05 0.007 0.01 0.007 at the target location.
FN 32.9 8.1 4.1 7.3 14.7 14.5 7.8 5.6 3.2 1.7 Function Entry Identification: Both IDA P RO and B INARY
N INJA consider the targets of direct/indirect calls as function
TABLE XXIII: Overlap of FP and FN in CFG recovery. entries. They further apply some other approaches.
B INARY N INJA adopts the idea from [6] to facilitate iden-
Number of Appearance (%) tification of function entries. It traverse the inter-procedural
Type 1 2 3 4 5 6 7
FP (EDGE) 75.2 14.8 4.6 2.5 2.2 0.4 0.2 CFG and groups all the connected basic blocks as a function.
FN (EDGE) 43.0 22.3 14.3 10.1 4.8 3.7 1.7 Similar to many open source tools, B INARY N INJA also
FP (CG) 99.9 0.01 0.01 0 0.01 0 0 considers targets of tail calls as function entries.
FN (CG) 54.4 20.5 14.8 7.9 2.1 0.2 0.001 IDA P RO leverages at least two other strategies to identify
FP (T-Call) 91.3 7.9 0.7 0.03 0.0001 0 0
function entries: (1) it considers certain (but not all) addresses
FN (T-Call) 61.1 29.5 7.1 0.9 1.4 0 0
FP (Non-Ret) 98.0 2.0 0 0 0 0 0
in the .eh_frame section as function entries. This is con-
FN (Non-Ret) 48.5 20.7 8.3 7.0 6.0 3.4 6.2 firmed by comparing the results before and after removing the
FP (J-Tab) 74.8 17.1 6.1 1.9 0 0 0 .eh_frame section; (2) it considers the targets of certain
FN (J-Tab) 44.8 36.7 12.9 4.5 1.1 0.01 0 d2c xrefs as function entries. We inferred this by comparing
In Table XX, XXI, XXII, and XXIII, we present the the results before and after we intentionally destroy some d2c
overlap of false positives (FP) and false negatives (FN) in xrefs. However, thus far we are not fully aware of how IDA
different tasks. Each cell indicates the percentage of FPs/FNs P RO selects .eh_frame items and the xrefs.
produced by the number of tools specified by Number of Indirect Jumps: Based on [84], B INARY N INJA implements
Appearance (e.g., the value (2.36%) in the cell [FN, 3] in VSA to handle jump tables. As discussed in § IV-B4, B INARY
Table XX means 2.36% of the FNs are produced by 3 tools). N INJA also resolves 120 hand-crafted indirect jumps, however,
with wrong targets. It remains unclear how B INARY N INJA
F. Understanding of Commercial Tools internally handles the 120 cases.
We attempted to infer how IDA P RO and B INARY N INJA According to [1], IDA P RO [49] relies on patterns to detect
operate, based on empirical experiments, blogs [49, 76, 84], jump tables. We further crafted test-cases (e.g., Listing 12) to
documentations [1], and communications with the developers. demonstrate that IDA P RO does not use VSA analysis .
Indirect Calls: We infer how B INARY N INJA and IDA P RO 1 mov $0xe,%ebx
handle indirect calls by examining their results with our 2 movzwl 0x4e33be(%rbx),%eax; operand = 0x4e33be
benchmark binaries (recall § IV-B3).
1 0x4e33b9: 00 00 00 00; end of .fini
All the targets found by B INARY N INJA are constants 2 0x4e33be: (bad); invalid memory address
propagated to the call sites. According to [76], this is achieved 3 0x4e33c0: 04 e3 3b 00; begin of .rodata
by path-sensitive data-flow analysis to calculate the ranges Listing 6: A xref in Busybox. At line 2 (upper part),
or disjoint sets of values (or VSA in general). Based on the constant operand points to an address (0x4e33be) in-
our further communication with the developers, the scope of between .finit section and .rodata section (lower part).
analysis is intra-procedural.
1 804f16c: 61 00 00 00 f7 c1 04 08; Unicode
IDA P RO found two types of targets. The first type is
propagated from constants in the current function. The second Listing 7: Xref missed by ANGR in addr2line. At
type all follows the format in Listing 9. We envision that IDA 0x804f16c, ANGR detects a Unicode and jumps to
P RO uses data-flow like constant propagation to handle indirect 0x804f171, missing the pointer at 0x804f170.
calls and applies patterns to find function tables. 1 callq 42ae30; non-returning
Tail Calls: B INARY N INJA considers a jump as a tail call if 2
3
nopw %cs:0x0(%rax,%rax,1)
00 00 00; padding
the target is outside of the current function and the stack has 4 cmpq $0x0,0x68(%rsi); start of a function
a zero offset [84]. IDA P RO does not particularly handle tail 5 je 406f38
calls, as clarified by their technical support team. Listing 8: A missed function entry (line 4). The disassembler
Non-returning Functions: Our analysis shows that IDA P RO assumes code after line 1 falls through and includes code at
and B INARY N INJA are detecting a similar group of non- line 4 to the preceding function.
returning functions as DYNINST. The difference is mainly
1 app_main = { 1 void run_app(int app){
caused by the recognition of non-returning library functions. 2 test_main, 2 ...
Further, the three tools have comparable precision and recalls. 3 acpid_main, 3 app_main[app](argc);
4 ... 4 ...
This indicates that IDA P RO and B INARY N INJA are using 5 } 5 }
similar recursive strategies to DYNINST.
Listing 9: An indirect call that can be handled by IDA P RO.
This indirect call uses a target from a function table.
G. Interesting Cases and Test Cases
1 switch(which){ 1 ; no restriction on %al
1 popfq 1 popfq 2 case 't': ... 2 sub $0x64,%eax
2 .byte 0xf3,0xc3 2 repz retq 3 ... 3 movzbl %al,%eax
3 .size AES_cbc_encrypt 3 nop 4 default: 4 movslq (%r10,%rax,4),%rax
4 .align 64 4 nop 5 undefined(); 5 add %r10,%rax
5 .LAES_Te:; data in code 5 (bad);disassembly error 6 } 6 jmpq *%rax
6 .long 0xa56363c6 6 movslq -0x5b(%rbx),%esp
Listing 10: A jump table with unrestricted index in dwp. The
Listing 2: An example of data-in-code. This example comes default case in source code transfers to undefined behaviors,
from Openssl, where hand-crafted data are appended after which is compiled into jump tables without index restriction.
code (left). Both OBJDUMP and ANGR incur errors in this case.
1 pop %ebp 1 pop %ebp 1 pop %ebp
2 jmp 8048430 2 jmp 8048430 2 jmp 8048430
1 mov 0x6ab8a0, %esi 1 ; wrong decoding 3 ;new function 3 ;new function 3 ;new function
2 mov %rbx, %rdi 2 640069: add [%rax],%eax 4 push %ebp 4 push %eax 4 db 50h ; P
3 64006b: add [%rax],%cl 5 mov %ebp, %esp 5 mov %ebp, %esp 5 db 89h
1 ; data; 4 ...
2 0x6ab8a0: 69 00 64 00 5 640126: call unwind
3 ... 6 ; non-return Listing 11: Test-case to infer IDA P RO’s disassembly. The
code has a function (line 4, left part) that carrys a common
Listing 3: A false postive of xref in Xalan_base. G HIDRA prologue but is never referenced. IDA P RO correctly disassem-
wrongly identifies the operand 0x6ab8a0 (line 1 left-upper) bles the code. After altering the instruction at line 4 (middle
as a pointer and makes erroneous disassembly. part), IDA P RO considers the code as data (right part).
1 ; load base address 1 ; jump table 1 cmp $4, %rax
2 1b: lea 0x8a(%rip),%r8 2 ac: 84 cb ec ff 2 ja .Ldefault
3 22: movzbl %cl,%ecx 3 a0: dc cb ec ff 3 ...
4 25: movslq (%r8,%rcx,4),%rax 4 b4: 04 cc ec ff 4 cmp $0, %rax
5 29: add %r8,%rax 5 b8: 14 cc ec ff 5 jle .Ldefault
6 2c: jmpq *%rax 6 ... 6 ...
7 jmp *branch_tbl(,%rax,8)
Listing 4: Jump table with 4-byte entries in 64-bit Gold Linker. Listing 12: Hand-crafted jump table with rax as the
1 004286ab: add 0xfb49eb,%rax index. Tools with VSA analysis, like B INARY N INJA and
2 ...
3 00fb49eb: undefined1 [11760811]; invalid address
DYNINST, figure out rax ranges from 1 to 4. IDA P RO
wrongly considers rax ranges from 0 to 4.
Listing 5: A xref error in zeusmp_base. Operand fb49eb
points to invalid data but G HIDRA symbolizes it.

Binsok

Uploaded by

Copyright:

Available Formats

Binsok

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Binsok

Uploaded by

Copyright:

Available Formats

SoK: All You Ever Wanted to Know About x86/x64

Binary Disassembly But Were Afraid to Ask

TABLE I: The group of open-source tools that our study covers

III. A NALYSIS OF T OOLS A. Algorithms and Heuristics in Disassembly

( 6 ). (Only) G HIDRA considers the .eh_frame section to Memoryload:

identify functions that have unwinding information ( 7 ). As Jmp

Ghidra Ghidra-NE McSema

O2 99.99 78.80 99.29 8.55 O1 99.82 78.60 92.87 46.21

O2 100 100 100 100 - - - - -

O2 99.99 99.96 98.78 97.87 O1 99.96 96.08 98.00 84.80

O2 99.91 95.53 87.35 37.71 O1 99.63 95.02 95.59 77.86

O3 99.94 95.06 88.38 31.96 O2 99.99 94.26 99.88 69.94

O2 93.92 75.26 48.95 12.37 - - - - - Tools

O2 86.37 68.08 44.74 17.01 O1 77.86 69.36 27.04 10.90

O3 77.40 57.22 3.06 24.30 O2 90.68 57.85 74.56 29.94

O2 94.99 90.49 48.11 66.79 O1 88.53 96.60 40.78 93.46

Radare2 73.32 3.90 15.55 0.67 4.23 2.34

O1 98.97 80.03 99.90 81.14 - - 100.0 6.43 88.89 71.56

Ghidra Ghidra-NE Dyninst

C. Configurations of Disassembly Tools O2 83.84 33.30 17.24 0.72 - - - - -

You might also like