PS 30 - XianDu 2020
PS 30 - XianDu 2020
PS 30 - XianDu 2020
Abstract—Checkers design is a main step for static analysis Code property graph [2] is a widely used intermediate rep-
of different vulnerabilities. This paper focuses on static analysis resentation of a program in static analysis. It combines abstract
on code property graph, which combines abstract syntax tree, syntax tree, control flow graph, and program dependency graph
control flow graph, and program dependence graph. Developing
checkers on code property graph directly is usually complex and together, which are the three most commonly used information
difficult. In this paper, we have analyzed a large number of in program analysis [23]. Thus, code property graph can
checkers of different vulnerabilities, and extracted those most provide a wealth of information for vulnerability description.
commonly used operations as a set of interfaces. We have Code property graphs are usually stored in a graph database
implemented these interfaces and developed a set of vulnerability and accessed by graph query languages. However, graph
checkers based on them. The practical efficacy of these checkers
are evaluated on the Linux kernel source code. Experimental query languages [20]–[22] are not suitable for vulnerability
results show that our proposed interfaces are strong enough to description. Hence, we still lack an efficient way to access data
express most vulnerabilities and our implementation is effective items in code property graph for vulnerability description.
for vulnerabilities detection. In this paper, we design a set of interfaces on code property
Keywords—Static analysis, code property graph, checker de- graphs to describe program vulnerabilities. Our approach
sign, interface
reconstructs the data structure of code property graph and
provides analysts with various powerful interfaces. These
I. I NTRODUCTION interfaces have been developed for each of the three repre-
sentations contained in the code property graph. Besides, we
Vulnerabilities in computer software can have a huge impact have also designed some interfaces that make use of all three
on software security. Many users are at risk of being attacked representations at the same time to improve the ability of de-
by hackers exploiting flaws in code. Due to the growing scription. These interfaces allow us to describe vulnerabilities
scale of software and the complexity of vulnerabilities, how properly and briefly. The well-designed checkers based on the
to detect security flaws efficiently is becoming a growing interfaces can identify the same type of potential vulnerable
concern. Many techniques have been proposed to detect vul- statements within a few minutes even in a large-scale code
nerabilities in source code. Existing techniques can be divided base such as the Linux kernel.
into dynamic program analysis and static program analysis. We have implemented the proposed interfaces in a popular
Dynamic analysis techniques such as software testing have high-level programming language. In order to demonstrate
the benefit of analyzing the properties of a running program practical effect of our approach, we have developed several
[1], but cannot guarantee analyze all program paths. On the checkers for analysis of vulnerability types such as buffer
other hand, static analysis has a high false positive rate due to overflows and memory leaks based on our interfaces. These
the approximate nature of analysis. However, static analysis checkers are then used to analyze the Linux kernel. Experi-
techniques can achieve a higher path coverage. mental results show that many flaws can be described using
This paper focuses on static analysis for vulnerability detec- the combination of interfaces. Our approach enables us to spot
tion. In general, static analysis methods first obtain an inter- well-known vulnerability types founded in the Linux kernel,
mediate representation of a program, such as abstract syntax such as buffer overflow, memory leak, integer overflow and
tree and control flow graph. Then to analyze different kinds of use-after-free.
vulnerabilities, one should design a checker for each kind of In summary, the main contributions of our work are as
vulnerability. However, designing these checkers directly on follows:
the intermediate representation are usually complex. We have • We design a set of vulnerability description interfaces
analyzed a large number of different vulnerability checkers. on code property graphs. The information in the code
We found that there are many commonly used operations property graph can easily be accessed with the interfaces.
among these checkers. To improve the efficiency of checker • We implement the interfaces and some checkers based on
design, an intuitive idea is to extract those most commonly the interfaces.
used operations as a set of interfaces, and then to design • We perform an experiment on source code of the Linux
checkers based on them. kernel to demonstrate the effectiveness of our approach.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on May 22,2021 at 14:50:34 UTC from IEEE Xplore. Restrictions apply.
The rest of this paper is organized as follows. Section II
introduces preliminaries of our method. Section III presents
a motivating example. Section IV describes our approach in
detail. Section V shows the experimental evaluation. Section
VI summarizes related work, and Section VII concludes the
paper.
II. P RELIMINARIES
Intermediate representations are primarily designed to opti-
mize and translate code. Among them, we find that some rep-
resentations of code are suitable for describing vulnerabilities.
In this paper, we focus on one representation, namely code
property graph, which combines abstract syntax tree, control
flow graph and program dependence graph into a joint data
structure. In addition, we introduce some concepts related to
Fig. 1. An example of a code property graph
vulnerability descriptions and graph databases.
A. Code Property Graph where K and S are a set of property keys and a set of propert
Code property graph is based on three classic representa- values respectively.
tions. To understand how code property graph contributes to As an example, a code property graph is shown in Figure
detect vulnerabilities, we will introduce the three representa- 1. For the sake of simplicity, AST edges are not marked. Note
tions first. that the statement nodes on the AST act as the skeleton of
Abstract syntax trees(AST) encode the syntactic structure the code property graph, which means the vertices V in a
and relations between statements, expressions and variables. code property graph G are all from AST. The edges E of G
The internal nodes in these trees represent operators and the have three kinds of labels, representing syntax information,
leaf nodes represent operands. An AST does not contain control flow, or dependencies. In this way, the code property
inessential elements in source code such as punctuation and graph combines the abstract syntax tree, control flow graph and
delimiters. Since key information such as variable types and program dependency graph into a combined data structure.
the orders of executable statements are preserved, it is easy
to locate code snippets in the original program. The AST also B. Graph Databases
serves as the basis for other code representations. After generating a code property graph, an essential ques-
Control Flow Graphs(CFG) faithfully describe all paths tion is how to store and manage the data effectively. Graph
that might be traversed through and indicate the order in databases use graph structures with nodes, edges and proper-
which statements are executed. Based on the abstract syntax ties to store data. Compared with relational databases, graph
tree, the control flow graph extracts the conditional branching databases perform better at storage of relational model and
statements and builds edges between nodes of abstract syntax are a natural fit for graph-based representations of programs.
tree. The CFG is essential to many static analysis tools and In our work, a popular database called Neo4j [25] is used to
indispensable for descriptions of many vulnerabilities. store the code property graph.
Program dependence graphs(PDG) represent data depen- It is important to choose an appropriate query language
dencies and control dependencies in source code. In a program other than SQL to retrieve data from a graph database. Typical
dependence graph, each edge can be determined as uses or query langauages designed for graph databases include Cypher
defines or reaching definitions by the set of variables and [22], PGQL [20] and Gremlin [5]. However, these graph query
statements. languages are not suitable for describing vulnerabilities. At
All three representations provide partial attributes of source present, how to efficiently use the code property graph stored
code. Each of them, when used alone, is limited by its per- in graphs to describe vulnerabilities is still a problem.
spective. In order to describe vulnerabilities more concretely,
we need to combine them into a comprehensive representation, C. Vulnerability Descriptions
namely code property graph. A code property graph is defined A series of security researches show that some vulnerabil-
formally as follows. ities have the same code characteristics from the perspective
Definition 1. A Code Property Graphs is a direct multigraph of source code attributes. Therefore, it is an effective method
G = (V, E, λ, μ) where V is a set of nodes same as the nodes to discover potential vulnerabilities by analyzing the features
in AST and E is a set of directed edges containing the edges in known vulnerabilities. The Common Weakness Enumera-
from AST, CFG and PDG. λ denotes an edge labeling function tion(CWE) [3] is a category system for software weaknesses
assigning a label from the label sets in AST, CFG and PDG and vulnerabilities. CWE develop a set of classification rules
to each edge. μ is a function defined as (V ∪ E) × K → S for all vulnerabilities reported in CVE [4]. In order to capture
47
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on May 22,2021 at 14:50:34 UTC from IEEE Xplore. Restrictions apply.
the characteristics from numerous vulnerabilities, our research to a function or nested in statements as shown in the
follows the vulnerability classification standards presented by example, and thus the properties provided by the AST
CWE. The analysis finds that vulnerabilities in the same CWE are nessary.
classification often have similar characteristics. Nonetheless, 2) Sanitization. Inadequate sanitization leads to many vul-
the complexity and variety of software flaws require the nerabilities in programs. In the example in Figure 2,
methods of vulnerability detection be extensible and flexible. if the variable info.rule cnt has been checked before
Therefore, we develop a method to provide some interfaces allocation, the vulnerability could be avoided. Therefore,
for vulnerability descriptions. the order in which statements are executed encoded in
CFG is important.
III. M OTIVATION
3) Source of variables. In order to figure out the value
Consider the example of a buffer overflow founded in the of variables, we need to track the data dependencies
Linux kernel. The vulnerable code is reported as CVE-2010- of variables. Some assignment operations are risky, so
2478 shown in Figure 2 [24]. The vulnerable statement on it is necessary for the detection of many types of
line 4 allocates memory for buffer rule buf using the function vulnerabilities. This information is present in the PDG.
kmalloc. The first one of the two parameters received by kmal-
The features above are fully illustrated in the code prop-
loc represents the size of allocated memory. Unfortunately,
erty graph. However, we find that in practice it is diffcult
the variable info.rule cnt is controlled by users and has not
to effectively access the code property graph stored in the
been sanitized before use. And thus it might be exploited by
graph database. Traversals of code property graphs for graph
attackers. On a 32-bit machine, if info.rule cnt is chosen to
databases have several limitations and are not suitable for
be larger than 0x40000000, the product will overflow and the
describing complicated vulnerabilities.
buffer may be smaller than needed. Finally, a buffer overflow
We collect large amounts of security vulnerabilties of the
occurs when more bytes are copied into the buffer.
Linux kernel on CVE. These vulnerabilities are analyzed man-
1 [...] ually and some common operations in them are identified. In
2 i f ( i n f o . cmd == ETHTOOL GRXCLSRLALL) {
3 i f ( info . r u l e c n t > 0) { order to accurately and elegantly describe these vulnerabilities,
4 r u l e b u f = k m a l l o c ( i n f o . r u l e c n t * s i z e o f ( u32 ) , we have designed a set of efficient and powerful interfaces to
5 GFP USER ) ; access nodes in a code property graph. In the next section,
6 if (! rule buf )
7 r e t u r n −ENOMEM; we explain our approach in detail. And the experimental
8 } evaluation is shown in Section V.
9 }
10 [...]
IV. A PPROACH
Fig. 2. A motivational example of buffer overflow
The code property graph provides a comprehensive view of
Yamaguchi designed some graph traversals [2] using Grem-
the code that we can use to model vulnerabilities. However,
lin to describe well-known types of vulnerabilities. He used
it is not immediately clear how code property graphs can
the following traversal to detect the vulnerable statement on
be employed to describe vulnerabilities. To make use of the
line 4:
graphs effectively, we reconstruct the data structure in code
getCallsTo(’kmalloc’).ithArguments(’0’).astNodes()
property graphs. By analyzing a large number of instances
.filter{ it.type == ’MultiplicativeExpression’}
on CVE, we have summarized common vulnerable operations
Unfortunately, the traversal only describes the product inside
and find the code locality of vulnerabilities. Code involved in a
the allocation function kmalloc. The biggest drawback of
single vulnerability is generally within the scope of a function.
the traversal is that it cannot describe whether the variable
To retain the complete structure of the functions in a program,
info.rule cnt has been sanitized. If the variable had been
we design a series of common interfaces based on functions.
checked, the vulnerability would not exist. In fact, we find
These interfaces are so essential and flexible that it is easy
that we could not use traversals to describe this vulnerability
to combine them into vulnerability descriptions. We traverse
precisely. The reason is that traversals based on Gremlin
from the root node of a function and obtain all the relevant
are not good at expressing the control flow and program
nodes. Our approach bridges the gap between vulnerability
dependencies of source code.
mining and code property graphs and improves the effciency
This simple example illustrates some common characteris-
of analysts. Next, we introduce some of the interfaces briefly.
tics of vulnerable code. To describe the vulnerability precisely,
we need to understand the following aspects of code:
A. Proposed Interfaces
1) Vulnerable operations. Vulerable operations are key to
locating vulnerabilities. These operations would be at There are some basic functions for preprocessing source
risk if they are not regulated properly. For example, if a code, including getEntry, FuncTraverse and getSymbols.
memory allocation function returns an undersized buffer, These functions return the basic nodes of the program and
a buffer overflow may occur during copying into the construct the code property graph. These functions need to be
buffer. These operations may be passed as parameters executed before the following interfaces are imported.
48
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on May 22,2021 at 14:50:34 UTC from IEEE Xplore. Restrictions apply.
1) Interfaces based-on AST: As the skeleton of the code getNodesWithType returns all variable nodes with the type.
property graph, the AST nodes are the base nodes in addition Besides, we provide nodesTypeFilter to filter these nodes. In
to encoding syntactic information. The statement nodes in the practice, we find that these functions are helpful. We also
AST are also nodes that make up CFG and PDG. As shown in design the interface Node.getCallsTo for the function node
Figure 1, a code property graph consists of the trees whose root class. This interface traverses from the given node to the exit
nodes are statement nodes. We design the following functions node of the graph and finds the statement nodes that call the
for AST. function whose name is the parameter.
statements: The interface is designed to reach the nearest sanitized is a function that check if there is one path in the
statement node. This function is the basis for switching from control flow graph from data sources to a destination node
AST to CFG or PDG. where no node on the path redefines the produced symbol and
getParameters: The interface takes an index as an argument no node on the path matches a sanitizer description. If the path
and return the index-th parameter of the function call. Many exists, it could be vulnerable. We use this function to check
vulnerability types are involved in parameters taken by a whether all the variables in a node have been checked.
function, and thus it is necessary to provide a function to To show the effects of these interfaces, we use these
obtain the parameters. interfaces to describe some common types of vulnerabilities
searchAstChild: The interface traverses from a given node as follows.
to all the child nodes and returns a collection of all nodes that
B. Interface-based Vulnerability Checker Design
satisfy the condition. More generally, we often filter nodes in
a syntax tree that meet certain criteria. This is a foundational We investigate numerous anomalys and find some common
interface that is the basis of a variety of interfaces, such as vulnerability operations. Our approach can characterize many
searching for certain types of variables. vulnerability types and narrow the search for them. Some
The information contained in the syntax tree is limited, checkers are designed based on the interfaces to detect 8
and the use of grammar information alone may lead to a vulnerability types. Now we take buffer overflows, memory
large number of missed and false positives. The code property leaks, use-after-free erros and uninitialized structs as the
graph also provides control flow information and dependency examples to introduce the checkers.
information.They play an important role in many more types • Buffer overflow. A buffer overflow is a very common and
of vulnerabilities. dangerous vulnerability due to copying data blocks into a
2) Interfaces based-on CFG: We design an interface to get undersized memory. Such vulnerabilities usually involve
the control flow information. memory allocation functions. In the source code of the
flowsTo: The function takes two statement nodes as argu- Linux kernel, there are a series of memory allocation
ments and returns True if the first argument is executed before functions, including kmalloc, realloc and calloc. Due to
the second one in order to judge the order of two nodes. The the variety of memory allocation functions, we provide
code execution sequence described in the control flow graph a matching rule like regular expressions for strings. We
is commonly used in vulnerability descriptions. begin with the simple example in Figure 2. It can be
3) Interfaces based-on PDG: The program dependence described as follows:
graph contains two kinds of information, namely data de-
pendence and control dependence. We declare a class called Algorithm 1 Buffer overflow checker
Symbol for variable nodes. Type: Buffer overflows
usedBy: The function takes a symbol and returns all the Output: Potential vulnerabilities
successor nodes of the symbol in PDG. Extracting the set of allocation function nodes using get-
uses: The function takes a statement node and returns all CallsTo
the predecessor nodes of the statement node in PDG. for funcNode in the node set do
defBy and reachingDefBy: Similarly, we have also defined Use Graph Build the code property graph of funcNode
two interfaces to return the definition and reaching relations. allocnode = funcNode.getCallsTo the allocation function
Besides, we also design functions for the statement nodes to Extracting the arguments of allocation function
retrieve the symbol involved. if allocnode is not sanitized then
In addition, we design several methods which combine Add the allocnode to the result set
the three representations and make use of the information end if
in the code property graph. Since our interfaces are based end for
on functions in souce code, the first step is to retrieve the Return the result set
relevant function nodes. We support a variety of methods to
access function nodes, such as access by function name with • Memory leak. A memory leak is a type of resource
getFunctionsByName. Considering that the function name is leak. The reason is that the program manages memory
unknown at first, we also support methods that access function allocation incorrectly. For example, a memory leak may
nodes from child nodes. getCallsTo returns all function nodes occur when a program allocates memory without freeing
that call the function indicated by the function argument and it in time. To model it, we need to find memory that has
49
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on May 22,2021 at 14:50:34 UTC from IEEE Xplore. Restrictions apply.
been allocated but not released. We use the Algorithm 2 Algorithm 4 Uninitialized struct checker
to search the vulnerabilities. Type: Uninitialied structs
• Use-after-free. A use-after-free occurs when a program Output: Potential vulnerabilities
continues to use a pointer after it has been freed. The Collecting the set of struct variable using getN-
consequences of use-after-free errors range from the odesWithType
corruption of valid data to the execution of arbitrary code. for funcNode in the node set do
As a memory related vulnerability, it can be detected as Build the code property graph of funcNode
Algorithm 3 shows. structs = nodesTypeFilter struct variable nodes in the
• Uninitialized struct. An uninitialized struct is a common code property graph of the function
source of vulnerabilities. A number of flaws have been prenodes = defBy and usedBy struct nodes
found in the Linux kernel because the structs are not fully for node in prenode do
initialized. It is observed that struct initialzation usually if node without complete initialization using
uses memset. The checker of uninitialized structs is shown node.getCallsTo then
in Algorithm 4. flag = true
break
Algorithm 2 Memory leak checker
end if
Type: Memory leaks end for
Output: Potential vulnerabilities if flag == true then
Extracting the set of allocation function nodes using get- Add the allocnode to the result set
CallsTo end if
for funcNode in the node set do end for
Build the code property graph of funcNode Return the result set
allocnode = funcNode.getCallsTo the allocation function
if node releases the memory using node.getCallsTo then
flag = true can control them with a high-level language rather than use
break traversals with limited ability of description. With algorithms
end if described by these interfaces, analysts can find out potential
if flag == false then vulnerabilities and narrow their audit scope. Another benefit of
Add the allocnode to the result set these interfaces is that they can be combined to detect multiple
end if kinds of vulnerabilities at the time or individually. In the next
end for section, we evaluate our approach experimentally, focusing on
Return the result set real-world security vulnerabilities.
V. E XPERIMENTAL R ESULTS
Algorithm 3 Use-after-free checker
To demonstrate the effectiveness, we evaluate the proposed
Type: Use-after-free
approach on the source code of the Linux kernel version
Output: Potential vulnerabilities
3.12.1. We compare the coverage between our approach and
Extracting the set of release function nodes using getCallsTo
traversals based on Gremlin. Besides, we also study the ability
for funcNode in the node set do
of our approach to detect vulnerabilities.
Build the code property graph of funcNode
We use the tool Joern [28] to import the souce code of the
freenode = funcNode.getCallsTo the release function
Linux kernel. The resulting code property graph takes up 15.4
subnodes = uses freenode
GB of disk space. We conduct the experiment on a laptop
for node in subnodes do
computer with a 2.80 GHz Intel Core i7 CPU and 16 GB of
if node is not sanitized then
main memory. The code property graph consists of 42 million
flag = true
nodes and 71 million edges. We implement our approach in
break
Python, as it is a flexible and commonly used programming
end if
language. We find it easy to convert the algorithms in the
end for
previous section into program code.
if flag == true then
All the algorithms presented in the paper end up within a
Add the allocnode to the result set
few minutes, even in a large code base like the Linux kernel.
end if
It proves that the proposed approach is feasible for real-world
end for
software system.
Return the result set
A. Ability of Description
By describing the typical vulnerability types above, we find We manually analyze the major security vulnerabilities in
these interfaces to be very practical and effective. Analysts the Linux kernel since 2010 and classify them according to
50
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on May 22,2021 at 14:50:34 UTC from IEEE Xplore. Restrictions apply.
CWE classification rules. A total of 164 vulnerabilities can be TABLE II
divided into 10 common types. We model these vulnerability B UFFER OVERFLOW
types and find that many of the vulnerability types could be
Filename Function
represented by our approach. To show the expressing ability arch/m68k/emu/nfblock.c nfhd init one
of our approach, we use traversals to describe vulnerabilities arch/m68k/kernel/dma.c dma alloc coherent
and compare them with our approach. The comparison results arch/m68k/mm/kmap.c get io area
arch/metag/kernel/dma.c metag vm region alloc
are presented in Table I. arch/metag/kernel/traps.c dspram save
arch/microblaze/mm/init.c alloc maybe bootmem
arch/mips/alchemy/common/dbdma.c au1xxx dbdma chan alloc
TABLE I arch/metag/kernel/process.c metag elf map
C OMPARISON OF E XPRESSING A BILITY arch/metag/kernel/tcm.c tcm add region
arch/cris/arch-v32/drivers/axisflashmap.c init axis flash
Vulnerability Vulnerability Type Descriptions arch/cris/arch-v32/drivers/cryptocop.c alloc cdesc
types Traversals Interfaces arch/cris/arch-v32/drivers/cryptocop.c cryptocop job setup
√ √ arch/cris/arch-v32/drivers/cryptocop.c cryptocop setup dma list
Buffer Overflow √ arch/cris/arch-v32/drivers/cryptocop.c cryptocop ioctl process
Memory Leak √ block/blk-cgroup.c blkg create
Use After Free √ √
Integer Overflow
Race Condition √
Struct Uninitialized √
Insecure Object References VI. R ELATED W ORK
Logic Error √
Code Injection √ There are many vulnerability scanners used in practice
Null Pointer Dereference
based on static analysis techniques, such as FindBugs [8],
PMD [9] and ASTREE [10]. While these tools provide a way
to detect vulnerabilities, they can be imprecise because it make
As can be seen from Table I, our approach has obvious approximations about run-time behavior of a program [11].
advantages over the traversals. Traversals can only describe As a remedy, some researchers begin to consider to apply
two of the 10 vulnerability types, while the interfaces we expert knowledge to vulnerability detection. Jia and Dong [27]
design can describe all the types of vulnerabilities except race present a graph traversal method based on code property graph
condition and logic error. Note that these two types of vul- to find bugs. Evans and Larochelle [12] implement a tool
nerability require runtime information to be identified, static to detect buffer overflows and format string vulnerabilities
analysis techniques are not suitable for them. Therefore, we with annotations in different programs. Besides, with the
see it as the inherent limitations of static analysis techniques. development of machine learning, some researchers attempt
to use machine learning to find patterns in vulnerabilities
B. Vulnerability Analysis automatically. For example, Yamaguchi and Golde [6] propose
a method which aims at discovering vulnerabilities using
The practical efficacy of our method in vulnerability detec- clustering. Grieco and Grinblat [7] implement VDiscover,
tion remains to be proved. To this end, We have designed a tool that uses machine learning techniques to discover
checkers with the proposed interfaces to exploit potential vulnerabilities in test cases. While these approaches present
vulnerabilities in the Linux kernel. In addition to the four the idea of automated extraction of vulnerability patterns, they
algorithms presented in Section IV, we also implement four still face many challenges.
other checkers for vulnerability types that can be described as In recent years, many dynamic analysis techniques emerge
shown in Table I. to detect some subtle flaws. Dynamic analysis tools have been
A single type of vulnerability may have different causes. widely used, such as Daikon [13], Valgrind [14] and Pin
For example, buffer overflows are usually due to undersized [15]. However, dynamic analysis techniques are limited by test
allocated memory, but sometimes caused by unsafe function coverage of the source code. Therefore, some researchers try
calls like strcpy. For simplicity, we focus on one common to combine dynamic and static analysis. Artho and Biere [16]
cause of a vulnerability type. develop a method to retain information from static analysis
To demonstrate the effectiveness of the proposed method, for run-time verification. In addition, as important methods of
we analyze the results of the buffer overflow checker as an program testing and verification, fuzzing and model checking
example. We implement the algorithm for buffer overflow remain highly popular. Xu and Yin [26] propose a fuzzing
showns in Section IV and run it on the Linux kernel source framework for binary vulnerability mining.
code. By running the checker, a total of 49 results are Some of research works are devoted to detect certain types
identified, 15 of which are shown in Table II. The vulnerable of vulnerabilities. Bican and Deaconescu [17] present an
functions in the results are memory allocation functions whose implementation as an extension of HIP/SLEEK, an automated
parameters may overflow under attack. Manual analysis shows verification system, to detect buffer overflows in C. Ghanavati
that they match the descriptions and are considered as potential and Costa [18] conduct an empirical study on resource and
vulnerabilities. memory leak issues in Java projects. The detection of single
51
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on May 22,2021 at 14:50:34 UTC from IEEE Xplore. Restrictions apply.
vulnerability type has high accuracy rate. However, these [9] PMD/Java. http://pmd.sourceforge.net
[10] Cousot P, Cousot R, Feret J, et al. Varieties of static analyzers: A
methods are not extensible and universal. comparison with ASTRÉE[C]//First Joint IEEE/IFIP Symposium on
Theoretical Aspects of Software Engineering (TASE’07). IEEE, 2007:
VII. C ONCLUSION 3-20.
In this paper, we propose an approach to effectively exploit [11] Gosain A, Sharma G. Static analysis: A survey of techniques and
tools[M]//Intelligent Computing and Applications. Springer, New Delhi,
information in code property graphs to discover potential 2015: 581-591.
vulnerabilities. The proposed appoach helps describe many [12] Evans D, Larochelle D. Improving security using extensible lightweight
common vulnerability types and find code snippets that fit the static analysis[J]. IEEE software, 2002, 19(1): 42-51.
[13] Dufour B, Hendren L, Verbrugge C. * J: a tool for dynamic analysis
description, meaning they may have vulnerabilities. We apply of Java programs[C]//Companion of the 18th annual ACM SIGPLAN
our method to the Linux kernel and identify some vulnerable conference on Object-oriented programming, systems, languages, and
code. Moreover, the description of these vulnerabilities is applications. 2003: 306-307.
[14] Nethercote N, Seward J. Valgrind: a framework for heavyweight dy-
completely controlled by the analysts, and thus the results can namic binary instrumentation[J]. ACM Sigplan notices, 2007, 42(6):
range from fuzzy matching to precise matching. Analysts can 89-100.
revise these descriptions to find snippets of code that interest [15] Skaletsky A, Devor T, Chachmon N, et al. Dynamic program analysis
of microsoft windows applications[C]//2010 IEEE International Sympo-
them. It helps greatly improve work efficiency and detect the sium on Performance Analysis of Systems & Software (ISPASS). IEEE,
potential vulnerabilities. However, there are some limitations 2010: 2-12.
to our apporach. As a static analysis technique, it is restricted [16] Artho C, Biere A. Combined static and dynamic analysis[J]. Electronic
Notes in Theoretical Computer Science, 2005, 131: 3-14.
in discovering all dependencies in the program. There are [17] Bican A, Deaconescu R, Chin W N, et al. Verification of C Buffer Over-
also some types of vulnerability that cannot be described by flows in C Programs[C]//2018 17th RoEduNet Conference: Networking
our approach. Further improvements in the algorithms and in Education and Research (RoEduNet). IEEE, 2018: 1-6.
[18] Ghanavati M, Costa D, Andrzejak A, et al. Memory and resource
interfaces for describing more types of vulnerabilities are leak defects in java projects: An empirical study[C]//Proceedings of
envisaged as part of future work. the 40th International Conference on Software Engineering: Companion
Proceeedings. 2018: 410-411.
ACKNOWLEDGMENT [19] Heelan S. Vulnerability detection systems: Think cyborg, not robot[J].
IEEE Security & Privacy, 2011, 9(3): 74-77.
This work was funded by the National Nature Science [20] van Rest O, Hong S, Kim J, et al. PGQL: a property graph query lan-
Foundation of China (No. 61802415, No. 61690203, No. guage[C]//Proceedings of the Fourth International Workshop on Graph
61532007). Data Management Experiences and Systems. 2016: 1-6.
[21] Rodriguez M A. The gremlin graph traversal machine and language
(invited talk)[C]//Proceedings of the 15th Symposium on Database
R EFERENCES Programming Languages. 2015: 1-10.
[1] Gosain A, Sharma G. A survey of dynamic program analysis techniques [22] Francis N, Green A, Guagliardo P, et al. Cypher: An evolving query
and tools[C]//Proceedings of the 3rd International Conference on Fron- language for property graphs[C]//Proceedings of the 2018 International
tiers of Intelligent Computing: Theory and Applications (FICTA) 2014. Conference on Management of Data. 2018: 1433-1445.
Springer, Cham, 2015: 113-122. [23] Stanier J, Watson D. Intermediate representations in imperative compil-
[2] Yamaguchi F, Golde N, Arp D, et al. Modeling and discovering ers: A survey[J]. ACM Computing Surveys (CSUR), 2013, 45(3): 1-27.
vulnerabilities with code property graphs[C]//2014 IEEE Symposium [24] CVE-2010-2478, http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-
on Security and Privacy. IEEE, 2014: 590-604. 2010-2478
[3] CWE, https://cwe.mitre.org/ [25] Neo4j, https://neo4j.com/
[4] CVE, http://cve.mitre.org/ [26] Xu L , Yin L , Dong W , et al. Expediting Binary Fuzzing with
[5] Gremlin, https://tinkerpop.apache.org/gremlin.html Symbolic Analysis[J]. International Journal of Software Engineering and
[6] Yamaguchi F. Pattern-Based Vulnerability Discovery[D]. University of Knowledge Engineering, 2019. DOI:10.1142/S0218194018400247
Göttingen, 2015. [27] L. Jia, W. Dong and B. Lu, Bug Finder Evaluation Guided Program
[7] Grieco G, Grinblat G L, Uzal L, et al. Toward large-scale vulnerability Analysis Improvement, 2019 IEEE 7th International Conference on
discovery using machine learning[C]//Proceedings of the Sixth ACM Computer Science and Network Technology (ICCSNT), Dalian, China,
Conference on Data and Application Security and Privacy. 2016: 85-96. 2019, pp. 122-125, doi: 10.1109/ICCSNT47585.2019.8962414.
[8] Hovemeyer D, Pugh W. Finding bugs is easy[J]. Acm sigplan notices, [28] Joern, https://github.com/octopus-platform/joern
2004, 39(12): 92-106.
52
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on May 22,2021 at 14:50:34 UTC from IEEE Xplore. Restrictions apply.