Survey On The Usage of Machine Learning Techniques For Malware Analysis
Survey On The Usage of Machine Learning Techniques For Malware Analysis
Survey On The Usage of Machine Learning Techniques For Malware Analysis
net/publication/320582721
CITATIONS READS
57 1,747
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Daniele Ucci on 06 December 2018.
Abstract
Coping with malware is getting more and more challenging, given their re-
lentless growth in complexity and volume. One of the most common approaches
in literature is using machine learning techniques, to automatically learn mod-
els and patterns behind such complexity, and to develop technologies to keep
pace with malware evolution. This survey aims at providing an overview on
the way machine learning has been used so far in the context of malware analy-
sis in Windows environments, i.e. for the analysis of Portable Executables. We
systematize surveyed papers according to their objectives (i.e., the expected out-
put), what information about malware they specifically use (i.e., the features),
and what machine learning techniques they employ (i.e., what algorithm is used
to process the input and produce the output). We also outline a number of
issues and challenges, including those concerning the used datasets, and identify
the main current topical trends and how to possibly advance them. In partic-
ular, we introduce the novel concept of malware analysis economics, regarding
the study of existing trade-offs among key metrics, such as analysis accuracy
and economical costs.
Keywords: portable executable, malware analysis, machine learning,
benchmark, malware analysis economics
2
techniques for PEs are slightly different from those for Android apps because
there are significant dissimilarities on how operating system and applications
work. As a matter of fact, literature papers on malware analysis commonly
point out what specific platform they target, so we specifically focus on works
that consider the analysis of PEs. 64 recent papers have been selected on the
basis of their bibliographic significance, reviewed and systematised according
to a taxonomy with three fundamental dimensions: (i) the specific objective of
the analysis, (ii) what types of features extracted from PEs they consider and
(iii) what machine learning algorithms they use. We distinguish three main
objectives: malware detection, malware similarity analysis and malware cate-
gory detection. PE features have been grouped in eight types: byte sequences,
APIs/System calls, opcodes, network, file system, CPU registers, PE file char-
acteristics and strings. Machine learning algorithms have been categorized de-
pending on whether the learning is supervised, unsupervised or semi-supervised.
The characterisation of surveyed papers according to such taxonomy allows to
spot research directions that have not been investigated yet, such as the impact
of particular combination of features on analysis accuracy. The analysis of such
a large literature leads to single out three main issues to address. The first
concerns overcoming modern anti-analysis techniques such as encryption. The
second regards the inaccuracy of malware behaviour modelling due to the choice
of what operations of the sample are considered for the analysis. The third is
about the obsolescence and unavailability of the datasets used in the evalua-
tion, which affect the significance of obtained results and their reproducibility.
In this respect, we propose a few guidelines to prepare suitable benchmarks
for malware analysis through machine learning. We also identify a number of
topical trends that we consider worth to be investigated more in detail, such as
malware attribution and triage. Furthermore, we introduce the novel concept of
malware analysis economics, regarding the existing trade-offs between analysis
accuracy, time and cost, which should be taken into account when designing a
malware analysis environment.
The novel contributions of this work are
3
• the definition of a taxonomy to synthesise the state of the art on machine
learning for malware analysis of PEs;
The rest of the paper is structured as follows. Related work are described
in Section 2. Section 3 presents the taxonomy we propose to organise reviewed
malware analysis approaches based on machine learning, which are then charac-
terised according to such a taxonomy in Section 4. From this characterisation,
current issues and challenges are pointed out in Section 5. Section 6 highlights
topical trends and how to advance them. Malware analysis economics is in-
troduced in Section 7. Finally, conclusions and future works are presented in
Section 8.
2. Related Work
Other academic works have already addressed the problem of surveying con-
tributions on the usage of machine learning techniques for malware analysis.
The survey written by Shabtai et al. [3] is the first one on this topic. It specifi-
cally deals with how classifiers are used on static features to detect malware. As
most of the other surveys mentioned in this subsection, the main difference with
our work is that our scope is wider as we target other objectives besides malware
detection, such as similarities analysis and category detection. Furthermore, a
novel contribution we provide is the idea of malware economics, which is not
4
mentioned by any related work. Also in [4], the authors provide a compara-
tive study on papers using pattern matching to detect malware, by reporting
their advantages, disadvantages and problems. Souri and Hosseini [5] proposes
a taxonomy of malware detection approaches based on machine learning. In
addition to consider detection only, their work differs from ours because they do
not investigate what features are taken into account. LeDoux and Lakhotia [6]
describe how machine learning is used for malware analysis, whose end goal is
defined there as “automatically detect malware as soon as possible, remove it,
and repair any damage it has done”.
Bazrafshan et al. [7] focus on malware detection and identify three main
methods for detecting malicious software, i.e. based on signatures, behaviours
and heuristics, the latter using also machine learning techniques. They also
identify what classes of features are used by reviewed heuristics for malware
detection, i.e. API calls, control flow graphs, n-grams, opcodes and hybrid
features. In addition to going beyond malware detection, we propose a larger
number of feature types, which reflects the wider breadth of our research.
Basu et al. [8] examine different works relying on data mining and machine
learning techniques for the detection of malware. They identify five types of
features: API call graph, byte sequence, PE header and sections, opcode se-
quence frequency and kernel, i.e. system calls. In our survey we establish more
feature types, such as strings, file system and CPU registers. They also compare
surveyed papers by used features, used dataset and mining method.
Ye et al. [1] examine different aspects of malware detection processes, fo-
cusing on feature extraction/selection and classification/clustering algorithms.
Also in this case, our survey looks at a larger range of papers by also including
many works on similarity analysis and category detection. They also highlight a
number of issues, mainly dealing with machine learning aspects (i.e. incremental
learning, active learning and adversarial learning). We instead look at current
issues and limitations from a distinct angle, indeed coming to a different set of
identified problems that complement theirs. Furthermore, they outline several
trends on malware development, while we rather report on trends about machine
5
learning for malware analysis, again complementing their contributions.
Barriga and Yoo [9] briefly survey literature on malware detection and mal-
ware evasion techniques, to discuss how machine learning can be used by mal-
ware to bypass current detection mechanisms. Our survey focuses instead on
how machine learning can support malware analysis, even when evasion tech-
niques are used. Gardiner and Nagaraja [10] concentrate their survey on the
detection of command and control centres through machine learning.
This section introduces the taxonomy on how machine learning is used for
malware analysis in the reviewed papers. We identify three major dimensions
along which surveyed works can be conveniently organised. The first one char-
acterises the final objective of the analysis, e.g. malware detection. The second
dimension describes the features that the analysis is based on in terms of how
they are extracted, e.g. through dynamic analysis, and what features are con-
sidered, e.g. CPU registers. Finally, the third dimension defines what type of
machine learning algorithm is used for the analysis, e.g. supervised learning.
Figure 1 shows a graphical representation of the taxonomy. The rest of this
section is structured according to the taxonomy. Subsection 3.1 describes in
details the objective dimension, features are pointed out in subsection 3.2 and
machine learning algorithms are reported in subsection 3.3.
6
Figure 1: Taxonomy of machine learning techniques for malware analysis
7
Variants Detection. Developing variants is one of the most effective and cheap-
est strategies for an attacker to evade detection mechanisms, while reusing as
much as possible already available codes and resources. Recognizing that a sam-
ple is actually a variant of a known malware prevents such strategy to succeed,
and paves the way to understand how malware evolve over time through the
development of new variants. Also this objective has been deeply studied in
literature, and several reviewed papers target the detection of variants. Given
a malicious sample m, variants detection consists in selecting from the avail-
able knowledge base the samples that are variants of m [37, 30, 38, 39, 40, 41].
Considering the huge number of malicious samples received daily from major se-
curity firms, recognising variants of already known malware is crucial to reduce
the workload for human analysts.
8
3.1.3. Malware Category Detection
Malware can be categorized according to their prominent behaviours and
objectives. They can be interested in spying on users’ activities and stealing
their sensitive information (i.e., spyware), encrypting documents and asking for
a ransom (i.e., ransomware), or gaining remote control of an infected machine
(i.e., remote access toolkits). Using these categories is a coarse-grained yet
significant way of describing malicious samples [63, 64, 65, 66, 32, 67]. Although
cyber security firms have not still agreed upon a standardized taxonomy of
malware categories, effectively recognising the categories of a sample can add
valuable information for the analysis.
This subsection deals with the features of samples that are considered for
the analysis. How features are extracted from executables is reported in subsec-
tion 3.2.1, while subsection 3.2.2 details which specific features are taken into
account.
9
using either sandboxes [42, 56, 68, 15, 16, 60, 57, 58, 69, 51, 52, 33] or em-
ulators [70, 40]. Also program analysis tools and techniques can be useful in
the feature extraction process by providing, for example, disassembly code and
control- and data-flow graphs. An accurate disassembly code is important for
obtaining correct Byte sequences and Opcodes features (§ 3.2.2), while control-
and data-flow graphs can be employed in the extraction of API and System
Calls (§ 3.2.2). For an extensive dissertation on dynamic analyses, refer to [71].
Among reviewed works, the majority relies on dynamic analyses [42, 55, 56,
15, 44, 16, 60, 57, 66, 46, 50, 58, 24, 26, 28, 30, 51, 52, 53, 35, 40], while the
others use, in equal proportions, either static analyses alone [11, 12, 63, 64, 72,
17, 65, 19, 47, 49, 61, 22, 23, 25, 31, 73, 27, 29, 37, 38, 54, 67, 74, 39] or a
combination of static and dynamic techniques [75, 18, 21, 48, 20, 59, 69, 62, 41].
Depending on the specific features, extraction processes can be performed by
applying either static, dynamic, or hybrid analysis.
10
trigrams) [74, 52, 31, 16, 18, 67, 46, 48]. Indeed, the number of features to
consider grows exponentially with n.
API and System Calls. Similarly to opcodes, APIs and system calls enable the
analysis of samples’ behaviour, but at a higher level. They can be either ex-
tracted statically or dynamically by analysing the disassembly code (to get the
list of all calls that can be potentially executed) or the execution traces (for the
list of calls actually invoked). While APIs allow to characterise what actions
are executed by a sample [13, 48, 49, 23, 59, 31, 51, 40], looking at system call
invocations provides a view on the interaction of the PE with the operating sys-
tem [42, 56, 44, 57, 18, 46, 58, 20, 59, 26, 70, 28, 33]. Data extracted by observing
APIs and system calls can be really large, and many works carry out additional
processing to reduce feature space by using convenient data structures. One of
the most popular data structures to represent PE behaviour and extract pro-
gram structure is the control flow graph. This data structure allows compilers
to produce an optimized version of the program itself and model control flow re-
lationships [76]. Several works employ control flow graphs and their extensions
for sample analysis, in combination with other feature classes [18, 21, 69, 62, 35].
11
on used protocols, TCP/UDP ports, HTTP requests, DNS-level interactions.
Many surveyed works require dynamic analysis to extract this kind of informa-
tion [42, 55, 56, 60, 50, 69, 32, 53, 40]. Other papers extract network-related
inputs by monitoring the network and analysing incoming and outgoing traf-
fic [66, 24, 41]. A complementary approach consists in analysing download pat-
terns of network users in a monitored network [22]. It does not require sample
execution and focuses on network features related to the download of a sample,
such as the website from which the file has been downloaded.
CPU Registers. The way CPU registers are used can also be a valuable indica-
tion, including whether any hidden register is used, and what values are stored
in the registers, especially in the FLAGS register [49, 31, 59, 30].
12
77, 23, 70, 34].
This subsection reports what machine learning algorithms are used in sur-
veyed works by organising them on the basis of whether the learning is super-
vised (§ 3.3.1), unsupervised (§ 3.3.2) or semi-supervised (§ 3.3.3).
13
tial Clustering of Applications with Noise [41], Hierarchical Clustering [75, 53],
Prototype-based Clustering [57], Self-Organizing Maps [65].
14
in general, do not take into account the interactions with the hosting system,
while those on similarities detection do.
Malware category detection. These articles focus on the identification of
specific threats and, thus, on particular features such as byte sequences, opcodes,
function lengths and network activity. Table 6 reports the works whose objective
is the detection of malware category.
By reasoning on what algorithms and features have been used and what
have not for specific objectives, the provided characterisation allows to easily
identify gaps in the literature and, thus, possible research directions to inves-
tigate. For instance, all works on differences detection (see Table 5) but [61],
rely on dynamically extracted APIs and system calls for building their machine
learning models. Novel approaches can be explored by taking into account other
features that capture malware interactions with the environment (e.g., memory,
file system, CPU registers and Windows Registry).
15
Table 1: Characterization of surveyed papers having malware detection as objective.
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
16
Decision Tree, Naı̈ve
APIs/System calls, file
Bayes, SVM, k -NN, The dataset used in the experi- Windows
Firdausi et al. [15] system, and Windows X 7 − 250 220 470
Multilayer Perceptron mental evaluations is very small. XP SP2
Registry
Neural Network
Byte sequences and The dataset used in the evalua-
Anderson et al. [16] SVM − − 7 − 615 1, 615 2, 230
APIs/system calls tions is small.
Proposed approach is not effec-
tive against packed malware and Own
Learning with Local requires manual labeling of a por- machines
Santos et al. [17] Byte sequences X 7 − 1, 000 1, 000 2, 000
and Global Consistency tion of the small dataset. In par- and
ticular, the dataset used in the VX Heavens
experimental evaluations is small.
Byte sequences, op-
Multiple Kernel Learn- Instruction categorization is not Offensive
Anderson et al. [18] codes, and APIs/system 7 7 − 776 21, 716 22, 492
ing optimal. Computing
calls
Only a subset of all the potential SANS
Yonts [19] Rule-based classifier PE file characteristics 7 7 − 65, 000 25 · 105 25.65 · 105
low level attributes is considered. Institute
Continue on the next page
Table 1: Characterization of surveyed papers having malware detection as objective. (Continued)
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
17
Decision Tree, Random packed and malware authors can Program Files
Bai et al. [23] PE file characteristics X 7 Automated 8, 592 10, 521 19, 113
Forest properly modify PE header to re- folders and
main undetected. VXHeavens
Kruczkowski and The dataset used in the experi-
SVM Network 7 N6 Platform 7 − ? ? 1, 015
Szynkiewicz [24] mental evaluations is small.
Symantec’s
Rare and new files cannot be ac-
Clustering with locality Norton
Tamersoy et al. [25] File system curately classified as benign or X 7 − 1, 663, 506 47, 956 4, 970, 865
sensitive hashing Community
malicious.
Watch
Decision Tree, Random Legitimate apps
Byte sequences and The dataset used in the experi-
Uppal et al. [26] Forest, Naı̈ve Bayes, X and 7 − 150 120 270
APIs/system calls mental evaluations is very small.
SVM VX Heavens
Rare and new files cannot be ac-
Comodo Cloud
Chen et al. [27] Belief propagation File system curately classified as benign or 7 7 − 19, 142 2, 883 69, 165
Security Center
malicious.
The dataset used in the exper-
Malicious graph match-
Elhadi et al. [28] APIs/System calls imental evaluations is extremely X VX Heavens X − 10 75 85
ing
small.
Table 1: Characterization of surveyed papers having malware detection as objective. (Continued)
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
18
Mao et al. [33] Random Forest system, and Windows X XP SP3 and 7 − 534 7, 257 7, 791
the accuracy of the proposed ap-
Registry VX Heavens
proach. The dataset is small.
Label assigned to training set
may be inaccurate and the accu- Legitimate apps
Saxe and Strings and PE file
Neural Networks racy of the proposed approach de- 7 and own malware 7 Automated 81, 910 350, 016 431, 926
Berlin [34] characteristics
creases substantially when sam- database
ples are obfuscated.
Legitimate files
Byte sequences and op- Obfuscation techniques reduce
Srakaew et al. [74] Decision Tree 7 and apps and 7 − 600 3, 851 69, 165
codes detection accuracy.
CWSandbox
Byte sequences, Obfuscation techniques applied
Legitimate app
Naı̈ve Bayes, Random APIs/system calls, by the authors may not reflect
Wüchner et al. [35] 7 downloads and 7 − 513 6, 994 7, 507
Forest, SVM file system, and Win- the ones of real-world samples.
Malicia
dows Registry The dataset is small.
Raff and k-NN with Lempel-Ziv Obfuscation techniques reduce
Byte sequences 7 Industry partner 7 − 240, 000 237, 349 477, 349
Nicholas [36] Jaccard distance detection accuracy.
Table 2: Characterization of surveyed papers having malware variants selection as objective. 1 Instead of using machine learning techniques, Gharacheh
et al. rely on Hidden Markov Models to detect variants of the same malicious sample [37].
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
19
system and
Opcode sequence is not Program Files
Khodamoradi Decision Tree,
Opcodes optimal and the dataset X folders and 7 − 550 280 830
et al. [38] Random Forest
size is very small. self-generated
metamorphic
malware
Clustering with Sampled from
Upchurch and The dataset size is ex-
locality sensitive Byte sequences 7 security X Manual 0 85 85
Zhou [39] tremely small.
hashing incidents
APIs/System calls, Monitored API/system
Rule-based classi- file system, Win- call set could be not op-
Liang et al. [40] 7 Anubis website 7 − 0 330, 248 330, 248
fier dows Registry, and timal and the dataset
network size is small.
Continue on the next page
Table 2: Characterization of surveyed papers having malware detection as objective. (Continued)
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
Table 3: Characterization of surveyed papers having malware families selection as objective. 2 Asquith describes aggregation overlay graphs for
storing PE metadata, without further discussing any machine learning technique that could be applied on top of these new data structures.
Dataset samples
20
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
21
Ahmadi et al. [31] est, Gradient Boost- X classification 7 − 0 21, 741 21, 741
istry, CPU registers, clearer view of the reasons
ing Decision Tree challenge
and PE file characteris- behind sample classification.
tics
APIs/System calls,
memory, file system,
Asquith [70] -2 -5 − − − − − − −
PE file characteristics,
and raised exceptions
Selected API/system call set
could be not optimal. Eva-
Byte sequences,
sion techniques and sam-
APIs/system calls,
Lin et al. [52] SVM ples requiring user interac- 7 Own sandbox 7 − 389 3, 899 4, 288
file system, and CPU
tions reduce the accuracy of
registers
the proposed approach. The
dataset is small.
Continue on the next page
Table 3: Characterization of surveyed papers having malware families selection as objective. (Continued)
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
22
Raff and k-NN with Lempel- Obfuscation techniques re-
Byte sequences 7 Industry partner 7 − 240, 000 237, 349 477, 349
Nicholas [36] Ziv Jaccard distance duce detection accuracy.
Table 4: Characterization of surveyed papers having malware similarities detection as objective. 3 SVM is used only for computing the optimal values
of weight factors associated to each feature chosen to detect similarities among malicious samples.
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
23
ity sensitive hashing interactions reduce the
automated
approach accuracy.
Evasion techniques and
Prototype-based clas-
samples requiring user CWSandbox
sification and cluster- Byte sequences and
Rieck et al. [57] interactions reduce the 7 and Sunbelt 7 Automated 0 36, 831 36, 831
ing with Euclidean APIs/system calls
accuracy of the pro- Software
distance
posed framework.
Continue on the next page
Table 4: Characterization of surveyed papers having malware similarities detection as objective. (Continued)
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
24
small.
The accuracy of com-
puted PE function
similarities drops when
APIs/System calls,
different compiler coreutils-8.13
Egele et al. [59] SVM3 memory, and CPU X 7 − 1, 140 0 1, 140
toolchains or aggressive program suite
registers
optimization levels are
used. The dataset is
small.
Table 5: Characterization of surveyed papers having malware differences detection as objective.
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
25
in the experimental
evaluations is small.
Evasion techniques and
Prototype-based clas-
samples requiring user CWSandbox
sification and cluster- Byte sequences and
Rieck et al. [57] interactions reduce the 7 and Sunbelt 7 Automated 0 36, 831 36, 831
ing with Euclidean APIs/system calls
accuracy of the pro- Software
distance
posed framework.
Continue on the next page
Table 5: Characterization of surveyed papers having malware differences detection as objective. (Continued)
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
26
Opcode sequence is not
optimal and the dataset
Decision Tree, k -NN, size is small. The pro- Own machines
Santos et al. [61] Bayesian Network, Opcodes posed method is not ef- X and 7 Automated 1, 000 1, 000 2, 000
Random Forest fective against packed VXHeavens
malware. The dataset
is small.
Continue on the next page
Table 5: Characterization of surveyed papers having malware differences detection as objective. (Continued)
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
Evasion techniques,
packed malware, and
samples requiring user
interactions reduce the
Clustering with Jac- accuracy of the pro-
Polino et al. [62] APIs/System calls − − − − ? ? 2, 136
card similarity posed framework. API
calls sequence used to
identify sample behav-
iors is not optimal. The
dataset size is small.
27
Table 6: Characterization of surveyed papers having malware category detection as objective. 4 Instead of using machine learning techniques, these
articles rely on Hidden Markov Models to detect metamorphic viruses [63, 64].
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
28
very small.
Advanced packing tech-
niques could reduce de-
Windows XP
Decision Tree, tection accuracy. The
Siddiqui et al. [72] Opcodes X and 7 − 1, 444 1, 330 2, 774
Random Forest dataset used in the ex-
VX Heavens
perimental evaluations is
small.
The proposed framework
Random Forest, heavily relies on secu-
Chen et al. [65] Byte sequences 7 Trend Micro 7 − 0 330, 248 330, 248
SVM rity companies’ encyclo-
pedias.
Continue on the next page
Table 6: Characterization of surveyed papers having malware category detection as objective. (Continued)
Dataset samples
Paper Algorithms Features Limitations
Public Source Available Labeling Benign Malicious Total
29
Sexton et al. [67] Regression, racy. The dataset used − − − − 4, 622 197 4, 819
opcodes
Naı̈ve Bayes, in the experimental eval-
SVM uations is small.
5. Issues and Challenges
30
require or wait for some user interaction to start their intended activity, in
order to make any kind of automatic analysis infeasible.
Identifying and overcoming these anti-analysis techniques is an important
direction to investigate to improve the effectiveness of malware analysis. Re-
cent academic and not-academic literature are aligned on this aspect. Karpin
and Dorfman [80] highlight the need to address very current problems such
as discovering where malware configuration files are stored and whether stan-
dard or custom obfuscation/packing/encryption algorithms are employed. De-
obfuscation [81, 82] and other operations aimed at supporting binary reverse
engineering, such as function similarity identification [83], are still very active
research directions. Symbolic execution techniques [84] are promising means to
understand what execution paths trigger the launch of the malicious payload.
31
5.3. Datasets
More than 72% of surveyed works use datasets with both malicious and
benign samples, while about 28% rely on datasets with malware only. Just
two works rely on benign datasets only [59, 73], because their objectives are
identifying sample similarities and attributing the ownership of some source
codes under analysis, respectively.
Figure 2 shows the dataset sources for malicious and benign samples. It is
worth noting that most of benign datasets consists of legitimate applications
(e.g. software contained in “Program Files” or “system” folders), while most
of malware have been obtained from public repositories, security vendors and
popular sandboxed analysis services. The most popular public repository in the
examined works is VX Heavens [88], followed by Offensive Computing [89] and
Malicia Project [90]. The first two repositories are still actively maintained at
the time of writing, while Malicia Project has been permanently shut down due
to dataset ageing and lack of maintainers.
30
Malicious Benign
25
20
15
10
0
n
s
s
ns
ies
ISP
ot
es
rs
RT
rs
er
xe
ow
ho
yp
tio
ni
do
or
CE
nt
bo
pa
kn
ne
en
ut
s it
ica
Ce
nd
Un
ya
Ho
po
yv
pl
ch
Sa
co
ap
nb
re
rit
ar
AV
cu
ic
se
e
te
at
bl
Se
Re
rit
tim
Pu
W
gi
Le
Figure 2: Frequency histogram showing how many reviewed papers use each type of source
(e.g. public repositories, honeypot) to collect their datasets, and whether it is used to gather
malware or benign samples.
32
Security vendors, popular sandboxed analysis services, and AV companies
have access to a huge number of samples. Surveyed works rely on CWSandbox,
developed by ThreatTrack Security [91], and Anubis [92]. As can be observed
from Figure 2, these sandboxes are mainly used for obtaining malicious sam-
ples. Internet Service Providers (ISPs), honeypots and Computer Emergency
Response Teams (CERTs) share with researchers both benign and malicious
datasets. A few works use malware developed by the authors [37, 38], created
using malware toolkits [63] such as Next Generation Virus Constrution Kit,
Virus Creation Lab, Mass Code Generator and Second Generation Virus Gen-
erator, all available on VX Heavens [88]. A minority of analysed papers do not
mention the source of their datasets.
Among surveyed papers, a recurring issue is the size of used dataset. Many
works, including [12, 13, 15], carry out evaluations on less than 1, 000 samples.
Just 39% of reviewed studies test their approaches on a population greater than
10, 000 samples.
When both malicious and benign samples are used for the evaluation, it is
crucial to reflect their real distribution [11, 12, 13, 72, 15, 44, 16, 17, 18, 93,
19, 46, 21, 48, 77, 58, 61, 20, 23, 26, 28, 29, 30, 51, 52, 33, 54, 34, 74, 35, 36].
Indeed, there needs to be a huge imbalance because non-malware executables
are the overwhelming majority. 48% of surveyed works do not take care of
this aspect and use datasets that either are balanced between malware and non
malicious software or, even, have more of the former than the latter. In [19],
Yonts supports his choice of using a smaller benign dataset by pointing out
that changes in standard system files and legitimate applications are little. 38%
of examined papers employ instead datasets having a proper distribution of
malware and non-malware, indeed they are either unbalanced towards benign
samples or use exclusively benign or malicious software. As an example, the
majority of surveyed papers having malware similarities detection as objective
(see Table 4) contains exclusively either malware or legitimate applications [55,
56, 57, 59]. The remaining 14% does not describe how datasets are composed.
Differently from other research fields, no reference benchmark is available for
33
malware analysis to compare accuracy and performance with other works. Fur-
thermore, published results are known to be biased towards good results [94].
In addition, since the datasets used for evaluations are rarely shared, it is
nearly impossible to compare works. Only two surveyed works have shared
their dataset [11, 39], while a third one plans to share it in the future [53]. It
is worth mentioning that one of the shared dataset is from 2001, hence almost
useless today. Indeed, temporal information is crucial to evaluate malware anal-
ysis results [95] and determine whether machine learning models have become
obsolete [96, 97].
Given such lack of reference datasets, we propose three desiderata for mal-
ware analysis benchmarks.
Datasets used in [11] and [39] are correctly labeled according to malware de-
tection and malware variants selection objectives, respectively. Both datasets
are not balanced. In Shultz et al. [11], the described dataset is biased towards
malicious programs, while in [39] diverse groups of variants contain a different
number of samples, ranging from 3 to 20. Finally, analysed datasets are not
actively maintained and do not contain any temporal information (in [11], the
authors do not mention if such information has been included into the dataset).
34
6. Topical Trends
This section outlines a list of topical trends in malware analysis, i.e. topics
that are currently being investigated but have not reached the same level of
maturity of the other areas described in previous sections.
Malware developers can use online public services like VirusTotal [98] and
Malwr [99] to test the effectiveness of their samples in evading most common
antiviruses. Malware analysts can leverage such behaviour by querying these
online services to obtain additional information useful for the analysis, such as
submission time and how many online antiviruses classify a sample as malicious.
Graziano et al. [69] leverage submissions to an online sandbox for identifying
cases where new samples are being tested, with the final aim to detect novel
malware during their development process. Surprisingly, it turned out that
samples used in infamous targeted campaigns had been submitted to public
sandboxes months or years before.
With reference to the proposed taxonomy, advances in the state of the art in
malware analysis could be obtained by analysing submissions to online malware
analysis services, to extract additional machine learning features and gather
intelligence on what next malware are likely to be.
35
by the same person or group. In [73], the author’s coding style of a generic
software (i.e. not necessarily malicious) is accurately profiled through syntactic,
lexical, and layout features. Unfortunately, this approach requires the availabil-
ity of source code, which happens only occasionally, e.g. in case of leaks and/or
public disclosures.
Malware attribution can be seen as an additional analysis objective, accord-
ing to the proposed taxonomy. Progresses in this direction through machine
learning techniques are currently hindered by the lack of ground truth on mal-
ware authors, which proves to be really hard to provision. Recent approaches
leverage on public reports referring to APT groups and detailing what mal-
ware they are supposed to have developed: those reports are parsed to mine
the relationships between malicious samples and corresponding APT group au-
thors [101]. The state of the art in malware attribution through machine learn-
ing can be advanced by researching alternative methods to generate reliable
ground truth on malware developers, or on what malware have been developed
by the same actor.
Given the huge amount of new malware that need to be analysed, a fast and
accurate prioritisation is required to identify what samples deserve more in depth
analyses. This can be decided on the basis of the level of similarity with already
known samples. If a new malware resembles very closely other binaries that
have been analysed before, then its examination is not a priority. Otherwise,
further analyses can be advised if a new malware really looks differently from
everything else observed so far. This process is referred to as malware triage
and shares some aspects with malware similarity analysis, as they both provide
key information to support malware analysis prioritisation. Anyway, they are
different because triage requires faster results at the cost of worse accuracy,
hence different techniques are usually employed [75, 77, 101, 86].
Likewise attribution, triage can be considered as another malware analysis
objective. One important challenge of malware triage is finding the proper
36
trade-off between accuracy and performance, which fits with the problems we
address in the context of malware analysis economics (see Section 7).
This section describes features different from those analysed in Section 3.2.2
and that have been used by just a few papers so far. In view of advancing
the state of the art in malware analysis, additional research is required on the
effectiveness of using such features to improve the accuracy of machine learning
techniques.
37
6.5.2. Function Length
Another characterising feature is the function length, measured as the num-
ber of bytes contained in a function. This input alone is not sufficient to discrim-
inate malicious executables from benign software, indeed it is usually combined
with other features. This idea, formulated in [78], is adopted in [48], where
function length frequencies, extracted through static analysis, are used together
with other static and dynamic features.
38
on one hand, and supplying the required equipment on the other. We refer to
the study of these trade-offs as malware analysis economics, and in this section
we provide some initial qualitative discussions on this novel topic.
Table 7: Type of analysis required for extracting the inputs presented in Sections 3.2.2 and 6.5:
strings, byte sequences, opcodes, APIs/system calls, file system, CPU registers, PE file char-
acteristics, network, AV/Sandbox submissions, code stylometry, memory accesses, function
length, and raised exceptions.
APIs PE
Byte File CPU Sub- Code Func.
Analysis Str. Ops Sys. file Net. Mem. Exc.
seq. sys. reg. mis. stylo. len.
calls char.
Static X X X X X X X X X
Dynamic X X X X X X X
39
of the APIs that the sample has actually invoked, thus simplifying the identifi-
cation of those suspicious APIs. By consequences, in this case dynamic analysis
is likely to generate more valuable features compared to static analysis. Maze-
Walker [106] is a typical example of how dynamic information can integrate
static analysis.
Although choosing dynamic analysis over, or in addition to, static seems ob-
vious, its inherently higher time complexity constitutes a potential performance
bottleneck for the whole malware analysis process, which can undermine the
possibility to keep pace with malware evolution speed. The natural solution is
to provision more computational resources to parallelise analysis tasks and thus
remove bottlenecks. In turn, such solution has a cost to be taken into account
when designing a malware analysis environment, such as the one presented by
Laurenza et al. [107].
The qualitative trade-offs we have identified are between accuracy and time
complexity (i.e., higher accuracy requires larger times), between time complexity
and analysis pace (i.e., larger times implies slower pace), between analysis pace
and computational resources (faster analysis demands using more resources),
and between computational resources and economic cost (obviously, additional
equipment has a cost). Similar trade-offs also hold for space complexity. As an
example, when using n-grams as features, it has been shown that larger values
of n lead to more accurate analysis, at cost of having the feature space grow
exponentially with n [26, 52]. As another example, using larger datasets in gen-
eral enables more accurate machine learning models and thus better accuracy,
provided the availability of enough space to store all the samples of the dataset
and the related analysis reports.
n feature count
1 187
2 6740
3 46216
4 130671
5 342663
40
1000000 90
89
100000 88
execution time (ms)
87
accuracy (%)
10000 86
85
1000 84
execution time 83
100
accuracy 82
target accuracy
81
10 80
1 2 3 4 5
n-grams size
Figure 3: Relationship between execution time (in logarithmic scale) and detection accuracy
as n varies. The target accuracy of 86% is also reported.
41
10000000
malware throughput (malware per day)
1000000
100000
10000
1000
1-grams 2-grams 3-grams
4-grams 5-grams malware load
100
1 2 3 4 5
machine count
Figure 4: Relationship between machine count and malware throughput (in logarithmic scale)
for different n-grams sizes. The one million malware per day to sustain is also reported.
creases almost linearly while the execution time has an exponential rise, which
translates to an exponential decrease of how many malware per second can be
processed. It can be noted that the minimum n-grams size to meet the accuracy
requirement of 86% is 3. The trade-off between analysis pace and cost can be
observed in Figure 4 where, by leveraging on the assumption of ideal scalability
of the detection algorithm, it is shown that sustainable malware throughput (in
logarithmic scale) increases linearly as the algorithm is parallelised on more ma-
chines. 4-grams and 5-grams cannot be used to cope with the expected malware
load of one million per day, at least when considering up to five machines. On
the other hand, by using four machines and 3-grams, we can sustain the target
load and at the same time meet the constraint on detection accuracy.
The presented toy example is just meant to better explain how malware
analysis economics can be used in practical scenarios. We claim the significance
of investigating these trade-offs more in detail, with the aim of outlining proper
guidelines and strategies to design a malware analysis environment in compli-
ance with requirements on analysis accuracy and pace, while respecting budget
42
constraints.
8. Conclusion
43
can be provided to balance competing metrics (e.g. accuracy and cost) when
designing a malware analysis environment.
Acknowledgment
This work has been partially supported by a grant of the Italian Presidency
of Ministry Council and by the Laboratorio Nazionale of Cyber Security of the
CINI (Consorzio Interuniversitario Nazionale Informatica).
References
44
[7] Z. Bazrafshan, H. Hashemi, S. M. H. Fard, A. Hamzeh, A survey on
heuristic malware detection techniques, in: Information and Knowledge
Technology (IKT), 2013 5th Conference on, IEEE, 2013, pp. 113–120.
[8] I. Basu, Malware detection based on source data using data mining: A
survey, American Journal Of Advanced Computing 3 (1).
45
[17] I. Santos, J. Nieves, P. G. Bringas, International Symposium on Dis-
tributed Computing and Artificial Intelligence, Springer Berlin Heidel-
berg, Berlin, Heidelberg, 2011, Ch. Semi-supervised Learning for Un-
known Malware Detection, pp. 415–422.
[19] J. Yonts, Attributes of malicious files, Tech. rep., The SANS Institute
(2012).
46
[26] D. Uppal, R. Sinha, V. Mehra, V. Jain, Malware detection and classifica-
tion based on extraction of api sequences, in: ICACCI, IEEE, 2014, pp.
2337–2342.
[30] M. Ghiasi, A. Sami, Z. Salehi, Dynamic VSA: a framework for malware de-
tection based on register contents , Engineering Applications of Artificial
Intelligence 44 (2015) 111 – 122.
[34] J. Saxe, K. Berlin, Deep neural network based malware detection using
two dimensional binary program features, in: Malicious and Unwanted
47
Software (MALWARE), 2015 10th International Conference on, IEEE,
2015, pp. 11–20.
48
[42] T. Lee, J. J. Mody, Behavioral classification, in: EICAR Conference, 2006,
pp. 1–17.
49
[51] N. Kawaguchi, K. Omote, Malware function classification using apis in
initial behavior, in: Information Security (AsiaJCIS), 2015 10th Asia Joint
Conference on, IEEE, 2015, pp. 138–144.
[52] C.-T. Lin, N.-J. Wang, H. Xiao, C. Eckert, Feature selection and ex-
traction for malware classification, Journal of Information Science and
Engineering 31 (3) (2015) 965–992.
50
[61] I. Santos, F. Brezo, X. Ugarte-Pedrero, P. G. Bringas, Opcode sequences
as representation of executables for data-mining-based unknown malware
detection, Information Sciences 231 (2013) 64–82.
[68] H.-S. Park, C.-H. Jun, A simple and fast algorithm for k-medoids cluster-
ing, Expert Systems with Applications 36 (2009) 3336 – 3341.
51
[70] M. Asquith, Extremely scalable storage and clustering of malware meta-
data, Journal of Computer Virology and Hacking Techniques (2015) 1–10.
52
artifacts, in: 2017 IEEE Symposium on Security and Privacy (SP), 2017,
pp. 1009–1024.
[86] A. Rosseau, R. Seymour, Finding Xori: Malware Analysis Triage with Au-
tomated Disassembly, https://i.blackhat.com/us-18/Wed-August-8/
us-18-Rousseau-Finding-Xori-Malware-Analysis-Triage-With-Automated-Disassembly.
pdf, last accessed: 2018-10-14 (2018).
53
[90] Malicia Project, http://malicia-project.com, accessed: 2018-06-03.
54
[99] Malwr, https://malwr.com, accessed: 2018-06-03.
55