Analysis of Malware Behavior: Type Classification Using Machine Learning
Analysis of Malware Behavior: Type Classification Using Machine Learning
Analysis of Malware Behavior: Type Classification Using Machine Learning
Abstract—Malicious software has become a major threat to which is the classical approach. Analyzing the malicious
modern society, not only due to the increased complexity of the code can yield inaccurate information when polymorphic,
malware itself but also due to the exponential increase of new metamorphic and obfuscating methods are used. When
malware each day. This study tackles the problem of analyzing aforementioned methods are applied the complexity increases
and classifying a high amount of malware in a scalable and even more, thus it will be hard to determine which type of
automatized manner. We have developed a distributed malware
malware it is. An alternative to the approach presented, is
testing environment by extending Cuckoo Sandbox that was used
to test an extensive number of malware samples and trace their performing dynamic analysis on the behavior of the malicious
behavioral data. The extracted data was used for the development software which can also be a troublesome task when having
of a novel type classification approach based on supervised to analyze an extensive and increasing number of new
machine learning. The proposed classification approach employs malware. Due to these problems it is therefore favorable
a novel combination of features that achieves a high classification to develop a scalable setup where several malware can be
rate with a weighted average AUC value of 0.98 using Random dynamically analyzed in parallel. A large amount of malware
Forests classifier. The approach has been extensively tested on a samples have been utilized compared to past research articles
total of 42,000 malware samples. Based on the above results it is used for this study. Having a large sample-set adds up to
believed that the developed system can be used to pre-filter novel the predictive power and reliability of the built classifier
from known malware in a future malware analysis system.
which provides satisfactory results. In this study, a system
Keywords: Malware, type-classification, dynamic analysis, has been developed which could be used as a pre-filtering
scalability, Cuckoo sandbox, Random Forests, API call, feature application, where all known types can be filtered from the
selection, supervised machine learning. novel malware. This leaves the opportunity to skip static
analysis on known malware and focus only on analyzing
February 25, 2015 the novel malware, thus drastically increasing the detection
and analysis rate of anti-virus programs. New malware
I. I NTRODUCTION that arise each day are believed to be mostly modified
versions of previous malware using sophisticated reproduction
The trend of the Internet usage has grown exponentially in techniques. Stating this, it is assumed in this study, that
the past years as modern society is becoming more and more malware, even though it is new, can exhibit similar behavior
dependent on global communication. At the same time, the as earlier versions from a dynamic analysis point of view [12].
Internet is increasingly used by criminals and, a large black
market has emerged where hackers or others with criminal
intent can purchase malware or use malicious services for a This study is based on a university report written by this
renting fee. This provides a strong incentive for the hackers to team in [3]. In section II the background and discussion
modify and increase the complexity of the malicious code in about improvements of related work are presented, followed
order to improve the obfuscation to decrease the chances of by the methodology in section III proposing a solution for
being detected by anti-virus programs. This leads to multiple the problems presented in the introduction. Finally, the results
forks or new implementations of the same type of malicious and conclusion will be presented in section IV and section V
software that can propagate out of control. Based on AV-Test, respectively.
approximately 390,000 new malware samples are registered
every day, which gives rise to the problem of processing the II. R ELATED W ORK
huge amount of unstructured data obtained from malware
analysis [2]. This makes it challenging for anti-virus vendors When classifying malware types it is essential to find
to detect zero-day attacks and release updates in a reasonable parameters that can distinguish between their behavior, where
time-frame to prevent infection and propagation. commonly used parameters on Windows platforms are the
Windows API calls. The reason that these are commonly
used is that they include a solid and understandable form of
Meeting this problem, researchers and anti-virus vendors behavioral information since an API call states an exact action
seek towards finding a faster alternative method of detection performed on the computer, e.g. creation, access, modification
that can overcome the limitations imposed by static analysis, and deletion of files or registry keys. In [10] they use hooking
of the system services and creation or modification of files. Ad- supervised machine learning, data generation, data extraction
ditionally they use logs from various API calls to differentiate and classification.
malware from cleanware as well as performing malware family
classification. They include a sample set of 1,368 malware and A. Dynamic Analysis
456 cleanware where they use a frequency representation of
the features. The limitation, also emphasized in their future As mentioned earlier, large amount of malware are released
work, is that they need to expand their sample set and explore on the Internet every day, which makes it more and more
new features. In [16] they made a scalable approach using the suitable to use a dynamic approach in contrast to static
API names and their input arguments, after which they applied analysis. Dynamic analysis is performed in such a way that
feature selection techniques to reduce the number of features malware is executed in a sandbox environment in which it is
for a binary classifier that includes the separation of malware assumed that malware believe it is on a normal machine. Here,
and cleanware. The features used in their setup are limited to all actions performed at run-time, are recorded and saved in a
the API system calls during run-time. Here they have a sample database. This is different from the classical signature-based
set of 826 malware and 385 cleanware. Additionally they apply approach also used in the context of static analysis that is
a frequency representation, as the research mentioned before commonly applied by anti-virus vendors. In this study, Cuckoo
in [10], but also include a binary representation. Furthermore Sandbox has been chosen as the sandbox environment in which
in [5], they use CWsandbox, which applies a technique called the malware will be injected, see [4]. Since Cuckoo is open
APIhooking to catch the behavior of the malware, but in this source, it allows to openly modify the software, which means
paper they strive to classify malware into known families. it is possible to change the code to fit the needs of this study.
They here use a total sample set of 10,072 malware and One of the requirements is to make the system distributed and
utilize a frequency representation of their features. In terms scalable, such that it can be controlled from one central unit
of automatic analysis, [6] has created a framework able to and new virtual machines or physical machines can easily be
perform thousands of tests on malware binaries each day. Here added in order to improve the efficiency of the overall analysis.
they use a sample set of 3,133 malware and use a sequence
representation of their features, which here are the Windows B. Supervised Machine Learning
API calls applied for both clustering and classification. To
understand how API calls are used by malicious programs, [8] Using dynamic analysis to gather behavioral data, it is pos-
have made a grouping of features in relation to their purpose, sible to perform malware type classification using supervised
which can be helpful to understand the malware behavior. In machine learning. We have chosen Random Forests with 160
terms of classification approaches, a wide range of machine trees, which is a decision tree based algorithm that makes
learning algorithms are used such as J48, Random Forests and use of random sub-sampling, or tree bagging, of the sample
Support Vector Machine. The weakness of the related work is space that are then used to create a tree for each subset [7].
the limited amount of samples used to build their classifier. Individual decision making is utilized at each tree for each
Furthermore, this study proposes a feature representation that classification of malware, where the results are then averaged.
combines several of the aforementioned representations to This prevents the possibility of over-fitting, as variance of the
achieve a greater behavioral picture of the malware. classification model decreases when averaged over a suitable
amount of trees. In this study the machine learning tool WEKA
Given that the labels for malware types are provided by has been used, which can be run through a java-based GUI or
anti-virus vendors and based on the related work, it is found directly in the terminal [15].
that supervised machine learning is a valid choice for this
study. Based on a dataset generated from around 80,000 In Figure 1 an overview of the system is depicted as a
malware samples, a feature selection has been performed after flowchart. It includes modules for each of the groups: Data
analyzing the data. In the mentioned research articles, API Generation, Data Extraction and Malware Classification. Each
calls are the mainly used parameter for creating features. group will be explained in the following subsections.
In this study several parameters were chosen as features in
addition to API calls. The additional parameters are: mu- Data Generation Data Extraction
Type
texes, registry keys/files accessed and DNS-requests. In the Filtering
Extraction
(Parameters)
related work, different feature representations were used, i.e. Cuckoo
Sandbox
(ML Labels)
sequence, binary and frequency. The contribution of this study Malware Classification
Begin: Malware
is the unique combination of different feature representations Classification InetSim
Feature Feature
Reduction Representation
and parameters that also apply feature reduction strategies.
Furthermore, our study includes a great amount of malware Weka:
End: Malware
MongoDB Random
samples and behavioral data collected using our setup. This Forests
Classification
TPR to 0.858. The AUC value is 0.955 which is also the lowest 0.3