Malware Categrisn
Malware Categrisn
Malware Categrisn
1
SAG, DRDO, Delhi, India
2
NIIT University, Neemrana, India
E-mail: [email protected]; [email protected];
[email protected]
∗
Corresponding Author
Abstract
In this research we have used Windows API (Win-API) call sequences to
capture the behaviour of malicious applications. Detours library by Microsoft
has been used to hook the Win-APIs call sequences. To have a higher level
of abstraction, related Win-APIs have been mapped to a single category.
A total set of 534 important Win-APIs have been hooked and mapped to
26 categories (A. . . Z). Behaviour of any malicious application is captured
through sequence of these 26 categories of APIs. In our study, five classes
of malware have been analyzed: Worm, Trojan-Downloader, Trojan-Spy,
Trojan-Dropper and Backdoor. 400 samples for each of these classes have been
taken for experimentation. So a total of 2000 samples were taken as training
set and their API call sequences were analyzed. For testing, 120 samples were
taken for each class. Fuzzy hashing algorithm ssdeep was applied to generate
fuzzy hash based signature. These signatures were matched to quantify the
API call sequence homologies between test samples and training samples.
Encouraging results have been obtained in classification of these samples to
the above mentioned 5 categories. Further, N-gram analysis has also been
done to extract different API call sequence patterns specific to each of the
5 categories of malware.
1 Introduction
In today’s world everyone is connected and uses internet for most of the
things. This not only creates dependency on the internet but also increases
possibility of exploitation via it. Besides computers, smartphones are also a
great source of connectivity. Managing ever-evolving malware related to these
devices is critical for proper functioning and security. Despite the use of anti-
virus software, new malware and their variants are spreading continuously.
Worms, Backdoors and Trojans are growing at tremendous rate thus affecting
the secrecy, integrity and functionality of the systems. Thus the researchers
and anti-malware vendors are always working in the area of developing new
solutions to counter the effect of malware.
Various approaches like Static analysis and Dynamic analysis have been
proposed for activities related to malware analysis. In Static analysis the binary
code is analyzed without executing it, whereas in Dynamic analysis the code
is executed and its behaviour is monitored. The advantage in dynamic analysis
is that it even works for sophisticated obfuscated binaries where static analysis
is quite challenging and time-consuming. However static features like Opcode
n-gram, Byte code n-gram have been used as features for Malware detection
systems [1–4].
New malware can easily evade traditional hash-based signature detection
by just introducing slight modification in the code or applying obfusca-
tion techniques. But signatures based on dynamic analysis provide better
detection rate as they capture the behaviour of the malware which remains
unaltered even after obfuscation. Further to categorize the malware in different
classes, behaviour specific to particular class needs to be identified.
The main advantage in dynamic analysis is that the run-time behaviour of
the executable is difficult to obfuscate. Also, the dynamic malware analysis
can be easily automated enabling analysis at large scale possible. But the
disadvantage of dynamic analysis is that it captures only one execution trace
of the whole program. Also the program must be run in secure run-time
environment to evade the danger of getting infection while doing analysis.
Both of these limitations can be addressed by using good test vectors for
maximum code coverage and setting safe virtual environment. Egele et al. [5]
given an extensive survey of dynamic malware analysis techniques. We have
used dynamic analysis technique to analyze different class samples, where-in
API call sequences are extracted by running the samples.
Malware Characterization Using Windows API Call Sequences 365
for malware classification and further extracting signatures for all these five
classes based on API call sequences.
2 Methodology
2.1 Overall Malware Classification and Characterization
Framework
The Proposed Malware Characterization Framework is mainly using
Win-API hooking technique for API call sequence extraction and Fuzzy
Hashing technique for signature generation, matching and classification. To
carry out this we have downloaded malware samples from available internet
resources [17–19]. Further this malware dataset is tagged as per Kaspersky’s
Antivirus classification through free VirusTotal [20] scanning engine. In
this work we have selected five classes of malware: Worm, Backdoor,
Trojan-Downloader, Trojan-Dropper and Trojan-Spy. The reason for selecting
these five classes is that we were able to get sufficient number of tagged
samples for these categories.
Modules for API hooking and DLL injection were implemented in
C language to extract the Win-API call sequences. In all a set of 534 Win-APIs
were hooked. All the samples were run and their API call sequence was
observed. Repeated consecutive API calls were removed while generating
signatures to remove redundancies. To have higher level of abstraction, we
bundled similar API-calls in one category. In all 26 such categories (A...Z)
were created and all the API calls were replaced with the corresponding
category. We generated these 26 categories based on the functionality of the
Win-API calls.
We have selected 26 categories to categorize the Win-API set od 534
calls, as we observed that these are sufficient to capture the higher level
of functionality description of any application. For example, category ‘I’
belongs to Registry write operations and include Win-APIs like RegSetVal-
ueA, RegSetValueW, RegSetValueExA, RegSetValueExW, RegCreateKeyA,
RegCreateKeyW etc. Also we get support from many text mining tools for
Alphabet domain (A...Z). We also observed that increasing the number of
categories does not increase the accuracy of results.
Fuzzy Hashing algorithm ssdeep [21] has been applied to the categorized
API call sequences to get the fuzzy hash signature of each malware sample.
Thus, a Fuzzy hash signature repository has been created for all the samples of
different classes. For a given test sample, we use the same procedure to extract
Malware Characterization Using Windows API Call Sequences 367
its fuzzy hash signature. Further we apply fuzzy hash signature matching
algorithm [21] between the given test sample and all the samples in the
signature repository. These matched values were averaged for each of the five
classes and the sample is classified to the highest matched class. We have also
extracted unique Win-API call patterns for each category of Malware, which
are in terms of sequences of theseAPI calls. The schematic diagram of Malware
Classification framework is shown in Figure 1. It shows two phases: Online
Phase and Offline Phase. Offline Phase is the learning phase for our classifier.
Here a database of Fuzzy hashes is prepared from the known malware samples.
In the online phase a new malware is subjected to same procedure and its fuzzy
hash is calculated. This fuzzy hash is then compared with the existing database
of fuzzy hashes of known malware. Closest matching class of malware is then
assigned to it. Implementation details regarding all the above-mentioned steps
are given in the Sections 2.2 to 2.5.
based on Unicode and ANSI types. For our research we have selected 534
Win-API calls, which seems relevant for behaviour analysis of malware.
A C-program was written for User Level inline API hooking which uses
Detours [22] library to extract the Win-API calls. Hooking is a technique used
to modify the behaviour of API calls in operating system or applications or
other software by intercepting their function calls or messages passed between
User/Application layer and kernel layer. They are used for Debugging, Mon-
itoring, and Intercepting messages and to extend functionality of any given
binary.
Malware class namely Rootkits mainly employ hooking to hide itself in the
system. Hooking can be done either in user-mode or in kernel of OS. Kernel
hooking requires valid signed drivers and in-depth internal knowledge of OS
kernel. User-Mode hooking is relatively simple and it is achieved by hooking
Windows APIs or third party libraries. There are mainly two techniques to
perform user land hooking namely Inline Hooking and Import Address Table
(IAT) Hooking. Import address hooking patches the import address table of PE
file to trick the application into execution of another function which carries
out malicious functionality. Inline hooking is achieved by overwriting the
beginning of DLL with a jump to the function which carries out malicious
functionality. Inline hooking is considered more robust than IAT hooking as
they do not have any problems related to DLL binding at run time. Also it
can be used to hook other function calls, instead of only system calls. This
technique is widely used by professionals.
The Win-API call sequences were extracted by running every sample for
30 seconds in the Virtual environment on Windows-XP. Consecutive same
API calls were clubbed together to remove redundant information from API
call sequences.
hash is computed with ‘b’ and other hash with ‘2b’. With two hashes in single
signature one can compare two different signatures bx and by if bx =by or
bx =2.by or by =2*bx .
Table 3 Average matching score (0–100) of fuzzy hashes between different classes of test
samples and training samples
Dataset of 2000 Signatures (400*5)
Trojan- Trojan-
Test Samples (# 120) Worm Backdoor Dropper Downloader Trojan-Spy
Worm 25.28 5.42 3.16 7.6 1.74
Backdoor 5.42 22.14 1.31 5.3 3.5
Trojan-Dropper 3.16 1.31 24.77 5.45 10.55
Trojan-Downloader 7.6 5.3 5.45 27.73 7.01
Trojan-Spy 1.74 3.5 10.55 7.01 26.63
For this we have divided our 5-class problem into five 2-class problems,
namely: Worm vs rest, Backdoor vs rest, Trojan-Dropper vs rest, Trojan-
Downloader vs rest and Trojan-Spy vs rest. Table 4 gives the classification
accuracy and FPR for these five 2-class problems, and Table 5 gives the
accuracy & FPR for 2 class classifier problems (Malware vs Benign) [11, 12].
These classification results indicate that there exist class specific signatures
for every class which can be extracted manually by thorough inspection. Thus
some malware class specific signatures in terms of patterns were extracted.
Table 6 gives few of the distinctive patterns extracted for each category. The
table also shows the presence of these patterns in the other classes. These
patterns are extracted using basic n-gram analysis which is based on exact
matching algorithm. However many more patterns can be considered if we
use approximate matching algorithms. It is the presence of these Win-API
patterns, which aids in the fuzzy hash based classification of the five classes
of Malware. Also our unique categorization of Win-API calls made this task
easier and effective as now we have an abstract and simplified data to work on.
374 S. Gupta et al.
4 Conclusion
Classification based on fuzzy-hash based matching score on Win-API call
sequences gives good results to classify different kinds of malware. Five differ-
ent classes of malware were analyzed: Worm, Backdoor, Trojan-Downloader,
Trojan-Dropper and Trojan-Spy. The Win-APIs were categorized into 26
categories based upon their functionality and further analysis was carried
out on these categorized sequences. With n-gram analysis on the categorized
sequences, we were able to extract class specific patterns for all the five classes
of Malware. Fuzzy hashes of these categorized sequences were calculated with
ssdeep algorithm. Fuzzy hash based matching score was calculated between
different categorized sequences. High fuzzy hash matching score was observed
in samples belonging to same class. It was established that the fuzzy hash based
matching score can used as classification criteria as it successfully captures
the homologies in the behavior of the malware samples belonging to the same
class.
5 Future Work
The proposed malware classification system will be extended to other malware
classes. Fuzzy hash based matching scheme can be replaced with more sophis-
ticated text pattern matching techniques. Extracted unique subsequences can
also be considered as features for classification. Number of samples in each
category will be increased for more accuracy. We propose to integrate all
the activities into a single automated system which will check all running
programs for malicious behaviour. At present API hooking has been done
at User level which will be extended to Kernel level, if possible. A similar
approach will be used to capture behaviour of applications based on other
Operating systems like Linux, Android etc.
References
[1] Shafiq, M. Z., Tabish, S. M., Mirza, F., and Farooq, M. (2009). Pe-Miner:
Mining structural information to detect malicious executable in real
time. In 12th international symposium on recent advances in intrusion
detection.
376 S. Gupta et al.
Biographies
Sarvjeet Kaur has done M.Sc (Computer Science) from DAVV, Indore.
She also did M.S. (Software System) from BITS, Pilani in 2010. She joined
Scientific Analysis Group (SAG) in 1991 and is presently working as Scientist
‘F’ and heading the Software Security testing Group. Her area of interests are
Software Security and Malware Analysis.