Academia.eduAcademia.edu

Malware Classification Methodologies

2020, Master's Thesis

Malware detection refers to the classification of a software as malicious or benign. Many attempts, employing diverse techniques, have been tried to tackle this issue. In the present thesis, we present a graph-based solution to the malware detection problem, which implements resources extraction from executable samples and applies machine learning algorithms to those resources so as to decide on the nature of the executable (malicious or benign). Given an unknown Windows executable sample, we first extract the calls that the sample makes to the Windows Application Programming Interface (API) and arrange them in the form of an API Call Graph, based on which, an Abstract API Call Graph is constructed. Subsequently, using a Random Walk Graph Kernel, we are able to quantify the similarity between the graph of the unknown sample and the corresponding graphs hailing from a labeled dataset of known samples (benign and malicious Windows executables), in order to carry out the binary classification using Support Vector Machines. Following the aforementioned process, we achieve accuracy levels up to 98.25%, using a substantially smaller dataset than the one proposed by similar efforts, while being considerably more efficient in time and computational power.

UNIVERSITY OF PATRAS DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING DIVISION: ELECTRONICS AND COMPUTERS Thesis of the Electrical and Computer Engineering Department of the Polytechnic School of the University of Patras student Vasileios Tsouvalas Registry Number: 227950 Subject ”Malware Classification Methodologies” Supervisor Dimitrios Serpanos Thesis Number 227950/2020 Patras, October 2020 CERTIFICATION It is hereby certified that the Thesis with subject ”Malware Classification Methodologies” of the Department of Electrical Engineering student Vasileios Tsouvalas (R.N.: 227950 ) was publicly presented and supported at the Department of Electrical and Computer Engineering on ..../..../.... Supervisor Head of Division Dimitrios Serpanos Professor Vasilis Paliouras Professor Thesis Number: 227950/2020 Subject: ”Malware Classification Methodologies” Student Vasileios Tsouvalas Supervisor Dimitrios Serpanos Abstract Malware detection refers to the classification of a software as malicious or benign. Many attempts, employing diverse techniques, have been tried to tackle this issue. In the present thesis, we present a graph-based solution to the malware detection problem, which implements resources extraction from executable samples and applies machine learning algorithms to those resources so as to decide on the nature of the executable (malicious or benign). Given an unknown Windows executable sample, we first extract the calls that the sample makes to the Windows Application Programming Interface (API) and arrange them in the form of an API Call Graph, based on which, an Abstract API Call Graph is constructed. Subsequently, using a Random Walk Graph Kernel, we are able to quantify the similarity between the graph of the unknown sample and the corresponding graphs hailing from a labeled dataset of known samples (benign and malicious Windows executables), in order to carry out the binary classification using Support Vector Machines. Following the aforementioned process, we achieve accuracy levels up to 98.25%, using a substantially smaller dataset than the one proposed by similar efforts, while being considerably more efficient in time and computational power. Keywords: Malware Detection, Machine Learning, SVM, Graphs, Graph Kernel ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΤΜΗΜΑ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ ΤΟΜΕΑΣ: ΗΛΕΚΤΡΟΝΙΚΗΣ ΚΑΙ ΥΠΟΛΟΓΙΣΤΩΝ Διπλωματική Εργασία του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών Βασιλείου Τσουβάλα του Κωνσταντίνου Αριθμός Μητρώου: 227950 Θέμα «Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού» Επιβλέπων Δημήτριος Σερπάνος Αριθμός Διπλωματικής Εργασίας 227950/2020 Πάτρα, Οκτώβριος 2020 ΠΙΣΤΟΠΟΙΗΣΗ Πιστοποιείται ότι η Διπλωματική Εργασία με θέμα «Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού» του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών Βασιλείου Τσουβάλα (Α.Μ.: 227950 ) παρουσιάτηκε δημόσια και εξετάστηκε στο τμήμα Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών στις ..../..../.... Επιβλέπων Διευθυντής Τομέα Δημήτριος Σερπάνος Καθηγητής Βασίλης Παλιουράς Καθηγητής Αριθμός Διπλωματικής Εργασίας: 227950/2020 Θέμα: «Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού» Φοιτητής Βασίλειος Τσουβάλας Επιβλέπων Δημήτριος Σερπάνος Περίληψη Η ανίχνευση κακόβουλου λογισμικού αναφέρεται στη διαδικασία κατά την οποία, χρησιμοποιώντας δίαφορες μεθόδους και τεχνικές ανάλυσης λογισμικού, έχουμε τη δυνατότητα να κατηγοριοποιήσουμε ένα πρόγραμμα ως κακόβουλο ή καλόβουλο. Στην παρούσα εργασία, παρουσιάζουμε μία λύση, κατά την οποία εξάγουμε πληροφορίες από ένα εκτελέσιμο δείγμα, μοντελοποιούμε αυτές τις πληροφορίες σε γράφους και εφαρμόζοντας τεχνικές μηχανικής μάθησης αποφαινόμαστε για τη φύση του εκτελέσιμου (καλόβουλο ή κακόβουλο). Ξεκινώντας με ένα σύνολο δεδομενων από καλόβουλα και κακόβουλα Windows εκτελέσιμα δείγματα, εξάγουμε, μέσω στατικής ανάλυσης, τις κλήσεις που πραγματοποιεί το κάθε εκτελέσιμο στο Windows API, μοντελοποιούμε ένα Γράφο ΑΡΙ Κλήσεων (API Call Graph) και εν συνεχεία έναν Αφηρημένο Γράφο ΑΡΙ Κλήσεων (Abstract API Call Graph). Μέσω ενός Kernel Τυχαίας Διαδρομής Γράφων (Random Walk Graph Kernel), διανυσματικοποιούμε τη σύγκριση των Αφηρημένων Γράφων ΑΡΙ Κλήσεων, ώστε να πραγματοποιήσουμε τη δυαδική ταξινόμηση με τη χρήση των Support Vector Machines (SVM’s). Ακολουθώντας την παραπάνω διαδικασία, πετυχαίνουμε επίπεδα ακρίβειας μέχρι 98.25% χρησιμοποιώντας μικρότερο dataset από εκείνο που χρησιμοποιείται σε παρόμοιες προσπάθειες, αλλα και μειώνοντας τις απαιτήσεις σε χρόνο και υπολογιστική ισχύ. Λέξεις Κλειδιά: Κακόβουλο Λογισμικό, Μηχανική Μάθηση, SVM, Γράφοι, Graph Kernel Στούς γονείς μου, Κωνσταντίνο και Παναγιώτα Η γλώσσα εκπόνησης της παρούσας εργασίας είναι η αγγλική. Στις τελευταίες σελίδες, παρέχεται συμπυκνωμένη έκδοση της εργασίας στα ελληνικά. The present study is written in english. In the last pages, a compacted version of the study is provided in greek. Polytechnic School Department of Electrical and Computer Engineering Thesis Malware Classification Methodologies Vasileios Tsouvalas Supervised by Dr.Dimitrios Serpanos Patras, October 2020 Contents 1 Introduction 4 2 The Malware Issue 5 3 State of the Art 6 3.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Proposed Solution 7 5 Tools and Dataset 9 5.1 5.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.1.1 Disassembly . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.1.2 API Call Graph . . . . . . . . . . . . . . . . . . . . . . . 10 5.1.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.1.4 Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5.2.1 Benign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5.2.2 Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 6 Implementation 6.1 13 Ghidra Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6.1.1 Codeblocks . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6.1.2 API Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 6.2 Resources Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.3 Control Flow Graph . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.4 API Call Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 6.5 Abstract API Call Graph . . . . . . . . . . . . . . . . . . . . . . 17 6.6 Graph Comparison: Random Walk Graph Kernels . . . . . . . . 18 6.6.1 Random Walk . . . . . . . . . . . . . . . . . . . . . . . . 18 6.6.2 Graph Kernels . . . . . . . . . . . . . . . . . . . . . . . . 18 6.6.3 Direct Product Graphs . . . . . . . . . . . . . . . . . . . . 19 6.6.4 Random Walk Graph Kernel (RWGK) . . . . . . . . . . . 20 6.6.5 Generalized Definition of the Random Walk Graph Kernel 6.7 for Labeled Graphs . . . . . . . . . . . . . . . . . . . . . . 21 6.6.6 Present Implementation of the RWGK . . . . . . . . . . . 23 6.6.7 Kernel Normalization . . . . . . . . . . . . . . . . . . . . 23 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 24 6.7.1 Training and Testing Splits, Number of Repetitions and Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 6.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.7.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 7 Experiments 27 7.1 Sample Size and API Calls . . . . . . . . . . . . . . . . . . . . . 27 7.2 Sparsity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 7.3 API Call Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 32 7.3.1 Distinct API Calls . . . . . . . . . . . . . . . . . . . . . . 32 7.3.2 Common API Calls 33 . . . . . . . . . . . . . . . . . . . . . 8 Classification Results 8.1 37 Kernel Measurements . . . . . . . . . . . . . . . . . . . . . . . . 37 8.1.1 Benign-Benign . . . . . . . . . . . . . . . . . . . . . . . . 37 8.1.2 Malware-Malware . . . . . . . . . . . . . . . . . . . . . . 38 8.1.3 Benign-Malware . . . . . . . . . . . . . . . . . . . . . . . 41 8.2 SVM Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 8.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 8.2.2 ROC Curve-AUC . . . . . . . . . . . . . . . . . . . . . . . 43 8.2.3 Precision and Recal . . . . . . . . . . . . . . . . . . . . . 45 9 Conclusion 47 List of Figures 48 References 50 Malware Classification Methodologies 1 Introduction Computer security or cybersecurity decribes the field that encompasses all the efforts pertaining to the protection of computers and networks. Given the magnitude of the issue and the potential risks, the field of computer security is an area where the scientific community, the industry and govermental agencies very often have convergent goals. One of the most significant and prolific challenges in computer security is malware and its detection. In the present thesis, we tackle the problem of malware detection, in the sense that we construct a pipeline in order to classify software as malicious or benign. The algorithm constructed features a graph-based approach to the malware detection problem, which has been attempted in similar efforts[1], and employs Machine Learning techniques for the classification. Page 4 Malware Classification Methodologies 2 The Malware Issue It has been estimated that the total value at risk globally due to cyberattacks may reach up to 5.2 trillion USD until 2023[2]. Cyberattacks comprise of hacking, malware, phishing techniques and social engineering in order to gain unauthorized access to a system. While hacking, phishing and social engineering consist of actively working in order to bypass a system to reach a malicious goal, malware is a more passive approach to the same objective. Malware is defined as a software that has been designed to cause damage or a program that performs an action it should not, with either malicious intent or not. Malwarerelated cybercrime and breaches account for 28% of the all cyberattacks[3], thus making malware detection an essential tool in the front against cybercrime. Page 5 Malware Classification Methodologies 3 State of the Art Malware analysis is divided into two categories; static and dynamic analysis. Static analysis involves examining the malware without actually running it, whilst in dynamic analysis, the malware is being run in a secure environment (usually a virtual machine) in order to gather information surrounding its execution. Increasingly, over the last decade, Machine Learning and other data mining techniques have been utilized in the malware detection schemes [4]. 3.1 Static Analysis The most common methods employed for static analysis of malware involve signature-based detection, where a predefined database of unique digital malware signatures allows the system to recognize threats, and code analysis, where the malware is reverse-engineered, using a disassembler or a decompiler, in order to examine the code and conclude on the nature of the program. 3.2 Dynamic Analysis Dynamic analysis, on the other hand, allows for more information to be extracted, since the runtime behavior of the malware can be explored. As such, dynamic analysis offers a superior alternative to static analysis, since the malware is being observed during execution and more aspects of its nature may be evaluated. However, it is also the more costly and power demanding choice. 3.3 Machine Learning Machine Learning approaches have been implemented in malware detection, at the level of classification, providing accuracy factors ranging from 80% to 99%, depending on the specific techniques and datasets used. Support Vector Machines account for 29% of the learning schemes employed[4]. Page 6 Malware Classification Methodologies 4 Proposed Solution Following the state-of-the-art approaches to the solution of the malware detection problem[5], as presented in the relative bibliography, our study focuses on proposing a graph-based detection scheme combining static analysis of executables[6][7], with well known machine learning tools such as Support Vector Machines (SVM). We begin with a dataset of acknowledged benign and malicious MS Windows executable files, which we analyze one by one. For each executable, the analysis involves the disassembly of the executable file, which allows for the extraction of the API Calls that the executable makes to the Windows Operating System (OS). The API Calls as well as information regarding their connectivity, are consequently used in modeling the API Call Graph of the executable. After a graph-theoretic manipulation of the API Call Graph, we obtain an Abstract API Call Graph, and with the use of an appropriately defined graph-kernel[8], for the purpose of graph comparison, we arrive at the detection step of the algorithm. It is important to remark here, that in our study, we treat the malware detection problem as a binary classification one, meaning that, given an executable of unknown ”intent”, the goal is to classify it as being ”malicious” or” benign”. To this end, a Support Vector Machine scheme is implemented in order to achieve the classification goal. Page 7 Malware Classification Methodologies Figure 1: Bird’s eye view of the proposed solution pipeline Page 8 Malware Classification Methodologies 5 Tools and Dataset In the following sections, we will elaborate on the tools and the data used in the present study, and we will explicitly discuss each part of the pipeline of the proposed solution. 5.1 Tools The toolbox that has been assembled for the graph-based malware detection application of the proposed solution consists of tools aimed at tackling the three main parts of the pipeline; Disassembly of the executables, API Call Graph modeling and handling, and Classification. We discuss the tools employed in each of those pipeline compartments below. 5.1.1 Disassembly In this first part of the pipeline, we wish to diassemble the executable sample so as to perform a code analysis and extract certain crucial to the algorithm information. The tool that is used for the disassembly procedure is the open source program Ghidra[9]. Ghidra Ghidra is an open source software analysis tool developped by the National Security Agency released on April 4, 2019. Ghidra is a powerful code analysis tool enabling a wide range of functionalities such as disassembly, and decompilation, as well as allowing for scripting and graph representation of data referring to the code analysis. One of the most important capabilities of Ghidra, is the fact that its API can be used to develop custom code for performing tasks tailored to the specific user’s requirements. It is this particular aspect that allows for coding scripts in order to extract the data and information required in this Page 9 Malware Classification Methodologies study, and more specifically, the API Calls that the executable makes and their intraconnectivity. Programming Language The code that has been developped in this part of the pipeline engages in the extraction of specific resources from Ghidra, after the disassembly process. Ghidra allows for scripting on top of its API, and thus, we are able to write code in Java, that allows us to manipulate the results of the analysis and extract the important information. We remark that the use of external libraries has been employed in Java coding, to account for the transfer of data and the connection to the Ghidra API. 5.1.2 API Call Graph Moving on to the second component of the pipeline, we discuss the tools used in the resources extraction posterior to the disassembly, and the handling of said resources in the modeling and manipulation so as to reach the API Call Graph, as well as other necessary features. We discuss below, the programming languages employed and the theoretical elements that significant parts of the study are based upon. Programming Languages The programming language utilized in this compartment is Python, and its use takes place in several parts of the algorithm. Code has been written in Python for the handling of this data, the modeling of it into graphs, as well as representation of results. We use Python libraries to accomodate for several goals such as intraconnectivity of the code, handling of information, as well as results representation. Page 10 Malware Classification Methodologies Graph Theory After the extraction of resources from the executable sample, we produce graph representations of the different elements extracted. We manipulate and model these representations using Graph theory principles. For example, in order to reach an Abstract API Call Graph from an API Call Graph, we employ Graph Theory practices such as bipartite graph abstraction[10] and network component projection[11]. 5.1.3 Classification In the classification module of the proposed solution, we use Support Vector Machines (SVM’s), treating the malware detection problem as a binary classification one. SVM’s allow us to classify the samples as malicious or benign based on a specific aspect of their nature; in our case the API Call Graph. We remark that the code written for the SVM’s is in Python and the corresponding classification libraries have been included. 5.1.4 Server Certain modules of the pipeline are very demanding in computational power, and require a high performance CPU and RAM for their execution. In our case, we use a server, which allows for the more burdensome procedures of code analysis and resources extraction to be performed on a more powerful machine. The given server contains an 8-core CPU and has 16GB of RAM. 5.2 Dataset In this section, we discuss the executable samples that were gathered for this study, which make up our dataset. The dataset contains labeled executables that are acknowledged and recognized to be either malicious or benign. We Page 11 Malware Classification Methodologies remark that one of the most fundemental issues facing any attempt to tackle the malware detection problem is the lack of freely available and agreed-upon datasets. Great efforts have been made in this study to assemble a sufficiently large dataset for the training of the SVM classifier. 5.2.1 Benign The benign samples collected are mainly installation and support files from an array of trusted sources such as Windows[12], Git[13], Cygwin[14], Codeblocks[15] and others. The amount gathered is 567 executables, ranging in size from several hundred kB to several MB. 5.2.2 Malware The gathering of malware samples is an easier task , since there are numerous dedicated websites such as VirusShare[16] and VirusTotal[17], which provide these malware samples for academic and scientific purposes. We collected 997 malicious executables from VirusShare, again ranging for several hundred kB to several MB. Page 12 Malware Classification Methodologies 6 Implementation In this section, we discuss in detail every step of the pipeline and explicitly elaborate on the inner mechanisms of the proposed solution. 6.1 Ghidra Analysis In this first part of the algorithm, we use the malicious and benign samples as input to the Ghidra platform. Ghidra performs a code analysis and disassembly on all of the dataset samples recursively. A Ghidra analysis establishes that the sample can be correctly disassembled and its code can be represented in an assembler language. We note that out of the 567 benign samples, all are correctly analyzed and disassembled, whereas only 827 of the total 997 malware samples are correctly analyzed. This may occur due to corruption, or other flaws in the integrity of the problematic files. We remark that in all cases, the analysis is static and the executable is never run. Following this, any reference made to actions following its execution refers to the possible execution actions based on information that has been statically gathered by Ghidra. After the Ghidra analysis and disassembly, the executables take the form of assembly code which provides us with information on the data that the executable needs from the platform it runs on (Windows OS in our case) and other aspects such as the registers that it uses, the functions that it calls, etc.. In each of the executables’ assembly code we notice the existence of codeblocks. Codeblocks are an essential part of this study and their functionality is at the center of it. Page 13 Malware Classification Methodologies 6.1.1 Codeblocks Codeblocks are bundles of code which, after the Ghidra analysis, are revealed to perform one specific action each. In that sense, codeblocks could be characterized as internal functions. Codeblocks are also internal nodes of the executable and their directed connections represent possible execution paths. In this sense, they can also be characterized as Control Points of the system. 6.1.2 API Calls A Codeblock may lead to another Codeblock or it may lead to an API Call, which is a function call to the Windows API. All API Calls are provided by the Windows API[18], which is the native platform for Windows applications and the executable uses them to ask the OS for certain resources or to ask for certain actions to be performed; i.e. user input and messaging, data access and storage, system services, etc. Each API Call has a unique name and a unique ordinal number. In our study we use the name of an API Call as its label. It is important to mention that API Calls cannot lead to other API Calls. The process in which their call takes place is the following: 1. A Codeblock reaches an API Call 2. The program ”branches” to execute the API Call 3. After the API Call has finished its execution, the program returns to the original Codeblock and resumes its path This means that only Codeblocks are connected and through their connection, can API Calls be considered to take place in sequence. Page 14 Malware Classification Methodologies Concluding on the analysis and disassembly, we have for each sample: 1. Code in assembly 2. Codeblocks - Control Points 3. All the possible execution paths 4. The API Calls it makes Page 15 Malware Classification Methodologies 6.2 Resources Extraction After the analysis, we extract the following information from the disassembled executable: • Codeblocks names and addresses • Incoming and outgoing Codeblocks or API Calls from each Codeblock • Addresses of those incoming and outgoing Codeblocks or API Calls • The size of the executable Using the above mentioned information, we are able to model the Control Flow Graph. 6.3 Control Flow Graph The Control Flow Graph (CFG) is simply the directed graph whose nodes are the executable’s Codeblocks and its edges are the connections with other Codeblocks (based on the incoming and outgoing Codeblocks information). It can be represented as a tuple G = (N, E), where N is a finite set of nodes and E is a finite set of edges. Following this, (n1 , n2 ) represents that the Codeblock n1 can lead to Codeblock n2 . 6.4 API Call Graph Going from a CFG to an API Call Graph, we have to note that each Codeblock may lead to more than one API Calls, more Codeblocks may lead to the same API Calls and a CFG may contain loops. An API Call Graph is a directed graph that is extracted from the CFG of the sample, by replacing each node with the API Calls that the specific Codeblock Page 16 Malware Classification Methodologies leads to. Following the example from before, let us assume that Codeblock n1 leads to API Call f1 and Codeblock n2 leads to API Call f2 . In this sense, ((n1 , f1 ), (n2 , f2 )) represents the equivalency between Codeblock and API Call connection. For the API Call Graph, we have that GAP ICall = (FAP I , EAP I ), where FAP I is a finite set of nodes, which represents the API Calls reachable from the source (initial) Codeblock, and EAP I is a finite set of edges, which represents the API Calls reachable from the destination Codeblock. It becomes apparent that the size of the API Call Graph is much greater than the size of the Control Flow Graph. 6.5 Abstract API Call Graph We wish to compare graphs between them so as to conlude on their classification, but it becomes apparent that the size of the API Call Graph poses a problem. As the size of the API Call Graph is very big, we perform an abstraction, in order to make it more manageable computationally. We address the API Call Graph as a series of bipartite graphs and perform an abstraction on those to reach an Abstract API Call Graph. The Abstract API Call Graph merges all the nodes referring to the same API Call in one node. Therefore, there are as many nodes in the Abstract API Call Graph, as there are distinct API Calls in the executable. The new merged nodes are connected to each other following the connections of every instance of the API Call in the previous API Call Graph. If (f1 , f2 ) , (f1 , f3 ) and (f3 , f1 ) described the API Call Graph, then the Abstract API Call Graph will be represented by (f1 , f2 ) and (f1 , f3 ). Following this abstraction, we lose the directionality of the graph model of API Calls, but we gain in computational cost, which allows us to perform comparisons among graphs and eventually be able to classify them as belonging to malicious or benign executables. By performing this abstraction, we reduce considerably the Page 17 Malware Classification Methodologies size of the graph in question. 6.6 6.6.1 Graph Comparison: Random Walk Graph Kernels Random Walk Random Walk (RW) is a well known and widely used algorithm that provides random paths on a graph. Its basic function involves beginning at a node and choosing at random a neighboring node to visit. This procedure in succesion allows for a traced path on a graph that has been acquired in a random fashion. 6.6.2 Graph Kernels A graph kernel is a kernel function that calculates the inner product of two graphs. It provides a measure of similarity between the two graphs[19][20] and is especially useful in kernelized learning and classification algorithms such as Support Vector Machines, since it allows us to operate directly on the graphs. We use a special case of graph kernels, namely Random Walk Graph Kernels, in order to compare all possible pairs of the Abstract API Call Graphs so as to use Support Vector Machines to classify them as hailing from malicious or benign executables. The idea behind Random Walk Graph Kernels is extensively discussed in the relative bibliography[21]. A Random Walk Graph Kernel between two graphs performs random walks on both of the graphs and counts the number of matching paths. Performing a random walk on the direct product graph of a pair of graphs is equivalent to performing random walk simultaneously on that pair of graphs[22]. This is being achieved through the use of direct product graphs, which we define as following: Page 18 Malware Classification Methodologies 6.6.3 Direct Product Graphs Given two graphs G1 = (N1 , E1 ) and G2 = (N2 , E2 ), their direct product graph G× = (N× , E× ) is defined as a graph over pairs of nodes from G1 and G2 , where two nodes of G× are neighboring if and only if the corresponding nodes in G1 and G2 are both neighbors. Let A1 and A2 be the adjacency matrices of G1 and G2 respectively. Following this, the adjacency matrix of G× is A× = A1 × A2 (1) Figure 2: Graph G1 (on the top left), Graph G2 (on the top right), Direct product graph G× of G1 , and G2 (bottom) Page 19 Malware Classification Methodologies 6.6.4 Random Walk Graph Kernel (RWGK) Following the relative bibliography, let us define the initial distribution of probabilities over the nodes in G1 and G2 , as p1 and p2 respectively, and the stopping probabilities (meaning the probability that a random walk ends at the given node), as q1 and q2 [1][21][23]. Considering these probabilities, we calculate the equivalent probabilities of the direct product graph: p× = p 1 ⊗ p 2 and q× = q1 ⊗ q2 (2) , where ⊗ denotes the Kronecker product. We define the Random Walk Graph Kernel as κ(G1 , G2 ) = T X ⊺ k µ(k)q× A× p × (3) k=0 where: – A× is the adjacency matrix of the direct product graph as defined in in (1) and subsequently Ak represents the probability of simultaneous length k random walks on G1 and G2 . – p× and q× represent the inital probability distribution and stopping probability of the direct product graph, as mentionned in (2) – T is the maximum length of the random walks – µ(k) = λk ∈ [0, 1] is a coefficient that controls the importance of length in random walks which we use to ensure that the sum converges and the kernel value is well defined The binary classification task at hand requires comparing each graph over every other. To this end, we recursively choose all possible pairs of graphs in order Page 20 Malware Classification Methodologies to compute the kernel value of each pair. The kernel value is a measure of the similarity of the two graphs and by gathering all the kernel values, we manage to turn our problem from a graph (i.e. non-linear) classification one to a linear one, for which a linear classifier may be used. More specifically, given the kernel values of all possible pairs, we can directly use these kernel values as input to the SVM classifier. The principle of kernelization of the data, so as to transform a non-linear problem to a linear one, is also referred to as the kernel trick[24][25]. 6.6.5 Generalized Definition of the Random Walk Graph Kernel for Labeled Graphs Every continuous, symmetric, positive definite kernel κ : X × X → R has a corresponding Hilbert space H, called the Reproducing Kernel Hilbert Space (RKHS)[21]. Providing a different representation of the RKHS involves using feature maps, where a feature map is a mapping function such that φ : X → H A feature map also defines a kernel as: κ(x, y) = hφ(x), φ(y)iH (4) We note that every positive definite function that correponds to a RKHS has infinite feature maps associated with it. Extending the feature maps to matrices, we have Φ : X n×m → Hn×m and|Φ(A)|ij : = φ(Aij ) (5) The generalized definition of the RWGK refers to including a weight matrix W× κ(G1 , G2 ) = T X ⊺ µ(k)q× W×k p× (6) k=0 Page 21 Malware Classification Methodologies where W× is a weight matrix for which we have W× = Φ(X1 ) ⊗ Φ(X2 ) (7) Given the definition of Φ(X1 ) and Φ(X2 ), and understanding that in the present study the nodes of the graphs are labeled (the node labels of the Abstract API Call Graph are the names of distinct API Calls), we are able to deduce that W× = A× . This is also recognized in the bibliography[21][23][26]. Following this and utilizing the Sylvester Equation Methods, the kernel definition may be given by ⊺ κ(G1 , G2 ) = q× (I − λA× )−1 p× (8) Given the solution of the Sylvester equation associated with the kernel definition above, we are able to reach O(n2 ) time, whereas the straight computation of inverting (I − λA× ) would take O(n6 ). Page 22 Malware Classification Methodologies 6.6.6 Present Implementation of the RWGK We remark that the RWGK is an adjustable quantity and allows us to regulate several parameters in order to fit each classification scheme. In the present implementation we can exploit two aspects of the graphs we wish to classify: • The initial probability distributions and stopping probabilities are neglected[19] since we only handle the Random Walk algorithm by regulating the importance of its depth, namely λ. This allows us for more efficient computing and faster Random Walks. • When computing the adjacency matrix of a pair of graphs A× , we accomodate for all the nodes of both graphs. However, given that the node labels are API Calls (unquely named) and that the Abstract API Call Graph only contains each API Call once, we can eliminate all the nodes that are different between the two graphs. Only common API Calls matter in the graph comparison of a pair of graphs since any different API Call nodes will augment the size of A× by inserting zeros. This allows us to reduce the computation time, since we avoid an enormous amount of element-wise operations that lead to zero and give us no information, thus enabling us to utilize the resources (Server RAM) more efficiently. 6.6.7 Kernel Normalization Even without a classification scheme, the kernel values allow for the extraction of conclusions, since they provide afterall a measure of similarity between graphs[19][20]. In this sense, we normalize the kernel values acquired from the aforementioned procedure so as to examine the results at this level as well. The normalization of the kernel values[20][27][28][29] allows for the correct examination of similarity or lack thereof since it ensures that a graph always matches Page 23 Malware Classification Methodologies itself with a normalized kernel value of 1 and all the other graphs with a value ∈ (0, 1). The normalized kernel value is given by κ̂(G1 , G2 ) = κ(G1 , G2 ) max(κ(G1 , G1 ), κ(G2 , G2 )) Details on the Implementation (9) It is important to remark that after an extensive search in the existing Python libraries that allow for graph kernel calculation[30][31], none where found to match the criteria placed above, and thus the implementation presented in this study is a Random Walk Graph Kernel for labeled graphs that was specifically written and compiled for the problem at hand. 6.7 Support Vector Machines Having computed the kernel values of all possible pairs of graphs, we can use them in a Support Vector Machine (SVM) classification scheme. Given that SVM’s are kernelized learning algorithms, the calculation of the kernel values have transformed a graph classification problem into a linear problem. The objective of the SVM is to classify the linearized acknowledged data in a way that allows for accurate predictions on new data. In this section, we discuss certain basic principles of SVM’s and describe the implementation details. Given the fact that we have already linearized the problem via the kernel trick, we proceed with the description of a linear SVM. 6.7.1 Training and Testing Splits, Number of Repetitions and Input We train and test the SVM classifier by using 9 different training to testing set splits, meaning that we begin with a certain number of kernel values (equal number of benign and malware originating kernel values) and split these kernel values in a random fashion beginning with a 10%-90% train-to-test split, and Page 24 Malware Classification Methodologies with 10% intervals, arrive to a 90%-10% train-to-test split. It is apparent that the two sets - training and testing- are complementary between them to the total number of kernel values. The whole procedure is repeated 100 times, so as to eliminate the flaws of the randomized train-to-test split and also allow for a larger number of predictions in order to minimize the effect of random spikes or drops in measurements. The kernel values are split in groups containing an equal number of malware and benign originating kernel values. Those groups contain 50, 100, 150, 200, 250, and 300 benign-benign and malware-malware kernel values. 6.7.2 Training Given a set of training data of n samples (x1 , y1 )...(xn , yn ), xi represents a feature and yi is its class label and can take values in {−1, 1}. The training of the SVM describes the procedure during which the algorithm finds the optimal classifier that seperates the labeled data, while maximizing the margin between them. This optimal classifier is a hyperplane and letting w be the normal to that hyperplane, we arrive at the SVM decision function h(x) = w⊺ x + b , where x = n X α i x i yi (10) i1 In the hyperplane function definition above, b represents the offset of the hyperplane from the origin and αi represents the SVM training parameters. Having a linearized problem, we can substitute the kernel values for xi . The kernel values that we use are two different sets; the ones computed between benign executable originating graphs among themselves and malware originating graphs again among themselves. The class values of yi take 1 as a value if the correPage 25 Malware Classification Methodologies sponding xi is a benign-benign graph kernel value, and -1 if the corresponding xi is a malware-malware graph kernel value. 6.7.3 Testing Having trained the classifier, we proceed to the testing procedure of the SVM, during which the classifier takes as input never before seen kernel values and predicts on their class based on the hyperplane that was developped for the training set before. We measure the accuracy, precision, recall and ROC Area Under Curve of the SVM and accordingly conclude on the prediction it makes on the testing set. Page 26 Malware Classification Methodologies 7 Experiments In this part, measurements performed on the dataset after the analysis procedure are presented. These measurements indicate the differences in essential characteristics of the malware and the benign executables and emphasize divergent qualities of the two types of samples. In the effort to classify executables, accumulating knowledge on irregularities, even smaller ones, allows us to enhance the aspects of the already established criteria. Managing to accumulate enough discrepancies, from different measurements, provides a more in depth comprehension of the importance and gravity of certain features of the malware detection problem. More specifically, we examine the size of the samples compared to the number of distinct API Calls in each sample, we look at the sparsity of the Abstract API Call Graphs, and we assess the differences in appearance frequency of specific API Calls. 7.1 Sample Size and API Calls The size of both the benign executables and the malwares range from several KB to several MB. In Figure 3, we compare the size of each executable sample over the number of API Calls it makes. Examining Figure 3, we quickly notice the 3 distinct spikes produced by the malware at around 0, 180, and 520 API Calls, whereas the benign executables do not show any irregular behavior in the given plot. Analyzing the irregularities by the malware, it becomes apparent that some of them do not follow the reasonable assumption that a larger number of API Calls is related with a bigger size. Certain malware are very big in size and have the same amount of API Calls as other much smaller malware. The reason behind this phenomenon, is related to the fact that most malware are newer, updated versions of older ones. Considering this, along with the fact that obfuscating techniques and multiple additions of ”dead” code are often employed Page 27 Malware Classification Methodologies Figure 3: Benign-Malware API Calls - Executable Size in the malware development, it is understandable why such large sized malwares present such few API Calls. 7.2 Sparsity Metrics Measurements were performed at different stages of the pipeline. Two sparsity metrics of the Abstract API Call Graphs were computed: Sparsity Metric 1: SM 1 : = number of edges number of nodes (11) SM 2 : = number of nodes number of edges (12) Sparsity Metric 2: The sparsity measurements gives us an understanding on how dense or sparse the Abstract API Call Graphs are, i.e. on how heavily intraconnected the API Calls are. The number of connections between the distinct API Call nodes of the Abstract API Call Graph informs us on the cyclomatic complexity of the executable. The cyclomatic complexity of an executable is a measure of Page 28 Malware Classification Methodologies linearly independent execution paths through the code and is a software metric indicating the complexity of the program. In this manner, it can also be used to recognize maliciousness[32][33]. In Figures 4 and 5, we present the aforementionned Sparsity measurements compared to the number of the edges and the number of nodes of the Abstract API Call Graphs of the samples. Figure 4: Benign-Malware Sparsity Metric 1 - Number of Edges (top), Number of Nodes (bottom) On all four plots (Figures 4, 5), we remark that the malware exhibit a slightly divergent behavior compared to the benign executables. Although there seems to be an area in which they behave in the same manner, there are malware that have very large sparsity value, meaning that they are very lightly intraconnected Page 29 Malware Classification Methodologies Figure 5: Benign-Malware Sparsity Metric 2 - Number of Edges (top), Number of Nodes (bottom) in comparison with the benign executables which follow a more predictable evolution. In Figure 6, we show both Sparsity Metrics compared with the size of the samples. Observing the Sparsity Metrics to executable size plots, we notice that there are again certain spikes appearing in the size of the malware samples, which reveal an equivalence in sparsity of the Abstract API Call Graph for samples with vastly different sizes. This, along with the recognition that there is a similar behavior in malware upon looking at the comparison of size to API Calls (Figure 3), supports the argumentation that in malware, some are equivalent to others Page 30 Malware Classification Methodologies Figure 6: Benign-Malware Sparsity Metric 2 - Number of Edges (top), Number of Nodes (bottom) in API Calls metrics, even though they are different in size. As mentioned before, one explanation of this could be that malware are usually developped upon older versions of similar malware, with the addition of code that does not change the behavior examined by the static analysis. Page 31 Malware Classification Methodologies 7.3 API Call Frequency A measurement that presents interest in the context of malware detection is also the API Calls themselves and the frequency in which they appear in the dataset. We perform two types of measurements; API Call frequency for API Calls appearing only in Benign or Malware, and API Call frequency for API Calls appearing in both sets of data (common API Calls). 7.3.1 Distinct API Calls In Figure 7, we plot the frequency of appearance of API Calls unique to benign executables and malware. Let the number of executable samples in a dataset be N and the number of appearances of a single distinct API Call, named i, in all the executable samples being ni , then the frequency of appearance of the API Call i is given by f requencyi = ni N (13) Examining the Figure 7 top plot, we notice that the frequency of API Calls appearing only in benign executables or only malware is generally very low. At the bottom plot of Figure 7, we display the same plot only allowing for API Calls of a frequency higher than 15%. In this plot, we have a clearer picture on the importance of API Call frequency, since we can extract certain API Calls that seem to be appearing at a distinctly high rate in benign executables and malware. We conclude on 4 API Calls for malware and 5 for benign executables. This metric allows us a hypothesis on the basis of the existence of these specific API Calls in relationship with their classification as malicious or benign. This constitutes that the extraction of such a feature may be significant. Page 32 Malware Classification Methodologies Figure 7: Appearing only in Malware or Benign - API Call Frequency all(top), frequency above 15% (bottom) 7.3.2 Common API Calls In this section of the API Call frequency measurements, we examine the frequency of common to the benign executables and malware API Calls. In Table 1, we present the number of Common API calls in malware and benign samples in relationship to the frequency difference that they appear in the malware and the benign samples. Page 33 Malware Classification Methodologies In Table 1, cutoff appearance frequency is the frequency difference above which Cutoff Appearance Frequency 0 1 2 5 10 15 20 25 30 40 50 60 70 80 Number of API Calls 3031 1251 994 761 492 345 274 218 122 33 9 3 1 0 Table 1: Malware (827) and Benign(567) Samples Common API Calls based on Appearance Frequency Difference there is a given number of common API Calls appearing; i.e. at a 10% cutoff appearance frequency, we have 492 distinct API Calls, among common API Calls between benign and malware, with an appearance frequency difference equal to or greater than 10%. Page 34 Malware Classification Methodologies This measurement provides insight on two important aspects: • For a big difference in appearance frequency among common API Calls, we can conclude that some API Calls are favored by benign and others by malware. • The choice of frequency difference is important, since it establishes the tendency of API Calls to appear in benign or malware more. If the frequency of a common API Call is comparable to the frequencies of API Calls appearing exclusively in malware or benign executables (see Figure 7), then this API Call is inherently important and its existence can constitute a feature for classification. In Figure 8, we plot the API Calls that appear in both malware and benign samples, whose appearance frequency difference in malware and benign is greater than 35%. Figure 8: Common API Calls Appearance Frequency - Cutoff Appearance Frequency 35% Based on the analysis of the common API Call frequency as a classification feature, we notice that there are certain common API Calls whose absolute frePage 35 Malware Classification Methodologies quency is large enough and the frequency difference between malware and benign appearance is also substantial. These API Calls are memset, RtlLookupFunctionEntry, initterm, exit, and set app type. Conclusion on API Call Frequency In this section, the effort has been focused on the extraction of possible classification features, using the information acquired on the API Calls appearance frequency in malware and benign executables. Although in isolation, these features may not suffice for the correct classification of the sample, they could play a significant role as weights of a classification scheme. Page 36 Malware Classification Methodologies 8 Classification Results In this section, we provide the results that have been attainned by the malware detection pipeline of the proposed solution. We divide the results into preliminary ones -the kernel values for benign-benign, malware-malware, and benign-malware- and the final Support Vector Machine prediction results on accuracy, precision, recall and ROC AUC. 8.1 Kernel Measurements As discussed before, the kernel is a measure of similarity between graphs[19][20]. The normalized kernels of different dataset sizes and combinations are being presented in the following sections. 8.1.1 Benign-Benign We calculate the kernel values among benign executables for 50, 100, 150, 200, 250, and 300 benign samples. The resulting pairwise similarities are plotted in Figure 9, whereby each vertical sequence of dots represents the kernel of the given sample over all the other samples; we have on each line the kernel value (similarity) it shares with all the other. The kernel values are normalized so as to allow for correct cross-examination with the corresponding plots concerning the malware. On each plot we also give a histogram of the distribution of the kernel values, in order to conclude on how similar the samples are overall (we refer to the histogram on the right side of the plots). For the plots of Figure 9, we can easily notice that for all the different dataset sizes, the average similarity value is around 20%. This demonstrates experimentally that the datasets chosen are valid and do not contain benign samples that are overly similar to each other (as far as Abstract API Call Graphs go). Page 37 Malware Classification Methodologies Figure 9: Benign-Benign Kernel Values for Dataset Size 50 (top-left), 100 (top-right), 150 (middle-left), 200 (middle-right), 250 (bottom-left), and 300 (bottom-right) 8.1.2 Malware-Malware As calculated before for the benign samples, we compute the kernel values among malware for 50, 100, 150, 200, 250, and 300 malware each time and plot the resulting kernel values in Figure 10. Page 38 Malware Classification Methodologies Figure 10: Malware-Malware Kernel Values for Dataset Size 50 (top-left), 100 (top-right), 150 (middle-left), 200 (middle-right), 250 (bottom-left), and 300 (bottom-right) We notice that the similarity, in the case of the malware, centrers around 20% for the smaller datasets, but as the dataset size grows, the similarity tends to become smaller. We notice also a very large spike on the histogram around zero , which indicates that there is a large amount of samples that are very Page 39 Malware Classification Methodologies different among them (in respect to Abstract API Call Graphs). Page 40 Malware Classification Methodologies 8.1.3 Benign-Malware Following the same principle as before, but this time examining benign over malware Abstract API Call Graph kernel values, we plot the results in Figure 11. Note that this type of combination is not required for the SVM classification. Figure 11: Benign-Malware Kernel Values for Dataset Size 50-50 (top-left), 100100 (top-right), 150-150 (middle-left), 200-200 (middle-right), 250-250 (bottomleft), and 300-300 (bottom-right) Page 41 Malware Classification Methodologies Nevertheless, it provides very significant information. Examining the benign over malware kernel value plots of Figure 11, we notice that the similarity centers around 10% and 15% and it becomes possible to define an upper bound with a reasonable degree of confidence around the 35% mark. This allows us to better understand the inner workings of a classification scheme, since with this crossexamination of benign to malware kernel values, we notice similarity bounds that can work as detection mechanisms. 8.2 SVM Results After having build the SVM classifier for the different training to testing splits 100 times over, as described in section 6.7.1, we proceed with the graphical representation of the results. The best accuracy achieved by the SVM classifier is 98.25% with the dataset of 300 benign and 300 malware samples, meaning that the maximum error is at 1.75%, which is in line with the relative bibliography[1]. 8.2.1 Accuracy The accuracy of a classification scheme is defined as the number of correct predictions over the total number of predictions. In the case of a binary classifiers, such as the one implemented in the present study, we have the following Accuracy = TP + TN TP + TN + FP + FN (14) where T P is the number of True Positives, T N is the number of True Negatives, F P is the number of False Positives, and F N is the number of False Negatives. Page 42 Malware Classification Methodologies Examining the SVM Accuracy plot of Figure 12, we notice that, with the exception of the 50-50 dataset, achieved performance lies generally above 95%. We also notice that as the training size becomes a larger percentage of the dataset itself, the accuracy is higher, as is expected. The highest accuracy takes place for the 300-300 dataset (the largest one) and for the best possible training to testing split (90% of the dataset is used for training). Figure 12: SVM Accuracy for the Different Dataset Sizes and Training to Testing Splits 8.2.2 ROC Curve-AUC A Receiver Operating Characteristic (ROC) curve shows the performance of a classifier at all classification thresholds. We introduce the notions of True Positive Rate (TPR) and False Positive Rate (FPR) as TPR = TP TP + FN (15) FPR = FP FP + TN (16) The ROC curve plots the TPR against the FPR at different classification Page 43 Malware Classification Methodologies thresholds. The Area Under the ROC Curve (AUC), measures the area covered by the ROC curve. AUC measures the accumulated performance across all thresholds of classification. The ROC-AUC is a measurement on the quality of the classifier no matter what classification threshold is chosen. In Figure 13, we present the ROC-AUC for the SVM of the 6 aforementioned datasets (50,100,150,200, 250, 300). Figure 13: SVM ROC-AUC for the Different Dataset Sizes and Training to Testing Splits In the Figure 13 plot, we notice that apart from the smaller dataset (5050), the performance only gets better as the datasets increase in size and as the training to testing split becomes more favorable. We note that the ROCAUC of the 50-50 dataset degrades in a quicker manner than the degradation of its accuracy in Figure 12, which shows the qualitative difference in the two measurements. Page 44 Malware Classification Methodologies 8.2.3 Precision and Recal The precision, also known as positive predictive value, and recall, or sensitivity, are measurements usually examined together based on the proximity of their definitions. The precision of a classifier is defined as the ratio of positive identifications that were actually correct, whereas recall is defined as the ratio of actual positives that was correctly identified. We represent this, as P recision = Recall = TP TP + FP TP TP + FN (17) (18) In Figure 14 and 15, we present the precision and recall measurements of the SVM classifier, which show, very high values of precision and recall as the training setup becomes more favorable (larger dataset and training split). It is important to note that a high precision relates to a low false positive rate and a high recall to a low false negative rate. Examining both plots, we conclude that the classifier is returning accurate results as well as returning a majority of positive results. Page 45 Malware Classification Methodologies Figure 14: SVM Precision for the Different Dataset Sizes and Training to Testing Splits Figure 15: SVM Recall for the Different Dataset Sizes and Training to Testing Splits Page 46 Malware Classification Methodologies 9 Conclusion The goal of the present study, was to tackle the malware detection problem in an efficient and accurate way. The proposed solution utilized static analysis techniques to disassemble benign and malware executable samples, study them, and extract important characteristics and features. Based on those, we developped ad hoc graph-based mechanisms to overcome computational limits and manage to fabricate subject specific solutions for the kernelization of non-linear data. The classification scheme employed to function as a malware detection configuration yielded very promising results. The importance of those results is only expanded if we consider that the dataset size used is an objectively small one (for such endeavors) and that the computational time is much smaller compared to relative studies. We remark that the results attainned in combination with the measurements collected, allow for solid hypotheses on the basis of the proposed solution compounded with extracted features for future research. Page 47 Malware Classification Methodologies List of Figures 1 Bird’s eye view of the proposed solution pipeline . . . . . . . . . 2 Graph G1 (on the top left), Graph G2 (on the top right), Direct 8 product graph G× of G1 , and G2 (bottom) . . . . . . . . . . . . 19 3 Benign-Malware API Calls - Executable Size . . . . . . . . . . . 28 4 Benign-Malware Sparsity Metric 1 - Number of Edges (top), Number of Nodes (bottom) . . . . . . . . . . . . . . . . . . . . . . . . 5 Benign-Malware Sparsity Metric 2 - Number of Edges (top), Number of Nodes (bottom) . . . . . . . . . . . . . . . . . . . . . . . . 6 33 Common API Calls Appearance Frequency - Cutoff Appearance Frequency 35% . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 31 Appearing only in Malware or Benign - API Call Frequency all(top), frequency above 15% (bottom) . . . . . . . . . . . . . . 8 30 Benign-Malware Sparsity Metric 2 - Number of Edges (top), Number of Nodes (bottom) . . . . . . . . . . . . . . . . . . . . . . . . 7 29 35 Benign-Benign Kernel Values for Dataset Size 50 (top-left), 100 (top-right), 150 (middle-left), 200 (middle-right), 250 (bottomleft), and 300 (bottom-right) . . . . . . . . . . . . . . . . . . . . 10 38 Malware-Malware Kernel Values for Dataset Size 50 (top-left), 100 (top-right), 150 (middle-left), 200 (middle-right), 250 (bottomleft), and 300 (bottom-right) . . . . . . . . . . . . . . . . . . . . 11 39 Benign-Malware Kernel Values for Dataset Size 50-50 (top-left), 100-100 (top-right), 150-150 (middle-left), 200-200 (middle-right), 250-250 (bottom-left), and 300-300 (bottom-right) . . . . . . . . 12 41 SVM Accuracy for the Different Dataset Sizes and Training to Testing Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Page 48 Malware Classification Methodologies 13 SVM ROC-AUC for the Different Dataset Sizes and Training to Testing Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 SVM Precision for the Different Dataset Sizes and Training to Testing Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 44 46 SVM Recall for the Different Dataset Sizes and Training to Testing Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Page 49 Malware Classification Methodologies References [1] K.-H.-T. Dam and T. Touili, “Malware detection based on graph classification,” in Proceedings of the 3rd International Conference on Information Systems Security and Privacy, SCITEPRESS - Science and Technology Publications, 2017. [2] P. D. C. Kelly Bissel, Ryan LaSalle, Ninth Annual Cost of Cybercrime Study. The Cost of Cybercrime, Ponemon Institue LLC, Accenture plc, 2019. [3] Verizon, “Data breach investigations report 2020.” https://enterprise. verizon.com/resources/reports/dbir/, year = 2020, note = Online; accessed 27-September-2020. [4] A. Souri and R. Hosseini, “A state-of-the-art survey of malware detection approaches using data mining techniques,” Human-centric Computing and Information Sciences, vol. 8, Jan. 2018. [5] H. E. Merabet and A. Hajraoui, “A survey of malware detection techniques based on machine learning,” International Journal of Advanced Computer Science and Applications, vol. 10, no. 1, 2019. [6] Y. Ding, S. Zhu, and X. Xia, “Android malware detection method based on function call graphs,” in Neural Information Processing, pp. 70–77, Springer International Publishing, 2016. [7] Y. Ye, S. Hou, L. Chen, J. Lei, W. Wan, J. Wang, Q. Xiong, and F. Shao, “Out-of-sample node representation learning for heterogeneous graph in real-time android malware detection,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Aug. 2019. Page 50 Malware Classification Methodologies [8] N. M. Kriege, F. D. Johansson, and C. Morris, “A survey on graph kernels,” Applied Network Science, vol. 5, Jan. 2020. [9] https://www.nsa.gov/resources/everyone/ghidra/, Released on April 4th, 2019. [10] I. Boneva, A. Rensink, M. Kurbán, and J. Bauer, “Graph abstraction and abstract graph transformation,” Measurement Science Review - MEAS SCI REV, 01 2007. [11] P. Moriano, J. Pendleton, S. Rich, and L. Camp, “Stopping the insider at the gates: Protecting organizational assets through graph mining,” Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications, vol. 9, pp. 4–29, 03 2018. [12] https://www.microsoft.com/el-gr/software-download/ windows10ISO. [13] https://git-scm.com/downloads. [14] https://www.nsa.gov/resources/everyone/ghidra/. [15] http://www.codeblocks.org/downloads/26. [16] https://virusshare.com/. [17] https://www.virustotal.com/gui/. [18] https://docs.microsoft.com/en-us/windows/win32/apiindex/ api-index-portal. [19] K. Avrachenkov, P. Chebotarev, and D. Rubanov, “Kernels on graphs as proximity measures,” vol. 10519, pp. 27–41, 09 2017. [20] J. Ah-Pine, “Normalized kernels as similarity indices,” pp. 362–373, 06 2010. Page 51 Malware Classification Methodologies [21] S. V. N. Vishwanathan, K. M. Borgwardt, I. R. Kondor, and N. N. Schraudolph, “Graph kernels,” CoRR, vol. abs/0807.0093, 2008. [22] W. Imrich and S. Klavzar, Product Graphs, Structure and Recognition. 01 2000. [23] T. Gärtner, P. Flach, and S. Wrobel, “On graph kernels: Hardness results and efficient alternatives,” vol. 129-143, pp. 129–143, 01 2003. [24] B. Schölkopf, “The kernel trick for distances,” vol. 13, pp. 301–307, 01 2000. [25] T. Hofmann, B. Schölkopf, and A. Smola, “Kernel methods in machine learning,” The Annals of Statistics, vol. 36, 01 2007. [26] M. Sugiyama and K. Borgwardt, “Halting in random walk kernels,” in NIPS, 2015. [27] M. Fisher, M. Savva, and P. Hanrahan, “Characterizing structural relationships in scenes using graph kernels,” ACM Trans. Graph., vol. 30, p. 34, 07 2011. [28] L. Jia, B. Gaüzère, and P. Honeine, “Graph kernels based on linear patterns: Theoretical and experimental comparisons,” 2019. [29] G. Simões, H. Galhardas, and D. Martins de Matos, “A labeled graph kernel for relationship extraction,” 02 2013. [30] M. Sugiyama, M. E. Ghisu, F. Llinares-López, and K. Borgwardt, “graphkernels: R and python packages for graph comparison,” Bioinformatics, vol. 34, no. 3, pp. 530–532, 2017. Page 52 Malware Classification Methodologies [31] G. Siglidis, G. Nikolentzos, S. Limnios, C. Giatsidis, K. Skianis, and M. Vazirgiannis, “Grakel: A graph kernel library in python,” Journal of Machine Learning Research, vol. 21, no. 54, pp. 1–5, 2020. [32] A. Calleja, J. Tapiador, and J. Caballero, “The malsource dataset: Quantifying complexity and code reuse in malware development,” IEEE Transactions on Information Forensics and Security, vol. 14, pp. 3175–3190, 2019. [33] M. Protsenko and T. Müller, “Android malware detection based on software complexity metrics,” 09 2014. Page 53 Πολυτεχνική Σχολή Τμήμα Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών Διπλωματική Εργασία Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού Βασίλειος Τσουβαλας Επιβλέπων Καθηγητής: Δημήτριος Σερπάνος Πάτρα, Οκτώβριος 2020 Περιεχόμενα 1 Εισαγωγή 2 1.1 Το Πρόβλημα του Κακόβουλου Λογισμικού . . . . . . . . . . . . . 2 1.2 Σύγχρονες Λύσεις . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Προτεινόμενη Λύση 3 3 Εργαλεία και Δείγματα 4 3.1 Εργαλεία . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Δείγματα 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Υλοποίηση 5 4.1 Ανάλυση Κώδικα . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4.2 Γράφοι Κλήσεων ΑΡΙ . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3 Ταξινόμηση . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 4.3.1 4.4 Σύγκριση Γράφων - Random Walk Graph Kernel (RDWGK) . . . . . . . . . . . . . . . . . . . . . . 7 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . 9 5 Μετρήσεις - Πειράματα 10 6 Αποτελέσματα 11 6.1 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 13 7 Συμπεράσματα 14 Βιβλιογραφία 15 1 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού 1 1.1 Εισαγωγή Το Πρόβλημα του Κακόβουλου Λογισμικού Υπολογίζεται πως μέχρι το 2023, τα περουσιακά στοιχεία που βρίσκονται σε ρίσκο λόγω του παγκόσμιου κυβερνοεγκλήματος ενδέχεται να φτάσουν τα 5.2 τρισεκατομμύρια USD [1]. Δεδομένου ότι το 28% των συνολικών κυβερνοεπιθέσεων αφορά παραβιάσεις με τη χρήση κακόβουλου λογισμικού[2], γίνεται προφανές ότι η ανίχνευση κακόβουλου λογισμικού είναι ένα από τα πιό σημαντικά κεφάλαια της κυβερνοασφάλειας διεθνώς. 1.2 Σύγχρονες Λύσεις Η ανίχνευση κακόβουλου λογισμικού γίνεται μέσω δύο βασικών τύπων ανάλυσης: στατική και δυναμική ανάλυση. Κατά τη στατική ανάλυση, επιχειρείται η συλλογή πληροφορίας μέσω ανάλυσης κώδικα χωρίς να εκτελείται το πρόγραμμα, ενώ κατά τη δυναμική ανάλυση, το πρόγραμμα εκτελείται σε ένα ασφαλές περιβάλλον και η συλλογή πληροφορίας γίνεται κατά την εκτέλεση αυτή, επιτρέποντας την παρατήρηση περαιτέρω στοιχείων κατά την αλληλεπίδραση του εκτελέσιμου με του υπολογιστή. Την τελευταία δεκαετία, η χρήση τεχνικών μηχανικής μάθησης και εξόρυξης δεδομένων λαμβάνει χώρα όλο και περισσότερο στις προσπάθειες ανίχνευσης κακόβουλου λογισμικού[3]. Σελίδα 2 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού 2 Προτεινόμενη Λύση Στην παρούσα εργασία, ακολουθώντας τις σύγχρονες προσπάθειες επίλυσης του προβλήματος[4], εφαρμόζεται μια λύση βασισμένη στη στατική ανάλυση και τη μοντελοποίηση γράφων[5][6], η οποία σε συνδυασμό με γνωστές τεχνικές ταξινόμησης, επιτρέπει την ανίχνευση του κακόβουλου λογισμικού. Ξεκινάμε με ένα σύνολο από κακόβουλα και καλόβουλα εκτελέσιμα MS Windows λογισμικά. Για κάθε λογισμικό πραγματοποιούμε την αποσυναρμολόγησή του (disassembly) και εξάγουμε πληροφορίες, οι οποίες μας επιτρέπουν να μοντελοποιήσουμε τις κλήσεις του εκτελέσιμου δείγματος στο Windows API σε Γράφο Ελέγχου Ροής (Control Flow Graph). Με βάση το Γράφο Ελέγχου Ροής και άλλες, εξαχθείσες από τη στατική ανάλυση, πληροφορίες λαμβάνουμε το Γράφο ΑΡΙ Κλήσεων. Εν συνεχεία, χρησιμοποιώντας βασικές αρχές της Θεωρίας Γράφων, καταλήγουμε σε μία αφηρημένη μορφή του Γράφου Κλήσεων ΑΡΙ, τον Αφηρημένο Γράφο ΑΡΙ Κλήσεων. Η διαδικάσια που προτείνεται ως λύση βασίζεται στη σύγκριση αυτών των γράφων. Η διανυσματικοποίηση της σύγκρισης των Αφηρημένων Γράφων ΑΡΙ Κλήσεων επιτυγχάνεται μέσω ενός συγκεκριμένου graph kernel, του οποίου τα αποτελέσματα εισάγονται για ταξινόμηση σε ένα Support Vector Machine (SVM), όπου και τα εκτελέσιμα λογισμικά ταξινομούνται σε καλόβουλα ή κακόβουλα. Σημειώνεται πως στην παρούσα εργασία, αντιμετωπίζουμε το πρόβλημα της ανίχνευσης κακόβουλου λογισμικού, ως ένα πρόλημα δυαδικής ταξινόμησης και άρα ο στόχος είναι να κατηγοριοποιηθεί ένα λογισμικό ως κακόβουλο ή καλόβουλο. Σελίδα 3 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού 3 Εργαλεία και Δείγματα 3.1 Εργαλεία Παρουσιάζουμε την εργαλείοθήκη που συγκροτήθηκε για την προτεινόμενη λύση και εδικότερα τα εργαλεία που χρησιμοποίηθηκαν στα τρία βασικά μέρη αυτής. • Ανάλυση εκτελέσιμων δειγμάτων: NSA Ghidra[7] για ανάλυση και αποσυναρμολόγηση, κώδικας σε Java για εξαγωγή πόρων • Γράφοι ΑΡΙ Κλήσεων: κώδικας σε Python για μοντελοποίσηση γράφων και εφαρμογή βασικών αρχών θεωρίας γράφων • Ταξινόμηση: SVM’s με τη χρήση της γλώσσας Python Σημειώνουμε πως για μεγάλο μέρος της διαδικασίας χρησιμοποιήθηκε server (8-core CPU, 16GB RAM), λόγω της ανάγκης για μεγαλύτερη υπολογιστική ισχύ. 3.2 Δείγματα Συγκεντρώθηκαν καλόβουλα και κακόβουλα εκτελέσιμα δείγματα για Windows. Ειδικότερα, συγκεντρώθηκαν 997 κακόβουλα δείγματα από το VirusShare[8] και 567 καλόβουλα δείγματα, τα οποία αποτελόυνται από αρχεία εγκατάστασης έγκυρων και αξιόπιστων λογισμικών όπως το λειτουργικό σύστημα των Windows[9], το Git[10], Cygwin[11], Codeblocks[12] κ.α. Σελίδα 4 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού 4 Υλοποίηση Παρακάτω περιγράφουμε πιό αναλυτικά τα βασικά μέρη της προτεινενης λύσης. 4.1 Ανάλυση Κώδικα Κατά την ανάλυση κώδικα κάθε εκτελέσιμου, μέσω της χρήσης του Ghidra, λαμβάνουμε τον αποσυναρμολογημένο κώδικα για το συγκεκριμένο εκτελέσιμο δείγμα. Τα βασικά στοιχεία που προκύπτουν από την ανάλυση κάθε δείγματος είναι: • Τα Μπλοκ Κώδικα (Codeblocks): σημεία ελέγχου του προγράμματος, των οποίων η αλληλοσύνδεση δίνει όλες τις πιθανές διαδρομές εκτέλεσης του προγράμματος • ΑΡΙ κλήσεις: κλήσεις συναρτήσεων του ΑΡΙ των Windows[13] τις οποίες πραγματοποιεί το εκτελέσιμο δείγμα Από τα Μπλοκ Κώδικα και την πληροφορία για τη διασυνδεσιμότητά τους, λαμβάνουμε το Γράφο Ελέγχου Ροής του εκτελέσιμου. Σημειώνουμε πως κάθε Μπλοκ Κώδικα δύναται να οδηγεί (με την έννοια της διαδρομής εκτέλεσης) σε άλλο Μπλοκ Κώδικα ή σε κλήση ΑΡΙ. Ο Γράφος Ελέγχου Ροής ορίζεται ως ο κατευθυντικός γράφος G = (N, E), όπου το Ν είναι ένα πεπερασμένο σύνολο κόμβων και Ε ένα πεπερασμένο σύνολο ακμών. Με τον παραπάνω ορισμό, έχουμε ότι το (n1 , n2 ) σημαίνει ότι το Μπλοκ Κώδικα n1 οδηγεί στο Μπλοκ Κώδικα n2 . Σελίδα 5 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού 4.2 Γράφοι Κλήσεων ΑΡΙ Δεδομένου του Γράφου Ελέγχου Ροής του εκτελέσιμου δείγματος και γνωρίζοντας τις κλήσεις ΑΡΙ που πραγματοποιεί το εκτελέσιμο καθώς και το συγκεκριμένο Μπλοκ Κώδικα από όπου καλούνται αυτές, είμαστε σε θέση να μοντελοποιήσουμε το Γράφο ΑΡΙ Κλήσεων. Η λογική που ακολουθείται έγκειται στην αντικατάσταση των κόμβων του Γράφου Ελέγχου Ροής, με κόμβους που αναπαριστούν τις ΑΡΙ κλήσεις που πραγματοποιεί το Μπλοκ Κώδικα του συγκεκριμένου κόμβου. Γίνεται εύκολα αντιληπτό πως το μέγεθος του Γράφου ΑΡΙ Κλήσεων (αριθμός κόμβων και ακμών) είναι πολύ μεγαλύτερο από εκείνο του Γράφου Ελέγχου Ροής. Ο Γράφος ΑΡΙ Κλήσεων είναι ένας κατευθυντικός γράφος GAP ICall = (FAP I , EAP I ), όπου το FAP I είναι ένα πεπερασμένο σύνολο κόμβων, το οποίο αναπαριστά τις ΑΡΙ κλήσεις στις οποίες μπορεί να φτάσει το πρόγραμμα εκκινώντας από το αρχικό Μπλοκ Κώδικα και EAP I είναι ένα πεπερασμένο σύνολο ακμών, το οποίο αναπαριστά τις ΑΡΙ κλήσεις στις οποίες μπορεί να φτάσει μέσω του Μπλοκ Κώδικα προορισμού. Το μέγεθος του Γράφου ΑΡΙ Κλήσεων είναι πολύ μεγάλο και δεν επιτρέπει τη σύγκριση δύο τέτοιων γράφων. Για το λόγο αυτό, χρησιμοποιώντας βασικές αρχές της θεωρία γράφων (αντιμετωπίζουμε το Γράφο ΑΡΙ Κλήσεων ώς πολλά επίπεδα διμερών γράφων), μετασχηματίζουμε το Γράφο ΑΡΙ Κλήσεων σε έναν Αφηρημένο Γράφο ΑΡΙ Κλήσεων. Ο Αφηρημένος Γράφος ΑΡΙ Κλήσεων έχει τόσους κόμβους όσες ΑΡΙ Κλήσεις καλούνται από το εκτελέσιμο (ένας κόμβος για κάθε μοναδική κλήση) και οι ακμές του εκφράζουν την ύπαρξη σύνδεσης ανάμεσα στις κλήσεις αυτές στο Γράφο ΑΡΙ Κλήσεων. Με τη μετατροπή αυτή, χάνεται η κατευθυντηκότητα, όμως διατηρούνται όλες οι συνδέσεις των ΑΡΙ Κλήσεων και ο γράφος πλέον έχει διαχειρήσιμο μέγεθος. Σελίδα 6 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού 4.3 Ταξινόμηση Η διαδικασία ταξινόμησης χωρίζεται σε δύο μέρη. Το πρώτο περιγράφει τη μετάβαση από τη σύγκριση δύο Αφηρημένων Γράφων ΑΡΙ Κλήσεων σε ένα γραμμικό μέγεθος, το kernel, το οποίο εισάγεται στο δευτερο και τελικό μέρος της ταξινόμησης, το οποίο αφορά τα Support Vector Machines (SVM). 4.3.1 Σύγκριση Γράφων - Random Walk Graph Kernel (RDWGK) ΄Εχοντας εξάγει τον Αφηρημένο Γράφο ΑΡΙ Κλήσεων για καθένα από τα εκτελέσιμα δείγματα, προχωράμε στη σύγκριση των γράφων. Ο αλγόριθμος Τυχαίας Διαδρομής (Random Walk) είναι ένας ευρέως διαδεδομένος αλγόριθμος, ο οποίος επιστρέφει τυχαίες διαδρομές ενός γράφου, με τη λογική να βασίζεται στην εκκίνηση από ένα κόμβο του γράφου και στην τυχαία επιλογή για την επίσκεψη γειτονικών κόμβων. ΄Ενα graph kernel είναι μία συνάρτηση η οποία υπολογίζει το εσωτερικό γινόμενο δύο γράφων και με τον τροπο αυτο παρέχει μία μέτρηση της ομοιότητας των δύο γράφων[14][15]. Η χρησιμότητα του graph kernel έγκειται στο ότι επιτρέπει να γραμμικοποιηθεί ένα μη-γραμμικό πρόβλημα[16][17], όπως είναι η σύγκριση γράφων, και εν συνεχεία, επιτρέπει τη χρήση αλγόριθμων ταξινόμησης με το παρεχόμενο αποτέλεσμα. Στην παρούσα εργασία, χρησιμοποιείται μία ειδική περίπτωση graph kernel, το Random Walk Graph Kernel (RDWGK)[18], ώστε να συγκριθούν όλοι οι Αφηρημένοι Γράφοι ΑΡΙ Κλήσεων ανά δύο, και έπειτα να εισαχθούν οι τιμές των kernel στο SVM για ταξινόμηση. Για δύο γράφους G1 = (N1 , E1 ) και G2 = (N2 , E2 ), ορίζεται ως Γράφος Γι- Σελίδα 7 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού νομένου G× = (N× , E× ) ένα γράφος των ζευγαριών των κόμβων των G1 και G2 , όπου δύο κόμβοι του G× συνδέονται, αν και μόνον αν οι αντιστοιχοι κομβοι των G1 και G2 γειτνιάζουν. Ο λόγος που χρησιμοποιείται ο Γράφος Γινομένου είναι γιατί το RDWGK ανάμεσα σε δύο γράφους είναι το ίδιο με το RDWGK του Γράφου Γινομένου των δύο γράφων[19]. Σημειώνουμε ότι εάν A1 και A2 οι πίνακες γειτνίασης των G1 και G2 αντίστοιχα, τότε ο πίνακας γειτνίασης του Γράφου Γινομένου G× είναι A× = A1 × A2 . Το RDWGK του Γράφου Γινομένου ορίζεται ως: κ(G1 , G2 ) = PT k=0 ⊺ k A× p × µ(k)q× όπου: – A× είναι ο πίνακας γειτνίασης του Γράφου Γινομένου και κατέπέκταση το Ak αναπαριστά την πιθανότητα ταυτόχρονων τυχαίων διαδρομών μήκου k στους G1 και G2 . – p× και q× αναπαριστούν την αρχική κατανομή πιθανότητας (αρχικός κόμβος) και την πιθανότητα να σταματήσει η διαδρομή στο συγκεκριμένο κόμβο του Γράφου Γινομένου, αντίστοιχα - οι συγκεκριμένες πιθανότητες δεν λαμβάνονται υπόψιν στην παρούσα εργασία, καθώς δεν γνωρίζουμε αυτές τις πιθανότητες[14] – T είναι το μέγιστο μήκος των τυχαίων διαδρομών – µ(k) = λk ∈ [0, 1] είναι ένας παράγοντας που ελέγχει τη σημαντικότητα του μήκους των τυχαίων διαδρομών μέσω του οποίου πάντα επιτυγχάνεται σύγκλιση του αθροίσματος και το RDWGK είναι καλά ορισμένο. Με δεδομένα ορισμένα ειδικά χαρακτηριστικά των Αφηρημένων Γράφων ΑΡΙ Κλήσεων (κάθε κόμβος είναι μοναδικός και αναπαριστά μία κλήση ΑΡΙ, κάθε κλήση ΑΡΙ Σελίδα 8 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού έχει μοναδικό όνομα), επιτυγχάνεται να μειωθεί σημαντικά το απαιτούμενο κόστος υπολογισμού και ο χρόνος εκτέλεσης των πράξεων για την εύρεση των RDWGK. 4.4 Support Vector Machines Αφού υπολογιστούν οι τιμές του RDWGK γιά όλα τα πιθανά ζευγάρια γράφων, τις χρησιμοποιούμε στην ταξινομητική διαδικάσια των Support Vector Machines (SVM). Δεδομένου ότι ο SVM είναι αλγόριθμος μάθησης ο οποίος βασίζεται σε kernel, έχοντας ήδη γραμμικοποιήσει το μη-γραμμικό πρόβλημα σύγκρισης γράφων επιτρέπει τη χρήση γραμμικού ταξινομητη. Ο στόχος του SVM είναι να ταξινομηθεί η γραμμικοποιημένη πληροφορία με τρόπο που επιτρέπει υψηλή ακρίβεια για την ταξινόμηση νέων δεδομένων. Εκπαιδεύουμε (training) και πραγματοποιούμε ελέγχους (testing) στο ταξινομητή SVM χρησιμοποιώντας 9 διαφορετικές υποδιαιρέσεις του συνόλου εκπαίδευσηςελέγχου, με την έννοια ότι ξεκινώντας από ένα συγκεκριμένο αριθμό τιμών kernel (ίσος αριθμός τιμών kernel από καλόβουλα και κακόβουλα εκτελέσιμα) χωρίζουμε τυχαία τις τιμές αυτές κατά 10%-90% για εκπαίσευση-έλεγχο και με διαστήματα 10% φτάνουμε σε 90%-10% εκπαίδευση-έλεγχο για τις τιμές του kernel. ΄Ολη η παραπάνω διαδικασία πραγματοπιείται 100 φορές και ως αποτελέσματα του SVM, παίρνουμε τις μέσες τιμές για accuracy, precision, recall, ROC-AUC. Κάθε σχήμα ταξινόμησης γίνεται για τιμές kernel οι οποίες περιέχουν ίσο αριθμό προερχόμενο από καλόβουλα και κακόβουλα εκτελέσιμα δείγματα και το πειράμα γίνεται για σύνολο δεδομένων των 50, 100, 150, 200, 250 και 300 καλόβουλων και κακόβουλων δειγμάτων. Σελίδα 9 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού 5 Μετρήσεις - Πειράματα Κατά την εξαγωγή πληροφοριών από την στατική ανάλυση, παρουσιάζουμε τις παρακάτω μετρήσεις: • Αριθμός Κλήσεων ΑΡΙ σε σχέση με το μέγεθος του δείγματος • Μετρήσεις Αραιότητας των Αφηρημένων Γράφων ΑΡΙ Κλήσεων σε σχέση με τον αριθμό κόμβων, τον αριθμό ακμών και το μέγεθος του δείγματος • Συχνότητα εμφάνισης ΑΡΙ Κλήσεων αποκλειστικά σε καλόβουλα ή κακόβουλα δείγματα καθώς και τη συχνότητα εμφάνισης των κοινών Κλήσεων ΑΡΙ Με βάση τις παραπάνω πειραματικές μετρήσεις, εξάγουμε συμπεράσματα βασισμένα στις αποκλίσεις που παρουσιάζονται ανάμεσα στα καλόβουλα και τα κακόβουλα δείγματα. Ειδικότερα, εξάγουμε τα παρακάτω συμπεράσματα: – Παρατηρείται ότι ένας σημαντικός αριθμός κακόβουλων δειγμάτων έχουν τον ίδιο αριθμό ΑΡΙ Κλήσεων, αλλά μεγάλη απόκλιση στο μέγεθός τους, εν αντιθέσει με τα καλόβουλα δείγματα, τα οποία παρουσιάζουν μία αναμενόμενη σχέση στη συγκεκριμένη μέτρηση (το μέγεθος του δείγματος μεγαλώνει με τον αριθμό των ΑΡΙ Κλήσεων). – Μεγάλες αποκλίσεις παρατηρούνται στις μετρήσεις αραιότητας (οι μετρήσεις αραιότητας δίνονται από τα κλάσματα: κόμβοι προς ακμές και ακμές προς κόμβους) σε σχέση με τον αριθμό των ακμών και τον αριθμό των κόμβων, καθώς τα κακόβουλα δείγματα παρουσιάζονται λιγότερο διασυνδεδεμένα, σε σχέση με τα καλόβουλα, όσον αφορά τους Αφηρημένους Γράφους ΑΡΙ Κλήσεων. – Διακρίνονται συγκεκριμένες ΑΡΙ Κλήσεις, οι οποίες είτε εμφανίζονται συχνά μόνο στα καλόβουλα ή μόνο στα κακόβουλα δείγματα, ή παρουσιάζουν Σελίδα 10 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού μεγάλη διαφορά στη συχνότητα εμφάνισης τους ανάμεσα σε καλόβουλα και κακόβουλα δείγματα. Οι παραπάνω παρατηρήσεις δύναται να χρησιμοποιηθούν ως προκαταρκτικά κριτήρια ταξινόμησης, καθώς επίσης μπορούν να χρησιμοποιηθούν για να βελτιώσουν την απόδοση ενός ταξινομητή. 6 6.1 Αποτελέσματα Kernel ΄Οπως αναφέρθηκε παραπάνω, η τιμή του kernel είναι μία μέτρηση της ομοιότητας δύο γράφων. Στο Σχήμα 1, παρουσιάζονται οι τιμές kernel για τα διαφορετικά μεγέθη δειγμάτων και σημειώνεται πως οι τιμές αυτές είναι κανονικοποιημένες στη μονάδα (μηδενική τιμή σημαίνει μηδενική ομοιότητα και τιμή ίση με 1 σημαίνει απόλυτη ομοιότητα-ταύτιση). Παρατηρείται πως οι τιμές kernel συγκεντρώνονται γύρω από χαμηλα ποσοστά ομοιότητας, το οποίο υποδεικνύει ότι ο Αφηρημένος Γράφος ΑΡΙ Κλήσεων των δειγμάτων είναι ένα έγκυρο κριτήριο ταξινόμησης. Σελίδα 11 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού Σχήμα 1: Τιμές kernel Καλόβουλων-Κακόβουλων δειγμάτων για 50 (πάνω αριστερά), 100 (πάνω δεξιά), 150 (μεσαίο αριστερά), 200 (μεσαίο δεξιά), 250 (κάτω αριστερά) και 300 (κάτω δεξιά) δείγματα Σελίδα 12 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού 6.2 Support Vector Machine Τα καλύτερα αποτελέσματα του ταξινομητή SVM παρατηρούνται για την πιό ευνοϊκή περίπτωση ταξινόμησης, η οποία συμβαίνει για το μεγαλύτερο δυνατό σύνολο δειγμάτων, 300 καλόβουλα και 300 κακόβουλα, και για τη διαίρεση των δειγμάτων κατά 90%-10% ανάμεσα σε εκπαίδευση και έλεγχο αντίστοιχα. Επιτυγχάνεται ποσοστό ακρίβειας 98.25% (Σχήμα 2), το οποίο σημαίνει πως το μέγιστο περιθώριο λάθους είναι 1.75%. Σημειώνουμε πως οι μετρήσεις απόδοσης του ταξινομητή Σχήμα 2: Ακρίβεια SVM για Διαφορετικά Μεγέθη Συνόλου Δειγμάτων και Διαφορετικές Υποδιαιρέσεις Δειγμάτων Εκπαίδευσης-Ελέγχου SVM αναφορικά με Precision, Recall και ROC AUC παρουσιάζουν παρόμοια χαρακτηριστικά καθώς όσο γίνονται ευνοϊκότερα τα δεδομένα ταξινόμησης, τόσο βελτιώνεται η απόδοση του ταξινομητή σε όλα τα επίπεδα. Σελίδα 13 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού 7 Συμπεράσματα Ο στόχος της παρούσας εργασίας ήταν να προταθεί μία λύση στο πρόβλημα της ανίχνευσης κακόβουλου λογισμικού. Η προτεινόμενη λύση χρησιμοποιεί τεχνικές της στατικής ανάλυσης για να αποσυναρμολογηθούν καλόβουλα και κακόβουλα δείγματα, ώστε να καταστεί δυνατή η μελέτη τους. Βασιζόμενοι στη μελέτη αυτη, αναπτύχθηκαν ad hoc μηχανισμοί μοντελοποίησης γράφων για να ξεπεραστούν προβλήματα όπως η απαίτηση μεγάλης υπολογιστικής ισχύς και η γραμμικοποίηση μη-γραμμικών στοιχείων, ώστε να επιτυγχάνεται αποδοτικότερος χειρισμός της πληροφορίας. Το σχήμα ταξινόμησης το οποίο χρησιμοποιήθηκε απέφερε πολλά υποσχόμενα αποτελέσματα και οι περαιτέρω πειραματικές μετρήσεις που πραγματοποιήθηκαν επιτρέπουν τη στοιχειοθέτηση βάσιμων υποθέσεων για μελλοντική έρευνα. Σελίδα 14 ίόόύ Βιβλιογραφία [1] P. D. C. Kelly Bissel, Ryan LaSalle, Ninth Annual Cost of Cybercrime Study. The Cost of Cybercrime, Ponemon Institue LLC, Accenture plc, 2019. [2] Verizon, “Data breach investigations https://enterprise.verizon.com/resources/reports/dbir/, report year 2020.” = 2020, note = Online; accessed 27-September-2020. [3] A. Souri and R. Hosseini, “A state-of-the-art survey of malware detection approaches using data mining techniques,” Human-centric Computing and Information Sciences, vol. 8, Jan. 2018. [4] H. E. Merabet and A. Hajraoui, “A survey of malware detection techniques based on machine learning,” International Journal of Advanced Computer Science and Applications, vol. 10, no. 1, 2019. [5] Y. Ding, S. Zhu, and X. Xia, “Android malware detection method based on function call graphs,” in Neural Information Processing, pp. 70–77, Springer International Publishing, 2016. [6] Y. Ye, S. Hou, L. Chen, J. Lei, W. Wan, J. Wang, Q. Xiong, and F. Shao, “Out-of-sample node representation learning for heterogeneous graph in real-time android malware detection,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Aug. 2019. [7] https://www.nsa.gov/resources/everyone/ghidra/, Released on April 4th, 2019. [8] https://virusshare.com/. ί 15 Μεθοδολογίες Ταξινόμησης Κακόβουλου Λογισμικού [9] https://www.microsoft.com/el-gr/software-download/windows10ISO. [10] https://git-scm.com/downloads. [11] https://www.nsa.gov/resources/everyone/ghidra/. [12] http://www.codeblocks.org/downloads/26. [13] https://docs.microsoft.com/en-us/windows/win32/apiindex/api-indexportal. [14] K. Avrachenkov, P. Chebotarev, and D. Rubanov, “Kernels on graphs as proximity measures,” vol. 10519, pp. 27–41, 09 2017. [15] J. Ah-Pine, “Normalized kernels as similarity indices,” pp. 362–373, 06 2010. [16] B. Schölkopf, “The kernel trick for distances,” vol. 13, pp. 301–307, 01 2000. [17] T. Hofmann, B. Schölkopf, and A. Smola, “Kernel methods in machine learning,” The Annals of Statistics, vol. 36, 01 2007. [18] S. V. N. Vishwanathan, K. M. Borgwardt, I. R. Kondor, and N. N. Schraudolph, “Graph kernels,” CoRR, vol. abs/0807.0093, 2008. [19] W. Imrich and S. Klavzar, Product Graphs, Structure and Recognition. 01 2000. Σελίδα 16