Academia.eduAcademia.edu

Deep Android Malware Detection

2017

Deep Android Malware Detection McLaughlin, N., Martinez del Rincon, J., Kang, B., Yerima, S., Miller, P., Sezer, S., ... Joon Ahn, G. (2017). Deep Android Malware Detection. In Proceedings of the ACM Conference on Data and Applications Security and Privacy (CODASPY) 2017 Association for Computing Machinery (ACM). https://doi.org/10.1145/3029806.3029823 Published in: Proceedings of the ACM Conference on Data and Applications Security and Privacy (CODASPY) 2017 Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights © 2017 The Authors. This work is made available online in accordance with the publisher’s policies. Please refer to any applicable terms of use of the publisher. General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:23. May. 2020 Deep Android Malware Detection ∗ Niall McLaughlin , Jesus Martinez del Rincon, BooJoong Kang, Suleiman Yerima, Paul Miller, Sakir Sezer Centre for Secure Information Technologies (CSIT) Queen´s University Belfast, UK Yeganeh Safaei, Erik Trickel, Ziming Zhao, Adam Doupe, Gail Joon Ahn Center for Cybersecurity and Digital Forensics Arizona State University, USA ABSTRACT In this paper, we propose a novel android malware detection system that uses a deep convolutional neural network (CNN). Malware classification is performed based on static analysis of the raw opcode sequence from a disassembled program. Features indicative of malware are automatically learned by the network from the raw opcode sequence thus removing the need for hand-engineered malware features. The training pipeline of our proposed system is much simpler than existing n-gram based malware detection methods, as the network is trained end-to-end to jointly learn appropriate features and to perform classification, thus removing the need to explicitly enumerate millions of n-grams during training. The network design also allows the use of long n-gram like features, not computationally feasible with existing methods. Once trained, the network can be efficiently executed on a GPU, allowing a very large number of files to be scanned quickly. CCS Concepts •Security and privacy → Malware and its mitigation; Software and application security; •Computing methodologies → Neural networks; Keywords Malware Detection, Android, Deep Learning 1. INTRODUCTION Malware detection is a growing problem, especially in mobile platforms. Given the proliferation of mobile devices and their associated app-stores, the volume of new applications is too large to manually examine each application for malicious behavior. Malware detection has traditionally been based on manually examining the behavior and/or de-compiled code ∗Corresponding author: [email protected] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. of known malware programs in order to design malware signatures by hand. This process does not easily scale to large numbers of applications, especially given the static nature of signature based malware detection, meaning that new malware can be designed to evade existing signatures. Consequently, there has recently been a large volume of work on automatic malware detection using ideas from machine learning. Various methods have been proposed based on examining the dynamic application behavior [18, 21], requested permissions [14, 16, 19] and the n-grams present in the application byte-code [7, 11, 10]. However many of these methods are reliant on expert analysis to design the discriminative features that are passed to the machine learning system used to make the final classification decision. Recently, convolutional networks have been shown to perform well on a variety of tasks related to natural language processing [12, 26]. In this work we investigate the application of convolutional networks to malware detection by treating the disassembled byte-code of an application as a text to be analyzed. This approach has the advantage that features are automatically learned from raw data, and hence removes the need for malware signatures to be designed by hand. Our proposed malware detection method is computationally efficient as training and testing time is linearly proportional to the number of malware examples. The detection network can be run on a GPU, which is now a standard component of many mobile devices, meaning a large number of malware files can be scanned per-second. In addition, we expect that as more training data is provided the accuracy of malware detection will improve because neural networks have been shown to have a very high learning capacity, and hence can benefit from very large training-sets [20]. Our proposed malware detection method takes inspiration from existing n-gram based methods [7, 11, 10], but unlike existing methods there is no need to exhaustively enumerate a large number of n-grams during training. This is because the convolutional network can intrinsically learn to detect n-gram like signatures by learning to detect sequences of opcodes that are indicative of malware. In addition, our proposed method allows very long n-gram type signatures to be discovered, which would be impractical if explicit enumeration of all n-grams was required. The malware signatures found by the proposed method may be complementary to those discovered by hand as the automated system will have different strengths and biases from human analysts, therefore they could be valuable for use in conjunction with conventional malware signatures databases. Once our sys- tem has been trained, large numbers of files can be efficiently scanned using a GPU implementation, and given that new malware is constantly appearing, a useful feature of our proposed method is that it can be re-trained with new malware samples to adapt to the changing malware environment. 2. 2.1 RELATED WORK Malware Detection Learning based approaches using hand-designed features have been applied extensively to both dynamic [18, 21] and static [23, 22, 25] malware detection. A variety of similar approaches to static malware detection have used manually derived features, such as API calls, intents, permissions and commands, with different classifiers such as support vector machine (SVM) [5], Naive Bayes, and k-Nearest Neighbor [19]. Malware detection approaches have also been proposed that use static features derived exclusively from the permissions requested by the application [14, 16]. In contrast with approaches using high-level hand-designed features, n-grams based malware detection uses sequences of low-level opcodes as features. The n-grams features can be used to train a classifier to distinguish between malware and benign software [10]. Perhaps surprisingly, even a 1-gram based feature, which is simply a histogram of the number of times each opcode is used, can distinguish malware from benign software [7]. The length of the n-gram used [10] and number of n-gram sequences used in classification [7] can both have an effect on the accuracy of the classifier. However increasing either parameter can massively increase the computational resources needed [7], which is clearly a disadvantage of standard n-gram based malware detection approaches. N-grams method also require feature selection to reduce the length of the feature-vector, which would otherwise be millions of elements long in the case of long n-grams. In this work we propose a method that allows very long ngrams features to be used, and allows an n-grams classifier to be trained in a much more efficient manner, based on neural networks. 2.2 Neural Networks Recently, convolutional neural networks (CNNs) have shown state-of-the-art performance for object recognition in images [20] and natural language processing (NLP) [12]. In NLP, local patterns of symbols, known as n-grams, have been used as features for a variety of tasks [27]. It has recently been shown that if sufficient training data is available, very deep CNNs can outperform traditional NLP methods [26] across a range of text classification tasks. We postulate that static malware analysis has much in common with NLP as the analysis of the disassembled source code of a given program can be understood as a form of textual processing. Therefore, techniques such as CNNs have huge potential to be applied in the field of malware detection. A variety of approaches to malware detection using other neural network architectures have been proposed. Several of the proposed methods are based on learning which sequences of operating system calls or API calls are indicative of malware [15, 9, 8] during dynamic analysis. The existing neural network based approaches to malware detection differ from our proposed method as they make use of a virtual machine to capture dynamic behavioural features [15, 9, 8]. This may prove problematic given that malware is often designed to detect when it is being run in a virtual environment in order to evade detection. Other existing neural network based malware detection methods use handdesigned features, which may not be the optimal way to detect malware [17]. We will attempt to address the limitations of existing neural network based malware detection methods, by using a novel static analysis method based on a CNN architecture that automatically learns an appropriate feature representation from raw data. In this work we apply convolutional neural networks to the problem of malware detection. The CNN learns to detect patterns in the disassembled byte-code of applications that are indicative of malware. Our approach has several advantages over existing methods of malware detection, such as those based on high-level hand-designed features and those based on detection of n-grams. Scalability and performance are major drawbacks of existing n-gram based approaches, as the length of the feature vector grows rapidly when increasing the n-gram length. In contrast, our approach eliminates the need for counting and storing millions of n-grams during training and can learn longer n-grams than conventional methods used for malware detection. The improved efficiency makes it possible to use our proposed method with larger datasets, where the use of traditional methods would be intractable. Our whole system is jointly optimized to perform feature extraction and classification simultaneously by showing the system a large number of labeled samples. This removes the need for hand-designed features, as features are automatically learned during supervised network training, and removes the need for an ad-hoc pipeline consisting of feature-extraction, feature-selection and classification, as feature extraction and classification are optimized together. The existence of a fully end-to-end system also saves time when the system is presented with new malware to be recognized, as the network can easily be updated by simply increasing the size of the training-set, which may also improve its overall accuracy. Finally, the features discovered by our method may be different from, and complementary to, those discovered by manual analysis. 3. METHOD In this work we propose a malware detection method that uses a convolutional network to process the raw Dalvik bytecode of an Android application. The overall structure of the malware detection network is shown in Fig. 2. In the following section we will first explain how an Android application is disassembled to give a sequence of raw Dalvik byte-codes, and then explain how this byte-code sequence is processed by the convolutional network. 3.1 Disassembly of Android Application In our system, the preprocessing of an application consists of disassembling the application and extracting opcode sequences for static malware analysis, as shown in Fig.1. An Android application is an apk file, which is a compressed file containing the code files, the AndroidManifest.xml file, and the application resource files. A code file is a dex file that can be transformed into smali files, where each smali file represents a single class and contains the methods of such a class. Each method contains instructions and each instruction consists of a single opcode and multiple operands. We disassemble each application using baksmali [1] to obtain the smali files that contain the human-readable Dalvik byte- code of the application, then extracting the opcode sequence from each method, discarding the operands. As the result of the preprocessing we obtain all the opcode sequences from all the classes of the application. The opcode sequences from all classes are then concatenated to give a single sequence of opcodes representing whole application. Figure 1: Work-flow of how an Android application is disassembled to produce an opcode sequence. 3.2 3.2.1 Network Architecture Opcode Embedding Layer Let X = {x1 ...xn } be a sequence of opcode instructions encoded as one-hot vectors, where xn is the one-hot vector for the n’th opcode in the sequence. To form a one-hot vector we associate each opcode with a number in the range 1 to D. In the case of Dalvik, where there are currently 218 defined opcodes, D = 218 [2]. The one-hot vector xn is a vector of zeros, of length D, with a ’1’ in the position corresponding with the n’th opcode’s integer mapping. Any operands associated with the opcodes were discarded during disassembly and preprocessing, meaning malware classification is based only on patterns in the sequence of opcodes. Opcodes in X are projected into an embedding space by multiplying each one-hot vector by a weight matrix, WE ∈ RD×k , where k is the dimensionality of the embedding-space as follows pi = xi WE (1) projection of all opcodes in X, the program is represented by a matrix, P , of size n × k, where each row, pi , corresponds to the representation of opcode xi . The weights in WE , and hence the representation for each opcode, are initialized randomly at first then updated by back-propagation during training along with the rest of the network’s parameters. The purpose of representing the program as a list of onehot vectors then projecting into an embedding space, is that it allows the network to learn an appropriate representation for each opcode as a vector in a k-dimensional continuous vector space, Rk where relationships between opcodes can be represented. The embedding space may encode semantic information for example, during training the network may discover that certain opcodes have similar meanings or perform equivalent operations, and hence should be treated similarly by deeper network layers for classification purposes. This can be achieved by projecting those opcodes to nearby points in the embedding space, while very different opcodes will be projected to distant points. The number of dimensions used in the embedding space may influence the network’s ability to perform such semantic mapping, hence using more dimensions may, up to a point, give the network greater flexibility in learning the expected highly non-linear mapping from sequences of opcodes to classification decisions. 3.2.2 Convolutional Layers In our proposed network we use one or more convolutional layers, numbered from 1 to L, where l refers to the l’th convolutional layer. The first convolutional layer receives the n × k program embedding matrix P as input, while deeper convolutional layers receive the output of the previous convolutional layer as input. Each convolutional layer has ml filters, which are of size s1 × k in the first layer, and of size sl × ml−1 in deeper layers. This means filters in the first layer can potentially detect sequences of up to s1 opcodes. During the forward pass of an example through a convolutional layer, each of the ml convolutional filters produces an activation map al,m of size n × 1, which can be stacked together to produce, a matrix, Al , of size n × ml . Note that before applying the convolutional filters we zero-pad the start and end of the input by sl /2 to ensure that the length of the output matrix from the convolutional layer is the same as the length of its input. The convolution of the first layer filters with program embedding matrix P can be denoted as follows al,m = relu(Conv(P )Wl,m ,bl,m ) (2) Al = [al,1 | al,2 | ... | al,m ] (3) where wl,m and bl,m are the respective weight and bias parameters of the m’th convolutional filter of convolution layer l, where Conv represents the mathematical operation of convolution of the filter with the input, and where the rectified linear activation function, relu(x) = max{0, x}, is used. In deeper layers the convolution operation is similar, however we replace input matrix P in Eq. 2 by the output matrix from the previous convolutional layer, Al−1 . Given output matrix AL from the final convolutional layer, maxpooling [27] is then used over the program length dimension as follows f = [max(aL,1 ) | max(aL,2 ) | ... | max(aL,m )] (4) to give a vector f of length mL , which contains the maximum activation of each convolutional filter over the program length. Using max-pooling over the length of the opcode sequence allows a program of arbitrarily length to be represented by a fixed-length feature vector. Moreover, selecting the maximum activation of each convolutional filter using max-pooling also focuses the attention of the classification layer on parts of the opcode sequence that are most relevant to the classification task. 3.2.3 Classification Layers Finally, the resulting vector f is passed to a multi-layer perceptron (MLP), which consists of a full-connected hidden layer and a full-connected output layer. The purpose of the MLP is to output the probability that the current example is malware. The use of the MLP with hidden layer allows high-order relationships between the features extracted by the convolutional layer to be detected [6] and used for clas- Opcode Seq. X 01 47 58 78 45 14 .. .. .. .. Activation Maps Layer 1 Activation Maps Layer 2 P A1 A2 Filter m1 Convolutional Layer Convolutional Layer bl,m al,1 N Wi bh bi z Softmax Classification Layer Wl,m bl,m Wh Hidden Fully Connected Layer Filter ml Wl,m WE Max over Length dim. Filter ml Embedding Layer 1 Embedding Projection class y al,m Max pooling Layer Figure 2: Malware Detection Network Architecture. sification. We can write the hidden layer as follows z = relu(Wh f + bh ) (5) where Wh , bh , are the parameters of the fully-connected hidden layer, and where the rectified linear activation function is used. Finally, the output, z, from the MLP is passed to a soft-max classifier function, which gives the probability that program X is malware, denoted as follows exp(wiT z + bi ) T 0 i0 =1 exp(wi0 z + bi ) p(y = i|z) = PI (6) where wi and bi denote the parameters of the classifier for class i ∈ I, and the label y indicates whether the current sample is either malware or benign. The softmax classifier outputs the normalized probability of the current sample belonging to each class. As malware classification is a two class problem (benign/malware) i.e., I = 2 and z is a two element vector. Other applications such as the problem of malware family classification, could be targeted by increasing the number of classes, I, to be equal to the number of malware families to be classified. 3.3 Learning Process Given the above definitions, the cost function to be minimized during training for a batch of b training samples, {X (1) . . . X (b) }, can be written as follows C=− b I 1 XX 1{y (j) = i}log p(y (j) = i|z (j) ) b j=1 i=1 (7) where z (j) is the vector output after applying the neural network to example training example X (j) , where y (j) is the provided correct label for the example X (j) , and where 1{x} is an indicator function that is 1 if its argument x is true and is 0 otherwise. The cost is dependent on both the parameters of the neural network, Θ, i.e. the weights and bias across all layers -WE , wl,m , bl,m ,Wh , bh ,wi , and bi - and on the current training sample. The objective during training is to update the network’s parameters, which are initialized randomly before training begins, to reduce the cost. This update is performed stochastically by computing the gradient of the , given the cost function with respect to the parameters, ∂C ∂Θ current batch of samples, and using this gradient to update the parameters after every batch to reduce the cost as follows ∂C (8) ∂Θ where α is a small positive real number called the learning rate. During training the network is repeatedly presented with batches of training samples in randomized order until the parameters converge. To deal with an imbalance in the number of training samples available for the malware and benign classes, the gradients used to update the network parameters are weighted depending on the label of the current training sample. This helps to reduce classifier bias towards predicting the more populous class. Let the number of malware samples in the training-set be M and number of benign samples in the training-set be B. Assuming there are more samples of benign software than malware, the weight for malware samples is 1 − M/(M + B) and the weight for benign samples is M/(M + B) i.e. the gradients are weighted in inverse proportion to the number of samples for each class. Θ(t+1) = Θ(t) − α Note that a consideration when designing our proposed architectures was to keep the number of parameters relatively low, in order to help prevent over-fitting given the relatively small number of training samples usually available. A typical deep network may have millions of parameters [20], while our malware detection network has only tens of thousands of parameters, which drastically reduces the need for large numbers of training samples. 4. RESULTS In order to evaluate the performance of our approach a set of experiments was designed. The architecture used in all experiments had only a single convolutional layer. This architecture was used because the available datasets have a relatively small number of training samples which means that networks with large numbers of parameters could be prone to over-fitting. Convolutional networks with only a single convolutional layer have been shown to perform well on natural language text classification tasks [27]. In this architecture, the remaining hyperparameters, such as the dimension of the embedding space and the length and the number of convolutional filters, are set empirically using 10fold cross validation on the validation-set of the small and large dataset. The resulting values are a 8-dimensional embedding space, 64 convolutional filters of length 8, and 16 neurons in the hidden fully connected layer. Our experiments were carried out on three different datasets. The first dataset consists of malware from the Android Malware Genome project [28] and has been widely used [10, 11]. This dataset has a total of 2123 applications, of which 863 are benign and 1260 are malware from 49 different malware families. Labels are provided for the malware family of each sample. The benign samples in this dataset were collected from the Google play store and have been checked using virusTotal to ascertain that they were highly probable to be malware free. We refer to this dataset as the ’Small Dataset’. The second dataset was provided by McAfee Labs (now Intel Security) and comes from the vendor’s internal repository of Android malware. After discarding empty files or files that are less than 8 opcodes long, the dataset contains 2475 malware samples and 3627 benign samples. This dataset does not include malware family labels and may include malware and/or benign applications present in the small dataset. Hence to ensuring training hygiene i.e. to ensure we do not train on the testing-set, the network is trained and tested on each dataset separately without crosscontamination. We refer to this dataset as the ’Large Dataset’. We also have an additional dataset provided by McAfee Labs containing approximately 18,000 android programs, and which was collected more recently than the first two datasets. This was used for testing the final system after setting the hyper-parameters using the smaller datasets. After discarding short files, the dataset contains 9268 benign files and 9902 malware files. We refer to this dataset as the ’V. Large Dataset’. Each dataset was split into 90% for training and validation and the remaining 10% was held-out for testing. Care was taken to ensure that the ratio of positive to negative samples in the validation and testing sets was the same as in the dataset as a whole. Results are reported using the mean of the classification accuracy, precision, recall and f-score. The key indicator of performance is f-score, because the number of samples in the malware and benign classes is not equal. In this situation, classification accuracy is too influenced by the number of samples in each class. For example if the majority of samples were of class x, and the classifier simply reported x in all cases, the classification accuracy would be high, although the classifier would not be useful. However, given the same conditions, the f-score, which is based on the precision and recall, would be low. Our neural network software was developed using the Torch scientific computing environment [4]. During training the network parameters were optimized using RMSProp [3] with a learning rate of 1e-2, for 10 epochs, using a mini-batch size of 16. The network weights were randomly initialized using the default Torch initialization. We used an Nvidia GTX 980 GPU for development of the network, and training the network to perform malware classification takes around 25 minutes on the large dataset (which contains approximately 6000 example programs). Once the network has been trained our implementation can classify approximately 3000 files per-second on the GPU. 4.1 Computational Efficiency In this experiment we compare the computational efficiency of our proposed malware classification system with our implementation of a conventional n-gram based malware classification system [10]. Note that when reporting the results we do not include the time take to disassemble the malware files as this is constant for both systems. The results in Table 2 are presented in terms of both the average time to reach a classification decision for a single malware file, and the corresponding average number of programs that can be classified per second. It can be seen from Table 2 that our system can produce a much higher number malware classification decisions per second than the n-gram based system. The n-gram based system also experiences exponential slow-down as the length of the n-gram features are increased. This severely limits the use of longer n-grams, which are necessary for improved classification accuracy. Our proposed system is not limited in the same way, and in fact, the features extracted by the first layer of the CNN can be thought of as n-grams where n = 8. Use of such features with a conventional n-gram based system would be much too computationally expensive. Our proposed neural network system is implemented on a desktop GPU, specifically an Nvidia GTX-980, however it could easily be moved to the GPU of a mobile device, allowing for fast and efficient malware classification of Android applications. Finally, the memory usage required to execute the trained neural network is constant. Increasing the length or number of convolutional filters, or increasing the number of training examples linearly increases memory usage. Whereas with n-gram based systems, increasing the training-set size dramatically increases the number of unique n-grams and hence memory usage. For instance, with the small dataset there are 213 unique 1-grams, 1891 unique 2-grams, and 286471 unique 3-grams. This means our proposed neural network based system also more efficient in terms of memory usage during training. 4.2 Classification Accuracy Classification System Ours (Small DS) Ours (Large DS) Ours (V. Large DS) n-grams (Small DS) n-grams (Large DS) DroidDetective [13] Yerima [23] Jerome [10] Yerima [25] * Yerima (2) [24]* Feature Types CNN applied to raw opcodes CNN applied to raw opcodes CNN applied to raw opcodes opcode n-grams (n=1) opcode n-grams (n=2) opcode n-grams (n=3) opcode n-grams (n=1) opcode n-grams (n=2) opcode n-grams (n=3) Perms. combination API calls, Perms., intents, cmnds opcode n-grams API calls, Perms., intents, cmnds API calls, Perms., intents, cmnds. Benign 863 3627 9268 863 863 863 3627 3627 3627 741 1000 1260 2925 2925 Malware 1260 2475 9902 1260 1260 1260 2475 2475 2475 1260 1000 1246 3938 3938 Acc. 0.98 0.80 0.87 0.95 0.98 0.98 0.80 0.81 0.82 0.96 0.91 0.97 0.96 Prec. 0.99 0.72 0.87 0.95 0.98 0.98 0.81 0.83 0.83 0.89 0.94 0.98 0.96 Recall 0.95 0.85 0.85 0.95 0.98 0.98 0.80 0.82 0.82 0.96 0.91 0.97 0.96 F-score 0.97 0.78 0.86 0.95 0.98 0.98 0.80 0.82 0.82 0.92 0.92 0.98 0.97 0.96 Table 1: Malware classification results for our system on both the small and large datasets compared with results from the literature. Results from the literature marked with a (*) use malware from the McAfee Labs dataset i.e. our large dataset, while all others use malware sampled from the Android Malware Genome project [28] dataset i.e. our small dataset System Ours 1-gram 2-gram 3-gram Time per program (s) 0.000329 0.000569 0.010711 0.172749 Programs per second 3039.8 1758.3 93.4 5.8 Table 2: Comparing the time taken to reach a classification decision and number of programs that can be classified per second, for our proposed neural network system and a conventional n-gram based system. In this experiment, the network’s performance is measured in terms of accuracy. The network was trained using the complete training and validation set, then tested on the heldout test-set that was not seen during hyper-parameter tuning. We compare the performance of our proposed system with our own implementation of an n-gram based malware detection method [10]. For both datasets we measured the performance of this system using 1, 2 and 3-gram features. The same training and testing samples were used for both systems in order to allow for direct comparison of their performance. The results for the small and large and v. large datasets are shown in Table 1. We have endeavored to select papers from the literature that use similar Android malware datasets to give as fair a comparison as possible. In the small dataset our proposed method clearly achieves state-of-the-art performance, and is comparable to methods such as [10] and [23]. It achieves better performance than our baseline n-gram system with 1-gram features and near identical performance to the baseline with 2 and 3-gram features. The large dataset is more challenging due to the greater variably of malware present. Our system achieves similar performance to the baseline n-gram system, while having far greater computational efficiency (See Section 4.1). Although other methods have achieved better performance on similar tests, they make use of additional outside information such as the application’s requested permissions or API calls [25]. In contrast, our proposed method needs only the raw opcodes, which avoids the need for features manually designed by domain experts. Moreover, our proposed method has the advantage over existing methods of being very computational efficient, as it is capable of classifying approximately 3000 files per-second. The results on the v. large dataset, which was obtained from the same source as the large dataset and hence likely shares similar characteristics, shows that our system’s performance improves as more training data is provided. This phenomenon has been observed when training neural networks in other domains, where performance is highly correlated with the number of training samples. We expect that these results can be further improved given greater quantities of training data, which will also allow more complex network architectures to be explored. Unfortunately comparisons with the baseline n-gram system on the v. large dataset were not possible due to computational cost associated with the n-gram method. 4.3 Learning Curves In this experiment we aim to understand the system’s performance as a function of the quantity of training data, with the aim of predicting how its performance is likely to change if more training data were to be made available. This experiment was performed on the V. Large dataset. As in previous experiments, the dataset is split into training and validation sets. Throughout the experiment the validation-set remains fixed. An artificially reduced size training-set is constructed by randomly sub-sampling from the complete set of training examples. The network is then trained from scratch on this reduced size training-set, and the system’s performance measured on both the training and validation sets. This process is repeated for several different sizes of training-set, ranging from a small number of examples up to the complete set of all training-examples. The system’s performance on the validation-set and training-set are then plotted as a function of the training-set size. Performance is recorded in terms of 1 − f-score, meaning that perfect performance would produce a value of zero. In figure 3, we can see that when only a small number of training-examples are provided, training-set performance is perfect, while validation-set performance is very poor. This is to be expected as with such a small number of trainingexamples the system will over-fit to the training-set and the learned parameters will not generalize to the unseen validation-set. However, as more training-examples are provided the validation-set error decreases, showing that the system has learned to generalize from the training-set. We can predict from the learning curves in figure 3 that if more training-examples were to be provided, the validation-set error would continue to decrease. These results suggest that our system benefits from larger quantities of training-data as expected with neural networks [20]. They also show that the poor performance on the ’Large Dataset’, which was obtained from the same source as the ’V. Large dataset’ and hence shares similar characteristics, is caused by lack of data. This is indicated by the gap between the validation and testing-set errors when only approximately 6000 training examples are provided. 1 Validation-set error Training-set error 0.9 Error (1 - F-score) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Classification System Ours 0 10 1 10 2 10 3 10 4 Number of Training Examples Figure 3: Learning curves for the Validation-set and Training-set as the number of training examples is varied. Note the log-scale on the x-axis. Realistic Testing In order to assess the potential of our proposed classification technique in realistic environments we apply our trained network to a completely new dataset. This allows us to demonstrate the real-world potential of our classification technique when applied to an unknown and realistic dataset at a bigger scale. The network used in this experiment was trained on the V. Large dataset, introduced in Section 4. Our new dataset consists of 96,412 benign apps and 24,103 malware apps. The benign apps were randomly selected from the Google Play store, and were collected during July and August 2016. To represent a distinct set of malicious apps, we used another dataset containing known malware apps, including those from the Android Malware Genome project [28], but removing the ones overlapping with the training set of the network. Approximately 1 TB of APKs were used in this experiment. The APKs were converted to opcode sequences using Acc. 0.69 Prec. 0.67 Recall 0.74 F-score 0.71 Table 3: Malware classification results of our system tested on an independent dataset of benign and malware Android applications. We can see from the results in Table 3 that although the fscore is lower than previous experiments, our system has the potential to work in realistic environments. This is because our new testing dataset is much larger than the one used for training the network and contains greater variability of applications. The results of this experiment show that the network has learned features with the ability to generalise to realistic data. In future work we hope to take advantage of our new dataset to explore more complex network architectures that can be learned given more training data. 5. 0.1 4.4 a cloud architecture consisting of 29 machines running in parallel, in a process which took around 11 hours. Classification of the opcode sequences was performed using an Nvidia GTX 1080 GPU, and took an hour to complete. Note that for this experiment we assume that all APKs in the Google Play dataset are benign, and all the APKs in the malicious dataset are malicious. Of course, this may be a naive assumption, as it is possible for malicious apps to exist on Google Play. Cross validation testing was performed on our new dataset. In each cross validation fold approximately 24,000 malware applications and 24,000 benign application were used. Therefore, in order to present all applications to the network fourfold cross validation was used. The results of this experiment are reported in Table 3. CONCLUSIONS In this paper we have presented a novel Android malware detection system based on deep neural networks. This innovative application of deep learning to the field of malware analysis has shown good performance and potential in comparison with other state-of-art techniques, and has been validated in four different Android malware datasets. Our system is capable of simultaneously learning to perform feature extraction and malware classification given only the raw opcode sequences of a large number of labeled malware samples. The main advantages of our system are that it removes the need for hand-engineered malware features, it is much more computationally efficient than existing n-gram based malware classification systems, and can be implemented to run on the GPU of mobile devices. As future work, we would like to extend our methodology to both dynamic and static malware analysis in different platforms. Our proposed method is general enough that it could be applied to other types of malware analysis with only minor changes to the network architecture. For instance, the network could process sequences of instructions produced by dynamic analysis software. Similarly, by changing the disassembly preprocessing step the same network architecture could be applied to malware analysis on different platforms. Another open problem for malware classification, which may allow networks with more parameters, and hence greater discriminative power, to be used, is data augmentation. Data augmentation is a way to artificially increase the size of the training-set, by slightly modifying existing training-examples. The transformations used in data augmentation are usually chosen to simulate variations that occur in real world data, but which may not be extensively covered by the available training-set. We would like to investigate the design of dataaugmentation schemes appropriate to malware detection. 6. REFERENCES [1] Baksmali. https://github.com/JesusFreke/smali. Accessed: 2015-02-15. [2] Dalvik bytecode. https://source.android.com/devices/ tech/dalvik/dalvik-bytecode.html. Accessed: 2015-02-01. [3] RMSProp. www.cs.toronto.edu/˜tijmen/csc321/ slides/lecture slides lec6.pdf. Slide 29. [4] Torch. http://torch.ch/. [5] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck. Drebin: Effective and explainable detection of android malware in your pocket. In NDSS, 2014. [6] C. M. Bishop. Neural networks for pattern recognition. Oxford university press, 1995. [7] G. Canfora, F. Mercaldo, and C. A. Visaggio. Mobile malware detection using op-code frequency histograms. In Proc.of Int. Conf. on Security and Cryptography (SECRYPT), 2015. [8] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu. Large-scale malware classification using random projections and neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE Int. Conf. on, pages 3422–3426, 2013. [9] O. E. David and N. S. Netanyahu. Deepsign: Deep learning for automatic malware signature generation and classification. In Neural Networks (IJCNN), 2015 Int. Joint Conf. on, pages 1–8, 2015. [10] Q. Jerome, K. Allix, R. State, and T. Engel. Using opcode-sequences to detect malicious android applications. In Communications (ICC), 2014 IEEE Int. Conf. on, pages 914–919, 2014. [11] B. Kang, B. Kang, J. Kim, and E. G. Im. Android malware classification method: Dalvik bytecode frequency analysis. In Proc. of the 2013 Research in Adaptive and Convergent Systems, pages 349–350, 2013. [12] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. [13] S. Liang and X. Du. Permission-combination-based scheme for android mobile malware detection. In Communications (ICC), 2014 IEEE Int. Conf. on, pages 2301–2306, 2014. [14] X. Liu and J. Liu. A two-layered permission-based android malware detection scheme. In Mobile Cloud Computing, Services and Engineering (MobileCloud), 2014 2nd IEEE Int. Conf. on, pages 142–148, 2014. [15] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas. Malware classification with recurrent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE Int. Conf. on, pages 1916–1920, 2015. [16] B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, P. G. Bringas, and G. Álvarez. Puma: Permission usage to detect malware in android. In Int. Joint [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] Conf. CISIS´12-ICEUTE´12-SOCO´12, pages 289–298, 2013. J. Saxe and K. Berlin. Deep neural network based malware detection using two dimensional binary program features. In 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pages 11–20, Oct 2015. A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and Y. Weiss. “ andromaly”: a behavioral malware detection framework for android devices. Journal of Intelligent Information Systems, 38(1):161–190, 2012. A. Sharma and S. K. Dash. Mining api calls and permissions for android malware detection. In Cryptology and Network Security, pages 191–205. 2014. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. X. Su, M. C. Chuah, and G. Tan. Smartphone dual defense protection framework: Detecting malicious applications in android markets. In Mobile Ad-hoc and Sensor Networks (MSN), 2012 Eighth Int. Conf. on, pages 153–160, 2012. D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P. Wu. Droidmat: Android malware detection through manifest and api calls tracing. In Information Security (Asia JCIS), 2012 7th Asia Joint Conf. on, pages 62–69, 2012. S. Y. Yerima, S. Sezer, G. McWilliams, and I. Muttik. A new android malware detection approach using bayesian classification. In Advanced Information Networking and Applications (AINA), 2013 IEEE 27th Int.l Conf. on, pages 121–128, 2013. S. Y. Yerima, S. Sezer, and I. Muttik. Android malware detection: An eigenspace analysis approach. In Science and Information Conference (SAI), 2015, pages 1236–1242, 2015. S. Y. Yerima, S. Sezer, and I. Muttik. High accuracy android malware detection using ensemble learning. Information Security, IET, 9(6):313–320, 2015. X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, pages 649–657, 2015. Y. Zhang and B. Wallace. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820, 2015. Y. Zhou and X. Jiang. Dissecting android malware: Characterization and evolution. In Security and Privacy (SP), 2012 IEEE Symp. on, pages 95–109, 2012.