Academia.eduAcademia.edu

Machine Learning for Malware Evolution Detection

2021, ArXiv

Malware evolves over time and antivirus must adapt to such evolution. Hence, it is critical to detect those points in time where malware has evolved so that appropriate countermeasures can be undertaken. In this research, we perform a variety of experiments on a significant number of malware families to determine when malware evolution is likely to have occurred. All of the evolution detection techniques that we consider are based on machine learning and can be fully automated—in particular, no reverse engineering or other labor-intensive manual analysis is required. Specifically, we consider analysis based on hidden Markov models (HMM) and the word embedding techniques HMM2Vec and Word2Vec.

Machine Learning for Malware Evolution Detection Lolitha Sresta Tupadha∗ Mark Stamp∗† arXiv:2107.01627v1 [cs.CR] 4 Jul 2021 December 3, 2021 Abstract Malware evolves over time and antivirus must adapt to such evolution. Hence, it is critical to detect those points in time where malware has evolved so that appropriate countermeasures can be undertaken. In this research, we perform a variety of experiments on a significant number of malware families to determine when malware evolution is likely to have occurred. All of the evolution detection techniques that we consider are based on machine learning and can be fully automated—in particular, no reverse engineering or other labor-intensive manual analysis is required. Specifically, we consider analysis based on hidden Markov models (HMM) and the word embedding techniques HMM2Vec and Word2Vec. 1 Introduction Malware is software that is intended to be malicious in its effect [1]. By one recent estimate, there are more than one billion malware programs in existence, with 560,000 new malware samples discovered every day [12]. Clearly, malware is a major cybersecurity threat, if not the most serious security threat today. Since the creation of the ARPANET in 1969, there has been an exponential growth in the number of users of the Internet. The widespread use of computer systems along with continuous Internet connectivity of the “always on” paradigm makes modern computer systems prime targets for malware attacks. Malware comes in many forms, including viruses, worms, backdoors, trojans, adware, ransomware, and so on. Malware is a continuously evolving threat to information security. In the field of malware detection, a signature typically consists of a string of bits that is present in a malware executable. Signature-based detection is the most popular method of malware detection used by anti-virus (AV) software [1]. But malware has become increasingly difficult to detect with standard signature-based approaches [34]. Virus writers have developed advanced metamorphic generators and ∗ † Department of Computer Science, San Jose State University [email protected] 1 obfuscation techniques that enable their malware to easily evade signature detection. For example, in [3], the authors prove that carefully constructed metamorphic malware can successfully evade signature detection. Koobface is a recent example of an advanced form of malware. This malware was designed to target the users of social media, and its infection is spread via spam that is sent through social networking websites. Once a system is infected, Koobface gathers a user’s sensitive information such as banking credentials, and it blocks the user from accessing anti-virus or other security-related websites [11]. Malware writers modify their code to deal with advances in detection, as well as to add new features to existing malware [2]. Hence, malware can be perceived as evolving over time. To date, most research into malware evolution has relied on software reverse engineering [7], which is labor intensive. Our goal is to detect malware evolution automatically, using machine learning techniques. We want to find points in time where it is likely that significant evolution has occurred within a given malware family. It is important to detect such evolution, as these points are precisely where modifications to existing detection strategies are urgently needed. We note in passing that malware evolution detection can play an additional crucial role in malware research, beyond updating existing detection strategies to deal with new variants. Generally, in malware research, we consider samples from a specific family, without regard to any evolutionary changes that may have occurred over time. An adverse side effect of such an approach is that—with respect to any specific point in time—we are mixing together past, present, and future samples. Relying on training based on future samples to detect past (or present) samples is an impossibility in any real-world setting, yet it is seldom accounted for in research. By including an accurate evolutionary timeline, we can conduct far more realistic research. Thus, accurate information regarding malware evolution will also serve to make research results more realistic and trustworthy. We consider several machine learning techniques to identify potential malware evolution, and our experiments are conducted using a significant number of malware families containing a large numbers of samples collected over an extended period of time. We extract the opcode sequence from each malware sample, and these sequences are used as features in our experiments. We group the available samples based on time periods and we train machine learning models on time windows. We compare the models to determine likely evolutionary points—substantial differences in models across a time boundary indicate significant change in the code base of the malware family under consideration. Specifically, we experiment with hidden Markov models (HMM) and word embedding techniques (Word2Vec and HMM2Vec). For comparison, we also consider logistic regression. The remainder of this paper is organized as follows. In Section 2, we discuss a range of relevant background topics, including malware, related work, our dataset, and we introduce the learning techniques that we employ in our experiments. Section 3 contains our the experimental results, while Section 4 gives our conclusions along with a discussion of a few potential avenues for future work. 2 2 Background In this section, we first give a brief introduction to malware. Then we consider related work in the area of malware evolution detection. 2.1 Malware A computer worm is a kind of malware that spreads by itself over a network [1]. Examples of famous worms include Code Red, Blaster, Stuxnet, Santy, and, of course, the Morris Worm [33]. Viruses are the most common form of malware, and the word “virus” is often used interchangeably with “malware.” A computer virus is similar to a worm but it requires outside assistance to transmit its infection from one system to another. Viruses are often considered to be parasitic, in the sense that they embed themselves in benign code. More advanced forms of viruses (and malware, in general) often use encryption, polymorphism, or metamorphism as means to evade detection [1]. These techniques are primarily aimed at defeating signature-based detection, although they can also be effective against more advanced detection strategies. A trojan horse, or simple a trojan, is malicious software that appears to be innocent but carries a malicious payload. Trojans are particularly popular today, with the the vast majority of Android malware, for example, being trojans. A trapdoor or backdoor is malware that allows unauthorized access to an infected system [33]. Such access allows an attacker to use the system in a denial of service (DoS) attack, for example. Traditionally, malware detection has relied on static signatures, which typically consist of strings of bits found in specific malware samples. While effective, signatures can be defeated by a wide variety of obfuscation and morphing techniques, and the sheer number of malware samples today can make signature scanning infeasible. Recently, machine learning and deep learning techniques have become the tools of choice for malware detection, classification, and analysis. We would argue that it is also critical to detect malware evolution, since we need to know when a malware family has evolved in a significant way so that we can update our detection techniques to account for such changes. As we see in the next section, this aspect of malware analysis has, thus far, received only limited attention from the research community. 2.2 Related Work While there is a great deal of research involving applications of machine learning to malware detection, classification, and analysis, there are very few articles that consider malware evolution. In [10], analysis of malware based on code injection is considered. This works deals with shell code extracted from malware samples. The researchers used clustering techniques to analyze shell code to determine relationships between various samples. This work was successful in determining the 3 similarities between samples, showing that a significant amount of code sharing had occurred. A drawback to the approach in this paper is that the authors only considered analysis of shell code. While shell code often serves as the attack vector for malware, other attack vectors are possible, and malware evolution is not restricted to the attack portion of the code. For example, a malware family might evolve to be more stealthy or obfuscated, without affecting the attack payload. Another limitation of this research is that it only considers software similarity, and not malware evolution, per se. Malware evolution research is considered in [8]. One positive aspect of this research is that it considers a large dataset that spans two decades. The authors use techniques based on graph pruning and they claim to show specific properties of various families are inherited from other families. However, it is not clear whether these properties are inherited from other families, or were developed independently. In addition, this work relies on manual investigation. A primary goal of our research is to eliminate the need for such manual intervention. The research presented in [29] is focused on detecting malware variants, which can be considered as a form of evolution detection. The authors apply semisupervised learning techniques to malware samples that have been shown to evade machine learning based detection. In contrast, in our research, we use unsupervised learning techniques to detect significant evolutionary points in time which, again, serves to minimize the need for manual intervention. The authors of [12] extract variety of features from Android malware samples, and then determine various trends based standard software quality metrics. These results are then compared to trends present in Android goodware. This work shows that the trends in the Android malware and goodware are similar, with changes in malware following a similar path as goodware. These results are not surprising, given that Android malware largely consists of trojans that, by necessity, would tend to have a great deal of overlap with goodware. The work presented in [5] is focussed on malware taxonomy, which provides some insights into malware evolution, in the form of genealogical trajectories. This research is based on features extracted from malware encyclopedia entries, which have been developed by antivirus software vendors, such as TrendMicro. The authors use SVMs and language processing techniques to extract features on which their results are based. In general, the features used in malware analysis can be considered to be either static or dynamic. Static features are those that can be collected without executing the code, whereas dynamic features require code execution or emulation. In general, static features are easier to collect, while dynamic features are more robust with respect to common obfuscation techniques [6]. The authors of [32] use multiple static features to perform malware classification among various families. The static features that are considered are byte 𝑛-grams, entropy, and image representations. In addition, hex-dump based features are also used, along with features extracted from disassembled files, including opcodes, API 4 calls, and sectional information from portable executable (PE) files. This works provides interesting insights on a wide variety of static features. The research that we present in this paper can be viewed as a continuation of work that originated in [36], where static PE file features of malware samples are used as the basis for malware evolution detection. This previous research employed linear support vector machine (SVM) techniques to train on samples from a specific family over sliding windows of time. The resulting SVM weights are compared based on a 𝜒2 measure, and observed differences in model weights are used to indicate potential evolutionary points in time. The work in [30], which employs opcode sequences from malware samples to analyze malware evolution, is related to the research presented in [36]. In [30], the data is again divided into time windows, and support vector machine (SVM) techniques are used to observe evolutionary points in the malware samples. In addition, hidden Markov model (HMM) techniques are used as a secondary test to confirm suspected evolutionary points in time. Our research in this paper is a further extension to this previous work. We perform extensive experiments with HMMs and the word embedding techniques of Word2Vec and HMM2Vec to analyze malware evolution. We find that we can automatically detect significant evolution in malware families using these techniques. 2.3 Dataset The dataset we use in this research consists of Windows portable executable files belonging to 15 malware families. Two families (Winwebsec and Zbot) are from the Malicia dataset [28], while the remaining families are from a larger dataset that was constructed using VirusShare [9]. Each malware family contains a a number of samples from an extended period of time. Samples belonging to a malware family are assumed to have similar characteristics and to share a code base. However, samples within the same family differ, as malware writers regularly modify successful malware to perform slightly different functions, to make it harder to detect, or for other purposes. The number of samples in each family in our dataset is given in Table 1. The table also includes the time range over which the samples were produced. The malware families in our dataset encompass a wide variety of types, including virus, trojan, backdoor, worms, and so on. Some of the families uses encryption and other obfuscation techniques in an effort to evade detection. Next, we briefly discuss each of the malware families listed in Table 1. Bifrose is a backdoor trojan [25]. As mentioned above, a trojan poses as innocent software to trick the user into installing it, while a backdoor serves to give an attacker unauthorized access to an infected system. CeeInject performs various malicious operations. CeeInject uses obfuscation techniques to evade signature detection [16]. 5 Table 1: Number of samples used in experiments Family Samples Years Adload Bho Bifrose CeeInject DelfInject Dorkbot Hupigon Ircbot Obfuscator Rbot Vbinject Vobfus Winwebsec Zbot Zegost 791 1116 577 742 401 222 449 59 670 127 2331 700 1511 835 506 2009–2011 2007–2011 2009–2011 2009–2012 2009–2012 2005–2012 2009–2011 2009–2012 2004–2017 2001–2012 2009–2018 2009–2011 2008–2012 2009–2012 2008–2011 Total 11,037 2001–2018 DelfInject is a worm that resides on websites and is downloaded to a user’s machine when visiting an infected site. This malware is executed whenever the system is restarted [17]. Dorkbot is a worm that is used to steal credentials of users on an infected system. It performs a denial of service (DoS) attack, and it is spread via messaging applications [24]. Hotbar is an adware virus that resides on websites and is downloaded onto a user’s system when visiting a site that hosts the malware. Hotbar is more annoying than harmful, as it displays advertisements when the user browses the Internet [14]. Hupigon is also a backdoor trojan, similar to Bifrose [15]. Obfuscator evades signature detection using sophisticated obfuscation techniques. It can perform a variety of malicious activities [21]. Rbot is a backdoor trojan that allows attackers into the system through an IRC channel. This is a relatively advanced malware that is typically used to launch denial of service (DoS) attacks [13]. VbInject uses encryption techniques to evade signature detection. Its primary purpose is to disguise other malware that can be hidden inside of it. Its payload can vary from harmless to severe [18]. Vobfus is a trapdoor that lets other malware into the system. It exploits the vulnerabilities of the Windows operating system autorun feature to spread 6 on a network. This malware makes changes to the system configuration that cannot be easily undone [19]. Winwebsec is a trojan that attempts to trick a user into paying money by portraying itself as anti-virus software. It gives deceptive messages claiming that the system has been infected [20]. Zbot is a trojan that steals private user information from an infected system. It can target information such as system data and banking details, and it can be easily modified to acquire other types of data. This trojan is generally spread via spam [22]. Zegost is another backdoor trojan that gives an attacker access to a compromised system [23]. We obtain Windows PE files for each sample in the families discussed above. All of our analysis is based on opcodes, so we first disassemble the files and extract the mnemonic opcode sequence from each, discarding labels, directives, and so on. Since opcodes encapsulate the function of the program we can expect opcode sequences to be useful in detecting code evolution. The resulting opcode sequence will serve as input to our machine learning techniques. In addition, we segregate the samples from each family according to their creation date. Next, we briefly describe each of the learning techniques considered in this paper. 2.4 Learning Techniques In this section, we discuss the learning techniques that are used in our experiments. Specifically, we introduce hidden Markov models, HMM2Vec, Word2Vec, and logistic regression. 2.4.1 Hidden Markov Models As the name suggests, a hidden Markov model (HMM) includes a Markov process that is “hidden” in the sense that it cannot be directly observed. We do have access to a series of observations that are probabilistically related to the underlying (hidden) Markov process. We can train a model to fit a given observation sequence and, given a model, we can score an observation sequence to determine how closely it fits the model. A generic HMM is illustrated in Figure 1. The number of hidden states in an HMM is denoted as 𝑁 , and hence 𝐴 in Figure 1 is an 𝑁 × 𝑁 row stochastic matrix that drives the hidden Markov process. The number of distinct observation symbols is denoted as 𝑀 . The 𝐵 matrix in Figure 1 is 𝑁 × 𝑀 , with each row representing a discrete probability distribution on the symbols, relative to a given (hidden) state. The 𝐵 matrix serves to (probabilistically) relate the hidden states to the observations. Note that the 𝐵 matrix is also row stochastic. An HMM is specified as 𝜆 = (𝐴, 𝐵, 𝜋), where 𝜋 is a 1 × 𝑁 initial state distribution matrix. 7 𝑋0 𝐵 𝒪0 𝐴 𝑋1 𝐴 𝐵 𝑋2 𝐴 ··· 𝐴 𝐵 𝒪1 𝒪2 𝑋T −1 𝐵 ··· 𝒪T −1 Figure 1: Hidden Markov model [35] 2.4.2 Word2Vec Word2Vec is a technique for embedding terms in a high-dimensional space, where the term embeddings are obtained by training a shallow neural network. After the training process, words that are more similar in context will tend to be closer together in the Word2Vec space. Perhaps surprisingly, meaningful algebraic properties hold for Word2Vec embeddings. For example, according to [26], if we let 𝑤0 = “king”, 𝑤1 = “man”, 𝑤2 = “woman”, 𝑤3 = “queen” and 𝑉 (𝑤i ) is the Word2Vec embedding of word 𝑤i , then 𝑉 (𝑤3 ) is the vector that is closest—in terms of cosine similarity—to 𝑉 (𝑤0 ) − 𝑉 (𝑤1 ) + 𝑉 (𝑤2 ) Suppose that we have a vocabulary of size 𝑀 . We can encode each word as a “one-hot” vector of length 𝑀 . For example, suppose that our vocabulary consists of the set of 𝑀 = 8 words 𝑊 = (𝑤0 , 𝑤1 , 𝑤2 , 𝑤3 , 𝑤4 , 𝑤5 , 𝑤6 , 𝑤7 ) = (“for”, “giant”, “leap”, “man”, “mankind”, “one”, “small”, “step”) Then we encode “for” and “man” as 𝐸(𝑤0 ) = 𝐸(“for”) = 10000000 and 𝐸(𝑤3 ) = 𝐸(“man”) = 00010000 respectively. Now, suppose that our training data consists of the phrase “one small step for man one giant leap for mankind” (1) To obtain training samples, we specify a window size, and for each offset we use all pairs of words within the specified window. For example, if we select a window size of two, then from (1), we obtain the training pairs in Table 2. 8 Table 2: Training data Offset Training pairs “ one small step . . .” “one small step for . . .” “one small step for man . . .” “. . . small step for man one . . .” “. . . step for man one giant . . .” “. . . for man one giant leap . . .” “. . . man one giant leap for . . .” “. . . one giant leap for mankind” “. . . giant leap for mankind” “. . . leap for mankind ” (one,small), (one,step) (small,one), (small,step), (small,for) (step,one), (step,small), (step,for), (step,man) (for,small), (for,step), (for,man), (for,one) (man,step), (man,for), (man,one), (man,giant) (one,for), (one,man), (one,giant), (one,leap) (giant,man), (giant,one), (giant,leap), (giant,for) (leap,one), (leap,giant), (leap,for), (leap,mankind) (for,giant), (for,leap), (for,mankind) (mankind,leap), (mankind,for) Consider the pair “(for,man)” from the fourth row in Table 2. As one-hot vectors, this training pair corresponds to input 10000000 and output 00010000. A neural network similar to that in Figure 2 is used to generate Word2Vec embeddings. The input is a one-hot vector of length 𝑀 representing the first element of a training pair, such as those in Table 2, and the network is trained to output the second element of the ordered pair. The hidden layer consists of 𝑁 linear neurons and the output layer uses a softmax function to generate 𝑀 probabilities, where 𝑝i is the probability of the output vector corresponding to 𝑤i for the given input. Input vector 𝑉 ℓ0 ··· 𝜔0 ··· 𝑝0 ··· Hidden layer ℓN −1 𝜔M −1 𝑝M −1 Output layer Probability Figure 2: Neural network for generating Word2Vec embeddings 9 Observe that the Word2Vec network in Figure 2 has 𝑁𝑀 weights that are to be determined, as represented by the blue lines from the hidden layer to the output layer. For each output node 𝜔i , there are 𝑁 edges (i.e., weights) from the hidden layer. The 𝑁 weights that connect to output node 𝜔i form the Word2Vec embedding 𝑉 (𝑤i ) of the word 𝑤i . A Word2Vec model can be trained using either a continuous bag-of-words (CBOW) or a skip-gram model. The model discussed in this section uses the CBOW approach, and that is what we employ in our experiments in this paper. Note that in our implementation, we use opcodes as the “words.” Several tricks are used to speed up the training of Word2Vec models. Such details are beyond the scope of this paper; see [27] for more information. 2.4.3 HMM2Vec Analogous to Word2Vec, we can use the 𝐵 matrix of a trained HMM to specify vector embeddings corresponding to the observations. More precisely, each column of the 𝐵 matrix is associated with a specific observation, and hence we obtain vector embeddings of length 𝑁 directly from the 𝐵 matrix—we refer to the resulting embedding as HMM2Vec. Since HMM2Vec is not a standard vector embedding technique, in this section, we illustrate the process using a simple English text example. Recall that an HMM is defined by the three matrices 𝐴, 𝐵, and 𝜋, and is denoted as 𝜆 = (𝐴, 𝐵, 𝜋). The 𝜋 matrix contains the initial state probabilities, 𝐴 contains the hidden state transition probabilities, and 𝐵 consists of the observation probability distributions corresponding to the hidden states. Each of these matrices is row stochastic, that is, each row satisfies the requirements of a discrete probability distribution. Notation-wise, we let 𝑁 be the number of hidden states, 𝑀 is the number of distinct observation symbols, and 𝑇 is the length of the observation (i.e., training) sequence. Note that 𝑀 and 𝑇 are determined by the training data, while 𝑁 is a user-defined parameter. For more details in HMMs, see [35] or Rabiner’s fine tutorial [31]. Suppose that we train an HMM on a sequence of letters extracted from English text, where we convert all upper-case letters to lower-case, and we discard any character that is not an alphabetic letter or word-space. Then 𝑀 = 27, and we select 𝑁 = 2 hidden states, and suppose we use 𝑇 = 50,000 observations for training. Note that each observation is one of the 𝑀 = 27 symbols (letters, together with word-space). For the example discussed below, the sequence of 𝑇 = 50,000 observations was obtained from the Brown corpus of English [4], but any source of English text could be used. For one specific case, an HMM trained with the parameters listed in the previous paragraph yields the 𝐵 matrix in Table 3. Observe that this 𝐵 matrix gives us two probability distributions over the observation symbols—one for each of the hidden states. We observe that one hidden state essentially corresponds to vowels, while 10 the other corresponds to consonants. This simple example nicely illustrates the machine learning aspect of HMMs, as no a priori assumption was made concerning consonants and vowels, and the only parameter we selected was the number of hidden states 𝑁 . The training process enabled the model to learn a crucial aspect of English directly from the data. ⊺ Table 3: Final 𝐵 for HMM Letter State 0 1 a b c d e f g h i j k l m 0.13537 0.00023 0.00039 0.00025 0.21176 0.00018 0.00041 0.00526 0.12193 0.00014 0.00112 0.00143 0.00027 0.00364 0.02307 0.05605 0.06873 0.00223 0.03556 0.02751 0.06808 0.00077 0.00326 0.00759 0.07227 0.03897 space 0.34226 0.00375 State Letter 0 1 n o p q r s t u v w x y z 0.00035 0.13081 0.00073 0.00019 0.00041 0.00032 0.00158 0.04352 0.00019 0.00017 0.00030 0.00028 0.00017 0.11429 0.00143 0.03637 0.00134 0.10128 0.11069 0.15238 0.00098 0.01608 0.02301 0.00426 0.02542 0.00100 — — — Suppose that for a given letter ℓ, we define its HMM2Vec representation 𝑉 (ℓ) ⊺ to be the corresponding row of the matrix 𝐵 in Table 3. Then, for example, (︀ )︀ (︀ )︀ 𝑉 (a) = 0.13537 0.00364 𝑉 (e) = 0.21176 0.00223 (︀ )︀ (︀ )︀ (2) 𝑉 (s) = 0.00032 0.11069 𝑉 (t) = 0.00158 0.15238 Next, we consider the distance between these HMM2Vec representations. Instead of using Euclidean distance, we measure the cosine similarity.1 The cosine similarity of vectors 𝑋 and 𝑌 is the cosine of the angle between the two vectors. Let 𝑆(𝑋, 𝑌 ) denote the cosine similarity between vectors 𝑋 and 𝑌 . Then for 𝑋 = (𝑋0 , 𝑋1 , . . . , 𝑋n−1 ) and 𝑌 = (𝑌0 , 𝑌1 , . . . , 𝑌n−1 ), n−1 ∑︁ 𝑋i 𝑌i i=0 ⎯ 𝑆(𝑋, 𝑌 ) = ⎯ ⎸n−1 ⎸ ∑︁ ⎸∑︁ 2 ⎸n−1 ⎷ 𝑋 ⎷ 𝑌2 i i=0 1 i i=0 Cosine similarity is not a true metric, since it does not, in general, satisfy the triangle inequality. 11 In general, we have −1 ≤ 𝑆(𝑋, 𝑌 ) ≤ 1, but since our HMM2Vec encoding vectors consist of probabilities—and hence are non-negative values—in this case, we always have 0 ≤ 𝑆(𝑋, 𝑌 ) ≤ 1. When considering cosine similarity, the length of the vectors is irrelevant, as we are only considering the angle between vectors. Consequently, we might want to ̃︀ = 𝑋/‖𝑋‖ and 𝑌̃︀ = 𝑌 /‖𝑌 ‖, in normalize all vectors to be of length one, say, 𝑋 which case the cosine similarity simplifies to the dot product ̃︀ 𝑌̃︀ ) = 𝑆(𝑋, 𝑌 ) = 𝑆(𝑋, n−1 ∑︁ i=0 ̃︀i 𝑌̃︀i 𝑋 ̃︀ to indicate a vector 𝑋 that has been normalized Henceforth, we use the notation 𝑋 to be of length one. For the vector encodings in (2), we find that for the vowels “a” and “e”, the cosine similarity is 𝑆(𝑉 (a), 𝑉 (e)) = 0.9999. In contrast, the cosine similarity of the vowel “a” and the consonant “t” is 𝑆(𝑉 (a), 𝑉 (t)) = 0.0372. The normalized vectors 𝑉 (a) and 𝑉 (t) are illustrated in Figure 3. Using the notation in this figure, cosine similarity is 𝑆(𝑉 (a), 𝑉 (t)) = cos(𝜃) 𝑉̃︀ (t) 𝜃 𝑉̃︀ (a) Figure 3: Normalized vectors 𝑉̃︀ (a) and 𝑉̃︀ (t) These results indicate that our HMM2Vec encodings—which are derived from a trained HMM—provide useful information on the similarity (or not) of pairs of letters. Note that we could obtain a vector encoding of any dimension by simply training an HMM with the number of hidden states 𝑁 equal to the desired vector length. In our experiments below, we consider HMM2Vec embeddings. However, in this research, models are trained on opcodes instead of letters, and hence the embeddings are relative to individual opcodes. 2.4.4 Logistic Regression Logistic regression is used widely for classification problems. This relatively simple technique relies on the sigmoid function, which is also knows as the logistic function, 12 and hence the name. The sigmoid function is defined as 𝑆(𝑥) = 1 . 1 + 𝑒−x Logistic regression can be viewed as a modification of linear regression. As with linear regression, logistic regression models the probability that observations take one of two (binary) values. Linear regression makes unbounded predictions whereas logistic regression converts the probability into the range 0 to 1 due to the use of the sigmoid function. The graph of the sigmoid function is given in Figure 4, from which we can see that the output must be between 0 and 1. 1.0 0.8 0.6 0.4 0.2 −8 −6 −4 −2 0 2 4 6 8 Figure 4: Graph of sigmoid function 3 Experiments and Results In this section, we discuss our evolution detection experiments and results. We divide this section into four subsections, one for each technique considered, namely, logistic regression, HMM, HMM2Vec and Word2Vec. 3.1 Logistic Regression Experiments As mentioned above, in [30] the authors use linear SVMs to detect potential malware evolution. Logistic regression is a simpler technique that, like SVM, is widely used for classification. Hence, we train logistic regression models over time-windows, analogous to the SVM approach in [30]. Specifically, we divide our data into overlapping time windows of one year, with a slide length of one month. All of the samples from the most recent one year time window are taken as the +1 class, while samples from the current month are considered as the −1 class, and we train our logistic regression models on the resulting data. Each such model is represented by its weights, and we calculate the Euclidian distances between these weight vectors to measure the similarity of the models. We then plot these distances on a timeline— spikes in the graph indicate that the model has changed and hence evolution may have occurred. Figure 5 shows the results of our logistic regression experiments for Winwebsec and Zegost. 13 (a) Winwebsec (b) Zegost Figure 5: Logistic regression results The results in Figure 5 indicate that we have random fluctuations in the graphs, rather than significant spikes that would indicate evolution. Although our logistic regression model achieves high accuracy in classifying samples, the weights of the hidden layer do not appear to provide useful information regarding changes in the malware samples. Apparently, the noise inherent in these weights overwhelms the relevant information. 3.2 Hidden Markov Model Experiments All experiments in this section are based on the top thirty most frequent opcodes per family, with all other opcodes grouped into a single “other” category. Thus, our HMMs are all based on 𝑀 = 31 distinct symbols. We use 𝑁 = 2 hidden states in all experiments. We conduct two sets of experiments based on hidden Markov models (HMM). In both of these approaches, we train models, and we then score samples with the resulting models. For our first set of experiments, we reserve the data from the first one-month time period to test our models, and hence we do not train a model on this data. For each subsequent one-month time window, we train a model, and then score the samples from the first one-month time period versus each of these models. We refer to this as HMM approach 1. Consider two distinct one-month time periods, say time period 𝑋 and 𝑌 . Suppose that we train an HMM on the data from time period 𝑋 and another on the data from time period 𝑌 , which we denote as 𝜆X and 𝜆Y , respectively. If the samples from 𝑋 and 𝑌 are similar, then we expect the HMMs 𝜆X and 𝜆Y to be similar, and hence they should produce similar scores on the reserved (first month) data. On the other hand, if the the samples from time periods 𝑋 and 𝑌 differ significantly, then we expect the models 𝜆X and 𝜆Y to differ, and hence the scores on the reserved first-month test set should differ significantly. Figure 6 shows results for three families based on this HMM approach 1. 14 (a) Winwebsec (b) CeeInject (c) Zegost Figure 6: HMM approach 1 results for three families In Figure 6, we observe spikes in the graphs at various points in time, with relative stability over extended periods of time. Thus, this approach seems to have the potential to detect malware evolution. Next, we consider another application of HMMs to our data. In this case, for each one-month time window, we use 75% of the available samples for training and reserve 25% for testing. Next, we train an HMM for each month—as above, we use 𝑁 = 2, and we have 𝑀 = 31 in each case. Suppose we have data from consecutive months that we label as 𝑋 and 𝑌 . We train model 𝜆X on the training data from time period 𝑋 and we train a model 𝜆Y on the training data from time period 𝑌 . We then score each test sample from 𝑋 with both 𝜆X and model 𝜆Y , giving us two score vectors. Since an HMM score depends on the length of the observation sequence, and since the observation sequence lengths vary between malware samples, each scores is normalized by dividing by the length of the observation sequence. As a result, each score is in the form of a log likelihood per opcode (LLPO). Note that If we have, say, 𝑚 test samples in 𝑋, the score vector obtained from 𝜆X and the score vector obtained from 𝜆Y will both be of length 𝑚. Once we generate these two vectors, we compute the Euclidean distance between the vectors, which we denote as 𝑑X . We repeat this scoring process using the test 15 samples from 𝑌 to obtain a distance 𝑑Y , and we define the distance between time windows 𝑋 and 𝑌 to be the average, that is, 𝑑= 𝑑X + 𝑑Y . 2 We plot the graph of these distances—small changes in the distance from one month to the next suggests minimal change, whereas larger distances indicate potential evolution points. Figure 7 gives results for four malware families using this HMMbased technique, which we refer to as HMM approach 2. (a) VbInject (b) Vobfus (c) Winwebsec (d) Zegost Figure 7: HMM approach 2 results for four families The results in Figure 7 indicate that we see significant evolutionary change points when considering this second HMM technique. Together with the results for HMM approach 1 in Figure 6, these results provide strong evidence that HMM-based techniques are a powerful tool for malware evolution detection. 3.3 HMM2Vec Experiments In this section we present our experimental results using HMM2Vec, which is discussed in Section 2.4.3. In these experiments, we select 𝑁 = 2 and we have 𝑀 = 31. Recall that the HMM2Vec embeddings are determined by the columns of the 𝐵 matrix from our trained HMM, and that each embedding vector is of length 𝑁 . 16 A technical difficulty arises when considering HMM2Vec embeddings. That is, the order of the hidden states can vary between models—even when training on the same data, different random initializations can cause the hidden states to differ in the resulting trained models. Since we only consider models with 𝑁 = 2 hidden states, we account for this possibility in our HMM2Vec experiments by computing the distance between 𝐵 matrices twice, once with the order of the rows flipped in one of the models. More precisely, suppose that we want to compare the two ̃︀ 𝐵 ̃︀ , 𝜋̃︀), where 𝑁 = 𝑁 ̃︀ = 2 and 𝑀 = 𝑀 ̃︀ . We HMMs 𝜆 = (𝐴, 𝐵, 𝜋) and 𝜆̃︀ = (𝐴, first compute the distance based on the HMM2Vec embeddings determined by the ̃︀ (we ignore 𝐴 and 𝐴, ̃︀ as well as 𝜋 and 𝜋̃︀). Denote the rows of 𝐵 matrices 𝐵 and 𝐵 ̃︀1 and 𝐵 ̃︀2 be the rows of 𝐵 ̃︀ . Compute as 𝐵1 and 𝐵2 and, similarly, let 𝐵 ̃︀1 ‖ 𝐵 ̃︀2 ) and 𝑑2 = 𝑑(𝐵1 ‖𝐵2 , 𝐵 ̃︀2 ‖ 𝐵 ̃︀1 ) 𝑑1 = 𝑑(𝐵1 ‖𝐵2 , 𝐵 where “‖” is the concatenation operator, and 𝑑(𝑥, 𝑦) is the Euclidean distance between vectors 𝑥 and 𝑦. We define the HMM2Vec distance between 𝜆 and 𝜆̃︀ as ̃︀ = min{𝑑1 , 𝑑2 }. 𝑑(𝜆, 𝜆) We divide the dataset into overlapping windows of one year, with a slide length of one month and we train an HMM (with 𝑁 = 2 and 𝑀 = 31) on each window. We compute the distance between adjacent windows using the method described in the previous paragraph, and we graph the resulting distances. The graphs obtained for three families are given in Figure 8. The results in Figure 8 indicate that HMM2Vec is successful in identifying potential evolution in these particular families. We observe significant spikes (i.e., evolutionary points) in most families using this technique. 3.4 Word2Vec Experiments In this set of experiments, we use Word2Vec to generate vector embeddings of opcodes. We compare the resulting models by concatenating the embedding vectors, and computing the distance between the resulting vectors. As above, we divide the dataset into overlapping time windows of one year, with a slide length of one month. The Word2Vec models are trained as outlined in Section 2.4.2. When training Word2Vec, the window size 𝑊 refers to the length of the window used to determine training pairs, while the vector length 𝑉 is the number of components in each embedding vector. We experimented with different window sizes and found that 𝑊 = 5 works best. We also experimented with different vector sizes—in Figure 9, we give results for the Zbot family for 𝑉 = 2, 𝑉 = 3, and 𝑉 = 5. In general, we do not find any improvement for larger values of 𝑉 , and hence we use 𝑉 = 2 in all of our subsequent Word2Vec experiments. Results from our Word2Vec experiments for three families are given in Figure 10. These results show potential evolutionary points in almost all the malware families and we conclude that Word2Vec is also a useful technique for detect potential malware evolution points. 17 (a) Bho (b) CeeInject (c) Winwebsec Figure 8: HMM2Vec results for three malware families 3.5 Discussion Here, we first discuss the results given by each technique considered in this section. Then we compare our results to the most closely related previous work. The two HMM scoring techniques that we first considered provide different models and scores, yet the results are similar. This provides evidence of correctness and consistency, and also some evidence of actual evolution. Our HMM2Vec and Word2Vec experiments were somewhat different, since they focus on longer time windows of one year, whereas the HMM techniques both are based on one-month time intervals. In any case, both HMM2Vec and Word2Vec performed well and consistently with each other. Again, this consistency is evidence of correctness of the implementations, and of evolution detection. Combining either HMM2Vec or Word2Vec with either of the HMM scoring techniques provides a two-step strategy for detecting evolution. That is, we can use a year-based technique (either HMM2Vec or Word2Vec) to see if there is any indication of evolution over such a time window. If so, we can then use one of the HMM scoring techniques to determine where within that one-year window the strongest evolutionary points occur. In this way, we could rapidly filter out time periods that are unlikely to be of interest, and then in the secondary phase, detect precise times 18 (a) Zbot with 𝑉 = 2 (b) Zbot with 𝑉 = 3 (c) Zbot with 𝑉 = 5 Figure 9: Word2Vec with different vector sizes (Zbot family) at which interesting evolutionary changes have most likely occurred. For example, both Word2Vec and HMM2Vec indicate that evolution in the Winwebsec family took place during the time period November 2010–June 2011. Then experimenting with the first HMM scoring approach on the Winwebsec family during the November 2010–June 2011 time period indicates that the precise point of evolution was June 2011. Next, we compare our work to that in [30], which considered the same malware evolution problem and used the same dataset as in our research. In [30], linear SVM models are trained over on year time windows, with a slide of one month. The resulting linear SVM model weights are compared using a 𝜒2 distance computation. Furthermore, Word2Vec feature vectors (derived from opcode sequences) were used as input features to their SVM models. Once a 𝜒2 similarity graph has been generated, an HMM-based approach is used on either sides of a spike to confirm that evolution has occurred. Comparing our results with those given in [30], our techniques are more efficient, as we omit the SVM training and our work factor is less than their secondary test. In spite of these simplifications, we find that our detect strategy is at least as sensitive 19 (a) Vobfus (b) Winwebsec (c) Zegost Figure 10: Word2Vec results for three malware families as that in [30]. For example, we see clear spikes in some families (e.g., DelfInject, Dorkbot, and Zbot) for which previous work found, at best, ambiguous results. Next we briefly summarize our results per family. We refer to previous work in [30] in some of these cases. Adload: For this family, both HMM2Vec and Word2Vec did not result in any significant spikes in the graphs, and hence we do not see indications of evolutionary change. On the other hand, the results given by our HMM techniques show significant spikes for this family. Bho: The results generated by Word2Vec for this family indicate that malware evolution occurred during the September 2009–December 2010 timeframe. Using our HMM approach, we are able to see that malware evolution happened during October 2010, which is consistent with the Word2Vec results. This is also consistent with results given in [30], based on SVM analysis. Bifrose: The results generated by HMM2Vec did not provide any major spikes in the graph, but we can see indications of slower change over time. The graph generated by Word2Vec gives us a better understanding of changes in this malware family, since we could see that significant evolution occurred 20 during the November 2009–March 2011 time period. Again, the results given by the HMM approaches narrow down the evolution point—in this case, to March 2011. A similar graph is given in [30], indicating evolution during November 2010–May 2011, which is consistent with our results. CeeInject: For CeeInject, we obtain clear results from all experiments we performed. The results given by HMM2Vec and Word2Vec shows significant evolution during the August 2010–July 2011 time window, and we identify the month of clearest change as November 2010 based on our HMM approaches. The results for this family given in [30] show similar evolution during September 2010–May 2011. DelfInject: We obtained significant results for this family using our HMM-based approaches. This family shows evolution occurring during January 2011. In this case, we do not observe significant spikes for Word2Vec or HMM2vec with their longer time windows. In [30], no evolutionary points are detected for this family. Dorkbot: Similar to CeeInject, we obtain strong results on Dorkbot from all of our experiments. Specifically, the evidence strongly points to malware evolution during 2011. Hupigon: The results received from Word2Vec technique show that significant malware evolution in this family happened during the July 2010–April 2011 period. Results from the HMM approaches narrow the time period to February 2011. Results given by the SVM approach in [30] are consistent with these results. Ircbot: The results generated by Word2Vec indicates that malware evolution occurred in this family slowly throughout 2011. That is, there is no major spikes observed, but the graph shows a slow changing trend. Obfuscator: We could not derive significant information from this family. Graphs plotted on this family had many spikes which we could not interpret regarding malware evolution. Rbot: Graphs generated based on Word2Vec show significant evolution in this malware family. Significant results were not observed for this family in any previous research. VbInject: We could not observe a significant spike in this malware family in any of our experiments. . Vobfus: The results generated by our experiments shows that evolution in this family occurred during the December 2009–January 2011 timeframe. The results given in [30] indicate evolution during November 2010–May 2011. Winwebsec: We observe evolution in this malware family using Word2Vec, where a spike appears in December 2010–July 2011. The previous research in [30] did not indicate evolution for this family. 21 Zbot: Experiments conducted on this family inidcate significant changes. Specifically, we observe a spike between April 2011–November 2011. Zegost: From our Word2Vec experiments, we see significant spikes in the August 2010–September 2011 and July 2010–July 2011 timeframes. Our experiments indicate significant evolution in almost all the malware families considered. By comparing the results given by our two HMM techniques, HMM2Vec, and Word2Vec, we can see that there are clear similarities in the results for most families. When we observe such similar evolution points across different experiments, it increases our confidence in the results. As further evidence, we found that the evolutionary points generated in previous research in [30] matches with our experiments, and we detect additional points of interested, as compared to previous research, indicating that our techniques may be somewhat more sensitive. In some cases, we found potential evolutionary points with the HMM techniques, but not with HMM2Vec or Word2Vec. We conjecture that this is a result of the longer time windows (one year) used in the latter two approaches, while the HMM techniques are based on monthly time windows. These longer time windows may not be as sensitive in cases where a changes are less pronounced or transient. 4 Conclusion And Future Work In previous research—first in [36] and subsequently in [30]—it has been shown that malware evolution can be detected using machine learning techniques. In this paper, we extend this previous work by exploring additional learning techniques. We find that various HMM-based techniques and Word2Vec provide powerful tools for automatically detecting malware evolution. Here, we conducted all of our experiments based on mnemonic opcodes derived from the malware samples. For future work, it would be useful to consider experiments with other features extracted from the malware samples. While mnemonic opcodes perform well, extracting such opcodes is relatively expensive. It is possible that other, less costly features can be used. Also, by considering dynamic features, we might gain more information about evolution within a malware family. Finally, the use of additional neural networking and deep learning techniques should be considered. Word2Vec performed well, and it is likely that more sophisticated techniques would result in more discriminative ability, which would enable more fine grained analysis of evolutionary trends. References [1] John Aycock. Computer Viruses and Malware. Springer, 2006. 22 [2] Marius Barat, Dumitru-Bogdan Prelipcean, and Dragoş Gavriluţ. A study on common malware families evolution in 2012. Journal of Computer Virology and Hacking Techniques, 9(4):171–178, 2013. [3] Jean-Marie Borello and Ludovic Mé. Code obfuscation techniques for metamorphic viruses. Journal in Computer Virology, 4(3):211–220, 2008. [4] The Brown corpus of standard American English. http://www.cs.toronto. edu/~gpenn/csc401/a1res.html. [5] Zhongqiang Chen, Mema Roussopoulos, Zhanyan Liang, Yuan Zhang, Zhongrong Chen, and Alex Delis. Malware characteristics and threats on the Internet ecosystem. The Journal of Systems & Software, 85(7):1650–1672, 2012. [6] Anusha Damodaran, Fabio Troia, Corrado Visaggio, Thomas Austin, and Mark Stamp. A comparison of static, dynamic, and hybrid analysis for malware detection. Journal of Computer Virology and Hacking Techniques, 13(1):1–12, 2017. [7] A Gupta, P Kuppili, A Akella, and P Barford. An empirical study of malware evolution. In First International Communication Systems and Networks and Workshops, pages 1–10, 2009. [8] Archit Gupta, Pavan Kuppili, Aditya Akella, and Paul Barford. An empirical study of malware evolution. In First International Communication Systems and Networks and Workshops, pages 1–10, 2009. [9] Samuel Kim. PE header analysis for malware detection. Master’s thesis, San Jose State University, Department of Computer Science, 2018. [10] Justin Ma, John Dunagan, Helen J Wang, Stefan Savage, and Geoffrey M Voelker. Finding diversity in remote code injection exploits. In Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement, pages 53–64, 2006. [11] Robert McMillan. COMPUTERWORLD: Researchers take down Koobface servers — Criminals behind the botnet made more than $2 million in one year, 2010. https://www.computerworld.com/article/2750985/researcherstake-down-koobface-servers.html. [12] Francesco Mercaldo, Andrea Di Sorbo, Corrado Aaron Visaggio, Aniello Cimitile, and Fabio Martinelli. An exploratory study on the evolution of Android malware quality. Journal of Software: Evolution and Process, 30(11), 2018. [13] Microsoft. Win32 Rbot detected with Windows Defender anhttps://www.microsoft.com/en-us/wdsi/threats/malwaretivirus. encyclopedia-description?Name=Win32%2FRbot, 2005. [14] Microsoft. Adware: Win32 Hotbar detected with Windows Defender antivirus. https://www.microsoft.com/en-us/wdsi/threats/malwareencyclopedia-description?Name=Adware%3AWin32%2FHotbar, 2006. 23 [15] Microsoft. Backdoor: Win32 Hupigon detected with Windows Defender antivirus. https://www.microsoft.com/en-us/wdsi/threats/malwareencyclopedia-description?Name=Backdoor%3AWin32%2FHupigon, 2006. [16] Microsoft. Virtool: Win32 CeeInject detected with Windows Defender antivirus. https://www.microsoft.com/en-us/wdsi/threats/malwareencyclopedia-description?Name=VirTool%3AWin32%2FCeeInject, 2007. [17] Microsoft. Virtool: Win32 DelfInject detected with Windows Dehttps://www.microsoft.com/en-us/wdsi/threats/ fender antivirus. malware-encyclopedia-description?Name=VirTool:Win32/DelfInject& ThreatID=-2147369465, 2007. [18] Microsoft. Virtool: Win32 VBInject detected with Windows Defender https://www.microsoft.com/en-us/wdsi/threats/malwareantivirus. encyclopedia-description?Name=VirTool:Win32/VBInject&ThreatID=2147367171, 2010. [19] Microsoft. Win32 Vobfus detected with Windows Defender anhttps://www.microsoft.com/en-us/wdsi/threats/malwaretivirus. encyclopedia-description?name=win32%2Fvobfus, 2010. [20] Microsoft. Win32 Winwebsec detected with Windows Defender antivirus. https://www.microsoft.com/security/portal/threat/encyclopedia/ entry.aspx?Name=Win32%2fWinwebsec, 2010. [21] Microsoft. Win32 Obfuscator detected with Windows Defender antivirus. https://www.microsoft.com/en-us/wdsi/threats/malwareencyclopedia-description?Name=Win32%2FObfuscator, 2011. [22] Microsoft. Win32 Zbot detected with Windows Defender antivirus. http://www.symantec.com/securityresponse/writeup.jsp?docid=2010011016-3514-99, 2011. [23] Microsoft. Win32 Zegost detected with Windows Defender antivirus. https: //www.symantec.com/security-center/writeup/2011-060215-2826-99, 2011. [24] Microsoft. Worm: Win32 Dorkbot detected with Windows Defender antivirus. https://www.microsoft.com/en-us/wdsi/threats/malwareencyclopedia-description?Name=Worm%3AWin32/Dorkbot, 2011. [25] Microsoft. Win32 Bifrose detected with Windows Defender anhttps://www.trendmicro.com/vinfo/us/threat-encyclopedia/ tivirus. malware/bifrose, 2012. [26] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301. 3781, 2013. [27] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. 24 https://papers.nips.cc/paper/5021-distributed-representationsof-words-and-phrases-and-their-compositionality.pdf, 2013. [28] Antonio Nappa, M Zubair Rafique, and Juan Caballero. The MALICIA dataset: Identification and analysis of drive-by download operations. International Journal of Information Security, 14(1):15–33, 2015. [29] Jacob Ouellette, Avi Pfeffer, and Arun Lakhotia. Countering malware evolution using cloud-based learning. In 8th International Conference on Malicious and Unwanted Software, pages 85–94, 2013. [30] Sunhera Paul and Mark Stamp. Word embedding techniques for malware evolution detection. In Malware Analysis Using Artificial Intelligence and Deep Learning, pages 321–343. Springer, 2021. [31] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. https://www.cs.sjsu.edu/~stamp/RUA/Rabiner.pdf. [32] Saeid Rezaei, Ali Afraz, Fereidoon Rezaei, and Mohammad Reza Shamani. Malware detection using opcodes statistical features. In 8th International Symposium on Telecommunications, IST, pages 151–155, 2016. [33] Mark Stamp. Information Security: Principles and Practice. Wiley, 2011. [34] Mark Stamp. Introduction to Machine Learning with Applications in Information Security. CRC Press, Boca Raton, 2018. [35] Mark Stamp. A revealing introduction to hidden Markov models. https: //www.cs.sjsu.edu/~stamp/RUA/HMM.pdf, 2018. [36] Mayuri Wadkar, Fabio Di Troia, and Mark Stamp. Detecting malware evolution using support vector machines. Expert Systems with Applications, 143, 2020. 25