BDA4CID-Mark

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/330625565

Analyzing Evolving Trends of Vulnerabilities in National Vulnerability Database

Conference Paper · December 2018


DOI: 10.1109/BigData.2018.8622299

CITATIONS READS
31 575

6 authors, including:

Mark Williams Roberto Camacho Barranco


The University of Texas at El Paso The University of Texas at El Paso
2 PUBLICATIONS 53 CITATIONS 11 PUBLICATIONS 83 CITATIONS

SEE PROFILE SEE PROFILE

Sheikh Motahar Naim M. Shahriar Hossain


The University of Texas at El Paso The University of Texas at El Paso
10 PUBLICATIONS 117 CITATIONS 54 PUBLICATIONS 657 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Mark Williams on 14 April 2020.

The user has requested enhancement of the downloaded file.


IEEE Big Data 2018, International Workshop on Big Data Analytics for Cyber Intelligence and Defense.
December 10-13, 2018, Seattle, WA, USA

Analyzing Evolving Trends of Vulnerabilities in


National Vulnerability Database
Mark A. Williams1 , Sumi Dey2 , Roberto Camacho Barranco1 ,
Sheikh Motahar Naim1 , M. Shahriar Hossain1 , and Monika Akbar1
1
Department of Computer Science, 2 Department of Computational Science
University of Texas at El Paso
El Paso, Texas, USA
Emails: {mawilliams7, sdey2}@miners.utep.edu, [email protected],
[email protected], [email protected], [email protected]

Abstract—As the world approaches a state of greater depen-


dence on technology, many products face increasing threats from Mac OS
malicious attackers who are attempting to take advantage of vul-
nerabilities in software design. Most of the known vulnerabilities
are already aggregated, stored in text format, and are readily JDK
Acrobat Reader
accessible to the public, making such an aggregated database a
prime corpus for analysis using data mining methods. A multi-
tude of research efforts have been deployed to analyze individual iPhone OS
aspects of such cyber-security corpora to create taxonomies, as-
sess vulnerability impact, and even predict future vulnerabilities. Internet Explorer
However, minimal effort has been committed to analyze cyber- Edge
security corpora to explore correlations between vulnerabilities Chrome
and study the evolution of a vulnerability from its genesis. In
this paper, we propose an integrated data mining framework
to automatically lay out how vulnerabilities develop over time
and detect the evolution of a specific cyber-security threat. We Microsoft Office
use (1) a Supervised Topical Evolution Model (STEM), which
discovers temporal themes from a text corpus and (2) a diffusion-
based storytelling technique that sifts through past vulnerability
reports to describe how a current threat evolved. The STEM
gives a holistic evolution structure of the vulnerabilities, while Figure 1: The product susceptibility of ten major software
diffusion-based storytelling provides the precise genealogy of a products since 2000 based on STEM model.
specific threat. A considerable series of experiments demonstrate
that the proposed framework can discover evolutionary patterns
in today’s most pressing vulnerabilities with a high degree
of precision. As case studies, we explore the development of Despite containing years’ worth of indispensable vulnera-
vulnerabilities in certain products, providing a unique insight into bility information on every major software product, such a
the correspondence between seemingly unrelated vulnerabilities large corpus of structured data remains largely neglected due
and the impact of that correspondence on overall software
security. to a lack of tools and algorithms to support extensive analysis
Index Terms—cyber-security, vulnerabilities, temporal topic of vulnerabilities. Consequently, there is a considerable gap in
modeling, storytelling knowledge of vulnerability interactions, trends, and evolution
as well as overall product susceptibility to vulnerabilities. Un-
I. I NTRODUCTION derstanding the vulnerability trends would pave the pathway
A high degree of dependence on software products in all of researchers and industry experts to develop more secure
sectors — government, private, and academia — necessitates systems, mitigate the impact of existing vulnerabilities, as well
acquiring a great deal of knowledge regarding software vul- as guide emerging research in the field of cyber-security.
nerabilities. Luckily, the growing volume of vulnerabilities Consider that a team of industry experts is trying to de-
are meticulously recorded by many software development termine the proportion of company resources that should be
companies including operating system providers like Microsoft allocated to cyber-defense by analyzing how susceptible their
and Apple. The U.S. Government has an ongoing effort to con- products are to vulnerabilities. In this case, a concise theme-
solidate all the reported software vulnerabilities in a standard based model based on an aggregate of vulnerability data would
database called National Vulnerability Database (NVD)1 . provide the expert with ample information to aid in making a
decision. Figure 1 describes the product susceptibility of ten
1 National Vulnerability Database (NVD), https://nvd.nist.gov major software products based on the probability of topics
indicates that attackers began focusing on other aspects of Windows software when trying to
elevate their privileges on a system.

CVE-2013-1300 CVE-2014-0318 CVE-2015-1724 CVE-2015-6100 CVE-2015-6128


Published: 2013-02-13 Published: 2014-08-12 Published: 2015-06-10 Published: 2015-11-11 Published: 2015-12-09
Impact Score: 10 Impact Score: 10 Impact Score: 10 Impact Score: 10 Impact Score: 10
Race condition in win32k.sys in the kernel- Use-after-free The kernel in Microsoft Microsoft Windows
win32k.sys in the kernel- mode drivers of vulnerability in the Windows allows local mishandles library
mode drivers in Microsoft Microsoft Windows does kernel-mode drivers in users to gain privileges loading, which allows
Windows and Windows not properly control Microsoft Windows and execute arbitrary local users to gain
RT allows local users to access to objects, which allows local users to gain code via a crafted privileges and execute
gain privileges via a allows local users to gain privileges via a crafted application. arbitrary code via a
crafted application. privileges via a crafted application. crafted application.
application.

Figure 2: How a security threat in Windows, related to library-mishandling


Affected Products: giving all users the capability of executing arbitrary
code, evolved over time. The document ID in NVD is CVE-2015-6128.
Microsoft Windows Vista SP2, Windows Server 2008 SP2 and R2 SP1, and Windows 7 SP1

over time in the National Vulnerability Database. A brief of vulnerabilities over time from one product to another and
analysis of this figure reveals, among other trends, the drastic between different versions of the same product.
decline of the susceptibility of Internet Explorer in recent 3) We determine the overall susceptibility of major software
years, but a sudden rise in susceptibility for Windows 10 and products to vulnerabilities within a specified time frame, and
Edge about three years ago. The team of experts will be able explore the distribution of susceptibility within distinct product
to recommend an area of focus based on a figure similar to domains.
Figure 1.
Now, consider the scenario where the team is performing a II. R ELATED W ORK
detailed analysis regarding a Windows 10 vulnerability — e.g.,
the operating system mishandles library handling allowing There are existing efforts to implement broader data min-
users to gain privileges to execute arbitrary codes. Under- ing algorithms on cyber-security corpora. In fact, data min-
standing how such a vulnerability evolved despite monitoring ing algorithms have been increasingly utilized to facilitate
multiple earlier versions of the product would allow the team knowledge gathering in the cyber-security domain in recent
to resolve the issue in the current version as well as in any years. Vulnerability prediction has been addressed in many
other related versions or related products. Figure 2 shows how efforts. For instance, text mining techniques are applied by
the vulnerability described evolved over time from another Scandariato et al. to predict software component vulnerabilties
issue involving kernel mode drivers that affected an earlier in [15]. Han et al. [5] develop a method to predict the severity
version of Windows. We explain Figure 2 later in Experimental of a vulnerability by using a one-layer convolutional neural
Results as part of Section V-B as a case study. network. Zhang et. al [18] use data mining techniques to try
In this paper, we use two analytical approaches to aid to predict day zero software vulnerabilities using the NVD.
decision making using vulnerability corpora. The first ap- Researchers have also targeted analysis of vulnerabilities
proach utilizes a temporal topical model that builds on top through detection of relationships between vulnerabilities [14],
of our previous effort called Supervised Topical Evolution [20]. Lin et al. [9] present an association rule discovery
Model (STEM) [10]. Outcomes such as Figure 1 can be algorithm to find correlations between certain aspects of a
generated using STEM. STEM provides a holistic idea about cyber-security corpus, the Common Weakness and Enumer-
the vulnerability topics in the dataset. The second approach ation database2 . Shin et al. [16] discover correlations between
leverages another of our efforts called Diffusion-based story- vulnerabilities by analyzing code complexity, code churn, and
telling [1], by providing the ability to describe how a specific developer activity metrics. Such correlations are extended
threat evolved historically. Overall, STEM is the first stage beyond code content to a vulnerability life cycle in [8]. Topic
of the analytic process to better understand the trends in models are used extensively in [11] to discover frequent
vulnerabilities. Diffusion-based storytelling is the second stage vulnerability types and new patterns in cyber-security corpora.
which provides finer details of the evolution of a specific One significant limitation of these works is that they do not
threat. take into account the temporal aspect of vulnerabilities. This
In summary, the contributions of this paper are: makes it difficult to determine if the relationship discovered by
1) We provide a high-level holistic view of cyber-security the work is relevant in the present. Our proposed framework
vulnerabilities using a probabilistic graphical model, STEM, is distinct in this manner because we discover trends in
that integrates timestamps, annotations, latent themes, and vulnerabilities based on the latent structure of the corpus
textual data. and by incorporating the temporal feature of vulnerability
2) Through a variety of experiments on a cyber-security reports. We then go one step further and study the evolution
corpus, we demonstrate that diffusion-based storytelling can
establish evolutionary narratives describing the propagation 2 Common Weakness and Enumeration (CWE), https://cwe.mitre.org
of a vulnerability as a series of connected vulnerability reports. labels and the 50 most affected products in our experiments
with STEM. All vulnerability labels and affected products are
utilized in the diffusion-based storytelling framework.
III. P ROBLEM AND S OLUTION S TATEMENT
We have a set of vulnerability reports D =
{d1 , d2 , . . . , d|D| }, which contain entities or words from
E = {e1 , e2 , . . . , eE }. Each document d has a publication date B. STEM
td , and a binary distribution of vulnerability labels specific to
the document d from ⇤d = [l1 , l2 , · · · , lK ], where K is the Supervised Topical Evolution Model (STEM) is a proba-
number of vulnerability labels. Additionally, each document bilistic graphical approach for modeling dynamic document
d has a list of affected products distinct to the document d collections with explicit labels. Similar to traditional topic
that is a subset of ⇤d = [a1 , a2 , · · · , aP ] where P is the models like LDA [2], STEM views each document as a
number of products referenced in the NVD dataset. mixture of underlying topics, and each topic as a distribution
The proposed framework consists of three stages: over the words. Unlike traditional topic models, STEM learns
1) Given the vulnerability reports D, compute the appear- the topics and their evolution over time by guiding its inference
ance probability (i.e., topic probability) of each of the K procedure through the incorporation of timestamps and label
vulnerability labels in each timestamp t. information of the documents. The generative process for the
2) Given the vulnerability reports D, compute the product model is as follows.
recurrence (i.e., product susceptibility) of each of the P 1) For k = 1 to K :
product labels in each timestamp t. a) k ⇠ Dirichlet( )
3) Given a specific vulnerability report d, create a chain of 2) For each document d 2 D :
relevant vulnerability reports from the past such that the a) For k = 1 to K :
chain reflects the evolution of the vulnerability reported i) ⇤dk ⇠ Bernoulli( k )
in d. b) Compute Ld from ⇤d
c) ↵d = Ld ⇥ ↵
IV. M ETHODOLOGY
d) ✓d ⇠ Dirichlet(↵d )
The National Vulnerability Database (NVD) that we ex- e) For each entity edi 2 d :
tensively use in this paper contains 111,060 unique reports. i) zdi ⇠ M ultinomial(✓d )
The reports describe vulnerabilities and exposures in software ii) edi ⇠ M ultinomial( zdi )
products from as early as 1996. The database contains links iii) tdi ⇠ Beta( zdi )
to the report-documents published by software development
A list of all the symbols used in the generative process is
companies such as Apple and Microsoft.
provided in Table I.
Each document in the NVD consists of a Common Vulner-
abilities and Exposures3 (CVE) entry, which is a high-level STEM approximates the posterior probabilities of the hid-
vulnerability description. Each CVE entry is associated with
products that are affected by the vulnerability described in the
entry. CVE entries in the NVD vary in length but they are Table I: List of symbols
usually short paragraphs. We leverage the links provided with Symbol Description
each entry to enrich the data by augmenting NVD reports with D Set of all the documents
descriptions from the original source. V Number of unique entities
K Number of unique labels
A. Dataset Preparation Nd Number of entities in document d
To prepare the dataset for our experiments, we used distinct ↵, Dirichlet prior for the document-topics and topic-
words distributions, respectively
attributes present in each NVD document (publication date,
Prior probabilities for the Bernoulli distribution
CVE entry description, etc.) and labeled documents as a of labels
collection of these attributes. We then extracted entities from ✓d Multinomial distribution of topics specific to the
each document description in the dataset by using standard document d
entity detection approaches [1], [7]. We ignored documents ⇤d Binary distribution of labels specific to the docu-
with less than four entities and documents that were published ment d
Ld Projection matrix for labels in document d
before 2000 because they do not contain enough information Multinomial distribution of words specific to topic
z
to be discriminative. z
Each NVD document is also associated with a label that z Beta distribution of time specific to topic z
serves as a classifier for the vulnerability described in the zdi The topic associated with the i-th token in the
document and a set of products that are affected by that vul- document d
nerability. We have selected the 50 most frequent vulnerability edi The i-th entity in the document d
tdi The timestamp associated with the i-th token in
3 Common Vulnerabilities and Exposures (CVE), https://cve.mitre.org the document d
den variables using Gibbs Sampling method [4] as follows. where wi 2 W is the weight for document di and
p(zdi |z di , e, t, ↵, , ) = (ti , tsL , tsH ) ⇤ (tj , tsL , tsH ) (5)
nzdi edi + 1
/ (ndzdi + ↵ 1) PV = (ti , tsL , tsH ) ⇤ (1 (tj , tsL , tsH )) (6)
v=1 (nzdi v + ) 1 (1)
1
tdi di
z 1
(1 tdi ) zdi 2 1 where tsL,H are the lower/upper turning points for a particular
⇥ segment and is defined as:
B( zdi 1 , zdi 2 )
For the sake of speed and simplicity, is updated after each 8 ✓ ◆
ts ˆ 2 )2
Gibbs sample using the method of moments as follows: >
>
(t L
✓ ◆ >p 1 2ˆ 2
if t  tsL
< 2⇡ ˆ 2 e
>
ˆz1 = t̄z t̄z (1 t̄z ) 1 (t, tsL , tsH ) = p 1 if tsL < t < tsH
s2z >
> 2⇡ ˆ 2 ✓ ◆
✓ ◆ (2) >
> (t ts 2 2
H +ˆ )
ˆz2 = (1 t̄z ) t̄z (1 t̄z ) 1 :p 1 e 2ˆ 2
if tsH  t
2⇡ ˆ 2
s2z (7)
where t̄z and sz indicate the mean and standard deviation of where ˆ is the standard deviation of a Gaussian distribution
all the timestamps belonging to topic z, respectively. Refer to that indicates the degree of membership of a timestamp in a
[10] for a more detailed derivation. segment.
C. Diffusion-based story generation The third term in the objective function is the overlap
penalty, which prevents having two or more turning points
In [1], we presented a diffusion-based framework that mines very close to each other.
a large corpus of documents to generate smooth evolving 0 ✓ ◆1
stories given a seed document published recently. The frame- |S| 1⇥|S| 1
X (ti tj )2

work runs several preprocessing tasks, including candidate overlap = @1 + e 2 2 A (8)


generation which reduces the search space to a few hun- i,j,i<j
dred documents. Then, via the optimization of an objective
where is the standard deviation of a Gaussian distribution
function, a storyline is generated which consists of several
that defines the sensitivity of the overlap penalty.
story segments delimited by “turning points” which contain
A final term is added to avoid having high uniformity in
coherent documents. The main objective is to help the user
the distribution of the weight probabilities that indicate if the
understand the succession of events or articles that lead to the
documents are relevant or not. This penalty will make sure
seed document.
that not all of the weights are set to 0 or 1.
We assume that we have a set of candidate documents
D = {d1 , d2 , . . . , d|D| } which contain entities from E = uniformity =
{e1 , e2 , . . . , eE } and a publication date td . A turning point in 0 0 ⇣ p ⌘ 11
Ws ⇤ >
the story ⌧ 2 T delimits a set of segments S of coherent doc- X|S| P s
Ws ⇤ >
⇤ |Ws | 1
B B CC
p 2 (9)
s
uments such that S = [(⌧1 , ⌧2 ), (⌧2 , ⌧3 ), · · · , (⌧|T | 1 , ⌧|T | )]. @1 + @1 AA
An optimization routine finds the values for every ⌧ as well s=1 |Ws | 1
as the probability that a candidate document is relevant to the
segment that includes it. where s is a vector of values returned by the membership
We formulate the objective function for evolution using four function (Eq. 7 for the documents that fall in segment s).
terms: incoherence, similarity, overlap, and uniformity. First, The final objective function (Eq. 10) is minimized for two
we have two penalties that aim to minimize the incoherence of vectors: S and W. The elements s 2 S are bounded by
documents within a segment, while also minimizing the simi- the range of the turning points and the elements w 2 W
larity across segments. The expectation is that these terms will are bounded between [0, 1]. We use the quasi-newton limited
promote having highly coherent segments that are independent memory algorithm for bound constrained optimization (L-
of each other. BFGS-B) [19].

|S|
incoherence(s) = X
P|D|⇥|D| F( T , W) = (incoherence(s) ⇤ similarity(s)) ⇤
i,j wi ⇤ wj ⇤ ⇤ soergel(di , dj ) ⇤ |ti tj | s=1
P|D|⇥|D| (3)
wi ⇤ wj ⇤ overlap ⇤ uniformity (10)
i,j

Given a seed document, Eq. 10 results in a set of documents


similarity(s) = with a vector of importance values W, which helps in identi-
P|D|⇥|D|
wi ⇤ wj ⇤ ⇤ e soergel(di ,dj ) fying documents from the past that are highly relevant to the
i,j
P|D|⇥|D| (4) seed document and at the same time represent the evolution
i,j wi ⇤ wj ⇤ of the content of the seed document.
V. E XPERIMENTAL R ESULTS A. Trends in Vulnerabilities

In this section we seek to answer the following questions: One of the goals of this paper is to study how we can
discover connections and trends in vulnerabilities that will
1) How does Supervised Topical Evolution Model (STEM)
help create more effective mitigation strategies and cyber-
perform compared to other topic modeling approaches in
defense implementations. Usually, this discovery is achieved
explaining trends in vulnerabilities? (Section V-A)
by analyzing individual aspects of a vulnerability with limited
2) How can an expert analyze the evolution of a specific
consideration of temporal factors. In the following experi-
vulnerability of interest? (Section V-B)
ments, we utilize STEM to discover novel correlations in
3) How can shifts in product susceptibility to vulnerabilities
vulnerabilities, with a particular emphasis on both temporal
be detected and explained? (Section V-C)
factors and the interplay between vulnerability dependencies.
4) How well does STEM perform on the NVD dataset in
We also compare STEM results with results obtained by
terms of quality and coherence of topics? (Section V-D)
LLDA and TOT. Capability-wise, STEM enables both aspects
5) How well does STEM scale in its parallel execution?
of LLDA and TOT. LLDA is capable of handling labels in
(Section V-E)
topic modeling but does not include time. TOT includes topics
We implemented two versions of STEM — a sequential over time but does not take labels into consideration. STEM
version and a parallel one. The parallel version of the STEM provides a combination of the advantages of both algorithms
model uses 12 processors and is denoted as STEM-P for the — labeled topics over time for a massive text corpus.
rest of this section. The L-BFGS-B optimization algorithm
In Figure 3, we display the top ten highly probable terms
[19] minimizes a GPU-oriented parallel version of the objec-
discovered by STEM, LLDA, and TOT for vulnerability labels
tive function for the diffusion-based storytelling framework.
in the NVD dataset. Although we performed this experiment
We conducted a series of quantitative and qualitative ex- on the ten most prominent vulnerability labels in the dataset,
periments using these algorithms to answer the preceding we have displayed only three in the Figure due to space con-
questions. We compared our model with two contemporary siderations. The complete results for this experiment and the
graphical models —- Topics Over Time (TOT) [17], which source code for this paper can be accessed on a website4 we
models topical evolutions over continuous time, and Labeled created. The three listed vulnerability labels are: Permissions
LDA (LLDA) [13], which uses labels to guide the model and Privileges, Path Traversal, and Resources Errors. Overall,
inference process — to determine the accuracy of our methods. the terms inferred by STEM and LLDA are moderately similar,
We utilized a real-world dataset –– a cyber security corpus referring to the fact that STEM subsumes the capability of
known as the National Vulnerability Database (NVD) — for LLDA.
our experiments. However, the terms discovered by TOT differ significantly
It is important to note that in our experiments with STEM from the other models. The reason of such difference is that
using products as topics, we refer to topic probability as prod- TOT does not consider vulnerability labels as its topics while
uct susceptibility. Topic probability and product susceptibility the other models do. The three topics — Topic 0, Topic 14,
are interchangeable in the context of this paper because they and Topic 13 — listed in Figure 3 (right) are based on mere
reflect the same concept. For example, if the topic Microsoft overlaps of top terms with vulnerability labels Permissions and
Edge had high topic probability in 2011, then the topic was
highly susceptible in that year because it appeared frequently 4 https://analyzing-evolving-trends-of-vulnerabilities-in-nvd-database.
in the data at that time. weebly.com/

STEM LLDA TOT


Permissions Permissions
and Path Resources and Path Resources
Privileges Traversal Error Privileges Traversal Error Topic 0 Topic 14 Topic 13
users xml service users traversal service oracle service service
access entity denial access files denial attacker code denial
privileges xxe allows windows parameter allows access denial allows
windows external consumption privileges xxe consumption allows windows cisco
android files memory android allows code server allows users
allows attacker crash application directory memory data vectors bug
application service code allows file crash user server code
server information function server users function parameter application software
vectors denial vectors vectors path vectors windows file application
file allows corruption file external cisco users parameter attacker

Figure 3: Terms discovered by STEM, LLDA, and TOT for three vulnerability labels in NVD.
STEM LLDA TOT
Linux Linux
Kernel iPhone OS Android Kernel iPhone OS Android Topic 13 Topic 9 Topic 10
denial component kernel kernel apple android service os android
service corruption bug linux ios bug denial mac versions
memory denial android users issue application users users code
crash service application service service kernel linux allows issue
kernel code id denial denial id kernel x kernel
function apple product function code product function service id
system issue qualcomm system corruption versions allows denial application
allows application driver allows component qualcomm crash code product
users applets linux crash application driver code linux data
linux vectors versions memory products code memory apple access

Figure 4: Terms discovered by STEM, LLDA, and TOT for three products in NVD.

Privileges, Path Traversal, and Resources Errors as detected


by STEM in Figure 3 (left). Overall, 50 distinct topics were Untrusted Search Path
Session Fixation
found by TOT. Topic 0, 14 and 13 are chosen for TOT because
they have the most resemblance with the three columns of Cryptographic Issues
STEM in Figure 3.
Code
Figure 3 offers immense insight into the impact of vulner-
abilities as well as how they can be exploited in the wild. For Configuration

example, the Permissions and Privileges vulnerability must be Incorrect Conversion


a common vulnerability for android products since android is of Cast
OS Command
Injections
related to Permissions and Privileges according to the STEM
Resource Errors
and LLDA models. Another trend can be found in the Path
Traversal vulnerability, which is related to terms xml and Race Conditions
xxe, implying that the validation issues associated with this Link Following
vulnerability are usually the result of XML factors. The final
trend we discover is seen in the Resources Errors vulnerability,
which is related to the terms denial, service, and crash. The
relations given by the models suggest that a Resources Errors Figure 5: The probability of ten vulnerability labels since 2010
vulnerability causes a denial of service, and potentially a crash based on the STEM model.
when it is exploited.
To discover trends in affected products, we performed a sim-
ilar analysis with STEM and the other models using affected list, indicating vulnerabilities that affect Linux Kernel cause
products as topics. Figure 4 demonstrates the top ten terms memory issues. Another unique relationship can be seen for
discovered by STEM, LLDA and TOT for three susceptible Android, whose terms include kernel, qualcomm, and driver
software products — Linux Kernel, iPhone OS, and Android — for both STEM and LLDA. Unlike for Linux Kernel, these
in the NVD dataset. This experiment was originally performed terms describe the individual software components that are
on the ten most affected software products of the dataset, but affected most by the vulnerabilities. Such information is in-
only three are displayed due to space limitations. The terms valuable for software developers and researchers in narrowing
inferred by STEM and LLDA are incredibly similar as there is down the scope for system vulnerability analysis. The final
an overlap of 94 percent between the two models. The terms relationship we discuss pertaining to Figure 4 relates to the
discovered by TOT are once again slightly different compared topic iPhone OS, which is related to the terms component and
to the terms discovered by the other models. Topic 13, 9 and 10 corruption indicating that the result of an exploit in the product
are chosen for TOT because they have the most resemblance iPhone OS is component corruption.
with the three columns of STEM in Figure 4. Vulnerabilities are not conventional topics. They evolve and
We discover equally interesting trends when we use affected adapt over time, so we further utilized our STEM framework
products as topic labels. For instance, in looking at the terms to determine the distribution of vulnerabilities with respect to
generated by the models for the topic Linux Kernel we discover time. Figure 5 shows the topic evolution of ten vulnerability
that memory and crash are related and appear high in the labels starting in 2010. Note that a plot like Figure 5 cannot
For the first test of the storytelling method, we selected 4 seed bulletins that were
published in different years. The storytelling method was set to find documents as far as five
be generatedyears
usingback
LLDA fromandtheTOTdatebecause,
the seedasdocument was published.
stated earlier, analyze threeThisvulnerabilities
decision of fiveusingyears was
diffusion-based storytelling
LLDA does arbitrary,
not supportand
timeforand
future
TOTtestsdoeswenotwill likely
support be trying
labels. different an
to establish yearevolutionary
limits to seenarrative
if we can of find a
vulnerabilities that
difference
An analysis in results.
of Figure 5 reveals It is
somealsointeresting
important charac- impact
to note that we did Windows
not limit 10.the
These narratives
possible number allow
of us to determine
teristics of vulnerabilities in thisThis
related documents. decade. First, was
decision also arbitrarytheand
vulnerabilities factors that to
we plan contribute to the susceptibility
see the effects of match limitsof this product.
appear to have less sudden fluctuations in topic probability Similar studies could be implemented
on our results in the future. Additionally, all information in the bulletins below is summarized. for all affected products
compared toHighlighted
affected products as seen in Figure
CVE’s are seed documents. 1 of Section in the NVD dataset, but for simplicity we chose to only run
I. However, there are exceptions, such as with the vulnerability this experiment on one product. For readability, the document
Race Conditions which has an abrupt
In general, growth in probability
the correlation between thefol-seeddescriptions
documentsprovided in each figure
and its matches have
are very been reduced to one
visible.
lowed by stagnation. Figure 5 also reveals that most vulnerabil- sentence.
However, it is hard to establish a “storyline” between the connected bulletins. This is possibly
ities had a relatively
due to the equal topic probability
relatively small sizeatofthethe
beginning ofand inWe
bulletins, select tests
further documents
we willover
seea how
five year timeof
the size segment
the since the
the decade, but in recent years OS Command Injections, Code,
seed document impacts results. Finally, all the seed documents and their documents affect the the quality of
publication of the seed document, and evaluate
and Session Fixation have become the most prominent vulner- a chain using the concept of dispersion coefficient introduced
same products.
abilities under this topic selection. Meanwhile, Cryptographic by Hossain et al. in [6]. Dispersion coefficient among a chain
Issues, Configuration, and Incorrect
Below Conversion
are the results fromand Cast have of which
CVE-2017-0050, documents
has acontaining n reports
vulnerability score is7.8described
on a as:
had dramaticscale
reductions in probability, implying that software
of 1-10, making it a moderately severe vulnerability. CVE-2017-0050 describes an
products are becoming less susceptible to these issues. 1 inX
n 3 n X1
elevation of privilege vulnerability that is caused by faulty input=validation 1 the Microsoft
disp(di , dj ) (11)
Windows
B. Vulnerability kernel and its drivers. A remote attacker can take advantage of
Evolution n this
2 i=0
faulty input
j=i+2
validation with
In this experiment, we autilize
specially
our crafted application.
diffusion-based sto-Thewhere
first CVE in this chain describes a
vulnerability that is caused by a software
rytelling framework on individual documents of the NVD error that is exploited by a malicious attacker. The
dataset. We following CVE’s describe
use the knowledge gained by various similaron
using STEM flaws
the in the Microsoft Windows ( kernel or its
NVD datasetcomponents.
to choose seed documents
In these results,for
thea relations
case study on
between previous CVE’s to the 1
seed if soergel(d
, CVE can be i , dj ) < ✓
vulnerabilityclearly
evolution. disp(di , dj ) = 1 n+i j (12)
seen.InFor
Section I, weCVE-2016-7216,
instance, introduce Figurethe 1, CVE right before the seed
0, document, describes
otherwise
which chronicles
how thethekernel
susceptibility of software
API mishandles products which
permissions in in turn allows for escalation of privilege. In
the 21st century. The figure shows, among other correlations, The dispersion coefficient of a chain of documents is highest
the seed document, the kernel doesn’t properly enforce permissions, so one can assume that
that there has been a substantial increase in susceptibility for only if consecutive pairs meet a distance threshold, ✓. For our
although CVE-2016-7216 fixed the mishandling of permissions in the kernel, that the kernel did
Windows 10 over the past few years. In Figures 2, 6,and 7, we chains, we only choose documents with the highest possible
not enforce these new permissions leading to a denial of service.

CVE-2015-0060 CVE-2015-0075 CVE-2015-1721 CVE-2016-7216 CVE-2017-0050


Published: 2015-02-11 Published: 2015-03-11 Published: 2015-06-10 Published: 2016-11-10 Published: 2017-03-17
disclosure. It is important to note that all the CVE’s in this chain describe vulnerabilities caused
by inspecially crafted
The font mapper applications.
The kernel in Microsoft Additionally, the products
The kernel-mode drivers affected byAPIthese
The kernel in vulnerabilities arein Microsoft
The kernel
win32k.sys inall
the the same andWindows
kernel- they aredoes Microsoft
not Windows Server
in Microsoft Windows 2003 SP2 Microsoft
and R2Windows
SP2, Windows Vista Windows SP2,
does not
mode drivers in Microsoft properly constrain allow local users to gain mishandles permissions, properly enforce
Windows Server
Windows does not
2008 SP2 and R2 SP1,privileges
impersonation levels,
Windows 7 SP1, Windows
or cause a
8, Windows 8.1, and
which allows local users permissions, which
properly scaleWindows
fonts, Server 2012
which allowsGold. For this chain,
local users although
denial of the correlation
service (system between
to gain privileges via aaffected products
allows users is
to cause a
clear and present,
which allows users to
it is hard to find a solid
to gain privileges via a
correlation
crash) via a crafted between the vulnerabilities
crafted application. themselves.
denial of For
service via a
cause a denial of service. crafted application. application. crafted application.
example, some are security feature bypass, or buffer overflow, while the seed document is
permissionsBelowenforcement
are the vulnerability. However, allwhich the vulnerabilities have the potential to a
Figure 6: Evolutionary narrative of results from CVE-2016-3346,
CVE-2017-0050, which is related has a vulnerability
to a kernel score
permission enforcement of 7.8 on
vulnerability in many
cause information disclosure or remote code execution.
scale of 1-10, making it a moderately critical security vulnerability. CVE-2016-3346 describes
versions of Microsoft Windows. The first four vulnerability reports from left in the chain represent how the last vulnerability
report in theanchain,
errorCVE-2017-0050,
in the loading ofevolved
a DLL over
(dynamic
time. link library), which is a library that contains code and
data that can be used by multiple programs at the same time. This error leads to elevation of
privilege, which in turn can lead to information disclosure or remote code execution. The first
CVE-2015-1679 CVE-2015-1727 CVE-2015-6098 CVE-2016-3225 CVE-2016-3346
matched bulletin
Published: 2015-05-13 in this2015-06-10
Published: chain describes Published:
a security feature bypassPublished:
2015-11-11 which 2016-06-16
leads to memoryPublished: 2016-09-14

The kernel-mode drivers Buffer overflow in the Buffer overflow in the The SMB server Microsoft Windows does
in Microsoft Windows kernel-mode drivers in Network Driver Interface component in Microsoft not properly enforce
allow local users to Microsoft Windows Standard (NDIS) Windows allows local permissions, which
bypass the ASLR allows local users to gain implementation in users to gain privileges allows local users to
protection mechanism privileges via a crafted Microsoft Windows via a crafted application obtain administrator
via a crafted function call, application. allows users to gain that forwards a request access via a crafted DLL
causing memory privileges via a crafted to an unintended service. (dynamic link library).
disclosure. application.

Figure 7: Evolutionary narrative of CVE-2016-3346, which describes a Windows Permissions Enforcement Elevation of
Below are the results from CVE-2012-0175, which has a vulnerability score of 9.3 on a
Privilege Vulnerability.
scale of 1-10, making it one of the most critical security vulnerabilities. CVE-2012-0175
describes a remote code execution vulnerability (“Command Injection Vulnerability”) caused in
the shell of Microsoft Windows when a user opens a specially crafted file or directory. The first
CVE in this chain describes how an integer overflow vulnerability caused by a crafted AVI
(Audio Video Interleaved) file leads to arbitrary code execution. The subsequent CVE bulletins
dispersion coefficient of 1.0 to ensure that there is a smooth vation of privilege and possibly remote code execution because
transition of topics. of improper input validation before loading libraries. The first
In Figure 6, we display the evolutionary narrative of CVE- document found by the storytelling algorithm describes a race
2017-0050, a document in the NVD dataset that describes condition vulnerability in the kernel mode drivers of an earlier
an elevation of privilege vulnerability caused by faulty input version of Microsoft Windows, as Microsoft Windows 10 was
validation in Microsoft Windows. The first document in the released in 2015. The next documents in the narrative describe
narrative describes the likely origin of this vulnerability as similar vulnerabilities in the kernel mode drivers of an earlier
being a scaling error in Windows kernel-mode drivers that version of Microsoft Windows. Note that all vulnerabilities in
occurred in 2015. The following documents in the narrative the chain can be exploited via a crafted application. In the final
describe similar flaws in the Microsoft Windows kernel or its two documents of the narrative, a shift from driver exploitation
components. The relations between previous CVE’s to the seed to library exploitation is seen, but the vulnerability still has the
CVE, CVE-2017-0050, can be clearly seen. For instance, the same impact on product security. In summary, this narrative
document labeled CVE-2016-7216 (4th from left in Figure 6) explores the impact of a single vulnerability on the security of
describes how the kernel API mishandles permissions, which multiple products, with a focus on how vulnerabilities in one
in turn allows for escalation of privilege errors. In the seed product contribute to vulnerabilities in others.
document, the kernel does not properly enforce permissions, Overall, our case study into the susceptibility of Microsoft
so although the mishandling of permissions was patched in the Windows 10 suggests that vulnerabilities do evolve over time.
kernel, the kernel itself did not enforce these new permissions It also suggests that vulnerability development is based on
leading to a denial of service. Note that the documents in the several factors, such as the dependence of vulnerabilities on
narrative are all related to kernel vulnerabilities, suggesting previous issues and the cross-product nature of vulnerability
that vulnerabilities affecting Microsoft Windows components adaptation.
are heavily dependent on previous vulnerabilities that affect
the same components. C. Shifts in Product Susceptibility
Figure 7 illustrates the evolutionary narrative of CVE-2016- In order to explain shifts in products susceptibilities, we
3346. This particular CVE describes a bug in the loading of perform an empirical comparison test. Figure 8 (left) shows
a dynamic link library (DLL). The bug leads to elevation the topic probability of ten vulnerability labels since 2000
of privilege, which causes information disclosure or remote and Figure 8 (middle) shows the product susceptibility of ten
code execution. A possible origin of the vulnerability can products since 2000. Comparing these two figures explains
be pinpointed to security feature bypass and buffer over- some of the shifts in susceptibility. However, other factors
flow vulnerabilities in the kernel mode drivers of Microsoft such as product use and product deprecation that have an
Windows. This implies that the vulnerability documented in undeniable impact on susceptibility are not considered here.
CVE-2016-3346 is kernel-based, rather than an inherent DLL For instance, Figure 8 (middle) shows that since 2000, certain
issue. Additional documents in the narrative provide further products, such as Edge and Windows Server, have appeared
background on the adaptation of the vulnerability, offering that completely changed the susceptibility dynamic. However,
a fascinating example that displays the progression of one Safari remains an exception. It becomes susceptible around
vulnerability to another. As a whole, the narrative reinforces 2003, when it was released to the public, and it has a high
the concept that vulnerabilities are multifaceted, and they often susceptibility until about 2014 when its susceptibility abruptly
have dependencies that are not easily visible to experts. falls. Similarly, the topic Numeric Errors in Figure 8 (left)
In Figure 2, we present the evolutionary narrative for the grows especially probable in the time range 2003-2014, before
document labeled CVE-2015-6218, which describes a high abruptly declining as well. Of course, there are multiple
impact vulnerability in Microsoft Windows 10 that causes ele- reasons for the shift in susceptibility for Safari, but Numeric

ImageMagick Authentication Issues


Authentication Issues
Mac OS
Information Leak Information Leak
Ubuntu Linux
Input Validation Input Validation
Mac OS Server
Seamonkey Safari
Null Pointer Null Pointer
Numeric Errors Numeric Errors
Edge Apple TV
Adobe Flash Player

Safari Windows
iTunes
Server (2012)
MySQL SQL Injection SQL Injection

JRE Cross-site Scripting (XSS) iPhone OS


Cross-site Scripting (XSS)

Figure 8: (left) Topic probability of ten vulnerability labels since 2000 (middle) Product susceptibility of ten products since
2000 (right) Susceptibility of five Apple products over the last ten years.
Errors vulnerabilities undoubtedly play a part as they are the
only vulnerabilities with a similar probability tendency out of
all the given topics.
Another shift that can be partially explained using this
pragmatic method is the sudden susceptibility of both Mi-
crosoft Edge and Windows Server 2012. Although the shifts
in susceptibility for these products can be partly credited to
the time of their introduction into the product ecosystem,
the shift in susceptibility is especially drastic and unlike the
natural growth of susceptibility observed thus far. However,
an analysis of overall vulnerability distribution offers some
explanation for this phenomena. Both the topics NULL Pointer
and Data Handling in Figure 8 (left) show similar growth
patterns suggesting that a proliferation of these vulnerabilities
contributes to an equal if not greater increase in susceptibility
for the products Edge and Windows Server 2012. Once again,
this assertion stands for these particular vulnerabilities because Figure 9: Coherence of topics generated by STEM, LDA,
they are the only ones that have a comparable probability LLDA, and TOT on the NVD dataset. STEM outperforms all.
variation during this time period.
Our final experiment using STEM in the realm of product
susceptibility is an analysis of shifts that occur when a new
software product is introduced to the public. In Figure 8
(right), we assess product susceptibility over ten years since
the introduction of iPhone OS. For this experiment, we chose
to only study products that were created by Apple, the com-
pany that developed iPhone OS, to display the impact of
product introduction on a company’s overall product suscep-
tibility. An analysis of this experiment reveals that since the
introduction of iPhone OS there has been a steady decline
in susceptibility for iTunes, Apple TV, and Safari, but there
has been an aggressive growth in susceptibility for iPhone
OS. Now, iPhone OS is the most susceptible product created
by Apple, but the susceptibility of Mac OS, which remained
steady for most of the period, is on the rise as well. This
relationship suggests that substantial shifts occur in the product
susceptibility domain once a new product enters the market,
but it also demonstrates that product susceptibility is volatile Figure 10: Perplexity of STEM, LLDA, STEM-P and TOT
even if a product introduction has not occurred, such as in the models on the NVD dataset.
case of Mac OS.

D. Timestamp Prediction, Coherence, and Perplexity error than LDA and LLDA. The resulting figure is omitted
STEM models the temporal evolution of topics along with here because of space considerations, but it can be found at
word co-occurence probabilities. To illustrate that the topics the website mentioned in Section V-A.
generated by STEM can capture time more accurately than We use Topic Coherence [12] as a second method to
other models, we approximated the timestamp of each NVD evaluate the quality of the topics generated by STEM. Each
document from its distribution of topics according to four generated topic consists of words, and topic coherence is
topic models (STEM, LDA, LLDA, and TOT) as well as a applied to the top N words from each topic. Higher topic
baseline model that predicted timestamps randomly. Support coherence is better. Figure 9 shows the coherence of topics
Vector Machine regression was used to predict timestamps for generated by STEM, LLDA, TOT, and LDA for different
the four topic models, where the feature vectors for regression numbers of topics in the NVD dataset. The figure shows that
were the topic distributions in NVD documents. For more STEM outperforms all models in terms of topic coherence.
information on Support Vector regression refer to [3]. Root Perplexity is another widely used metric of convergence
Mean Squared Error (RMSE) was used for error measurement in topic modeling. It is measured as the likelihood of the
and the models were evaluated at various folds (training/test inverse of the geometric mean per word. A lower perplexity
splits). We found that STEM performed as good as TOT, is an indicator of a better fit to the data. Figure 10 shows
resulting in lower error (RMSE). STEM also had a lower the perplexity for STEM, STEM-P, LLDA and TOT on the
NVD dataset. STEM and the parallel version of STEM (called VII. ACKNOWLEDGMENTS
STEM-P) converge in less iterations than TOT and compete This material is based upon work supported by the National
well with LLDA by being very close to it. Science Foundation under Grant No. HRD-1242122.
R EFERENCES
E. Parallel Inference
[1] R. C. Barranco, A. P. Boedihardjo, and M. S. Hossain, “Analyzing
In this paper, we draw conclusions on the nature of vul- evolving stories in news articles,” International Journal of Data Science
and Analytics, 2017.
nerabilities that affect software products based on only one [2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
cyber-security corpus. In order to discover more intricate Machine Learning Research, p. 993–1022, 2003.
relationships in this domain, other corpora may be necessary. [3] C. Cortes and V. N. Vapnik, “Support-vector networks,” Machine Learn-
ing, p. 273–297, 1995.
Considering this, the scalability of STEM becomes of crucial [4] S. German and D. German, “Stochastic relaxation, gibbs distributions,
importance as the model needs to be trained for each new and the bayesian restoration of images,” IEEE Transactions on Pattern
dataset, a process that can be computationally intensive and Analysis and Machine Intelligence, p. 721–741, 1984.
[5] Z. Han, X. Li, Z. Xing, H. Liu, and Z. Feng, “Learning to predict severity
time-consuming. In this subsection, we perform an experiment of software vulnerability using only vulnerability description,” 2017
on the scalability of STEM, focusing on how the model IEEE International Conference on Software Maintenance and Evolution
converges during the inference process. In Figure 11, we (ICSME), 2017.
[6] M. Hossain, P. Butler, S. Boedihardjo, and N. Ramakrishnan, “Story-
compare the convergence times of the STEM with its parallel telling in entity networks to support intelligence analysts,” KDD ’12,
version, STEM-P, over ten iterations for datasets with 25, 50, pp. 1375–1383, 2012.
75, and 100 topics. Our analysis reveals that STEM-P is able [7] M. A. Kader, A. P. Boedihardjo, S. M. Naim, and M. S. Hossain,
“Contextual embedding for distributed representations of entities in a
to converge 30 to 40 percent faster than the regular STEM text corpus,” in Proceedings of the 5th International Workshop on Big
model, suggesting that the STEM model has high scalability Data, Streams and Heterogeneous Source Mining: Algorithms, Systems,
when implemented using a parallel approach. Programming Models and Applications at KDD 2016, ser. Proceedings
of Machine Learning Research, W. Fan, A. Bifet, J. Read, Q. Yang, and
P. S. Yu, Eds., vol. 53, 14 Aug 2016, pp. 35–50.
70 [8] S. Kamara, S. Fahmy, E. Schultz, F. Kerschbaum, and M. Frantzen,
“Anaylsis of vulnerabilities in internet firewalls,” Computers and Secu-
Convergence Time(in minutes)

60 rity, p. 214–232, 2003.


[9] Z. Lin, X. Li, and X. Kuang, “Machine learning in vulnerability
50
databases,” 2017 10th International Symposium on Computational In-
40 telligence and Design (ISCID), 2017.
[10] S. M. Naim, A. P. Boedihardjo, and M. S. Hossain, “A scalable model
30 for tracking topical evolution in large document collections,” 2017 IEEE
International Conference on Big Data (Big Data), 2017.
20
[11] S. Neuhaus and T. Zimmermann, “Security trend analysis with cve
10 topic models,” 2010 IEEE 21st International Symposium on Software
Reliability Engineering, 2010.
0 [12] D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic evalu-
25 50 75 100 ation of topic coherence,” in Human Language Technologies: The 2010
Number of Topics Annual Conference of the North American Chapter of the Association
STEM STEM-P for Computational Linguistics, ser. HLT ’10, 2010, pp. 100–108.
[13] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled lda,”
Proceedings of the 2009 Conference on Empirical Methods in Natural
Figure 11: Convergence time of STEM and STEM-P for Language Processing Volume 1 - EMNLP 09, 2009.
different number of topics. [14] C. H. S. Neuhaus, T. Zimmermann and A. Zeller, “Predicting vulnerable
software components,” In Proceedings of the 14th ACM Conference on
Computer and Communications Security, 2007.
[15] R. Scandariato, J. Walden, A. Hovsepyan, and W. Joosen, “Predicting
VI. C ONCLUSION vulnerable software components via text mining,” IEEE Transactions on
Software Engineering, vol. 40, no. 10, p. 993–1006, Jan 2014.
[16] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating
This paper proposes an analytic framework that uses a complexity, code churn, and developer activity metrics as indicators of
Supervised Topical Evolution Model (STEM) and a diffusion- software vulnerabilities,” IEEE Transactions on Software Engineering,
based storytelling algorithm to discover and study hidden vol. 37, no. 6, p. 772–787, 2011.
[17] X. Wang and A. Mccallum, “Topics over time,” Proceedings of the 12th
themes and trends in the National Vulnerability Database, ACM SIGKDD international conference on Knowledge discovery and
a cyber-security corpus describing software vulnerabilities data mining - KDD 06, 2006.
since 1996. A variety of experimental results indicate that [18] S. Zhang, D. Caragea, and X. Ou, “An empirical study on using
the national vulnerability database to predict software vulnerabilities,”
our overall framework reveals notable relationships between Lecture Notes in Computer Science Database and Expert Systems
vulnerabilities, vulnerability interactions, and the susceptibility Applications, p. 217–231, 2011.
of software products. We reinforce our results by comparing [19] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-bfgs-
b: Fortran subroutines for large-scale bound-constrained optimization,”
different models. In the future, we will use STEM in conjunc- ACM Transactions on Mathematical Software, vol. 23, no. 4, p. 550–560,
tion with diffusion-based storytelling to predict vulnerabilities Jan 1997.
in software before they occur. We will also use STEM to [20] T. Zimmermann, N. Nagappan, and L. Williams, “Searching for a needle
in a haystack: Predicting security vulnerabilities for windows vista,”
estimate the future appearance probabilities of vulnerabilities 2010 Third International Conference on Software Testing, Verification
in the NVD. and Validation, 2010.

View publication stats

You might also like