Extracting Information About Security Vulnerabilities From Web Text
Extracting Information About Security Vulnerabilities From Web Text
Extracting Information About Security Vulnerabilities From Web Text
Varish Mulwad, Wenjia Li, Anupam Joshi, Tim Finin and Krishnamurthy Viswanathan
Computer Science and Electrical Engineering
University of Maryland, Baltimore County
Baltimore, MD, USA
{varish1,wenjia1,joshi,finin,krishna3}@cs.umbc.edu
Abstract—The Web is an important source of information computer security and, if so, identify the set of security
about computer security threats, vulnerabilities and cyber- concepts it evokes. Once the concepts are identified and
attacks. We present initial work on developing a framework linked to related objects in our knowledge-base, we generate
to detect and extract information about vulnerabilities and
attacks from Web text. Our prototype system uses Wikitology, assertions about the text description.
a general purpose knowledge base derived from Wikipedia,
to extract concepts that describe specific vulnerabilities and II. M OTIVATION AND BACKGROUND
attacks, map them to related concepts from DBpedia and gener- A system to extract vulnerabilities, threats and attacks
ate machine understandable assertions. Such a framework will from unstructured text will be useful to many applications.
be useful in adding structure to already existing vulnerability
descriptions as well as detecting new ones. We evaluate our Sources such as NVD provide XML and ATOM feeds of the
approach against vulnerability descriptions from the National latest vulnerability. Although this contains some structured
Vulnerability Database. Our results suggest that it can be useful information, such as vendor, software name, version, and
in monitoring streams of text from social media or chat rooms severity, important information such as the exploit type (e.g.,
to identify potential new attacks and vulnerabilities or to collect cross-site scripting) and attack mode (e.g., ping of death)
data on the spread and volume of existing ones.
are only mentioned, if at all, in the unstructured text. Using
Keywords-security, vulnerability, information extraction, en- our techniques, we can detect concepts such as exploits and
tity linking attacks from the free text and thus add more structure to the
XML feed, adding to its value and utility.
I. I NTRODUCTION We see our framework as part of a larger system which
The Web has become a primary source of knowledge and scans Web resources for descriptions of new vulnerabilities,
information, largely replacing encyclopedias and reference threats and zero day attacks. Such a system can monitor and
books. This is especially true for dynamic topics, such as digest information from a set of sources, such as vulnera-
computer security threats, vulnerabilities and cyber-attacks. bility description feeds and as well as hacker forums, chat
Detailed information on these topics are found in Web- rooms and hacker blogs. The system can process a stream of
accessible repositories of structured and semi-structured text to detect potential vulnerability descriptions and extract
information, including the National Vulnerability Database concepts and topics of interest as well as associated entities
(NVD) [1], IBM’s XFORCE [2], the US-CERT Vulnerability such as software products and notify security expert for
Notes Database [3], and other security advisory sources. Var- further action about them.
ious informal sources complement these curated repositories, Our framework will be also useful in creating a
such as computer help forums, hacker blogs and forums, knowledge-base of vulnerabilities and threats. Using it,
chat rooms and social media streams. Even though these are existing semi-structured and unstructured vulnerability de-
noisy, redundant and often contain misinformation, they can scriptions can be transformed into machine understandable
be mined and aggregated to provide early warnings of new assertions in the RDF/OWL Semantic Web language [5] and
vulnerabilities and attacks, track the evolution of existing linked to appropriate entities and concepts in the linked data
ones, produce evidence for attribution and estimate the cloud [6].
prevalence and geographical distribution of known problems. A key component of our framework is Wikitology, a
The integration of knowledge and data from these two general purpose, hybrid knowledge base containing both
very different domains has great potential, but also offers structured and unstructured information extracted from
significant challenges. Wikipedia, DBpedia [7], Freebase [8], WordNet [9] and
We present a framework that analyses text snippets found Yago [10]. Wikitology’s interface is based on a specialized
on the Web to identify and generate assertions about vulner- information retrieval index implemented using the Lucene
abilities, threats and attacks. Given a text description, we use information retrieval system that supports complex queries
the Wikitology [4] knowledge base along with a taxonomy with structured and unstructured components and con-
of computer security exploits to decide if it is relevant to straints. Wikitology provides various fields to query against
258
Line : Buffer overflow in Fax4Decode in LibTIFF 3.9.4 and it has been focused on extracting people, places and orga-
possibly other versions, as used in ImageIO in Apple iTunes nizations from free text. To the best of our knowledge, no
before 10.2 on Windows and other products, allows remote effort has focused on extracting computer security exploits
attackers to execute arbitrary code or cause a denial of service and associated entities, relations and events from free text.
(application crash) via a crafted TIFF Internet Fax image file
that has been compressed using CCITT Group 4 encoding,
Portions of the NVD database has been mapped into RDF
related to the EXPAND2D macro in libtiff/tif fax3.h. [18] using a schema-based approach [19] but much of the
information remained in strings rather than RDF instances.
@prefix dbpedia: <http://dbpedia.org/resource/> . We evaluated our prototype system against a collection of
@prefix ids: <http://ebiquity.org/ontologies/cybersecurity/ vulnerability text descriptions from NVD. For every text de-
ids/v2.0/ids > . scription, the KB returned a ranked list of top N Wikipedia
[a ids:Vulnerability; concepts associated with the text. The concept extraction
ids:hasMeans dbpedia:Buffer overflow; algorithm then applies the filtering mechanism described
ids:hasConsequence dbpedia:Denial-of-service attack].
above to produce a ranked list of computer security exploits
Figure 3. The top box is an example text description of a vulnerability in from these.
the NIST NVD data feed. The second shows extracted information encoded
as OWL assertions serialized using N3.
In the first evaluation, we checked whether the top N
concepts returned by the algorithm includes the correct
exploit and at what rank is the correct exploit predicted. For
Wikitology to some super categories from Wikipedia. To each text description, one of the authors of the paper (who
overcome these issues, we extracted the taxonomy under the has a background in computer security and networks), went
Wikipedia category Computer security exploits [13]. This through the ranked list of computer security exploit concepts
taxonomy allows us to filter and select concepts that belong returned by the algorithm and identified whether a correct
to this category or fall under it. In our evaluation section exploit was predicted and at what rank it was predicted.
we present results for concepts related to computer security Out of 107 vulnerability text descriptions with N = 5, the
exploits extracted from the top five and top ten Wikipedia algorithm identified one or more security exploits for 76 text
concepts returned by Wikitology for every query. descriptions and it failed for 31. Out of the 76 descriptions
Once the security exploit concepts are extracted from text in which an security exploit concept was detected, for 68
snippets using the concept extraction algorithm, we generate text descriptions (89.47%) our algorithm detected the correct
machine-understandable assertions from them. We use the concept. Out of the 68 text descriptions in which a correct
IDS OWL ontology [14] [15] [16] to represent and reason concept was detected, 66 times the correct concept was at
about intrusion detection concepts and events and to encode rank 1. In general the average rank for the correct concept
the resulting inferred facts. This ontology provides classes was 1.0588
to describe different aspects of an attack. For example, Out of 107 vulnerability text descriptions when N = 10,
the Means class is a super class of concepts representing the algorithm identified one or more security exploits for 80
methods to conduct the attack. The Consequence class sub- text descriptions and it did not detect any security exploit
sumes classes representing attack outcomes and the System concept for 27 text descriptions. Out of the 80 descriptions
class covers classes for systems under various attacks. The in which an security exploit entity was detected, for 72
ontology also has properties such as hasMeans, hasConse- text descriptions (90%) our algorithm detected the correct
quence to describe the means and consequence for a given concept, which is slightly better than when N = 5. Out
vulnerability. We use the knowledge from Wikitology to map of the 72 text descriptions in which a correct concept was
the concepts extracted to the respective DBpedia concepts. detected, 69 times the correct concept was at rank 1. In
Figure 3 shows a text description from the NIST NVD/CVE general the average rank for the correct concept was 1.125.
data feed and the assertions extracted from it encoded in the From the first graph in Figure 4, it is evident that the concept
Semantic Web language OWL and serialized in N3. extraction algorithm yields a high accuracy for both N = 5
Such assertions can be added to a knowledge-base of and N = 10.
computer security exploits and can be used while reasoning Every concept on Wikipedia is associated with a set of
and detecting new vulnerabilities and threats. In future we multiple labels (or categories). There are a set of specific
will also explore generating a more complex assertion which categories as well a set of general categories and labels
will capture software products under attack, vendors of the associated with a given concept. Thus, to evaluate the
products etc. ranked order of security exploit concepts generated by the
algorithm, we perform a second evaluation. We compared
IV. D ISCUSSION AND E VALUATION the ranked list of computer security exploits generated by
the algorithm against the ranked list of computer security
Our work is broadly related to problems of concept
exploits created by a human expert (the same author as
spotting and named entity recognition. While named entity
before).
recognition is a well known and explored problem (see [17]),
259
R EFERENCES
[1] “National vulnerability database,” http://nvd.nist.gov.
[2] “Internet security systems x-force security threats,”
http://xforce.iss.net.
[3] US-CERT, “Vulnerability notes database,” http://www.
kb.cert.org/vuls/.
[4] T. Finin and Z. Syed, “Creating and Exploiting a Web
of Semantic Data,” in Proc. 2nd Int. Conf. on Agents
and Artificial Intelligence. Springer, January 2010.
[5] O. Lassila and R. Swick, “Resource description frame-
work (rdf): Model and syntax specification. recommen-
dation,” W3C, Tech. Rep., 1999.
[6] C. Bizer, “The emerging web of linked data,” IEEE
Intelligent Systems, vol. 24, no. 5, pp. 87–92, 2009.
[7] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker,
R. Cyganiak, and S. Hellmann, “Dbpedia - a crystal-
lization point for the web of data,” Journal of Web
Semantics, vol. 7, no. 3, pp. 154–165, 2009.
[8] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and
J. Taylor, “Freebase: a collaboratively created graph
database for structuring human knowledge,” in Proc.
Figure 4. Figure show graphs for accuracies for concepts returned by ACM Int. Conf. on Management of Data. New York,
Wikitology and the mean average precision between ranked list of concepts NY: ACM, 2008, pp. 1247–1250.
extracted by our algorithm against the list generated by our human expert
for top five and top ten results returned by Wikitology.
[9] G. A. Miller, “Wordnet: a lexical database for english,”
Commun. ACM, vol. 38, pp. 39–41, November 1995.
[10] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago:
A Core of Semantic Knowledge,” in 16th Int. World
We used the average precision [20] measure to compare
Wide Web Conf. New York: ACM Press, 2007.
two ranked list. Mean average precision (MAP) gives us the
[11] CNET, http://www.cnet.com/.
average precision over a set of queries. We calculated MAP
[12] “Opencalais,” http://opencalais.com/.
for N = 5 over a smaller subset 17 queries (i.e. 17 text
[13] http://en.wikipedia.org/w/index.php?title=Special:
descriptions) and for N = 10 over subset of 19 queries.
CategoryTree.
MAP for both N = 5 and N = 10 is greater than 0.8
[14] J. Undercoffer, A. Joshi, T. Finin, and J. Pinkston,
(see Figure 4) which indicates that the concept extraction
“Using DAML+OIL to classify intrusive behaviours,”
algorithm is not only producing a list of correct entities, but
The Knowledge Engineering Review, vol. 18, pp. 221–
also producing it in the desired order.
241, 2003.
[15] J. Undercoffer, T. Finin, A. Joshi, and J. Pinkston, “A
V. C ONCLUSION AND F UTURE W ORK target-centric ontology for intrusion detection,” in Proc.
We described a prototype system to identify vulnerabil- 18th Int. Joint Conf. on Artificial Intelligence, 2004.
ities, threats and attacks in Web text from which machine [16] http://ebiquity.umbc.edu/ontologies/cybersecurity/ids/.
understandable OWL assertions can be generated. Evalua- [17] D. Nadeau and S. Sekine, “A survey of named entity
tions showed promising results for our framework. We plan recognition and classification,” Linguisticae Investiga-
to focus on developing stronger reasoning algorithms and to tiones, vol. 30, no. 1, pp. 3–26, January 2007.
develop a more principled security exploits ontology. Many [18] V. Khadilkar, J. Rachapalli, and B. Thuraisingham,
difficult challenges will need to be addressed, including “Semantic web implementation scheme for national
representing uncertainty, reasoning with both logical and vulnerability database,” Univ. of Texas at Dallas, Tech.
probabilistic knowledge, and modeling and reasoning about Rep. UTDCS-01-10, 2010.
the temporal aspects of the data. [19] C. Bizer and A. Seaborne, “D2rq-treating non-rdf
databases as virtual rdf graphs,” in Proc. 3rd Inter-
national Semantic Web Conference, 2004.
ACKNOWLEDGMENT
[20] C. D. Manning, P. Raghavan, and H. Schütze, Intro-
This work was partially supported by an grant from the duction to Information Retrieval, 1st ed. Cambridge
Air Force Office of Scientific Research (MURI FA9550-08- University Press, July 2008.
1-0265) and a gift from Northrop Grumman Corporation.
260