Notos - Building A Dynamic Reputation System For Dns
Notos - Building A Dynamic Reputation System For Dns
Notos - Building A Dynamic Reputation System For Dns
Manos Antonakakis, Roberto Perdisci, David Dagon, Wenke Lee, and Nick Feamster
College of Computing, Georgia Institute of Technology,
{manos,rperdisc,dagon,wenke,feamster}@cc.gatech.edu
ISP Recursive
Subnet DNS Server
(Atlanta) Internet Network Based Zone Based Evidence Based
Feature Extraction Feature Extraction Feature Extraction
Subnet
Subnet Reputation
F1 F2 F3 ... F18 F1 F2 F3 ... F17 F1 F2 F3 ... F6
Engine
ISP Recursive
Subnet DNS Server Network Based Zone Based Evidence Based
(SJC) Features Vector Features Vector Features Vector
call A(D) the set of IP addresses ever pointed to by any do- average length of domain names in RHDNs, the number
main name d ∈ D. of distinct TLDs, the occurrence frequency of different
Given an IP address a, we define BGP (a) to be the set characters, etc.
of all IPs within the BGP prefix of a, and AS(a) as the set
of IPs located in the autonomous system in which a resides. • Evidence-based features: The last set of features in-
In addition, we can extend these functions to take as input cludes the measurement of quantities such as the number
a set of IPs: given IP set A = a1 , a2 , ..., aN , BGP (A) = of distinct malware samples that contacted the domain d,
S the number of malware samples that connected to any of
k=1..N BGP (ak ); AS(a) is similarly extended.
To assign a reputation score to a domain name d we proceed the IPs pointed by d, etc.
as follows. First, we consider the most current set Ac (d) =
{ai }i=1..m of IP addresses to which d points. Then, we query Once extracted, these statistical features are fed to the
our pDNS database to retrieve the following information: reputation engine. Notos’ reputation engine operates in two
modes: an off-line “training” mode and an on-line “classifica-
• Related Historic IPs (RHIPs), which consist of the union tion” mode. During the off-line mode, Notos trains the repu-
of A(d), A(Zone(3LD(d))), and A(Zone(2LD(d))). tation engine using the information gathered in our knowledge
In order to simplify the notation we will refer to base, namely the set of known malicious and legitimate do-
A(Zone(3LD(d))) and A(Zone(2LD(d))) as A3LD (d) main names and their related IP addresses. Afterwards, during
and A2LD (d), respectively. the on-line mode, for each new domain d, Notos queries the
trained reputation engine to compute a reputation score for d
• Related Historic Domains (RHDNs), which comprise the (see Figure 3). We now explain the details about the statistical
entire set of domain names that ever resolved to an IP features we measure, and how the reputation engine uses them
address a ∈ AS(A(d)). In other words, RHDNs contain during the off-line and on-line modes to compute a domain
all the domains di for which A(di ) ∩ AS(A(d)) 6= ∅. names’ reputation score.
domain names and IPs that are used for malicious purposes mainly because the IPs in the RHIPs should belong to the
are often short-lived and are characterized by a high churn same organization or a small number of different organiza-
rate. This agility avoids simple blacklisting or removals by tions. On the other hand, if a domain name d participates in
law enforcement. In order to measure the level of agility of malicious activities (i.e., botnet activities, flux networks), then
a domain name d, we extract eighteen statistical features that it could reside in a large number of different networks. The list
describe d’s network profile. Our network features fall into the of IPs in the RHIPs that correspond to the malicious domain
following three groups: name will produce AS features with higher values. In the same
sense, we measure that homogeneity of the registration infor-
• BGP features. This subset consists of a total of nine fea- mation for benign domains. Legitimate domains are typically
tures. We measure the number of distinct BGP prefixes linked to address space owned by organizations that acquire
related to BGP (A(d)), the number of countries in which and announce network blocks in some order. This means that
these BGP prefixes reside, and the number of organiza- the registration-feature values for a legitimate domain name
tions that own these BGP prefixes; the number of distinct d that owned by the same organizations will produce a list of
IP addresses in the sets A3LD (d) and A2LD (d); the num- IPs in the RHIPs that will have small registration feature val-
ber of distinct BGP prefixes related to BGP (A3LD (d)) ues. If this set of IPs exhibits high registration feature values,
and BGP (A2LD (d)), and the number of countries in it means that they very likely reside in different registrars and
which these two sets of prefixes reside. were registered on different dates. Such registration-feature
• AS features. This subset consists of three features, properties are typically linked with fraudulent domains.
namely the number of distinct autonomous systems re-
lated to AS(A(d)), AS(A3LD (d)), and AS(A2LD (d)).
3.2.2 Zone-based Features
• Registration features. This subset consists of six features.
We measure the number of distinct registrars associated The network-based features measure a number of characteris-
with the IPs in the A(d) set; the diversity in the regis- tics of IP addresses historically related to a given domain name
tration dates related to the IPs in A(d); the number of d. On the other hand, the zone-based features measure the
distinct registrars associated with the IPs in the A3LD (d) characteristics of domain names historically associated with
and A2LD (d) sets; and the diversity in the registration d. The intuition behind the zone-based features is that while
dates for the IPs in A3LD (d) and A2LD (d). legitimate Internet services may be associated with many dif-
ferent domain names, these domain names usually have strong
While most legitimate, professionally run Internet services similarities. For example, google.com, googlesyndi-
have a very stable network profile, which is reflected into low cation.com, googlewave.com, etc., are all related to
values of the network features described above, the profiles of Internet services provided by Google, and contain the string
malicious networks (e.g., fast-flux networks) usually change “google” in their name. On the other hand, malicious domain
relatively frequently, thus causing their network features to be names related to the same spam campaign, for example, often
assigned higher values. We expect a domain name d from a look randomly generated and share few common characteris-
legitimate zone to exhibit a small values in its AS features, tics. Therefore, our zone-based features aim to measure the
level of diversity across the domain names in the RHDNs set. • Blacklist features. We measure three features, namely the
Given a domain name d, we extract seventeen statistical fea- number of IP addresses in A(d) that are listed in public
tures that describe the properties of the set RHDNs of domain IP blacklists; the number of IPs in BGP (A(d)) that are
names related to d. We divide these seventeen features into listed in IP blacklists; and the number of IPs in AS(A(d))
two groups: that are listed in IP blacklists.
• String features. This group consists of twelve features. Notos uses the blacklist features from the evidence vector
We measure the number of distinct domain names in so it can identify the re-use of known malicious network re-
RHDNs, and the average and standard deviation of their sources like IPs, BGP prefixes or even ASs. Domain names
length; the mean, median, and standard deviation of the are significantly cheaper than IPv4 addresses; so malicious
occurrence frequency of each single character in the do- users tend to reuse address space with new domain names. We
main name strings in RHDNs; the mean, median and should note that the evidence-based features represent only
standard deviation of the distribution of 2-grams (i.e., part of the information we used to compute the reputation
pairs of characters); the mean, median and standard devi- scores. The fact that a domain name was queried by malware
ation of the distribution of 3-grams. does not automatically mean that the domain will receive a
low reputation score.
• TLD features. This group consists of five features. For
each domain di in the RHDNs set, we extract its top-level
domain T LD(di ) and we count the number of distinct
3.3 Reputation Engine
TLD strings that we obtain; we measure the ratio between
the number of domains di whose T LD(di )=“.com” and Notos’ reputation engine is responsible for deciding
the total number of TLD different from “.com”; also, we whether a domain name d has characteristics that are simi-
measure the mean, median, and standard deviation of the lar to either legitimate or malicious domain names. In order
occurrence frequency of the TLD strings. to achieve this goal, we first need to train the engine to rec-
ognize whether d belongs (or is “close”) to a known class of
It is worth noting that whenever we measure the mean, me- domains. This training can be repeated periodically, in an off-
dian and standard deviation of a certain property, we do so in line fashion, using historical information collected in Notos’
order to summarize the shape of its distribution. For exam- knowledge base (see Section 4). Once the engine has been
ple, by measuring the mean, median, and standard deviation trained, it can be used in on-line mode to assign a reputation
of the occurrence frequency of each character in a set of do- score to each new domain name d.
main name strings, we summarize how the distribution of the In this section, we first explain how the reputation engine
character frequency looks like. is trained, and then we explain how a trained engine is used to
assign reputation scores.
3.2.3 Evidence-based Features
3.3.1 Off-Line Training Mode
We use the evidence-based features to determine to what ex-
tent a given domain d is associated with other known mali- During off-line training (Figure 3), the reputation engine
cious domain names or IP addresses. As mentioned above, builds three different modules. We briefly introduce each
Notos collects a knowledge base of known suspicious, ma- module and then elaborate on the details.
licious, and legitimate domain names and IPs from public
sources. For example, we collect malware-related domain • Network Profiles Model: a model of how well known
names by executing large numbers of malware samples in a networks behave. For example, we model the network
controlled environment. Also, we check IP addresses against characteristics of popular content delivery networks (e.g.,
a number of public IP blacklists. We elaborate on how we Akamai, Amazon CloudFront), and large popular web-
build Notos’ knowledge base in Section 4. Given a domain sites (e.g., google.com, yahoo.com). During the on-line
name d, we measure six statistical features using the informa- mode, we compare each new domain name d to these
tion in the knowledge base. We divide these features into two models of well-known network profiles, and use this in-
groups: formation to compute the final reputation score, as ex-
plained below.
• Honeypot features. We measure three features, namely
the number of distinct malware samples that, when ex- • Domain Name Clusters: we group domain names into
ecuted, try to contact d or any IP address in A(d); the clusters sharing similar characteristics. We create these
number of malware samples that contact any IP address clusters of domains to identify groups of domains that
in BGP (A(d)); and the number of samples that contact contain mostly malicious domains, and groups that con-
any IP address in AS(A(d)). tain mostly legitimate domains. In the on-line mode,
given a new domain d, if d (more precisely, d’s projec- akafms.net, akamai.net, akamaiedge.net, akamai.com,
tion into a statistical feature space) falls within (or close akadns.net, and akamai.com.
to) a cluster of domains containing mostly malicious do-
mains, for example, this gives us a hint that d should be • CDN Domains. In this class we include domain
assigned a low reputation score. names related to CDNs other than Akamai. For ex-
ample, we collect domain names under the follow-
• Reputation Function: for each domain name di , i = 1..n, ing zones: panthercdn.com, llnwd.net, cloudfront.net,
in Notos’ knowledge base, we test it against the trained nyud.net, nyucd.net and redcondor.net. We chose not
network profiles model and domain name clusters. Let to aggregate these CDN domains and Akamai’s domains
N M (di ) and DC(di ) be the output of the Network Pro- in one class, since we observed that Akamai’s domains
files (NP) module and the Domain Clusters (DC) mod- have a very unique network profile, as we discuss in Sec-
ule, respectively. The reputation function takes in input tion 4. Therefore, learning two separate models for the
N M (di ), DC(di ), and information about whether di and classes of Akamai Domains and CDN Domains allows
its resolved IPs A(di ) are known to be legitimate, suspi- use to achieve better classification accuracy during the
cious, or malicious (i.e., if they appeared in a domain on-line mode, compared to learning only one model for
name or IP blacklist), and builds a model that can assign both classes (see Section 3.3.5).
a reputation score between zero and one to d. A repu-
tation score close to zero signifies that d is a malicious • Dynamic DNS Domains. This class includes a large set
domain name while a score close to one signifies that d of domain names registered under two of the largest dy-
is benign. namic DNS providers, namely No-IP (no-ip.com) and
DynDNS (dyndns.com).
We now describe each module in detail.
For each class of domains, we train a statistical classifier
to distinguish between one of the classes and all the others.
3.3.2 Modeling Network Profiles
Therefore, we train five different classifiers. For example,
During the off-line training mode, the reputation engine builds we train a classifier that can distinguish between the class of
a model of well-known network behaviors. An overview of the Popular Domains and all other classes of domains. That is,
network profile modeling module can be seen in Figure 4(a). given a new domain name d, this classifier is able to recog-
In practice we select five sets of domain names that share simi- nize whether d’s network profile looks like the profile of a
lar characteristics, and learn their network profiles. For exam- well-known popular domain or not. Following the same logic
ple, we identify a set of domain names related to very popular we, can recognize network profiles for the other classes of do-
websites (e.g., google.com, yahoo.com, amazon.com) and for mains.
each of the related domain names we extract their network fea-
tures, as explained in Section 3.2.1. We then use the extracted 3.3.3 Building Domain Name Clusters
feature vectors to train a statistical classifier that will be able
to recognize whether a new domain name d has network char- In this phase, the reputation engine takes the domain names
acteristics similar to the popular websites we modeled. collected in our pDNS database during a training period, and
In our current implementation of Notos we model the fol- builds clusters of domains that share similar network and zone
lowing classes of domain names: based features. The overview of this module can be seen
in Figure 4(b). We perform clustering in two steps. In the
• Popular Domains. This class consists of a large first step we only use the network-based features to create
set of domain names under the following DNS coarse-grained clusters. Then, in the second step, we split
zones: google.com, yahoo.com, amazon.com, ebay.com, each coarse-grained cluster into finer clusters using only the
msn.com, live.com, myspace.com, and facebook.com. zone-based features, as shown in Figure 5.
u(d)
Reputation Function
F1 ... F16 S
f(d)
Reputation Engine
network-based cluster and also share similar zone-based when we consider the set of RHDNs for d1 and d2 , we can
features. To better understand how the zone-based clustering notice that the zone-based features of d1 are much more “sta-
works, consider the following examples of zone-based clus- ble” than the zone-based features of d2 . In other words, while
ters: the RHDNs of d1 share strong domain name similarities (e.g.,
they all share the substring “akamai”) and have low variance of
Cluster 1: the string features (see Section 3.2.2), the strong zone agility
..., 72.247.176.81 e55.g.akamaiedge.net, 72.247.176.94 e68.g.akamaiedge.net, 72.247.176.146
properties of d2 affect the zone-based features measured on
e120.g.akamaiedge.net, 72.247.176.65 e39.na.akamaiedge.net, 72.247.176.242
e216.g.akamaiedge.net, 72.247.176.33 e7.g.akamaiedge.net, 72.247.176.156
d2 ’s RHDNs and make d2 look very different from d1 .
e130.g.akamaiedge.net, 72.247.176.208 e182.g.akamaiedge.net, 72.247.176.198
e172.g.akamaiedge.net, 72.247.176.217 e191.g.akamaiedge.net, 72.247.176.200
One of the main advantages of Notos is the reliable as-
e174.g.akamaiedge.net, 72.247.176.99 e73.g.akamaiedge.net, 72.247.176.103
e77.g.akamaiedge.net, 72.247.176.59 e33.c.akamaiedge.net, 72.247.176.68
signment of low reputation scores to domain names partici-
e42.gb.akamaiedge.net, 72.247.176.237 e211.g.akamaiedge.net, 72.247.176.71
e45.g.akamaiedge.net, 72.247.176.239 e213.na.akamaiedge.net, 72.247.176.120
pating in “agile” malicious campaigns. Less agile malicious
e94.g.akamaiedge.net, ... campaigns, e.g., Fake AVs campaigns may use domain names
structured to resemble CDN related domains. Such strate-
Cluster 2: gies would not be beneficial for the FakeAV campaign, since
domains like virus-scan1.com, virus-scan2.com,
..., 90.156.145.198 spzr.in, 90.156.145.198 vwui.in, 90.156.145.198 x9e.ru, 90.156.145.50
v2802.vps.masterhost.ru, 90.156.145.167 www.inshaker.ru, 90.156.145.198 x7l.ru, etc., can be trivially blocked by using simple regular expres-
90.156.145.198 c3q.at, 90.156.145.198 ltkq.in, 90.156.145.198 x7d.ru,
90.156.145.198 zdlz.in, 90.156.145.159 www.designcollector.ru, 90.156.145.198 sions [16]. In other words, the attackers need to introduce
x7o.ru, 90.156.145.198 q5c.ru, 90.156.145.159 designtwitters.com, 90.156.145.198
u5d.ru, 90.156.145.198 x9d.ru, 90.156.145.198 xb8.ru, 90.156.145.198 xg8.ru, more “agility” at both the network and domain name level in
90.156.145.198 x8m.ru, 90.156.145.198 shopfilmworld.cn, 90.156.145.198
bigappletopworld.cn, 90.156.145.198 uppd.in, ... order to avoid simple domain name blacklisting. Notos would
only require a few labeled domain names belonging to the ma-
Each element of the cluster is a domain name - IP ad- licious campaign for training purposes, and the reputation en-
dress pair. These two groups of domains belonged to the gine would then generalize to assign a low reputation score to
same network cluster, but were separated into two different the remaining (previously unknown) domain names that be-
clusters by the zone-based clustering phase. Cluster 1 con- long to the same malicious campaign.
tains domain names belonging to Akamai’s CDN, while the
domains in Cluster 2 are all related to malicious websites that 3.3.4 Building the Reputation Function
distribute malicious software. The two clusters of domains
share similar network characteristics, but have significantly Once we build a model of well-known network profiles (see
different zone-based features. For example, consider domain Section 3.3.2) and the domain clusters (see Section 3.3.3), we
names d1 =“e55.g.akamaiedge.net” from the first cluster, and can build the reputation function. The reputation function will
d2 =“spzr.in” from the second cluster. The reason why d1 and assign a reputation score in the interval [0, 1] to domain names,
d2 were clustered in the same network-based cluster is because with 0 meaning low reputation (i.e., likely malicious) and 1
the set of RHIPs (see Section 3.1) for d1 and d2 have similar meaning high reputation (i.e., likely legitimate). We imple-
characteristics. In particular, the network agility properties of ment our reputation function as a statistical classifier. In order
d2 make it look like if it was part of a large CDN. However, to train the reputation function, we consider all the domain
names di , i = 1, .., n in Notos’ knowledge base, and we feed in Section 3.3.2, and Spam Domains, Flux Domains, and Mal-
each domain di to the network profiles module and to the do- ware Domains.
main clusters module to compute two output vectors N M (di ) In order to compute the output vector DC(d), we compute
and DC(di ), respectively. We explain the details of how the following five statistical features: the majority class label
N M (di ) and DC(di ) are computed later in Section 3.3.5. For L (e.g., L may be equal to Malware Domain), i.e., the label
now it sufficient to consider N M (di ) and DC(di ) as two fea- that appears the most among the vectors vi ∈ Vd ; the stan-
ture vectors. For each di we also compute an evidence fea- dard deviation of label frequencies, i.e., given the occurrence
tures vector EV (di ), as described in Section 3.2.3. Let v(di ) frequency of each label among the vectors vi ∈ Vd we com-
(L)
be a feature vector that combines the N M (di ), DC(di ), and pute their standard deviation; given the subset Vd ⊆ Vd of
EV (di ) feature vectors. We train the reputation function us- vectors in Vd that are associated with label L, we compute
ing the labeled dataset L = {(v(di ), yi )}i=1..n , where yi = 0 the mean, median and standard deviation of the distribution
(L)
if di is a known malicious domain name, otherwise yi = 1. of distances between zd and the vectors vj ∈ Vd .
After training is complete; the reputation engine can be used Given a domain d, once we compute the vectors N M (d) and
in on-line mode (Figure 3) to assign a reputation score to new DC(di ) as explained above, we also compute the evidence
domain names. For example, given an input domain name vector EV (d) as explained in Section 3.2.3. At this point, we
d, the reputation engine computes a score S ∈ [0, 1]. Val- concatenate these three feature vectors into a sixteen dimen-
ues of S close to zero mean that d appears to be related to sional feature vector v(d), and we feed v(d) in input to our
malicious activities and therefore has a low reputation. On trained reputation function (see Section 3.3.4). The reputa-
the other hand, values of S close to one signify that d ap- tion function computes a score S = 1 − f (d), where f (d) can
pears to be associated with benign Internet services, and there- be interpreted as the probability that d is a malicious domain
fore has a high reputation. The reputation score is computed name. S varies in the [0, 1] interval, and the lower the value of
as follows. First, d is fed into the network profiles module, S, the lower d’s reputation.
which consists of five statistical classifiers, as discussed in
Section 3.3.2. The output of the network profiles module is 4 Data Collection and Analysis
a vector N M (d) = {c1 , c2 , ..., c5 }, where c1 is the output of
the first classifier, and can be viewed as the probability that This section summarizes observations from passive DNS
d belongs to the class of Popular Domains, c2 is the proba- measurements, and how professional, legitimate DNS services
bility that d belongs to the class of Common Domains, etc. are distinguished from malicious services. These observations
At the same time, d is fed into the domain clusters module, provided the ground truth for our dynamic domain name rep-
which computes a vector DC(d) = {l1 , l2 , ..., l5 }. The ele- utation system. We also provide an intuitive example to illus-
ments li of this vector are computed as follows. Given d, we trate these properties, using a few major Internet zones like
first extract its network-based features and identify the closest Akamai and Google.
network-based cluster to d, among the network-based clusters
computed by the domain clusters module during the off-line 4.1 Data Collection
mode (see Section 3.3.3). Then, we extract the zone-based
statistical features and identify the zone-based cluster closest The basic building block for our dynamic reputation rating
to d. Let this closest domain cluster be Cd . At this point, we system is the historical or “passive” information from success-
consider all the zone-based feature vectors vj ∈ Cd , and we ful A-type DNS resolutions. We use the DNS traffic from
select the subset of vectors Vd ⊆ Cd for which the two fol- two ISP-based sensors, one located on the US east coast (At-
lowing conditions are verified: i) dist(zd , vj ) < R, where zd lanta) and one located on the US west coast (San Jose). Addi-
is the zone-based feature vector for d, and R is a predefined tionally we use the aggregated DNS traffic from the different
radius; ii) vj ∈ KN N (zd ), where KN N (zd ) is the set of k networks covered by the SIE [3]. In total, our database col-
nearest-neighbors of zd . lected 27,377,461 unique resolutions from all these sources
The feature vectors in Vd are related to domain names ex- over a period of 68 days, from 19th of July 2009 to 24th
tracted from Notos’ knowledge base. Therefore, we can assign September 2009.
a label to each vector vi ∈ Vd , according to the nature of the Simple measurements performed on this large data set
domain name d from which vi was computed. The domains in demonstrate a few important properties leveraged by our se-
Notos’ knowledge base belong to different classes. In particu- lected features. After just a few days the rate of new, unique
lar, we distinguish between eight different classes of domains, pDNS entries leveled off. The graph in Figure 7(b) shows
namely Popular Domains, Common Domains, Akamai, CDN, only about 100,000 to 150,000 new domains/day (with a brief
and Dynamic DNS, which have the same meaning as explained outage issue on the 53rd day), despite very large numbers of
(a) Unique RRs In The Two ISPs Sensors (per day) (c) Akamai Class Growth (d) CDN Class Growth
4e+06 Over Time (Days) Over Time (Days)
Unique RRs
3.5e+06
Volume Of
Volume
Volume
Days 100
(b) New RRs Growth In pDNS DB For All Zones 100
1e+07 10
Volume Of
New RRs
1e+06
100000
10000
1000 1 10
100 New RRs 1 10 100 1 10 100
10 Unique DN Unique DNs
0 10 20 30 40 50 60 70 Unique IPs Unique IP
Days New RRs New RRs
(e) Pop Class Growth (f) Dyn. DNS Class Growth (g) Common Class Growth (h) CDF Of RR Growth
Over Time (Days) Over Time (Days) Over Time (Days) For All Classes
100000 1000 100 100000
10000
10000
100
1000
Volume
Volume
1000 Volume 10
100
10
100
10
10 1 1 1
1 10 100 1 10 100 1 10 100 0.01 0.1 1
Unique DN Unique DN Unique DN Akamai CDN
Unique IP Unique IP Unique IPs Common Dynamic
New RRs New RRs New RRs Pop
Figure 7. Various RRs growth trends observed in the pDNS DB over a period of 68 days
RRs arriving each day (shown in Figure 7(a)). This suggests the domain, instead of the URI), which explains the growth
that most RRs are duplicates, and approximately after the first in domains shown in Figure 7(e). These popular sites use a
few days, 94.7% – on average – from the unique RRs ob- very small number of IPs, however, and after a few weeks of
served in daily base at the sensor level are already recorded by training our pDNS database identified all of them. Since these
the passive DNS database. Therefore, even a relatively small popular domains make up a large portion of traffic in any trace,
pDNS database may be used to deploy Notos. In Section 5, we our intuition is that simple whitelisting would significantly re-
measure the sensitivity of our system to traffic collected from duce the workload of a classifier.
smaller networks. Figure 7(f) shows the rate of pDNS growth for zones in
The remaining plots in Figure 7 show the daily growth of Dynamic DNS providers. These services, sometimes used by
our passive DNS database, from the point of view of five dif- botmasters, demonstrate a nearly matched ratio of new IPs to
ferent zone classes. Figure 7(c) and (d) show the growth rate new domains. The data excludes non-routable answers (e.g.,
associated with CDN networks (Akamai, and all other CDNs). dynamic DNS domains pointing to 127.0.0.1), since this con-
The number of unique IPs stays nearly constant with the num- tains no unique network information. Intuitively, one can think
ber of unique domains (meaning that each new RR is a new of dynamic DNS as a nearly complete bijection of domains to
IP and a new child domain of the CDN). In a few weeks, most IPs. Figure 7(g) shows the growth of RRs for alexa.com
of the IPs became known—suggesting that one can fully map top 100 domains. Unlike dynamic DNS domains, these points
CDNs in a modest training set. This is because CDNs, al- to a small set of unique addresses, and most can be identified
though large, always have a fixed number of IP addresses used in a few weeks’ worth of training.
for hosting their high-availability services. Intuitively, we be- A comparison of all the zone classes appears in Figure 7(h),
lieve this would not be the case with malicious CDNs (e.g., which shows the cumulative distribution of the unique RRs de-
flux networks), which use randomly spreading infections to tailed in Figure 7(c) through (g). The different rates of change
continually recruit new IPs. illustrate how each zone class has a distinct pattern of RR use:
The ratio of new IPs to domains diverges in Figure 7(e), some have a small IP space and highly variable domain names;
a plot of the rate of newly discovered RRs for popular web- some pair nearly every new domain with a new IP. Learning
sites (e.g., Google, Facebook). Facebook notably uses unique approximately 90% of all the unique RRs in each zone class,
child domains for their Web-based chat client, and other top however, only requires (at most) tens of thousands of distinct
Internet sites use similar strategies (encoding information in RRs. The intuition from this plot is that, despite the very large
data set we used in our study, Notos could potentially work ecution. After excluding all domain names that belong to the
with data observed from much smaller networks. top 500 most popular alexa.com zones, we assemble the
main corpus of our “honeypot data”. We automated the crawl-
4.2 Building The Ground Truth ing and collection of black list information and honeypot exe-
cution.
To establish ground truth, we use two different labeling The reader should note that we chose to label our data in
processes. First, we assigned labels to RRs at the time of their as transparent way as possible. We used public blacklisting
discovery. This provided an initial static label for many do- information to label our training dataset before we build our
mains. Blacklists, of course, are never complete and always models and train the reputation function. Then we assigned
dynamic. So our second labeling process took place during the reputation scores and validated the results again using the
evaluation, and monitored several well-known domain black- same publicly available blacklist sources. It is safe to as-
lists and whitelists. sume that private IP and DNS blacklist will contain significant
The data we used for labeling came from several sources. more complete information with lower FP rates than the public
Our primary source of blacklisting came from services blacklists. By using such type of private blacklist the accuracy
such as malwaredomainlist.com and malwaredo- of Notos’ reputation function should improve significantly.
mains.com. In order to label IP addresses in our pDNS
database we also used the Sender Policy Block (SBL) list from 5 Results
Spamhaus [18]. Such IPs are either known to send spam or
distribute malware. We also collected domain name and IP In this section, we present the experimental results of our
blacklisting information from the Zeus tracker [30]. All this evaluation. We show that Notos can identify malicious domain
blacklisting information was gathered before the first day of names sooner than public blacklists, with a low false posi-
August 2009 (during all the 15 days in which we collected tive rate (FP%) of 0.38% and high true positive rate (TP%)
passive DNS data). Since blacklists traditionally lag behind of 96.8%. As a first step, we computed vectors based on
the active threat, we continued to collect all new data until the the statistical features (described in Section 3.2) from 250,000
end of our experiments. unique RRs. This volume corresponds to the average volume
Our limited whitelisting was derived from the top 500- of new – previously unseen – RRs observed at two recursive
alexa.com domain names, as of the 1st of August 2009. We DNS servers in a major ISP in one day, as noted in Section 4,
reasoned that, although some malicious domains become pop- Figure 7(b). These vectors were computed based on historic
ular, they do not stay popular (because of remediation), and passive DNS information from the last two weeks of DNS traf-
never break into the top tier of domain rankings. Likewise, we fic observed on the same two ISP recursive resolvers in Atlanta
used a list of the 18 most common 2LDs from various CDNs, and San Jose.
which composed the main corpus of our CDN labeled RRs.
Finally a list of 464 dynamic DNS second level domains al- 5.1 Accuracy of Network Profile Modeling
lowed us to identify and label domain name and IPs coming
from zones under dynamic DNS providers. We label our eval- The accuracy of the Meta-Classification system (Fig-
uation (or testing) data-set by aggregating updated blacklist ure 4(a)) in the network profile module is critical for the over-
information for new malicious domain names and IPs from all performance of Notos. This is because, in the on-line mode,
the same lists. Notos will receive unlabeled vectors which must be classified
To compute the honeypot features (presented in Sec- and correlated with what is already present in our knowledge
tion 3.2.3) we need a malware analysis infrastructure that can base. For example, if the classifier receives a new RR and as-
process as many “new” malware samples as possible. Our signs to it the label Akamai with very high confidence, that
honeypot infrastructure is similar to “Ether” [4] and is capa- implies the RR which produced this vector will be part of a
ble of processing malware samples in a queue. Every malware network similar to Akamai. However, this does not necessar-
sample was analyzed in a controlled environment for a time ily mean that it is part of the actual Akamai CDN. We will see
period of five minutes. This process was repeated during the in the next section how we can draw conclusions based on the
last 15 days of July 2009. After 15 days of executions we proximity between labeled and unlabeled RRs within the same
obtained a set of successful DNS resolutions (domain names zone-based clusters. Furthermore, we discuss the accuracy
and IPs) that each malware looked up. We chose to execute of the Meta-Classifier when modeling each different network
malware and collect DNS evidence through the same period profile class (profile classes are described in Section 3.3.2).
of time in which we aggregate the passive DNS database. Our Our Meta-Classifier consists of five different classifiers,
virtual machines are equipped with five popular commercial one for each different class of domains we model. We chose to
anti-virus engines. If one of the engines identifies an exe- use a Meta-Classification system instead of a traditional sin-
cutable as malicious, we capture all domain names and the gle classification approach because Meta-Classification sys-
corresponding IP mappings that the malware used during ex- tems typically perform better than a single statistical classi-
False Positive Rate vs True Positive Rate False Positive Rate vs True Positive Rate
1 1
0.99 0.99
0.98 0.98
0.97 TP over All Pos. vs Threshold 0.97 TP over All Pos. vs Threshold
True Positive Rate
Precision
0.95 0.94 0.95 0.94
0.92 0.92
0.9 0.9
0.94 0.88 0.94 0.88
0.86 0.86
0.84 0.84
0.93 0.82 0.93 0.82 ROC
0.8 0.8
0.92 0 0.10.20.30.40.50.60.70.80.9 Akamai 0.92 0 0.10.20.30.40.50.60.70.80.9 1
CDNs
Threshold Popular Threshold
0.91 Common 0.91
Dynamic ROC
0.9 0.9
0 0.05 0.1 0.15 0.2 0 0.02 0.04 0.06 0.08 0.1
False Positive Rate False Positive Rate
Figure 8. ROC curves for all network profile Figure 9. The ROC curve from the reputation func-
classes shows the Meta-Classifier’s accuracy. tion indicating the high accuracy of Notos.
fier [11, 2]. Throughout our experiments this proved to be 5.2 Network and Zone-Based Clustering Results
also true. The ROC curve in Figure 8, shows that the Meta-
Classifier can accurately classify RRs for all different network In the domain name clustering process (Section 3.3.3, Fig-
profile classes. ure 4(b)) we used X-Means clustering in series, once for the
The training dataset for the Meta-Classifier is composed network-based clustering and again for the zone-based clus-
of sets of 2,000 vectors from each of the five network profile tering. In both steps we set the minimum and maximum num-
classes. The evaluation dataset is composed of 10,000 vectors, ber of clusters to one and the total number of vectors in our
2,000 from each of the five network profile classes. The classi- dataset, respectively. We run these two steps using different
fication results for the domains in the Akamai, CDN, dynamic numbers of zone and network vectors. Figure 11 shows that
DNS and Popular classes showed that the supervised learn- after the first 100,000 vectors are used, the number of network
ing process in Notos is accurate, with the exception of a small and zone clusters remains fairly stable. This means that by
number of false positives related to the Common class (3.8%). computing at least 100,000 network and zone vectors—using
After manually analyzing these false positives, we concluded a 15-day old passive DNS database—we can obtain a stable
that some level of confusion between the vectors produced by population of zone and network based clusters for the moni-
Dynamic DNS domain names and the vectors produced by tored network. We should note that reaching this network and
domain names in the Common class still remains. However, cluster equilibrium does not imply that we do not expect to
this minor misclassification between network profiles does not see any new type of domain names in the ISP’s DNS recur-
significantly affect the reputation function. This is because sive. This just denotes that based on the RRs present in our
the zone profiles of the Common and Dynamic DNS domain passive DNS database, and the daily traffic at the ISP’s recur-
names are significantly different. This difference in the zone sive, 100,000 vectors are enough to reflect the major network
profiles will drive the network-based and zone-based cluster- profile trends in the monitored networks. Figure 11 indicates
ing steps to group the RRs from Dynamic DNS class and Com- that a sample set of 100,000 vectors may represent the major
mon class in different zone-based clusters. trends in a DNS sensor. It is hard to safely estimate the exact
minimum number of unique RRs that is sufficient to identify
Despite the fact that the network profile modeling process all major DNS trends. An answer to this should be based upon
provides accurate results, it doesn’t mean this step can inde- the type, size and utilization of the monitored network. With-
pendently designate a domain as benign or malicious. The out data from smaller corporate networks it is difficult for us
clustering steps will assist Notos to group vectors not only to make a safe assessment about the minimum number of RR
based their network profiles but also based on their zone prop- necessary for reliably training Notos.
erties. In the following section we show how the network and The evaluation dataset we used consisted of 250,000 unique
zone profile clustering modules can better associate similar domain names and IPs. The cluster overview is shown in Fig-
vectors, due to properties of their domain name structure. ure 10 and in the following paragraphs we discuss some in-
1st Level (Network Based) Clusters
2nd Level (Zone Based) Clusters
100
90
80
60
50
40
30
20
10
0
0 50000 100000 150000 200000 250000
Number of Vectors Used
teresting observations that can be made from these network- of these pilot experiments, we decided to set k equal to 50 and
based and zone-based cluster assignments. As an example, the radius distance equal to 100.
network clusters 0 and 1 are predominantly composed of zones Figures 12 and 13 show the effect of this radius selection
participating in fraudulent activities like spam campaigns (yel- on two different types of clustering problems. In Figure 12,
low) and malware dropping or C&C zones (red). On the other unknown RRs for akamaitech.net are clustered with a
hand, network clusters 2 to 5 contain Akamai, dynamic DNS, labeled vector akamai.net. As noted in Section 4, CDNs
and popular zones like Google, all labeled as benign (green). such as Akamai tended to have new domain names with each
We included the unlabeled vectors (blue) based on which we RR, but to also reuse their IPs. By training with only a small
evaluated the accuracy of our reputation function. We have a set of labeled akamai.net RRs, our classifier put the new,
sample of unlabeled vectors in almost all network and zone unknown RRs for akamaitech.net into the existing Aka-
clusters. We will see how already labeled vectors will assist mai class. IP-specific features therefore brought the new RRs
us to characterize the unlabeled vectors in close proximity. close to the existing labeled class. Figure 12 compresses all
Before we describe two sample cases of dynamic charac- of the dimensions into a two-dimensional plot (for easier vi-
terization within zone-based clusters, we need to discuss our sual representation), but it is clear the unknown RRs were all
radius R and k value selection (see Section 3.3.5). In Sec- within a distance of 100 to the labeled set.
tion 3.3.5, we discuss how we build domain name clusters. This result validates the design used in Section 4, where
At that point we introduced the dynamic characterization pro- just a few weeks’ worth of labeled data was necessary for
cess that gives Notos the ability to utilize already label vectors training. Thus, one does not have to exhaustively discover all
in order to characterize a newly obtained unlabeled vector by whitelisted domains. Notos is resilient to changes in the zone
leveraging our prior knowledge. After looking into the distri- classes we selected. Services like CDNs and major web sites
bution of Euclidean distances between unlabeled and labeled can add new IPs or adjust domain formats, and these will be
vectors within the same zone clusters, we concluded that in the automatically associated with a known labeled class.
majority of these cases the distances were between 0 and 1000. The ability of Notos to associate new RRs based on lim-
We tested different values of the radius R and the value of k ited labeled inputs is demonstrated again in Figure 13. In
for the K-nearest neighbors (KNN) algorithm. We observed this case, labeled Zeus domains (approximately 2,900 RRs
that the experiments with radius values between 50 and 200 from three different Zeus-related BLs) were used to clas-
provided the most accurate reputation rating results, which we sify new RRs. Figure 13 plots the distance between the la-
describe in the following sections. We also observed that if beled Zeus-related RRs and new (previously unknown) RRs
k > 25 the accuracy of the reputation function is not affected that are also related Zeus botnets. As we can see from
for all radius values between 50 and 200. Based on the results Section 4, most of the new (unlabeled) Zeus RRs lay very
Clustering akamai.net and akamaitech.net Vectors Clustering The Zeus Botnet
200
400
0
-200
0 -400
-200 -600
-800
-400
-1000
-400 -200 0 200 400 600 800 1000 -4000 -3000 -2000 -1000 0 1000 2000 3000 4000
CMD Scale (2) CMD 2D Scale (2)
akamai.net akamaitech.net Labeled Zeus Unlabeled Zeus
Figure 12. An example of characterizing the aka- Figure 13. An example of how the Zeus botnet
maitech.net unknown vectors as benign based on clusters during our experiments. All vectors are
the already labeled vectors (akamai.net) present in the same network cluster and in two different
in the same cluster. zone clusters.
close, and often even overlap, to known Zeus RRs. This names. We experimented with bot the top 10,000 and top
is a good result, because Zeus botnets are notoriously hard 100,000 Alexa domain names. The detection results for these
to track, given the botnet’s extreme agility. Tracking sys- experiments are as follows. When using the top 10,000 Alexa
tems such as zeustracker.abuse.ch and malware- domains, we obtained a true positive rate of 93.6% and a false
domainlist.com have limited visibility into the botnet, positive rate of 0.4% (again using 10-fold cross-validation and
and often produce disjoint blacklists. Notos addresses this a detection threshold equal to 0.5). As we can see, these results
problem, by leveraging a limited amount of training data to are not very different from the ones we obtained using only
correctly classify new RRs. During our evaluation set, Notos the top 500 Alexa domains. However, when we extended our
correctly detected 685 new (previously unknown) Zeus RRs. list of known good domains to include the top 100,000 Alexa
domain names, we observed a significant decrease of the true
5.3 Accuracy of the Reputation Function positive rate and an increase in the false positives. Specifically,
we obtained a TP% of 80.6% and a FP% of 0.6%. We believe
The first thing that we address in this section is our deci- this degradation in accuracy may be due to the fact that the
sion to use a Decision Tree using Logit-Boost strategy (LAD) top 100,000 Alexa domains include not only professionally
as the reputation function. Our decision is motivated by the run domains and network infrastructures, but also include less
time complexity, the detection results and the precision (true good domain names, such as file-sharing, porn-related web-
positives over all positives) of the classifier. We compared sites, etc., most of which are not run in a professional way and
the LAD classifier to several other statistical classifiers using have disputable reputation1 .
a typical model selection procedure [6]. LAD was found to We also wanted to evaluate how well Notos performs, com-
provide the most accurate results in the shortest training time pared to static blacklists. To this end, we performed a number
for building the reputation function. As we can see from the of experiments as follows. Given an instance of Notos trained
ROC curve in Figure 9, the LAD classifier exhibits a low false with data collected up to July 31, 2009, we fed Notos with
positive rate (FP%) of 0.38% and true positive rate (TP%) of 250,000 distinct RRs found in DNS traffic we collected on
96.8%. It is was noting that these results were obtained using August 1, 2009. We then computed the reputation score for
10-fold cross-validation, and the detection threshold was set each of these RRs. First, we set the detection threshold to 0.5,
to 0.5. The dataset using for the evaluation contained 10,719 and with this threshold we identified 54,790 RRs that had a
RRs related to 9,530 known bad domains. The list of known low reputation (lower than the threshold). These RRs where
good domains consisted of the top 500 most popular domains
1 A quick analysis of the top 100,000 Alexa domains reported that about
according to Alexa.
5% of the domains appeared in the SURBL (www.surbl.org) blacklist, at
We also benchmarked the reputation function on other two certain point in time. A more rigorous evaluation of these results is left to
datasets containing a larger number of known good domain future work.
domain names with very little historic (passive DNS) informa-
tion. Sufficient time and a relatively large passive DNS collec-
(a) Overall Volume of Malicious RRs (c)Malware/Trojans, Exploits and
tion are required to create an accurate passive DNS database.
10000
1000
Rogue AV RRs Identified
Therefore, if an attacker always buys new domain names and
Volume Of RRs
1000
100 new address space, and never reuses either resource for any
100 10 other malicious purposes, Notos will not be able to accurately
10
1
1
0 20 40 60 80 100 assign a reputation score to the new domains. In the IPv4
0 20 40 60 80
Days After Training
100
Malware
Days After Training
Rogue AV
space, this is very unlikely to happen due to the impending ex-
Exploit
haustion of the available address space. Once IPv6 becomes
10000
(b) Flux and Spam RRs Identified
100
(d) Botnet RRs Identified
the predominant protocol, however, this may represent a prob-
Volume Of RRs