A Machine Learning Approach To Anomaly Detection
A Machine Learning Approach To Anomaly Detection
A Machine Learning Approach To Anomaly Detection
r
k
S
1
p
k
, where r
k
is a rule in S and p
k
is the p value
of rule r
k
. The reciprocal of p
k
reects a surprise factor that is large when anomaly has a low likelihood
(small p
k
).
The p estimate is an aggregate over a stationary training period; however, recent events can greatly
inuence current events. Bursty network trac or OS activities are common. In intrusion detection we
experience that attacks cause bursty behavior as well. In order to incorporate recent novel events into our
6
scoring mechanism, we introduce t
k
which is the duration since the last novel value was observed in the
consequent of anomaly rule r
k
(or when r
k
was satised). The smaller t
k
is, the higher the likelihood that
we will see another novel value. That is, intuitively, we are less surprised if we have observed a novel value
in a more recent past. Hence, we calculate the anomaly score as:
AnomalyScore(x) =
r
k
S
t
k
p
k
. (3)
3.2 Summary of Current Results
To evaluate LERAD, we use network trac recorded in tcpdump provided by the DARPA evaluation in 1999
[20, 16]. Week 3 was used for training (D) and Weeks 4 and 5 (D
E
) were used for testing. The size of the
validation set ([D
V
[) was set to be 10% of the training set (D). LERAD was run ve times with a dierent
random seed. Attributes used in our data sets include values in the protocol header and application payload.
LERAD is evaluated based on the number of detected attacks with at most 10 false alarms per day.
In our experiments the resulting set of rules usually contains 50 to 75 rules. Though the rule set is
relatively small, LERAD, on the average, detects about 117 attacks out of 201 attacks with at most 10 false
alarms per day. Under a blind evaluation (the test set was not available apriori), the original DARPA
participant with the most detections detected 85 attacks [20]. This indicates LERAD is quite successful
in nding highly predictive normal patterns. More importantly, LERAD detects about 58% of the attacks
poorly detected by the original participants [20]. That is, LERAD increases the overall coverage of detectable
attacks. The total computational overhead is about 30 minutes for three weeks of training and test data.
Much of the overhead is in preprocessing of the raw data to generate feature values for training and testing.
Training and testing on three weeks of data take less than two minutes. We also analyzed and categorized
why our detected anomalies were successful in detecting attacks. The more common categories (covering
about 70% of the detected attacks) are unexpected user behavior (e.g., unusual client addresses for servers)
and learned (partial) attack signatures (e.g., unusual input that exploit bugs in software). Details of our
ndings are described in [21]. Furthermore, details of our results of our earlier and simplier algorithms
PHAD and ALAD are in [22]. Our PHAD algorithm was adapted to detect attacks by modeling accesses to
the Registry in the Windows OS [4].
4 Clustering for Anomaly Detection (CLAD)
LEARD assumes the training data are free of attacks, however, making sure the data is clean could be time
consuming. We propose to use a clustering approach to identify outliers as anomalous. Our clustering
method, CLAD, is inspired by the work of [11, 29], and is related to k-NN. CLAD locates anomalies by
nding local and global outliers with some restrictions, where k-NN and LOF [6] concentrate mainly on local
outliers. One key dierence of CLAD from other clustering algorithms is that clusters are of xed width
(radius) and allows clusters to overlap (i.e., the clusters are not mutually exclusive). This dierence permits
CLAD to process large amounts of data eciently.
CLAD has two phases: Phase 1 creates the clusters and Phase 2 assigns data points to additional clusters.
Fig. 1 illustrates the steps of the 2 phases. Given a dataset, D, Phase 1 creates clusters of xed width, W
(which will be discussed later), and assigns data points, d D, to the created clusters. If a data point is
further away than width W from any existing cluster, the data point becomes the centroid of a new cluster;
otherwise it is assigned to all existing clusters that are not further away than W. In Phase 1 since data
points can only be assigned to existing clusters, some data points might miss assignment to clusters that are
subsequently created. Phase 2 assigns these data points to additional clusters. So far our CLAD algorithm
is basically the clustering algorithm proposed in [11, 29], however, the methods signicantly diverge on
how data points are represented for calculating distance, how the cluster width is determined, and how the
properties of outliers are decided.
7
Input: Dataset D
Output: Set of clusters C
1. initialize the set of clusters, C, to
Phase 1: Creating clusters
2. for d D
3. for c C
4. if distance(d, c) W, assign d to c
5. if d is not assigned
6. create cluster c
to C
Phase 2: Assigning data points to additional clusters
7. for d D
8. for c C
9. if distance(d, c) W and d is not assigned to c
10. assign d to c
Figure 1: Overall CLAD Algorithm
4.1 Feature Vectors and Distance Function
Each data point, d, is represented by a feature vector, and a cluster, c, is represented by its centroid, which
is a data point. We use the Euclidean distance as our distance function:
distance(Y
1
, Y
2
) =
|Y1|
j=1
(Y
1j
Y
2j
)
2
, (4)
where Y
1
and Y
2
are two feature vectors, Y
ij
denotes the jth component of Y
i
, and [Y
i
[ denotes the length
of vector Y
i
.
To obtain a feature vector for a data point, we transform the data points represented in the input attribute
vectors (X
i
) into our feature vectors (Y
i
). We have two types of transformation depending on whether the
input attribute is continuous or discrete. Discrete attributes are usually problematic for distance functions.
In anomaly detection since values that are observed more frequently are less likely to be anomalous and
we want distance to indicate the dierence in the degree of normalcy (separating normal from abnormal
behavior), we represent a discrete value by its frequency. That is, discrete values of similar frequency are
close to each other, but values of very dierent frequency are far apart. As a result, discrete attributes are
transformed to continuous attributes.
In our domain continuous attributes, including those transformed from discrete attributes, usually exhibit
a power-law distributionsmaller values are much more frequent than larger values. Distances involving
the infrequent large values are large and drowns the distances involving only small values. To reduce
this problem, we use a logarithmic scale. In addition, to discount variance among values, we quantize the
values using the oor operation, after taking the logarithm. Furthermore, in order to consider each attribute
equally, the values of each attribute are normalized to the range [0,1]. Formally, an input attribute value,
X
ij
, is transformed to a, feature value, Y
ij
as follows:
Y
ij
= normalize(ln(X
ij
+ 1)|), (5)
where normalize(v
j
) = (v
j
Min
j
)/(Max
j
Min
j
), v
j
is a value from vector component j, and Min
j
(Max
j
) is the minimum (maximum) value of component j. To avoid negative and undened values (when
0 X
ij
< 1) , we add 1 to X
ij
before taking ln.
For normalization, we also considered the number of standard deviations (SD) away from average. How-
ever, power-law distributions are one-sided and heavy-tailed, so standard deviations are not very appropriate
for our purpose. Using SD for normalization resulted in noticeable degradation in performance in our ex-
periments. Therefore, we revert to simple scaling as a means of normalization.
8
4.2 Cluster Width
The cluster width, W, species the local neighborhood of clusters that are considered close. The width is
specied by the user in [29]. CLAD derives the width from the smallest distances between pairs of data
points. To eciently calculate the width, CLAD randomly draws a sample, of size s = 1% [D[, from the
entire dataset, D, and calculates the pair-wise distances. The bottom 1% of the pair-wise distances (i.e.,
1% s(s 1)/2 pairs) are considered the smallest and their average is the cluster width. That is, CLAD
samples pair-wise distances and uses the average distance of the closest neighbors as W. Though CLAD has
a xed parameter of 1% for deriving W, it is much less ad hoc than asking the user to specify W, which
becomes a parameter. Our parameter is similar to specifying k in k-NN methods, but our parameter is in
relative percentage, which is dierent from the absolute count of k and is conceptually easier to specify and
understand.
4.3 Density, Inter-cluster Distance, and Anomalies
To determine if a cluster is an outlier, CLAD relies on two properties of a cluster: density and distance from
the other clusters. Since each cluster has the same W (and hence area), we dene the density of cluster
c
i
as the number of data points, Count
i
, in c
i
. For the distance from the other clusters, we calculate the
average inter-cluster distance (ICD) between c
i
and the other clusters. Formally, we denote ICD
i
as the
ICD of cluster c
i
and dene ICD
i
as:
ICD
i
=
1
[C[ 1
|C|
j=1,=i
distance(c
i
, c
j
) (6)
where C, as similarly dened before, is the set of clusters.
Clusters that are distant and sparse are considered outliers and anomalous. A cluster c
i
is considered
distant if ICD
i
is more than a standard deviation away from the average ICD. From our initial experiments,
we observe that the distribution of Count exhibits a power-law distribution; when we use average and SD for
Count, the average is very small and few/no clusters have Count
i
one SD smaller than the average. Hence,
instead of using the average we use the median; a cluster c
i
is considered sparse when Count
i
is more than
one median absolute deviation (MAD) [15] smaller than the median Count. Interestingly, in our domain an
attack could be composed of many data points (e.g., ooding attacks), and hence dense regions could be
attacks as well. We will discuss this issue further in the next section when we evaluate CLAD. Accordingly,
we dene dense clusters, which have Count
i
more than one MAD larger than the median Count. More
formally, the set of distant clusters C
distant
, sparse clusters C
sparse
, and dense clusters C
dense
, are dened
as:
C
distant
= c
i
C[ICD
i
> AV G(ICD) +SD(ICD), (7)
C
sparse
= c
i
C[Count
i
< AV G(Count) MAD(Count), (8)
C
dense
= c
i
C[Count
i
> AV G(Count) +MAD(Count), (9)
where AV G is the average function. CLAD generates alerts for clusters that are sparse or dense and distant.
Each cluster is represented by its centriod.
A sparse cluster/region is essentially a local outlier, i.e., it reects how many neighbors are within W.
This is similar to k-NN which computes distance to the closest k neighbors, as discussed previously. Labeling
a region distant is equivalent to saying that the region is a global outlier.
4.4 Summary of Current Results
As with evaluation LEARD, we use the same DARPA 99 dataset to evaluate CLAD. Connections are similarly
reassembled and the rst 10 bytes from the application payload are in the input data. Unlike LEARD, CLAD
does not require an explicit training phase, we combine the normal training data (Weeks 1 and 3) and test
data (Weeks 4 and 5); the additional normal training data also help reduce the unusually high rate of attacks
in the test data.
9
Figure 2: Count and ICD of clusters for port 25 with CD a. < 20%, b. > 80%
Figure 3: Count and ICD of clusters for port 80 with CD a. < 20%, b. > 80%
To improve eectiveness and eciency, CLAD learns a model for each port (application protocol). For
ports that are rarely used (< 1% of the dataset), we lump them into one model: Other. Only clusters that
are sparse or dense and distant trigger alerts. To make anomaly scores comparable across models, anomaly
scores are normalized to the number of SDs away from the average ICD.
Density is not used in the anomaly score because it is not as reliable as ICD. This results from our
analysis of how attacks are distributed between density and ICD on ports 25 and 80, which have the most
trac. Since we do not exact labels (attack or normal) for each data point, we rely on how DARPA/LL
counts an alert as a detection of an attack [20]. We dene CD (counted as detection) of a cluster as the
percentage of data points in the cluster, when used to trigger an alert, is counted as a detection of an attack.
This is an indirect rough approximation of the likelihood of an attack present in the cluster. We plot clusters
with CD < 20% (unlikely anomalies) against Count and ICD in Fig. 2a and similarly for CD > 80%
(likely anomalies) in Fig. 2b. Both Count and ICD are in log scale. As we compare the two plots, we
observe that the likely anomalies occur more often in regions with larger ICD, and the opposite for unlikely
anomalies with smaller ICD. The same observation cannot be made for Count. This is related to the fact
that some attacks can occur in dense clusters as we explained previously. For port 80 in Fig 3, similar
observations can be made. The gures also indicate that sparse or dense and distant clusters, which we use
to trigger alerts, are likely to detect attacks. Furthermore, for port 80, 96% of the clusters have CD = 100%
or < 9% (similarly for port 25). This indicates that most of the clusters are near homogeneous and hence our
combination of feature vectors, distance function, and cluster width can sucently characterize the data.
Tbl. 3 shows the number of attacks detected by modeled learned for each port with at most 100 false
alarms for the 10 days attack period in Weeks 4 and 5. The combined model detected 76 attacks, after
removing duplicate detections from individual models. As mentioned perviously, the original DARPA par-
ticipant with the most detections detected 85 attacks [20], which was achieved by a signature detector built
by handunlike CLAD, which is an anomaly detector with no apriori knoweldge of attacks. Compared to
Table 3: Number of detections by CLAD (duplicates are removed in Combined)
Port 20 21 23 25 53 79 80 110 111 143 Other Combined
Detections 3 14 17 33 5 8 37 2 1 3 14 76
10
LEARD, CLAD detected fewer detections, but CLAD is handicapped by not assuming the availability of
attack-free training data. However, we seem to detect more attacks than similar techniques [11, 29], which
make similar assumptions, but we cannot claim that since the datasets are dierent. Further experimentation
would help reduce the uncertainty.
5 Concluding Remarks
We motivated the signicance of a machine learning approach to anomaly detection and have proposed two
machine learning methods for constructing anomaly detectors. LERAD is a learning algorithm that can
characterize normal behavior in logical rules. CLAD is a clustering algorithm that can identify outliers from
normal clusters. We evaluated both methods with the DARPA 99 dataset and show that our methods can
detect more attacks than similar existing techniques.
LERAD and CLAD have dierent strengths and weaknesses, we would like to investigate more how ones
strengths can benet the other. Dierent from CLAD, LERAD assumes the training data are free of attacks.
This assumption can be relaxed by assigning scores to events that have been observed during training; these
scores can be related to the estimated probability of observing the seen events. Unlike CLAD, LERAD is
an oine algorithm, an online LERAD can update the random sample, used in the rule generation phase,
with new data by a replacement strategy, and additional rules can be constructed that consider both new
and old data. Dierent from LERAD, CLAD does not aim to generate a concise model, which can aect
the eciency during detection. We plan to explore merging similar clusters in a hierarchical manner and
dynamically determine the appropriate number of clusters according to the L method [33]. Unlike LERAD,
CLAD does not explain alerts well; we plan to use the notion of near miss to explain an alert by identifying
centriods of normal clusters with few attributes contributing much of the distance between the alert and the
normal centroid. We are also investigating extracting features from the payload, as well as, applying our
mehtods to host-based data. Lastly, we are planning to evaluate our methods on data collected from live
trac on our main departmental server.
acknowledgments
This research is partially supported by DARPA (F30602-00-1-0603).
References
[1] C. Aggarwal and P. Yu. Outlier detection for high dimensional data. In Proc. SIGMOD, 2001.
[2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large
databases. In Proc. ACM SIGMOD Conf., pages 207216, 1993.
[3] D. Anderson, T. Lunt, H. Javitz, A. Tamaru, and A. Valdes. Detecting unusual program behavior using
the statistical component of the next-generation intrusion detection expert system (NIDES). Technical
Report SRI-CSL-95-06, SRI, 1995.
[4] F. Apap, A. Honig, S. Hershkop, E. Eskin, and S. Stolfo. Detecting malicious software by monitoring
anomalous windows registry accesses. In Proc. Fifth Intl. Symp. Recent Advances in Intrusion Detection
(RAID), 2002.
[5] D. Barbara, N. Wu, and S. Jajodia. Detecting novel network intrusions using bayes estimators. In Proc.
SIAM Intl. Conf. Data Mining, 2001.
[6] M. Breunig, H. Kriegel, R. Ng, and J. Sander. Lof: Identifying density-based local outliers. In Proc.
SIGMOD, 2000.
[7] P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3:261285, 1989.
[8] Silicon Defense. SPADE, 2001. http://www.silicondefense.com/software/spice/.
11
[9] P. Domingos and M. Pazzani. On the optimality of the simple bayesian classier under zero-one loss.
Machine Learning, 29:103130, 1997.
[10] R. Duda and P. Hart. Pattern classication and scene analysis. Wiley, New York, NY, 1973.
[11] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A geometric framework for unsupervised
anomaly detection: Detecting intrusions in unlabeled data. In D. Barbara and S. Jajodia, editors,
Applications of Data Mining in Computer Security. Kluwer, 2002.
[12] S. Forrest, S. Hofmeyr, and A. Somayaji. Computer immunology. Comm. ACM, 4(10):8896, 1997.
[13] S. Forrest, S. Hofmeyr, A. Somayaji, and T. Longsta. A sense of self for unix processes. In Proc. of
1996 IEEE Symp. on Computer Security and Privacy, 1996.
[14] A. Ghosh, A. Schwartzbard, and M. Schatz. Learning program behavior proles for intrusion detection.
In Proc. 1st USENIX Workshop on Intrusion Detection and Network Monitoring, 1999.
[15] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[16] K. Kendall. A database of computer attacks for the evaluation of intrusion detection systems. Masters
thesis, EECS Dept., MIT, 1999.
[17] E. Knorr and T. Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. VLDB,
1998.
[18] C. Krugel, T. Toth, and E. Kirda. Service specic anomaly detection for network intrusion detection.
In Proc. ACM Symp. on Applied Computing, 2002.
[19] T. Lane and C. Brodley. Temporal sequence learning and data reduction for anomaly detection. ACM
Trans. Information and System Security, 1999.
[20] R. Lippmann, J. Haines, D. Fried, J. Korba, and K. Das. The 1999 DARPA o-line intrusion detection
evaluation. Computer Networks, 34:579595, 2000.
[21] M. Mahoney and P. Chan. Learning models of network trac for detecting novel attacks. Technical
Report CS-2002-08, Florida Inst. of Tech., Melbourne, FL, 2002. http://www.cs.t.edu/~pkc/papers/cs-
2002-08.pdf.
[22] M. Mahoney and P. Chan. Learning nonstationary models of normal network trac for detecting novel
attacks. In Proc. Eighth Intl. Conf. on Knowledge Discovery and Data Mining, pages 376385, 2002.
[23] T. Mitchell. Machine Learning. McGraw Hill, 1997.
[24] P. Neumann and P. Porras. Experience with EMERALD to date. In Proc. 1st USENIX Workshop on
Intrusion Detection and Network Monitoring, pages 7380, 1999.
[25] T. Niblett. Constructing decision trees in noisy domain. In Proc. 2nd European Working Session on
Learning, pages 6778, 1987.
[26] V. Paxon. Bro: A system for detecting network intruders in real-time. In Proc. 7th USENIX Security
Symp., 1998.
[27] V. Paxon and S. Floyd. The failure of poisson modeling. IEEE/ACM Transactions on Networking,
3:22624, 1995.
[28] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan
Kaufmann, 1987.
[29] L. Portnoy. Intrusion detection with unlabeled data using clustering. Undergraduate Thesis, Columbia
University, 2000.
12
[30] F. Provost and P. Domingos. Tree induction for probability-based rankings. Machine Learning, 2002.
[31] S. Ramaswamy, R. Rastogi, and K. Shim. Ecient algorithms for mining outliers from large data sets.
In Proc. SIGMOD, 2000.
[32] M. Roesch. Snort lightweight intrusion detection for networks. In USENIX LISA, 1999.
[33] S. Salvador and P. Chan. Learning states and rules for time-series anomaly detection. Technical Report
CS-2003-05, Florida Inst. of Tech., Melbourne, FL, 2003. http://www.cs.t.edu/~pkc/papers/cs-2003-
05.pdf.
[34] R. Sekar, M. Bendre, D. Dhurjati, and P. Bollinen. A fast automaton-based method for detecting
anomalous program behaviors. In Proc. IEEE Symp. Security and Privacy, 2001.
[35] K. Sequira and M. Zaki. ADMIT: Anomaly-based data mining for intrusions. In Proc. KDD, 2002.
[36] S. Staniford, J. Hoagland, and J. McAlerney. Practical automated detection of stealthy portscans,. J.
Computer Security, 2002.
[37] I. Witten and T. Bell. The zero-frequency problem: estimating the probabilities of novel events in
adaptive text compression. IEEE Trans. on Information Theory, 37(4):10851094, 1991.
13