Attribute Reduction for Effective Intrusion
Detection⋆
Fernando Godı́nez1 and Dieter Hutter2 and Raúl Monroy3
1
Centre for Intelligent Systems, ITESM–Monterrey
Eugenio Garza Sada 2501, Monterrey, 64849, Mexico
[email protected]
2
DFKI, Saarbrücken University
Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany
[email protected]
3
Department of Computer Science, ITESM–Estado de México
Carr. Lago de Guadalupe, Km. 3.5, Estado de México, 52926, Mexico
[email protected]
Abstract. Computer intrusion detection is to do with identifying computer activities that may compromise the integrity, confidentiality or
the availability of an IT system. Anomaly Intrusion Detection Systems
(IDSs) aim at distinguishing an abnormal activity from an ordinary one.
However, even in a moderate site, computer activity very quickly yields
Giga-bytes of information, overwhelming current IDSs. To make anomaly
intrusion detection feasible, this paper advocates the use of Rough Sets
previous to the intrusion detector, in order to filter out redundant, spurious information. Using rough sets, we have been able to successfully
identify pieces of information that succinctly characterise computer activity without missing chief details. The results are very promising since
we were able to reduce the number of attributes by a factor of 3 resulting
in a 66% of data reduction. We have tested our approach using BSM log
files borrowed from the DARPA repository.
1
Introduction
Computer intrusion detection is concerned with studying how to detect computer
attacks. A computer attack, or attack for short, is any activity that jeopardises
the integrity, confidentiality or the availability of an IT system. An attack may
be either physical or logical. There exists a number of Intrusion Detection Systems (IDSs); based on the detection scheme they belong to either of two main
categories: misuse-detection and anomaly-detection IDSs.
Misuse-Detection IDSs aim to detect the appearance of the signature of a
known attack in a network traffic. While simple, misuse-detection IDSs do not
scale up, both because they are useless to unknown attacks and because they
are easily cheated upon an attack originating from several sessions.
⋆
This research is supported by three research grants CONACyT 33337-A, CONACyTDLR J200.1442/2002 and ITESM CCEM-0302-05.
To get around this situation, anomaly-detection IDSs count on a characterisation of ordinary activity and use it to distinguish it from an abnormal
one [1]. However, normal and abnormal behaviour differ quite subtly and hence
are difficult to put apart. To compound the problem, there is an infinite number
of instances of both normal and abnormal activity. Even in a moderate site,
computer activity very quickly yields Giga-bytes of information, overwhelming
current IDSs.
This paper aims to make anomaly-based intrusion detection feasible. It addresses the problem of dimensionality reduction using an attribute relevance
analyser. We aim to filter out redundant, spurious information, and significantly
reduce the number of computer resources, both memory and CPU time, required
to detect an attack. We work under the consideration that intrusion detection
is approached at the level of execution of operating system calls, rather than
network traffic. So our input data is noiseless and less subject to encrypted attacks. In our reduction experiments, we use BSM log files [2], borrowed from the
DARPA repository [3].
A BSM log file consists of a sequence of system calls. Roughly, Sun Solaris
involves about 240 different system calls, each of which takes a different number
of attributes. The same system call can even take distinct attributes in separate
entries. This diversity in the structure of a BSM log file gets in the way for a
successful application of data mining techniques. Entries cannot be easily standardized to hold the same number of attributes, unless extra attributes are given
no value. This lack of information rules out the possibility of using classical data
mining techniques, such as ID3 [4] or GINI [5]. Fortunately, rough sets [6] work
well under diverse, incomplete information.
Using rough sets, we have been able to successfully identify pieces of information that succinctly characterise computer activity without missing chief details.
We have tested our approach using various BSM log files of the DARPA repository. More precisely we used 8 days out of 25 available from 1998 as our training
data. The results we obtained show that we need less than a third part of the 51
identified attributes to represent the log files with minimum loss of information.
This paper is organized as follows: in section 2 we briefly overview the importance of Intrusion Detection Systems and the problem of information managing.
Section 3 is a brief introduction to the Rough Set Theory especially the parts
used on the problem of dimensionality reduction. In section 4 we describe the
methodology of our experiments. Section 5 describes our experimental results.
Finally some conclusions are discussed in section 6.
2
Information managing in Intrusion Detection Systems
An anomaly-based IDS relies on some kind of statistical profile abstracting out
normal user behaviour. User actions may be observed at different levels, ranging from system commands to system calls. Any deviation from a behaviour
profile is taken as an anomaly and therefore an intrusion. Building a reliable
user profile requires a number of observations that can easily overwhelm any
IDS [7], making it necessary to narrow the input data without loosing important
information. Narrowing input data yields the additional benefit of alleviating
intrusion detection. Current reduction methods, as shown below, eliminate important information getting in the way for effective intrusion detection.
2.1
Attribute Reduction Methods
Log files are naturally represented as a table, a two dimensional array, where
rows stand for objects (in our case system calls) and columns for their attributes.
These tables may be unnecessarily redundant. The problem of reducing the rows
and columns of a table, we respectively call object reduction and attribute reduction. To the best of the authors’ knowledge, the attempts to reduce the number
of attributes prior to the clustering have been few while this is still a big concern
as there might be unnecessary attributes in any given source of information.
Lane et al have suggested an IBL technique for modeling user behavior [8, 9],
that works at the level of Unix user commands. Their technique relies upon an
arbitrary reduction mechanism, which replaces all the attributes of a command
with an integer representing the number of that command’s attributes (e.g. cat
/etc/password /etc/shadow > /home/mypasswords is replaced by cat <3>).
According to [8, 9], this reduction mechanism narrows the alphabet by a factor
of 14, but certainly at the cost of loosing chief information. This is because
the arguments cannot be put away if a proper distinction between normal and
abnormal behavior is to be done. For example, copying the password files may
in general denote abnormal activity. By contrast our method keeps all these
attributes as they are main discriminants between objects and thus an important
source of information.
Knop et al have suggested the use of correlation elimination to achieve attribute reduction [10]. Their mechanism uses a correlation coefficient matrix to
compute statistical relations between system calls. 4 . These relations are then
used to identify chief attributes. Knop et al ’s mechanism relies upon an numerical representation of system call attributes to capture object correlation. Since a
log file consists of mainly a sequence of strings, this representation is unnatural
and a source of noise. It may incorrectly relate two syntactically similar system
calls with different meaning. By comparison, our method does not relies on correlation measures but in data frequency which is less susceptible to representation
problems.
2.2
Object Reduction Methods
Most object reduction methods rely on grouping similar objects according to
their attributes. Examples of such methods are the Expectation-Maximization
procedure used by Lane et al [8, 9] and an expert system based grouping method
developed by Marin et al [12]. The more information we have about the objects
4
Knop et al ’s method resembles Principal Component Analysis [11]
the more accurate the grouping will be. This comes with a computational overhead, the more attributes we use to make the grouping the more computationally
intense the process becomes. All these methods can benefit from our work since
a reduction previous to clustering can reduce the number of attributes needed
to represent an object, therefore reducing the time needed to group similar ones.
In general attribute reduction method are not directly capable of dealing with
incomplete data. These methods rely on the objects having some value assigned
for every attribute and a fixed number of attributes. In order to overcome these
problems the use of another method is proposed, Rough Set Theory. Even though
the Rough Sets Theory has not been widely used in dimensionality reduction for
security logs, its ability to deal with incomplete information (one of the key
features of Rough Sets) makes it very suitable for our needs since entries in a
log do not have the same attributes all the time.
This completes our revision of related work. Attention is now given to describing Rough Sets.
3
Dealing with Data Imperfection
In production environments, output data are often vague, incomplete, inconsistent and of a great variety, getting in the way of its sound analysis. Data
imperfections rule out the possibility of using conventional data mining techniques, such as ID3, C5 or GINI. Fortunately, the theory of rough sets [6] has
been specially designed to handle these kinds of scenarios. Same as in fuzzy logic,
in rough sets every object of interest is associated with a piece of knowledge indicating relative membership. This knowledge is used to drive data classification
and is the key issue of any reasoning, learning, and decision making [13].
Knowledge, acquired from human or machine experience, is represented as
a set of examples describing attributes of either of two types, condition and
decision. Condition attributes and decision attributes respectively represent a
priori and a posteriori knowledge. Thus, learning in rough sets is supervised.
Rough sets removes superfluous information by examining attribute dependencies. It deals with inconsistencies, uncertainty and incompleteness by imposing an upper and an lower approximation to set membership. Rough sets estimates the relevance of an attribute by using attribute dependencies regarding a
given decision class. It achieves attribute set covering by imposing a discernibility
relation. Rough set’s output, purged and consistent data, can be used to define
decision rules. A brief introduction to rough set theory, mostly based on [6],
follows.
3.1
Rough Sets
Knowledge is represented by means of a table, so-called an information system,
where rows and columns respectively denote objects and attributes. An information system, A, is given as a pair A = (U, A), where U is a non-empty finite
set of objects, the universe, and A is a non-empty finite set of attributes.
A decision system is an information system that involves at least (and usually) one decision attribute. It is given by A = (U, A ∪ {d}), where d 6∈ A is the
decision attribute. Decision attributes are often two-valued. Then the input set
of examples is split into two disjoint subsets: positive and negative. An element
is in positive if it belongs to the associated decision class and is in negative otherwise. Multi-valued decision attributes give rise to pairwise, multiple decision
classes. A decision system expresses our knowledge about a model. It may be unnecessarily redundant. To remove redundancies, rough sets define an equivalence
relation up to indiscernibility.
Let A = (U, A) be an information system. Then, every B ⊆ A yields an
equivalence relation up to indiscernibility, IN DA (B) ⊆ (U × U ), given by:
IN DA (B) = {(x, x′ ) : ∀a ∈ B. a(x) = a(x′ )}
A reduct of A is the least B ⊆ A that is equivalent to A up to indiscernibility. In
symbols, IN DA (B) = IN DA (A). Then the attributes in A − B are considered
expendable. An information system typically has many subsets B. The set of all
reducts in A is denoted RED(A).
An equivalence splits the universe, allowing us to create new classes, also
called concepts. A concept that cannot be completely characterised gives rise to
a rough set. A rough set is used to hold elements for which it cannot be definitely
said whether or not they belong to a given concept. For the purpose of this paper
reducts is the only concept that we need to understand thoroughly. We will now
explore two of the most used reduct algorithms.
3.2
Reduct Algorithms
First we need to note that currently the algorithms supplied by the Rosetta
library (which is described in 3.3) support two types of discernibility: i)Full : In
this case the reducts are extracted relative to the system as a whole. With the
resulting reduct set we are able to discern between all relevant objects. ii)Object:
This kind of discernibility extract reducts relative to a single object. The result
is a set of reducts for each object in A. We are mainly interested in two reduct
extraction algorithms supplied by the Rosetta library, Johnson’s Algorithm and
Genetic Algorithm.
Johnson’s algorithm implements a variation of a simple greedy search
algorithm as described in [14]. This algorithm extracts a single reduct and not a
set like other algorithms. The reduct B can be found by executing the following
algorithm where S is a superset of the sets corresponding to the discernibility
function gA (U ) and w(S) is a weight function assigned to S ∈ S (Unless stated
otherwise the function
P w(S) denotes cardinality): i)B = ∅, ii)Select an attribute
a that maximizes
w(S)|∀S, a ∈ S, iii)Add a to B, iv)Remove S|a ∈ S from
S, v)When S = ∅ return B otherwise repeat from step ii.
Approximate solutions can be provided by leaving the execution when an arbitrary number of sets have been removed from S. The support count associated
with the extracted reduct is the percentage of S ∈ S : B ∩ S 6= ∅, when computing the reducts a minimum support value can be provided. This algorithm has
the advantage of returning a single reduct but depending on the desired value
for the minimal support count some attributes might be eliminated. If we use a
support of 0% then all attributes are included in the reduct, if we use a value of
100% then the algorithm executes until S = ∅.
The Genetic Algorithm described by Øhrn and Viterbo in [15] is used to
find minimal hitting sets. The algorithm’s fitness function is presented below. In
the function S is the multi-set given by the discernibility function, α is a weighting between subset cost and the hitting fraction, and ε is used for approximate
solutions.
|[S ∈ S]|S ∩ B 6= ∅|
cost(A) − cost(B)
+ α × min ε,
f (B) = (1 − α) ×
cost(A)
|S|
The subsets B of A are found by an evolutionary search measured by f (B),
when a subset B has a hitting fraction of at least ε then it is saved in a list. The
size of the list is arbitrary. The function cost specifies a penalty for an attribute
(some attributes may be harder to collect) but it defaults to cost(B) = |B|. If
ε = 1 then the minimal hitting set is returned. In this algorithm the support
count is the same as in Johnson’s algorithm.
3.3
Rosetta Software
The Rosetta system is a toolkit developed by Alexander Øhrn [16] used for
data analysis using Rough Sets Theory. The Rosetta toolkit is conformed by a
computational kernel and a GUI. The main interest for us is the kernel which
is a general C++ class library implementing various methods of the Rough Set
framework. The library is open source thus being modifiable by the user. Anyone
interested in the library should refer to [16].
In the next section we will see the application of Rough Sets to the problem
of dimensionality reduction in particular attribute reduction and our results in
applying this technique.
4
Rough Set Application to Attribute Reduction
This section aims to show how rough sets can be used to find the chief attributes
which ought to be considered for session analysis. By the equivalence up to
discernibility (see section 3.1), this attribute reduction will be minimal with
respect to content of information.
Ideally, to find the reduct, we would just need to collect together as many
as possible session logs and make Rosetta to process them. However this is
computationally prohibitive since one would require to have an unlimited amount
of resources, both memory and CPU time. To get around this situation, we ran a
number of separate analysis, each of which considers a session segment, and then
collected the associated reducts. Then, to find the minimum common reduct, we
performed an statistical analysis which removes those attributes that appeared
least frequently. In what follows, we elaborate on our methodology to reduct
extraction, which closely follows that outlined by Komorowski [6].
4.1
Reduct Extraction
To approach reduct extraction, considering the information provided by the
DARPA repository, we randomly chose 8 logs (out of 25), for the year 1998,
and put them together. Then the enhanced log was evenly divided in segments
of 25,000 objects, yielding 365 partial log files. For each partial log file, we made
Rosetta extract the associated reduct using Johnson’s algorithm and selecting a
100% support count (see Section 3.2).
After this extraction process, we sampled the resulting reducts and using
an frequency-based discriminant, we constructed a minimum common reduct
(MCR). This MCR is so that it keeps most information of the original data
minimizing the number of attributes. The largest reduct in the original set has
15 attributes and our minimum common reduct has 18 attributes. This is still a
66.66% reduction in the number of attributes. The 18 chief attributes are shown
below:
Access-Mode
File-System-ID
arg-value-1
arg-value-3
Remote-IP
Effective-Group-ID
Owner
inode-ID
arg-string-1
exec-arg-1
Audit-ID
Process-ID
Owner-Group
device-ID
arg-value-2
Socket-Type
Effective-User-ID
System-Call
Before concluding this section, we report on our observations of the performance of the algorithms found in Rosetta, namely Johnson’s algorithm and the
genetic algorithm based reduction mechanism.
4.2
Algorithm Selection
Previous to reduct extraction, we tested on the performance of the two Rosetta
reduction algorithms, in order to find which is the most suitable to our work.
Both reduction algorithms (see section 3.2) were made find a reduction with
25 log files. Log files were selected considering different sizes and types of sessions. A minimum common reduct set containing 14 attributes was obtained
after the 25 extraction processes. This amounts to a reduction of 72.5% in the
number of attributes. In general, both algorithms yielded similar total elapsed
times. Sometimes, however, Johnson’s algorithm was faster. As expected, the
total elapsed time involved in reduct extraction grows exponentially with the
number of objects to be processed. For a 1,000 object log file the time needed
to extract the reduct is 3 seconds and for a 570,000 is 22 hours. Also the size
of the reduct increases according to the diversity of the log. For a 1,000 object
log long, we found a reduct of 8 attributes, while for a 570,000 one, we found a
reduct of 14 attributes.
However, for longer log files, the instability of the genetic algorithm based
mechanism become apparent. Our experiments show that this algorithm is unable to handle log files containing more than 22,000 objects, each with 51 attributes. This explains why our experiments only consider Johnson’s algorithm.
Even though the algorithms accept indiscernibility decision graphs (that is
relations between objects) we did not use them, both because we wanted to keep
the process as unsupervised as possible and because in order to build the graphs
we needed to know in advance the relation between the objects which is quite
difficult even with the smaller logs with 60,000 objects.
In the next section we will review our experimental results to validate the
resulting reducts.
5
Reduct Validation—Experimental Results
This section describes the methodology used to validate our output reduct, the
main contribution of this paper. The validation methodology basically appeals
to so-called association patterns.
An association pattern is basically a pattern that, with the help of wildcards, matches part of an example log file. Given both a reduct and a log file,
the corresponding association patterns are extracted by overlaying the reduct
over that log file, and reading off the values [16]. Then the association patterns
are then compared against another log to compute how well they cover that log
file information.
Thus, our validation test consists of checking the quality of the association
patterns generated by our output reduct, considering two log files. The rationale
behind it is that the more information about the system the reduct comprehends
the higher the matching ratio the association patterns of that reduct will have.
To validate our reduct, we conducted the following three step approach. For
each one of the 8 log files considered through our experiments, i) use the output
reduct to compute the association patterns, ii) cross validate the association
patterns against all of the log files, including the one used to generate them and
iii) collect the results. Our validation results are summarized in Table 1. The
patterns contain only relations between attributes contained in the final reduct
set. A set of patterns was generated from each log file.
These association patterns are able to describe the information system to
some extent. That extent is the covering percentage of the patterns over the
information system. The first column indicates the corresponding log file used
to generate the association patterns. The first row indicates the log file used
to test the covering of the pattern. In contrast if we were to generate a set of
patterns using all attributes, then those patterns will cover 100% of the objects
in the table that generated them. The result is the percentage of objects we were
able to identify using the attributes in our reduct.
These experiments were conducted on a Ultra Sparc 60 with two processors
running Solaris 7. Even though it is a fast workstation the Reduct algorithms
are the most computational demanding in all the Rough Set Theory. The order
Training
Testing log
Log
A
B
C
D
E
F
G
H
A
93.7
89.7
90.8
90.3
90.9
91.1
89.9
90.9
B
90.9
93.1
91.2
91.3
90.9
92.4
91.4
90.1
C
90.2
90.8
92.8
90.7
92.1
90.1
90.6
91.0
D
90.3
89.3
91.3
93.1
91.5
90.5
92.9
91.1
E
89.8
89.1
92.2
92.3
93.4
89.9
92.1
90.1
F
89.7
89.3
91.4
92.8
90.7
92.9
92.3
90.8
G
89.2
90.3
91.5
92.6
91.2
90.7
93.1
90.6
H
90.1
89.5
91.2
91.3
90.8
90.2
90.9
92.5
Size
744,085 2,114,283 1,093,140 1,121,967 1,095,935 815,236 1,210,358 927,456
Table 1. Covering % of the Association Patterns
of the reduction algorithms is O(n2 ) with n being the number of objects. With
700,000 an overhead in time was to be expected.
Calculating the quality of the rules generated with the extracted reducts is
also time consuming. The generation of the rules took 27 hours (we used the
larger tables to generate the rules) and another 60 hours to calculate the quality
of the generated rules. In order to test the rules with another table we needed to
extend the Rosetta library, that is because of the internal representation of data
depends on the order an object is imported and saved on the internal dictionary
of the table (every value is translated to a numerical form). The library has
no method for importing a table using the dictionary from an already loaded
table. To overcome the above deficiencies we extended the Rosetta library with
an algorithm capable of importing a table using a dictionary from an already
loaded table. This way we were able to test the rules generated in a training set
over a different testing set. We also tested the rules upon the training set.
In the end the quality of the reduction set is measured in terms of the discernibility between objects. If we can still discern between two different objects
with the reduced set of attributes then the loss of information is said to be
minimal. Our goal was to reduce the number of attributes without loosing the
discernibility between objects, so more precise IDSs can be designed, and the
goal was achieved.
Even though the entire reduct extraction process is time consuming, it only
needs to be done once and it can be done off-line. There is no need for live data,
the analysis can be done with stock data. Once the reduct is calculated and its
quality verified we can use only the attributes we need thus reducing the space
required to hold the logs and the time used processing them to make a proper
intrusion detection.
The last section of this paper present our conclusions and projected future
work.
6
Conclusions and Future Work
Based on our results we identified the chief attributes of a BSM log file without
sacrificing discernibility relevant information. With growing amount of information flowing in an IT system there was a need for an effective method capable
of identify key elements in the data in order to reduce the amount of memory
and time used in the detection process. We think our results prove that Rough
Sets provide such a method. As a future work we are planning to experiment on
object reduction to facilitate even more the detection task. So far the reduction
obtained should be sufficient to explore intrusion detection methods that are
computational intensive and were prohibitive.
References
1. Kim, J., Bentley, P.: The Human Immune System and Network Intrusion Detection. In: Proceedings of the 7th European Conference on Intelligent Techniques
and Soft Computing (EUFIT’99), Aachen, Germany, ELITE Foundation (1999)
2. Sun MicroSystems: SunSHIELD Basic Security Module Guide. Part number 8061789-10 edn. (2000)
3. Haines, J.W., Lippmann, R.P., Fried, D.J., Tran, E., Boswell, S., Zissman, M.A.:
1999 DARPA intrusion detection system evaluation: Design and procedures. Technical Report 1062, Lincoln Laboratory, Massachusetts Institute of Technology
(2001)
4. Quinlan, J.R.: Learning efficient classification procedures and their application to
chess and games. Machine Learning: An artificial intelligence approach. Springer,
Palo Alto, CA (1983)
5. Breiman, L., Stone, C.J., Olshen, R.A., Friedman, J.H.: Classification and Regresion Trees. Statistics-Probability Series. Brooks/Cole (1984)
6. Komorowski, J., Polkowski, L., Skowron, A.: Rough sets: A tutorial. In: RoughFuzzy Hybridization: A New Method for Decision Making. Springer-Verlag (1998)
7. Axelsson, S.: Aspects of the modelling and performance of intrusion detection.
Department of Computer Engineering, Chalmers University of Technology (2000)
Thesis for the degree of Licentiate of Engineering.
8. Lane, T., Brodley, C.E.: Temporal Sequence Learning and Data Reduction for
Anomaly Detection. ACM Transactions on Information and System Security 2
(1999) 295–331
9. Lane, T., Brodley, C.E.: Data Reduction Techniques for Instance-Based Learning
from Human/Computer Interface Data. In: Proceedings of the 17th International
Conference on Machine Learning, Morgan Kaufmann (2000) 519–526
10. Knop, M.W., Schopf, J.M., Dinda, P.A.: Windows performance monitoring and
data reduction using watchtower and argus. Technical Report Technical Report
NWU-CS-01-6, Department of Computer Science, Northwestern University (2001)
11. Rencher, A.: Methods in Multivariate Analysis. Wiley & Sons, New York (1995)
12. Marin, J.A., Ragsdale, D., Surdu, J.: A hybrid approach to profile creation and
intrusion detection. In: Proc. of DARPA Information Survivability Conference and
Exposition, IEEE Computer Society (2001)
13. Félix, R., Ushio, T.: Binary Encoding of Discernibility Patterns to Find Minimal
Coverings. International Journal of Software Engineering and Knowledge Engineering 12 (2002) 1–18
14. Johnson, D.S.: Approximation algorithms for combinatorial problems. Journal of
Computer and System Sciences 9 (1974) 256–278
15. Viterbo, S., Øhrn, A.: Minimal approximate hitting sets and rule templates. International Journal of Approximate Reasoning 25 (2000) 123–143
16. Øhrn, A., Komorowski, J.: ROSETTA: A Rough Set Toolkit for Analysis of Data.
In Wong, P., ed.: Proceedings of the Third International Joint Conference on Information Sciences. Volume 3., Durham, NC, USA, Department of Electrical and
Computer Engineering, Duke University (1997) 403–407