Academia.eduAcademia.edu

Attribute Reduction for Effective Intrusion Detection

2004, Lecture Notes in Computer Science

Computer intrusion detection is to do with identifying computer activities that may compromise the integrity, confidentiality or the availability of an IT system. Anomaly Intrusion Detection Systems (IDSs) aim at distinguishing an abnormal activity from an ordinary one. However, even in a moderate site, computer activity very quickly yields Giga-bytes of information, overwhelming current IDSs. To make anomaly intrusion detection feasible, this paper advocates the use of Rough Sets previous to the intrusion detector, in order to filter out redundant, spurious information. Using rough sets, we have been able to successfully identify pieces of information that succinctly characterise computer activity without missing chief details. The results are very promising since we were able to reduce the number of attributes by a factor of 3 resulting in a 66% of data reduction. We have tested our approach using BSM log files borrowed from the DARPA repository.

Attribute Reduction for Effective Intrusion Detection⋆ Fernando Godı́nez1 and Dieter Hutter2 and Raúl Monroy3 1 Centre for Intelligent Systems, ITESM–Monterrey Eugenio Garza Sada 2501, Monterrey, 64849, Mexico [email protected] 2 DFKI, Saarbrücken University Stuhlsatzenhausweg 3, D-66123 Saarbrücken, Germany [email protected] 3 Department of Computer Science, ITESM–Estado de México Carr. Lago de Guadalupe, Km. 3.5, Estado de México, 52926, Mexico [email protected] Abstract. Computer intrusion detection is to do with identifying computer activities that may compromise the integrity, confidentiality or the availability of an IT system. Anomaly Intrusion Detection Systems (IDSs) aim at distinguishing an abnormal activity from an ordinary one. However, even in a moderate site, computer activity very quickly yields Giga-bytes of information, overwhelming current IDSs. To make anomaly intrusion detection feasible, this paper advocates the use of Rough Sets previous to the intrusion detector, in order to filter out redundant, spurious information. Using rough sets, we have been able to successfully identify pieces of information that succinctly characterise computer activity without missing chief details. The results are very promising since we were able to reduce the number of attributes by a factor of 3 resulting in a 66% of data reduction. We have tested our approach using BSM log files borrowed from the DARPA repository. 1 Introduction Computer intrusion detection is concerned with studying how to detect computer attacks. A computer attack, or attack for short, is any activity that jeopardises the integrity, confidentiality or the availability of an IT system. An attack may be either physical or logical. There exists a number of Intrusion Detection Systems (IDSs); based on the detection scheme they belong to either of two main categories: misuse-detection and anomaly-detection IDSs. Misuse-Detection IDSs aim to detect the appearance of the signature of a known attack in a network traffic. While simple, misuse-detection IDSs do not scale up, both because they are useless to unknown attacks and because they are easily cheated upon an attack originating from several sessions. ⋆ This research is supported by three research grants CONACyT 33337-A, CONACyTDLR J200.1442/2002 and ITESM CCEM-0302-05. To get around this situation, anomaly-detection IDSs count on a characterisation of ordinary activity and use it to distinguish it from an abnormal one [1]. However, normal and abnormal behaviour differ quite subtly and hence are difficult to put apart. To compound the problem, there is an infinite number of instances of both normal and abnormal activity. Even in a moderate site, computer activity very quickly yields Giga-bytes of information, overwhelming current IDSs. This paper aims to make anomaly-based intrusion detection feasible. It addresses the problem of dimensionality reduction using an attribute relevance analyser. We aim to filter out redundant, spurious information, and significantly reduce the number of computer resources, both memory and CPU time, required to detect an attack. We work under the consideration that intrusion detection is approached at the level of execution of operating system calls, rather than network traffic. So our input data is noiseless and less subject to encrypted attacks. In our reduction experiments, we use BSM log files [2], borrowed from the DARPA repository [3]. A BSM log file consists of a sequence of system calls. Roughly, Sun Solaris involves about 240 different system calls, each of which takes a different number of attributes. The same system call can even take distinct attributes in separate entries. This diversity in the structure of a BSM log file gets in the way for a successful application of data mining techniques. Entries cannot be easily standardized to hold the same number of attributes, unless extra attributes are given no value. This lack of information rules out the possibility of using classical data mining techniques, such as ID3 [4] or GINI [5]. Fortunately, rough sets [6] work well under diverse, incomplete information. Using rough sets, we have been able to successfully identify pieces of information that succinctly characterise computer activity without missing chief details. We have tested our approach using various BSM log files of the DARPA repository. More precisely we used 8 days out of 25 available from 1998 as our training data. The results we obtained show that we need less than a third part of the 51 identified attributes to represent the log files with minimum loss of information. This paper is organized as follows: in section 2 we briefly overview the importance of Intrusion Detection Systems and the problem of information managing. Section 3 is a brief introduction to the Rough Set Theory especially the parts used on the problem of dimensionality reduction. In section 4 we describe the methodology of our experiments. Section 5 describes our experimental results. Finally some conclusions are discussed in section 6. 2 Information managing in Intrusion Detection Systems An anomaly-based IDS relies on some kind of statistical profile abstracting out normal user behaviour. User actions may be observed at different levels, ranging from system commands to system calls. Any deviation from a behaviour profile is taken as an anomaly and therefore an intrusion. Building a reliable user profile requires a number of observations that can easily overwhelm any IDS [7], making it necessary to narrow the input data without loosing important information. Narrowing input data yields the additional benefit of alleviating intrusion detection. Current reduction methods, as shown below, eliminate important information getting in the way for effective intrusion detection. 2.1 Attribute Reduction Methods Log files are naturally represented as a table, a two dimensional array, where rows stand for objects (in our case system calls) and columns for their attributes. These tables may be unnecessarily redundant. The problem of reducing the rows and columns of a table, we respectively call object reduction and attribute reduction. To the best of the authors’ knowledge, the attempts to reduce the number of attributes prior to the clustering have been few while this is still a big concern as there might be unnecessary attributes in any given source of information. Lane et al have suggested an IBL technique for modeling user behavior [8, 9], that works at the level of Unix user commands. Their technique relies upon an arbitrary reduction mechanism, which replaces all the attributes of a command with an integer representing the number of that command’s attributes (e.g. cat /etc/password /etc/shadow > /home/mypasswords is replaced by cat <3>). According to [8, 9], this reduction mechanism narrows the alphabet by a factor of 14, but certainly at the cost of loosing chief information. This is because the arguments cannot be put away if a proper distinction between normal and abnormal behavior is to be done. For example, copying the password files may in general denote abnormal activity. By contrast our method keeps all these attributes as they are main discriminants between objects and thus an important source of information. Knop et al have suggested the use of correlation elimination to achieve attribute reduction [10]. Their mechanism uses a correlation coefficient matrix to compute statistical relations between system calls. 4 . These relations are then used to identify chief attributes. Knop et al ’s mechanism relies upon an numerical representation of system call attributes to capture object correlation. Since a log file consists of mainly a sequence of strings, this representation is unnatural and a source of noise. It may incorrectly relate two syntactically similar system calls with different meaning. By comparison, our method does not relies on correlation measures but in data frequency which is less susceptible to representation problems. 2.2 Object Reduction Methods Most object reduction methods rely on grouping similar objects according to their attributes. Examples of such methods are the Expectation-Maximization procedure used by Lane et al [8, 9] and an expert system based grouping method developed by Marin et al [12]. The more information we have about the objects 4 Knop et al ’s method resembles Principal Component Analysis [11] the more accurate the grouping will be. This comes with a computational overhead, the more attributes we use to make the grouping the more computationally intense the process becomes. All these methods can benefit from our work since a reduction previous to clustering can reduce the number of attributes needed to represent an object, therefore reducing the time needed to group similar ones. In general attribute reduction method are not directly capable of dealing with incomplete data. These methods rely on the objects having some value assigned for every attribute and a fixed number of attributes. In order to overcome these problems the use of another method is proposed, Rough Set Theory. Even though the Rough Sets Theory has not been widely used in dimensionality reduction for security logs, its ability to deal with incomplete information (one of the key features of Rough Sets) makes it very suitable for our needs since entries in a log do not have the same attributes all the time. This completes our revision of related work. Attention is now given to describing Rough Sets. 3 Dealing with Data Imperfection In production environments, output data are often vague, incomplete, inconsistent and of a great variety, getting in the way of its sound analysis. Data imperfections rule out the possibility of using conventional data mining techniques, such as ID3, C5 or GINI. Fortunately, the theory of rough sets [6] has been specially designed to handle these kinds of scenarios. Same as in fuzzy logic, in rough sets every object of interest is associated with a piece of knowledge indicating relative membership. This knowledge is used to drive data classification and is the key issue of any reasoning, learning, and decision making [13]. Knowledge, acquired from human or machine experience, is represented as a set of examples describing attributes of either of two types, condition and decision. Condition attributes and decision attributes respectively represent a priori and a posteriori knowledge. Thus, learning in rough sets is supervised. Rough sets removes superfluous information by examining attribute dependencies. It deals with inconsistencies, uncertainty and incompleteness by imposing an upper and an lower approximation to set membership. Rough sets estimates the relevance of an attribute by using attribute dependencies regarding a given decision class. It achieves attribute set covering by imposing a discernibility relation. Rough set’s output, purged and consistent data, can be used to define decision rules. A brief introduction to rough set theory, mostly based on [6], follows. 3.1 Rough Sets Knowledge is represented by means of a table, so-called an information system, where rows and columns respectively denote objects and attributes. An information system, A, is given as a pair A = (U, A), where U is a non-empty finite set of objects, the universe, and A is a non-empty finite set of attributes. A decision system is an information system that involves at least (and usually) one decision attribute. It is given by A = (U, A ∪ {d}), where d 6∈ A is the decision attribute. Decision attributes are often two-valued. Then the input set of examples is split into two disjoint subsets: positive and negative. An element is in positive if it belongs to the associated decision class and is in negative otherwise. Multi-valued decision attributes give rise to pairwise, multiple decision classes. A decision system expresses our knowledge about a model. It may be unnecessarily redundant. To remove redundancies, rough sets define an equivalence relation up to indiscernibility. Let A = (U, A) be an information system. Then, every B ⊆ A yields an equivalence relation up to indiscernibility, IN DA (B) ⊆ (U × U ), given by: IN DA (B) = {(x, x′ ) : ∀a ∈ B. a(x) = a(x′ )} A reduct of A is the least B ⊆ A that is equivalent to A up to indiscernibility. In symbols, IN DA (B) = IN DA (A). Then the attributes in A − B are considered expendable. An information system typically has many subsets B. The set of all reducts in A is denoted RED(A). An equivalence splits the universe, allowing us to create new classes, also called concepts. A concept that cannot be completely characterised gives rise to a rough set. A rough set is used to hold elements for which it cannot be definitely said whether or not they belong to a given concept. For the purpose of this paper reducts is the only concept that we need to understand thoroughly. We will now explore two of the most used reduct algorithms. 3.2 Reduct Algorithms First we need to note that currently the algorithms supplied by the Rosetta library (which is described in 3.3) support two types of discernibility: i)Full : In this case the reducts are extracted relative to the system as a whole. With the resulting reduct set we are able to discern between all relevant objects. ii)Object: This kind of discernibility extract reducts relative to a single object. The result is a set of reducts for each object in A. We are mainly interested in two reduct extraction algorithms supplied by the Rosetta library, Johnson’s Algorithm and Genetic Algorithm. Johnson’s algorithm implements a variation of a simple greedy search algorithm as described in [14]. This algorithm extracts a single reduct and not a set like other algorithms. The reduct B can be found by executing the following algorithm where S is a superset of the sets corresponding to the discernibility function gA (U ) and w(S) is a weight function assigned to S ∈ S (Unless stated otherwise the function P w(S) denotes cardinality): i)B = ∅, ii)Select an attribute a that maximizes w(S)|∀S, a ∈ S, iii)Add a to B, iv)Remove S|a ∈ S from S, v)When S = ∅ return B otherwise repeat from step ii. Approximate solutions can be provided by leaving the execution when an arbitrary number of sets have been removed from S. The support count associated with the extracted reduct is the percentage of S ∈ S : B ∩ S 6= ∅, when computing the reducts a minimum support value can be provided. This algorithm has the advantage of returning a single reduct but depending on the desired value for the minimal support count some attributes might be eliminated. If we use a support of 0% then all attributes are included in the reduct, if we use a value of 100% then the algorithm executes until S = ∅. The Genetic Algorithm described by Øhrn and Viterbo in [15] is used to find minimal hitting sets. The algorithm’s fitness function is presented below. In the function S is the multi-set given by the discernibility function, α is a weighting between subset cost and the hitting fraction, and ε is used for approximate solutions.   |[S ∈ S]|S ∩ B 6= ∅| cost(A) − cost(B) + α × min ε, f (B) = (1 − α) × cost(A) |S| The subsets B of A are found by an evolutionary search measured by f (B), when a subset B has a hitting fraction of at least ε then it is saved in a list. The size of the list is arbitrary. The function cost specifies a penalty for an attribute (some attributes may be harder to collect) but it defaults to cost(B) = |B|. If ε = 1 then the minimal hitting set is returned. In this algorithm the support count is the same as in Johnson’s algorithm. 3.3 Rosetta Software The Rosetta system is a toolkit developed by Alexander Øhrn [16] used for data analysis using Rough Sets Theory. The Rosetta toolkit is conformed by a computational kernel and a GUI. The main interest for us is the kernel which is a general C++ class library implementing various methods of the Rough Set framework. The library is open source thus being modifiable by the user. Anyone interested in the library should refer to [16]. In the next section we will see the application of Rough Sets to the problem of dimensionality reduction in particular attribute reduction and our results in applying this technique. 4 Rough Set Application to Attribute Reduction This section aims to show how rough sets can be used to find the chief attributes which ought to be considered for session analysis. By the equivalence up to discernibility (see section 3.1), this attribute reduction will be minimal with respect to content of information. Ideally, to find the reduct, we would just need to collect together as many as possible session logs and make Rosetta to process them. However this is computationally prohibitive since one would require to have an unlimited amount of resources, both memory and CPU time. To get around this situation, we ran a number of separate analysis, each of which considers a session segment, and then collected the associated reducts. Then, to find the minimum common reduct, we performed an statistical analysis which removes those attributes that appeared least frequently. In what follows, we elaborate on our methodology to reduct extraction, which closely follows that outlined by Komorowski [6]. 4.1 Reduct Extraction To approach reduct extraction, considering the information provided by the DARPA repository, we randomly chose 8 logs (out of 25), for the year 1998, and put them together. Then the enhanced log was evenly divided in segments of 25,000 objects, yielding 365 partial log files. For each partial log file, we made Rosetta extract the associated reduct using Johnson’s algorithm and selecting a 100% support count (see Section 3.2). After this extraction process, we sampled the resulting reducts and using an frequency-based discriminant, we constructed a minimum common reduct (MCR). This MCR is so that it keeps most information of the original data minimizing the number of attributes. The largest reduct in the original set has 15 attributes and our minimum common reduct has 18 attributes. This is still a 66.66% reduction in the number of attributes. The 18 chief attributes are shown below: Access-Mode File-System-ID arg-value-1 arg-value-3 Remote-IP Effective-Group-ID Owner inode-ID arg-string-1 exec-arg-1 Audit-ID Process-ID Owner-Group device-ID arg-value-2 Socket-Type Effective-User-ID System-Call Before concluding this section, we report on our observations of the performance of the algorithms found in Rosetta, namely Johnson’s algorithm and the genetic algorithm based reduction mechanism. 4.2 Algorithm Selection Previous to reduct extraction, we tested on the performance of the two Rosetta reduction algorithms, in order to find which is the most suitable to our work. Both reduction algorithms (see section 3.2) were made find a reduction with 25 log files. Log files were selected considering different sizes and types of sessions. A minimum common reduct set containing 14 attributes was obtained after the 25 extraction processes. This amounts to a reduction of 72.5% in the number of attributes. In general, both algorithms yielded similar total elapsed times. Sometimes, however, Johnson’s algorithm was faster. As expected, the total elapsed time involved in reduct extraction grows exponentially with the number of objects to be processed. For a 1,000 object log file the time needed to extract the reduct is 3 seconds and for a 570,000 is 22 hours. Also the size of the reduct increases according to the diversity of the log. For a 1,000 object log long, we found a reduct of 8 attributes, while for a 570,000 one, we found a reduct of 14 attributes. However, for longer log files, the instability of the genetic algorithm based mechanism become apparent. Our experiments show that this algorithm is unable to handle log files containing more than 22,000 objects, each with 51 attributes. This explains why our experiments only consider Johnson’s algorithm. Even though the algorithms accept indiscernibility decision graphs (that is relations between objects) we did not use them, both because we wanted to keep the process as unsupervised as possible and because in order to build the graphs we needed to know in advance the relation between the objects which is quite difficult even with the smaller logs with 60,000 objects. In the next section we will review our experimental results to validate the resulting reducts. 5 Reduct Validation—Experimental Results This section describes the methodology used to validate our output reduct, the main contribution of this paper. The validation methodology basically appeals to so-called association patterns. An association pattern is basically a pattern that, with the help of wildcards, matches part of an example log file. Given both a reduct and a log file, the corresponding association patterns are extracted by overlaying the reduct over that log file, and reading off the values [16]. Then the association patterns are then compared against another log to compute how well they cover that log file information. Thus, our validation test consists of checking the quality of the association patterns generated by our output reduct, considering two log files. The rationale behind it is that the more information about the system the reduct comprehends the higher the matching ratio the association patterns of that reduct will have. To validate our reduct, we conducted the following three step approach. For each one of the 8 log files considered through our experiments, i) use the output reduct to compute the association patterns, ii) cross validate the association patterns against all of the log files, including the one used to generate them and iii) collect the results. Our validation results are summarized in Table 1. The patterns contain only relations between attributes contained in the final reduct set. A set of patterns was generated from each log file. These association patterns are able to describe the information system to some extent. That extent is the covering percentage of the patterns over the information system. The first column indicates the corresponding log file used to generate the association patterns. The first row indicates the log file used to test the covering of the pattern. In contrast if we were to generate a set of patterns using all attributes, then those patterns will cover 100% of the objects in the table that generated them. The result is the percentage of objects we were able to identify using the attributes in our reduct. These experiments were conducted on a Ultra Sparc 60 with two processors running Solaris 7. Even though it is a fast workstation the Reduct algorithms are the most computational demanding in all the Rough Set Theory. The order Training Testing log Log A B C D E F G H A 93.7 89.7 90.8 90.3 90.9 91.1 89.9 90.9 B 90.9 93.1 91.2 91.3 90.9 92.4 91.4 90.1 C 90.2 90.8 92.8 90.7 92.1 90.1 90.6 91.0 D 90.3 89.3 91.3 93.1 91.5 90.5 92.9 91.1 E 89.8 89.1 92.2 92.3 93.4 89.9 92.1 90.1 F 89.7 89.3 91.4 92.8 90.7 92.9 92.3 90.8 G 89.2 90.3 91.5 92.6 91.2 90.7 93.1 90.6 H 90.1 89.5 91.2 91.3 90.8 90.2 90.9 92.5 Size 744,085 2,114,283 1,093,140 1,121,967 1,095,935 815,236 1,210,358 927,456 Table 1. Covering % of the Association Patterns of the reduction algorithms is O(n2 ) with n being the number of objects. With 700,000 an overhead in time was to be expected. Calculating the quality of the rules generated with the extracted reducts is also time consuming. The generation of the rules took 27 hours (we used the larger tables to generate the rules) and another 60 hours to calculate the quality of the generated rules. In order to test the rules with another table we needed to extend the Rosetta library, that is because of the internal representation of data depends on the order an object is imported and saved on the internal dictionary of the table (every value is translated to a numerical form). The library has no method for importing a table using the dictionary from an already loaded table. To overcome the above deficiencies we extended the Rosetta library with an algorithm capable of importing a table using a dictionary from an already loaded table. This way we were able to test the rules generated in a training set over a different testing set. We also tested the rules upon the training set. In the end the quality of the reduction set is measured in terms of the discernibility between objects. If we can still discern between two different objects with the reduced set of attributes then the loss of information is said to be minimal. Our goal was to reduce the number of attributes without loosing the discernibility between objects, so more precise IDSs can be designed, and the goal was achieved. Even though the entire reduct extraction process is time consuming, it only needs to be done once and it can be done off-line. There is no need for live data, the analysis can be done with stock data. Once the reduct is calculated and its quality verified we can use only the attributes we need thus reducing the space required to hold the logs and the time used processing them to make a proper intrusion detection. The last section of this paper present our conclusions and projected future work. 6 Conclusions and Future Work Based on our results we identified the chief attributes of a BSM log file without sacrificing discernibility relevant information. With growing amount of information flowing in an IT system there was a need for an effective method capable of identify key elements in the data in order to reduce the amount of memory and time used in the detection process. We think our results prove that Rough Sets provide such a method. As a future work we are planning to experiment on object reduction to facilitate even more the detection task. So far the reduction obtained should be sufficient to explore intrusion detection methods that are computational intensive and were prohibitive. References 1. Kim, J., Bentley, P.: The Human Immune System and Network Intrusion Detection. In: Proceedings of the 7th European Conference on Intelligent Techniques and Soft Computing (EUFIT’99), Aachen, Germany, ELITE Foundation (1999) 2. Sun MicroSystems: SunSHIELD Basic Security Module Guide. Part number 8061789-10 edn. (2000) 3. Haines, J.W., Lippmann, R.P., Fried, D.J., Tran, E., Boswell, S., Zissman, M.A.: 1999 DARPA intrusion detection system evaluation: Design and procedures. Technical Report 1062, Lincoln Laboratory, Massachusetts Institute of Technology (2001) 4. Quinlan, J.R.: Learning efficient classification procedures and their application to chess and games. Machine Learning: An artificial intelligence approach. Springer, Palo Alto, CA (1983) 5. Breiman, L., Stone, C.J., Olshen, R.A., Friedman, J.H.: Classification and Regresion Trees. Statistics-Probability Series. Brooks/Cole (1984) 6. Komorowski, J., Polkowski, L., Skowron, A.: Rough sets: A tutorial. In: RoughFuzzy Hybridization: A New Method for Decision Making. Springer-Verlag (1998) 7. Axelsson, S.: Aspects of the modelling and performance of intrusion detection. Department of Computer Engineering, Chalmers University of Technology (2000) Thesis for the degree of Licentiate of Engineering. 8. Lane, T., Brodley, C.E.: Temporal Sequence Learning and Data Reduction for Anomaly Detection. ACM Transactions on Information and System Security 2 (1999) 295–331 9. Lane, T., Brodley, C.E.: Data Reduction Techniques for Instance-Based Learning from Human/Computer Interface Data. In: Proceedings of the 17th International Conference on Machine Learning, Morgan Kaufmann (2000) 519–526 10. Knop, M.W., Schopf, J.M., Dinda, P.A.: Windows performance monitoring and data reduction using watchtower and argus. Technical Report Technical Report NWU-CS-01-6, Department of Computer Science, Northwestern University (2001) 11. Rencher, A.: Methods in Multivariate Analysis. Wiley & Sons, New York (1995) 12. Marin, J.A., Ragsdale, D., Surdu, J.: A hybrid approach to profile creation and intrusion detection. In: Proc. of DARPA Information Survivability Conference and Exposition, IEEE Computer Society (2001) 13. Félix, R., Ushio, T.: Binary Encoding of Discernibility Patterns to Find Minimal Coverings. International Journal of Software Engineering and Knowledge Engineering 12 (2002) 1–18 14. Johnson, D.S.: Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences 9 (1974) 256–278 15. Viterbo, S., Øhrn, A.: Minimal approximate hitting sets and rule templates. International Journal of Approximate Reasoning 25 (2000) 123–143 16. Øhrn, A., Komorowski, J.: ROSETTA: A Rough Set Toolkit for Analysis of Data. In Wong, P., ed.: Proceedings of the Third International Joint Conference on Information Sciences. Volume 3., Durham, NC, USA, Department of Electrical and Computer Engineering, Duke University (1997) 403–407