Feature Selection For Intrusion Detection System
Feature Selection For Intrusion Detection System
Feature Selection For Intrusion Detection System
Jingping Song
Ph.D. Thesis
Department of Computer Science
Institute of Mathematics, Physics and Computer Science
Aberystwyth University
March 3, 2016
Declaration and Statement
DECLARATION
This work has not previously been accepted in substance for any degree and is not
being concurrently submitted in candidature for any degree.
Date ............................................................
STATEMENT 1
This thesis is the result of my own investigations, except where otherwise stated.
Where correction services1 have been used, the extent and nature of the correction
is clearly marked in a footnote(s).
Date ............................................................
STATEMENT 2
I hereby give consent for my thesis, if accepted, to be available for photocopying and
for inter-library loan, and for the title and summary to be made available to outside
organisations.
Date ............................................................
1
This refers to the extent to which the text has been corrected by others.
Abstract
Feature selection can be used to optimize the classifiers used to identify attacks by
removing redundant or irrelevant features while improving the quality. In this thesis,
six feature selection algorithms are developed, and their application to intrusion
detection is evaluated.
They are: Cascading Fuzzy C Means Clustering and C4.5 Decision Tree Classifica-
tion Algorithm, New Evidence Accumulation Ensemble with Hierarchical Clustering
Algorithm, Modified Mutual Information-based Feature Selection Algorithm, Mutual
Information-based Feature Grouping Algorithm, Feature Grouping by Agglomer-
ative Hierarchical Clustering Algorithm, and Online Streaming Feature Selection
Algorithm.
All algorithms are evaluated on the KDD 99 dataset, the most widely used data
set for the evaluation of anomaly detection methods, and are compared with other
algorithms. The potential application of these algorithms beyond intrusion detection
is also examined and discussed.
Acknowledgements
I would like to express my thanks for the great support from many people, without
whom this thesis would not have been possible.
Firstly, I would like to express uttermost gratitude to my supervisor Prof. Chris Price
for his guidance, advice and effort, which have been essential at all stages of my
research and shaped the thesis.
Also I am extremely grateful to my second supervisor Prof. Qiang Shen for his
insightful advice and constant inspiration in this research. I would like to express
my deepest appreciation to Dr Richard Jensen and Dr. Neil S. Mac Parthaláin for the
stimulating discussions and helpful advice. Besides, my sincere gratitude goes to my
examiners for their time and effort, which will improve this thesis in various ways.
I would like to thank my fellow researchers in the Advanced Reasoning Group for
the discussions, inspiration and team work. They are Dr Ren Diao, Dr Nitin Naik,
Dr Chengyuan Chen, Dr Shangzhu Jin, Dr Pan Su, Liang Shen, Yongfeng Zhang,
Tianhua Chen, Ling Zheng and Zhenpeng Li. My thanks also go to Computer Science
Department for their assistance and comfort that makes me feel like being home.
Finally and most importantly, I would like to thank my mother Wenhui Chen and
my wife Ni Zhu for their constant support, encouragement and motivation, which
enabled me to overcome any difficulties I encountered during my research.
At last thanks go to my family members Jubilee and Millie, and their friends Miu
Miu and Jasper for their companion and the great happiness they brought.
Contents
Contents i
List of Figures v
List of Tables ix
List of Algorithms xi
1 Introduction 1
1.1 Network Security Threat . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Intrusion Detection System and Feature Selection . . . . . . . . . . . . 4
1.2.1 Intrusion Detection System . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Aims of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 KDD 99 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Statistical Observations . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Background 19
2.1 Intrusion Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Brief history of IDS . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Classification of IDS . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.3 Approaches to Intrusion Detection . . . . . . . . . . . . . . . . . 23
2.1.4 Challenges in Intrusion Detection . . . . . . . . . . . . . . . . . 27
2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Feature Selection evaluation measure . . . . . . . . . . . . . . . 30
2.2.2 Feature Selection Approaches . . . . . . . . . . . . . . . . . . . . 32
2.3 Network Anomaly Detection by Machine Learning . . . . . . . . . . . . 38
i
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8 Conclusions 107
8.1 Summary of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.1 Short Term Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2.2 Long Term Developments . . . . . . . . . . . . . . . . . . . . . . 110
Bibliography 117
List of Figures
v
4.1 Selecting strategy example of feature grouping . . . . . . . . . . . . . . . . 57
4.2 Dendrogram of agglomerative hierarchical clustering on KDD99 by me-
dian distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Dendrogram of agglomerative hierarchical clustering on KDD99 by inner
squared distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 True positive rate comparison chart by different number of selected features 63
4.5 False positive rate comparison chart by different number of selected features 63
4.6 Precision comparison chart by different number of selected features . . . 64
4.7 Total Accuracy comparison chart by different number of selected features 64
4.8 F-measure comparison chart by different number of selected features . . 65
4.9 Time taken to build model comparison chart by different number of features 67
4.10 Precision comparison of FGMI-AHC by different number of selected features 69
4.11 F-measure comparison of FGMI-AHC by different number of selected
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.12 Total accuracy comparison of FGMI-AHC by different number of selected
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1 True positive rate comparison chart by different number of selected features 78
5.2 False positive rate comparison chart by different number of selected features 79
5.3 Precision comparison chart by different number of selected features . . . 79
5.4 F_Measure comparison chart by different number of selected features . . 80
5.5 Accuracy comparison chart by different number of selected features . . . 80
5.6 Precision of different feature streaming order test . . . . . . . . . . . . . . 82
5.7 F_Measure of different feature streaming order test . . . . . . . . . . . . . 82
1.1 Class labels details that appears in "10% training KDD" and "Corrected
KDD" dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Basic features of individual TCP connections . . . . . . . . . . . . . . . . . 16
1.3 Content features within a connection suggested by domain knowledge . 16
1.4 Time-based traffic features computed using a two-second time window . 17
1.5 Connection-based traffic features . . . . . . . . . . . . . . . . . . . . . . . . 18
6.1 Results obtained by four feature selection methods over KDD99 training
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Unbalanced continuous features of KDD Cup 99 dataset . . . . . . . . . . 86
6.3 Performance evaluation comparison . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Comparison Results Between Proposed algorithm and Other algorithms 92
ix
7.4 Classification accuracy comparison by SVM . . . . . . . . . . . . . . . . . . 98
7.5 Classification accuracy comparison by Naive Bayes . . . . . . . . . . . . . . 99
7.6 Comparison Results Between OSFS2 and mRMR by different learning
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.7 Datasets for test in experiments . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.8 Accuracy comparison between NEAC and Ada’s algorithm . . . . . . . . . 104
7.9 Accuracy comparison by NEAC on different parameters . . . . . . . . . . . 106
List of Algorithms
6.1.1 Cascading Fuzzy C Means clustering and C4.5 decision tree classi-
fication algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.1 New Evidence Accumulation Clustering Algorithm . . . . . . . . . 89
xi
Chapter 1
Introduction
W
I th the development of computer technology and network communication
technology, computer network spread rapidly in recent years. Internet has
become an important medium for information exchange and sharing in our society.
Internet has huge information capacity, high speed transmission, worldwide coverage,
a high degree of openness and interactivity. So it is profoundly changing the way
of people’s work and live, and influences political, economic, military, cultural and
technological development[126].
Network information security has become more and more serious as the rapid
development of the computer network. And it is also an important factor restricting
the development of the network. In recent years, network attacks and information
security incidents occured frequently, covering areas more and more widely, and
increasingly harmful[164]. In october of 2002, the top 13 root domain name server
which are responsible for global Internet working are attacked by DDoS(Distributed
Denial of Service). And it results in nine servers were interrupted their service[157].
In January 2003, Internet suffered massive "worm" virus. Global network services
were severely affected, including the Americas, Europe, Asia, Australia[26]. From
2011, other new challenges appeared, such as data unauthorized disclosure, cell
phone privacy and security problems, Advanced Persistent Threats (APT) attacks
and so on. Network security issues have caused economic losses to some companies.
And users will panic and dare not trust the network[130, 5]. Network security is a
process of defence all forms of threats from internal and external in order to ensure
security of communications network and information [60]. And network security
1
1. INTRODUCTION
design usually include safety equipment, firewall, vitual private network, intrusion
detection system, security server and access control mechanism. According to system
security, research on network security is divided into attack and defense. Attack
techniques include computer network scanning, monitoring, stealth, invasion and
backdoor. Defense technology mainly includes security configuration of operating
system, firewall technology, information encryption technology and network intrusion
detection[156].
2
1.1. Network Security Threat
using cryptographic digital signatures[171]. Besides the four elements in figure 1.1,
availability, controllability and accountability are also elements of network security.
At present, there are many threats in network and they are mainly in following
aspects.
4. Interfere with the normal operation of the system. Constantly on the network
service system interference, to change its normal operating state, make the
system response slow down or even stop, affecting the normal users.
Currently, computer viruses and network intrusion attacks are the two most
common implementation of network threats. Network attacks are always in a large
number of network activities and not susceptible to geographical and time limits.
3
1. INTRODUCTION
IDS could be classified in different ways[24]. There are network based (NIDS)
4
1.2. Intrusion Detection System and Feature Selection
and host based (HIDS) intrusion detection systems[185]. There are misuse-based
intrusion detection and anomaly-based intrusion detection. It will described in detail
in chapter 2.
5
1. INTRODUCTION
identify previously unknown attacks. The test dataset we used has unknown attacks
and they will be introduced in detail in chapter 2.
6
1.2. Intrusion Detection System and Feature Selection
On the other hand, IDS usually runs day by day in real world. And the instance
in IDS datasets are very huge and they take time to do classification or clustering. It
will takes several days to get classification results from a dataset which has over 1
million instances. And if a dataset has a large number of instances and features, it
will take large memory and computation resources to run. Thus, feature selection is
very necessary to IDS datasets since they usually include a large number of instance
and features.
7
1. INTRODUCTION
The motivation of the thesis is to find the relationship between features and class
labels in a IDS dataset and design feature selection algorithms. Then, we select
features from IDS datasets by using these feature selection algorithms. Moreover, we
analysis the performance of selected features and compare with some other feature
selection algorithms. At last, we test proposed algorithms on other datasets to see
their effctiveness.
The definition of a connection is a TCP data packet sequence including data from
source IP address to destination IP address in a predefined protocol (such as TCP or
UDP) from beginning to end in a period of time. Each connection is classified as either
attack or normal. An attack can be sub classified into four categories of 39 types.
The 7 week training dataset only contains 22 types of attacks, and the test dataset
includes other unknown 17 types[18]. It is notable that the probability distribution
of the test data is not the same as the one of training data, and also that the test data
contains certain attack types which do not appear in the training data. It is believed
by some intrusion experts that most of the novel attacks are variants of known attacks,
the signature of which is sufficient to capture novel variants[105, 153].
8
1.3. KDD 99 dataset
(1) Denial of Service Attack (DoS): Some computing or memory resources are
made too busy or too full to accept legitimate requests, or to allow legitimate users
to access a machine. E.g., ping-of-death, syn flood, smurf.
(2) User to Root Attack (U2R): An attacker gains access to a normal user account
(perhaps by a dictionary attack, sniffing passwords or social engineering) and exploit
the vulnerability in the system to gain root access to it. E.g., guessing passwords.
(3) Remote to Local Attack (R2L): By sending packets to a machine over a network,
an illegitimate user exploits vulnerability of the machine to gain local access to it as
a user. E.g., buffer overflow attacks.
(1) Basic features: This category contains all the attributes extracted from a
TCP/IP connection. The monitoring of these features will cause a fixed delay in
detection.
(2) Content features: The features of suspicious behavior in the data portion
should be captured in order to detect attacks. E.g. number of failed login attempts.
Those features are called content features. The R2L and U2R attacks normally don’t
appear in intrusion frequent sequential patterns, as they have been embedded in
the data portions of packets and only request a single connection. While the DoS
and Probing attacks involve many connections to hosts and show the attribute of
intrusion frequent sequential patterns.
(3) Time-based traffic features: Only the connections in the past 2 seconds are
examined, which have the same destination host/service as the current connection,
and of which the statistics related to protocol behavior, service, etc. are calculated.
9
1. INTRODUCTION
(4) Connection-based traffic features: Some slow probing attacks scan the
hosts/service at an internal much longer than 2 seconds, e.g. once in every minute,
which cannot be detected by the time-based traffic features, as it only examines the
connections in the past 2 seconds. In such case, the features of same destination
host/service connections can be re-calculated at an interval of every 100 connections
rather than a time window[50]
Figure 1.6: Basic characteristics of "10% Figure 1.7: Basic characteristics of "whole
KDD" training dataset KDD" training dataset
The "Corrected KDD" dataset is a test dataset with different statistical distributions
from either "10% training KDD" or "Whole KDD". It contains 17 additional attacks[92].
The basic characteristics of "corrected KDD" dataset is shown in figure 1.8.
The list of class labels and their corresponding categories for "10% training KDD"
and "Corrected KDD" dataset are described in table 1.1.
There are several categories of derived features. Basic features of individual TCP
connections are shown in table 1.2. Content features are illustrated in table 1.3.
Time-based traffic features only examine the connections in the past two seconds,
10
1.4. Structure of Thesis
which have the same destination host/service as the current connection, and for
which the statistics related to protocol behavior, service, etc are calculated. The
"same host" and "same service" features are together called time-based traffic features,
which are illustrated in table 1.4[167].
There are probing attacks which scan the hosts/ports at a much longer inter-
nal than two seconds, e.g. once in every minute. Such connections are not able
to be detected by time-based traffic features. Therefore, connection-based traffic
features[181] are introduced to examine the connection records at a window of 100
connections for the connections to the same host/service. It is shown in table 1.5.
11
1. INTRODUCTION
Chapter 2: Background
This chapter provides a background introduction to intrusion detection system and
feature selection. This chapter also provides a comprehensive review of the most
recent methods for feature selection.
12
1.4. Structure of Thesis
Both of these two algorithms use the same selecting strategy of feature grouping.
The largest mutual information between each feature and a class label within a
certain group is then selected. The performance evaluation results show that better
classification performance can be attained from such selected features.
13
1. INTRODUCTION
means clustering method is used to partition the training instances into clusters. On
each cluster, we build a decision tree using C4.5 algorithm. Experiments on the KDD
CUP 99 data set shows that our proposed method in detecting intrusion achieves
better performance while reducing the relevant features by more than 80%.
Chapter 8: Conclusion
This chapter summarises the key contributions made by the thesis, together with
a discussion of topics which form the basis for future research. Both immediately
achievable tasks and long-term projects are considered.
14
1.4. Structure of Thesis
Table 1.1: Class labels details that appears in "10% training KDD" and "Corrected
KDD" dataset
Training Data Test Data
Category
Class labels(23) Number Class labels(38) Number
Normal normal 97278 normal 60593
ipsweep 1247 ipsweep 306
nmap 231 nmap 84
portsweep 1040 portsweep 354
Probe
satan 1589 satan 1633
saint 736
mscan 1053
back 2203 back 1098
land 21 land 9
neptune 107201 neptune 58001
pod 264 pod 87
smurf 280790 smurf 164091
DoS
teardrop 979 teardrop 12
apache2 794
mailbomb 5000
udpstorm 2
processtable 759
perl 3 perl 2
rootkit 10 rootkit 13
loadmodule 9 loadmodule 2
buffer_overflow 30 buffer_overflow 22
U2R
httptunnel 158
ps 16
sqlattack 2
xterm 13
ftp-write 8 ftp-write 3
guess_passwd 53 guess_passwd 4367
multihop 7 multihop 18
phf 4 phf 2
imap 12 imap 1
spy 2 spy
warezclient 1020 warezclient
R2L warezmaster 20 warezmaster 1602
named 17
xsnoop 4
xlock 9
sendmail 17
worm 2
snmpgetattack 7741
snmpguess 2406
total 494021 311029
15
1. INTRODUCTION
16
1.4. Structure of Thesis
Table 1.4: Time-based traffic features computed using a two-second time window
17
1. INTRODUCTION
18
Chapter 2
Background
T
HE detection of network attacks is a important task for network operators in
today’s Internet. The principal challenge in automatically detecting network
attacks is that these are a moving and ever-growing target[58, 169]. Intrusion
detection systems usually detect anomaly attacks by monitoring packets. We often
use machine learning technologies to identify whether traffic data is normal or
anomalous[54]. Two common machine learning methods are classification-based
and clustering-based. But some of the methods lose effectiveness or even become
invalid in this area since data volume is often very large. Moreover, traffic data
for the network contains many features, and some of the features are irrelevant or
redundant. Thus, we usually use feature selection algorithm to remove irrelevant
and redundant features. This chapter will introduce intrusion detection system,
feature selection, and machine learning methods.
19
2. BACKGROUND
work on how to construct an effective IDS model. Since IDS needs to detect, defend
against and respond to computer attacks, researchers have to consider many problems
when they construct IDS models. Such as data collection, intrusion identifying,
reporting and response. The structure of IDS construction is shown in figure 2.1.
IDS is composed by four parts as follows[118, 131].
2. Data collection and storage. This part collects all data from every event, and
converts the data to a proper format to store.
3. Data analysis and management. This is a core part in IDS. It searches suspected
actions and generates a signal when it detects an attack. Then, IDS deals with
the attack or send a signal to network administrator to handle.
20
2.1. Intrusion Detection System
21
2. BACKGROUND
On the basis of difference of data analysis and process unit, IDS can be separated
into misuse detection and anomaly detection. It is shown in figure 2.3. Misuse
detection is used to analyze and detect intrusion. This method generally takes
intrusion behavior as a pattern or a character. And it establishes a intrusion mode
characteristic database based on known intrusions behavior patterns. The detection
will be monitoring system or the user’s actual behavior patterns and match them with
the database. According to the results of the matching, the system will determine
whether there is a intrusion [82]. Supervised machine-learning methods could help
to compose signatures. Misuse detection systems are highly effective to detect those
attacks which they are programmed to alert on. However, they cannot detect new
attacks, since they cannot recognize those attacks which do not match their lists of
signatures[145, 77]. Misuse detection based intrusion systems can be divide into
stateless and stateful. Stateless misuse detection systems use only existing signature.
However, stateful misuse detection systems use not only existing signatures, but also
previous signatures[140]. On the contrary, anomaly detection will create a normal
operation model for users. Any operation does not comply with the normal behavior
will be prevented. Anomaly detection principle is take every exception as a possible
attack. Thus, this detection method can detect unknown attacks[22]. Anomaly
detection based intrusion system can also be further classified into self-learning and
rule-based. The difference is that rule-based intrusion detection system will be fully
specified in advance of normal rules. But self-learning systems typically need to have
a training process, which can let the system know what is normal network behavior.
22
2.1. Intrusion Detection System
In accordance with the reponse mechanism, IDS falls into reactive response IDS
and passive response IDS. And according to usage frequency, IDS can be divided into
online IDS and offline IDS.
23
2. BACKGROUND
well-known network intrusion detection tool Snort uses the simple pattern
matching method. It uses rule base to describe the intrusion behavior that has
been known and the rule base uses the text file to store. It has good readability
and can be modified[170, 15].
2. Expert system
Expert system is one of the earliest misuse detection schemes, and has been
adopted by many classical intrusion detection models. When the expert system
works, user have to input the information of the known intrusion behavior to
the expert system in the form special format which is expert system required.
Expert system constructs a rule base by using these information. The expert
system matches the corresponding observation events and rules in the rule
base to determine whether the intrusion occurs. For users, the expert system
is an autonomous "black box", users do not need to understand or interfere
with their internal reasoning and decision-making process[62, 33]. The main
problems existing in the expert system are the maintenance of the rule base is
complex, and we need to consider the relationship between the rules when
changing the rules. And another problem is the low efficiency in dealing with
massive data[14, 70].
24
2.1. Intrusion Detection System
the system behavior meets the requirements of the intrusion proposition before
it is checked and matched[68].
Anomaly detection will judges a intrusion when there is a certain difference between
monitored system or the user’s actual behavior and normal behavior[103, 176]. The
advantage of anomaly detection is that there is no need to have much knowledge
about system defects, and has strong adaptability, which can detect unknown in-
trusions or new intrusion patterns. The core problem of anomaly detection is how
to represent the normal behavior of the system or the user. The current anomaly
detection mainly has three methods as follows[98, 91].
1. Statistical method
Anomaly detection based on statistical method is the use of specific statistical
model of the system or the user normal behavior for learning. And it identifies
abnormal behavior which is a deviation behavior compare with normal behavior
based on large statistical data. The key of statistical method is the selection
of statistical object and statistical model, and the training of statistical model.
The following are some of the possible statistical object[78].
This method is not very difficult to select the appropriate statistical model
for the specific intrusion detection. And it can be seen as its advantage.The
weakness is the threshold value is difficult to determine, too large or too small
value will affect the accuracy of detection. Moreover, many of the system or
user behavior is very difficult to use simple statistical model to describe, and
the complex statistical model requires high calculation[120].
25
2. BACKGROUND
26
2.1. Intrusion Detection System
27
2. BACKGROUND
10. The interaction between intrusion detection system and intrusion detection
system and other security components.
Intrusion detection system could combine with other IDS or security compo-
nents by cascaded connection or integration.
28
2.2. Feature Selection
effects[71]. From the definition of feature selection, we could see evaluation criteria
are very important when given a specific learning algorithm and a dataset[143, 83].
Generally, a feature selection algorithm includes four parts which are shown in figure
2.4, generation, evaluation, stopping criterion and validation. Feature selection
process is can be seen as removing irrelevant and redundant features. Irrelevant
means features have little correlation with class labels. And redundant features
have strong relationship with selected features. Thus, both irrelevant and redundant
features are no help for classification.
The evaluation function is used to evaluate the merits of the candidate subset
obtained by the search. It will compare evaluation value with the best optimal value
stored before. If the evaluation value is higher, the primary candidate subset will be
replaced[168, 114].
29
2. BACKGROUND
30
2.2. Feature Selection
Wrapper methods was first proposed by John in 1994 and it is shown in figure
2.7[75]. Wrapper methods optimize a classifier as part of the selection process and
choose those features with high prediction performance induced by specified learning
31
2. BACKGROUND
algorithms [121, 154]. Selected feature subset will vary depending on different
learning algorithms. Therefore, the best evaluation criteria is the performance of
learning algorithm which is used on selected feature subset. Wrapper methods have
no limitation to learning algorithms. And decision tree, KNN, bayesian network and
SVM could all be used for wrapper methods[57, 84]. In general, wrapper methods
could get better subsets than filter methods. But they take long processing times,
have low adaptability and need to train for different learning algorithms.
Hybrid methods combine filter method and wrapper method and take advantage
of both of them [101, 134]. The hybrid mechanism is typically by two steps. At first,
candidate features are preprocessed by filter methods and irrelevant features are
removed. Thus, the dimension of dataset could be reduced. Then, Hybrid methods
select features by wrapper methods and classification learning algorithm is used to
evaluate the selected subsets[63, 195].
2.2.2.1 DMIFS
32
2.2. Feature Selection
dataset and it is described as D(F, C). C represents class labels. And S and F are
selected and candidate feature subsets, respectively. DMIFS uses semi-supervised
learning method which combine supervised and unsupervised methods. And semi-
supervised learning takes advantage of labeled instances and unlabeled instances to
do training and classification.
Normally, we can divide the instances in T = D(F, C) into two types: labeled and
unlabeled. We set the stopping condition as that when the selected features have the
same information as the original features, the selection procedure will cease, which
is frequently used in feature selection. When there are still unlabeled instances in
D, the procedure will continue and pick out the candidate features from F . Assume
that S is the subset of selected features, and the instances D are classified into two
categories according to the labels C. Du and Dl are unlabeled and labeled instances
respectively.
33
2. BACKGROUND
re-calculated in the estimation of mutual information in the next round, the selected
feature will be kept aside and later discarded from Du , since new labeled instances Dl
will be produced from Du . After that, the algorithm runs into next round and picks
up other candidate features. The procedure will continue until no candidate features
in F or the number of the unlabeled instances is equal to inconsistency count of T.
2.2.2.2 mRMR
1 X
max D(S, C), D= I( f i ; C) (2.2)
|S| f ∈S
i
1 X
min R(S), R= I( f i ; f j ) (2.3)
|S|2 f , f ∈S
i j
34
2.2. Feature Selection
1 X
max [I( f j ; C) − I( f j ; f i )] (2.5)
f j ∈F −Sm−1 m − 1 f ∈S
i m−1
The algorithm 2.2.2 describes mRMR feature selection scheme. To select the
candidate feature set, the algorithm computes the cross validation classification error
for a large number of features and finds a relatively stable range of small error. This
range is called Ω. The optimal number of features which is denoted as n∗ of the
candidate set is determined within Ω.
2.2.2.3 IG
Information gain (IG) uses Shannon’s entropy to measure feature set quality [113].
The information for D given class ci at the root amounts to
35
2. BACKGROUND
d
X
I(D) = − PD (ci ) log2 PD (ci ) (2.6)
i=1
the information for D j due to partitioning D at f is
d
X
f
I(D j )=− PD f (ci ) log2 PD f (ci ) (2.7)
j j
i=1
2.2.2.4 ReliefF
Relief is a series of algorithms. It includes Relief, ReliefF and RReliefF. Relief was
proposed by Kira and it is a feature weighting algorithm. Relief algorithm is simple
and has high efficiency, but it is limited to dealing with two label classification. In
1994, Kononenko expanded it to ReliefF algorithm. ReliefF could process multi-
label[88]. In 1997, Kononenko improved ReliefF to RReliefF which could handle
continuous value of target attributes[89]. We will introduce ReliefF which is used in
chapter 7 to compare with proposed algorithms. In fact, ReliefF’s estimate W [A] of
attributes A is an approximation of the following difference of probabilities:
k
X
W (A) = W (A) − di f f (A, R, H j )/(mk)
j=1
k
(2.9)
X p(C) X
+ [ di f f (A, R, H j (C))]/(mk)
/ ass(R)
C ∈cl
1 − p(C lass(R)) j=1
36
2.2. Feature Selection
The key idea of ReliefF is to estimate attributes according to how well their values
distinguish among instances that are near each other. For that purpose ReliefF for
a given instance searches for its two nearest neighbours: one from the same class
(called nearest hit) and the other from different class(called nearest miss).
ReliefF algorithm is shown in 2.2.4. ReliefF has high efficiency and has no limits
of data types. But this algorithm could not remove redundant features.
8 for A=1 to N do
9 if W (A) ≤S
δ then
10 S = S A;
37
2. BACKGROUND
38
2.3. Network Anomaly Detection by Machine Learning
Given a training dataset T = D(F, C), the task of learning algorithms for classi-
fication is to induce a hypothesis h: Fi → C from T , where Fi is the value domain
of f i ∈ F . Since there is a limited number of instances in D, there is a classification
error " F (h) = |(o, c) ∈ F |h(o) 6= c|/m for each classifier, where h(o) is the predicted
class label of o by the hypothesis h. Feature selection can change F , and result in
the changing of " F (h) . After feature selection process, datasets are reduced and then
learning methods are used on datasets for classification.
As stated above, there are two learning methods. One is supervised methods and
they are based on classifiers, such as C4.5 [133], Bayesian [2, 127], ID3, JRip, PART,
SMO and IBK algorithms.
Another one is unsupervised methods and they are based on clustering methods,
such as Fuzzy C Means, Sub-Space Clustering (SSC) [144], Density-based Clustering
[48], and Evidence Accumulation Clustering (EAC) techniques [44]. One advantage
of unsupervised methods is that they could detect unknown attacks[104, 150].
39
2. BACKGROUND
Kabir put forward a algorithm based on neural network[79]. Inza takes Bayes
network into wrapper algorithms and get improved classification performance[69].
The degree of relevance within a feature subset is very important to the perfor-
mance of feature selection. Generally, there are two methods to measure relevance
between features, linear correlation measurement and correlation measurement
based on information theory. BIF feature selection algorithm using mutual informa-
tion to measure the degree of relevance between the features and the class labels,
and output K highest degree features as the optimal feature subset [72]. This method
can effectively eliminate the irrelevant features, but the selected features still have a
large number of redundant features. Battiti put forward a feature selection algorithm
based on mutual information (MIFS) [13]. This algorithm uses mutual information
to measure the relevance between features and class labels. At the same time, it also
calculates relevance degree between a candidate feature and the selected feature set.
However, the MIFS algorithm has lower robustness and when facing the redundant
features is highly correlated to class labels. Kwak and Choi introduced MIFS-U algo-
rithm which uses uncertainty coefficient to represent relevance degree of features
[93]. Peng proposed Max-Relevance and Min-Redundancy algorithm and it uses
mutual information to evaluate features [148]. Lee introduced information gain and
divergence-based feature selection algorithm. The algorithm obtains feature subset
by deleting the redundant features while maintaining the information gain [99]. [3]
proposed mutual information-based feature selection method results in detecting
intrusions with higher accuracy.
40
2.4. Summary
2.4 Summary
In this chapter, intrusion detection and feature selection are introduced. Intrusion
detection is not a new research area but it is an important area since new chal-
lenges and new attacks emerge continuously. The history and challenges are briefly
described in this chapter, and some solutions for IDS are also introduced.
Since both instance and features in IDS data usually very huge, we use feature
selection method to reduce dimensionality and remove irrelevant and redundant
features. And feature selection could also help improving classification accuracy.
In this chapter, we introduced feature selection models and some feature selection
algorithms. These algorithms will be used to compare with proposed algorithms in
following chapters. Machine learning methods using on network anomaly detection
are briefly set forth in this chapter as well. Some literatures mentioned in this part
are used to classify or cluster for IDS.
41
Chapter 3
T
HIS chapter proposes a modified mutual information-based feature selection
algorithm (MMIFS) for intrusion detection on the KDD Cup 99 dataset. The C4.5
classification method was used with this feature selection method. In comparison
with dynamic mutual information feature selection algorithm (DMIFS), we can see
that most performance aspects are improved. Furthermore, this chapter shows the
relationship between performance, efficiency and the number of features selected.
43
3. MODIFIED MUTUAL INFORMATION-BASED FEATURE SELECTION FOR I NTRUSION
DETECTION SYSTEMS IN DECISION TREE LEARNING
the information amount among them. We only talk about finite random variables
with discrete values within this chapter [45].
where p(x, y) is the joint probability density function and p(x| y) is the posterior
probabilities of X given Y .
When X and Y are unrelated, the value of I(X ; Y ) is 0. While I(X ; Y ) is high,
it means X and Y are closely related. The mutual information is applicable in the
evaluation of any arbitrary dependency between random variables. Within this
chapter, we only compute the mutual information between two variables, and scale
the mutual dependence between them [163].
44
3.1. Mutual Information-based Feature Selection Method Introduction
To verify that there are irrelevant and redundant features in KDD Cup 99 dataset,
Correlation based Feature Selection (CFS) is used to select 8 features by Weka.
Two performance measures (precision and F-measure) were calculated which will
specifically be discussed in section 3.3.2 and we used four classification methods to
45
3. MODIFIED MUTUAL INFORMATION-BASED FEATURE SELECTION FOR I NTRUSION
DETECTION SYSTEMS IN DECISION TREE LEARNING
calculate the two performances. Figure 3.2 shows the precision comparison between
41 features and 8 features by normal and anomaly types respectively. Similarity,
figure 3.3 describes the other performance F-measure.
Figure 3.2: Precision comparison chart between all features and selected features
46
3.1. Mutual Information-based Feature Selection Method Introduction
Figure 3.3: F-measure comparison chart between all features and selected features
47
3. MODIFIED MUTUAL INFORMATION-BASED FEATURE SELECTION FOR I NTRUSION
DETECTION SYSTEMS IN DECISION TREE LEARNING
The two figures show for each classification method, that the performance com-
parison between 8 and 41 features is quite close. For J48 and PART methods, the
performance with 8 features is actually improved. Another advantage of selecting
features is the running time is shorter than using all features. We will show the
computation time comparison in section 3.3.2.
Figure 3.4: Mutual information of between each feature and class label in KDD99
dataset
We could rank the features by mutual information from figure 3.4. But we
could not select the features according to this way. Take features 5, 12 and 23 as
an example, let C represent class label and mutual information between the three
features and C are I( f5 ; C) = 0.6424, I( f12 ; C) = 0.381, I( f23 ; C) = 0.6179. In
descending order, the three are sorted as f5 , f23 , f12 . But after f5 is selected, we
48
3.2. Modified Mutual Information-based Feature Selection for Intrusion Detection
Systems
should delete the correct instances induced by f5 . Battiti proposed an evaluation
function considering the mutual information between features, which is shown by
formula 3.5, along with the method called mutual information-based feature selection
(MIFS). In this case, the mutual information between f5 and f23 is I( f5 , f23 ) = 1.472
and the mutual information between f5 and f12 is I( f5 , f12 ) = 0.5436. According to
Battiti’s evaluation function, f12 will be selected, rather than f23 . In 2009, Huawen
Liu proposed a dynamic mutual information method called DMIFS. And DMIFS
improved MIFS in respect to some performance.
X
I( f i ; C) − β I( f i ; fs ) (3.5)
fs ∈S
In formula 3.5, f i represents each feature in a set and fs denotes a selected feature
in a selected feature set S. There is a parameter β and Battiti suggested it should be
between 0.5 and 1. But in our study, we think the parameter should be related to
mutual information between each feature and class label, rather than a fixed value.
So we put forward an improved algorithm named MMIFS as follows.
49
3. MODIFIED MUTUAL INFORMATION-BASED FEATURE SELECTION FOR I NTRUSION
DETECTION SYSTEMS IN DECISION TREE LEARNING
3.3 Experimental Results
The classification is based on six measures: True Positive Rate (TPR), False
Positive Rate (FPR), Precision, Total Accuracy, Recall, F-Measure. The six measures
are calculated by True Positive (TP), False Positive (FP), True Negative (TN), False
Negative (FN), as follows.
True positive rate (TPR): TP/(TP+FN), also known as detection rate (DR) or
sensitivity or recall.
False positive rate (FPR): FP/(TN+FP) also known as the false alarm rate.
We can see from the figure that we tested 13 features obtained by MMIFS. The
reason we tested 13 features is that less or equal than 13 selected features of KDD
50
3.3. Experimental Results
99 dataset could achieve best performance and relatively less computation time
according to the feature selection algorithms we used before(e.g.CFS). The total
accuracy does not increase as the numbers rise. The reason is because there are
many noisy and redundant features in the dataset.
From figure 3.5, we can see that the total accuracy is between 93% and 94%
if we select 2 to 13 features. And the accuracy is very close for different feature
numbers. But considering the large number of instances in the KDD 99 dataset, a
small improvement in accuracy will result in many instances being correctly classified.
A range of features between 2 and 13 could be used for comparison. But when
we used DMIFS to get the features, we realised if the desired numbers are small,
most of the features are the same as we got by MMIFS. Thus, we choose 10 features
to compare the algorithms since the algorithm (DMIFS) which is compared with
MMIFS could achieve the best performance when it select 10 features.
3.3.2 Results
As we disscussed in section 2.3, intrusion detection can be considered as a two class
problem or a multiple class problem. In this experiment, we regard all attack types
as anomaly patterns and the other class is a normal pattern, addressing intrusion
detection as a two class problem.
In the following subsection, C4.5 is used to classify the dataset and compare the
performance between DMIFS and MMIFS. C4.5 is better than some other classification
51
3. MODIFIED MUTUAL INFORMATION-BASED FEATURE SELECTION FOR I NTRUSION
DETECTION SYSTEMS IN DECISION TREE LEARNING
algorithms and comparison between C4.5 and 3 other algorithms by using 10 selected
features shows in table 3.1. From table 3.1, we can see that C4.5 is much better
than other 3 algorithms. Though some other methods could achieve the same or
even better performance than C4.5, such as SVM or neural network. But they take
longer computation time. The experiments were conducted on the KDD 99 dataset
and performed on a Windows machine having configuration and Intel (R) Core (TM)
i5-2400 CPU@ 3.10GHz, 3.10 GHz, 4GB of RAM, the operating system is Microsoft
Windows 7 Professional. We have used an open source machine learning framework
Weka 3.5.0. We have used this tool for performance comparison of our algorithm
with other classification algorithms. Table 3.2 shows the specific comparison and
it indicates that most of the performances are improved by MMIFS compared to
DMIFS and Battiti’s method, such as precision and F-measure. The experiment is
executed 10 times and the differences in performance of different methods are
statistically evaluated using paired t-test with two-tailed p = 0.01. MMIFS could
achieve statistically better result. The total accuracies for these three methods are
92.65%, 92.94% and 93.02% respectively.
Suppose there are m instances and n features in training dataset. The time
complexity of MMIFS, DMIFS and Battiti’s algorithms are O(mn2 ), O(mn2 ) and
O(mn) respectively.
From the comparison results, we can see that MMIFS could have better perfor-
mance than other two algorithms when they all select 10 features. The first two rows
show the performance of C4.5 with all 41 features. And the next four rows describe
results of Battiti’s method and DMIFS. They have the same results since they select
the same 10 features. MMIFS get the best performance although differences are not
very large.
52
3.3. Experimental Results
53
3. MODIFIED MUTUAL INFORMATION-BASED FEATURE SELECTION FOR I NTRUSION
DETECTION SYSTEMS IN DECISION TREE LEARNING
We can see from figure 3.6 and figure 3.7 that as the number of features increases,
the calculation time increases significantly. It indicates that the computation time is
greatly affected by the numbers of features.
3.4 Summary
This chapter proposed a new feature selection method and the main improvement of
this work is that it modifies the mutual information feature selection algorithm by
changing the weighting parameter. We tested this method on the KDD 99 dataset
and compared the results with the DMIFS algorithm. The results show that most of
the performance indicators are improved. Future work will evaluate the algorithm
against other datasets which have less noise and less redundant features. The value
of the weighting parameter may not be optimum, and so further study will attempt
to find values of the parameter that produce the best results. Finally, we will try to
compare the method based on correlation coefficient of features with the method
based on mutual information.
54
Chapter 4
F
EATURE grouping is using the relationship of features in a dataset to compose
groups and design selection strategy to select a feature or features from a
group[116]. We could take feature grouping as a kind of feature selection. In
particular, feature grouping that allows the selection of multiple features by one go
is applicable to the dataset with a high dimensionality.
In this chapter, two algorithms are proposed and they are both using feature
grouping method and based on mutual information. One is mutual information-based
feature grouping algorithm which is introduced in 4.1. The other one is feature
grouping by agglomerative hierarchical clustering based on mutual information,
which is presented in 4.2.
Formula 3.5 can be used to select the next feature. β is a parameter and de-
termined empirically and Battiti has proposed a value between 0.5 and 1 for β.
This algorithm indicates that feature selection should consider not only the mutual
information between each feature and class label but also the mutual information
between each feature and selected features.
If there are n features in the dataset and f i is the feature i, then Mi ( f i ; F ) denotes
the mutual information between f i and all the other features. And it shows in formula
4.1.
n
X
Mi ( f i ; F ) = I( f i ; f j ) (4.1)
j=1 j6=i
Feature grouping could been seen as a kind of feature selection. We could measure
the relationship between one feature and other features by some methods. And then
we can use it to compose groups. Features who have similar metrics be put into will
in one group.
Clustering methods could be used to create groups since they select data in one
cluster by specific metrics. Different clustering methods and metrics could compose
different cluster constructions. Number of clusters affects how many features will be
selected. For example, different strategies could be adopted if we expect to select
8 features from a dataset. We could create 8 groups by a clustering method and
select 1 feature in each group. Or we could construct 4 groups and select 2 features
per group instead. Figure 4.1 is shown one example of feature grouping strategy.
Moreover, we could select different numbers of features in different groups. For
example, hierarchical clustering method could be used to create groups in this work,
we chose the selecting 1 feature from each group strategy. This strategy is simple
56
4.1. Mutual Information-based Feature Grouping Method
and easy to implement. And another reason is there might be only one feature in
one group by using agglomerative hierarchical clustering method.
From figure 4.1 we can see there are n features F1 , F2 , ..., Fn . And they compose
m groups G1 , G2 , ..., Gm by using a specific method. Then in each group, we select
one feature and get selected features set Fs1 , Fs2 , ..., Fsm .
From the algorithm 4.1.1, it can be seen that the number of features selected by
this algorithm depends on the number of groups. The mutual information between
57
4. FEATURE GROUPING FOR I NTRUSION DETECTION BASED ON MUTUAL INFORMATION
each pair of features is calculated and Fuzzy C-Means algorithm is used to compose
the groups. Fuzzy C-Means (FCM) is a method of clustering which allows one piece
of data to belong to two or more clusters. This method (developed by Dunn in 1973
and improved by Bezdek in 1981) is frequently used in pattern recognition and
unsupervised classification.
At first, the proposed algorithm calculates the mutual information between each
feature and all the other features and adds them together, denoted as SU M M I . Then,
it ranked the SU M M I by Fuzzy C-Means algorithm to get G groups. Moreover, in each
group, the algorithm computes mutual information between each feature and class
label and get the maximum one. At last, select the feature which has the maximum
value.
From the process of FGMI, we can see that G decides how many features will be
selected by this algorithm. FGMI composes G groups and selects one feature from
each group. In other words, the algorithm will select G features. FGMI requires users
to input G at first as FGMI does not attempt to decide how many features should
be selected by itself. Someone might argue this is a disadvantage of the algorithm,
but FGMI is very efficient in computation time. The reason is that algorithms
that automatically calculate the optimum number of selected features need to add
performance evaluation or deduce part in the algorithms. However, FGMI would be
able to use performance evaluation of selected features to find the best G. FGMI is
appropriate for datasets who have large number of instance such as KDD 99.
58
4.2. Feature Grouping by Agglomerative Hierarchical Clustering based on Mutual
Information
unbalanced and nearly half of the values are quite low. So we chose to use clustering
algorithm to compose groups. Fuzzy C-Means is used to compose G groups in
FGMI. This algorithm could divide a set X = {x 1 , x 2 , ..., x n } into C clusters and make
objective function get minimum value. Data are bound to each cluster by means
of a Membership Function, which represents the fuzzy behaviour of this algorithm.
Data in SU M M I are one-dimensional and unbalanced. After using Fuzzy C-Means on
SU M M I , most low values in SU M M I will go into one group. In contrast, high values
in SU M M I will go into different groups.
59
4. FEATURE GROUPING FOR I NTRUSION DETECTION BASED ON MUTUAL INFORMATION
60
4.2. Feature Grouping by Agglomerative Hierarchical Clustering based on Mutual
Information
There are some methods for computing distance between clusters, such as average,
centroid, complete, median, single, ward, weighted and so on. Different method will
result in different cluster tree, such as figure 4.2 and figure 4.3.
First of all, the algorithm decides initialization parameters and F is a set of all
the features in the training dataset. And C denotes class labels and n represents
clusters number. Then, the algorithm calculates the mutual information of every
pair of features in F and composes a matrix based on them. After that, it creates a
hierarchical cluster tree based on the matrix by using an agglomerative hierarchical
clustering method. Moreover, it constructs clusters from a hierarchical cluster tree by
given n. And n clusters mean n groups containing candidate features. Furthermore,
in each cluster, it calculates mutual information between each feature and class label
in C, and then finds the maximum value Mc . Finally, it selects feature fs which has
the Mc in each group, and put fs into S.
There are some distances could be used to compute between pairs of objects data
when we using AHC. Such as Euclidean, chebychev,cosine, correlation, hamming and
so on. In this algorithm, we take mutual information between each pair of features as
61
4. FEATURE GROUPING FOR I NTRUSION DETECTION BASED ON MUTUAL INFORMATION
their distance. Thus some of methods for computing distance between clusters could
not be used since they are appropriate for Euclidean distances only. That means
centroid, median and and ward could not be used in this algorithm. And complete
and single mean furthest and shortest distance respectively. In this algorithm, we use
average menthod which means unweighted average distance. The reason is average
method is least affected by abnormal data.
As stated in 4.1.3, this algorithm also requires users to input number of clusters n
first. And we select one feature from each cluster. It means the number of clusters is
equal to the number of features you will select. This algorithm constructs a maximum
of n clusters and finds the smallest height at which horizontal cut through the tree
leaves n or fewer clusters.
62
4.3. Experimental Evaluation - Comparison with Other Approaches
Figure 4.4: True positive rate comparison chart by different number of selected
features
Figure 4.5: False positive rate comparison chart by different number of selected
features
63
4. FEATURE GROUPING FOR I NTRUSION DETECTION BASED ON MUTUAL INFORMATION
Figure 4.7: Total Accuracy comparison chart by different number of selected features
64
4.3. Experimental Evaluation - Comparison with Other Approaches
= 0.01. Table 4.1 shows the comparison with DMIFS and FGMI on average. The
first row is shown that C4.5 with all 41 features in the dataset. The second row
represented DMIFS algorithm proposed by Huawen. 13 features are used by DMIFS
and the performance is shown in row 2. The last two rows describe the results of
the proposed algorithm FGMI. 13 features and 10 features are used to test by C4.5
respectively. And it is shown from the results that the proposed algorithm could
improve the performance of all the measures. Table 4.1 highlighted in bold indicates
statistically superior results in comparison to the rest.
Another 3 algorithms were used to compare beside C4.5, and table 4.2 shows
65
4. FEATURE GROUPING FOR I NTRUSION DETECTION BASED ON MUTUAL INFORMATION
One of the advantages of the feature selection method using on KDD 99 dataset
is saving computation time. More features means more computation time. Figure
4.9 shows the time taken to build model of C4.5 algorithm by different number of
features.
66
4.3. Experimental Evaluation - Comparison with Other Approaches
Figure 4.9: Time taken to build model comparison chart by different number of
features
67
4. FEATURE GROUPING FOR I NTRUSION DETECTION BASED ON MUTUAL INFORMATION
Jingping in 2014. And we can see from the comparison that FGMI-AHC could get
better performance nearly in all measures. The paired t-test is again employed to
compare the differences between C4.5, DMIFS, FGMI against FGMI-AHC. The tables
highlighted in bold indicate statistically superior results in comparison to the rest.
The purpose of comparison in table 4.3 and table 4.4 is to compare FGMI-AHC
and other algorithms by the same number of selected features. Figure 4.10 illustrates
the precision comparison of FGMI-AHC by different number of features.
Figure 4.11 and figure 4.12 show the F-measure and total accuracy comparison
of FGMI-AHC by different number of features respectively.
From the comparison of figure 4.10 to figure 4.12, we could see better perfor-
mance could be achieved when selecting 12 features by FGMI-AHC. And table 4.5
shows detailed comparison of FGMI-AHC by different number of selected features.
We can see from table 4.5 that FGMI-AHC algorithm could get best performance
by selecting 12 features. And for F-measure, both normal and anomaly could achieve
highest value when using 12 selected features.
Suppose there are m instances and n features in training dataset. Both of time
complexity of FGMI and FGMI-AHC are O(mn2 ).
From the experimental results, we can see that FGMI and FGMI-AHC could get
better performance but most values are close to other algorithms. On one hand,
the reason is KDD 99 dataset has large number of instance, smalll improvements of
68
4.3. Experimental Evaluation - Comparison with Other Approaches
69
4. FEATURE GROUPING FOR I NTRUSION DETECTION BASED ON MUTUAL INFORMATION
70
4.4. Summary
performance will bring large number of instances changed. On the other hand, the
two algorithms we performed both select features first by a given number, rather than
deduce to find the best feature number. Thus, we could only see the performance
after feature selection process.
4.4 Summary
Section 4.2 has presented a feature grouping method based on agglomerative hier-
archical clustering method. It described how to compose the group by hierarchical
tree, how to get the number of groups and how to select features in each group. First
of all, the mutual information between each pair of two features is calculated to be
used to construct the hierarchical tree. Moreover, the proposed algorithm creates
groups by a given number. Finally, the mutual information between a feature and
class labels is used to select one feature in one group. Experiment results on KDD 99
dataset indicate that the proposed approach generally outperforms DMIFS, MMIFS,
and FGMI algorithm. Furthermore, the comparison by different number of features
shows that 12 features could get best performance indicator.
Whilst promising, the presented work opens avenues for further investigation.
For instance, the mutual information between features and class labels can be used
to design new algorithm. And other clustering or classification algorithms can be
applied to compose groups. Moreover, more than one feature could be selected in
a certain group. In future work, the proposed algorithm will be tested on other
datasets and look for more effective measures or methods than mutual information
theory.
71
Chapter 5
I
N this chapter, online streaming feature selection algorithm is proposed and it is
applied to the KDD99 dataset. Traditional feature selection algorithms usually
need a high level of computational effort and we need to input all features at the same
time and then carry out the learning process. It will consume more memory space
if the dataset has more features. Online feature selection method could integrate
new features as they arrive and carry out the computation. Specifically, the goal
of online streaming feature selection is to develop online classifiers that involve
only a small and fixed number of features for classification. This method is fit for
applications where not all features could be present in advance or the feature volumns
are unknown or of infinite size.
Online streaming feature selection is fit to deal with sequential training data of
high dimensionality such as online intrusion detection system[32, 182]. The major
contribution of this chapter is that I proposed a novel algorithm to solve real-world
problems in intrusion detection system. And this online streaming feature selection
algorithm could apply on other datasets as well. The application of online streaming
feature selection algorithm to other datasets will be specifically described in chapter
7.
73
5. ONLINE STREAMING FEATURE SELECTION FOR IDS
Xidong’s work is based on this framework and proposed OSFS and fast OSFS[182].
And he presented his own method to distinguish irrelevant and redundant features. As
stated in 3.1.2, he defined strong relevant, weakly relevant, irrelevant and redundant
features. These definition are based on the changes of objective function when
streaming a feature. The advantage of this method is that it performs well but
it takes time to deduce in feature selection progress. In other words, optimizing
objective the function occupies a long time.
74
5.2. Online Streaming Feature Selection Algorithm
The advantage of this algorithm is computation time will be saved and the
analysis of relevance and redundancy are easier to implement. But the weakness of
this algorithm is users need to understand the relevance and mutual information
between the features in dataset and class label or features. Otherwise, users could
not give a reasonable value to r and mi.
13 Output BC F .
The process of feature selection in this algorithm is only based on the relationship
of features and class labels and there is no objective function. The relationships we
used in this algorithm are relevance and mutual information.
75
5. ONLINE STREAMING FEATURE SELECTION FOR IDS
Another algorithm is shown in algorithm 5.2.2. At first, the first feature will
added to BC F , and users need to input a desired number K of features to be selected.
If the length of BC F is smaller than K, the streaming feature is added to BC F . Else it
will calculate mean relevance between all features in BC F and class label. S denotes
the number of features in BC F . Then, the redundancy analysis is calculated by
r edund anc y = S1 ∗ M I( f , C) f j ∈S M I( f , f j ). And this formula means it calculates
P
mean mutual information between streaming feature f and all the other features in
BC F , and it is weighted by mutual information between f and class label C.
r elevance = S1 fi ∈S Relevance( f i , C)
P
10
r edundanc y = S1 ∗ M I( f , C) f j ∈S M I( f , f j )
P
11
15 Output BC F .
Compared to algorithm 5.2.1, this algorithm need not to input thresholds before
it starts. It takes relevance and redundancy as a criterion and uses it to judge
whether to select a streaming feature or not. And criterion idea is taken from
[148] and we improved it. The improvement is the method of calculating relevance
and redundancy. The relevance here is decided by the relevance between feature
and class labels. And redundancy is calculated by mutual information, denoted as
r edund anc y = S1 ∗ M I( f , C) f j ∈S M I( f , f j ).
P
76
5.3. Experimental Evaluation - Comparison with Traditional Feature Selection
algorithms
5.3 Experimental Evaluation - Comparison with
Traditional Feature Selection algorithms
In the following subsection, C4.5 is used to classify the dataset and compare the
performance between OSFS and other algorithms proposed in other chapters. The ex-
periments were conducted by using the KDD 99 dataset and performed on a Windows
machine having configuration and Intel (R) Core (TM) i5-4308U CPU@ 2.8GHz,
2.8 GHz, 8GB of RAM, the operating system is Microsoft Windows 7 Professional.
We have used an open source machine learning framework Weka 3.6.0. We have
used this tool for performance comparison of our algorithm with other classification
algorithms.
Table 5.1: Comparison Results Between OSFS1 and DMIFS and FGMI
Algorithm TP Rate FP Rate Precision F-Measure Class
0.994 0.09 0.728 0.841 Normal
C4.5
0.91 0.006 0.999 0.952 Anomaly
0.993 0.086 0.736 0.846 normal
C4.5 + DMIFS (13)
0.914 0.007 0.998 0.954 anomaly
0.994 0.085 0.739 0.848 normal
C4.5 + FGMI (13)
0.915 0.006 0.998 0.955 anomaly
0.994 0.083 0.743 0.85 normal
C4.5 + OSFS1 (15)
0.917 0.006 0.998 0.956 anomaly
In this algorithm, the number of selected features is decided by the input thresh-
olds r and mi. Regarding this experiment, we set r = 0.2 and mi = 0.05. The two
values could be decided by experience or understanding of the dataset or according
to the largest relevance and mutual information between features and class labels.
The number of selected features will be changed if you change the input value r and
mi.
77
5. ONLINE STREAMING FEATURE SELECTION FOR IDS
Figure 5.1: True positive rate comparison chart by different number of selected
features
Algorithm OSFS2 will vary depending on the order in which features are streamed.
The results in table 5.2 used the original order of the dataset. Next, we change the
order of streaming feature randomly and get its performance. In this test, we did this
experiment 10 times and selected 9 features every time. The average performance
78
5.3. Experimental Evaluation - Comparison with Traditional Feature Selection
algorithms
Figure 5.2: False positive rate comparison chart by different number of selected
features
79
5. ONLINE STREAMING FEATURE SELECTION FOR IDS
80
5.4. Summary
of this 10 test is shown in table 5.3. And precision and F-measure are shown in
figure 5.6 and figure 5.7. From the test, we can see that streaming order affects
performance of results. And we also found that we could get better performance if
we stream features which have larger relevance with class labels.
Suppose there are m instances and n features in training dataset. The time
complexity of OSFS1 and OSFS2 are O(mn) and O(mn2 ) respectively.
5.4 Summary
In this chapter, I introduced two online feature selection algorithms. And from the
comparison with other feature selection algorithm we proposed before, we could see
that OSFS algorithms could get better performance. But there are some disadvantages
in the two algorithms, such as we need to input some threshold or desired number
of features you will select from datasets. And they need to be improved by revising
some part of the algorithms. Such as we might deeply analyse the relationship of
relevance and redundancy of features in a dataset and design an algorithm that
removes the need for any threshold or desired number.
81
5. ONLINE STREAMING FEATURE SELECTION FOR IDS
82
Chapter 6
M
OST current network intrusion detection systems employ signature-based meth-
ods or data mining-based methods which rely on labelled training data[187,
49]. This training data is typically expensive to produce. Moreover, these meth-
ods have diffculty in detecting new types of attack[162, 21]. Using unsupervised
anomaly detection techniques, however, the system can be trained with unlabelled
data and is capable of detecting previously unseen attacks[27]. In this chapter, two
algorithms are proposed. Cascading fuzzy C means clustering and C4.5 decision
tree classification algorithm combine an unsupervised method with a supervised
method. Cluster accumulation ensemble with hierarchical clustering algorithm is an
unsupervised algorithm. Both of them are evaluated on KDD 99 dataset.
83
6. UNSUPERVISED NETWORK INTRUSION DETECTION SYSTEM
the features have different kinds of data structure. For this reason, we normalize
the data so that all attribute values are between 0 and 1. Then, we use fuzzy C-
means method to divide the training data into two clusters and get two centres.
As presented in 4.1.3, fuzzy C-means is a clustering method by using membership
function. Moreover, we calculate the membership function between each test data
instance and each cluster. The test data instance is allocated to the cluster which has
higher membership. Finally, fuzzy c-means is cascaded with the C4.5 by building
decision trees using the instances in each cluster. We used C4.5 algorithm to classify
the test data as an anomaly or a normal instance. Cascading could solve a problem
when most of instances from one class and very few instances from other classes
are in a single cluster. Such clusters, which are dominated by a single class, show
weak association to other classes. In KDD 99 dataset, most of the instances are DOS
attacks, and cascading two machine learning methods could get better results. Each
part of the process is now described in greater detail.
Table 6.1: Results obtained by four feature selection methods over KDD99 training
dataset
Test Search Attribute No. of
No. Method Evaluator Selected Features
1 BestFirst CfsSubsetEval 8
2 Ranker ConsistencySubsetEval 7
3 FCBFSearch SymmetricalUncertAttributeSetEval 6
4 Randomsearch AttributeSubsetEvaluator 7
84
6.1. Cascading Fuzzy C Means Clustering and C4.5 Decision Tree Classification
Algorithm
85
6. UNSUPERVISED NETWORK INTRUSION DETECTION SYSTEM
6.1.2 Normalization
The 8 features we used have two types. The protocol_type, service, flag are symbolic
and the other five features are continuous. The protocol_type has 3 values, service
has 66 values, and flag has 11 values. Table 6.2 shows the minimum and maximum
value of each feature, as well as its mean, standard deviation and the number of
distinct examples of the five continuous features.
Normalization converts all the data in the dataset between 0 and 1. For a
particular continuous data x i , normalization follows equation 6.1,
where X min is the minimum value for variable X, X max is the maximum value for
variable X. For a specific symbolic feature, we assigned a discrete integer to each
value and then used equation 6.1 to normalize it.
At first, the algorithm initialise U which denotes the membership matrix of Dt r ain ,
in other words, sets U (0) to U. And Ui j is degree of membership of x i in cluster
j. x i is the ith of instance and c j is the jth center of cluster. Then, it uses fuzzy
86
6.1. Cascading Fuzzy C Means Clustering and C4.5 Decision Tree Classification
Algorithm
Algorithm 6.1.1: Cascading Fuzzy C Means clustering and C4.5 decision tree
classification algorithm
Input: Dt r ain , Dt est (zi ∈ Dt est )
Output: Classified test instance zi to normal or anomaly
(0)
1 Initialise membership matrix of Dt r ain , U = [ui j ] ← U
(k+1)
2 while ||U − U (k) || ≥ ε do
3 Calculate the centres C1 ,C2 ;
4 Update U (k) , U (k+1) ;
5 for zi ∈ Dt est do
6 Compute ui j , j = 1, 2;
7 Find Higher Membership to zi ;
8 Assign zi to Higher Membership Cluster;
9 Classify each cluster by C4.5;
10 Return C1 , C2 and clusters.
Pn
um · x i
i=1 i j
Cj = Pn (6.2)
um
i=1 i j
1
ui j = 2
||x −c || m−1 (6.3)
PC i j
k=1 ||x i −ck ||
This algorithm has a limitation that users need to predefine the number of clusters.
In this case, all attacks are seen as abnormal data and there are two clusters, normal
and anomaly. Clusters could show internal structure of data and could help to find
unknown attacks. But for multi-class scenario, if users do not know the number of
attcks in test dataset, the algorithm lose effectiveness.
87
6. UNSUPERVISED NETWORK INTRUSION DETECTION SYSTEM
clusters generated by all ensemble members together form a set of base clusters for
PM
ensemble, where n = m=1 Km . There are two procedures of cluster ensemble. First,
ensemble members are generated. Second, a consensus function is then applied on
those ensemble members to generate the final clustering result.
From the algorithm description, we can see that it deals with features of the
dataset one by one. In other words, it clusters features one by one rather than
clustering all features at the same time. At first, the algorithm gets the instance
number and feature number to n and d and sets co_assoc to a null n × n matrix.
C o_assoc denotes co_association matrix. Then, it gets a feature from the dataset and
88
6.2. New Evidence Accumulation Clustering with Hierarchical Clustering Algorithm
runs the clustering algorithm on this feature and produce a patition. Moreover, it
updates the co_association matrix. For each pattern pair (i, j) in the same cluster in
Ei , set co_assoc(i, j) = co_assoc(i, j) + 1/d. The number of clustering is d. Finally,
we use hierarchical clustering algorithm with single link to get the final partition.
89
6. UNSUPERVISED NETWORK INTRUSION DETECTION SYSTEM
than all features in dataset. Thus, computation time will be saved compare to the
similar algorithms which use all features to get ensembles. After d steps of iteration,
we get co_association matrix and then we use hierachical clustering methods on it.
And SL denotes single linkage which is used to describe the distance of clusters in
hierachical clustering algorithm.
the proposed algorithm could achieve better performance, especially on the measure
of precision, accuracy and F-measure. Suppose there are m instances and n features
in training dataset. The time complexity of proposed algorithm is O(2mnt), where t
is the number of iterations.
90
6.3. Experimental Results
91
6. UNSUPERVISED NETWORK INTRUSION DETECTION SYSTEM
performance as Ana’s algorithm, and better than simple K-means algorithm. In Ana’s
algorithm, it runs N times clusterings. And in every time, it uses K-means algorithm
on whole a dataset to produce a partition and then construct co-association matrix.
Unlike runing K-means N times, the number of clustering of our proposed algorithm is
depends on the number of features of datasets. The experiment is executed 10 times
and the differences between Ana’s algorithm and proposed algorithm are statistically
evaluated using paired t-test with two-tailed p = 0.01. The two algorithms achieve
statistically the same result. Suppose there are m instances and n features in training
dataset. The time complexity of proposed algorithm is O(m2 n).
Table 6.4: Comparison Results Between Proposed algorithm and Other algorithms
Algorithm Total accuracy
K-Means 72.85%
Ana’s algorithm 79.77%
Proposed algorithm 79.77%
From table 6.4 we can see that unsupervised methods like K-means could not
achieve higher performance compared to supervised methods on KDD 99 dataset.
Although ensemble accumulation the clusters methods could help to improve the
92
6.4. Summary
performance, it takes more computation time since it needs to run N times clustering
by using all features. On the contrary, our proposed algorithm generate clusters
by one feature every time. Although the result is the same as Ana’s algorithm, the
proposed algorithm greatly reduces the computational complexity. More datasets
are tested and compared with Ada’s algorithm in 7.3.
6.4 Summary
This chapter proposed two unsupervised algorithms. The first is based on the com-
bination of feature selection, fuzzy C means and C4.5 algorithms that improve the
performance results of classifiers while using a reduced set of features. It has been
applied to the KDD Cup 99 dataset in the intrusion detection field. We used a normal-
ization method on the KDD 99 training dataset and test dataset before applying the
proposed scheme to the dataset. The method improves the performance results ob-
tained by C4.5 while using only 19.5% of the total number of features. Performance
analysis is assessed against six measures. This method gives impressive detection
precision accuracy and F-measure in the experiment results. An additional advantage
is memory and time costs reduction for C4.5 classifier.
93
Chapter 7
I
N this chapter, we will test the described algorithms on other datasets. Since the
algorithms described in chapter 3 to chapter 6 are either classification algorithms or
clustering algorithms, they all could deal with datasets for classification or clustering.
We will test three proposed algorithms in this chapter and compare their performance
with other algorithms.
95
7. APPLICATION TO OTHER DATASETS
and class number. Detailed information of the datasets could obtained from the UCI
website. From table 7.1, we can see that these datasets from different areas, instance
number from 187 to 11055, feature dimension from 9 to 279, and including two
class and multi-class labels. This pluralism could help to verify performance of the
algorithm in different conditions.
Since these datasets from different areas, some of them have missing values
and some of them need to change format. Thus, before testing algorithms, we
need to preprocess the datasets. We use mean value to fill missing values. And we
change discrete string to discrete digital number since this is our proposed algorithm’s
requirement.
96
7.1. Test by FGMI-AHC
The highest value which is in bold represents the best performance of the three
algorithms in each dataset. The value in parentheses are the number of selected
features in each dataset. We selected the same number of features to compare
by different feature selection algorithms. From table 7.2 to 7.4, we can see that
FGMI-AHC has the best performance in most datasets. The performance by using
all features in each dataset are shown in each table as well. Figure 7.1 shows
classification accuracy comparison by different learning algorithms clearly.
97
7. APPLICATION TO OTHER DATASETS
98
7.2. Test by OSFS
In this test, we use same datasets as section 7.1. And we compare the performance
with mRMR algorithm which is used in chapter 5. Figure 7.2, 7.3, 7.4 and 7.5 show
the classification accuracy comparison between OSFS2 and mRMR by decision tree
(C4.5), 1-Nearest Neighbor (1-NN), Support Vector Machines (SVM) and Naive Bayes
classifiers respectively.
From the comparison we could see that OSFS2 could achieve better performance
that mRMR algorithm. Table 7.6 describes comparison results between OSFS2 and
mRMR. The last row of the table shows the number of selected features for each
dataset.
99
7. APPLICATION TO OTHER DATASETS
101
7. APPLICATION TO OTHER DATASETS
Figure 7.5: Classification accuracy between OSFS2 and mRMR by Naive Bayes
102
Table 7.6: Comparison Results Between OSFS2 and mRMR by different learning algorithms
Classification
Algorithm FS Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14
OSFS2 66.36% 94.94% 94.25% 92.61% 91.45% 86.95% 84.17% 86.17% 67.50% 73.75% 95.10% 80.74% 89.32% 66.37%
C4.5
mRMR 66.36% 93.72% 94.60% 91.98% 91.45% 86.36% 82.27% 85.11% 67.50% 73.75% 94.54% 78.70% 89.30% 65.49%
OSFS2 66.36% 96.48% 92.84% 89.76% 90.88% 88.76% 83.89% 75.96% 57.50% 68.75% 94.73% 75.56% 96.42% 63.05%
1-NN
mRMR 66.36% 96.40% 92.81% 89.72% 90.88% 87.98% 83.41% 71.49% 55.00% 71.25% 94.35% 75.09% 96.42% 62.17%
OSFS2 57.94% 91.27% 91.98% 86.63% 89.74% 85.14% 84.17% 85.11% 67.50% 71.25% 95.48% 78.89% 94.61% 67.04%
SVM
mRMR 57.94% 91.22% 91.99% 86.33% 88.60% 85.10% 83.13% 85.11% 67.50% 72.50% 95.35% 78.70% 94.61% 67.04%
OSFS2 49.53% 83.49% 90.76% 87.89% 90.60% 78.44% 74.69% 84.68% 70.00% 71.25% 89.96% 80.19% 89.25% 67.48%
Naive Bayes
mRMR 49.53% 83.28% 90.76% 72.53% 91.74% 78.11% 74.22% 83.62% 78.75% 70.00% 91.90% 79.17% 89.25% 67.26%
Number of selected features 5 8 10 15 10 12 13 7 8 11 27 26 18 28
103
7.2. Test by OSFS
7. APPLICATION TO OTHER DATASETS
In this experiment, we test 8 datasets which are from section 7.1. They are
showed in table 7.7. All of the datasets in table 7.7 have 2 class. The reason is that
NEAC has a limitation that it could only solve 2-class problem.
104
7.3. Test by NEAC
In this test, we use single link method over the similarity matrix for producing
dendrogram in both of the two algorithms. And we set the number of clusters
in K-means algorithm to 60. And in Ada’s algorithm, we run 60 times to get the
co-association matrix in every dataset since the feature number of most datasets in
table 7.7 are less than 60. From the accuracy comparison in table 7.7, we can see
that the values are very close among the eight group data. Three of them are equal,
NEAC won four of them and lost one.
To investigate the impact of the parameters and settings in NEAC. The distance
between clusters in hierachical clustering we used is single in above tests. We use
another two different distance complete and average to compare. Single, complete
and average are shortest, furthest and unweighted average distance between clusters
respectively. And accuracy comparison by NEAC on different parameters are shown
in figure 7.9. The last column shows different k values in NEAC algorithm, we change
k = 2 to k = 60 and the results are shown in figure 7.9. And from the results we
can see that k value has litte impact on the result. And different distance between
clusters in hierachical clustering impacted the results very much.
105
7. APPLICATION TO OTHER DATASETS
106
Chapter 8
Conclusions
I
N this chapter, a high level summary of the research as detailed in the preceding
chapters will be presented. The thesis has demonstrated some feature selection
algorithms on KDD 99 dataset after reviewed relevant approaches in the literature.
After that, we compared the results with either the original approaches or relevant
techniques in the literature. The most promising of the proposed algorithms has
also been tested on other datasets. This chapter also presents a number of initial
thoughts about the directions for future research.
A survey of feature selection and intrusion detection system has been given in
chapter 2. A brief history and classification of IDS are presented. After that, it states
some approaches and challenges to intrusion detection. Feature selection part is very
important in this thesis. In chapter 2, feature selection process and four evaluation
measures are described. And four feature selection approaches are also presented,
they are used to compare with my proposed algorithms in other chapters.
107
8. CONCLUSIONS
Chapter 5 states two online streaming feature selection algorithms, OSFS1 and
OSFS2. They are based on the analysis of feature relevance and redundancy. OSFS1
uses threshold to decide whether a feature is redundant or relevant with labels and
other features. And OSFS1 limits its application as users need to understand the
candidate dataset before using it. OSFS2 utilizes a criterion which is computed by
relevance and redundancy values. And it has more universal applicability.
For all the algorithms we proposed, OSFS2 could achieve the best performance
on KDD 99 dataset. And cascading Fuzzy C Means Clustering and C4.5 Decision
108
8.2. Future Works
Tree Classification Algorithm get the worst performance. And the results of MMIFS,
FGMI, FGMI-AHC, OSFS and Cascading Fuzzy C Means Clustering and C4.5 Decision
Tree Classification Algorithm come to the expected target, but NEAC does not. The
reason is that consensus function we used in NEAC is k-means clustering method
and result in the performance are not significant compare to Ada’s algorithm.
I think future directions in this area should include the following aspects. Firstly,
feature selection is a kind of dimensionality reduction. And dimensionality reduction
is not only for features but also for instance. Like feature selection, instance should
be selected and it might be used in IDS area. Moreover, traditional feature selection
algorithms are not applicable for big data and variable dimension datasets whose
feature and instance numbers are not fixed. OSFS is a method could solve this
problem and more similar methods are needed. At last, the relationships of features is
very important in this area, it determines feature selection strategies. Thus, designing
more effective measurements to measure the relationship between features is vital
direction as well.
8.2.1.1 MMFS
Most proposed algorithms use mutual information to measure the relationship be-
tween features in this thesis. MMFS algorithm is the most typical one. We could find
other metrics or methods to measure instead of mutual information and compare the
109
8. CONCLUSIONS
results with different measurements. And methods which could measure relation-
ship between features could be used, such as correlation. Some of measurements
could be used to produce a similarity matrix and create groups for feature grouping
algorithms.
8.2.1.2 FGMI
The selecting strategy of feature grouping used in this algorithm is selecting one
feature from each group. As it is discussed in chapter 4, we could design different
selecting strategies for feature grouping. Figure 8.1 shows outline of improved
selecting strategy of feature grouping. In improved scheme, we can select different
number of features from one group based on some selecting strategies. For example,
we could select 2 or more features from a group and the number of selected features
could be different from a group to other groups. And selecting should be help for
classification rather than random selecting. As stated above, different measurements
also could be used to create groups and it could combine with selecting strategies.
8.2.1.3 OSFS
110
8.2. Future Works
111
8. CONCLUSIONS
Both OSFS and NEAC algorithms deal with dataset by features. OSFS streaming a
feature every time and NEAC cluster one feature at a time. This characteristic could
be used on big data area which is a hot area recently. But it still needs to solve many
problems, such as data storage, intermediate results storage and processing and so
on.
All the proposed algorithms and methods in this thesis could only be used for static
dataset of IDS. None of them could applied on real-time intrusion detection system.
Real-time IDS requres response in time and it usually deals with data according to
instance. Machine learning methods could be used on real-time IDS, but it needs
more improvements and testing in practical.
112
Appendix A
A number of publications have been generated from the research carried out within
the PhD project. Below lists the resultant publications that are in close relevance to
the thesis.
1. Jingping Song, Zhiliang Zhu, Peter Scully and Chris Price, "Selecting Features
for Anomaly Intrusion Detection A Novel Method using Fuzzy C Means and
Decision Tree Classification", Cyberspace Safety and Security. Springer Interna-
tional Publishing, 2013: 299-307.
2. Jingping Song, Zhiliang Zhu, Peter Scully and Chris Price, "Modified Mutual
Information-based Feature Selection for Intrusion Detection Systems in Decision
Tree Learning", Journal of computers, 2014, 9(7): 1542-1546.
3. Jingping Song, Zhiliang Zhu and Chris Price, Feature Grouping for Intrusion
Detection System Based on Hierarchical Clustering, Availability, Reliability,
and Security in Information Systems, Springer International Publishing, 2014:
270-280.
4. Jingping Song, Zhiliang Zhu and Chris Price, "Feature Grouping for Intrusion
Detection Based on Mutual Information," Journal of Communications, vol. 9,
no. 12, pp. 987-993, 2014.
113
Appendix B
List of Acronyms
DR Detection Rate
FN False Negative
FP False Positive
FS Feature Selection
IG Information Gain
115
116
P Precision
R Recall
TA Total Accuracy
TN True Negative
TP True Positive
[6] R. Anderson and A. Khattak, “The use of information retrieval techniques for
intrusion detection,” Proceedings of Raid, 1998.
[10] S. Axelsson, “The base-rate fallacy and the difficulty of intrusion detection,”
ACM Transactions on Information and System Security (TISSEC), vol. 3, no. 3,
pp. 186–205, 2000.
117
118
[11] Y. Bai and H. Kobayashi, “Intrusion detection systems: technology and devel-
opment,” in Advanced Information Networking and Applications, 2003. AINA
2003. 17th International Conference on, 2003, pp. 710–715.
[14] D. S. Bauer and M. E. Koblentz, “Nidx- an expert system for real-time network
intrusion detection,” Proceedings of the Computer Networking Symposium, pp.
98 – 106, 1988.
[24] C. M. Chen, Y. L. Chen, and H. C. Lin, “An efficient network intrusion detection,”
Computer Communications, vol. 33, no. 4, pp. 477–484, 2010.
[26] W. Chen, “Design of a worm isolation and unknown worm monitoring system
based on honeypot,” 2014.
[28] J. Cho, C. Lee, S. Cho, J. H. Song, J. Lim, and J. Moon, “A statistical model
for network data analysis: Kdd cup 99 data evaluation and its comparing
with mit lincoln laboratory network data,” Simulation Modelling Practice and
Theory, vol. 18, no. 4, pp. 431–435, 2010.
[29] T. W. S. Chow and . Huang, D., “Estimating optimal feature subsets using effi-
cient estimation of high-dimensional mutual information.” IEEE Transactions
on Neural Networks, vol. 16, no. 1, pp. 213–24, 2005.
[31] R. L. Cilibrasi and P. M. B. Vitanyi, “A fast quartet tree heuristic for hierarchical
clustering,” Pattern Recognition, vol. 44, no. 3, pp. 662–677, 2014.
[37] J. J. Davis and A. J. Clark, “Data preprocessing for anomaly based network
intrusion detection: A review,” Computers & Security, vol. 30, pp. 353–375,
2011.
[39] D. E. Denning and P. G. Neumann, “Requirements and model for ides a real-
time intrusion-detection expert system,” in Document A005, SRI International,
333 Ravenswood Avenue, Menlo Park, CA 94025, 1985.
[44] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for
discovering clusters in large spatial databases with noise.” in KDD, vol. 96,
1996, pp. 226–231.
[51] G. Giacinto, R. Perdisci, M. Del Rio, and F. Roli, “Intrusion detection in com-
puter networks by a modular ensemble of one-class classifiers,” Information
Fusion Special Issue on Applications of Ensemble Methods, vol. 9, no. 1, pp.
69–82, 2008.
[52] G. Giacinto, F. Roli, and L. Didaci, “Fusion of multiple classifiers for intrusion
detection in computer networks,” Pattern recognition letters, vol. 24, no. 12,
pp. 1795–1803, 2003.
[57] M. A. Hall and L. A. Smith, “Feature selection for machine learning: Comparing
a correlation-based filter approach to the wrapper,” in Proceedings of the Twelfth
International Florida Artificial Intelligence Research Society Conference, 1999,
pp. 235–239.
[62] Y. Hou and X. F. Zheng, “Expert system based intrusion detection system,”
International Conference on Information Management Innovation Management
& Industrial Engineering, vol. 4, pp. 404–407, 2010.
[63] H.-H. Hsu, C.-W. Hsieh, and M.-D. Lu, “Hybrid feature selection by combining
filters and wrappers,” Expert Systems with Applications, vol. 38, no. 7, pp.
8144–8150, 2011.
[64] Q. Hu, X. Che, L. Zhang, and D. Yu, “Feature evaluation and selection based
on neighborhood soft margin,” Neurocomputing, vol. 73, no. s 10âĂŞ12, pp.
2114–2124, 2010.
[66] W. Hua-Liang and S. A. Billings, “Feature subset selection and ranking for data
dimensionality reduction.” IEEE Transactions on Pattern Analysis & Machine
Intelligence, vol. 29, no. 1, pp. 162–166, 2007.
[67] J. Huang, Y. Cai, and X. Xu, “A hybrid genetic algorithm for feature selection
wrapper based on mutual information,” Pattern Recognition Letters, vol. 28,
no. 13, pp. 1825–1844, 2007.
[73] R. Jensen, “Combining rough and fuzzy sets for feature selection,” Ph.D.
dissertation, University of Edinburgh, 2005.
[75] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant feature and the subset
selection problem,” in Rutgers University, 1994.
[78] M. E. Kabir and J. Hu, “A statistical framework for intrusion detection system,”
in Fuzzy Systems and Knowledge Discovery (FSKD), 2014 11th International
Conference on, 2014, pp. 941–946.
[82] G. Kim, S. Lee, and S. Kim, “A novel hybrid intrusion detection method
integrating anomaly detection with misuse detection,” Expert Systems with
Applications, vol. 41, no. 4, pp. 1690–1700, 2014.
[84] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial
intelligence, vol. 97, no. 1, pp. 273–324, 1997.
[87] I. Kononenko and I. Kononenko, “Analysis and extension of relief,” in The Euro-
pean Conference on Machine Learning and Principles and Practice of Knowledge
Discovery in Databases, 1994.
[91] C. Krügel, T. Toth, and E. Kirda, “Service specific anomaly detection for network
intrusion detection,” in Proceedings of the 2002 ACM symposium on Applied
computing. ACM, 2002, pp. 201–208.
[92] A. Kumarshrivas and A. Kumar Dewangan, “An ensemble model for classi-
fication of attacks with feature selection based on kdd99 and nsl-kdd data
set,” International Journal of Computer Applications, vol. 99, no. 15, pp. 8–13,
2014.
[94] L. Lan and S. Vucetic, “A multi-task feature selection filter for microarray
classification,” in 2009 IEEE International Conference on Bioinformatics and
Biomedicine, 2009, pp. 160–165.
[96] T. D. Lane, “Machine learning techniques for the computer security domain
of anomaly detection,” "Machine learning techniques for the computer security
domain of anomal" by Terran D Lane, 2001.
[99] C. Lee and G. G. Lee, “Information gain and divergence-based feature selection
for machine learning-based text categorization,” Information Processing &
Management, vol. 42, no. 1, pp. 155–165, 2006.
[101] M. C. Lee, “Using support vector machine with a hybrid feature selection
method to the stock trend prediction,” Expert Systems with Applications, vol. 36,
no. 8, pp. 10 896–10 904, 2009.
[102] W. Lee and S. J. Stolfo, “A framework for constructing features and models
for intrusion detection systems,” ACM transactions on Information and system
security (TiSSEC), vol. 3, no. 4, pp. 227–261, 2000.
[103] W. Lee, S. J. Stolfo, and K. W. Mok, “A data mining framework for building
intrusion detection models,” in Security and Privacy, 1999. Proceedings of the
1999 IEEE Symposium on. IEEE, 1999, pp. 120–132.
[105] I. Levin, “Kdd-99 classifier learning contest: Llsoft’s results overview,” SIGKDD
explorations, vol. 1, no. 2, pp. 67–75, 2000.
[107] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, and K.-Y. Tung, “Intrusion detection system: A
comprehensive review,” Journal of Network and Computer Applications, vol. 36,
no. 1, pp. 16–24, 2013.
[108] S. W. Lin, K. C. Ying, S. C. Chen, and Z. J. Lee, “Particle swarm optimization for
parameter determination and feature selection of support vector machines.”
Expert Systems with Applications, vol. 35, no. 4, pp. 1817–1824, 2008.
[109] S.-W. Lin, K.-C. Ying, C.-Y. Lee, and Z.-J. Lee, “An intelligent algorithm with
feature selection and decision rules applied to anomaly intrusion detection,”
Applied Soft Computing, vol. 12, no. 10, pp. 3285–3290, 2012.
126
[110] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das, “The 1999 darpa
off-line intrusion detection evaluation,” Computer networks, vol. 34, no. 4, pp.
579–595, 2000.
[113] H. Liu and H. Motoda, Feature selection for knowledge discovery and data
mining. Springer Science & Business Media, 2012, vol. 454.
[114] H. Liu and L. Yu, “Toward integrating feature selection algorithms for classifi-
cation and clustering,” Knowledge and Data Engineering, IEEE Transactions on,
vol. 17, no. 4, pp. 491–502, 2005.
[115] H. Liu, J. Sun, L. Liu, and H. Zhang, “Feature selection with dynamic mutual
information,” Pattern Recognition, vol. 42, no. 7, pp. 1330–1339, 2009.
[116] X. Liu, B. Lang, Y. Xu, and B. Cheng, “Feature grouping and local soft match for
mobile visual search,” Pattern Recognition Letters, vol. 33, no. 3, pp. 239–246,
2012.
[117] Z. Y. Liu and J. Wang, “A research into an intrusion detection system based on
immune principle and multi-agent in wsn,” Advanced Materials Research, vol.
433-440, pp. 5157–5161, 2012.
[121] S. Maldonado and R. Weber, “A wrapper method for feature selection using
support vector machines,” Information Sciences, vol. 179, no. 13, pp. 2208–
2217, 2009.
127
[122] ——, “Embedded feature selection for support vector machines: State-of-the-
art and future challenges,” Lecture Notes in Computer Science, pp. 304–311,
2011.
[124] J. McHugh, “Testing intrusion detection systems: a critique of the 1998 and
1999 darpa intrusion detection system evaluations as performed by lincoln
laboratory,” ACM transactions on Information and system Security, vol. 3, no. 4,
pp. 262–294, 2000.
[125] J. McHugh, A. Christie, and J. Allen, “Defending yourself: The role of intrusion
detection systems,” IEEE software, no. 5, pp. 42–51, 2000.
[126] G. Meera and S.K.Srivatsa, “Detecting and preventing attacks using network
intrusion detection systems,” International Journal of Computer Science and
Security, vol. 2, no. 1, pp. 50–60, 2008.
[129] M. Monirul Kabir, M. Monirul Islam, and K. Murase, “A new wrapper feature
selection approach using neural network,” Neurocomputing, vol. 73, pp. 3273–
3283, 2010.
[130] A. J. Mooij, “Constructing and reasoning about security protocols using invari-
ants,” Electronic Notes in Theoretical Computer Science, vol. 201, pp. 99–126,
2008.
[132] S. Mukkamala and A. H. Sung, “Feature ranking and selection for intrusion
detection systems using support vector machines,” in Proceedings of the Second
Digital Forensic Research Workshop. Citeseer, 2002, pp. 503–509.
[148] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information
criteria of max-dependency, max-relevance, and min-redundancy,” Pattern
Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 8, pp.
1226–1238, 2005.
[150] L. Portnoy, E. Eskin, and S. Stolfo, “Intrusion detection with unlabeled data
using clustering,” in In Proceedings of ACM CSS Workshop on Data Mining
Applied to Security (DMSA-2001). Citeseer, 2001.
[151] Z. Ran, Q. Yao, X. Wang, and Y. Zou, “Application of pattern matching algo-
rithm in intrusion detection technique,” Modern Electronics Technique, 2009.
[156] D. Schneider, “The state of network security,” Network Security, vol. 2012,
no. 2, pp. 14–20, 2012.
[157] M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo, “Data mining methods for
detection of new malicious executables,” in Security and Privacy, 2001. S&P
2001. Proceedings. 2001 IEEE Symposium on. IEEE, 2001, pp. 38–49.
[158] K. Shafi, “Online and adaptive signature learning for intrusion detection,”
http://www.vdm-verlag.de, 2009.
[159] S. Shan and V. Karthik., “An approach for automatic selection of relevance
features in intrusion detection systems,” 2011, pp. 215–219.
130
[168] D. Tian, J. Keane, and X. J. Zeng, “Evaluating the effect of rough set feature
selection on the performance of decision trees,” in Granular Computing, 2006
IEEE International Conference on, 2006, pp. 57–62.
[179] L. Wenke and S. Salvatore, “Data mining approaches for intrusion detection,”
Proceedings of Usenix Security Symposium, vol. 16, no. 4, pp. 18–20, 1998.
[181] D. Wilson and D. Kaur, “Knowledge extraction from kdd’99 intrusion data
using grammatical evolution,” Wseas Transactions on Information Science &
Applications, vol. 4, 2007.
[182] X. Wu, K. Yu, H. Wang, and W. Ding, “Online streaming feature selection,” in
Proceedings of the 27th international conference on machine learning (ICML-10),
2010, pp. 1159–1166.
[183] Y. L. Wu, C. Y. Tang, M. K. Hor, and P. F. Wu, “Feature selection using genetic
algorithm and cluster validation,” Expert Systems with Applications, vol. 38,
no. 3, pp. 2727–2732, 2011.
[185] T. Xia, G. Qu, S. Hariri, and M. S. Yousif, “An efficient network intrusion
detection method based on information theory and genetic algorithm,” in
Performance, Computing, and Communications Conference, 2005. IPCCC 2005.
24th IEEE International, 2005, pp. 11–17.
[191] D. Zhang, S. Chen, and Z. H. Zhou, “Constraint score: A new filter method
for feature selection with pairwise constraints,” Pattern Recognition, vol. 41,
no. 5, pp. 1440–1451, 2008.
[193] Z. Zhao and H. Liu, “Spectral feature selection for supervised and unsuper-
vised learning,” in Proceedings of the 24th international conference on Machine
learning, 2007, pp. 1151–1157.
[195] Z. Zhu, Y.-S. Ong, and M. Dash, “Wrapper–filter feature selection algorithm us-
ing a memetic framework,” Systems, Man, and Cybernetics, Part B: Cybernetics,
IEEE Transactions on, vol. 37, no. 1, pp. 70–76, 2007.