Ids Tool: Project Report
Ids Tool: Project Report
Ids Tool: Project Report
On
IDS TOOL
(Intrusion Detection and Analysis Tool)
Submitted By:
JAIPUR
1
Department of Computer Science and Engineering
JLN Marg, Jaipur, Rajasthan 302017, Tel: 0141 2713376, Fax: 25209154
CERTIFICATE
This is to certify that the project entitled “Intrusion Detection tool” has
been completed and submitted by Mr. Akshay Nagar (2014UCP1524),
Mr. Saurabh Kumar(2014UCP1660), Mr. Sandeep Jatolia
(2014UCP1650), Mr. Manish Kumar (2013UCP1483) under my guidance
during the academic session July 2017 to May 2018 in the partial fulfilment
of the requirement for the award of the degree of Bachelor of Technology
(B.Tech) in Computer Science and Engineering, Malaviya National Institute
of Technology, Jaipur.
(Assistant Professor)
2
DECLARATION
3
ABSTRACT
ii
4
ACKNOWLEDGEMENT
We would also like to thank our Head of Department Dr. Girdhari Singh
and our project coordinator, Dr. Yogesh Kumar Meena.
Last but not the least; we would like to thank Department of Computer
Science and Engineering, Malaviya National Institute of technology,
Jaipur and all those people who are responsible for a successful
realization of this report.
iii
5
CONTENTS
Declaration i
Abstract ii
Acknowledgement iii
1. Introduction...............................………………………………................
2. Related Work. 9
3. Background……………………………….............................. …………15
3.1Types of Attacks ………..............................................................15
3.1.1 DOS……………………………........................................15
3.1.2 U2R…………………………………..................................15
3.1.3 R2L…………………………………………………………..15
3.1.4 Probe………………………………………………………...15
6
3.3.2 J4.8 ………………………………………………………..…17
3.3.6 IBk……………………………………….............................19
3.4.2 Weka………………………………………………………….............20
4. Methodology………………………………………………................... 21
4.1 Dataset……………………………………………….......................21
4.2 Workflow……………………....…………..................................... 24
6. Result Analysis.............…………………………..……………............36
8. References………………………………………………………………..43
7
LIST OF FIGURES
8
LIST OF TABLES
9
CHAPTER 1
Introduction
With the non linear advancement in network based services and information
transfer, it is highly imperative to provide security. Generally, when a website is
connected to a network, attacks by people are there but also crimes such as social
engineering. People pretend to be someone else to steal your credentials.
Organizations generally have certain amount of critical data and applications that
need to be secured to protect their confidentiality, integrity and availability. Intruders
can be illegitimate users trying to gain access to a network, running harmful scripts or
destroying important files or a legitimate user trying to gain super-user privileges to
perform malicious activities. Intrusions can also be an attacker scanning all open ports,
IP addresses for any vulnerability as well as denying a legitimate user an access to a
particular service. While organizations employ firewalls to detect Viruses, Trojans,
Worms, harmful software and application data,patterns or signature shouldnt be there
at all.. Antivirus etc need to be updated frequently to avoid malwares from not being
detected and going out just like that.
Depending upon the operating environment, the network infrastructure and the
resources available within an organization, it is necessary to evaluate the deployment
strategies for an IDS sensor software. IDS require adequate and appropriate human
interaction at all times. The alarms generated by an IDS must be monitored by a
security administrator to take appropriate action. IDS works along with an IPS
(Intrusion Prevention System) that judges the severity of an alarm and ideally
responds to an intrusion. (Intrusion Response System). It takes into consideration the
risk assessment model developed by the organization. The security policies, plans
and procedures must be known to all personnel handling IDS output to take
appropriate action. Based on the deployment of sensors, IDS can be classified in
Network based IDS and Host based IDS.
1
Deployment of Network Based IDS
IDS sensors monitor the network traffic to find malicious activities. Ingress as well
as extra network traffic is performed.. They can be deployed at various points along
the network infrastructure, along the main switch, or before the router connecting two
LANs as well as in the DMZ (Demilitarized Zone).
A Host Based IDS typically monitors the activities of a single host. It acts in
conjunction with a firewall or anti-virus software to detect attacks. It can be used to
capture the ingress traffic of a standalone application server as well as to examine the
behaviour of a host. It can provide enhancement in the levels of security for our
system. It will become extremely tedious, computationally expensive as well as costly
for organizations to deploy an IDS sensor for each host. As a result, these are used
only for select hosts and honeypot systems as a tool to inspect the traffic.
Signature Detection is most commonly deployed along with firewalls and other
thwarting mechanisms to enhance their performance. Accordingly, signature detectors
were initially rule-based expert systems, however, with the gradually increase in the
2
number of attacks, these detectors were not suitable to perform real-time detection.
As the rule set becomes larger and larger, the comparison speed of these expert
systems decreases. Fuzzy rule based systems and the use of other Artificial
Intelligence techniques have made an improvement in the pattern matching
techniques developing new rule sets based on existing rules to detect newer attacks.
SNORT and P-BEST are most commonly used misuse detectors. Figure 2 shows the
components of SNORT.
SNORT is a rule based Network Intrusion Detection System that uses fast
created with the help of human interaction and stored in databases for comparison
by the Snort detection engine. The pre-processor component performs initial packet
separation of the packet payload. The detection engine compares the packets with
the existing rule set and generates alert output into a log file or on the console.
SNORT can also be run in inline mode where it can prevent attacks (IPS). The main
disadvantage of using misuse detection is their constant need for updating the rule
database.
In stark contrast to this approach is the Anomaly Detection (AD) method which
assesses the incoming packets for any abnormality. An Anomaly is a deviation from
the normal behaviour. It compares the packet with the normal behaviour that is
predefined. For e.g. If the number of failed login attempts is more than the specified
3
threshold, then it can be termed as an anomaly. The main advantage of using this
detection is that it can detect previously unknown attacks.
The main problem in anomaly based detection is the definition of the normal
behaviour of the system. For example, consider an employee of an organization who
works for five days a week from 9:00 am to 5:00 pm. He logs in into his system at 9:00
am using a password which is only known to him. Here, an anomaly could be an
‘unknown’ user logging into his system at 8:00 pm late outside office hours, or a user
logging in on a holiday or any failed number of login attempts based on organizational
protocols. Sometimes, even that employee could have forgotten his correct password
thereby having a failed login attempt
anomaly. However, since he is a legitimate user, the system should grant him
privileges to obtain his password instead of raising an alarm of a suspicious intrusion.
Anomalies in behaviour can be identified by analysing audit data or by conducting a
‘forensics’ of the network and servers. Sometimes the data collected by Honeypot
systems and historical data are used to create rules and protocols to establish normal
behaviour. Some processes monitor the statistical measures like the amount of
rejection rate (error rate), network bandwidth, CPU utilization or the number of
requests to an application server. There measures are currently used in commercial
systems because their identification of flaws is a straightforward process. These are
threshold based anomaly detectors.
4
attacks are due to improper exception handling by the protocols. Hence, scrutinizing
exceptions in the network protocols helps in detecting unknown attacks.
Another important type of anomalies are the probe attacks. i.e Stealth scanning
of the network for open IP addresses and ports. Usually, when an attacker wants to
carry out an illegal activity or hack into a system, he has to carry out a reconnaissance
of the network. He may find out all the open network address points through which he
can attack, find out the OS running on the systems, run various tools to ‘probe’ the
network. Nmap, TCP Stealth port scanning, FTP Bounce attack, SATAN tool find flaws
in the protocol design to probe the network. While this attack type is not dangerous,
its detection will help in building better network models, protocols and develop secure
organizational rules.
5
are made by the research community to continuously improve the efficiency of
detection and reduce the false alarm rate as well as to automate the task of alert
classification and intrusion prevention.
Data Acquisition module: It is responsible for acquiring data from the network, either
from the upper layer protocols, or raw network frames or from audit data to analyse
patterns in behaviour.
Packet Decoder: Performs the initial pre-processing of the packet to obtain initial
features. It strips all the headers off the payload and acquires the features to perform
detection. These features can be related to basic packet information, the source bytes,
features such as the number of packets received from a particular source in one
second, or even payload information such as the number of TCP SYN packets
6
1.4 Machine Learning in Anomaly Detection
Machine Learning is a vast field of Artificial Intelligence that builds models based
on existing input data to make predictions or decisions instead of following a standard
set of instructions to achieve a certain output. While expert systems are used in misuse
detectors, anomaly based detection employs machine learners as well as other
artificial intelligence techniques to estimate the class of an activity. i.e. whether an
activity is an anomaly or not. The main idea behind using machine learning in anomaly
detection is to develop predictive models based on the training dataset to perform the
detection. The improvement should be so significant that the model estimates most of
the novel malicious activities with a reasonable error rate (low false alarm rate). One
such dataset is the DARPA dataset developed by the MIT
of five weeks of raw TCP dump data . They had used different Windows NT servers,
and then simulated various attacks to obtain raw TCP dump data.
Based on whether the classes of attacks are provided or not, learners are
classified as supervised learners or unsupervised learners. Most commonly used
supervised learning algorithms are the Classification and Regression algorithms.
Simple Naïve Bayes classifier, Bayesian networks, Hidden Naïve Bayes, Logistic
7
Regression, Decision Trees such as ID3, C4.5, Random Tree classifiers, or CART
(Classification and Regression Tree) etc. are used. Similarly Backpropagation based
Neural Networks (ANN) and Support Vector Machines (SVM) classify an activity as
an anomaly or a normal one. Unsupervised learners are the Clustering algorithms
such as Simple K-Means, K-Nearest Neighbour (K-NN) etc. as well as Self-Organising
Maps (SOMs) based Neural Networks finding patterns in training data to cluster in into
useful classes. Depending upon the similarity in functioning, learning algorithms can
be classfied into the following categories:
Bayesian Classification: They are stochastic techniques that explicitly employ the
Bayes’ Theorem of conditional probability to perform classification. Eg. Naïve Bayes,
Bayesian Belief Network, Hidden Naïve Bayes (HNB). Bayesian networks are reliable
classifiers in a multi-class setting. Simple NB algorithm assumes mutual
independence amongst the attributes. Several techniques which calculate the
conditional probability between the attributes to various complex Bayesian Networks
have been proposed.
1.5 Report Organization
8
CHAPTER 2
Related Work
There are many related works related to the design and functioning of IDS. Since
2000, a variety of machine learners, classifier ensembles and search and optimization
techniques are used by researchers to classify activities as anomalies. A competition
was conducted by Knowledge Discovery and Data Mining (KDD) in 1999. The main
task of this competition was to build a network intrusion detector, a predictive model
that was able to distinguish “bad” connections from “normal” connections.
Researchers use the dataset provide by DARPA, KDD99, ISCX12 etc. to construct
their predictive classifiers. It was necessary to make an in-depth review of different
techniques, understanding their advantages and limitations. This was done by
studying review papers [1-2, 4].
Hesham Altwaijry and Saeed Algarny in [9] developed a Bayesian filter to improve
detection rates of attacks in the KDD99 dataset. They calculated a score of each
feature based on the frequency of occurrence for “normal” records and for “attack”
records. They selected features with a higher score to test their filter. They also tested
their filter using all features selected. It was discovered that the results obtained were
worse when all features are selected. Features 23, 24 and 31 were found to be most
relevant with the detection rate of R2L improved to 85.35% and a threshold value of
0.6. In [10], they proposed a multi-stage Naïve Bayes filter with each stage trained to
detect a different type of attack. The multi-stage filter is shown in Figure 5.
9
They discovered that the threshold for each stage can be changed depending upon
the level of security required and also that the training model for each stage can be
exercised with feature selection. While their work was related to reducing the false
positives, the misclassified instances (where the actual class != predicted class, for
supervised learning) is not sent to the next stage. Instead the normal records are sent
to the next filtering stage. Their main aim was to identify anomalies and not
classification of attacks according to type. This proposed method improved the overall
detection rate to 96.85%.
In [3], usage multiple Bayesian based classifiers, namely the Naïve Bayes
classifier, expert-elicited Bayesian (EEBN) network and the Bayesian Network (BN)
classifier using optimum feature set constructed using Correlation based feature
selection (CFS) and the consistency subset evaluator. With respect to a single
classifier, NB classifier provided the best results compared to the other two with an
overall detection rate of 92.3%.
On the other hand, for BN and EEBN, the detection rate was 90.3% and 91.8%
respectively. The main problem in their work was that the detection rate of R2L and
U2R was found to be very low, with NB classifier providing the best results. A dismal
12% for R2L and 19.7% for U2R. EEBN performed well in U2R category with 22.4%
detection rate.
Researchers prefer to work with Bayesian classifiers because of its simplicity in
construction and reasonable accuracy in detecting all attack types. In [12], Hui Lu and
Jinhua Xu compared an unsupervised clustering method, Bayesian clustering, with
10
the performance of Naïve Bayes and decision trees. They developed a three stage
hybrid detection model to segregate attacks shown in Figure 8. In the first stage, the
dataset was separated into three parts (DoS, probe and others) using a C4.5 decision
tree algorithm. This main rationale behind this was to separate the R2L and U2R and
normal records from the rest. In the second stage filtering, Simple NB classifier was
used to separate U2R records from the R2L and normal records. It can be seen that
with this approach, the accuracy of R2L and U2R was low. As a result, the third stage
using Bayesian clustering approach was used to improve detection rate of R2L attack.
It was seen that with this three level hybrid classifier, the detection rate of R2L and
U2R reaches to 48.41% and 97.14% respectively on the KDD 99’ dataset. The training
of U2R and R2L records is a problem because of its low number of records in the
dataset. It was seen that using multiple classifiers or hybrid classifiers in series or in
parallel improved the detection rate significantly .
In [5], investigated the application of Feature Selection in IDS. They employed rough
set theory in finding feature subset, and then used an optimization Genetic Algorithm
with population clustering approach to nd out the optimal subset from the remaining
features. SVM was used as a classifier on the KDD99 dataset. Population clustering
was used to establish a self-adaptive crossover and mutation rate in the Genetic
Algorithm. K-Means algorithm with Euclidean distance was used to achieve this
purpose. The feature selection algorithm selects 11 features; feature no.
1,2,4,5,6,11,22,23,31,33 and 35 for all attack types. The results were compared with
other algorithm and it was found that the proposed method was more efficient as the
detection rate has increased from 95.27% to 98.21% and the false positive rate
reduced from 7.63% to 0.91%. They had used an evolutionary technique in computing
the optimal feature subset with rigorous mathematical formulas to perform population
clustering.
In [6] (2007) used GA to calculate best feature subset and SVM to test the
accuracy of the classifier. Some of the features are redundant and their removal in the
data-preprocessing step greatly enhances the result. They used weight based
parametric fitness function with the weights calculated from the frequency of
occurrence in the dataset. Their optimal fitness value varied according to a parameter
β. For β = 0.9, 5 features were selected. Them being 23,27,31,33 and 38. With β =
0.9, the overall detection rate reached 97.14%.
Based on trials, the optimal value of β was found to be 0.5 where 26 features were
11
selected and the detection rate reached 98.46%. It can be seen that though SVM is a
good classifier, its detection rate is significantly low for the NSL-KDD dataset (69%)
.[13] When the training set is large, it takes a long time to train the features.
The C4.5 decision tree algorithm can detect U2R and R2L attacks better than
neural networks [4]. With respect to time, backpropagation based NN gave better
results since it undergoes training, validation and testing phases. Researchers have
used Linear Discriminate Analysis (LDA) [16] as features selection models,
information gain or the Gain ratio to select features. Fuzzy techniques are used along
with Hidden Markov Models (HMM) to efficiently classify abnormalities. In [16], GA
was used to select features 2,3,4,5,6,27,33,40 and 41. The modified J48 (C4.5
algorithm) uses an intelligent agent to calculate the threshold for constructing the
nodes of the tree. It was seen that this modified J48 classifier with optimal GA based
feature selection model was better in detecting DoS and probe attacks with shorter
training and testing times as well.
In [4], usage of soft computing based approach to create rules to detect attacks.
Information gain was used to select 15 most important features from the KDD99 Cup
dataset. After normalization and fuzzification of the selected features using the
triangular function, initial rules were created and they were stored as a decision tree.
For example,
feature 36 > 0.005 | feature 23 > 0.01 : ipsweep
{normal=0,buffer_overflow=0,loadmodule=0,smurf=0,ipsweep=72,multihop=0,ftp_w
rite=0, nmap=0,Neptune=0,warezclient=0,spy=0}
Within GA, a support-confidence based fitness function was introduced and a rule
was selected if its value is greater than the threshold (0.5). Their research was
restricted to creating rules to detect anomalies and not attack classification. It was
seen that the detection rate of attacks was 91.88% with a high false positive rate of
18.06%. They also compared the trends in feature selection, comparing the number
of attributes selected to the accuracy of detection. It could be seen that the accuracy
is maximum when the number of features selected is around 20-25. This trend in
feature selection can be a topic for further research purposes. They also compared
their GA based rule creation technique with ANN approaches. They concluded that
GA is a form of unsupervised learning as compared to ANN. With neural networks we
need to assign the weights to nodes and provide some learning, checking the output
regularly via the backpropagation algorithm. However, with GA, we need not worry
about this as it will provide the globally optimal solution. The accuracy of GA and thus
the rule creation depends upon the definition of the fitness function. In [5], rules were
12
created for detecting warezclient and the warezmaster attack. [16] used GA based
technique to create rules for attacks.
In [7], usage of Self-Organising Maps (SOM) on the KDD99, NSL-KDD(2010)
dataset to provide an in-depth comparison of ANN based approaches. With the
KDD99 dataset, the proposed SOM based algorithm had an accuracy of 92.37%
whereas for
the NSL-KDD dataset, it was 75.49%. Panda (2010) had used Multinomial Naïve
Bayes + N2B to obtain a success rate of 38.89%. Because of its high speed and faster
conversion rate, SOM can be used in multi-stage detection. SOMs are also powerful
on static networks than dynamic ones because dynamic networks have memory. They
can be trained to learn sequential or time varying patterns.
Deshmukh Et. Al [6] used preprocessing methods such as feature selection and
discretization and compared the efficiencies of the NB, HNB and the NB Tree
classifiers. A Fast Correlation based Feature selection technique (FCFS) was
implemented and equal-width discretization converted the continuous values of a
feature to nominal values. Discretization improves the accuracy of NB classifiers.
Using CFS, 12 features were selected. The detection rate of NB tree was found to be
maximum with 94.6% as compared to 88.2% for NB and 93.4% for HNB; however it
takes longer time to construct the model.
In [3] used ANN to evaluate the NSL-KDD dataset. Proposed method uses
Levenberg-Marquardt (LM) and BFGS quasi-Newton backpropagation for updating
the weight and bias. There are initially 29 neurons corresponding to 29 features in the
set with the hidden layer using 21 neurons. For binary classes and 5 classes, output
layer’s neuron is changed with 2 and 5 respectively. Since there are very low U2R and
R2L records in the dataset, it was necessary to maintain equality. For speeding up the
training times and to maintain equality, only a set of normal and other class records
were selected. With the proposed backpropagation approach, the overall accuracy
achieved was 79.2%. The detection rates of DoS, probe, R2L and U2R were obtained
to be 77.77%, 76.6%, 34.6% and 10.5% respectively. While DoS and probe attacks
are classified reasonably well for NSL-KDD99 dataset, it was the R2L and U2R attack
classification which is the problem. For binary classification (normal and attack class),
the accuracy was 81.2% of detection with a false positive rate of 3.23%.
It can be seen that based on the vast review of existing techniques, the false
positive rate can be reduced and the detection rate can be improved even more using
multiple or hybrid classifiers and by using feature selection technique.
13
Feature Selection is an important domain area that could be used in the field of IDS.
However, it was necessary to conduct a preliminary research of different feature
selection techniques to understand their performance so as to determine its necessity
of usage in my research. Megha Agarwal [11] compared the performance of different
selection methods for the four attack classes. Removal of redundant and irrelevant
features for a particular class is called feature selection or dimensionality reduction. It
is different from feature extraction where new features are uncovered from existing
features. Feature selection methods can be classified mainly into three types. Filter
methods, Wrapper methods and embedded methods.
Removal of these redundant features will shorten the training times; improve the
prediction accuracy rate of the classifier as well as simplify the classifier model
construction [14-16]. Each feature selection technique employs a search algorithm for
finding the subset and an evaluation function to give a score to that subset. Filter
methods find an approximate subset whereas wrapper methods use a classifier to
evaluate the subset. Wrapper methods give an optimal solution but take longer time
to produce an output than the filter methods.
Correlation based Feature Selection (CFS) was first discussed by [7]. It is based
on the notion that all features that have a high correlation to the class attribute must
be in the same subset whereas those features in the same subset must have low
correlation amongst themselves. It gives a score to each subset called as the merit of
the subset.
Greedy search technique such as the hill climbing, forward selection, backward
selection, greedy stepwise search, best first search etc. are used to find feature
subsets and CFS is used to find the optimal subset amongst them. Other important
selection practices are ChiSquare subset evaluator, Filtered Wrapper technique,
Information Gain (IG), Gain Ratio, LDA and Principle Component Analysis (PCA). In
[11], the detection rate for GainRatioAttributeEval + Ranker algorithm is the maximum
with 99.61% for Naïve Bayes classifier whereas the time for model construction was
the least for InfoGainAttributeEval + Ranker algorithm.
14
CHAPTER 3
Background
There are various types on a intrusion detection in which attacker tries to imitate
the packets formed by a legitimate user and attack the loopholes in a network
device. The various types of attacks are listed below:
These attacks are exploitations in which the hacker starts off on the system
with a normal user account and attempts to abuse vulnerabilities in the system in
3.1.4 Probe
15
3.2 Intrusion detection techniques
There are three main detection techniques for Static Analysis - Misuse
Detection and Anomaly detection and Specification detection(SemiSupervised
Learning)
3.2.1. Misuse Detection
This technique is also known as signature based detection technique. A
packet is detected as a attack if it matches a sequence of instructions or
policies.
In the research by Feng et al. , the authors have presented Intrpopscopy,
a semantic language based signatures for detecting malicious packets. In this
approach, signatures are created for all attacks.
Signature matching is achieved using the rules such as that in SNORT ..
Semi supervised learning is used when the amount of data in the training
samples is low. To remedy the low accuracy we would get otherwise we first
classify the dataset using supervised learning on training samples and then run
anomaly detection for better accuracy. The main advantage of this technique
is it can be used even when number of training samples is very low.
16
3.3 Machine Learning Algorithms
3.3.2 J48
In this algorithm we assume that all the features are independent of each
other. The classification is based on the calculating the maximum probability of
the attributes which belong to a particular class.
The Naïve Bayes classifier is a probabilistic estimator based on the Bayes
theorem of conditional probability. Based on its usage, it can be Simple Naïve
Bayes classifier, Hidden Naïve Bayes, Naïve Bayes updatable or a multi-nominal
Naïve Bayes classifier. Henceforth Naïve Bayes classifier will also be abbreviated
17
as NB classifier in future references in this thesis report. A simple NB classifier
assumes mutual independence amongst the features. Some features are
correlated to each other, after finding the correlation, each of these features has
a hidden parent in HNB. Simple NB classifier is one of the simplest, intuitive, easy
to construct and gives reasonably good accuracy. Hence it is one of the most used
machine learners for IDS.
On the basis of conditional probability, Bayes theorem was formulated [9]. The
Bayes states that given the prior probability of the occurrence of any event A when
B has occurred, and the likelihood of the occurrence of event B, then it is possible
to calculate the probability of the occurrence of event B when A has occurred. This
can be given in formula(1).
18
TCP connection and predicts the class of the each instance. For a discrete
attribute, it calculates the prior probability based on the frequency of occurrence
in the training set. For a continuous attribute, it calculates the mean and
standard deviation and then uses a uniform Gaussian distribution to find the
prior probability.
Let r = (r1, r2,….,rn) belong to class C, and P(C) be the probability of the class
and P(C|r) be the probability of feature for a given class, then packet is considered
as a goodif,
3.3.6 IBk
This instance based learner is a lazy algorithm. The instance based learner
saves all of the training samples and compares the test samples to each of the
19
members of the training set until it finds the closest match. This algorithm is
implemented in Weka as a k-nearest neighbour classifier. The Weka
implementation sets Euclidean distance as the default distance algorithm.
3.4.1 Metasploit
3.4.2 WEKA
20
CHAPTER 4
Methodology
This chapter describes the proposed methodology used in the project. The first
section deals with the malware samples used to perform analysis in the project.
The next section details the methodology used to extract features from the dataset.
The last section describes the implementation details of the approach.
4.1 Dataset
21
Fig:6: Screenshot of the KDD training dataset
The NSL-KDD dataset consists of 125973 records in the training set and 22544
records in the test set, with each record connection having 41 features (or
attributes) divided into four categories: Basic features (Features 1-9), Content
Based features (Features 10-22), Time-based Traffic features (Features 23-31) and
Host-based Traffic features (Features 32-41). The basic features are shown below:
Each connection has a class label representing the attack type of the class and there
are five attack types: Normal, DoS, U2R, R2L and Probe.
The NSL-KDD (2010) is a dataset which solved the shortcomings of its predecessor.
• It does not include the redundant records so that the classifiers are
not biased towards the more frequent records.
The detection of attacks in the test set is not biased towards certain types of
records since duplicate records are removed from the test set.
The number of records of train and test set are reasonable and hence tests not
need be conducted on small subset of the dataset.
22
Attribute Attribute Description Sample
No. Name Data
Length of
time duration
1 Duration 0
of the
connection
Protocol used
2 Protocol_type in the Tcp
connection
Destination
3 Service network ftp_data
service used
Status of the
connection –
4 Flag SF
Normal or
Error
Number of
data bytes
transferred
5 Src_bytes from source to 491
destination in
single
connection
Number of
data bytes
transferred
from
6 Dst_bytes 0
destination to
source in
single
connection
if source and
destination IP
addresses and
port numbers
7 Land 0
are equal then,
this variable
takes value 1
else 0
Total number
of wrong
Wrong_fragm
8 fragments in 0
ent
this
connection
Number of
urgent packets
in this
connection.
9 Urgent Urgent 0
packets are
packets with
the urgent bit
activated
23
4.2 Workflow
The proposed system to improve the detection rate of attacks is a two-stage
classifier with feature selection used in the second stage. Simple NB classifier is
used in both the stages of classification because of its speed of execution, ease
of construction and reasonable accurate result rate. The second stage classifier
is run in correspondence for each attack class to achieve parallelism along with
the improvement in the true positive rate. Reduction of the false positive rate is
an issue and to address this, the misclassified instances from the first stage are
given as the input to the second stage.
In the first stage, the classifier is trained with all features selected and then the
trained filter is evaluated against the test set. The output obtained from the first
stage consists of correctly identified instances and misclassified instances i.e.
only those records whose predicted class value is not the same as the actual
class value. The classifier accuracy is calculated. In the second stage, the
training set is trained only for the R2L attacks and U2R attacks separately. The
Naïve Bayes model filters the records into either two classes (R2L and others or
U2R and others). In the second stage, the classifier is trained with the features
selected by the Genetic Algorithm. The filter is evaluated against the
misclassified records obtained from stage 1.
The KDD99 cup dataset is a benchmark for testing many anomaly detection
algorithms. It was prepared by the MIT Lincoln Labs that acquired nine weeks of raw
TCP dump data from the environment they set up. They bombarded Windows NT,
Solaris, Sun OS and Linux systems with multiple attacks over a nine week period.
represented of real-life attacks since most of the background traffic is synthetically
generated and not real burst traffic that is available nowadays.
Four Gigabytes of raw TCP data was processed into five million connection
records. A connection represents a sequence of TCP packets starting and ending at
24
well-defined times, during which data flows from the source to destination under some
well-defined protocol
Each connection is labelled as a ‘normal’ or an ‘attack’ record. Though it is mostly
used by researchers, it suffers from several problems given below:
• The number of DoS and probe records are large and R2l and U2R
attacks are low so that the classifiers are biased towards the more
frequent records.
• The existence of redundant and repeated records will tend to bias
the result towards the more frequent records favouring the DoS and
probe attacks.
• It is outdated with newer and newer attacks discovered since 1999.
Genetic Algorithm requires a set of initial solutions through which it can work with.
These solutions are arbitrary and provided through the training dataset. Since, feature
must be selected for each class; the initial solutions for GA are the set of R2L and U2R
attack instances. For calculation of feature subset for R2L attacks, the initial solutions
are the 995 record connections whereas for U2R attacks, the initial solutions are the
52 records. Since feature number 2, 3 and 4 (protocol, service and flag respectively)
are nominal values; they are removed from the algorithmic process and considered
later on in the feature subset.
An ideal initial population size of about a 100 chromosomes gives a good solution
to any optimization problem. Increasing the population size will only increase the
computational overhead. However, since there are 8 attacks in R2L contributing to
995 record connections, it was necessary to consider each attack separately and find
25
an approximate solution representing the aggregated feature subset for R2L attack
class. Hence using each record connection as an initial solution is important.
“Random sources contain more information”. An event which occurs frequently is
easily predictable and hence will provide less information when it occurs. Conversely,
an event which is random and less likely to occur will carry more information. This is
the concept of information theory in a gist. The information that the source event
provides is not the actual data contained in that event but is the frequency of
occurrence when some dependent event occurs. The event is characterised by the
probability distribution of the samples drawn from it. According to Shannon’s
information theory, the probability of occurrence of an event along with its information
content gives the average amount of information contained in that event.
The Entropy of a feature X can be derived from Formula (3) to form Formula
(4)
Where n1 represents the number of 1’s in a chromosome and n0 represents
the number.
0’s in a chromosome. More the number of 1’s for a particular feature for all
chromosomes, higher is the weight. It can be seen the curve for Entropy of a feature
versus the probability distribution is strictly increasing till the probability reaches 0.5,
after which it becomes a strictly decreasing function.
The weights indicate the importance of a feature to an attack class. The feature
selection algorithm will improve upon the weights to find the proper solution.
The weights indicate the importance of a feature to an attack class. The feature
selection algorithm will improve upon the weights to find the proper solution.
26
Population size for R2L attack = 995
for each record connection, the value of a feature was converted to a bit binary gene
value 1 or 0 indicating that a feature was selected or rejected respectively [14]. If the
value of the feature is non-zero, then the gene value is 1 otherwise it is 0. For e.g.
(0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,1,0,0,150,25,0.17,
simple and one can easily identify as to the selected genes in a chromosomes.
27
It is obvious that greater the number of 1s in a chromosome, higher is the fitness
of a particular chromosome. The proposed feature selection algorithm is defined in
such a way that after a certain number of generations, the fittest chromosome will act
as a feature vector indicating which features are selected and which are not.
In the second part of fitness function, the weights of each feature previously calculated
are multiplied with individual genes in a chromosome.
The genes of the chromosome are multiplied with the normalized weight vector
having values between 0 and 1 i.e individual weight values divided by the sum of all
weights.
In the proposed algorithm, rank selection is used. The chromosomes are sorted
in descending order of fitness and ranks are given to each one of them. The top
chromosome is given rank 1, the second best is given rank 2, and so on. Since, the
lower ranked chromosomes do not contribute anything to the overall population; they
can be replaced with the higher ranked chromosomes. This is called elitism strategy
(Survival of the fittest). Top 10% of the chromosomes replace the lower ones. After
this is done, parents are selected as adjacent chromosomes.
28
Crossover is the mating operation of chromosomes to create offsprings. Single
point crossover is chosen with the crossover probability 1. The crossover probability
simply indicates the ratio with which the parents are chosen for crossover. With a
crossover probability of 1, all chromosomes are chosen as parents. Retaining some
top chromosomes of the population is done by elitism (as explained in the selection
process). Based on experimental evaluation, it was found that best results are
obtained with the pivot at the 25th position. In essence, crossover operation creates
new feature subsets to be evaluated by the fitness function. It is responsible for
exploratory part of a search algorithm.
Mutation is responsible for exploitation i.e genetic diversity in the search space.
Mutation probability is chosen to be 0.05 i.e Probability with which a particular gene
is mutated is 0.05. For mutation, a random number generator is used for each bit. If
the value of the random number generated is less than 0.05, then the gene is
mutated; otherwise it is retained as it is.
4.4.3 Pseudo Code for feature selection
The pseudo-code for feature selection is given below. encode[ ] stores the encoded
initial solutions obtained from the dataset. chromosome[ ] is a vector which stores the
38 bit chromosomes used by the GA. children[ ] is a vector which stores the offspring
resulting from the crossover operation.
29
Copy 10% of best chromosome[i] to next generation
for all chromosome[i] as parent do
//Single Point Crossover on Parents
children[i] ← chromosome[i]
children[i+1] ← chromosome[i+1]
//Mutation
for each bit do
Generate random variable,
0≤ r ≤1
if r ≤ 0.05 then
flip bit
Else
retain bit
end if
end for
end for
end
The chromosome with highest fitness is the feature subset vector. In this vector, bit 1
indicates that a feature is selected, whereas bit 0 indicates that a feature is rejected.
Include features 2,3,4 in the feature subset.
Summary
A two- stage classifier is constructed with feature selection used in the second
stage of classification. The first stage uses a simple NB classifier trained with all the
features . The second stage uses entropy based fitness function for giving weights
(importance) in GA. The fitness function is calculated using Formula . The entire
algorithm is explained with the help of an example taking 4 initial solutions into
consideration. The misclassified instances are obtained by comparing the actual class
(specified in the NSL-KDD dataset) and the predicted class (as estimated by the
classifier) and then segregating only R2L and U2R misclassified records. The
30
CHAPTER 5
EXPERIMENTAL SETUP AND IMPLEMENTATION
The proposed algorithm is built on two stages with the Naïve Bayes classifier:
• Selected features
In the first stage, the filter is trained with all features present in the NSL-KDD
dataset. The trained filter is tested against the test dataset consisting of 22544
records. The output of stage 1 classifier will be:
In order to improve the detection rate of R2L and U2R attacks, the misclassified R2L and
U2R records are applied to the next stage of filtering. It is expected that at least some
incorrectly classified instances will be caught by the trained filter in the second stage.
The second stage filter is trained for only R2L and U2R attacks and is run in parallel.
1. For R2L attacks, training set is divided into classes: R2L and others.
2. For U2R attacks, training set is divided into classes: U2R and others.
The filter is trained with only the selected features from the proposed feature
selection algorithm.
31
The experiments were run on a Windows platform PC with Intel Core i5-3210
Processor running at 2.50 GHz and 4GB RAM. The classifications were performed
in WEKA (Waikato Environment for Knowledge Analysis) version 3.7.12, a standard
tool for running data mining tasks and machine learning algorithms .
import java.util.Random;
32
8. public static String crossover(String parents[ ], String children[ ] )
//Crossover operation to form offsprings stored in children[i] and
children[i+1] using parent[i] and parent[i+1].
9. public static String mutation(String children[ ], java.util.Random.rand)
//Perform mutation with a random number generator.
After testing with different values of parameters given below, the best
result was considered.
Crossover probability = 1
import weka.core.Instance;
import weka.core.Instances;
import weka.classifiers.Evaluation;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.evaluation.output.prediction.PlainText;
import java.io.*
33
3. Read .csv file, compare actual attack class with predicted class
4. Output misclassified instances to r2l_misclassified.arff and u2r_misclassified.arff
//For stage 2 training and testing of the NB Classifier, same functions are called
except that the training dataset is trained with the features selected by the
proposed Genetic Algorithm.Features are removed by using the filtering
techniques provided in WEKA Explorer or by calling the deleteAttributeAt(int
index) method provided in the weka.core.Instances class.
5.2.1 Sample code for obtaining misclassified instances
count++;
select_instances[i]=true;
}
System.out.println(i+"\t"+select_instances[i]);
}
/*
The next stage classifier is also a Naive Bayes classifier with only certain
features selected
*/
BufferedReader test=new BufferedReader(new FileReader("D:\\Actual training and
testing
34
testing.setClassIndex(testing.numAttributes()-1);
for(int i=testing.numInstances()-1;i>=0;i--)
{
if(!select_instances[i])
testing.remove(i);
35
CHAPTER 6
RESULT ANALYSIS
WEKA tool produces an output whose analysis can lead to important conclusions.
The following expressions are used to analyse classifier output.
True Positives (TP): Percentage of correctly detected R2L and U2R records. (Actual
class = Predicted class)
False Positive (FP): Percentage of records belonging to some other attack class that
are classified as either R2L or U2R records (False Alarm).
True Negative (TN): Percentage of records belonging to some other attack class that
are correctly detected.
False Negative (FN): Percentage of records belonging to R2L and U2R attack class
that are not detected.
For R2L attacks, 21 relevant features are selected whereas for U2R attacks, 22 features are
selected. Table 9 summarizes the results obtained by the feature selection algorithm.
36
Attack Features Selected Number of
Type features
selected
R2L 1,2,3,4,5,6,10,12,23,24,25,26, 21
U2R 27,28,29,32,34,36,38,39,41
1,2,3,4,5,6,10,11,12,14,15,17 22
18,19,23,24,29,31,32,33,34,36
Table 2. Features selected by genetic algorithm
It can be seen that of the 2466 misclassified R2L records, 2086 were detected by
the stage 2 classifier with an accuracy of 84.59%. Of the 149 misclassified U2R
records, 143 were detected by the new algorithm with an accuracy of 95.97%. Out of
the 6 record connections that were not classified, 4 of them belonged to the
buffer_overflow attack, 1 belonged to the xterm attack and 1 belonged to the
httptunnel attack.
37
Class TP Rate (%) FP Rate (%) Precision (%) Recall (%)
38
Theproposed algorithm produced the lowest detection rate of 28% for snmpguess
attack and 82.58% for snmpgetattack. This can be attributed to the fact that these
attacks work on the UDP protocol, whereas all the attacks in the training set were
based on the TCP protocol. As a result, the classifier was not trained properly to detect
UDP based attacks and can be concluded as one of the flaws in the dataset
construction. The overall detection accuracy rate of R2L and U2R attacks is
calculated.
The overall detection accuracy rate of the proposed approach for R2L attacks is
obtained to be 86.2% and for U2R attacks it is obtained to be 97%. This shows that
the proposed two-stage classifier gains a significant improvement in the detection rate
of the attacks from 10.5% for R2L attacks to 86.2% and from 25.5% for U2R attacks
to a substantial 97%. Comparing the results with those obtained by Ingre & Yadav in
, it can be seen that two-stage Naïve Bayes classifier provides better results in
classifying instances of the NSL-KDD dataset as compared to Artificial Neural
Networks. High Precision means that the algorithm returned more relevant results
than irrelevant for a particular class. Recall is the number of attack records detected
correctly to the total number of records belonging to that class (TP + FN). High recall
means that the algorithm returned most of the relevant results. The proposed
approach significantly improves the detection rate of R2L and U2R attacks .
Now next we classify the attack types by their port numbers etc for example
FTP write tries to obtain a FTP session to send a file so that it can operate as a local
user thus port 20 and port 21 will be occupied. Now we run WEKA against these
known attacks to obtain their classification accuracy:
39
The following output is of WEKA in detail about U2R and R2L attacks:
R2L Attacks
FTP Write 3 3 100
Guess Password 1176 1231 95.5
Imap 0 1 0
Multihop 14 18 77.77
Phf 1 2 50
Warezmaster 907 944 96
Sendmail 13 14 92.85
Named 13 17 76.47
Snmpgetattack 147 178 82.58
Snmpguess 93 331 28
Xlock 8 9 88.88
Xsnoop 3 4 75
Worm 2 2 100
U2R Attacks
Buffer Overflow 16 20 80
Loadmodule 2 2 100
Perl 1 1 100
Rootkit 11 11 100
Httptunnel 132 133 99.24
Ps 15 15 100
Sqlattack 2 2 100
Xterm 12 13 92.3
The proposed algorithm produced the lowest detection rate of 28% for snmpguess
attack and 82.58% for snmpgetattack according to the results above This can be
attributed to the fact that these attacks work on the UDP protocol, whereas all the
attacks in the training set were based on the TCP protocol. As a result, the classifier
was not trained properly to detect UDP based attacks and can be concluded as one
of the flaws in the dataset construction.
40
6.3 Summary
As seen in Table 1, the number of features selected for R2L and U2R attacks are
21 and 22 respectively. This shows that approximately 50% of the features are
relevant and hence the speed up caused due to this increases by almost 95%. As a
result, improved training times will reduce the complexity of execution. To determine
whether the detection rates have improved, study Tables 2, 3 and 4. The number of
attack records detected for each attack type is given in Table 5. Analysis of the table
shows that Snmpguess and snmpgetattack are difficult to detect since the classifier is
not trained properly to detect UDP attacks. The protocol(feature no. 2 in Table 1) is
biased towards TCP attacks, and hence the classifier produces biased results towards
TCP based malicious records. The detection rate of U2R attacks is high with only 4
buffer overflow records not classified properly. Comparison of different classifiers is
done in Table 14 and Table 15, showing the difference in variations of others with
respect to Naïve Bayes.
Finally the filter feature selection technique, GainRatioAttribute eval along with a
ranker search algorithm was compared with the proposed algorithm. Thorough
analysis of GainRatioAttribute eval shows that the optimal subset varies according to
number of features selected. The detection rate of this FS method reaches 88.6%
when 8 features are selected for R2L attacks and 93.95% when 15 features are
selected for U2R attacks.
41
CHAPTER 7
Conclusion and Future Work
A novel two stage classifier was used to classify the attacks from among the R2L
and U2R attacks. 89% in U2R and 95% in R2L attacks were detected as a result of
using the two-stage technique of feature selection and the features were selected by
genetic algorithm. The features were based on a fitness function which measured the
entropy.
The genetic algorithm approach works better than gain ratio and coorelation
based feature selection. Snmpguess was very difficult to guess with 20% detection
rate. The training set consists of only TCP packets thus UDP packets are hard to
detect. NB tree is more accurate than simple NB but slower thus not used here.
Feature number 23 was most necessary for R2L attacks.This is comparable to the
backpropagation technique in neural networks which works similarly but takes longer
to train.
In future we could use other classification techniques which could improve the
accuracy further and also test in on manually generated datasets and more realistic
traffic like ISCX.
42
LIST OF REFERENCES
[1] Hung-Jen Liao; Chun-Hung Richard Lin et.al, “Intrusion Detection System: A
comprehensive review”, Journal of Network and Computer Applications 36 (2013),
pg 16-24.
[5] Rebecca Bace and Peter Mell, NIST Special Publication on Intrusion
Detection Systems, National Institute of Standards and Technology, pg 1-51.
[6] Nicholas Pappas, Network IDS & IPS Deployment Strategies, SANS
Institute, April 2, 2008, pg 1-64.
[7] Zheng Zhang, Jun Li, C. N. Manikopoilos, Jay Jorgenson, Josh Ucles,
“HIDE: a hierarchical network intrusion detection system using statistical
preprocessing and neural network classification”, IEEE Workshop on
Information Assurance and Security, 2001.
[10] Hesham Altwaijry, Saeed Hui Lu, Jinhua Xu, “Three-level Hybrid Intrusion
Detection System”, International Conference on Information Engineering and
Computer Science (ICIECS 2009), pg 1-4.
[11] Kok-Chin Khor, Choo-Yee Ting, Sommuk-Phon Amnuaisuk, “Comparing
Single and Multiple Bayesian Classifiers Approaches for Network Intrusion
Detection”, International Conference on Computer Engineering and Applications,
2010,325-329.
43
[13] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the
KDD CUP 99 Data Set,” Submitted to Second IEEE Symposium on Computational
Intelligence for Security and Defense Applications (CISDA), 2009.
[14] Yuteng Guo, Beizhan Wang Et.al, “Feature Selection Based on Rough Set
and Modified Genetic Algorithm for Intrusion Detection”, 5th International
Conference on Computer Science & Education, Hefei, China, August 24-27th
,2010, pg 1441-1446.
[15] Hua Zhou, Xiangru Meng, Li Zhang, “Application of Support Vector Machine
and Genetic Algorithm to Network Intrusion Detection”, Wireless Communications,
Networking and Mobile Computing 2007, WiCom 2007, 21-25 Sept. 2007, pg
2267-2269.
44