Ids Tool: Project Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Project Report

On

IDS TOOL
(Intrusion Detection and Analysis Tool)

Submitted By:

Akshay Nagar 2014UCP1524


Saurabh Kumar 2014UCP1660
Sandeep Jatolia 2014UCP1650
Manish Kumar 2013UCP1483

Submitted in the partial fulfillment of Degree of


Bachelor of Computer Science and Engineering

Under the guidance of

Mr. Dinesh Kumar Tyagi


Assistant Professor

MALAVIYA NATIONAL INSTITUTE OF TECHNOLOGY,

JAIPUR

1
Department of Computer Science and Engineering
JLN Marg, Jaipur, Rajasthan 302017, Tel: 0141 2713376, Fax: 25209154

CERTIFICATE

This is to certify that the project entitled “Intrusion Detection tool” has
been completed and submitted by Mr. Akshay Nagar (2014UCP1524),
Mr. Saurabh Kumar(2014UCP1660), Mr. Sandeep Jatolia
(2014UCP1650), Mr. Manish Kumar (2013UCP1483) under my guidance
during the academic session July 2017 to May 2018 in the partial fulfilment
of the requirement for the award of the degree of Bachelor of Technology
(B.Tech) in Computer Science and Engineering, Malaviya National Institute
of Technology, Jaipur.

Date: / / Dinesh Kumar Tyagi

(Assistant Professor)

2
DECLARATION

We, Akshay Nagar, Saurabh Kumar, Sandeep Jatolia and Manish


Kumar, declare that this report, titled,“ Intrusion detection tool” and the
work presented in are our own. We confirm that:
• This work is done towards the partial fulfilment of the degree of
“Project Report” at MNIT, Jaipur.
• Where any part of this report has previously been submitted for a
degree or any other qualification at MNIT or any other institution, this
has been clearly stated.
• Where we have consulted the published work of others, this is
always clearly attributed.
• Where we have quoted from the work of others, the source is always
given. With the exception of such quotations, the report is entirely
our own work.
• We have acknowledged all main sources of help.

Akshay Nagar Saurabh Kumar Sandeep Jatolia Manish Kumar

(2014UCP1524) (2014UCP1660) (2014UCP1650) (2013UCP1483)

Date: 22 May 2018

3
ABSTRACT

A Network Intrusion Detection System (NIDS) is a system that detects


activity which is not permitted inside a network. Here comparison of anomalous
behaviour with normal behaviour is done. Machine learning techniques are
used to classify attacks. Class label of attacks can be predicted via classifier.
R2L and U2R attacks are not being effectively detected currently. Thus a two
stage classifier with feature selection in second stage is used. A simple NB
classifier with all features is used. Useful and relevant features are filtered.
Genetic algorithm using entropy is used for feature selection. Experiments were
conducted on the NSL-KDD dataset using the WEKA machine learning tool. It
can be seen that the detection rate of R2L (remote to local attacks) and
U2R(user to root attacks) is enhanced in the second stage of classification.

ii

4
ACKNOWLEDGEMENT

We would like to express our sincere thanks and gratitude to our


esteemed supervisor Mr. Dinesh Kumar Tyagi (Assistant Professor) for
providing his invaluable guidance and sparing precious hours for our
work. His kind cooperation and ideas guided us throughout the work and
made the completion of work possible. He has even provided technical
as well as administrative facilitation to pursue this work.

We would also like to thank our Head of Department Dr. Girdhari Singh
and our project coordinator, Dr. Yogesh Kumar Meena.

Last but not the least; we would like to thank Department of Computer
Science and Engineering, Malaviya National Institute of technology,
Jaipur and all those people who are responsible for a successful
realization of this report.

iii

5
CONTENTS

Declaration i

Abstract ii

Acknowledgement iii

1. Introduction...............................………………………………................

1.1 Intrusion Detection….………………………………………………......1

1.2 Deployment of IDS…..…………………………………………….........1

1.3Types of Intrusion Detection………………………………..…………...2

1.4 Machine learning in intrusion detection………………………………..7

1.5 Report Organisation……………………………………………………..8

2. Related Work. 9

3. Background……………………………….............................. …………15
3.1Types of Attacks ………..............................................................15
3.1.1 DOS……………………………........................................15
3.1.2 U2R…………………………………..................................15
3.1.3 R2L…………………………………………………………..15
3.1.4 Probe………………………………………………………...15

3.2 Intrusion Detection techniques………………………......................16


3.2.1 Misuse Detection……………………………………..………16

3.2.2 Anomaly Detection…....……………………………………..16

3.2.3 Semi supervised learning……………………………………16

3.3 Machine Learning Algorithms……………...…………………………17


3.3.1 Random Forest…………………………………………….......17

6
3.3.2 J4.8 ………………………………………………………..…17

3.3.3 Naïve Bayes…………………….……..……………………17

3.3.4 Simple Logistics………………….......…………………….19

3.3.5 Sequential Minimal Optimization……………...................19

3.3.6 IBk……………………………………….............................19

3.4 Genetic algorithm …………...…………........................................20


3.4.1

3.4.2 Weka………………………………………………………….............20

4. Methodology………………………………………………................... 21
4.1 Dataset……………………………………………….......................21
4.2 Workflow……………………....…………..................................... 24

4.3 Analysis of the NSL-KDD dataset…………………………..……..24

4.4 Feature Selection using Genetic Algorithm……………….……...25

4.4.1 Rationale behind Fitness Function…………………………27


4.4.2 Selection, Mutation and Crossover…………………………28
4.4.3 Pseudo code for Feature Selection......………...…………..29

5. Experimental setup and implementation……………………………....31

5.1 Implementation of the feature selection Algorithms.…………....32

5.2 Implementation of two stage classifiers……………………………33

5.3 Sample code………………………………………………………….34

6. Result Analysis.............…………………………..……………............36

6.1 Results of feature selection………………………………….……...36


6.1.1 Results of stage I classifier ……………………………………..37
6.1.2 Results of stage II classifier…………...………………………..37
6.2 Analysis of various attacks by WEKA………………………………..39
6.3 Summary………………………………………………………………. 41

7. Conclusion and Future works…………………………………………..42

8. References………………………………………………………………..43

7
LIST OF FIGURES

Figure no. Title Page

Figure 1 Deployment strategies for NIDS 2


Figure 2 Components of Snort 3
Figure 3 Architecture of an Intrusion Detection System 6
Figure 4 Various machine learning algorithms in anomaly detection 7
Figure 5 Improved Bayesian classification 10
Figure 6 Screenshot of NSL KDD dataset 22

Figure 7 Experimental Setup 30

8
LIST OF TABLES

Table no. Title Page


Number
1 Description about feature of each network connection 23
2 Features selected 37
3 Results of stage I classifier 37
4 Results of stage II classifier for U2R(user to root) 38
attacks
5 Results of stage II classifier for R2L(remote to local) 38
attacks
6 Weka detailed report about each attack 40

9
CHAPTER 1
Introduction

1.1 Intrusion Detection

With the non linear advancement in network based services and information
transfer, it is highly imperative to provide security. Generally, when a website is
connected to a network, attacks by people are there but also crimes such as social
engineering. People pretend to be someone else to steal your credentials.

Organizations generally have certain amount of critical data and applications that
need to be secured to protect their confidentiality, integrity and availability. Intruders
can be illegitimate users trying to gain access to a network, running harmful scripts or
destroying important files or a legitimate user trying to gain super-user privileges to
perform malicious activities. Intrusions can also be an attacker scanning all open ports,
IP addresses for any vulnerability as well as denying a legitimate user an access to a
particular service. While organizations employ firewalls to detect Viruses, Trojans,
Worms, harmful software and application data,patterns or signature shouldnt be there
at all.. Antivirus etc need to be updated frequently to avoid malwares from not being
detected and going out just like that.

1.2 Deployment of IDS

Depending upon the operating environment, the network infrastructure and the
resources available within an organization, it is necessary to evaluate the deployment
strategies for an IDS sensor software. IDS require adequate and appropriate human
interaction at all times. The alarms generated by an IDS must be monitored by a
security administrator to take appropriate action. IDS works along with an IPS
(Intrusion Prevention System) that judges the severity of an alarm and ideally
responds to an intrusion. (Intrusion Response System). It takes into consideration the
risk assessment model developed by the organization. The security policies, plans
and procedures must be known to all personnel handling IDS output to take
appropriate action. Based on the deployment of sensors, IDS can be classified in
Network based IDS and Host based IDS.

1
Deployment of Network Based IDS

IDS sensors monitor the network traffic to find malicious activities. Ingress as well
as extra network traffic is performed.. They can be deployed at various points along
the network infrastructure, along the main switch, or before the router connecting two
LANs as well as in the DMZ (Demilitarized Zone).

Deployment of Host Based ID

A Host Based IDS typically monitors the activities of a single host. It acts in
conjunction with a firewall or anti-virus software to detect attacks. It can be used to
capture the ingress traffic of a standalone application server as well as to examine the
behaviour of a host. It can provide enhancement in the levels of security for our
system. It will become extremely tedious, computationally expensive as well as costly
for organizations to deploy an IDS sensor for each host. As a result, these are used
only for select hosts and honeypot systems as a tool to inspect the traffic.

1.3 Types of Intrusion Detections

Signature Detection is most commonly deployed along with firewalls and other
thwarting mechanisms to enhance their performance. Accordingly, signature detectors
were initially rule-based expert systems, however, with the gradually increase in the

2
number of attacks, these detectors were not suitable to perform real-time detection.
As the rule set becomes larger and larger, the comparison speed of these expert
systems decreases. Fuzzy rule based systems and the use of other Artificial
Intelligence techniques have made an improvement in the pattern matching
techniques developing new rule sets based on existing rules to detect newer attacks.
SNORT and P-BEST are most commonly used misuse detectors. Figure 2 shows the
components of SNORT.

SNORT is a rule based Network Intrusion Detection System that uses fast

pattern matching algorithms to detect known patterns in packets.These rules are

created with the help of human interaction and stored in databases for comparison

by the Snort detection engine. The pre-processor component performs initial packet

processing such as the analysis of the IP header, feature identification as well as

separation of the packet payload. The detection engine compares the packets with

the existing rule set and generates alert output into a log file or on the console.

SNORT can also be run in inline mode where it can prevent attacks (IPS). The main

disadvantage of using misuse detection is their constant need for updating the rule

database.

In stark contrast to this approach is the Anomaly Detection (AD) method which
assesses the incoming packets for any abnormality. An Anomaly is a deviation from
the normal behaviour. It compares the packet with the normal behaviour that is
predefined. For e.g. If the number of failed login attempts is more than the specified

3
threshold, then it can be termed as an anomaly. The main advantage of using this
detection is that it can detect previously unknown attacks.
The main problem in anomaly based detection is the definition of the normal
behaviour of the system. For example, consider an employee of an organization who
works for five days a week from 9:00 am to 5:00 pm. He logs in into his system at 9:00
am using a password which is only known to him. Here, an anomaly could be an
‘unknown’ user logging into his system at 8:00 pm late outside office hours, or a user
logging in on a holiday or any failed number of login attempts based on organizational
protocols. Sometimes, even that employee could have forgotten his correct password
thereby having a failed login attempt
anomaly. However, since he is a legitimate user, the system should grant him
privileges to obtain his password instead of raising an alarm of a suspicious intrusion.
Anomalies in behaviour can be identified by analysing audit data or by conducting a
‘forensics’ of the network and servers. Sometimes the data collected by Honeypot
systems and historical data are used to create rules and protocols to establish normal
behaviour. Some processes monitor the statistical measures like the amount of
rejection rate (error rate), network bandwidth, CPU utilization or the number of
requests to an application server. There measures are currently used in commercial
systems because their identification of flaws is a straightforward process. These are
threshold based anomaly detectors.

The actual use of anomaly detectors is in detecting unknown attacks. According


to [1] unknown attacks can be variations in previously known attacks. Denial of Service
(DoS) attacks thwarts a legitimate user from accessing a particular resource or
service. For example, flooding the network with unwanted packets will clog the
bandwidth, thereby hindering the smooth flow of traffic across the network. Similarly,
SYN Flooding, ARP poisioning are variants of DoS attack. The ping of death (PoD)
attack divides a packet whose length is greater than 65535 bytes into various
fragments which will flood the destination with limited buffer length. All attacks are a
result of flaws in protocol configuration. One solution to prevent the occurrence of
these attacks is to make a perfect protocol that is robust in terms of security. However,
when this happens performance of the network degrades significantly. Most of the

4
attacks are due to improper exception handling by the protocols. Hence, scrutinizing
exceptions in the network protocols helps in detecting unknown attacks.

Another important type of anomalies are the probe attacks. i.e Stealth scanning
of the network for open IP addresses and ports. Usually, when an attacker wants to
carry out an illegal activity or hack into a system, he has to carry out a reconnaissance
of the network. He may find out all the open network address points through which he
can attack, find out the OS running on the systems, run various tools to ‘probe’ the
network. Nmap, TCP Stealth port scanning, FTP Bounce attack, SATAN tool find flaws
in the protocol design to probe the network. While this attack type is not dangerous,
its detection will help in building better network models, protocols and develop secure
organizational rules.

Sometimes, anomalies can be detected by analysing the number of files created


in the root directory, or the number of outbound commands in an FTP session. For
example, if the FTP Server is not write protected and the root directory is owned by
the FTP account, then the attacker can write malicious scripts from the remote
machine to gain local access to the
system. To detect this attack, it is necessary to check for anonymous FTP sessions
and whether any files are created in the root directory. Some FTP Servers allow a
guest to login and download the files from the server. However, if the guest account is
allowed to write to the FTP server, then the attacker can upload “warez” copies of the
illegal software to the FTP server and gain local access. This compromises the server
and the attacker gains local access. The legitimate clients can download the illegal
“warez” software on their machines to perform the warezclient attack. The previous
attack is called as warezmaster. Regulating whether a file is being uploaded or
downloaded in an FTP session can help detect abnormalities in behaviour.

While Signature based detection involves comparison with known signatures,


better accuracy in detection can be achieved here than anomaly detection. A large
number of detectors are based on probabilistic models which provide no 100% surety
that an activity is malicious or a normal one. Hence, anomaly detectors are not
deployed in real environments due to their non-robust performance. As a result, efforts

5
are made by the research community to continuously improve the efficiency of
detection and reduce the false alarm rate as well as to automate the task of alert
classification and intrusion prevention.

With regards to what was discussed earlier, a general architecture of an IDS


system is shown in Figure 3. The components of an IDS are discussed below.

Data Acquisition module: It is responsible for acquiring data from the network, either
from the upper layer protocols, or raw network frames or from audit data to analyse
patterns in behaviour.
Packet Decoder: Performs the initial pre-processing of the packet to obtain initial

features. It strips all the headers off the payload and acquires the features to perform

detection. These features can be related to basic packet information, the source bytes,

the destination address to which the packet is to be transported etc. or temporal

features such as the number of packets received from a particular source in one

second, or even payload information such as the number of TCP SYN packets

successfully received etc. as well as the content.

Intrusion Response System (IRS): It is the response management block which


obtains the alert, calculates the severity of the alert and takes adequate response.
Knowledge Base: This is the expert system that takes in the reference data from the
detection engine and the expert knowledge on attacks to create new set of rules. It
examines behavioural patterns in data to establish new models to identify attacks

6
1.4 Machine Learning in Anomaly Detection

Machine Learning is a vast field of Artificial Intelligence that builds models based
on existing input data to make predictions or decisions instead of following a standard
set of instructions to achieve a certain output. While expert systems are used in misuse
detectors, anomaly based detection employs machine learners as well as other
artificial intelligence techniques to estimate the class of an activity. i.e. whether an
activity is an anomaly or not. The main idea behind using machine learning in anomaly
detection is to develop predictive models based on the training dataset to perform the
detection. The improvement should be so significant that the model estimates most of
the novel malicious activities with a reasonable error rate (low false alarm rate). One
such dataset is the DARPA dataset developed by the MIT
of five weeks of raw TCP dump data . They had used different Windows NT servers,
and then simulated various attacks to obtain raw TCP dump data.

Based on whether the classes of attacks are provided or not, learners are
classified as supervised learners or unsupervised learners. Most commonly used
supervised learning algorithms are the Classification and Regression algorithms.
Simple Naïve Bayes classifier, Bayesian networks, Hidden Naïve Bayes, Logistic

7
Regression, Decision Trees such as ID3, C4.5, Random Tree classifiers, or CART
(Classification and Regression Tree) etc. are used. Similarly Backpropagation based
Neural Networks (ANN) and Support Vector Machines (SVM) classify an activity as
an anomaly or a normal one. Unsupervised learners are the Clustering algorithms
such as Simple K-Means, K-Nearest Neighbour (K-NN) etc. as well as Self-Organising
Maps (SOMs) based Neural Networks finding patterns in training data to cluster in into
useful classes. Depending upon the similarity in functioning, learning algorithms can
be classfied into the following categories:

Bayesian Classification: They are stochastic techniques that explicitly employ the
Bayes’ Theorem of conditional probability to perform classification. Eg. Naïve Bayes,
Bayesian Belief Network, Hidden Naïve Bayes (HNB). Bayesian networks are reliable
classifiers in a multi-class setting. Simple NB algorithm assumes mutual
independence amongst the attributes. Several techniques which calculate the
conditional probability between the attributes to various complex Bayesian Networks
have been proposed.
1.5 Report Organization

This report is organized as follows.


Chapter 1 gives the Introduction to the problem statement. It also provides a
basic overview about the Intrusion detection platform, attacks and the algorithm to
be proposed. And a brief description of what our project is about.
In Chapter 2, we identify the various articles, notable books and research
papers to evaluate what researchers and scholars have written about the topic on
which we are doing our project.
Chapter 3, we discuss relevant background related to the Snort and Intrusion
Detection Systems. This chapter also includes a literature survey of various
machine learning algorithms and ways to detect intrusions.
In Chapter 4, we discuss the dataset used and the methodology used to extract
features using genetic algorithm.
In Chapter 5, we have mentioned the system setup.
Finally, Chapter 6 highlights our conclusion and suggestions for the future
work

8
CHAPTER 2
Related Work

There are many related works related to the design and functioning of IDS. Since
2000, a variety of machine learners, classifier ensembles and search and optimization
techniques are used by researchers to classify activities as anomalies. A competition
was conducted by Knowledge Discovery and Data Mining (KDD) in 1999. The main
task of this competition was to build a network intrusion detector, a predictive model
that was able to distinguish “bad” connections from “normal” connections.
Researchers use the dataset provide by DARPA, KDD99, ISCX12 etc. to construct
their predictive classifiers. It was necessary to make an in-depth review of different
techniques, understanding their advantages and limitations. This was done by
studying review papers [1-2, 4].
Hesham Altwaijry and Saeed Algarny in [9] developed a Bayesian filter to improve
detection rates of attacks in the KDD99 dataset. They calculated a score of each
feature based on the frequency of occurrence for “normal” records and for “attack”
records. They selected features with a higher score to test their filter. They also tested
their filter using all features selected. It was discovered that the results obtained were
worse when all features are selected. Features 23, 24 and 31 were found to be most
relevant with the detection rate of R2L improved to 85.35% and a threshold value of
0.6. In [10], they proposed a multi-stage Naïve Bayes filter with each stage trained to
detect a different type of attack. The multi-stage filter is shown in Figure 5.

9
They discovered that the threshold for each stage can be changed depending upon
the level of security required and also that the training model for each stage can be
exercised with feature selection. While their work was related to reducing the false
positives, the misclassified instances (where the actual class != predicted class, for
supervised learning) is not sent to the next stage. Instead the normal records are sent
to the next filtering stage. Their main aim was to identify anomalies and not
classification of attacks according to type. This proposed method improved the overall
detection rate to 96.85%.
In [3], usage multiple Bayesian based classifiers, namely the Naïve Bayes
classifier, expert-elicited Bayesian (EEBN) network and the Bayesian Network (BN)
classifier using optimum feature set constructed using Correlation based feature
selection (CFS) and the consistency subset evaluator. With respect to a single
classifier, NB classifier provided the best results compared to the other two with an
overall detection rate of 92.3%.

On the other hand, for BN and EEBN, the detection rate was 90.3% and 91.8%
respectively. The main problem in their work was that the detection rate of R2L and
U2R was found to be very low, with NB classifier providing the best results. A dismal
12% for R2L and 19.7% for U2R. EEBN performed well in U2R category with 22.4%
detection rate.
Researchers prefer to work with Bayesian classifiers because of its simplicity in
construction and reasonable accuracy in detecting all attack types. In [12], Hui Lu and
Jinhua Xu compared an unsupervised clustering method, Bayesian clustering, with

10
the performance of Naïve Bayes and decision trees. They developed a three stage
hybrid detection model to segregate attacks shown in Figure 8. In the first stage, the
dataset was separated into three parts (DoS, probe and others) using a C4.5 decision
tree algorithm. This main rationale behind this was to separate the R2L and U2R and
normal records from the rest. In the second stage filtering, Simple NB classifier was
used to separate U2R records from the R2L and normal records. It can be seen that
with this approach, the accuracy of R2L and U2R was low. As a result, the third stage
using Bayesian clustering approach was used to improve detection rate of R2L attack.
It was seen that with this three level hybrid classifier, the detection rate of R2L and
U2R reaches to 48.41% and 97.14% respectively on the KDD 99’ dataset. The training
of U2R and R2L records is a problem because of its low number of records in the
dataset. It was seen that using multiple classifiers or hybrid classifiers in series or in
parallel improved the detection rate significantly .

In [5], investigated the application of Feature Selection in IDS. They employed rough
set theory in finding feature subset, and then used an optimization Genetic Algorithm
with population clustering approach to nd out the optimal subset from the remaining
features. SVM was used as a classifier on the KDD99 dataset. Population clustering
was used to establish a self-adaptive crossover and mutation rate in the Genetic
Algorithm. K-Means algorithm with Euclidean distance was used to achieve this
purpose. The feature selection algorithm selects 11 features; feature no.
1,2,4,5,6,11,22,23,31,33 and 35 for all attack types. The results were compared with
other algorithm and it was found that the proposed method was more efficient as the
detection rate has increased from 95.27% to 98.21% and the false positive rate
reduced from 7.63% to 0.91%. They had used an evolutionary technique in computing
the optimal feature subset with rigorous mathematical formulas to perform population
clustering.
In [6] (2007) used GA to calculate best feature subset and SVM to test the
accuracy of the classifier. Some of the features are redundant and their removal in the
data-preprocessing step greatly enhances the result. They used weight based
parametric fitness function with the weights calculated from the frequency of
occurrence in the dataset. Their optimal fitness value varied according to a parameter
β. For β = 0.9, 5 features were selected. Them being 23,27,31,33 and 38. With β =
0.9, the overall detection rate reached 97.14%.
Based on trials, the optimal value of β was found to be 0.5 where 26 features were

11
selected and the detection rate reached 98.46%. It can be seen that though SVM is a
good classifier, its detection rate is significantly low for the NSL-KDD dataset (69%)
.[13] When the training set is large, it takes a long time to train the features.
The C4.5 decision tree algorithm can detect U2R and R2L attacks better than
neural networks [4]. With respect to time, backpropagation based NN gave better
results since it undergoes training, validation and testing phases. Researchers have
used Linear Discriminate Analysis (LDA) [16] as features selection models,
information gain or the Gain ratio to select features. Fuzzy techniques are used along
with Hidden Markov Models (HMM) to efficiently classify abnormalities. In [16], GA
was used to select features 2,3,4,5,6,27,33,40 and 41. The modified J48 (C4.5
algorithm) uses an intelligent agent to calculate the threshold for constructing the
nodes of the tree. It was seen that this modified J48 classifier with optimal GA based
feature selection model was better in detecting DoS and probe attacks with shorter
training and testing times as well.
In [4], usage of soft computing based approach to create rules to detect attacks.
Information gain was used to select 15 most important features from the KDD99 Cup
dataset. After normalization and fuzzification of the selected features using the
triangular function, initial rules were created and they were stored as a decision tree.
For example,
feature 36 > 0.005 | feature 23 > 0.01 : ipsweep
{normal=0,buffer_overflow=0,loadmodule=0,smurf=0,ipsweep=72,multihop=0,ftp_w
rite=0, nmap=0,Neptune=0,warezclient=0,spy=0}

Within GA, a support-confidence based fitness function was introduced and a rule
was selected if its value is greater than the threshold (0.5). Their research was
restricted to creating rules to detect anomalies and not attack classification. It was
seen that the detection rate of attacks was 91.88% with a high false positive rate of
18.06%. They also compared the trends in feature selection, comparing the number
of attributes selected to the accuracy of detection. It could be seen that the accuracy
is maximum when the number of features selected is around 20-25. This trend in
feature selection can be a topic for further research purposes. They also compared
their GA based rule creation technique with ANN approaches. They concluded that
GA is a form of unsupervised learning as compared to ANN. With neural networks we
need to assign the weights to nodes and provide some learning, checking the output
regularly via the backpropagation algorithm. However, with GA, we need not worry
about this as it will provide the globally optimal solution. The accuracy of GA and thus
the rule creation depends upon the definition of the fitness function. In [5], rules were

12
created for detecting warezclient and the warezmaster attack. [16] used GA based
technique to create rules for attacks.
In [7], usage of Self-Organising Maps (SOM) on the KDD99, NSL-KDD(2010)
dataset to provide an in-depth comparison of ANN based approaches. With the
KDD99 dataset, the proposed SOM based algorithm had an accuracy of 92.37%
whereas for
the NSL-KDD dataset, it was 75.49%. Panda (2010) had used Multinomial Naïve
Bayes + N2B to obtain a success rate of 38.89%. Because of its high speed and faster
conversion rate, SOM can be used in multi-stage detection. SOMs are also powerful
on static networks than dynamic ones because dynamic networks have memory. They
can be trained to learn sequential or time varying patterns.
Deshmukh Et. Al [6] used preprocessing methods such as feature selection and
discretization and compared the efficiencies of the NB, HNB and the NB Tree
classifiers. A Fast Correlation based Feature selection technique (FCFS) was
implemented and equal-width discretization converted the continuous values of a
feature to nominal values. Discretization improves the accuracy of NB classifiers.
Using CFS, 12 features were selected. The detection rate of NB tree was found to be
maximum with 94.6% as compared to 88.2% for NB and 93.4% for HNB; however it
takes longer time to construct the model.
In [3] used ANN to evaluate the NSL-KDD dataset. Proposed method uses
Levenberg-Marquardt (LM) and BFGS quasi-Newton backpropagation for updating
the weight and bias. There are initially 29 neurons corresponding to 29 features in the
set with the hidden layer using 21 neurons. For binary classes and 5 classes, output
layer’s neuron is changed with 2 and 5 respectively. Since there are very low U2R and
R2L records in the dataset, it was necessary to maintain equality. For speeding up the
training times and to maintain equality, only a set of normal and other class records
were selected. With the proposed backpropagation approach, the overall accuracy
achieved was 79.2%. The detection rates of DoS, probe, R2L and U2R were obtained
to be 77.77%, 76.6%, 34.6% and 10.5% respectively. While DoS and probe attacks
are classified reasonably well for NSL-KDD99 dataset, it was the R2L and U2R attack
classification which is the problem. For binary classification (normal and attack class),
the accuracy was 81.2% of detection with a false positive rate of 3.23%.

It can be seen that based on the vast review of existing techniques, the false
positive rate can be reduced and the detection rate can be improved even more using
multiple or hybrid classifiers and by using feature selection technique.

13
Feature Selection is an important domain area that could be used in the field of IDS.
However, it was necessary to conduct a preliminary research of different feature
selection techniques to understand their performance so as to determine its necessity
of usage in my research. Megha Agarwal [11] compared the performance of different
selection methods for the four attack classes. Removal of redundant and irrelevant
features for a particular class is called feature selection or dimensionality reduction. It
is different from feature extraction where new features are uncovered from existing
features. Feature selection methods can be classified mainly into three types. Filter
methods, Wrapper methods and embedded methods.
Removal of these redundant features will shorten the training times; improve the
prediction accuracy rate of the classifier as well as simplify the classifier model
construction [14-16]. Each feature selection technique employs a search algorithm for
finding the subset and an evaluation function to give a score to that subset. Filter
methods find an approximate subset whereas wrapper methods use a classifier to
evaluate the subset. Wrapper methods give an optimal solution but take longer time
to produce an output than the filter methods.
Correlation based Feature Selection (CFS) was first discussed by [7]. It is based
on the notion that all features that have a high correlation to the class attribute must
be in the same subset whereas those features in the same subset must have low
correlation amongst themselves. It gives a score to each subset called as the merit of
the subset.
Greedy search technique such as the hill climbing, forward selection, backward
selection, greedy stepwise search, best first search etc. are used to find feature
subsets and CFS is used to find the optimal subset amongst them. Other important
selection practices are ChiSquare subset evaluator, Filtered Wrapper technique,
Information Gain (IG), Gain Ratio, LDA and Principle Component Analysis (PCA). In
[11], the detection rate for GainRatioAttributeEval + Ranker algorithm is the maximum
with 99.61% for Naïve Bayes classifier whereas the time for model construction was
the least for InfoGainAttributeEval + Ranker algorithm.

Based on the study of existing techniques used by researchers, several


shortcomings were identified in the field of IDS discussed in the next chapter.

14
CHAPTER 3

Background

There are various types on a intrusion detection in which attacker tries to imitate
the packets formed by a legitimate user and attack the loopholes in a network
device. The various types of attacks are listed below:

3.1 Types of attacks on a system


3.1.1 DOS

A DoS attack is a type of attack in which the hacker makes a computing or


memory resources too busy or too full to serve legitimate networking requests
and hence denying users access to a machine e.g. apache, smurf, Neptune,
ping of death, back, mail bomb, UDP storm etc. are all DoS attacks.

3.1.2 U2R(user to root)

These attacks are exploitations in which the hacker starts off on the system

with a normal user account and attempts to abuse vulnerabilities in the system in

order to gain super user privileges e.g. perl, xterm.

3.1.3 R2L (remote to user)

A remote to user attack is an attack in which a user sends packets to a


machine over the internet, which s/he does not have access to in order to
expose the machines vulnerabilities and exploit privileges which a local user
would have on the computer e.g. xlock, guest etc.

3.1.4 Probe

Probing is an attack in which the hacker scans a machine or a networking


device in order to determine weaknesses or vulnerabilities that may later be
exploited so as to compromise the system. This technique is commonly used
in data mining e.g. saint, portsweep, mscan, nmap etc.

15
3.2 Intrusion detection techniques

There are three main detection techniques for Static Analysis - Misuse
Detection and Anomaly detection and Specification detection(SemiSupervised
Learning)
3.2.1. Misuse Detection
This technique is also known as signature based detection technique. A
packet is detected as a attack if it matches a sequence of instructions or
policies.
In the research by Feng et al. , the authors have presented Intrpopscopy,
a semantic language based signatures for detecting malicious packets. In this
approach, signatures are created for all attacks.
Signature matching is achieved using the rules such as that in SNORT ..

3.2.2 Anomaly Detection

This technique relies machine learning algorithms to detect malicious


behaviour. Features extracted from known malware are used to train the model
and predict a novel or unknown malware. Abah et al. proposes a machine
learning approach relies on K-Nearest Neighbour classifier to train the model
with features such as bandwith utilization ,etc , Device status and running
applications/processes. In another research by Aung et al. proposes a
framework which relies on machine learning algorithms to for Intrusion
detection system using features obtained from genetic algorithm.

3.2.3 Semi Supervised learning

Semi supervised learning is used when the amount of data in the training
samples is low. To remedy the low accuracy we would get otherwise we first
classify the dataset using supervised learning on training samples and then run
anomaly detection for better accuracy. The main advantage of this technique
is it can be used even when number of training samples is very low.

16
3.3 Machine Learning Algorithms

3.3.1 Random Forest

This is an ensemble learning algorithm which classifies based on


information aggregated from individual learner. This algorithm relies on the
bagging approach where each classifier is built individually by working with a
bootstrap sample of the input data. Normally in a decision tree algorithm, the
decision is made considering all the features. However, in Random Forest
Algorithm the decision is made by randomly selecting the features. This random
selection, improves the scalability when there are large number of features. In
addition, it reduces the interdependence between the feature attributes and
makes the results is less susceptible to noise.

3.3.2 J48

J48 is based on the implementation of the decision tree algorithm C4.5. In


this algorithm, a node for the tree is created by splitting the dataset. The data
with highest information gain is chosen which effectively splits the into class
variable. After choosing the data, a decision node is created to split based on
the data chosen. The data obtained by splitting is the recursed and then added
as children of the decision node.

3.3.3 Naive Bayes

In this algorithm we assume that all the features are independent of each
other. The classification is based on the calculating the maximum probability of
the attributes which belong to a particular class.
The Naïve Bayes classifier is a probabilistic estimator based on the Bayes
theorem of conditional probability. Based on its usage, it can be Simple Naïve
Bayes classifier, Hidden Naïve Bayes, Naïve Bayes updatable or a multi-nominal
Naïve Bayes classifier. Henceforth Naïve Bayes classifier will also be abbreviated

17
as NB classifier in future references in this thesis report. A simple NB classifier
assumes mutual independence amongst the features. Some features are
correlated to each other, after finding the correlation, each of these features has
a hidden parent in HNB. Simple NB classifier is one of the simplest, intuitive, easy
to construct and gives reasonably good accuracy. Hence it is one of the most used
machine learners for IDS.
On the basis of conditional probability, Bayes theorem was formulated [9]. The
Bayes states that given the prior probability of the occurrence of any event A when
B has occurred, and the likelihood of the occurrence of event B, then it is possible
to calculate the probability of the occurrence of event B when A has occurred. This
can be given in formula(1).

Consider a set of attributes X1,X2,X3….Xn with each contributing to the


attack class C.

Considering mutual independence amongst the attributes, calculating the prior


probabilities and the likelihood of the occurrence of an attack in the attack class,
we can estimate the probability of the occurrence of an anomaly (or attack) when
a particular feature is significant. This is shown in formula (2). The denominator
can be removed since it is a constant. Hence the problem would be to find the
maximum posterior probability to estimate the class of an attack.

Bayesian based classifier is a supervised learning algorithm and is based on


the Bayes’ theorem of conditional probability. It calculates the probabilities for
attack classes and in normal TCP traffic for different features. The classifier is
trained by giving it classified traffic. It then corrects the probabilities for each
feature. During the test phase, the classifier estimates the probabilities for each

18
TCP connection and predicts the class of the each instance. For a discrete
attribute, it calculates the prior probability based on the frequency of occurrence
in the training set. For a continuous attribute, it calculates the mean and
standard deviation and then uses a uniform Gaussian distribution to find the
prior probability.

Let r = (r1, r2,….,rn) belong to class C, and P(C) be the probability of the class
and P(C|r) be the probability of feature for a given class, then packet is considered
as a goodif,

P(C =norma|R = r) > P(C = attack|R = r)

3.3.4 Simple Logistic

This is an ensemble learning algorithm. To evaluate the base learners this


approach utilizes logistic regression using simple regression functions. Similar
to linear regression, it tries to find a function that will fit the training data well by
computing the weights that maximizes the log-likelihood of the logistic
regression function. In this algorithm, the training phase is relatively longer than
the testing phase.

3.3.5 Sequential Minimal Optimization

This is an implementation of SVM in Weka. In this algorithm the classification


is computed using a separator between two classes and then maximizing the
width of the margin. SMO calculates the maximization by splitting the problem
into smaller parts. Each problem consists of optimizing two multipliers in order
to maximize or minimize the solution. The algorithm solves the smallest first and
adds these to the overall optimization. The classifier uses either a Gaussian or
a polynomial kernel to map the data.

3.3.6 IBk

This instance based learner is a lazy algorithm. The instance based learner
saves all of the training samples and compares the test samples to each of the

19
members of the training set until it finds the closest match. This algorithm is
implemented in Weka as a k-nearest neighbour classifier. The Weka
implementation sets Euclidean distance as the default distance algorithm.

3.4 Tools used

3.4.1 Metasploit

Metasploit is a penetrating testing tool that provides a public resource for


researching security vulnerabilities and developing code that allow a network
administrator to break his own network to identify security risks and document
which vulnerabilities need to be addressed first.

3.4.2 WEKA

Waikato Environment for Knowledge Analysis (Weka) is a suite of machine


learning software written in Java.
It is used for testing application with various above mentioned algos for
testing their accuracy. It comes with a graphical user interface so it is easy to
use.

20
CHAPTER 4

Methodology

This chapter describes the proposed methodology used in the project. The first
section deals with the malware samples used to perform analysis in the project.
The next section details the methodology used to extract features from the dataset.
The last section describes the implementation details of the approach.

4.1 Dataset

The dataset used in the project is from the following sources:


• https://github.com/defcom17/NSL_KDD
The dataset has been explained later but the thing to be noted here is that the
dataset doesn’t represent real world traffic as real world traffic can be
quite biased towards normal cases rather than attack samples as well as the
number of records can be very huge which is impractical to test manually.
Hence we first clean the data. NSL(a laboratory in US for cyber defense
cleaned the real world traffic to make it testable)
The below is the screenshot of the NSL KDD dataset with labels.

21
Fig:6: Screenshot of the KDD training dataset

The NSL-KDD dataset consists of 125973 records in the training set and 22544
records in the test set, with each record connection having 41 features (or
attributes) divided into four categories: Basic features (Features 1-9), Content
Based features (Features 10-22), Time-based Traffic features (Features 23-31) and
Host-based Traffic features (Features 32-41). The basic features are shown below:
Each connection has a class label representing the attack type of the class and there
are five attack types: Normal, DoS, U2R, R2L and Probe.

The NSL-KDD (2010) is a dataset which solved the shortcomings of its predecessor.

• It does not include the redundant records so that the classifiers are
not biased towards the more frequent records.
The detection of attacks in the test set is not biased towards certain types of
records since duplicate records are removed from the test set.
The number of records of train and test set are reasonable and hence tests not
need be conducted on small subset of the dataset.

22
Attribute Attribute Description Sample
No. Name Data
Length of
time duration
1 Duration 0
of the
connection
Protocol used
2 Protocol_type in the Tcp
connection
Destination
3 Service network ftp_data
service used
Status of the
connection –
4 Flag SF
Normal or
Error
Number of
data bytes
transferred
5 Src_bytes from source to 491
destination in
single
connection
Number of
data bytes
transferred
from
6 Dst_bytes 0
destination to
source in
single
connection
if source and
destination IP
addresses and
port numbers
7 Land 0
are equal then,
this variable
takes value 1
else 0
Total number
of wrong
Wrong_fragm
8 fragments in 0
ent
this
connection
Number of
urgent packets
in this
connection.
9 Urgent Urgent 0
packets are
packets with
the urgent bit
activated

Attribute Attribute Sample


Description
No. Name Data
Number of
„hot‟ indicators
in the content
10 Hot 0
such as:
entering a
system

TABLE 1 : CONTENT RELATED FEATURES OF EACH


NETWORK CONNECTION
VECTOR

23
4.2 Workflow
The proposed system to improve the detection rate of attacks is a two-stage

classifier with feature selection used in the second stage. Simple NB classifier is
used in both the stages of classification because of its speed of execution, ease

of construction and reasonable accurate result rate. The second stage classifier

is run in correspondence for each attack class to achieve parallelism along with

the improvement in the true positive rate. Reduction of the false positive rate is

an issue and to address this, the misclassified instances from the first stage are
given as the input to the second stage.

In the first stage, the classifier is trained with all features selected and then the
trained filter is evaluated against the test set. The output obtained from the first
stage consists of correctly identified instances and misclassified instances i.e.
only those records whose predicted class value is not the same as the actual
class value. The classifier accuracy is calculated. In the second stage, the
training set is trained only for the R2L attacks and U2R attacks separately. The
Naïve Bayes model filters the records into either two classes (R2L and others or
U2R and others). In the second stage, the classifier is trained with the features
selected by the Genetic Algorithm. The filter is evaluated against the
misclassified records obtained from stage 1.

4.3 Analysis of the NSL-KDD dataset

The KDD99 cup dataset is a benchmark for testing many anomaly detection
algorithms. It was prepared by the MIT Lincoln Labs that acquired nine weeks of raw
TCP dump data from the environment they set up. They bombarded Windows NT,
Solaris, Sun OS and Linux systems with multiple attacks over a nine week period.
represented of real-life attacks since most of the background traffic is synthetically
generated and not real burst traffic that is available nowadays.
Four Gigabytes of raw TCP data was processed into five million connection
records. A connection represents a sequence of TCP packets starting and ending at

24
well-defined times, during which data flows from the source to destination under some
well-defined protocol
Each connection is labelled as a ‘normal’ or an ‘attack’ record. Though it is mostly
used by researchers, it suffers from several problems given below:
• The number of DoS and probe records are large and R2l and U2R
attacks are low so that the classifiers are biased towards the more
frequent records.
• The existence of redundant and repeated records will tend to bias
the result towards the more frequent records favouring the DoS and
probe attacks.
• It is outdated with newer and newer attacks discovered since 1999.

On account of these problems, KDD99 is not a suitable dataset for research


purposes.

4.4 Feature Selection using Genetic Algorithm


The proposed feature selection technique is based on evolutionary computing.
Genetic Algorithm combines the searching and evaluation of a feature subset thereby
requiring no additional effort in finding the optimal subset for R2L and U2R attack
class.

Genetic Algorithm requires a set of initial solutions through which it can work with.
These solutions are arbitrary and provided through the training dataset. Since, feature
must be selected for each class; the initial solutions for GA are the set of R2L and U2R
attack instances. For calculation of feature subset for R2L attacks, the initial solutions
are the 995 record connections whereas for U2R attacks, the initial solutions are the
52 records. Since feature number 2, 3 and 4 (protocol, service and flag respectively)
are nominal values; they are removed from the algorithmic process and considered
later on in the feature subset.

An ideal initial population size of about a 100 chromosomes gives a good solution
to any optimization problem. Increasing the population size will only increase the
computational overhead. However, since there are 8 attacks in R2L contributing to
995 record connections, it was necessary to consider each attack separately and find

25
an approximate solution representing the aggregated feature subset for R2L attack
class. Hence using each record connection as an initial solution is important.
“Random sources contain more information”. An event which occurs frequently is
easily predictable and hence will provide less information when it occurs. Conversely,
an event which is random and less likely to occur will carry more information. This is
the concept of information theory in a gist. The information that the source event
provides is not the actual data contained in that event but is the frequency of
occurrence when some dependent event occurs. The event is characterised by the
probability distribution of the samples drawn from it. According to Shannon’s
information theory, the probability of occurrence of an event along with its information
content gives the average amount of information contained in that event.

Entropy is the average amount of information contained in each event. It gives


the uncertainty of the event and is higher for more random sources . The entropy for
a random event X is calculated in Formula (3).

The Entropy of a feature X can be derived from Formula (3) to form Formula
(4)
Where n1 represents the number of 1’s in a chromosome and n0 represents
the number.
0’s in a chromosome. More the number of 1’s for a particular feature for all
chromosomes, higher is the weight. It can be seen the curve for Entropy of a feature
versus the probability distribution is strictly increasing till the probability reaches 0.5,
after which it becomes a strictly decreasing function.
The weights indicate the importance of a feature to an attack class. The feature
selection algorithm will improve upon the weights to find the proper solution.

The weights indicate the importance of a feature to an attack class. The feature
selection algorithm will improve upon the weights to find the proper solution.

26
Population size for R2L attack = 995

Population size for U2R attack = 52

4.4.1 Theoretical Rationale behind the fitness function

Any chromosome is encoded in such a manner that when a feature is relevant to


the attack, it is selected and has a value 1. When it is redundant and does not
contribute any information to the attack class, it is removed and has a value 0. A
statistical analysis of the dataset shows that most of the features that have a non-zero
value contribute some information to the attack. No matter how trivial the information
might be. However, since the dataset is in human readable format (For example.
Feature no. 1, duration, units will be in seconds. 0-∞), when a feature has a value 0,
then its contribution to an attack is redundant.
The proposed fitness function will work on sparse data i.e data which has many
redundant values.
The first step in a Genetic Algorithm is the encoding. In the proposed approach,

for each record connection, the value of a feature was converted to a bit binary gene

value 1 or 0 indicating that a feature was selected or rejected respectively [14]. If the

value of the feature is non-zero, then the gene value is 1 otherwise it is 0. For e.g.

A record with values

(0,tcp,ftp_data,SF,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,1,0,0,150,25,0.17,

0.03,0.17 ,0,0,0,0.05,0) is encoded as

01000000000000000001100001001111100010. This encoding scheme is trivial,

simple and one can easily identify as to the selected genes in a chromosomes.

Hence the feature vector is a 38 bit vector with binary values.

27
It is obvious that greater the number of 1s in a chromosome, higher is the fitness
of a particular chromosome. The proposed feature selection algorithm is defined in
such a way that after a certain number of generations, the fittest chromosome will act
as a feature vector indicating which features are selected and which are not.

Lx counts the number of 1s in the chromosome.

In the second part of fitness function, the weights of each feature previously calculated
are multiplied with individual genes in a chromosome.
The genes of the chromosome are multiplied with the normalized weight vector
having values between 0 and 1 i.e individual weight values divided by the sum of all
weights.

The subset converges to a solution when the average fitness of a generation


improves. After certain number of generation, the fittest chromosome is rewarded
and selected as the feature subset.

4.4.2 Selection, Crossover and Mutation

A fitter chromosome has a better chance of survival in the next generation.


Chromosomes thrive to survive and reach the top. If better chromosomes are selected
as parents, then their offsprings will also be fitter. This is the theory of biological
evolution as proposed by Darwin. Several selection techniques such as tournament
selection, fitness-proportionate selection and rank selection are employed.

In the proposed algorithm, rank selection is used. The chromosomes are sorted
in descending order of fitness and ranks are given to each one of them. The top
chromosome is given rank 1, the second best is given rank 2, and so on. Since, the
lower ranked chromosomes do not contribute anything to the overall population; they
can be replaced with the higher ranked chromosomes. This is called elitism strategy
(Survival of the fittest). Top 10% of the chromosomes replace the lower ones. After
this is done, parents are selected as adjacent chromosomes.

28
Crossover is the mating operation of chromosomes to create offsprings. Single
point crossover is chosen with the crossover probability 1. The crossover probability
simply indicates the ratio with which the parents are chosen for crossover. With a
crossover probability of 1, all chromosomes are chosen as parents. Retaining some
top chromosomes of the population is done by elitism (as explained in the selection
process). Based on experimental evaluation, it was found that best results are
obtained with the pivot at the 25th position. In essence, crossover operation creates
new feature subsets to be evaluated by the fitness function. It is responsible for
exploratory part of a search algorithm.

Mutation is responsible for exploitation i.e genetic diversity in the search space.
Mutation probability is chosen to be 0.05 i.e Probability with which a particular gene
is mutated is 0.05. For mutation, a random number generator is used for each bit. If
the value of the random number generated is less than 0.05, then the gene is
mutated; otherwise it is retained as it is.
4.4.3 Pseudo Code for feature selection

The pseudo-code for feature selection is given below. encode[ ] stores the encoded
initial solutions obtained from the dataset. chromosome[ ] is a vector which stores the
38 bit chromosomes used by the GA. children[ ] is a vector which stores the offspring
resulting from the crossover operation.

Algorithm 1 Feature Selection

Separate R2L and U2R attack instances


Set population_size_r2l=995,
population_size_u2r=52
pop[i][j] ← ith record and jth feature
(here pop[] is a vector)
Remove features 2, 3, 4 from Table 1
for i=0 to pop.length-1 do
if value of jth feature > 0 then
encode[i][j] ← 1(here encode[] is a vector)
encode[i][j] ← 0
end if end
for
for specific no. of generations do
chromosome[i] ← encode[i][j] (here chromosome[] is a vector)
Calculate fitness[i] using Formula (6) (here fitness[] is a vector)
Sort chromosome[i] based on fitness[i]
Assign ranks to chromosome[i] in descending order
Remove 10% of less fit chromosome[i]

29
Copy 10% of best chromosome[i] to next generation
for all chromosome[i] as parent do
//Single Point Crossover on Parents
children[i] ← chromosome[i]
children[i+1] ← chromosome[i+1]
//Mutation
for each bit do
Generate random variable,
0≤ r ≤1
if r ≤ 0.05 then
flip bit
Else
retain bit
end if
end for
end for
end

The chromosome with highest fitness is the feature subset vector. In this vector, bit 1
indicates that a feature is selected, whereas bit 0 indicates that a feature is rejected.
Include features 2,3,4 in the feature subset.

Summary

A two- stage classifier is constructed with feature selection used in the second

stage of classification. The first stage uses a simple NB classifier trained with all the

features . The second stage uses entropy based fitness function for giving weights

(importance) in GA. The fitness function is calculated using Formula . The entire
algorithm is explained with the help of an example taking 4 initial solutions into
consideration. The misclassified instances are obtained by comparing the actual class

(specified in the NSL-KDD dataset) and the predicted class (as estimated by the

classifier) and then segregating only R2L and U2R misclassified records. The

experimental setup and result analysis is done in the next section.

30
CHAPTER 5
EXPERIMENTAL SETUP AND IMPLEMENTATION
The proposed algorithm is built on two stages with the Naïve Bayes classifier:

• Running all 41 features

• Selected features

In the first stage, the filter is trained with all features present in the NSL-KDD
dataset. The trained filter is tested against the test dataset consisting of 22544
records. The output of stage 1 classifier will be:

1. Records which are correctly classified. (actual class == predicted class).

2. Records which are misclassified. (actual class != predicted class)

In order to improve the detection rate of R2L and U2R attacks, the misclassified R2L and
U2R records are applied to the next stage of filtering. It is expected that at least some
incorrectly classified instances will be caught by the trained filter in the second stage.
The second stage filter is trained for only R2L and U2R attacks and is run in parallel.

Training is provided by dividing the training dataset into 2 classes:

1. For R2L attacks, training set is divided into classes: R2L and others.

2. For U2R attacks, training set is divided into classes: U2R and others.

The filter is trained with only the selected features from the proposed feature
selection algorithm.

31
The experiments were run on a Windows platform PC with Intel Core i5-3210
Processor running at 2.50 GHz and 4GB RAM. The classifications were performed
in WEKA (Waikato Environment for Knowledge Analysis) version 3.7.12, a standard
tool for running data mining tasks and machine learning algorithms .

5.1 Implementation of the feature selection algorithm


The feature selection algorithm was coded in java using Netbeans IDE version 8.0.2.
WEKA classes were imported to fetch the dataset and segregate R2L and U2R
records using tutorials

Create new Project >


include weka.jar in
libraries>FeatureSelection.java

Import weka classes


import weka.core.Instances;
import java.io.*;
import weka.core.Instance;
import weka.core.Attribute;

import java.util.Random;

Algorithm has the following methods:


1. public void getDataSet(String path)

2. public void getParameters(Instances dataset);


parameters such as class index, no_of_instances, instance_no,
instances_attributes etc.
3. public static String encode(Instances //Used for encoding the dataset
instance)
into chromosomes with the encoding scheme specified in Section 5.3.
4. public static double calculateWeight(String encode[ ]) //Calculates weights
of individual features stored in weights_r2l[ ] and weights_u2r[ ])
5. public static double fitnessR2L(String encode[ ], weights_r2l[ ])
6. public static double fitnessU2R(String encode[ ], weights_u2r[ ]) //Calculates
fitness
of a chromosome using the fitness function defined in formula (4)
7. public static void selection(String encode[ ], int rank)
algorithm to sort the chromosomes and give rank

32
8. public static String crossover(String parents[ ], String children[ ] )
//Crossover operation to form offsprings stored in children[i] and
children[i+1] using parent[i] and parent[i+1].
9. public static String mutation(String children[ ], java.util.Random.rand)
//Perform mutation with a random number generator.
After testing with different values of parameters given below, the best
result was considered.

Mutation probability = 0.03, 0.05

Crossover probability = 1

Number of generations = 100,300,500

5.2 Implementation of two-stage classifier


//Importing WEKA classes in java code

import weka.core.Instance;

import weka.core.Instances;

import weka.classifiers.Evaluation;

import weka.classifiers.bayes.NaiveBayes;

import weka.classifiers.evaluation.output.prediction.PlainText;

import java.io.*

//Training the stage 1 classifier


Read training set
Set class index ← attack class
Create model ← (Classifier) new NaïveBayes( );//Naive Bayes class in WEKA
Build classifier on training set
Save model in file as .model
Evaluate model on training set
Print summary

//Testing the stage 1 classifier


1. Read testing set
2. Set class index ← attack class
3. Read model saved as .model
4. Re-evaluate model on test set
5. Print Confusion matrix, precision, recall of R2L and U2R attacks

//Obtain misclassified instances


1. Evaluate model on test set
2. Output predictions to .csv file

33
3. Read .csv file, compare actual attack class with predicted class
4. Output misclassified instances to r2l_misclassified.arff and u2r_misclassified.arff

//For stage 2 training and testing of the NB Classifier, same functions are called
except that the training dataset is trained with the features selected by the
proposed Genetic Algorithm.Features are removed by using the filtering
techniques provided in WEKA Explorer or by calling the deleteAttributeAt(int
index) method provided in the weka.core.Instances class.
5.2.1 Sample code for obtaining misclassified instances

opstage1reader = new BufferedReader(new FileReader("D:\\Actual training


and testing dataset\\Output
models\\opstage1_predictions_preprocessed.arff"));
Instances opstage1=new
Instances(opstage1reader);
double[] actual=new
double[opstage1.numInstances()];
double[] predicted=new
double[opstage1.numInstances()];
boolean select_instances[]=new
boolean[opstage1.numInstances()]; int
count=0;
for(int i=0;i<opstage1.numInstances();i++)
{
actual[i]=opstage1.instance(i).value(1);
predicted[i]=opstage1.instance(i).value(2);
System.out.println(actual[i]+"\t"+predicted[i]);
System.out.println(opstage1.instance(i).value(0)+"\t"+actual[i]+"\t"+predicted[i]);
if((actual[i]!=predicted[i])&& actual[i]==2)
{

count++;
select_instances[i]=true;
}
System.out.println(i+"\t"+select_instances[i]);
}
/*

The next stage classifier is also a Naive Bayes classifier with only certain
features selected

*/
BufferedReader test=new BufferedReader(new FileReader("D:\\Actual training and
testing

dataset\\Training and testing with normal and attack


classes\\Testing_nosuccpred.arff"));

Instances testing=new Instances(test);

34
testing.setClassIndex(testing.numAttributes()-1);
for(int i=testing.numInstances()-1;i>=0;i--)
{
if(!select_instances[i])
testing.remove(i);

35
CHAPTER 6
RESULT ANALYSIS
WEKA tool produces an output whose analysis can lead to important conclusions.
The following expressions are used to analyse classifier output.

True Positives (TP): Percentage of correctly detected R2L and U2R records. (Actual
class = Predicted class)

False Positive (FP): Percentage of records belonging to some other attack class that
are classified as either R2L or U2R records (False Alarm).

True Negative (TN): Percentage of records belonging to some other attack class that
are correctly detected.

False Negative (FN): Percentage of records belonging to R2L and U2R attack class
that are not detected.

A confusion matrix visualizes the performance of any classification algorithm


displaying the number of true positives, false positives, true negatives and false
negatives that a classifier achieves.
The Detection Accuracy Rate (DAR) refers to the total number of attacks records
belonging to a particular class that are detected successfully. Parameters that used to
measure the performance of the algorithm are Precision and Recall. Precision is the
number of attack records detected correctly to the total number of records detected
(TP + FP). High Precision means that the algorithm returned more relevant results
than irrelevant for a particular class. Recall is the number of attack records detected
correctly to the total number of records belonging to that class (TP + FN). High recall
means that the algorithm returned most of the relevant results.

6.1 Results of feature selection

For R2L attacks, 21 relevant features are selected whereas for U2R attacks, 22 features are

selected. Table 9 summarizes the results obtained by the feature selection algorithm.

36
Attack Features Selected Number of
Type features
selected
R2L 1,2,3,4,5,6,10,12,23,24,25,26, 21
U2R 27,28,29,32,34,36,38,39,41
1,2,3,4,5,6,10,11,12,14,15,17 22
18,19,23,24,29,31,32,33,34,36
Table 2. Features selected by genetic algorithm

6.1.1 Stage 1 Classifier Results

Class TP Rate FP Rate Precision Recall (%)


(%) (%) (%)

Normal 86.9 23.9 73.3 86.9


DoS 71.1 4.1 89.6 71.1
U2R 25.5 8.5 2.6 25.5

R2L 10.5 1.3 52.2 10.5


Probe 81.6 3.2 75.4 81.6

Table 3:Stage 1 classifier without selection in WEKA


It was found that the classifier misclassifies 6489 out of the total 22544 test records. Of
these misclassified records, 149 were U2R attack instances and 2466 were R2L attack
instances. They were stored in u2r_misclassified.arff and r2l_misclassified.arff respectively so
that they could be applied to the stage 2 of the classification.

6.1.2 Stage2 results

It can be seen that of the 2466 misclassified R2L records, 2086 were detected by
the stage 2 classifier with an accuracy of 84.59%. Of the 149 misclassified U2R
records, 143 were detected by the new algorithm with an accuracy of 95.97%. Out of
the 6 record connections that were not classified, 4 of them belonged to the
buffer_overflow attack, 1 belonged to the xterm attack and 1 belonged to the
httptunnel attack.

37
Class TP Rate (%) FP Rate (%) Precision (%) Recall (%)

R2L 84.6 0 1 84.6


Others 0 15.4 0 0

Correctly Classified 2086 84.590%


Instances –

Incorrectly Classified 380 15.409%


Instances-
Classifier 84.59%
Accuracy -
Table 4. Results of Stage 2 R2L
classifier obtained by WEKA

Class TP Rate FP Rate Precisio Recall (%)


n
(%) (%) (%)
U2R 95.9 0 1 95.97
Others 4.02 0 0

Correctly Classified Instances - 143 95.97%

Incorrectly Classified Instances - 6 4.02%


Classifier Accuracy - 95.97%
Table 5. Results of Stage 2 U2R Classifier obtained by
WEKA

38
Theproposed algorithm produced the lowest detection rate of 28% for snmpguess
attack and 82.58% for snmpgetattack. This can be attributed to the fact that these
attacks work on the UDP protocol, whereas all the attacks in the training set were
based on the TCP protocol. As a result, the classifier was not trained properly to detect
UDP based attacks and can be concluded as one of the flaws in the dataset
construction. The overall detection accuracy rate of R2L and U2R attacks is
calculated.

The overall detection accuracy rate of the proposed approach for R2L attacks is
obtained to be 86.2% and for U2R attacks it is obtained to be 97%. This shows that
the proposed two-stage classifier gains a significant improvement in the detection rate
of the attacks from 10.5% for R2L attacks to 86.2% and from 25.5% for U2R attacks
to a substantial 97%. Comparing the results with those obtained by Ingre & Yadav in
, it can be seen that two-stage Naïve Bayes classifier provides better results in
classifying instances of the NSL-KDD dataset as compared to Artificial Neural
Networks. High Precision means that the algorithm returned more relevant results
than irrelevant for a particular class. Recall is the number of attack records detected
correctly to the total number of records belonging to that class (TP + FN). High recall
means that the algorithm returned most of the relevant results. The proposed
approach significantly improves the detection rate of R2L and U2R attacks .

6.2 Analysis of the detection rates of attacks

Now next we classify the attack types by their port numbers etc for example
FTP write tries to obtain a FTP session to send a file so that it can operate as a local
user thus port 20 and port 21 will be occupied. Now we run WEKA against these
known attacks to obtain their classification accuracy:

39
The following output is of WEKA in detail about U2R and R2L attacks:

Attack Type Number of Total Number of Detection


Rate
detected attack records (%)
records

R2L Attacks
FTP Write 3 3 100
Guess Password 1176 1231 95.5
Imap 0 1 0
Multihop 14 18 77.77
Phf 1 2 50
Warezmaster 907 944 96
Sendmail 13 14 92.85
Named 13 17 76.47
Snmpgetattack 147 178 82.58
Snmpguess 93 331 28
Xlock 8 9 88.88
Xsnoop 3 4 75
Worm 2 2 100
U2R Attacks
Buffer Overflow 16 20 80
Loadmodule 2 2 100
Perl 1 1 100
Rootkit 11 11 100
Httptunnel 132 133 99.24
Ps 15 15 100
Sqlattack 2 2 100
Xterm 12 13 92.3

Table 6: Weka detailed report

The proposed algorithm produced the lowest detection rate of 28% for snmpguess
attack and 82.58% for snmpgetattack according to the results above This can be
attributed to the fact that these attacks work on the UDP protocol, whereas all the
attacks in the training set were based on the TCP protocol. As a result, the classifier
was not trained properly to detect UDP based attacks and can be concluded as one
of the flaws in the dataset construction.

40
6.3 Summary

As seen in Table 1, the number of features selected for R2L and U2R attacks are
21 and 22 respectively. This shows that approximately 50% of the features are
relevant and hence the speed up caused due to this increases by almost 95%. As a
result, improved training times will reduce the complexity of execution. To determine
whether the detection rates have improved, study Tables 2, 3 and 4. The number of
attack records detected for each attack type is given in Table 5. Analysis of the table
shows that Snmpguess and snmpgetattack are difficult to detect since the classifier is
not trained properly to detect UDP attacks. The protocol(feature no. 2 in Table 1) is
biased towards TCP attacks, and hence the classifier produces biased results towards
TCP based malicious records. The detection rate of U2R attacks is high with only 4
buffer overflow records not classified properly. Comparison of different classifiers is
done in Table 14 and Table 15, showing the difference in variations of others with
respect to Naïve Bayes.

Finally the filter feature selection technique, GainRatioAttribute eval along with a
ranker search algorithm was compared with the proposed algorithm. Thorough
analysis of GainRatioAttribute eval shows that the optimal subset varies according to
number of features selected. The detection rate of this FS method reaches 88.6%
when 8 features are selected for R2L attacks and 93.95% when 15 features are
selected for U2R attacks.

41
CHAPTER 7
Conclusion and Future Work

A novel two stage classifier was used to classify the attacks from among the R2L
and U2R attacks. 89% in U2R and 95% in R2L attacks were detected as a result of
using the two-stage technique of feature selection and the features were selected by
genetic algorithm. The features were based on a fitness function which measured the
entropy.
The genetic algorithm approach works better than gain ratio and coorelation
based feature selection. Snmpguess was very difficult to guess with 20% detection
rate. The training set consists of only TCP packets thus UDP packets are hard to
detect. NB tree is more accurate than simple NB but slower thus not used here.
Feature number 23 was most necessary for R2L attacks.This is comparable to the
backpropagation technique in neural networks which works similarly but takes longer
to train.
In future we could use other classification techniques which could improve the
accuracy further and also test in on manually generated datasets and more realistic
traffic like ISCX.

42
LIST OF REFERENCES
[1] Hung-Jen Liao; Chun-Hung Richard Lin et.al, “Intrusion Detection System: A
comprehensive review”, Journal of Network and Computer Applications 36 (2013),
pg 16-24.

[2] Monowar H. Bhuyan, D.K Bhattacharyya; J.K. Kalita, “Network Anomaly


Detection: Methods, Systems and Tools”, IEEE Communications Surveys &
Tutorials, Vol 16, No 1, First Quarter 2014

[3] DARPA dataset evaluation, MIT Lincoln Laboratories, 1999. Online:


http://www.ll.mit.edu/ideval/docs/

[4] Carlos A. Catania, Carlos Garcia Garino, “Automatic network intrusion


detection: Current techniques and open issues”, Journal of Computers and
Electrical Engineering, Volume 38 Issue 5, Sept 2012, pg 1062-1072.

[5] Rebecca Bace and Peter Mell, NIST Special Publication on Intrusion
Detection Systems, National Institute of Standards and Technology, pg 1-51.

[6] Nicholas Pappas, Network IDS & IPS Deployment Strategies, SANS
Institute, April 2, 2008, pg 1-64.

[7] Zheng Zhang, Jun Li, C. N. Manikopoilos, Jay Jorgenson, Josh Ucles,
“HIDE: a hierarchical network intrusion detection system using statistical
preprocessing and neural network classification”, IEEE Workshop on
Information Assurance and Security, 2001.

[8] Guiling Zhang, Yongzhen Ke, Zhichao Li, Mingjie E, “Improvements of


Payload-based Intrusion Detection Models by using Noise against Fuzzy SVM”,
Journal of Networks, Vol 6, No 2(2011), 330-340, Feb 2011.

[9] Hesham Altwaijry, Saeed Algarny, “Bayesian based intrusion detection


system”, Journal of King Saud University – Computer and Information Sciences
(2012), 24, pg 1-6

[10] Hesham Altwaijry, Saeed Hui Lu, Jinhua Xu, “Three-level Hybrid Intrusion
Detection System”, International Conference on Information Engineering and
Computer Science (ICIECS 2009), pg 1-4.
[11] Kok-Chin Khor, Choo-Yee Ting, Sommuk-Phon Amnuaisuk, “Comparing
Single and Multiple Bayesian Classifiers Approaches for Network Intrusion
Detection”, International Conference on Computer Engineering and Applications,
2010,325-329.

[12] Datta H.Deshmukh, Tushar Ghorpade, Puja Padiya, “Intrusion Detection


System by Improved Preprocessing Methods and Naïve Bayes Classifier using
NSL-KDD 99 Dataset”, 2014 International Conference on Electronics and
Communication Systems (ICECS), 13-14th Feb 2014, pg 1-7

43
[13] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the
KDD CUP 99 Data Set,” Submitted to Second IEEE Symposium on Computational
Intelligence for Security and Defense Applications (CISDA), 2009.

[14] Yuteng Guo, Beizhan Wang Et.al, “Feature Selection Based on Rough Set
and Modified Genetic Algorithm for Intrusion Detection”, 5th International
Conference on Computer Science & Education, Hefei, China, August 24-27th
,2010, pg 1441-1446.

[15] Hua Zhou, Xiangru Meng, Li Zhang, “Application of Support Vector Machine
and Genetic Algorithm to Network Intrusion Detection”, Wireless Communications,
Networking and Mobile Computing 2007, WiCom 2007, 21-25 Sept. 2007, pg
2267-2269.

[16] Datta H.Deshmukh, Tushar Ghorpade, Puja Padiya, “Intrusion Detection


System by Improved Preprocessing Methods and Naïve Bayes Classifier using
NSL-KDD 99 Dataset”, 2014 International Conference on Electronics and
Communication Systems (ICECS), 13-14th Feb 2014, pg 1-7

44

You might also like