Network Intrusion Detection System Using Single Level Multi-Model Decision Trees

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Network Intrusion Detection System Using Single

Level Multi-Model Decision Trees


J Component Final Report
Information Security Management (CSE3502)
Submitted in partial fulfilment of the requirements for the degree of

Bachelor of Technology
in

Vellore Institute of Technology


by

1. Adhitya Maniyan (18BCE2294)


2. B Sai Rohit (18BCE0680)
3. Lagisetty Pullaiah Sampath (18BCI0234)

Under the guidance of

Prof. Amutha Prabakar


School of Computer Science and Engineering
VIT, Vellore

1|Page
DECLARATION

I hereby declare that the project entitled “Network Intrusion Detection System
Using Single Level Multi-Model Decision Trees” submitted by our team, for
the award of the degree of Bachelor of Technology in Information Security
Management to VIT is a record of bonafide work carried out by our team under
the supervision of Prof. Amutha Prabakar.

I further declare that the work reported in this project has not been
submitted and will not be submitted, either in part or in full, for the
award of any other degree or diploma in this institute or any other
institute or university.

Place: Vellore

2|Page
TABLE OF CONTENTS

S.No: Contents Page No:

1. Abstract 4

2. Introduction 5

3. Literature review 6

4. Technical 10
Specifications

5. Attacks 11

6. Methodology 12

7. Architecture Diagram 14

8. Implementation 15

9. Conclusion 26

10. References 27

3|Page
ABSTRACT
An intrusion detection system(IDS) is basically a software application that
monitors a network or systems for malicious activity or policy violations. Any
intrusion activity or violation of the protocols is reported to an authorized person
or an administrator. Intrusion detection helps to detect a malicious activity which
may cause any harm or damage to the important data or services.

We have made an Intrusion detection using Machine Learning which is able to


detect attacks if they are performed on the network or the system. It analyses the
various network parameters like the flag, srv_count , etc and on the basis of these
values it predicts whether an attack is being performed or not. The number of
false positives have been reduced to a large extent and the accuracy for correct
prediction is also very good.

4|Page
INTRODUCTION
Intrusion detection system allows packets to pass and then based on the
performance or behaviour of that packets in the system the IDS stop the further
incoming packets. An advantage of an Intrusion detection system is that it is very
fast and hence only very less number of infected packets are able to enter the
system and the entry of suck packets is stopped very soon thus saving the system
from much damage.

In the today’s world where everyone is connected to the Internet and everything
is being done online and there is a lot of communication between the devices
where the data packets are sent from one PC to another, several types of attacks
have evolved which not only can harm the private data of organisation but can
also lead to the disruption of services like in the case of Denial Of Service attack.
So a system which can detect such malicious packets and protect the system from
losses is the need of the hour.

Intrusion detection System is a must for any company or organization dealing


with general to protect its internal systems and data from the attacker. The IDS
that we have made here protects the systems from four different kinds of attacks
by detecting them with a good accuracy. The thing that we have focussed a lot is
on decreasing the number of false positives since a higher number of false
positives will continuously disrupt the normal working of the company in-spite
of there not being any danger or attacks. This allows a company to carry on its
normal routine work without any unnecessary disruptions or stoppages.

5|Page
LITERATURE REVIEW
TITLE AUTHORS YEAR OF METHODOLOGY
(STUDY) PUBLICATION

Techniques Hota and 2014 proposed a model that utilized


Applied on Shrivas principal component
NSL-KDD determination methods to
Data and Its eliminate the unimportant
Comparison highlights in the dataset and
with Various fostered a classifier
Feature dependent on different choice
Selection tree strategies like ID3,
Techniques CART, REP Tree, REP Tree
and C4.5
“Intrusion Deshmukh 2014 created IDS utilizing Naive
detection Bayes classifier with various
system by pre-handling techniques.
improved Creators utilized NSL-KDD
preprocessing dataset and WEKA for their
methods and
trial investigation. They
Naive Bayes
contrasted their outcomes and
classifier
other characterization
calculations like NB TREE
and AD Tree
Accuracy of Noureldien 2016 inspected the exhibition of
Machine Yousif seven regulated AI
Learning calculations in identifying the
Algorithms in DoS assaults utilizing
Detecting DoS NSLKDD dataset. they
Attacks Type utilized 10-crease cross
approval in test and assess the
strategies to affirm that
methods will accomplish on
undetected information. Their
outcomes showed that
Random Committee was the
best calculation for

6|Page
identifying smurf assault with
exactness of 98.6161%

Intelligent Jabbar and 2016 have introduced a novel


network Samreen, methodology for ID utilizing
intrusion exchanging choice trees
detection using (ADT) to characterize the
alternating different kinds of assaults
decision trees while it is normally utilized
for parallel characterization
issues. The outcomes showed
that their proposed model
delivered higher location rate
and decreases the bogus
caution rate in order of IDS
assaults.
Analysis of data Paulauskas 2017 examinations the underlying
preprocessing and information pre-handling
influence on Auskalnis, effect on assault recognition
intrusion precision by utilizing of
detection gathering, that are rely upon
consolidating various more
fragile students to make a
more grounded student,
model of four distinct
classifiers: J48, C5.0, Naïve
Bayes and PART

7|Page
An effective Wang 2017 proposed a SVM based
intrusion interruption identification
detection procedure that considers
framework prehandling information
based on SVM using changing over the
with feature typical qualities by the
augmentation logarithms of the negligible
thickness proportions that
abuses the order data that is
remembered for each
element. This subsequent in
information that has top
caliber and compact which
thus accomplished a superior
recognition execution as well
as lessening the preparation
time required for the SVM
identification model
A Deep Yin, et al 2017 have investigated how to
Learning show an IDS dependent on
Approach for profound learning approach
Intrusion utilizing repetitive neural
Detection organizations (RNN-IDS) in
Using view of its capability of
Recurrent removing better portrayals for
Neural the information and make
Networks better models. They
preprocessed the dataset
utilizing Numericalization
procedure on the grounds that
the information worth of
RNN-IDS ought to be a
numeric network. The
outcomes showed that
RNNIDS has extraordinary
precision rate and location
rate with a low bogus positive
rate contrasted and

8|Page
conventional grouping
techniques.

Intrusion Ikram and 2017 proposed an ID model


detection model Cherukuri utilizing Chi-Square trait
using fusion of choice and multi-class
chisquare support vector machine
feature (SVM). The principle thought
selection and behind this model is to
multi class develop a multi class SVM
SVM which has not been received
for IDS so far to diminish the
preparation and testing time
and increment the individual
grouping precision of the
organization assaults

9|Page
TECHNICAL SPECIFICATIONS
In this project, a multi model decision tree classifier has been used for approach
for intrusion detection/prevention systems is proposed.

Language used: Python3 Technologies:

Jupyter notebook: It is basically an open document format which contain record


of the user's sessions where a user can execute his code block by block

The python libraries and packages used in the project are:

Numpy: It is a python library which has many in-built functions for working on
single and multi-dimensional arrays

Onehotencoding: It allows representation of categorical data to be more


expressive. It offers a set of predictions than a single label

Label Encoding: It refers to converting labels into the numeric format so that it
gets converted into the machine-readable form.

Pandas: It is a python library used for analyzing the dataframes

Sklearn: It provides many tools for statistical modelling including classification,


regression and clustering.

Recursive Feature Elimination: It is a method of feature selection which


removes the weakest feature until a specified number of features is reached to fit
the model.

10 | P a g e
Attacks against which the IDS will provide protection:
1) U2R : U2R attack means unauthorized access to local root privileges. This
is the type of attack where the attacker attempts to illegally obtain root
privileges by actually legally accessing a local machine by using some
vulnerability in the victim’s system to his advantage.

2) R2L : Remote to local attack is launched by an attacker to gain


unauthorized access to a victim machine in the entire network. Here
attacker gains access of the victim's device by gaining the root access. It is
similar to U2R attack.

3) DoS : A denial-of-service attack (DoS attack) is a cyber-attack in which


the attacker seeks to make a machine or network resource unavailable to
its intended users by disrupting services of a host connected to the Internet
by sending millions and millions of packets to the server responsible for
the service thus crashing the server.

4) Probing : This is also cyber attack where the attacker tries to steal

sensitive information present in the victim’s system.

Many of the similar attacks have been classified into these four categories to
extend this project to detect more number of attacks.

11 | P a g e
METHODOLOGY

The essential objective is to plan an arrangement detecting intrusions within the


system inside the framework with the number of features inside the dataset. In
view of the information from past papers distributed, we can tell that lone a
development of highlights in the dataset are subsidiary to the Intrusion Detection
System. We need to scale back the dimensionality of the dataset to assemble an
improved classifier in a legitimate measure of time. The methodology we will
utilize has a sum of 4 phases : In the primary stage, we choose the huge highlights
for each class utilizing highlight selection. In the following we join the different
highlights, with the goal that the last bunch of highlights are ideal and important
for each assault class. The third stage is for building a classifier. Here, the ideal
highlights found in the past stage are sent as contribution to the classifier. In the
last stage, we test the model by utilizing a test dataset

MODULES USED

Feature selection

Here we will utilize Information Gain (IG) so that we can choose the relevant
features from the dataset. It is determined for each and every class independently.
The classes are ranked as per the information gain such that if the value is less
than a threshold value, feature will be eliminated.We partition the preparation
dataset into 4 datasets. The preparation dataset is partitioned into 4 datasets so
that each dataset comprises of records having a place with a similar attack class
alongside a portion of the records of the first dataset. Then the datasets for each
attack class are sent independently as info into the technique used to compute the
attack class.

12 | P a g e
Obtaining the best features

Here we will be using recursive feature elimination technique to eliminate the


lower ranked features so that we have only more relevant and higher ranked
features which decreases the computation and increases the accuracy.

Developing a classifier

A supervised machine learning model is being used here which is used to classify
a label data into a particular class. The classification algorithm that we have used
in our project is decision tree classifier Decision tree is an algorithm which takes
decisions at each node of the tree and is widely used for regression and
classification. We have chosen decision trees since they can be trained very easily.
It is more productive than most of the classification algorithms in ML like K-
Nearest Neighbours in most of the cases.

13 | P a g e
ARCHITECTURE DIAGRAM

14 | P a g e
Implementation

Dataset Description

All investigations are done on NSL-KDD datasets. NSL-KDD is a refined


rendition of the KDD'99 dataset. It beats some inborn issues in the first KDD
dataset. Repetitive records in the preparation set have been eliminated so that the
classifiers produce unprejudiced outcomes

Here column which we are using for the given dataset are as follows:-
This is how the datasets looks like for training and testing

15 | P a g e
The table dimensions are:

We have calculated the value_counts of column label of the dataset , where it is


categorized as normal or the various attack types. The label distribution that we
get is

Training dataset:

16 | P a g e
Testing dataset

17 | P a g e
Here we have separated Categorical data columns which comes to be
protocol_type, service, flag, label and calculated unique row values of them for
the training dataset and testing dataset

Training dataset

For testing data set it has 6 less categories in feature protocol_type than training
dataset.

18 | P a g e
Then we have imported labelEncoder and Onehot encoder to transform the
categorical values into binary values. Here we have taken the 3 columns and
formed a new data frame with these 3 columns and shown

Now what we have done is that we have given unique names to each unique
category of the 3 feature columns so that , it becomes easy to identify

The here what we have done is to label encode the 3 columns by numbering their
unique categories with the help of label Encoder which transforms each value in
a column to it’s corresponding number for both training and test dataset
19 | P a g e
After that we performed Onhotencoding on the label encoded data set to
transform it into binary data form. The basic use of Onehotencoding is that here
each category value is converted into a new column and assigned a 1 or 0

Now there were no. of different categories for service column in training dataset.
Here we have fetched that. These are the 6 categories which are missing

20 | P a g e
After adding the new data frame of binary encoded data back to original dataframe
after removing the categorical columns we get the dimensions as:-

RENAMING EVERY ATTACK LABEL: 0=NORMAL, 1=DoS, 2=PROBE, 3=R2L


and 4=U2R such that the similar kind of attacks have been included as well. Using these attack
types being labelled i.e.,

21 | P a g e
After that feature slection is done using RFE to select 13 best features from
122 features here.

Which comes as

Then we have built the model using decision tree classifier and have
got our results as follows:

Evaluation And Result:

Confusion matrix corresponding to each attack is calculated and it can be seen


from the matrices that the number of false positives are very less and true positives
are very high i.e., our system detects most of the attack attempts.

22 | P a g e
Dos ATTACK

Probe ATTACK

R2L ATTACK

23 | P a g e
U2R ATTACK

Then we have calculated the accuracy , precision , recall ,


F-score

Dos attack

24 | P a g e
Probe attack

R2L attack

U2R attack

25 | P a g e
Conclusion
We have made an intrusion detection system using the decision tree machine
learning model classifier to classify a particular attack and detect the attack.
Preprocessing on the data has been done so as to decrease the computation time
and analyse only the futures. Many network parameters like srv_count, etc have
been taken into account which makes our IDS system to detect attacks more easily
and effectively. The confusion matrices have been calculated which show that
there are a very low very number of false positives and very high number of true
positives. Also the evaluation measures like accuracy and precision etc have been
calculated and the accuracy was found to be around 99% which is very good and
better than many of the other IDS systems that exist.

26 | P a g e
References
• H. S. Hota and A. K. Shrivas, “Decision Tree Techniques Applied on
NSLKDD Data and Its Comparison with Various Feature Selection
Techniques,” in Advanced Computing, Networking and
InformaticsVolume 1: Advanced Computing and Informatics Proceedings
of the Second International Conference on Advanced Computing,
Networking and Informatics (ICACNI-2014), 2014, pp. 205–211.

• D. H. Deshmukh, T. Ghorpade, and P. Padiya, “Intrusion detection system


by improved preprocessing methods and Na #x00EF;ve Bayes classifier
using NSL-KDD 99 Dataset,” in 2014 International Conference on
Electronics and Communication Systems (ICECS), 2014, pp. 1–7.
• I. M. Y. Noureldien A. Noureldien, “Accuracy of Machine Learning
Algorithms in Detecting DoS Attacks Types,” Sci. Technol., vol. 6, no. 4,
pp. 89–92, 2016.

• M. A. Jabbar and S. Samreen, “Intelligent network intrusion detection


using alternating decision trees,” in 2016 International Conference on
Circuits, Controls, Communications and Computing (I4C), 2016, pp. 1–6.

• N. Paulauskas and J. Auskalnis, “Analysis of data pre-processing


influence on intrusion detection using NSL-KDD dataset,” in 2017 Open
Conference of Electrical, Electronic and Information Sciences (eStream),
2017, pp. 1– 5.

• H. Wang, J. Gu, and S. Wang, “An effective intrusion detection framework


based on SVM with feature augmentation,” KnowledgeBased Syst., vol.

136, no. Supplement C, pp. 130–139, 2017.

27 | P a g e

You might also like