Chapter 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

CHAPTER 1

INTRODUCTION
1.1. OVERVIEW
Cloud Computing technologies are evolving day by day with the incorporation of new business
and operating models. It has the most powerful innovation technologies around the globe.
Cloud computing is a kind of large distributed computing system and highly scalable internet-
based computing resources provided in the form of service in which, services are offered to
users according to their demand. One of its main features and benefits consists of shared
network resources whereby multiple clients share the same hardware using logical isolation
technologies. While cloud computing has enormous advantages such as scalability, rapid
elasticity, measured services, the most important of the potential is that it helps cost savings to
the enterprises, by utilising on demand services. Securing the resources in the cloud and the
data communication between them are most challenging tasks because the number of intrusions
is increasing year by year.
In recent years, statistical reports reveal tremendous number of security breaches that
occur in the virtual network layer of cloud computing. A huge upsurge in network traffic has
paved way to security breaches which are more complicated and widespread in nature.
Tackling these attacks has become inefficient with application of traditional network-based
intrusion detection systems (IDS) environment. In today’s scenario, varied intrusion detection
system (IDS) tools and techniques are available in cloud platform for the discovery of attacks
on cloud infrastructure. The traditional or existing intrusion detection systems (IDS) is turning
out to be ineffective and inefficient due to heavy traffic and its dynamic behaviour. An
intrusion detection system (IDS) usually complements a firewall to form an effective security
solution. One of the main challenges in securing cloud networks is in the area of the
appropriate design and use of an intrusion detection system, that can monitor network traffic
and effectively identify network intrusions. This research is focused on security threats in the
form of network security attacks on public clouds and proposes solutions and suggestions.
Findings in literature reveal the ratio of a large number of defences against network attacks.
Despite all the efforts made by researchers over the last two decades, the network security
problem is not completely solved. One reason for this is the rapid growth in computational
power and available resources to attackers, which enables them to launch complex attacks (Wu
et al., 2010). An intrusion detection system (IDS) is the most typically used technique for the
discovery of attacks on cloud infrastructure.
In today’s scenario, various intrusion detection system (IDS) tools and techniques are available
in cloud platform as the typical technique to discover attacks on cloud infrastructure. The
traditional intrusion detection systems (IDS) are ineffective for deployment in cloud
environments due to virtualized and distributed nature. Researcher has proposed an efficient
ensemble based feature selection and classification techniques for network intrusion detection
system on cloud computing. For overcoming the challenges faced by the traditional intrusion
detection system, this research work investigates the existing intrusion detection systems by
applying an efficient feature selection and classification technique which uses an ensemble
learning for accurate classification of known attacks in the cloud environment by applying
supervised learning techniques. The idea of building ensemble model provides the highest
accuracy the lowest false alarm rate (FAR). Honeypot and Honeynet are deployed to capture
real attack traffic and attacker activity in a cloud environment.

1.2 CLOUD COMPUTING


Cloud computing offers on-demand resources that are reliable and scalable to the individual
and organizational users. Cloud has become a tempting target for intruders as the predominant
technologies in the cloud, like the virtualization and multitenancy, enabling multiple users to
employ the physical resource as similar. Network attacks target the cloud users and providers in
order to make it inaccessible to both [1]. Cloud computing is a kind of large distributed
computing system in which services are offered to users according to their demand.

Cloud computing uses the virtualized platform containing the elastic resources like
software, hardware, services, and data sets by allowing dynamic provisioning of the pool of
virtual resources to the cloud users. The notion of cloud computing involves changes in the
traditional desktop computing towards service-oriented computing using large clusters and data
centres. Cloud computing provides ease and low cost to cloud customers using the concept of
virtualization. Cloud computing known as internet-based computing has emerged from the
concepts of distributed computing, utility computing, grid computing. It integrates the
techniques like virtualization, multitenancy, service-oriented architectures for delivering
infrastructure, platform, and software as service rather than a product. Cloud computing
technology works on three different SPI (Software Platform Infrastructure) models and four
deployments (public, private, hybrid, and community) models. The consumer can use the
service(s) of the cloud as per their usage or requirement and deploy the cloud. Figure 1.1
depicts the cloud computing scenarios.
Figure 1.1 Cloud Computing Scenarios

The definition of cloud computing as stated by NIST, “Cloud computing is a model for
enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable
computing resources (e.g., networks, servers, storage, applications, and services) that can be
rapidly provisioned and released with minimal management effort or service provider
interaction” [2]. A cloud can be categorized as private cloud, public cloud and hybrid cloud as
per a deployment model.
Another interesting definition of cloud computing is, “Cloud computing refers to both
the applications delivered as services over the internet and the hardware and system software in
the data centres that provide those services.” There are many big vendors who provide cloud
resources and services with different pricing. The leading cloud vendors are Google, Microsoft,
Amazon, IBM, AT &T, and Salesforce.com.

1.2.1. FEATURES OF CLOUD COMPUTING


The main features of Cloud computing are listed as follows:
On demand, self-provisioning
Cloud Service Providers (CSP) offer off-the-shelf kind of cloud services to customers [2].
Cloud users have options to choose their required services to form a list of offered services.
Customers are charged only for the services they use and the amount of time for which they use
the service. Clients require internet connectivity to access any cloud service. A graphical
dashboard mechanism is provided by the CSPs for easy management of services, e.g. Amazon
AWS portal, Microsoft Azure portal, etc.
Elasticity and scalability
The workload in the cloud environment is highly dynamic. As per the definition of cloud,
scalability is an important and desired feature. Cloud users feel that an infinite amount of
resources is available. These users can rapidly add or remove resources as per their
requirements. Applications hosted under a cloud environment can easily meet resource
requirements according to the changes in the workload.
Higher availability via Network access
Availability indicates the amount of time a particular service or resource is available for use
[3]. Cloud computing resources are offered as on-demand services over the internet. This
provides the user the flexibility to access services from anywhere and at any time. Service
providers take care of failure and recovery to achieve higher availability.
Application programming interface
CSP allows users to make modifications to services as per their requirements. This can be done
by the API provided by service providers. Using this kind of interface, users can customize
policies as per their requirements.
Pay per use model
Pay per use model is a very important feature of cloud computing, i.e. users have to pay only
for the services they use and only for the amount of time they use those services. All CSPs
provide their billing-related information. Public CSP gives several choices regarding their
service offerings. Amazon provides EC2 services with different options like on demand,
reservation based or spot plan based.
Performance monitoring and measuring
CSPs provides an interface by which users can view or monitor the performance of their
resources or services. Monitoring will help cloud service users to optimize resource usage. This
also helps service users in taking some scaling related decisions.

1.2.2 SERVICE MODELS


There are three basic service delivery models in cloud, these service delivery models deliver
different types of services for different types of cloud tenants. On the basis of services, the
models can be classified into three service models [3]:
Infrastructures-as-a-Service(IaaS): The role of the IaaS provider lies in the ownership and
management of all IT infrastructures. IaaS Service providers maintain a large pool of
hardware resources in the form of server machines, storage devices, networking elements, etc.
in the data center. The IaaS internally uses the virtualization concept. In the cloud environment,
a software tool/framework is used for the management of a large amount of physical and virtual
resources. This tool is called the Virtual Infrastructure Manager (VIM) [4]. In this service
delivery model, cloud providers provide hardware resources, databases, networking, and
computational services. This service model is based on virtualization technology i.e. deploying
a special software “hypervisor” on top of the infrastructure, which enables the creation of
different virtual machines that share the same physical server. IaaS service providers provide
direct access to physical infrastructure (via the hypervisor) to users in the way of Virtual
Machines (VM) [4]. The VMs offers the own operating system or the users may install the
operating system of his/her own requirement. Later, various runtime environments and
applications are installed on VMs. Sometimes, pre-configured and ready to use appliances are
also offered by service providers in the form of VM images. Major IaaS service providers are
Amazon EC2, Google Compute cloud, HP, etc. Some of the benefits of IaaS are,
• Reduction in initial expenditure on purchase of IT infrastructure.
• The total cost of ownership of IT infrastructure related resources is reduced.
• The maintenance of hardware resources is done by a service provider.
• Application scaling is faster due to the auto scaling mechanisms.
Platform_as _a_Service(PaaS) : In this service delivery model, service providers provide
programming languages, development environments, operating system and other services that
support users to implement, host, and administrate their own applications [5], They do not need
a separate local machine to run their applications. The PaaS model provides software
environments using which a developer can make personalized solutions. PaaS also provides
access to the integrated development environment (IDE) in which applications can be created,
built and tested [1,5]. This middleware can be installed in a private cloud environment. One of
the most widely used PaaS offerings is the Google App Engine (GAE). Developers may access
the PaaS service of GAE by installing API in their IDE. Another well-known example of PaaS
is Force.com. Sometimes, an application may use specific facilities provided by the PaaS
provider which may result in vendor locking. PaaS solutions help in the reduction in cost
involved in the development, deployment and management of applications. In some cases,
IDEs are provided within the Internet browser. PaaS also provides the facility to automatically
deploy applications in cloud infrastructure. The most leading service providers such as Google
Apps engine and Microsoft Windows Azure are the most widespread PaaS platform.

Software_as_a_Service (SaaS) : In SaaS delivery model, service providers provide the


applications to the customers as on demand access services, without mounting the applications
on the user’s network [6]. This model may be deployed on top of PaaS, IaaS or straight hosted
on cloud infrastructure. SaaS based software is developed using cloud services like
infrastructure and platform. Once a SaaS application is developed, it is given to the end user for
access in the form of services. SaaS offers complete infrastructure, software and solution as a
service. Generally, the SOA (Service Oriented Architecture) based approach is used in the
development of SaaS based applications. The most leading service providers such as
Salesforce.com CRM is an example of a popular SaaS platform.

These service delivery models are based on the existing technologies that include web
applications, service-oriented architecture, and virtualization technologies which already suffer
from security problems. Of these different service delivery models, SaaS model is the weakest
model where in most of the attacks target to breach the model security [5]. Sometimes, SaaS
based solution become replacement of local software, e.g. Google Docs, Calendar, Zoho Office
Suite etc. Some SaaS based solutions like Salesforce.com also provide an API for developers
by which some customization in the software may be done. Main features of the SaaS
applications are listed below,
• The software is available on demand and over the Internet through a browser.
• Reduction in software distribution and maintenance cost.
• Provision of features like automated upgrade, update and patch management.
• Software licensing is subscription based or usage based.

1.2.3 CLOUD DEPLOYMENT MODELS


Cloud computing has four different deployment models, Figure.1.2 depicts the different types
of cloud deployment models:

Figure 1.2 Cloud Deployment Models

Private Cloud: A private cloud deployment model is dedicated to a particular enterprise(s) [6].
This is usually deployed within the enterprise network perimeter and the enterprise IT
department has full control of the cloud platform. The private cloud may be the third party on
premise (outsourced) and may be in-house (on-site). On-site/in-house private cloud can be
accessed only within the organization and block the access of outsider users. The outsourced
private cloud is accessible only by an authorized user and the cloud is located on a third party’s
premises [7]. The main objective of this model is to remove customers’ concerns about the
security of their cloud-hosted assets at the expense of increasing IT infrastructure costs. It
allows access only to authorized users. This model is suitable for organizations whose resource
requirements are dynamic, critical, secured and customized. The private cloud model helps an
organization in better utilization of their hardware resources. The main disadvantages of this
model are poor scalability and higher initial cost [8]. There are several software frameworks
available by which private cloud can be created like, Open Nebula, Open Stack, Azure,
Eucalyptus, etc.
Public Cloud: A public is open to everyone; any user can access cloud resources from
anywhere over the internet. Cloud Service Providers (CSPs) manage and own large amounts of
physical infrastructure in the data centre [1,5]. These resources are publically shared among
users across the world as per their demand. Users may access these resources, in the form of
services. Public clouds can be composed of geographically distributed data centres [6]. Some
fundamental requirements in public cloud are: proper isolation among users, guarantee the
quality of service and provide management/monitoring mechanisms. Examples of public CSPs
are Amazon, Microsoft, Google, HP, etc. [7] The public cloud model has advantages like low
capital cost, flexibility, higher scalability, location independence, etc. The possible
disadvantages of this model are low security and less customization.

Community cloud: A community cloud model shared infrastructure for specific community.
Community cloud also be in the form of outsourced and on-site community cloud. It is a
category of computing infrastructure that is mutually distributed among several organizations,
having the common computing requirements like regulatory compliance, SLAs, security,
jurisdictions, privacy, etc., This model can be hosted and managed internally or externally by a
third-party. E.g., banks, trading organizations.

Hybrid Cloud: In this cloud deployment combines private as well as public cloud deployment
model is known as hybrid cloud. It is an integration of one or more distinctive clouds (private,
public or community) but are bound together, combining the benefits of multiple clouds. The
capacity of this cloud service can be extended by integration, aggregation, and customization of
cloud service.
The most vulnerable deployment model is a public cloud, among these three models,
because of open nature and distributed [11]. Many users are joining into the cloud and access
the cloud resources, even, those users may be malicious users or services. In the hybrid
deployment model, cloud infrastructure is composed of different types of resources (private and
public). The private cloud is created using resources owned by organizations. However, in
some situations, these resources may not be enough for handling the workload. In this case,
private resources are combined along with resources provided by public CSPs. The cloud
formed by the combination of private and public infrastructure is called the hybrid cloud. The
hybrid cloud provides scalability with the use of services provided by a public cloud. This
model has benefits like a larger flexibility and scalability compared to the private cloud.
However, this model requires more complex network configurations compared to the private
cloud.
Challenges of Cloud Computing Paradigm
Some of the primary challenges of cloud computing are discussed as follows:
a. Network Latency
Cloud computing offers unlimited resources in the form of services over the Internet. This
increases availability, but the problem of network latency may occur. If the Internet speed of
the cloud user is slow, he or she may notice delays while accessing the service.
b. Privacy and Security
In a cloud environment, user data travels over a network and is stored in a system on which the
user has no direct access or control [12]. The protection of information in terms of privacy,
confidentiality, and loss of data must be handled properly by complying with the various
security criteria and protocols. Many researchers are working in this area.
c. Legal Issues
Cloud services are provided over the internet. Users of cloud services may be from any part of
the world. In this scenario, if cloud computing deployment is stages across states and countries,
one it may end up with the requirement to fulfil with various impacts. The acts of maximum
governing agencies place the entire burden on the client. Thus, compliance with local
regulations must be handled by cloud users.
d. Lack of Flexibility and Standardization
Sometimes, the services offered by CSP do not provide any option for customization which
reduces flexibility. There are several standards defined regarding cloud services. However, in
some areas standardization is still ongoing.
e. Dynamic Provisioning of Services and Resources
As per the cloud characteristics, services and resources provided by the cloud are highly
scalable and elastic. But the main issue is the dynamic provisioning of cloud resources on the
basis of requirements [13]. Improper provisioning may cause wastage of resources which
incurs more bills to the customer and poor resource utilization.

1.2.4 ATTACKS ON CLOUD


Cloud computing is a most powerful innovation that has caught the fancy of
technologists around the globe. It has enormous advantages that include scalability, rapid
elasticity, measured services, and most important of them, the potential of cost saving for
customers. The security risks emanate from the wide range of the vulnerabilities inherent in any
type of cloud computing system. Security and risk assessment would encompass analysis of the
impact of a variety of threats and attacks on various aspects of cloud computing including
adaptation of cloud computing, maintenance of secrecy and privacy of personal data, access
and updating of data [13]. Therefore, the identification of the most appropriate-solution
directives to strengthen security and privacy in the cloud environment has become paramount
to all business operations in the cloud. In this study was explored and analysed the prominent
network security attacks on cloud systems.
In order to alleviate these security threats in the cloud, it is important to impose security
requirements on the data and services of the cloud. While Cloud Computing provides enormous
advantages such as scalability, rapid elasticity, measured services, and multitenant ship to the
individual and business enterprises through automation, virtual presence, and availability of
services, resources, and applications. There are also many serious security threats that have
emerged in recent past like networks security, data security for maintaining secrecy and privacy
of personal data, accessing and maintaining data, application security, web security and
virtualization security. The vulnerable nature of cloud computing system has given user to
different security issues. Figure 1.3 depicts the various attacks on the cloud environment.
Figure. 1.3 Attacks on Cloud Computing

Therefore, there is an urgent and strong requirement to identify unique security threats
and vulnerabilities by evaluating cloud networks. In other words, for Cloud Computing
adoption to happen at large scale, an assessment of vulnerabilities and attacks, and
identification of relevant solution directives to strengthen security and privacy in the Cloud
environment is the need of the hour. Cloud computing technology is a novel approach,
solutions to the threats and vulnerabilities are lagging behind the easily executable attacks like
Malware, DoS, DDoS, man-in-the-middle &SQL injection, Cross-Site Scripting (XSS) and
authentication attacks, etc. Towards this, it becomes imperative to find out time-bound
solutions to the threats and exploitation of cloud vulnerabilities. The proposed research study is
motivated by this research gap.

1.3. INTRUSION DETECTION SYSTEMS (IDS)


An Intrusion Detection Systems(IDS) is a security application that is employed to detect
attempts from an attacker to gain unauthorised access to a system or system resource. They are
usually deployed in a region known as the De-Militarised Zone (DMZ), a subnetwork
separating an internal network from untrusted external networks, providing an added layer of
isolation between internal and external systems. Most traditional IDSs work off the basis of
comparing activities to a defined security policy, and either permitting or denying the action on
this basis.

Intrusion detection system can be categorized into two main types based on the
detection taking place: Network-based intrusion detection systems and Host-based intrusion
detection systems. In practice, a network intrusion detection system (NIDS) is a device located
at a strategic point in a network for monitoring all network activity. A host-based intrusion
detection system (HIDS) is used for finding suspicious activity in a single computer system. In
practice, it is a software that runs on the operating system. It will take a snapshot of system files
and then compare it to the original version to detect suspicious activities.
Intrusion detection systems can also be categorized into two other categories by how the
detection works [13]:

™ Signature Based intrusion detection


™ Anomaly Based intrusion detection

• Signature-based intrusion detection is a method where the goal is to detect attacks by


known patterns. For example; anti-virus systems check the byte sequence in a file to
verify if a file contains a virus. The same method also can be used in network intrusion
detection, where the goal of a NIDS is to search the byte sequences in the network traffic.
• Anomaly-based intrusion detection systems are used for the detection of non-normal
behaviour. Generally, machine learning methods are used for this type of detection
system. These are trained to enable understanding of the normal behaviour of the traffic
and then any deviation from this behaviour is marked as an anomaly.

1.3.1 The Role of Intrusion Detection System


A large number of defence approaches have been proposed in literature to provide different
functions in various environments[12]. An IDS aims to detect intrusions before they seriously
damage the network. The term “intrusion “refers to any unauthorized attempt to access the
elements of a network with the aim of making the system unreliable. Figure 1.4 depicts the
organization of a generalized IDS. The solid lines show the data/control flow and the dashed
lines indicate the responses to the intrusions.

Figure 1.4 Organization of a generalized intrusion detection system


Intrusion detection systems are generally categorized into signature-based or anomaly
based [14]. Signature-based IDSs use a database of rules or the so-called signatures to classify
network connections, whereas anomaly-based IDSs create a normal user profile and identify
anything that does not match this profile as an attack. In the former, known intrusions can be
detected efficiently with a low false alarm rate. Hence, this approach has been widely used in
most of the commercial systems.

One of the major challenges in signature-based intrusion detection is the detection of new
intrusions [15]. For addressing this issue, the database of rules requires regular update with a
manual or automatic process. A manual process is usually done by a network administrator by
finding new signatures and adding them to the database of rules, while an automatic process
can be done with the help of supervised learning algorithms. Another way to address this
problem is to switch to an anomaly detection system, which has the capability of detecting new
types of attacks [16]. However, the major problem in anomaly detection systems is discovering
the boundaries between normal and anomalous behaviour. In addition to that, as the normal
traffic pattern is also changing over time, an anomaly detection system needs to adapt itself to
this changing behaviour. Nowadays, due to the exponential growth of technologies and the
increased number of available hacking tools, in both approaches, adaptability should be
considered as a key requirement. To be able to react to network intrusions as fast as possible,
an automatic or at least semi-automatic detection phase is required. This will decrease the
amount of damage to legitimate users because an early detection system supplies more time for
a proper reaction.

1.4 DATA COLLECTION


One important contribution in this research is collecting representative network traffic
from a real cloud environment. Representative traffic should contain different scenarios that
occur in cloud environment. This includes user activities, such as transferring files, browsing
and streaming. The available public datasets for intrusion detection, such as KDD 99, DARPA,
ISCX and CAIDA etc., suffer from limitations, like outdated traffic, simulated traffic that does
not reflect real-world network traffic and lack of representative normal data. In addition to
representative normal data, a network attack benchmark should also contain a variety of attacks
generated by different tools and methods. Building and evaluating the detection models on
representative data provides realistic evaluation results, which can reduce the gap between
building detection mechanisms using machine learning methods and their actual deployment in
real cloud.

In this research work, for learning more about attackers, their motivations and techniques,
honeypots have been used. These systems allow monitoring of attacks by pretending to be real
machines with valuable data, such that attackers interact with them. In this work, honeypots
were set up on AWS public cloud. Over a period of one months, log data was collected for
further analysis, resulting in over 5,195,499 attacker’s log entries.

1.4.1 Honeypots

Honeypots were presented for the past two decades as a trap point that can lure an attacker
into, making them exposed, traced and reveal their attack trial to a dedicated system to capture
their behaviour, techniques and scan any malicious payloads or exploits which can early detect
and prevent a zero-day exploit early or uncover a weakness point in a production system, and
could be eventually patched before it can be approached by any attacker later on. Honeypot are
just designed to mimic like a real computer where they pretend to be helping the intruder in
breaking the real system. The architecture of a Honeypot is designed in such a way that it
supports as a source of interaction for connecting with a black cat community.

There are different levels for honeypots that can categorize into two types, 1) Low
interaction; that allow only limited interaction for an attacker or a malware, which make them
not vulnerable themselves and cannot be infected by exploit attempt, however, the rigid
behaviour makes it easy to be detected and avoided. 2) High interaction; that involves real
operating systems and applications, since nothing is emulated and everything is real, it can
provide more details of an attack or an intrusion, thereby helping identification of unknown
vulnerabilities, however, this nature exposes the honeypot and increase the risk of an attacker
using it to compromise production systems.

1.4.2 Honeynet

Honeynet is a set of several Honeypots that is meant to deceive an attacker and capture their
techniques for further analysis of the attacking session. Honeypots have been evolving through
the last two decades, but without coping with the continuously advanced adversary’s
techniques. Recently a research made by a team in Arizona State University came up with a
new revolutionary honeynet architecture they called Honey Proxy [18] that can address several
challenges in protecting the network through a centralized reverse proxy module which could
multicast malicious traffic to honeypots and select the response that does not contain a
fingerprint indication.

1.5 FEATURE SELECTION

Dimensionality Reduction (DR) is a standard technique used to reduce the dimensionality


without information loss. Various DR techniques are offered to reduce the dimensionality.
Among them, feature selection has become one of the most popular techniques. Features
selection, a pre-processing technique, comes to the rescue of obtaining the relevant features
from huge volume of features space. It focuses mainly on selecting the relevant features using
some predefined criterion and thereby increasing the efficiency of data mining algorithms [Shu,
2010].

The task of feature selection improves the classification accuracy and understandability of
the learning process. In the process of learning, large number of features requires huge memory
space and consumes longer time for its processing. In this scenario, Feature selection reduces
the cost of data acquisition and cost of computation. The reduction of features in terms of
selecting highly relevant features results in authentic conclusions. However, feature selection
should not victimize the highly informative features. The feature selection technique contains
three major approaches.

In this research work, we used an ensemble of different filter feature selection methods
for finding the important feature for the detection of attacks in intrusion datasets. It utilizes a
search criterion independent of any learning algorithm to find the appropriate feature subset
before a machine learning algorithm is used. It uses some statistical measure to find the
importance of features and filter out irrelevant features. These features are then presented as
input to the learning algorithm. Most widely used technique in filter methods is feature-
ranking. In this technique, each feature is measured for correlation with the class according to
some criteria. Table 1.1 shows the comparison of various feature selection methods.

Table 1.1 Relationship of various feature selection methods


Feature selection methods Advantages Disadvantages
Lack of feature
Classifier independence,
Univariate Filter dependencies and classifier
scalability and fast speed.
interaction
Multivariate Filter Includes feature Lack of classifier
dependencies. Classifier interaction, less scalable and
independence. Better slower.
computational complexity.
Deterministic Wrapper Includes classifier Classifier dependence, risk
interaction, presence of of overfitting.
feature dependencies
Randomized Wrapper Includes classifier Classifier dependence, high
interaction, presence of risk of overfitting,
feature dependencies computationally intensive.
Embedded Better computational Classifier dependence.
complexity, includes
classifier interaction
Hybrid hybrid model is more Classifier dependence,
efficient than filter and less computationally intensive.
expensive than wrapper.

1.6 MACHINE LEARNING-(ML)

Machine learning is “the field of study that gives computers the ability to learn without
being explicitly programmed” [20]. The machine learning methods may be divided into three
categories to explain how the learning works; supervised learning, unsupervised learning, and
reinforcement learning.

Although there are many other algorithms used in different studies and applications, the
four algorithms namely, Logistic regression, Decision tree, Support vector machines, and Naïve
bayes algorithms are the main base algorithms used in this research work. Supervised learning
methods deal with the data given as X, Y pairs [21]. The goal of supervised learning is
mapping from X to Y. If the variable Y is a continuous(numeric) variable, this type of
supervised algorithms is called regression algorithms; if the variable Y is a categorical data,
then those problems are solved with classification algorithms. As an example; “predicting the
price of a house” is a regression problem and “detecting if an incoming e-mail is a spam or not”
is a classification problem.
Unlike supervised learning methods, unsupervised learning methods [21] do not deal
with X, Y pair and they do not try to find a function to fit from X to Y. The goal of
unsupervised learning is to find the patterns in the dataset. Finding the patterns in the purchases
of a retail store’s customers is an example for unsupervised learning. Another use case for
unsupervised learning is clustering where the goal is to group similar items in a dataset.
Grouping similar customers by the recently purchased items is a perfect example for clustering.

Reinforcement learning [23] is a method which may be seen as combination of


supervised and unsupervised learning. Reinforcement learning algorithms works on reward
systems. The algorithm’s inputs constitute a task for solving and as a reward function. If the
algorithm does a successful move, the reward function give positive score otherwise it gives a
negative score. The algorithm starts learning based on this reward. One of the fields where
reinforcement learning used is robotics. A robot can learn to navigate inside a maze based on
given scores produced by reward function.

In Figure 1.5. Shows the categories of machine learning algorithms and also been represented.
Figure 1.5: Categories of Machine Learning Algorithms

1.6.1 MACHINE LEARNING FOR INTRUSION DETECTION SYSTEM


Machine learning is the science of making the systems capable of obtaining and assimilating
the knowledge automatically. Systems that employ machine learning techniques usually have
the ability of self-improvement and exhibit a high degree of effectiveness and efficiency.
Initially, a machine learning system uses a little knowledge of the information system for
analysis, interpretation and examination of the knowledge acquired. Machine learning
techniques enable the systems to analyse the patterns to be categorized based on implicit and
explicit models. IDS employing Machine learning techniques have the ability to transform their
execution plan as it acquires new information. However, there is a drawback with this feature
as this technique is resource expensive in nature. Otherwise, the technique would have been
desirable in all situations.

1.6.2 TYPES OF SUPERVISED LEARNING ALGORITHMS


Supervised learning consists of a classifier of a broad range with their own merits and demerits.
However, selecting an appropriate classifier for a specific problem is the science of art. Some
of the supervised learning methods used in this work are explained as follows.
Logistic Regression
Logistic regression is a function that calculates the probability of a discrete outcome based on
the given data [21]. Logistic regression is based on linear regression and logit function is
applied to the output of linear regression to find the probability.
Linear regression is defined by the given equation:
w1x1 + w2 x2 +… = ∑ wi xi (1.1)
In Equation (1.1), x denotes the given data and the w denotes the weights. A logit function is
formulated in Equation (1.2) and logit function has the ability to convert any real value to a
value between 0 and 1. The logit function for the range between -10 and 10 is shown in Figure
1.6. where, ‘z’ is the independent variable.

logit ( z ) =
1
(1.2)
(1+ e− z )
Figure 1.6 Logit Function
The equation for logistic regression is shown in Equation (1.3).

f ( x) =
1
(1.3)
1− e− ∑ wi xi
Where xi denotes the given data and the wi denotes the weights.

Decision Tree
A decision tree is used for estimating discrete-valued functions and the learned function can be
signified as a tree diagram or can be expressed as if-else rules which makes it a human-readable
solution. Decision tree (DT) algorithm is a very popular algorithm and it is used in various
areas as well as in network intrusion detection. Bouzida & Cuppens, (2006) highlights that the
rules can be extracted from decision trees to be used in expert systems. Although they have
some advantages, they also have some drawbacks. Chawla (2003) and Gharibian & Ghorbani
(2007) highlight that decision trees are very sensitive to the training data. A good example of
decision tree usage can be seen in the results of KDD Cup’99 where a C5.0 decision tree
algorithm was used by the winner of KDD Cup’99 competition Pfahringer (2000) decision tree
with ensemble methods. Gyanchandani, Yadav, & Rana (2010) also used the C4.5 decision tree
algorithm in ensemble methods to improve the results of C4.5. There are several
implementations of the decision tree algorithm, one of which is ID3 [] starts constructing the
tree by finding which attribute will be used as the root node. And then, it continues finding
which attribute is the best to split the decision tree in each node. ID3 uses information gain
criteria to decide the attribute. knowledge of the entropy is needed before continuing with
information gain. Entropy is used for calculating the impurity of a set of examples and is
formulated in Equation (1.4)
n

Entropy ( x ) = ∑ − pi log 2 pi (1.4)


i=2

Given the entropy formula, information gain can be explained. Information gain is the
difference between two entropies. The information gain is calculated by the given formula
given in Equation (1.5).
v

InformationGain ( x ) = InitialEntropy − ∑ p(xi )Entropy(xi ) (1.5)


i=1

Naïve Bayes
Naïve Bayes algorithm is based on bayes theorem, which is used for the calculation of the
posterior probability of a hypothesis based on the prior probability. Prior probability is defined
as the probability of observing data given the hypothesis. Bayes theorem is shown on equation
(1.6).
P( X Y )P(Y )
P(Y X ) = (1.6)
P( X )
Where P(Y) denotes the probability of the outcome, P(X) denotes the probability of the data
that will be observed, P(X|Y) denotes the probability of the data observed given in the world
where the outcome is Y and P(Y|X) defines the probability of the outcome based on given
observed data.

When Naïve Bayes is built on Bayes Theorem to implement a classifier, the goal is to find the
probabilities of each possible outcome when X is given. Since X is constant, we may ignore the
denominator P(X). So, the final formula becomes as seen on Equation (1.7).
f ( X ) = arg max y∈Y (P( X y)P( y)) (1.7)

where Y is the set of all possible outcomes.

SVM
SVM is one of the most robust data classification techniques, introduced by Vapnik [].
A SVM classification process mostly comprehends with training and testing data which
inhabits of some data instances and it classifies data by a set of support vectors that signify the
data items. Each instance in the training set includes one objective value and some attributes.
The aim of SVM is to generate a model which predicts the final value of data in the testing data
set. There are several motivations for using SVM. It needs less initial probabilities about the
input data and its execute on huge data set by exploiting a nonlinear mapping from a new input
space into a high dimensional feature space.

SVM model is a representation of the instances as points in space, mapped so


that examples of separate classes are divided by a clear gap that is as wide as possible found by
maximizing the margin between the two classes. Finding the maximum margin hyperplanes
offers the best generalization ability. Thus, it consists of a linear classification function which
corresponds to a separating hyperplane f(x) that passes through the middle of the two classes,
separating the two. More formally, a new data instance xi is classified by simply testing the
sign of the function f(xi), xi belongs to the positive class if f(xi) > 0. The search for the
maximum margin hyperplanes is done by maximizing the following function with respect to w
and b:

m m
1
Lp = w − ∑α i yi (w.xi + b) + ∑α i (1.8)
2 i=1 i=1

where m is the quantity of training data, and α i ,(i = 1…, t) are denoting the non-negative

numbers such that the results of Lp with consider to α i are zero, α i are Lagrange multipliers and
Lp is so-called as Lagrangian, where w is represent the vectors and b is represent the constant
in this equation, which describing the hyperplane.

1.6 ENSEMBLE METHODS

Ensemble methods are techniques where more than one models are combined to produce a
better model [34]. Ensemble methods usually produce more accurate results compared to a
single model. A case in point, we may see that in many competitions, where the winner was an
ensemble method. In the popular Netflix Competition, the winner has used an ensemble method
for the implementation of a recommendation system. The usage of ensemble methods may be
also seen on kaggle.com is a website where the machine learning competitions are placed and it
can be seen that in almost every competition, the winner is an ensemble method. Some of these
types of methods are explained as follows.
1.7.1 Ensemble Classifiers

Bagging
Bagging is an effective ensemble technique for unstable learning algorithms where small
changes in the training set result in large changes in predictions for e.g. decision tree, neural
network etc. In this method, the models are generated by training the same algorithm but with
random subsets of the original training dataset. These subsets are drawn from the original
training dataset with bootstrap sampling method. Bootstrap sampling is a method where the we
draw examples from the dataset but do not remove the selected example from the original
dataset. Bootstrapped Aggregating (Bagging) combines voting with a method for producing the
classifiers that give the votes. Random Forest is a widely-known algorithm which uses the
bagging method.

Boosting
The term “boosting” is used to describe the machine learning algorithms which aim to evolve
the weak models to stronger models. A model is called a weak model if it the error rate is
substantial, but the performance is not random, which means the accuracy is more than 50% for
a binary classification. Boosting is built step by step and does the training with the same dataset
in each step but adjusts the weights of the examples based on the error of the last model’s
prediction. The main idea of boosting is to force the models to focus on the examples in dataset
which are hard to learn. Boosting method is used for improving the accuracy of the algorithm
used. It is a general and effective method to forecast correctly by combining rough and
moderately inaccurate rules of thumb.

Stacking
Stacking [28] is a general method where an algorithm is trained to do grouping of the output of
models. Stacking consists of two levels: In the first level, machine learning algorithms are
trained by using the original training data set. A new data set is then generated using
predictions of each model. In the second level, the generated dataset is used as an input for a
machine learning algorithm, which is used as a combiner method. The original class labels are
used as the labels of the new training dataset as used in the first level.

1.8 RESEARCH OBJECTIVES


The main objective of this research work is to design and develop an efficient intrusion
detection system for cloud environment by applying ensemble based feature selection and
classification techniques. In order to evaluate an IDS, it is necessary to create a dataset to serve
as a ground truth. The system is integrated with the existing Honeynet setup to demonstrate the
approach using data from real Honeypots deployed on the public cloud using Docker. Then the
next key step is to apply the proposed Univariate Ensemble Filter Feature Selection (UEFFS)
technique for the evaluation of a large intrusion dataset to select the relevant features and build
a robust classifier. Finally, we use an supervised learning algorithms to accurately classify
known attacks in the cloud environment by applying ensemble learning techniques. The goal of
ensemble learning may be used for the achievement of the best possible classification
performance and to develop a robust classifier. The major sources of data arise from the
network attacks is categorized into four classes namely, denial of service, the user to root,
remote to local and probe which is the key reason to build efficient intrusion detection. This
proposed method is then evaluated using three intrusion datasets namely, Honeypot, NSL-KDD
and Kyoto and the classification performance is measured by applying performance measured
by applying “area under the receiver operating characteristic curves” ROC metrics across
various classifiers. The results of the proposed system must meet the requirements of designing
novel intrusion detection systems and maintaining high accuracy and low false positive rates.

Academic studies the focus of most is on finding the attacks that are not known because
the signature-based solutions are already providing a solution to find the known attacks. Those
academic studies are using the power of machine learning for this purpose and the classification
methods may be used for finding the attacks. Attacks have a complicated nature and so, those
methods do not provide high accuracy in detecting attacks. As an alternative, ensemble
machine learning methods started playing a role in intrusion detection studies to achieve higher
accuracy. The goal of ensemble methods is to combine more than one machine learning model
to create a new model.

The research objectives are listed as follows:

ƒ To design and develop an efficient intrusion detection system using Honeypot and
Honeynet for cloud environment.

ƒ To design and develop an effective ensemble feature selection approach to select


a valuable reduced feature set.

ƒ To incorporate the ensemble based technique to detect and classify known threats
using supervised learning.

ƒ To classify the known threats using ensemble classifiers and evaluated the
classification performance and detection abilities.

1.9 THESIS OUTLINE


The remainder of this thesis is organized as follows:

Chapter 2. affords a brief review of the related work. It elaborates the widespread taxonomy of
cloud computing and intrusion detection system field, also discusses the theoretical background
and offers an account of the state-of-the-art in intrusion detection technologies.

Chapter 3. elaborates on the Honeypots that capture attacker activity in a cloud environment.
This is followed by a discussion on the implementation of results in the form of intrusion data
collected from this setup and used for assessment of the intrusion detection.

Chapter 4. presents the methodology used for feature ranking for assessing a feature subset and
to construct a valuable reduced feature set from a given intrusion dataset and explained how we
conducted experiments and discussed the results of the experiments.
Chapter 5. presents the methodology used for detecting and classifying known malicious
threats using supervised learning and ensemble based voting technique.

Chapter 6. discusses the evaluation of an ensemble-based approach for the intrusion detection
system using bagging, boosting and stacking ensemble classifiers. To evaluate the proposed
model, an evaluation strategy was developed to carried out a series of experiments.

Chapter 7. summarises the research work and provides the future directions for further research
in this area.

The overall thesis structure is depicted in Figure 1.7


Figure 1.7 Overall Thesis Structure

You might also like