Chapter 1
Chapter 1
Chapter 1
INTRODUCTION
1.1. OVERVIEW
Cloud Computing technologies are evolving day by day with the incorporation of new business
and operating models. It has the most powerful innovation technologies around the globe.
Cloud computing is a kind of large distributed computing system and highly scalable internet-
based computing resources provided in the form of service in which, services are offered to
users according to their demand. One of its main features and benefits consists of shared
network resources whereby multiple clients share the same hardware using logical isolation
technologies. While cloud computing has enormous advantages such as scalability, rapid
elasticity, measured services, the most important of the potential is that it helps cost savings to
the enterprises, by utilising on demand services. Securing the resources in the cloud and the
data communication between them are most challenging tasks because the number of intrusions
is increasing year by year.
In recent years, statistical reports reveal tremendous number of security breaches that
occur in the virtual network layer of cloud computing. A huge upsurge in network traffic has
paved way to security breaches which are more complicated and widespread in nature.
Tackling these attacks has become inefficient with application of traditional network-based
intrusion detection systems (IDS) environment. In today’s scenario, varied intrusion detection
system (IDS) tools and techniques are available in cloud platform for the discovery of attacks
on cloud infrastructure. The traditional or existing intrusion detection systems (IDS) is turning
out to be ineffective and inefficient due to heavy traffic and its dynamic behaviour. An
intrusion detection system (IDS) usually complements a firewall to form an effective security
solution. One of the main challenges in securing cloud networks is in the area of the
appropriate design and use of an intrusion detection system, that can monitor network traffic
and effectively identify network intrusions. This research is focused on security threats in the
form of network security attacks on public clouds and proposes solutions and suggestions.
Findings in literature reveal the ratio of a large number of defences against network attacks.
Despite all the efforts made by researchers over the last two decades, the network security
problem is not completely solved. One reason for this is the rapid growth in computational
power and available resources to attackers, which enables them to launch complex attacks (Wu
et al., 2010). An intrusion detection system (IDS) is the most typically used technique for the
discovery of attacks on cloud infrastructure.
In today’s scenario, various intrusion detection system (IDS) tools and techniques are available
in cloud platform as the typical technique to discover attacks on cloud infrastructure. The
traditional intrusion detection systems (IDS) are ineffective for deployment in cloud
environments due to virtualized and distributed nature. Researcher has proposed an efficient
ensemble based feature selection and classification techniques for network intrusion detection
system on cloud computing. For overcoming the challenges faced by the traditional intrusion
detection system, this research work investigates the existing intrusion detection systems by
applying an efficient feature selection and classification technique which uses an ensemble
learning for accurate classification of known attacks in the cloud environment by applying
supervised learning techniques. The idea of building ensemble model provides the highest
accuracy the lowest false alarm rate (FAR). Honeypot and Honeynet are deployed to capture
real attack traffic and attacker activity in a cloud environment.
Cloud computing uses the virtualized platform containing the elastic resources like
software, hardware, services, and data sets by allowing dynamic provisioning of the pool of
virtual resources to the cloud users. The notion of cloud computing involves changes in the
traditional desktop computing towards service-oriented computing using large clusters and data
centres. Cloud computing provides ease and low cost to cloud customers using the concept of
virtualization. Cloud computing known as internet-based computing has emerged from the
concepts of distributed computing, utility computing, grid computing. It integrates the
techniques like virtualization, multitenancy, service-oriented architectures for delivering
infrastructure, platform, and software as service rather than a product. Cloud computing
technology works on three different SPI (Software Platform Infrastructure) models and four
deployments (public, private, hybrid, and community) models. The consumer can use the
service(s) of the cloud as per their usage or requirement and deploy the cloud. Figure 1.1
depicts the cloud computing scenarios.
Figure 1.1 Cloud Computing Scenarios
The definition of cloud computing as stated by NIST, “Cloud computing is a model for
enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable
computing resources (e.g., networks, servers, storage, applications, and services) that can be
rapidly provisioned and released with minimal management effort or service provider
interaction” [2]. A cloud can be categorized as private cloud, public cloud and hybrid cloud as
per a deployment model.
Another interesting definition of cloud computing is, “Cloud computing refers to both
the applications delivered as services over the internet and the hardware and system software in
the data centres that provide those services.” There are many big vendors who provide cloud
resources and services with different pricing. The leading cloud vendors are Google, Microsoft,
Amazon, IBM, AT &T, and Salesforce.com.
These service delivery models are based on the existing technologies that include web
applications, service-oriented architecture, and virtualization technologies which already suffer
from security problems. Of these different service delivery models, SaaS model is the weakest
model where in most of the attacks target to breach the model security [5]. Sometimes, SaaS
based solution become replacement of local software, e.g. Google Docs, Calendar, Zoho Office
Suite etc. Some SaaS based solutions like Salesforce.com also provide an API for developers
by which some customization in the software may be done. Main features of the SaaS
applications are listed below,
• The software is available on demand and over the Internet through a browser.
• Reduction in software distribution and maintenance cost.
• Provision of features like automated upgrade, update and patch management.
• Software licensing is subscription based or usage based.
Private Cloud: A private cloud deployment model is dedicated to a particular enterprise(s) [6].
This is usually deployed within the enterprise network perimeter and the enterprise IT
department has full control of the cloud platform. The private cloud may be the third party on
premise (outsourced) and may be in-house (on-site). On-site/in-house private cloud can be
accessed only within the organization and block the access of outsider users. The outsourced
private cloud is accessible only by an authorized user and the cloud is located on a third party’s
premises [7]. The main objective of this model is to remove customers’ concerns about the
security of their cloud-hosted assets at the expense of increasing IT infrastructure costs. It
allows access only to authorized users. This model is suitable for organizations whose resource
requirements are dynamic, critical, secured and customized. The private cloud model helps an
organization in better utilization of their hardware resources. The main disadvantages of this
model are poor scalability and higher initial cost [8]. There are several software frameworks
available by which private cloud can be created like, Open Nebula, Open Stack, Azure,
Eucalyptus, etc.
Public Cloud: A public is open to everyone; any user can access cloud resources from
anywhere over the internet. Cloud Service Providers (CSPs) manage and own large amounts of
physical infrastructure in the data centre [1,5]. These resources are publically shared among
users across the world as per their demand. Users may access these resources, in the form of
services. Public clouds can be composed of geographically distributed data centres [6]. Some
fundamental requirements in public cloud are: proper isolation among users, guarantee the
quality of service and provide management/monitoring mechanisms. Examples of public CSPs
are Amazon, Microsoft, Google, HP, etc. [7] The public cloud model has advantages like low
capital cost, flexibility, higher scalability, location independence, etc. The possible
disadvantages of this model are low security and less customization.
Community cloud: A community cloud model shared infrastructure for specific community.
Community cloud also be in the form of outsourced and on-site community cloud. It is a
category of computing infrastructure that is mutually distributed among several organizations,
having the common computing requirements like regulatory compliance, SLAs, security,
jurisdictions, privacy, etc., This model can be hosted and managed internally or externally by a
third-party. E.g., banks, trading organizations.
Hybrid Cloud: In this cloud deployment combines private as well as public cloud deployment
model is known as hybrid cloud. It is an integration of one or more distinctive clouds (private,
public or community) but are bound together, combining the benefits of multiple clouds. The
capacity of this cloud service can be extended by integration, aggregation, and customization of
cloud service.
The most vulnerable deployment model is a public cloud, among these three models,
because of open nature and distributed [11]. Many users are joining into the cloud and access
the cloud resources, even, those users may be malicious users or services. In the hybrid
deployment model, cloud infrastructure is composed of different types of resources (private and
public). The private cloud is created using resources owned by organizations. However, in
some situations, these resources may not be enough for handling the workload. In this case,
private resources are combined along with resources provided by public CSPs. The cloud
formed by the combination of private and public infrastructure is called the hybrid cloud. The
hybrid cloud provides scalability with the use of services provided by a public cloud. This
model has benefits like a larger flexibility and scalability compared to the private cloud.
However, this model requires more complex network configurations compared to the private
cloud.
Challenges of Cloud Computing Paradigm
Some of the primary challenges of cloud computing are discussed as follows:
a. Network Latency
Cloud computing offers unlimited resources in the form of services over the Internet. This
increases availability, but the problem of network latency may occur. If the Internet speed of
the cloud user is slow, he or she may notice delays while accessing the service.
b. Privacy and Security
In a cloud environment, user data travels over a network and is stored in a system on which the
user has no direct access or control [12]. The protection of information in terms of privacy,
confidentiality, and loss of data must be handled properly by complying with the various
security criteria and protocols. Many researchers are working in this area.
c. Legal Issues
Cloud services are provided over the internet. Users of cloud services may be from any part of
the world. In this scenario, if cloud computing deployment is stages across states and countries,
one it may end up with the requirement to fulfil with various impacts. The acts of maximum
governing agencies place the entire burden on the client. Thus, compliance with local
regulations must be handled by cloud users.
d. Lack of Flexibility and Standardization
Sometimes, the services offered by CSP do not provide any option for customization which
reduces flexibility. There are several standards defined regarding cloud services. However, in
some areas standardization is still ongoing.
e. Dynamic Provisioning of Services and Resources
As per the cloud characteristics, services and resources provided by the cloud are highly
scalable and elastic. But the main issue is the dynamic provisioning of cloud resources on the
basis of requirements [13]. Improper provisioning may cause wastage of resources which
incurs more bills to the customer and poor resource utilization.
Therefore, there is an urgent and strong requirement to identify unique security threats
and vulnerabilities by evaluating cloud networks. In other words, for Cloud Computing
adoption to happen at large scale, an assessment of vulnerabilities and attacks, and
identification of relevant solution directives to strengthen security and privacy in the Cloud
environment is the need of the hour. Cloud computing technology is a novel approach,
solutions to the threats and vulnerabilities are lagging behind the easily executable attacks like
Malware, DoS, DDoS, man-in-the-middle &SQL injection, Cross-Site Scripting (XSS) and
authentication attacks, etc. Towards this, it becomes imperative to find out time-bound
solutions to the threats and exploitation of cloud vulnerabilities. The proposed research study is
motivated by this research gap.
Intrusion detection system can be categorized into two main types based on the
detection taking place: Network-based intrusion detection systems and Host-based intrusion
detection systems. In practice, a network intrusion detection system (NIDS) is a device located
at a strategic point in a network for monitoring all network activity. A host-based intrusion
detection system (HIDS) is used for finding suspicious activity in a single computer system. In
practice, it is a software that runs on the operating system. It will take a snapshot of system files
and then compare it to the original version to detect suspicious activities.
Intrusion detection systems can also be categorized into two other categories by how the
detection works [13]:
One of the major challenges in signature-based intrusion detection is the detection of new
intrusions [15]. For addressing this issue, the database of rules requires regular update with a
manual or automatic process. A manual process is usually done by a network administrator by
finding new signatures and adding them to the database of rules, while an automatic process
can be done with the help of supervised learning algorithms. Another way to address this
problem is to switch to an anomaly detection system, which has the capability of detecting new
types of attacks [16]. However, the major problem in anomaly detection systems is discovering
the boundaries between normal and anomalous behaviour. In addition to that, as the normal
traffic pattern is also changing over time, an anomaly detection system needs to adapt itself to
this changing behaviour. Nowadays, due to the exponential growth of technologies and the
increased number of available hacking tools, in both approaches, adaptability should be
considered as a key requirement. To be able to react to network intrusions as fast as possible,
an automatic or at least semi-automatic detection phase is required. This will decrease the
amount of damage to legitimate users because an early detection system supplies more time for
a proper reaction.
In this research work, for learning more about attackers, their motivations and techniques,
honeypots have been used. These systems allow monitoring of attacks by pretending to be real
machines with valuable data, such that attackers interact with them. In this work, honeypots
were set up on AWS public cloud. Over a period of one months, log data was collected for
further analysis, resulting in over 5,195,499 attacker’s log entries.
1.4.1 Honeypots
Honeypots were presented for the past two decades as a trap point that can lure an attacker
into, making them exposed, traced and reveal their attack trial to a dedicated system to capture
their behaviour, techniques and scan any malicious payloads or exploits which can early detect
and prevent a zero-day exploit early or uncover a weakness point in a production system, and
could be eventually patched before it can be approached by any attacker later on. Honeypot are
just designed to mimic like a real computer where they pretend to be helping the intruder in
breaking the real system. The architecture of a Honeypot is designed in such a way that it
supports as a source of interaction for connecting with a black cat community.
There are different levels for honeypots that can categorize into two types, 1) Low
interaction; that allow only limited interaction for an attacker or a malware, which make them
not vulnerable themselves and cannot be infected by exploit attempt, however, the rigid
behaviour makes it easy to be detected and avoided. 2) High interaction; that involves real
operating systems and applications, since nothing is emulated and everything is real, it can
provide more details of an attack or an intrusion, thereby helping identification of unknown
vulnerabilities, however, this nature exposes the honeypot and increase the risk of an attacker
using it to compromise production systems.
1.4.2 Honeynet
Honeynet is a set of several Honeypots that is meant to deceive an attacker and capture their
techniques for further analysis of the attacking session. Honeypots have been evolving through
the last two decades, but without coping with the continuously advanced adversary’s
techniques. Recently a research made by a team in Arizona State University came up with a
new revolutionary honeynet architecture they called Honey Proxy [18] that can address several
challenges in protecting the network through a centralized reverse proxy module which could
multicast malicious traffic to honeypots and select the response that does not contain a
fingerprint indication.
The task of feature selection improves the classification accuracy and understandability of
the learning process. In the process of learning, large number of features requires huge memory
space and consumes longer time for its processing. In this scenario, Feature selection reduces
the cost of data acquisition and cost of computation. The reduction of features in terms of
selecting highly relevant features results in authentic conclusions. However, feature selection
should not victimize the highly informative features. The feature selection technique contains
three major approaches.
In this research work, we used an ensemble of different filter feature selection methods
for finding the important feature for the detection of attacks in intrusion datasets. It utilizes a
search criterion independent of any learning algorithm to find the appropriate feature subset
before a machine learning algorithm is used. It uses some statistical measure to find the
importance of features and filter out irrelevant features. These features are then presented as
input to the learning algorithm. Most widely used technique in filter methods is feature-
ranking. In this technique, each feature is measured for correlation with the class according to
some criteria. Table 1.1 shows the comparison of various feature selection methods.
Machine learning is “the field of study that gives computers the ability to learn without
being explicitly programmed” [20]. The machine learning methods may be divided into three
categories to explain how the learning works; supervised learning, unsupervised learning, and
reinforcement learning.
Although there are many other algorithms used in different studies and applications, the
four algorithms namely, Logistic regression, Decision tree, Support vector machines, and Naïve
bayes algorithms are the main base algorithms used in this research work. Supervised learning
methods deal with the data given as X, Y pairs [21]. The goal of supervised learning is
mapping from X to Y. If the variable Y is a continuous(numeric) variable, this type of
supervised algorithms is called regression algorithms; if the variable Y is a categorical data,
then those problems are solved with classification algorithms. As an example; “predicting the
price of a house” is a regression problem and “detecting if an incoming e-mail is a spam or not”
is a classification problem.
Unlike supervised learning methods, unsupervised learning methods [21] do not deal
with X, Y pair and they do not try to find a function to fit from X to Y. The goal of
unsupervised learning is to find the patterns in the dataset. Finding the patterns in the purchases
of a retail store’s customers is an example for unsupervised learning. Another use case for
unsupervised learning is clustering where the goal is to group similar items in a dataset.
Grouping similar customers by the recently purchased items is a perfect example for clustering.
In Figure 1.5. Shows the categories of machine learning algorithms and also been represented.
Figure 1.5: Categories of Machine Learning Algorithms
logit ( z ) =
1
(1.2)
(1+ e− z )
Figure 1.6 Logit Function
The equation for logistic regression is shown in Equation (1.3).
f ( x) =
1
(1.3)
1− e− ∑ wi xi
Where xi denotes the given data and the wi denotes the weights.
Decision Tree
A decision tree is used for estimating discrete-valued functions and the learned function can be
signified as a tree diagram or can be expressed as if-else rules which makes it a human-readable
solution. Decision tree (DT) algorithm is a very popular algorithm and it is used in various
areas as well as in network intrusion detection. Bouzida & Cuppens, (2006) highlights that the
rules can be extracted from decision trees to be used in expert systems. Although they have
some advantages, they also have some drawbacks. Chawla (2003) and Gharibian & Ghorbani
(2007) highlight that decision trees are very sensitive to the training data. A good example of
decision tree usage can be seen in the results of KDD Cup’99 where a C5.0 decision tree
algorithm was used by the winner of KDD Cup’99 competition Pfahringer (2000) decision tree
with ensemble methods. Gyanchandani, Yadav, & Rana (2010) also used the C4.5 decision tree
algorithm in ensemble methods to improve the results of C4.5. There are several
implementations of the decision tree algorithm, one of which is ID3 [] starts constructing the
tree by finding which attribute will be used as the root node. And then, it continues finding
which attribute is the best to split the decision tree in each node. ID3 uses information gain
criteria to decide the attribute. knowledge of the entropy is needed before continuing with
information gain. Entropy is used for calculating the impurity of a set of examples and is
formulated in Equation (1.4)
n
Given the entropy formula, information gain can be explained. Information gain is the
difference between two entropies. The information gain is calculated by the given formula
given in Equation (1.5).
v
Naïve Bayes
Naïve Bayes algorithm is based on bayes theorem, which is used for the calculation of the
posterior probability of a hypothesis based on the prior probability. Prior probability is defined
as the probability of observing data given the hypothesis. Bayes theorem is shown on equation
(1.6).
P( X Y )P(Y )
P(Y X ) = (1.6)
P( X )
Where P(Y) denotes the probability of the outcome, P(X) denotes the probability of the data
that will be observed, P(X|Y) denotes the probability of the data observed given in the world
where the outcome is Y and P(Y|X) defines the probability of the outcome based on given
observed data.
When Naïve Bayes is built on Bayes Theorem to implement a classifier, the goal is to find the
probabilities of each possible outcome when X is given. Since X is constant, we may ignore the
denominator P(X). So, the final formula becomes as seen on Equation (1.7).
f ( X ) = arg max y∈Y (P( X y)P( y)) (1.7)
SVM
SVM is one of the most robust data classification techniques, introduced by Vapnik [].
A SVM classification process mostly comprehends with training and testing data which
inhabits of some data instances and it classifies data by a set of support vectors that signify the
data items. Each instance in the training set includes one objective value and some attributes.
The aim of SVM is to generate a model which predicts the final value of data in the testing data
set. There are several motivations for using SVM. It needs less initial probabilities about the
input data and its execute on huge data set by exploiting a nonlinear mapping from a new input
space into a high dimensional feature space.
m m
1
Lp = w − ∑α i yi (w.xi + b) + ∑α i (1.8)
2 i=1 i=1
where m is the quantity of training data, and α i ,(i = 1…, t) are denoting the non-negative
numbers such that the results of Lp with consider to α i are zero, α i are Lagrange multipliers and
Lp is so-called as Lagrangian, where w is represent the vectors and b is represent the constant
in this equation, which describing the hyperplane.
Ensemble methods are techniques where more than one models are combined to produce a
better model [34]. Ensemble methods usually produce more accurate results compared to a
single model. A case in point, we may see that in many competitions, where the winner was an
ensemble method. In the popular Netflix Competition, the winner has used an ensemble method
for the implementation of a recommendation system. The usage of ensemble methods may be
also seen on kaggle.com is a website where the machine learning competitions are placed and it
can be seen that in almost every competition, the winner is an ensemble method. Some of these
types of methods are explained as follows.
1.7.1 Ensemble Classifiers
Bagging
Bagging is an effective ensemble technique for unstable learning algorithms where small
changes in the training set result in large changes in predictions for e.g. decision tree, neural
network etc. In this method, the models are generated by training the same algorithm but with
random subsets of the original training dataset. These subsets are drawn from the original
training dataset with bootstrap sampling method. Bootstrap sampling is a method where the we
draw examples from the dataset but do not remove the selected example from the original
dataset. Bootstrapped Aggregating (Bagging) combines voting with a method for producing the
classifiers that give the votes. Random Forest is a widely-known algorithm which uses the
bagging method.
Boosting
The term “boosting” is used to describe the machine learning algorithms which aim to evolve
the weak models to stronger models. A model is called a weak model if it the error rate is
substantial, but the performance is not random, which means the accuracy is more than 50% for
a binary classification. Boosting is built step by step and does the training with the same dataset
in each step but adjusts the weights of the examples based on the error of the last model’s
prediction. The main idea of boosting is to force the models to focus on the examples in dataset
which are hard to learn. Boosting method is used for improving the accuracy of the algorithm
used. It is a general and effective method to forecast correctly by combining rough and
moderately inaccurate rules of thumb.
Stacking
Stacking [28] is a general method where an algorithm is trained to do grouping of the output of
models. Stacking consists of two levels: In the first level, machine learning algorithms are
trained by using the original training data set. A new data set is then generated using
predictions of each model. In the second level, the generated dataset is used as an input for a
machine learning algorithm, which is used as a combiner method. The original class labels are
used as the labels of the new training dataset as used in the first level.
Academic studies the focus of most is on finding the attacks that are not known because
the signature-based solutions are already providing a solution to find the known attacks. Those
academic studies are using the power of machine learning for this purpose and the classification
methods may be used for finding the attacks. Attacks have a complicated nature and so, those
methods do not provide high accuracy in detecting attacks. As an alternative, ensemble
machine learning methods started playing a role in intrusion detection studies to achieve higher
accuracy. The goal of ensemble methods is to combine more than one machine learning model
to create a new model.
To design and develop an efficient intrusion detection system using Honeypot and
Honeynet for cloud environment.
To incorporate the ensemble based technique to detect and classify known threats
using supervised learning.
To classify the known threats using ensemble classifiers and evaluated the
classification performance and detection abilities.
Chapter 2. affords a brief review of the related work. It elaborates the widespread taxonomy of
cloud computing and intrusion detection system field, also discusses the theoretical background
and offers an account of the state-of-the-art in intrusion detection technologies.
Chapter 3. elaborates on the Honeypots that capture attacker activity in a cloud environment.
This is followed by a discussion on the implementation of results in the form of intrusion data
collected from this setup and used for assessment of the intrusion detection.
Chapter 4. presents the methodology used for feature ranking for assessing a feature subset and
to construct a valuable reduced feature set from a given intrusion dataset and explained how we
conducted experiments and discussed the results of the experiments.
Chapter 5. presents the methodology used for detecting and classifying known malicious
threats using supervised learning and ensemble based voting technique.
Chapter 6. discusses the evaluation of an ensemble-based approach for the intrusion detection
system using bagging, boosting and stacking ensemble classifiers. To evaluate the proposed
model, an evaluation strategy was developed to carried out a series of experiments.
Chapter 7. summarises the research work and provides the future directions for further research
in this area.