Data Analytics Book
Data Analytics Book
Data Analytics Book
Data Analytics
and Decision
Support for
Cybersecurity
Data Analytics
Series editors
Longbing Cao, Advanced Analytics Institute, University of Technology, Sydney,
Broadway, NSW, Australia
Philip S. Yu, University of Illinois at Chicago, Chicago, IL, USA
Building and promoting the field of data science and analytics in terms of
publishing work on theoretical foundations, algorithms and models, evaluation
and experiments, applications and systems, case studies, and applied analytics in
specific domains or on specific issues.
Specific Topics:
123
Editors
Iván Palomares Carrascosa Harsha Kumara Kalutarage
University of Bristol Centre for Secure Information Technologies
Bristol, UK Queen’s University of Belfast
Belfast, UK
Yan Huang
Queen’s University Belfast
Belfast, UK
v
vi Preface
Human MULTI-CRITERIA
VISUALISATION review DECISION MAKING
MONITORING
3. Decision RISK ANALYSIS
Support
Domain-specific extracted knowledge
INCOMPLETE
INFORMATION
MANAGEMENT
UNCERTAINTY
HANDLING
2. Analytics
Aggregated/pre-processed information
FEATURE
DATA SELECTION
FUSION
1. Data Management
(...)
SECTORS
LAW
GOVERNMENT FINANCE INSURANCE
ENFORCEMENT
Fig. 1 Overview of data analytics and decision support processes and techniques in cybersecurity
scenarios
Part I—Regular Chapters The first seven chapters present both theoretical
and practical-industrial contributions related to emergent cybersecurity research.
Particular emphasis is put on data analysis approaches and their relationship with
decision-making and visualization techniques to provide reliable decision support
tools.
In Chap. 1 [1], Markus Ring et al. present a novel toolset for anomaly-based
network intrusion detection. Motivated by challenges that frequently hinder the
applicability of anomaly-based intrusion detection systems in real-world settings,
the authors propose a flexible framework comprising of diverse data mining
algorithms. Their approach applies online analysis upon flow-based data describing
meta-information about network communications, along with domain knowledge
extraction, to augment the value of network information before analysing it under
multiple perspectives. To overcome the problem of data availability, the framework
is also conceived to emulate realistic user activity—with a particular focus on the
insider threat problem—and generate readily available flow-based data.
Legg reflects in Chap. 2 [2] on the problem of detecting insider threats in
organizations and major challenges faced by insider threat detection systems, such
as the difficulty to reduce false alarms. The chapter investigates the importance of
combining visual analytics approaches with machine learning methods to enhance
an iterative process of mutual feedback between detection system and human
analyst, so as to rationally capture the dynamic—and continuously evolving—
boundaries between normal and insider behaviours and make informed decisions.
In this work, the author demonstrates how generated visual knowledge can signif-
icantly help analysts to reason and make optimal decisions. The chapter concludes
with a discussion that aligns current challenges and future directions of research on
the insider threat problem.
Malware detection is no longer a problem pertaining to desktop computer
systems solely. With the enormous rise of mobile device technologies, the presence
and impact of malware have rapidly expanded to the mobile computing panorama.
Măriuca et al. in Chap. 3 [3] report on Android mobile systems as a potentially major
target for collusion attacks, i.e. attacks that result by “combining” permissions from
multiple apps to pave the way for attackers to undertake serious threats. The authors
present two analysis methods to assess the potential danger of apps that may become
part of a collusion attack. Apps assessed as suspicious are subsequently analysed in
further detail to confirm whether an actual collusion exists. Previous work by the
authors is adopted as a guideline to provide a general overview of the state-of-the-
art research in app collusion analysis.
In Chap. 4 [4], Carlin et al. focus on a recent strategy to fight against the
devastating effects of malware: the dynamic analysis of run-time opcodes. An
opcode is a low-level and human-readable machine language instruction, and it
can be obtained by disassembling the software program being analyzed. Carlin
et al. demonstrate in their work the benefits of dynamic opcode analysis to
detect malicious software with a significant accuracy in real practical applications.
One of such notable advantages is the ability of dynamic analysis techniques
to observe malware behaviour at runtime. The model presented by the authors,
which uses n-gram analysis on extracted opcodes, is validated through a large data
viii Preface
and fuzzy reasoning process allows the system to make highly confident decisions
autonomously as to whether current levels of power load demands are legitimated
or manipulation by an attacker is taking place.
Garae and Ko provide in Chap. 9 [9] an insightful closure to this book, with
a comprehensive overview of analytics approaches based on data provenance,
effective security visualization techniques, cybersecurity standards and decision
support applications. In their study, Garae and Ko investigate the notion of data
provenance as the ability of tracking data from its conception to its deletion
and reconstructing its provenance as a means to explore cyber-attack patterns.
The authors argue on the potential benefits of integrating data provenance and
visualization techniques to analyse data and support decision-making in IT security
scenarios. They also present a novel security visualization standard describing its
major guidelines and law enforcement implications in detail.
For ease of reference, the table below summarizes the techniques covered and
cybersecurity domains targeted by each one of the chapters comprising the book.
We would like to thank Springer editorial assistants for the confidence put in
this book and their continuous support in materializing it before, during and after
its elaboration. We are also very grateful to the authors of selected chapters for
their efforts during the preparation of the volume. Without their valuable ideas
and contributions, finishing this project would not have been possible. Likewise,
we acknowledge all the scientists and cybersecurity experts who generously vol-
unteered in reviewing the chapters included in the book. Finally, we would like
to express our special thanks to Dr. Robert McCausland, principal engineer and
R&D manager in the Centre for Secure Information Technologies (CSIT), (Queen’s
University Belfast), for firmly believing in our initiative and strongly supporting it
since its inception.
References
1. Markus Ring, Sarah Wunderlich, Dominik Grüdl, Dieter Landes, Andreas Hotho. A Toolset for
Intrusion and Insider Threat Detection.
2. P.A. Legg. Human-Machine Decision Support Systems for Insider Threat Detection.
3. I. Măriuca. J. Blasco, T.M. Chen, H.K. Kalutarage, I. Muttik, H.N. Nguyen, M. Roggenbach,
S.A. Shaikh. Detecting malicious collusion between mobile software applications - the Android
case.
4. D. Carlin, P. O’Kane, S. Sezer. Dynamic Analysis of Malware using Run-Time Opcodes.
5. N. Moustafa, G. Creech, J. Slay. Big Data Analytics for Intrusion Detection Systems: Statistical
Decision-Making using Finite Dirichlet Mixture Models.
6. Y. Sabbah. Security of Online Examinations.
7. R. Indika P. Wickramasinghe. Attribute Noise, Classification Technique and Classification
Accuracy: A Comparative Study.
8. M. Alamaniotis, L.H. Tsoukalas. Learning from Loads: An Intelligent System for Decision
Support in Identifying Nodal Load Disturbances of Cyber-Attacks in Smart Power Systems using
Gaussian Processes and Fuzzy Inference.
9. J. Garae, R. Ko. Visualization and Data Provenance Trends in Decision Support for Cybersecu-
rity.
Acknowledgements
xi
Contents
xiii
xiv Contents
xv
xvi Contributors
Nour Moustafa The Australian Centre for Cyber Security, University of New
South Wales Canberra, Canberra, NSW, Australia
Igor Muttik Cyber Curio LLP, Berkhamsted, UK
Hoang Nga Nguyen Centre for Mobility and Transport, Coventry University,
Coventry, UK
Philip O’Kane Centre for Secure Information Technologies, Queen’s University,
Belfast, Northern Ireland, UK
Markus Ring Department of Electrical Engineering and Computer Science,
Coburg University of Applied Sciences and Arts, Coburg, Germany
Markus Roggenbach Department of Computer Science, Swansea University,
Swansea, UK
Yousef W. Sabbah Faculty of Technology and Applied Sciences, Quality Assur-
ance Department, Al-Quds Open University, Ramallah, Palestine
Sakir Sezer Centre for Secure Information Technologies, Queen’s University,
Belfast, Northern Ireland, UK
Siraj Ahmed Shaikh Centre for Mobility and Transport, Coventry University,
Coventry, UK
Jill Slay The Australian Centre for Cyber Security, University of New South Wales
Canberra, Canberra, NSW, Australia
Lefteri H. Tsoukalas Applied Intelligent Systems Laboratory, School of Nuclear
Engineering, Purdue University, West Lafayette, IN, USA
R. Indika P. Wickramasinghe Department of Mathematics, Prairie View A&M
University, Prairie View, TX, USA
Sarah Wunderlich Department of Electrical Engineering and Computer Science,
Coburg University of Applied Sciences and Arts, Coburg, Germany
Part I
Regular Chapters
A Toolset for Intrusion and Insider
Threat Detection
Abstract Company data are a valuable asset and must be protected against
unauthorized access and manipulation. In this contribution, we report on our
ongoing work that aims to support IT security experts with identifying novel
or obfuscated attacks in company networks, irrespective of their origin inside or
outside the company network. A new toolset for anomaly based network intrusion
detection is proposed. This toolset uses flow-based data which can be easily
retrieved by central network components. We study the challenges of analysing
flow-based data streams using data mining algorithms and build an appropriate
approach step by step. In contrast to previous work, we collect flow-based data for
each host over a certain time window, include the knowledge of domain experts
and analyse the data from three different views. We argue that incorporating
expert knowledge and previous flows allow us to create more meaningful attributes
for subsequent analysis methods. This way, we try to detect novel attacks while
simultaneously limiting the number of false positives.
1 Introduction
Information security is a critical issue for many companies. The fast development of
network-based computer systems in modern society leads to an increasing number
of diverse and complex attacks on company data and services. However, company
data are a valuable asset which must be authentic to be valuable and inaccessible
to unauthorized parties [27]. Therefore, it is necessary to find ways to protect
company networks against criminal activities, called intrusions. To reach that goal,
companies use various security systems like firewalls, security information and
event management systems (SIEM), host-based intrusion detection systems, or
network intrusion detection systems.
This chapter focuses on anomaly-based network intrusion detection systems.
Generally, network intrusion detection systems (NIDS) try to identify malicious
behaviour on network level and can be categorized into misuse and anomaly
detection [25]. Misuse detection utilizes known attacks and tries to match incom-
ing network activities with predefined signatures of attacks and malware [14].
Consequently, only known attacks can be found and the list of signatures must
be constantly updated [18]. The increasing trend of insider attacks complicates
this challenge even more. It is harder to identify signatures that indicate unusual
behaviour as these behaviours may be perfectly normal under slightly different
circumstances. Anomaly detection systems on the other hand assume that normal
and malicious network activities differ [18]. Regular network activities are modelled
by using representative training data, whereas incoming network activities are
labelled as malicious if they deviate significantly [14]. Thus, anomaly detection
systems are able to detect novel or obfuscated attacks. However, operational
environments mainly apply misuse detection systems [52]. Sommer and Paxson [52]
identify the following reasons for the failure of anomaly-based intrusion detection
systems in real world settings:
1. high cost of false positives
2. lack of publicly available training and evaluation data sets
3. the semantic gap between results and their operational interpretation
4. variability of input data
5. fundamental evaluation difficulties
In this contribution, we propose a novel approach for anomaly based network
intrusion detection in which we try to consider the challenges identified by
Sommer and Paxson [52]. We report on our ongoing work that aims to develop
an interactive toolset which supports IT security experts by identifying malicious
network activities. The resulting toolset Coburg Utility Framework (CUF) is based
on a flexible architecture and offers a wide range of data mining algorithms. Our
work addresses various aspects that may contribute to master the challenge of
identifying significant incidents in network data streams, irrespective of their origin
from inside or outside of a company network. In particular, our approach builds
upon flow-based data. Flows are meta information about network communications
between hosts and can be easily retrieved by central network components like
routers, switches or firewalls. This results in fewer privacy concerns compared to
packet-based approaches and the amount of flow data is considerably smaller in
contrast to the complete packet information.
The resulting approach is multistaged: We first propose an enrichment of flow-
based data. To this end, we collect all flows within a given time window for
each host and calculate additional attributes. Simultaneously, we use additional
domain knowledge to add further information to the flow-based data like the
origin of the Source IP Address. In order to detect malicious network traffic, the
A Toolset for Intrusion and Insider Threat Detection 5
enriched flow-based data is analysed from three different perspectives using data
mining algorithms. These views are adjusted to detect various phases of attacks
like Scanning or Gaining Access (see Sect. 3.1.1). We are confident that collecting
flows for each host separately for a certain time window and the inclusion of
domain knowledge allows us to calculate more meaningful attributes for subsequent
analysis methods. Further, the three different analysis views allow us to reduce the
complexity in each single view. We also describe our process to generate labelled
flow-based data sets using OpenStack in order to evaluate the proposed approach.
The chapter is organized as follows: The next section introduces the general
structure of our toolset Coburg Utility Framework (CUF). Section 3 proposes our
data mining approach to analyse flow-based data streams using CUF. Then, the
generation of labelled flow-based data sets for training and evaluation is described
in Sect. 4. Section 5 discusses related work on flow-based anomaly detection. The
last section summarizes the chapter and provides an outlook.
working steps that manipulate incoming data. Pipes connect pairs of filters and pass
data on to other filters. The architecture of CUF ensures optimal encapsulation of
pipes and filters, and both are realized as services with a common interface. Services
are implemented by a service provider and loaded dynamically by a service loader.
In addition, this architecture provides an easy way of integrating new filters, e.g.,
when integrating a new clustering algorithm as a filter, all other filters and pipes
stay unchanged.
As an enhancement to the pure pipes-and-filters pattern, CUF allows to split and
merge workflows. Thus, different clustering filters or multiple instances of one filter
with different parameter settings can be combined to process data in a single run
and compare their results directly. Since each of these steps is implemented as an
individual filter in CUF, different workflows can easily be set up and executed.
Figure 1 shows a simple data mining workflow in CUF to cluster network data. At
first, an input filter reads the data from a database. Then, a preprocessing filter adds
additional information to each data point. In this case, the GeoIPFilter adds to each
data point the corresponding geographical coordinates of the Source IP Address and
Destination IP Address using an external database. The third filter in the processing
chain is the clustering algorithm k-Prototypes. This filter sorts the incoming data
points in a predefined number of k clusters (groups) according to their similarity. The
fourth filter visualizes the results in form of Parallel Coordinates and the last filter
writes the results to disk. Input and output interfaces of the filters are represented
by the coloured rectangles in Fig. 1. Input interfaces are on the left side and output
interfaces are on the right side. Green rectangles transport data points, whereas blue
rectangles transport cluster objects. Figure 1 shows that the Parallel Coordinates
filter is able to process data points (green input rectangle) or cluster objects (blue
input rectangle). The ability of filters to read and generate different output formats
allows us to easily split and merge workflows.
One of the major challenges in analysing network data is the large amount of
data generated by network devices. Company networks generate millions of network
flows per hour. Consequently, an efficient and fast processing chain is required.
Here, the pipes-and-filters architecture of CUF itself has a big advantage. Since
filters work independently, each of them is executed in its own thread such that
multicore architectures may easily be utilized to reduce execution time.
A Toolset for Intrusion and Insider Threat Detection 7
As the name suggests, input and output filters are responsible for reading and writing
data. CUF offers various input filters which can read data from different sources like
text files, binary files, or databases. For each input filter, a corresponding output filter
is available for storing data in a particular format. Binary files are primarily used to
read and write temporary results from clusterings. For text files, CUF offers two
formats: CSV and HTML. The CSV format is most widely used. Yet, analysing CSV
files with many columns and lines can quickly become confusing. Therefore, CUF
may also write data in a formatted table and store it in HTML format.
We are primarily interested in analysing flow-based data streams. In experimental
settings, however, network data streams are often not available. Instead, flow-
based data streams are recorded from network devices and stored in CSV files.
Each recorded flow (data point) contains an attribute named Date first seen. This
attribute depicts the timestamp at which the corresponding network device created
the flow. For simulating live data streams from these stored CSV files, CUF offers
an additional stream simulation filter. The stream simulation filter emulates a data
stream by extracting the attribute Date first seen from each flow and calculates the
time difference between two successive flows. The filter forwards the flows with
respect to the calculated time differences in the processing chain to simulate real
traffic.
2.2.2 Preprocessing
CUF contains a wide range of preprocessing filters which, in most cases, are
not limited to network data. Preprocessing filters clean data (e.g. by removing
inconsistent data), transform attribute values, or add further information. Some
preprocessing filters use explicit domain knowledge to add further information to
data points.
1
http://www.cs.waikato.ac.nz/ml/index.html.
8 M. Ring et al.
2.2.3 Clustering
In general, the first goal of any type of data analysis is a better understanding
of the data [27]. Clustering is an unsupervised technique, meaning it uses no a
priori knowledge about the data. Consequently, appropriate clustering techniques
can only be determined experimentally. In order to avoid restricting the range
of techniques, CUF integrates clustering algorithms from all categories, namely
partitioning (k-means and variants), hierarchical (Lance-Williams [33], ROCK
[20] and extensions), grid-based and density-based algorithms (CLIQUE [2] and
extensions) [27]. Any clustering algorithm may choose an appropriate distance
measure, ranging from Minkowski distances for continuous data over the Jaccard
Index for categorical attributes to ConDist [44] for heterogeneous data. The latter is
even capable of self-adapting to the underlying data set.
Further, CUF integrates the stream clustering algorithms CluStream [1] and
DenStream [6] both of which are based on the principle of micro and macro clusters.
An online component clusters incoming data points into micro clusters. The offline
component is triggered by the user, uses micro clusters as input data points, and
presents a clustering result for a certain timeframe to the user. CluStream [1]
and DenStream [6] can only process continuous attributes in their standard form.
A Toolset for Intrusion and Insider Threat Detection 9
Therefore, CUF also integrates HCluStream [64] and HDenStream [29] which have
been proposed as extensions in order to handle categorical attributes in addition to
numerical ones.
2.2.4 Classification
2.2.5 Evaluation
CUF provides a wide range of filters for result evaluation. First of all, filters
are integrated to calculate default evaluation measures like Accuracy, F1-Score,
Recall or Precision. However, these measures can only be calculated when a
ground truth is available, like the labels in the classification setting. If there is
no ground truth available, CUF offers filters which calculate intrinsic validation
measures. Regarding the evaluation of data stream clustering, Hassani and Seidl
examine the performance and properties of eleven internal clustering measures
in [21]. Since Calinski-Harabasz [5] emerges as best internal evaluation measure it
is also implemented in CUF. Besides Calinski-Harabasz, other promising evaluation
methods are implemented as well (e.g. CPCQ-Index [30], CDbw-Index [9], or
Davies-Bouldin-Index [13]). These intrinsic cluster validation measures evaluate
the clustering results by their compactness and separability. However, these key
measurements can only provide an overall assessment of the results and give no
further insights.
Therefore, CUF provides various visualization filters for deeper investigations.
Data plots (see Fig. 2) may give a first impression of the data. Bar diagrams
give an overview of the number of generated clusters and their number of data
points. Further, CUF may use parallel coordinates, pixel-based visualization, or
radial visualizations to display large amounts of multi-dimensional data to a human
security expert. Figure 3 shows the visualization of a network data set in parallel
coordinates as well as in radial coordinates. In parallel coordinates, each attribute
is displayed on an axis. All axes are displayed parallel to each other on the screen.
Each data point is represented as a line from left to right. The situation is slightly
10 M. Ring et al.
Fig. 2 Data plot which represents the number of flows per time
Fig. 3 Representation of a network data set with a horizontal Port Scan in parallel coordinates
(a) and a vertical Port Scan in radial coordinates (b). The flows which belong to the Port Scan are
highlighted in red
different for radial coordinates where each attribute is displayed on an axis as well,
but data are arranged in a circular layout. However, it has to be taken into account
that opposing axis are influencing each other. Other visualizations may be used to
get an overview of or detailed insights into the overall result.
This way, we tackle the fundamental challenge of evaluation of anomaly based
intrusion detection systems by using a broad range of evaluation technologies.
proposed approach and provides the underlying ideas. Then, the integration of
additional domain knowledge is described in Sect. 3.3. Sections 3.5–3.7 describe
the three different views in which the incoming flow-based data stream is analysed.
As already mentioned above, we focus on flow-based data for intrusion and insider
threat detection. To reiterate, flows contain meta information about connections
between two network components. A flow is identified by the default five tuple:
Source IP Address, Source Port, Destination IP Address, Destination Port and
Transport Protocol. We capture flows in unidirectional NetFlow format [10] which
typically contains the attributes shown in Table 1. These attributes are typical
in flow-based data and also available in other flow standards like IPFIX [11] or
sFlow [39]. NetFlow terminates a flow record in two cases: (1) a flow receives no
2
https://nmap.org/.
A Toolset for Intrusion and Insider Threat Detection 13
Table 1 Overview of the used NetFlow attributes in our approach. The third column gives a short
description of the attributes
Nr. Name Description
1 Src IP Source IP address
2 Src port Source port
3 Dest IP Destination IP address
4 Dest port Destination port
5 Proto Transport protocol (e.g. ICMP, TCP, or UDP)
6 Date first seen Start time flow first seen
7 Duration Duration of the flow
8 Bytes Number of transmitted bytes
9 Packets Number of transmitted packets
10 Flags OR concatenation of all TCP Flags
data within ˛ seconds after the last packet arrived (inactive timeout) or (2) a flow
has been open for ˇ seconds (active timeout). By default, NetFlow uses the values
˛ D 15 and ˇ D 1800.
NIDS usually operate either on flow-based data or on packet-based data.
We analyse flow-based data for several reasons. In contrast to packet-based data, the
amount of data can be reduced, fewer privacy concerns are raised, and the problem
of encrypted payloads is bypassed. Further, problems associated with the variability
of input data [52] are avoided. Flows have a standard definition and can be easily
retrieved by central network components. Another advantage of flows is that huge
parts of the company network can be observed. For example, firewall log files would
limit the analysed data to the traffic which passes the firewall. Consequently, insider
attacks do not appear in these log files as they usually do not pass the firewall. In
contrast to that, traffic between internal network components always passes switches
or backbones. In conclusion, the use of flows allows to analyse the whole company
network traffic independently of its origin.
Flows come as either unidirectional or bidirectional flows. Unidirectional flows
aggregate those packets from host A to host B into one flow that have identical five
tuples. The packets from host B to host A are merged to another unidirectional flow.
In contrast, bidirectional flows contain the traffic from host A to host B as well as
vice versa. Consequently, bidirectional flows contain more information. However,
we decided to use unidirectional flows since company backbones often contain
asymmetric routing [23] which would distort the information in bidirectional flows.
The individual phases of an attack (see Sect. 3.1.1) have different effects on
flow-based data. The Reconnaissance phase has no influence on the data since
methods like dumpster diving or social engineering generate no observable network
traffic within the company. In comparison, the other four phases generate observable
network traffic.
It should be noted that the detection of attacks using host-based log files could
sometimes be easier than the analysis of flows, e.g. failed ssh logins are stored in
14 M. Ring et al.
the ssh-log-file. However, regarding phase five (Covering Tracks), attackers usually
manipulate the log files on the host to wipe their traces. Since we use flow-based
data of network components and no host-based log files, the covering of tracks fails
in our anomaly based intrusion detection system. This would only be possible if the
attacker hacks the network components and manipulates the flows.
3.1.4 Implications
Considering our above analysis of the general problem setting, where we investi-
gated attack scenarios, the underlying flow-based data and necessary data prepara-
tion steps, we can draw several conclusions:
1. A basic assumption is that due to the big number of various applications
and services it is nearly impossible to decide if a single flow is normal or
malicious traffic based on the available attributes (see Table 1). This assumption
is supported by the fact that normal user behaviour and malicious user behaviour
are characterized by sequences of flows. Lets illustrate this by an example.
Assume a Vertical Port Scan attack. Here, the attacker scans some or all open
ports on a target systems [55]. Since Source Port and Destination Port are keys
A Toolset for Intrusion and Insider Threat Detection 15
of the default five tuple for creating flows, each scanned port generates a new
flow. Another example is the loading of a web page. Often different pictures are
reloaded or other web pages are included. In such cases, the client opens various
Source Ports or sends request to different web servers (different Destination IP
Addresses). For these reasons, it makes more sense to collect multiple flows for
each host rather than to analyse each flow separately.
2. Different attack phases have different effects on flow-based data. In the second
phase (Scanning) different services and/or hosts are targeted, whereas in the
third phase (Gaining Access), a concrete service of a host is attacked. In the
fourth phase (Maintaining Access), flow characteristics like transmitted Bytes or
Packets seem to be normal, only the origin of the connections are suspicious.
Consequently, it makes more sense to analyse the flow-based data from different
views to detect different attack phases.
3. The information within flow-based data is limited. Therefore, flows should be
enriched with as much information about the network as possible using domain
knowledge.
This section provides an overview of the proposed approach and discusses the
underlying ideas. Figure 5 shows the essential components of our approach.
The IP Address Info Filter is the first filter in the processing chain. It receives
the flow-based data stream XS from central network components and incorporates
domain knowledge about the network (see Sect. 3.3). Data is passed through the
Service Detection Filter (Sect. 3.4) to identify the services of each flow (e.g.
SSH, DNS or HTTP). The Collecting Filter is the central component of our
approach. It also incorporates domain knowledge about the network and receives
the enriched flow-based data stream XS from the Service Detection Filter. Based on
the observations above, in its first step, the Collecting Filter collects all incoming
flows for each user separately. User identification is based on the Source IP Address
of the flow. A parameter ı controls the windows-size (in seconds) of flows which
are collected for each user. The larger the parameter ı, the more memory is
necessary but in consequence, the quality of the calculated summary of flows
increases. The Collecting Filter creates one Network data point for each user and
each time window. For each user and identified service within the time window
a Service data point and User data point is created. Each of these data points
is created for investigating the user flows from a specific view, namely Network
Behaviour Analyser, Service Behaviour Analyser and User Behaviour Analyser
in Fig. 5. The Network data point contains specific information about the users’
network behaviour and is described in Sect. 3.5. The Service data point contains
specific information about the usage of the concrete service and is described in
Sect. 3.6. The User data point contains specific information, e.g. if the behaviour
16 M. Ring et al.
is typical for a user or not. It is described in Sect. 3.7. We argue that incorporating
domain knowledge from IT security experts and other flows allow us to create more
meaningful attributes for downstream analysis methods.
The IP Address Info Filter, Service Detection Filter, Collecting Filter, Network
Behaviour, Service Behaviour and User Behaviour are implemented as separate
filters in CUF which can be run independently and in parallel.
Arguably, the performance of a NIDS increases with the amount of domain specific
information about the network. Therefore, we integrate more detailed information
about Source IP Address, Destination IP Address, Source Port and Destination Port
in our system.
A Toolset for Intrusion and Insider Threat Detection 17
The Service Detection Filter classifies each flow with respect to their services
(e.g. HTTP, SSH, DNS or FTP). Right now, this filter uses a common identification
18 M. Ring et al.
method which is based on evaluating known port numbers assigned by the Internet
Assigned Numbers Authority (IANA).3
Unfortunately, this approach is no longer viable because many applications do
not use fixed port numbers [65]. Another problem when evaluating known port
numbers is that many applications tunnel their traffic through port 80 (e.g. Skype).
Therefore, we intend to integrate more sophisticated service detection algorithms
in the future. Nguyen et al. [37] and Valenti et al. [59] broadly review traffic
classification using data mining approaches which is not limited to flow-based
data. Several approaches for flow-based service classification have been published
[32] and [65]. Moore and Zuev [32] show the effectiveness of a Naive Bayes
estimator for flow-based traffic classification. Zander et al. [65] use autoclass, an
unsupervised Bayesian classifier which learns classes inherent in a training data set
with unclassified objects. A more recent approach of service classification using
NetFlow data is given by Rossi and Valenti [46].
The Network Behaviour Analyser evaluates hosts with respect to their general
network behaviour. Consequently, this analyser primarily checks if the number and
the kind of connections are normal or suspicious for a specific host. The primary
goal is to identify activities in the Scanning phase (see Sect. 3.1.1). Therefore, the
main attacks in this scenario are IP Range Scans or Port Scans. For the detection
of Port Scans, a more detailed analysis is required. Port Scans can be grouped into
horizontal scans and vertical scans. In case of more common horizontal scans the
attacker exploits a specific service and scans numerous hosts for the corresponding
port [55]. In contrast, vertical scans target some or all ports of a single host.
Since the scanning behaviour of attacker and victim hosts are different for
TCP and UDP, they need to be treated separately. The most common TCP scan
is the SYN-scan. In this case, the attacker sends the initialization request of the
3-Way-Handshake. If the port is open, a SYN-ACK-response is sent by the victim.
Otherwise, the victim host responds with a RST-Flag. It should be noted that there
are different approaches of TCP scans, e.g. sending a FIN flag instead of the
initialization SYN flag for bypassing firewall rules. These approaches are described
in more detail in the documentation of the popular nmap4 tool.
Scanning UDP ports differs fundamentally from scanning TCP ports. Successful
addressing UDP ports does not necessarily render a response. However the same
behaviour can be observed if a firewall or other security mechanisms blocks
3
http://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.
xhtml.
4
https://nmap.org/.
A Toolset for Intrusion and Insider Threat Detection 19
the request. If the attacker addresses a closed UDP port, the victim sends an ICMP
unreachable message in return. However, most operating systems limit the number
of ICMP unreachable messages to one per second.
In consequence, it is easier to detect the targeted victim instead of the attacker,
since victims follow rules predefined by protocols. In contrast, attackers may vary
their behaviour regarding the protocol to trick the systems. Also, it is more likely to
detect the scanning of closed ports due to the atypical behaviour of request-response.
Due to these observations, data points are calculated in the Collecting Filter with
corresponding attributes for this view. The calculated attributes contain values like
the number of flows within a time window, the number of sent or received RST-
Flags, the number of sent or received ICMP unreachable messages as well as the
number of requested ports per Destination IP Address. Since we consider both, the
view of the victim and the attacker, it is easier to detect distributed Port Scans too.
In preliminary experiments, we applied various classifiers (J48 Decision Tree, k-
Nearest-Neighbour, Naive Bayes or SVM) for the detection of IP Range Scans and
Port Scans. The proposed approach seems to work on our emulated data sets. We
will describe the process of data emulation in Sect. 4.
Several other methods which use similar approaches for Port Scan detection are
discussed in literature. Methods like Time-based Access Pattern, and Sequential
hypothesis testing (TAPS) [54] use the ratio of Destination IP Addresses and
Destination Ports to identify scanners. If the ratio exceeds a threshold, the host is
marked as scanner. Threshold Random Walk (TRW) [24] assumes that a scanner
has more failed connections than a legitimate client. For identification of failed
connections, TRW also evaluates the TCP-Flags.
The Service Behaviour Analyser evaluates the hosts from the view of their correct
usage of services. Consequently, the main goal of this analyser is to check if the use
of the current service is normal or malicious for a host. Primary target is to recognize
the Gaining Access phase mentioned in Sect. 3.1.1. DoS or SSH Brute Force attacks
are typical representatives.
For the detection of misused services, it is necessary to collect all flows of this
service within the time window. Therefore, we use the service attribute added by
the Service Detection Filter (Sect. 3.4). All flows of the host within a time window
which share the same service and the same Destination IP Address are collected.
Based on these collected flows, the Collecting Filter calculates data points with
adjusted attributes for this view. The calculated attributes contain values like the
sum of transmitted Bytes and Packets, the duration of the flows, or the number of
flows. More useful attributes like the number of open connections or the number of
successfully closed connections can be derived from TCP Flags (if available). The
Collecting Filter also builts attributes which give additional information about the
source and destination using the domain knowledge of Sect. 3.3.
20 M. Ring et al.
The User Behaviour Analyser filter evaluates the hosts with respect to their used
services being typical. The main goal of this analyser is to recognize if the current
connection is normal or malicious for this user and to identify already infected and
misused hosts. Primary target is to recognize the Maintaining Access and Covering
Tracks phases mentioned in Sect. 3.1.1.
Once a server is infected, the attacker knows a valid combination of username
and password and is able to start a SSH session to that server. In this case,
the traffic characteristics like transmitted bytes, packets or duration seem to be
legitimate. Therefore, the Collecting Filter calculates data points for this view which
strongly consider the source, destination and services of the flows. Here, the domain
knowledge of Sects. 3.3 and 3.4 is used. For example, the calculated attributes
describe if the Source (Destination) IP Address is internal or external, if the Source
(Destination) IP Address is a server or a client, and which organizations (see Fig. 6)
the Source (Destination) IP Address belongs to. Further, the identified service by
the Service Detection Filter is added as an additional attribute.
Right now, we use a simple rule learner which generates rules regarding normal
and malicious behaviour. If an unknown combination occurs, the corresponding
connection information is sent to the domain experts for further investigation. Rules
generated in this case would look like the following:
This example would imply malicious behaviour, since we would not expect a valid
SSH connection to an internal server from outside the company network.
A Toolset for Intrusion and Insider Threat Detection 21
4 Data Generation
Labelled publicly available data sets are necessary for proper comparison and
evaluation of network based intrusion detection systems. However, evaluation of our
system proves to be difficult due to the lack of up-to-date flow-based data sets. Many
existing data sets are not publicly available due to privacy concerns. Those which
are publicly available often do not reflect current trends or lack certain statistical
characteristics [50]. Furthermore, correct labelling of real data proves to be difficult
due to the massive and in-transparent generation of traffic in networks. In order to
overcome these problems, we create labelled flow-based data sets through emulation
of user activities in a virtual environment using OpenStack [45].
In this section, some prominent existing data sets are presented as well as our
own approach to generate data sets for IDS evaluation.
DARPA98 and DARPA99 from the MIT Lincoln Laboratory were among the first
standard packet-based data sets published for evaluation purposes. Those data sets
were created by capturing simulated traffic of a small US Air Force base with limited
personnel via tcpdump [26]. The MIT Lincoln Laboratory also provided the KDD
CUP 99 data set, which is a modified version of the DARPA98 data set [17]. Each
data point of the KDD data set consists of 41 attributes. The KDD data set, however,
has a few problems, one being the huge number of redundant records. To overcome
these problems, the NSL-KDD data set was generated by Tavallaee et al. [57]. Since
publicly available data sets are sparse due to privacy concerns, a lot of today’s work
is based on the DARPA [40, 47] and KDD [7, 12, 41] data sets. As DARPA data sets
were created more than 17 years ago, it is questionable if they still reflect relevant
up-to-date scenarios appropriate for IDS evaluation [19, 62, 66].
Besides the data sets being outdated, conversion of packet-based to flow-based
data sets turns out to be tricky, if the data set is not available in a standard packet-
based format like pcap. Since we prefer to analyse network flows naturally flow-
based data sets would be best. Sperotto et al. [53] created one of the first publicly
available flow-based labelled datasets by monitoring a single honeypot. Due to the
traffic being recorded by monitoring a honeypot, the data set mainly consists of
malicious data. Thus, the detection of false positives could not be determined during
the evaluation [53]. For a more comprehensive IDS evaluation, a more balanced data
set would be preferable.
In 2014, Wheelus et al. [62] presented the SANTA dataset which consists of
real traffic as well as penetration testing attack data. The attack data was labelled
via manual analysis [62]. In 2015, Zuech et al. [66] introduced a two part dataset
named IRSC. The IRSC set comprises a netflow data set as well as full packet
22 M. Ring et al.
capture set. The data sets were labelled manually for uncontrolled attacks and via an
IP filter for controlled attacks [66]. Although the two flow-based data sets [62, 66]
are up-to-date, they are not publicly available as of now.
Another labelled flow-based data set is CTU-13. It was especially created for
training botnet detection algorithms. CTU-13 contains a variety of botnet traffic
mixed with background traffic coming from a real network [16].
Shiravi et al. [50] introduce a dynamic data set approach based on profiles
containing abstract representations of events and network behaviour. This allows
to generate reproducible, modified and extended data sets for better comparison of
different IDS. We use unidirectional flows whereas [50] contains bidirectional ones.
Converting flows might constitute a viable approach, but would require some effort
for re-labelling.
In conclusion, the validation of IDS through data sets in general seems to be
difficult due to few publicly available data sets. Some of the most widely used data
sets are outdated [17, 48, 57] or only contain malicious data which complicates the
attempt for comprehensive validation. Since we use unidirectional flows some of
the presented data sets would only be applicable after conversion and re-labelling
of the data [50]. Some data sets are only for specific attack scenarios [16] and other
promising approaches are not publicly available [62, 66].
Małowidzki et al. [31] define a list of characteristics for a good data set. A good data
set should contain recent, realistic and labelled data. It should be rich containing all
the typical attacks met in the wild as well as be correct regarding operating cycles in
enterprises, e.g. working hours. We try to meet these the requirements listed in [31]
in our approach using OpenStack.
OpenStack is an open source software platform which allows the creation of
virtual networks and virtual machines. This platform provides certain advantages
when generating flow-based data sets. A test environment can be easily scaled
by using virtual machines and therefore allows to generate data sets of any size.
Furthermore, one has full control over the environment including the control of the
network traffic. This ensures the correct capturing and labelling of truly clean flow-
based data sets which do not contain any harmful scenarios. Conversely, it is also
possible to include a host as an attacker and clearly label the data generated from the
attacker as malicious. A data set acquired from a real enterprise network can never
be labelled with the same quality. However, it is of upmost importance to emulate
the network activity as authentic as possible to generate viable data sets which is
difficult for synthetic data. To reach that goal, we emulate a sample small company
network with different subnets containing various servers and simulated clients and
record generated network traffic in unidirectional NetFlow format.
A Toolset for Intrusion and Insider Threat Detection 23
Fig. 7 An overview of a sample network in our OpenStack environment. The Router separates the
internal network from the internet and acts as firewall. The internal network structure with three
subnets containing several clients and servers is shown on the right
24 M. Ring et al.
IP Address of the attacker and the exact timestamp of the attack. Again, Python
scripts are used to create malicious behaviour like DoS attacks, Port Scans, or SSH
brute force attacks. If malicious data is inserted via scripts, it is always possible to
include new attacks by simply writing new scripts. Thus, up-to-date data sets can be
generated at any time.
5 Related Work
Work on network based anomaly detection methods for intrusion and insider threat
detection can be separated into packet-based and flow-based anomaly detection.
A comprehensive review of both methods is given in Bhuyan et al. [3]. A recent
survey of data mining and machine learning methods for cyber security intrusion
detection is published by Buczak and Guven [4]. Further, Weller-Fahy et al. [61]
published an overview of similarity measures which are used for anomaly based
network intrusion detection.
Since the proposed approach is based on flow-based data, the following review
does not consider packet-based methods. We categorize flow-based anomaly detec-
tion methods into (I) treating each flow separately, (II) aggregating all flows over
time windows and (III) aggregating flows of single hosts over time windows.
Category I
Winter et al. [63] propose a flow-based anomaly detection method of category (I).
The authors use an One-Class SVM and train their system with malicious flows
instead of benign flows since data mining methods are better at finding similarities
than outliers. For learning the One-Class SVM, the honeypot data set of [53] is
used. During the evaluation phase, each flow within the class is considered as
malicious and each outlier is considered as normal behaviour. Another approach
of this category is proposed by Tran et al. [58]. The basis of their system is a
block-based neural network (BBNN) integrated within an FPGA. They extract
four attributes (Packets, Bytes, Duration and Flags) from each flow as input for
their IDS. The authors compared their system against SVM and Naive Bayes
classifier and outperformed them in an experimental evaluation. Najafabadi et al.
[35] use four different classification algorithms for SSH Brute Force detection.
The authors selected eight attributes from the flow-based data and were able to
detect the attacks. The detection of RUDY attacks using classification algorithms
is studied by Najafabadi et al. [36]. RUDY is an application layer DoS attack which
generates much less traffic than traditional DoS attacks. The authors use the enriched
flow-based SANTA dataset [62] for evaluation. The flows in this data set contain
additional attributes which are calculated based on full packet captures.
26 M. Ring et al.
Category II
Approaches from category (II) aggregate all flows within a certain time window.
Wagner et al. [60] developed a special kernel function for anomaly detection. The
authors divide the data stream in equally sized time windows and consider each time
window as a data point for their kernel. The kernel function takes information about
the Source (Destination) IP Address and the transferred Bytes of all flows within
the time window. The authors integrate their kernel function in an One-Class SVM
and evaluate their approach in the context of an internet service provider (ISP). An
entropy based anomaly detection approach is presented in [38]. Here, the authors
divide the data stream in 5 min intervals and calculate for each interval seven
different distributions considering flow-header attributes and behavioural attributes.
Based on these distributions, entropy values are calculated which are used for
anomaly detection.
Category III
Approaches from category (III) use more preprocessing algorithms in the data
mining workflow and do not work directly on flow-based data. These approaches
aggregate for each host the flows over a time window and calculate new attributes
based on these aggregations. BClus [16] uses this approach for behavioural-based
botnet detection. At first, they divide the flow data stream in time windows. Then,
flows are aggregated by Source IP Address for each time window. For each aggre-
gation, new attributes (e.g. amount of unique destination IP addresses contacted by
this Source IP Address) are calculated and used for further analysis. The authors
evaluate their botnet detection approach using the CTU-13 Malware data set.
Another representative of this category is proposed by Najafabadi et al. [34]. The
authors aggregate all NetFlows with the same Source IP Address, Destination IP
Address and Destination Port in 5 min intervals. Based on these aggregations, new
attributes are calculated like the average transmitted bytes or the standard deviation
of the transmitted bytes. Then, Najafabadi et al. [34] train different classifiers and
use them for the detection of SSH Brute Force attacks.
Besides these three categories we want to also mention the Apache Spot5
framework. Apache Spot is an open source framework for analysing packet- and
flow based network traffic on Hadoop. This framework allows to use machine
learning algorithms for identifying malicious network traffic. Another network
based anomaly detection system is proposed by Rehak et al. [42, 43]. Their system
Camnep uses various anomaly detectors and combine their results to decide if the
network traffic is normal or malicious. The individual filters use direct attributes
from NetFlow and additional context attributes. For calculating these attributes, the
anomaly detectors can access all flows of a 5 min window.
5
http://open-network-insight.org/.
A Toolset for Intrusion and Insider Threat Detection 27
Our proposed approach neither handles each flow separately (category I) nor
simply aggregates all flows within a time window (category II). Instead, it follows
the approach of the third category and collects all flows for each host within a
time window. However, in contrast to the third category, the proposed approach
generates more than one data point for each collection. The generation of multiple
data points for each collection allows us to calculate more adapted data points which
describe the network, service and user behaviour of the hosts. Further, we do not try
to recognize all attack types with a single classifier like [60] or [63]. Instead, we
analyse the calculated data points from different views and develop for each view a
separate detection engine.
Future activities of our research are directed towards refining the flow-based
analysis approach and provide appropriate visualization tools for data streams, e.g.
by extending well-known visualization approaches such as parallel coordinates to
data streams. In addition, the simulation environment needs to be expanded allowing
to generate even more realistic data sets as a basis to validate and refine our anomaly-
based intrusion and insider threat detection approach more thoroughly.
Acknowledgements This work is funded by the Bavarian Ministry for Economic affairs through
the WISENT project (grant no. IUK 452/002).
References
1. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams.
In: International Conference on very large data bases (VLDB), pp. 81–92. Morgan Kaufmann
(2003)
2. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high
dimensional data for data mining applications. In: International Conference on Management
of Data, pp. 94–105. ACM Press (1998)
3. Bhuyan, M.H., Bhattacharyya, D.K., Kalita, J.K.: Network anomaly detection: Methods,
systems and tools. IEEE Communications Surveys & Tutorials 16(1), 303–336 (2014)
4. Buczak, A.L., Guven, E.: A survey of data mining and machine learning methods for cyber
security intrusion detection. IEEE Communications Surveys & Tutorials 18(2), 1153–1176
(2016)
5. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in
Statistics-theory and Methods 3(1), 1–27 (1974)
6. Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream
with noise. In: SIAM International Conference on Data Minning (SDM), vol. 6, pp. 328–339.
Society for Industrial and Applied Mathematics (2006)
7. Chae, H.s., Jo, B.o., Choi, S.H., Park, T.: Feature selection for intrusion detection using NSL-
KDD. Recent Advances in Computer Science pp. 978–960 (2015)
8. Chen, E.Y.: Detecting DoS attacks on SIP systems. In: IEEE Workshop on VoIP Management
and Security, 2006., pp. 53–58. IEEE (2006)
9. Chou, C.H., Su, M.C., Lai, E.: A new cluster validity measure and its application to image
compression. Pattern Analysis and Applications 7(2), 205–220 (2004)
10. Claise, B.: Cisco systems netflow services export version 9. RFC 3954 (2004)
11. Claise, B.: Specification of the ip flow information export (IPFIX) protocol for the exchange
of ip traffic flow information. RFC 5101 (2008)
12. Datti, R., Verma, B.: B.: Feature reduction for intrusion detection using linear discriminant
analysis. International Journal on Engineering Science and Technology 1(2) (2010)
13. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE transactions on pattern
analysis and machine intelligence 1(2), 224–227 (1979)
14. Depren, O., Topallar, M., Anarim, E., Ciliz, M.K.: An intelligent intrusion detection system
(IDS) for anomaly and misuse detection in computer networks. Expert systems with
Applications 29(4), 713–722 (2005)
15. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for
classification learning. In: International Joint Conference on Artificial Intelligence (IJCAI),
pp. 1022–1029. Morgan Kaufmann (1993)
16. Garcia, S., Grill, M., Stiborek, J., Zunino, A.: An empirical comparison of botnet detection
methods. Computers & Security 45, 100–123 (2014)
A Toolset for Intrusion and Insider Threat Detection 29
17. Gharibian, F., Ghorbani, A.A.: Comparative study of supervised machine learning techniques
for intrusion detection. In: Annual Conference on Communication Networks and Services
Research (CNSR’07), pp. 350–358. IEEE (2007)
18. Giacinto, G., Perdisci, R., Del Rio, M., Roli, F.: Intrusion detection in computer networks by a
modular ensemble of one-class classifiers. Information Fusion 9(1), 69–82 (2008)
19. Goseva-Popstojanova, K., Anastasovski, G., Pantev, R.: Using multiclass machine learning
methods to classify malicious behaviors aimed at web systems. In: International Symposium
on Software Reliability Engineering, pp. 81–90. IEEE (2012)
20. Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes.
In: International Conference on Data Engineering, pp. 512–521. IEEE (1999)
21. Hassani, M., Seidl, T.: Internal clustering evaluation of data streams. In: Trends and
Applications in Knowledge Discovery and Data Mining, pp. 198–209. Springer (2015)
22. Hellemons, L., Hendriks, L., Hofstede, R., Sperotto, A., Sadre, R., Pras, A.: SSHCure: a flow-
based SSH intrusion detection system. In: IFIP International Conference on Autonomous
Infrastructure, Management and Security, pp. 86–97. Springer (2012)
23. John, W., Dusi, M., Claffy, K.C.: Estimating routing symmetry on single links by passive
flow measurements. In: International Wireless Communications and Mobile Computing
Conference, pp. 473–478. ACM (2010)
24. Jung, J., Paxson, V., Berger, A.W., Balakrishnan, H.: Fast portscan detection using sequential
hypothesis testing. In: IEEE Symposium on Security and Privacy, pp. 211–225. IEEE (2004)
25. Kang, D.K., Fuller, D., Honavar, V.: Learning classifiers for misuse and anomaly detection
using a bag of system calls representation. In: Annual IEEE SMC Information Assurance
Workshop, pp. 118–125. IEEE (2005)
26. Kendall, K.: A database of computer attacks for the evaluation of intrusion detection systems.
Tech. rep., DTIC Document (1999)
27. Landes, D., Otto, F., Schumann, S., Schlottke, F.: Identifying suspicious activities in company
networks through data mining and visualization. In: P. Rausch, A.F. Sheta, A. Ayesh (eds.)
Business Intelligence and Performance Management, pp. 75–90. Springer (2013)
28. Lee, C.H.: A hellinger-based discretization method for numeric attributes in classification
learning. Knowledge-Based Systems 20(4), 419–425 (2007)
29. Lin, J., Lin, H.: A density-based clustering over evolving heterogeneous data stream. In: ISECS
International Colloquium on Computing, Communication, Control, and Management, vol. 4,
pp. 275–277. IEEE (2009)
30. Liu, Q., Dong, G.: CPCQ: Contrast pattern based clustering quality index for categorical data.
Pattern Recognition 45(4), 1739–1748 (2012)
31. Małowidzki, M., Berezinski, P., Mazur, M.: Network intrusion detection: Half a kingdom for a
good dataset. In: NATO STO SAS-139 Workshop, Portugal (2015)
32. Moore, A.W., Zuev, D.: Internet traffic classification using bayesian analysis techniques. In:
ACM SIGMETRICS International Conference on Measurement and Modeling of Computer
Systems, pp. 50–60. ACM, New York, USA (2005)
33. Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(1), 86–97 (2012)
34. Najafabadi, M.M., Khoshgoftaar, T.M., Calvert, C., Kemp, C.: Detection of SSH brute force
attacks using aggregated netflow data. In: International Conference on Machine Learning and
Applications (ICMLA), pp. 283–288. IEEE (2015)
35. Najafabadi, M.M., Khoshgoftaar, T.M., Kemp, C., Seliya, N., Zuech, R.: Machine learning
for detecting brute force attacks at the network level. In: International Conference on
Bioinformatics and Bioengineering (BIBE), pp. 379–385. IEEE (2014)
36. Najafabadi, M.M., Khoshgoftaar, T.M., Napolitano, A., Wheelus, C.: Rudy attack: Detection
at the network level and its important features. In: International Florida Artificial Intelligence
Research Society Conference (FLAIRS), pp. 288–293 (2016)
37. Nguyen, T.T., Armitage, G.: A survey of techniques for internet traffic classification using
machine learning. IEEE Communications Surveys & Tutorials 10(4), 56–76 (2008)
30 M. Ring et al.
38. Nychis, G., Sekar, V., Andersen, D.G., Kim, H., Zhang, H.: An empirical evaluation of entropy-
based traffic anomaly detection. In: ACM SIGCOMM Conference on Internet measurement,
pp. 151–156. ACM (2008)
39. Phaal, P., Panchen, S., McKee, N.: InMon Corporation’s sFlow: A Method for Monitoring
Traffic in Switched and Routed Networks. RFC 3176 (2001)
40. Pramana, M.I.W., Purwanto, Y., Suratman, F.Y.: DDoS detection using modified k-means
clustering with chain initialization over landmark window. In: International Conference on
Control, Electronics, Renewable Energy and Communications (ICCEREC), pp. 7–11 (2015)
41. Rampure, V., Tiwari, A.: A rough set based feature selection on KDD CUP 99 data set.
International Journal of Database Theory and Application 8(1), 149–156 (2015)
42. Rehák, M., Pechoucek, M., Bartos, K., Grill, M., Celeda, P., Krmicek, V.: Camnep: An
intrusion detection system for high-speed networks. Progress in Informatics 5(5), 65–74 (2008)
43. Rehák, M., Pechoucek, M., Grill, M., Stiborek, J., Bartoš, K., Celeda, P.: Adaptive multiagent
system for network traffic monitoring. IEEE Intelligent Systems 24(3), 16–25 (2009)
44. Ring, M., Otto, F., Becker, M., Niebler, T., Landes, D., Hotho, A.: Condist: A context-driven
categorical distance measure. In: European Conference on Machine Learning and Knowledge
Discovery in Databases, pp. 251–266. Springer (2015)
45. Ring, M., Wunderlich, S., Grüdl, D., Landes, D., Hotho, A.: Flow-based benchmark data sets
for intrusion detection. In: Proceedings of the 16th European Conference on Cyber Warfare
and Security (ECCWS). ACPI (2017, to appear)
46. Rossi, D., Valenti, S.: Fine-grained traffic classification with netflow data. In: International
wireless communications and mobile computing conference, pp. 479–483. ACM (2010)
47. Rostamipour, M., Sadeghiyan, B.: An architecture for host-based intrusion detection systems
using fuzzy logic. Journal of Network and Information Security 2(2) (2015)
48. Shah, V.M., Agarwal, A.: Reliable alert fusion of multiple intrusion detection systems.
International Journal of Network Security 19(2), 182–192 (2017)
49. Shearer, C.: The CRISP-DM model: the new blueprint for data mining. Journal of data
warehousing 5(4), 13–22 (2000)
50. Shiravi, A., Shiravi, H., Tavallaee, M., Ghorbani, A.A.: Toward developing a systematic
approach to generate benchmark datasets for intrusion detection. Computers & Security 31(3),
357–374 (2012)
51. Skoudis, E., Liston, T.: Counter Hack Reloaded: A Step-by-step Guide to Computer Attacks
and Effective Defenses. Prentice Hall Series in Computer Networking and Distributed Systems.
Prentice Hall Professional Technical Reference (2006)
52. Sommer, R., Paxson, V.: Outside the closed world: On using machine learning for network
intrusion detection. In: IEEE Symposium on Security and Privacy, pp. 305–316. IEEE (2010)
53. Sperotto, A., Sadre, R., Van Vliet, F., Pras, A.: A labeled data set for flow-based intrusion
detection. In: IP Operations and Management, pp. 39–50. Springer (2009)
54. Sridharan, A., Ye, T., Bhattacharyya, S.: Connectionless port scan detection on the backbone.
In: IEEE International Performance Computing and Communications Conference, pp. 10–
pp. IEEE (2006)
55. Staniford, S., Hoagland, J.A., McAlerney, J.M.: Practical automated detection of stealthy
portscans. Journal of Computer Security 10(1-2), 105–136 (2002)
56. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, (First Edition). Addison-
Wesley Longman Publishing Co., Inc., Boston, MA, USA (2005)
57. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP
99 data set. In: IEEE Symposium on Computational Intelligence for Security and Defense
Applications, pp. 1–6 (2009)
58. Tran, Q.A., Jiang, F., Hu, J.: A real-time netflow-based intrusion detection system with
improved BBNN and high-frequency field programmable gate arrays. In: International
Conference on Trust, Security and Privacy in Computing and Communications, pp. 201–208.
IEEE (2012)
59. Valenti, S., Rossi, D., Dainotti, A., Pescapè, A., Finamore, A., Mellia, M.: Reviewing traffic
classification. In: Data Traffic Monitoring and Analysis, pp. 123–147. Springer (2013)
A Toolset for Intrusion and Insider Threat Detection 31
60. Wagner, C., François, J., Engel, T., et al.: Machine learning approach for ip-flow record
anomaly detection. In: International Conference on Research in Networking, pp. 28–39.
Springer (2011)
61. Weller-Fahy, D.J., Borghetti, B.J., Sodemann, A.A.: A survey of distance and similarity
measures used within network intrusion anomaly detection. IEEE Communications Surveys &
Tutorials 17(1), 70–91 (2015)
62. Wheelus, C., Khoshgoftaar, T.M., Zuech, R., Najafabadi, M.M.: A session based approach
for aggregating network traffic data - the santa dataset. In: International Conference on
Bioinformatics and Bioengineering (BIBE), pp. 369–378. IEEE (2014)
63. Winter, P., Hermann, E., Zeilinger, M.: Inductive intrusion detection in flow-based network data
using one-class support vector machines. In: International Conference on New Technologies,
Mobility and Security (NTMS), pp. 1–5. IEEE (2011)
64. Yang, C., Zhou, J.: Hclustream: A novel approach for clustering evolving heterogeneous data
stream. In: International Conference on Data Mining-Workshops (ICDMW’06), pp. 682–688.
IEEE (2006)
65. Zander, S., Nguyen, T., Armitage, G.: Automated traffic classification and application identifi-
cation using machine learning. In: The IEEE Conference on Local Computer Networks 30th
Anniversary (LCN’05) l, pp. 250–257. IEEE (2005)
66. Zuech, R., Khoshgoftaar, T.M., Seliya, N., Najafabadi, M.M., Kemp, C.: A new intrusion
detection benchmarking system. In: International Florida Artificial Intelligence Research
Society Conference (FLAIRS), pp. 252–256. AAAI Press (2015)
Human-Machine Decision Support Systems
for Insider Threat Detection
Philip A. Legg
Abstract Insider threats are recognised to be quite possibly the most damaging
attacks that an organisation could experience. Those on the inside, who have
privileged access and knowledge, are already in a position of great responsibility
for contributing towards the security and operations of the organisation. Should
an individual choose to exploit this privilege, perhaps due to disgruntlement or
external coercion from a competitor, then the potential impact to the organisation
can be extremely damaging. There are many proposals of using machine learning
and anomaly detection techniques as a means of automated decision-making about
which insiders are acting in a suspicious or malicious manner, as a form of
large scale data analytics. However, it is well recognised that this poses many
challenges, for example, how do we capture an accurate representation of normality
to assess insiders against, within a dynamic and ever-changing organisation? More
recently, there has been interest in how visual analytics can be incorporated with
machine-based approaches, to alleviate the data analytics challenges of anomaly
detection and to support human reasoning through visual interactive interfaces.
Furthermore, by combining visual analytics and active machine learning, there is
potential capability for the analysts to impart their domain expert knowledge back
to the system, so as to iteratively improve the machine-based decisions based on
the human analyst preferences. With this combined human-machine approach to
decision-making about potential threats, the system can begin to more accurately
capture human rationale for the decision process, and reduce the false positives
that are flagged by the system. In this work, I reflect on the challenges of insider
threat detection, and look to how human-machine decision support systems can offer
solutions towards this.
1 Introduction
It is often said that for any organisation, “employees are the greatest asset, and
yet also the greatest threat”. The challenge of how to address this insider threat
is one that is of increasing concern for many organisations. In particular, as our
modern world is rapidly evolving, so to are the ways in that we conduct business
and manage organisations, and so to are the ways in that those who choose to attack
can do so, and succeed. In recent times there have been many high profile cases,
including Edward Snowden [1], Bradley Manning [2], and Robert Hanssen [3].
According to the 2011 CyberSecurity Watch Survey [4], whilst 58% of cyber-attacks
on organisations are attributed to outside threats, 21% of attacks are initiated by their
own employees or trusted third parties. In the Kroll 2012 Global Fraud Survey [5],
they report that 60% of frauds are committed by insiders, up from 55% in the
previous year. Likewise, the 2012 Cybercrime report by PwC [6] states that the
most serious fraud cases were committed by insiders. Of course, in all of these
cases, these figures may not truly reflect the severity of the problem given that there
are most likely many more that are either not detected, or not reported publicly. To
define what is an ‘insider’, it is often agreed that this is somebody who compared
to an outsider, has some level of knowledge and some level of access in relation
to an organisation. Whilst employees are often considered to be the main focal
point as insiders, by this definition there may be many others, such as contractors,
stakeholders, former employees, and management, who could also be considered as
insiders.
Insider threat research has attracted a significant amount of attention in the
literature due to the severity of the problem within many organisations. Back in
2000, early workshops on insider threat highlighted the many different research
challenges surrounding the topic [7]. Since then, there have been a number of
proposals to address these challenges. For example, Greitzer et al. [8] discuss strate-
gies for combating the insider-threat problem, including raising staff awareness and
more effective methods for identifying potential risks. In their work, they define an
insider to be an individual who currently, or at one time, was authorised to access
an organisation’s information system, data, or network. Likewise, they refer to an
insider threat as a harmful act that trusted insiders might carry out, such as causing
harm to an organisation, or an unauthorised act that benefits the individual. Carnegie
Mellon University has conducted much foundational work surrounding the insider-
threat problem as part of their CERT (Computer Emergency Response Team),
resulting in over 700 case-studies that detail technical, behavioural, and organisa-
tional details of insider crimes [9]. They define a malicious insider to be a current
or former employee, contractor, or other business partner who has or had authorized
access to an organisation’s network, system, or data and intentionally exceeded
or misused that access in a manner that negatively affected the confidentiality,
integrity, or availability of the organisation’s information or information systems.
Spitzner [10] discusses early research on insider-threat detection using honeypots
(decoy machines that may lure an attack). However, as security awareness increases,
Human-Machine Decision Support Systems for Insider Threat Detection 35
those choosing to commit insider attacks are finding more subtle methods to cause
harm or defraud their organisations, and so there is a need for more sophisticated
prevention and detection.
In this chapter, I discuss and reflect on my recent research that addresses the
issues that surround insider threat detection. Some of this work has been previously
published in various journals and conference venues. The contribution that this
chapter serves is to bring together previous work on developing automated machine-
based detection tools, and to reconsider the problem of insider threat detection with
regards to how the human and the machine can work in tandem to identify malicious
activity. Neither the human alone, nor the machine alone, is sufficient to address the
problem in a satisfactory manner.
2 Related Works
There are a variety of published works on the topic of insider threat detection that
range from theoretical frameworks for representing the problem domain, through
to practical implementations of detection systems. As a research area, it is multi-
disciplinary in nature, including computational design of detection algorithms,
human behavioural modelling, business operations management, and ethical and
legal implications of insider surveillance.
Legg et al. propose a conceptual model that can help organisations to begin thinking
about how to detect and prevent insider attacks [11]. The model is based on a
tiered approach that relates real-world activity, measurement of the activity, and
hypotheses about the current threat. The model is designed to capture a broad
range of attributes related to insider activity that could be characterised by some
means. The tiered approach aims to address how multiple attributes from the real-
world tier can contribute towards the collection of measurements that may prove
useful for forming hypotheses (e.g., heavy workload, working late, and a developing
disagreement with higher management, could result in a possible threat of sabotage).
Nurse et al. [12] also propose a framework, this time for characterising insider threat
activity. The framework is designed to help an analyst identify the various traits that
surround insider threats, including the precipitating events that then motivate an
attacker, and the identification of resources and assets that may be exploited as part
of an attack. By considering these attributes, analysts may be able to ensure a full
and comprehensive security coverage in their organisation.
Maybury et al. [13] developed a taxonomy for the analysis and detection of
insider threat that goes beyond only cyber actions, to also incorporate such measures
as physical access, violations, finances and social activity. Similarly, Colwill [14]
36 P.A. Legg
examines the human factors surrounding insider threat in the context of a large
telecommunications organisation, remarking that greater education and awareness
of the problem is required, whilst Greitzer et al. [15] focus on incorporating inferred
psychological factors into a modelling framework. The work by Brdiczka et al. [16]
combine such psychological profiling with structural anomaly detection, to develop
an architecture for insider-threat detection that demonstrates much potential for
solving the problem.
In terms of measuring behaviours that may indicate a threat, Roy et al. [17]
propose a series of metrics that could be used based on technical and behavioural
observations. Schultz [18] presents a framework for prediction and detection of
insider attacks. He acknowledges that no single behavioural clue is sufficient to
detect insider threat, and so suggest using a mathematical representation of multiple
indicators, each with a weighted contribution. Althebyan and Panda [19] present
a model for insider-threat prediction based on the insider’s knowledge and the
dependency of objects within the organisation. In the work of Sasaki [20], a trigger
event is used to identify a change of behaviour, that impel an insider to act in a
particular way (for instance, if the organisation announce an inspection, an insider
threat may begin deleting their tracks and other data records).
Bishop et al. [21] discuss the insider-threat problem, and note that the term
insider threat is ill-defined, and rightly recognise that there should be a degree
of “insiderness” rather than a simple binary classification of insider threat or not.
They propose the Attribute-Based Group Access Control (ABGAC) model, as a
generalisation of role-based access control, and show its application to three case
studies [22]: embezzlement, social engineering, and password alteration. Other
work such as Doss and Tejay [23] propose a model for insider-threat detection
that consists of four stages: monitoring, threat assessment, insider evaluation
and remediation. Liu et al. [24] propose a multilevel framework called SIDD
(Sensitive Information Dissemination Detection) that incorporates network-level
application identification, content signature generation and detection, and covert
communication detection. More recently, Bishop et al. [25] extend their work to
examine process modelling as a means for detecting insider attacks.
Agrafiotis et al. [26] explore the sequential nature of behavioural analysis for insider
threat detection. The sequence of events is a critical aspect of analysis, since a single
event in isolation may not be deemed as a threat, and yet in conjunction with other
events, this may have much greater significance. As an example, an employee who is
accessing sensitive company records would be of more concern if they had recently
been in contact with a rival organisation, compared to an employee who may be
acting as part of their job role requirement. They extend the work on sequential
analysis in [27], where this scheme is then applied to characterise a variety of insider
threat case studies that have been collated by the Carnegie Mellon University CERT.
Human-Machine Decision Support Systems for Insider Threat Detection 37
Elmrabit et al. [28] study the categories and approaches of insider threat. They
categorise different types of insider attack (e.g., sabotage, fraud, IP theft) against
the CIA security principles (confidentiality, integrity, availability), and also against
human factors (motive, opportunity, capability). They discuss a variety of tools in
the context of insider threat detection, such as intrusion detection systems, honey-
tokens, access control systems, and security information and event management
systems. They also highlight the importance of psychological prediction models,
and security education and awareness, both of which are required by organisations
in order to tackle the insider threat problem effectively. It is clear that technical
measures alone are not sufficient, and that ‘security as a culture’ should be practiced
by organisations wishing to address this issue successfully.
Parveen et al. [29] use stream mining and graph mining to detect insider activity
in large volumes of streaming data, based on ensemble-based methods, unsuper-
vised learning and graph-based anomaly detection. Building on this, Parveen and
Thuraisingham [30] propose an incremental learning algorithm for insider threat
detection that is based on maintaining repetitive sequences of events. They use trace
files collected from real users of the Unix C shell, however this public dataset is
relatively dated now. Buford et al. [31] use situation-aware multi-agent systems as
part of a distributed architecture for insider threat detection. Garfinkel et al. [32]
propose tools for media forensics, as means to detecting insider threat behaviour.
Eldardiry et al. [33] also propose a system for insider threat detection based
on feature extraction from user activities, although they do not consider role-based
assessments as part of their system. Senator et al. [34] propose to combine structural
and semantic information on user behaviour to develop a real-world detection
system. They use a real corporate database, gather as part of the Anomaly Detection
at Multiple Scales (ADAMS) program, however due to confidentiality they can not
disclose the full details and so it is difficult to compare against the work.
McGough et al. [35] propose a beneficial software system for insider threat
detection based on anomaly detection of a user profile and their job role profile.
Their approach also aims to incorporate human resources information, for which
they describe a five states of happiness approach to assess the likelihood that a
user may pose a threat. Nguyen and Reiher [36] propose a detection tool for
insider threat that monitors system call activity for unusual or suspicious behaviour.
Maloof and Stephens [37] propose a detection tool for when insiders violate need-
to-know restrictions that are in place within the organisation. Okolica et al. [38] use
Probabilistic Latent Semantic Indexing with Users to determine employee interests,
which are used to form social graphs that can highlight insiders.
With regards to insider threat visualization, the technical report by Harris [39]
discusses some of the issues related to visualizing insider threat activity. Nance
and Marty [40] propose using bipartite graphs to identify and visualize insider
38 P.A. Legg
threat activity where the nodes in the graph represent two distinct groups, such
as user nodes and activity nodes, and the edges represent that a particular user has
performed a particular activity. This approach is best suited for comparative analysis
once a small group of users and activities have been identified, as scalability issues
would soon arise in most real-world analysis tasks. Stoffel et al. [41] propose a
visual analytics application for identifying correlations between different networked
devices, based on time-series anomaly detection and similarity models. They focus
primarily at the network traffic level, and so they do not currently consider other
attributes related to insider threat such as file storage systems and USB connected
devices. Kintzel et al. [42] use scalable glyph-based visualization using a clock
metaphor to present an overview of the activity over time of thousands of hosts
on a network. Zhao et al. [43] looked at anomaly detection for social media data
and presented their visualization tool FluxFlow. Again, they make use of the clock
metaphor as part of their visualization, which they combine with scaled circular
glyphs to represent anomalous data points. Walton et al. [44] proposed QCATs
(Multiple Queries with Conditional Attributes) as a technique for understanding and
visualizing conditional probabilities in the context of anomaly detection.
From the literature it becomes clear to see that the topic of insider threat has been
extensively studied from a variety of viewpoints. A number of models have been put
forward for how one could observe and detect signs that relate to whether an insider
is posing a threat, or has indeed already attacked. Likewise, a number of detection
techniques have been proposed. However, it is difficult to assess their true value
when some only consider a sub-set of activities, or do not provide validation in a
real-world context. In the following sections, I discuss work that has been conducted
in recent years on insider threat detection by colleagues and myself. In particular, I
address both machine-based and human-based approaches for decision-making on
the current threat posed by an individual. As part of this, I also describe the real-
world validation study of the machine-driven decision process that was performed,
and an active learning approach for combining human-machine decision-making
using visual analytic tools. These contributions set the work apart from the wider
body of research that exists on insider threat detection, by supporting both human
and machine in the process of identifying malicious insiders.
Instead, there is a need for the machine to make a well-informed decision about
the threat posed by an individual, based on their observed activity, and this differs
from what is deemed as normal behaviour.
In the paper by Legg et al. [45], “Automated Insider Threat Detection System using
User and Role-based Profile Assessment”, an insider threat detection system is
proposed that is capable of identifying anomalous activity of users, in comparison
to their previous activity and in comparison to their peers . The detection tool is
based upon the underlying principles of the conceptual model proposed in [11].
The paper demonstrates the detection tool using publicly-available insider threat
datasets provided by Carnegie Mellon University CERT, along with ten synthetic
scenarios that were generated by an independent team within the Oxford Cyber
Security group. In the work, the requirements of the detection system are given
that:
– The system should be able to determine a score for each user that relates to the
threat that they currently pose.
– The system should be able to deal with various forms insider threat, including
sabotage, intellectual property theft, and data fraud.
– The system should also be able to deal with unknown cases of insider threat,
whereby the threat is deemed to be an anomaly for that user and for that role.
– The system should assess the threat that an individual poses based on how this
behaviour deviates from both their own previous behaviour, and the behaviour
exhibited by those in a similar job role.
The system comprises of five key components: data input streams, user and role-
based profiling, feature extraction, threat assessment, and classification of threat.
From the data streams that were available for the CMU-CERT scenarios, and for
those developed by the Oxford team, the data typically represented the actions of
1000 employees over the period of 12 months, with data that captured login and
logout information for PC workstations, USB device insertion and removal, file
access, http access, and e-mail communications. Each user also has an assigned
job role (e.g., technician, receptionist, or director), where those in a similar role are
expected to share some commonality in their behaviour. The first stage of the system
is to connect to the available data streams, and to receive data from each stream in
the correct time sequence as given by the timestamp of each activity.
As data is received, this is utilised to populate a profile that represents each
individual user, as well as a combined profile that represents a single role. The
profiles are constructed in a consistent hierarchical fashion, that denotes the devices
that have been accessed by the user, the actions performed on each of these devices,
40 P.A. Legg
Fig. 1 Tree-structured profiles of user and role behaviours. The root node is the user ID, followed
by sub branches for ‘daily’, ‘normal’, and ‘attack’ observations. The next level down shows
devices used, then activities performed, and finally, attributes for those activities. The probability
distribution for normal hourly usage is given in the top-right, and the distribution for the detected
attack is given in the bottom-right. Here it can be seen that the user has accessed a new set of file
resources late at night
and the attributes associated with these actions. At each of these nodes in the profile,
a time-series is constructed that denotes the occurrence of observations on a 24-h
period. Figure 1 shows an interactive tree view of an individual user’s profile.
Once the system has computed the current daily profile for each user and for
each role, the system can then extract features from the profile. Since the profile
structure is consistent and well-defined, it means that comparisons between users,
roles, or time steps can be easily made. In particular, the feature sets consists of three
main categories: the user’s daily observations, comparisons between the user’s daily
activity and their previous activity, and comparisons between the user’s daily activity
and the previous activity of their role. The full set of features that are computed for
each user is provided in [45]. These features include a variety of measurements
that can be derived from the profiles, such as New device for user, New attribute
for activity for device for role, Hourly usage count for activity, USB duration for
user, and Earliest logon time for user. This set of features intends to be widely
applicable for most organisations, although of course, there may be more bespoke
features that are relevant for specific organisations that could also be incorporated.
To perform the threat assessment, the system aims to identify variance between
related features that may be indicative of a particularly anomaly. This is performed
using Principal Component Analysis (PCA) [46]. PCA performs a projection of the
features into lower dimensional space based on the amount of variance exhibited
by each feature. From the user profiles, an n m matrix is constructed for each
Human-Machine Decision Support Systems for Insider Threat Detection 41
1.0
Yval
0.8
0.6
0.4
2010-03-20 08:10:12
0.2 (0.674508778988, 0.315509964987)
0.0
-0.2
-0.4
-0.6
-0.8
Xval
-1.0
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
Fig. 2 Example of using PCA for assessing deviation in user activity. Each point represents a
single user for a single day (observation instance). Here, only a single user over time is shown
to preserve clarity. The majority of points form a cluster at the centre of the plot. There are five
observations that begin to move away from the general cluster. At the far right is a point observed
on the 20th March 2010, that exhibits the most deviation in the user’s behaviour
user, where n is the total number of sessions (or days) being considered, and m
is the number of features that have been obtained from the profile. The bottom
row of the matrix represents the current daily observation, with the remainder of
the matrix being all previous observation features. Essentially, this process reduces
an n-dimensional dataset to n 1 dimensionality, based on the vector of greatest
variance through the data. By performing this successively, we can reduce to 2 or 3
dimensions. Similar instances would be expected to group together, whilst instances
that exhibit significant variation would appear far from other points in the space,
where each point represents a single user on a single day. The system performs
PCA using a variety of different feature combination that relate to a particular
area of concern (e.g., web activity). Figure 2 shows the PCA decomposition for a
detected insider threat. It can be seen that the most-part of activity clusters towards
the centre, however over time, there are activities that diverge from this cluster, that
represent daily observations where the user has performed significantly different.
By considering the Euclidean distance of points from the centroid of the cluster, or
from the point given by the role average, a measure of anomaly or deviation can be
obtained for a given observation.
The score for each anomaly metric can then be analysed, for each user, for each
day (e.g., file_anomaly, total_anomaly, role_anomaly). A parallel co-ordinate plot
is used (Fig. 3), where each polyline shows a single user for a single day, against
the various anomaly metrics (where each axis is a separate anomaly metric). In
the example shown in Fig. 3, there is an observation that appears separate on the
any_anomaly metric (this relates to activity that has been observed on any device—
rather than just this device that it may have been observed on). By brushing the
axis, the analyst can filter the view to show only this result. This reveals activity
performed by a particular user of interest, who was found to be the malicious insider
42 P.A. Legg
Fig. 3 Parallel Coordinates view to show the corresponding profile features. An interactive table
below the parallel co-ordinates view shows a numerical data view of the profile features that have
been selected. Here, a particular user scores significantly higher than other users on one metric.
Interactive brushing allows this to be examined in further detail
in the set of 1000 employees, who was accessing systems and using a USB storage
device early in the morning. This was found to be different from the other users in
the role of ‘Director’, who did not use USB storage devices, and very rarely used
systems at this time of day.
This approach was found to be successful for the test scenarios from CMU-
CERT and from Oxford. Unlike other supervised machine learning techniques,
this approach requires no labelling of instances, making it easier to be deployed
quickly and effectively within an organisation. Given the variety of ways that a
user may exhibit activity that could be deemed a threat, classifying instances may
be quite difficult in any case. Classification also assumes that future instances of
a particular threat will closely match the currently-observed case, which may not
be the case (e.g., exfiltration of data could be performed in a variety of ways). The
challenge with the proposed approach is ensuring that the available data streams can
capture the occurrence of the activity that is required to identify the threat. It also
requires that the feature extraction supports all possible attack vectors that could be
imagined that relates to the available data. Whilst a comprehensive set of features
are provided, organisations may well find that they wish to incorporate additional
Human-Machine Decision Support Systems for Insider Threat Detection 43
Whilst the previous section describes the detection tool in detail, perhaps the
biggest challenge with developing insider threat tools is actually validating their
performance in the real-world. Previously, synthetic data scenarios were used for
developing and testing the tool. The paper by Agrafiotis et al. [47], “Validating an
insider threat detection system: A real scenario perspective” extends this to report
on the deployment of the detection tool in a real organisation.
The head of security for the particular organisation in question (not disclosed for
confidentiality purposes) indicated that there had recently been an incident, which
meant that there was a known insider that the system could be trialled against. From
discussions with the head of security, the detection system was modified to account
for three particular areas of interest: File-access logs, Patent DB interactions, and
Directory DB interactions. Compared to the previous work of [45], the real-world
organisation also presented scalability challenges. Here, file access logs provided
more than 750,000 data entries per day, compared to approximately 20,000 in
the synthetic examples. However, by only considering authenticated data entries
resulted in a significant reduction in the amount of data from 750,000 to 44,000
entries per day. This was deemed as appropriate by the head of security, since this
then provided user details, whereas the unauthenticated attempts were simply denied
access to the system. They use five anomaly metrics (which are anonymised in their
paper due to non-disclosure agreements), based on combinations of the features
derived from the user activity profile.
For testing the system, they deployed the detection system over two different time
periods (1 September to 31 October, and 1 December to 31 December), accounting
for 16,000 employees. The December period contained no known cases, and served
as a training period for establishing a baseline of normal activity to compare against.
The period in September and October contained one known case of insider threat
activity. When testing on this dataset, a number of false positives were generated as
either medium or high alert, for 4129 individuals. However, on closer inspection,
what the authors actually found was that the system produced approximately 0.5
alerts per employee per day. Yet, for one particular user, they generated 12 alerts
in a single day. Sure enough, this particular user was the insider. Given the nature
of a multi-national organisation, working times are likely to change significantly,
and it is recognised by the head of security that users do not conform to strict
working patterns on a regular basis. However, the fact that the system is capable
of identifying the repeat occurrence of alerts for a user shows the strong potential
of this system. Further work aims to consider how combinations of alerts across
multiple days can be accumulated to better separate this particular individual from
44 P.A. Legg
the other alerts that were generated. Nevertheless, the importance of this study
is crucial for the continued development of insider threat detection tools, and
demonstrates a real-world validation of how a system can be deployed in a large
complex organisation.
In the paper by Legg [48], “Visualizing the Insider Threat: Challenges and tools for
identifying malicious user activity”, it is shown how visualization can be utilised to
better support the decision-making process of the detection tool. The system makes
use of a visual analytics dashboard, supported by a variety of linked views including
a interactive PCA (iPCA) view (as originally proposed by Jeong et al. [49]). The
proposed dashboard, shown in Fig. 4, allows for overview summary statistics to be
viewed, based on selection of time, users, and job roles. The iPCA view shows the
measurement features on a parallel coordinates plot, and a scatter plot that represents
the 2-dimensional PCA. In particular, what this offers is the ability to observe how
the PCA space relates back to the original feature space. By dragging points in the
scatter plot, a temporary black polyline is displayed on the parallel co-ordinates
that shows the inverse PCA for the new dragged position, giving an interactive
indication of how the 2-dimensional space maps to the original feature space. For the
analyst, this can be particularly helpful to strengthen their reasoning for a particular
hypothesis, such as for understanding what a particular cluster of points may be
indicative of. The tool also features an activity view, where activities are plotted by
time in a radial view (Fig. 5). This can be particularly useful for examining the raw
Human-Machine Decision Support Systems for Insider Threat Detection 45
Fig. 4 Layout of the visual analytics dashboard. The dashboard consists of four visualization
views: User Selection, Projection, Detail, and Feature. The dashboard also has two supporting
views for feature selection and configuration
Fig. 5 Two variants of the detail view for exploring user activity, using (a) a circular plot (where
time maps to angle and day maps to the radius), or (b) a rectangular grid plot (where time maps to
the x-axis and day maps to the y-axis). Colour denotes the observed activity, and the selection pane
provides detail of attributes. The role profile can be shown by the translucent coloured segments
46 P.A. Legg
Fig. 6 Assessment of 18 different user profiles within the same job role. Of the profiles, six
profiles exhibit activity that occurs outside of the typical time period (marked by a circle top-left of
profile). Two of the users also use USB devices (marked by a blue circle) during this non-typical
time period, which may be of potential interest to the analyst. This view provides a compact and
comparable overview of similar users
activity for days where there is significant deviation. Again, this links with the PCA
view, so that when a user hovers on a point, the corresponding ring in the radial view
is highlighted, and visa versa.
The detail view also forms the basis for a role overview mode, where the
analyst can inspect the detail view of all users that exist within the same role.
Figure 6 shows 18 users, where red indicates login/logout activity, blue indicates
USB insertion/removal, green indicates e-mail activity, and yellow indicates web
activity. As previously, a translucent background is used to represent the role profile,
so that comparisons can be made between how the user compares against this. From
this view, it can be seen that six of the users access resources outside of typical
working hours for that particular role (marked by a circle top-left of profile), and two
of these are making use of USB devices during these non-typical hours (marked by a
blue circle). By visualizing the activity by means of this overview, it allows analysts
to gain a clearer understanding of how the other users in the role perform, which
can help further support their decision-making about the threat that is posed.
Visual Analytics provide an interface for analysts to visual explore the analytical
results produced by the machine decision process. The ability to utilise the machine-
based detection allows for an initial filtering process that alleviates the workload
for the analyst, whilst the visual analytics approach then enables analysts to obtain
a richer understanding of why the machine has given a particular result, without
obscuring the data should the analyst decide that further analytics of additional data
is required to fulfil their own decision process.
Human-Machine Decision Support Systems for Insider Threat Detection 47
The visual analytics dashboard is a powerful interface that links the user to intuitive
visual representations of the underlying data. Through interaction, the analyst
can explore and delve deeper into this data, to support the development of their
hypotheses on the intentions of an insider who may pose a threat to the organisation.
By exploiting the concept of a visual analytics loop [50], the user interaction
can be utilised to inform the system, based on the new knowledge that they have
obtained from viewing the current result. From a machine learning viewpoint, this
is akin to the human providing online training labels, based on instances of particular
interest. This concept of training on a small sample of key instances, as determined
by the current result of the system (rather than providing a complete training set like
in supervised learning) is referred to as active learning [51].
In the paper by Legg et al. [52], “Caught in the act of an insider attack: detection
and assessment of insider threat”, an active learning approach is proposed for
refining the configuration of the detection system. Figure 7 shows the approach for
introducing active learning into a visual analytics tool. As seen previously, a parallel
co-ordinates plot is used to depict each user for each day. The plot can be configured
to show a historical time windows (e.g., the last 30 days). The left-side shows a user
alert list, where minor and severe alerts are show as orange and red respectively, and
the date and user are given as the text label. If the analyst clicks on an alert, a tree-
structured profile is displayed (Fig. 1) that allows them to explore deeper into why a
particular observation has been flagged. In the tree profile, all previously-acceptable
activity is shown under the normal node, whilst the current attack is shown under
the attack node. In this example, it appears that the user has accessed a set of files
late at night that they would not typically work with, hence why they have been
flagged up in this case.
Fig. 7 Detection system as a result of active learning. The analyst has rejected the alert on
mpowel1969 (shown by the removal of the accept option). This reconfigures the detection system
to downgrade the anomaly associated with this result—in this case insert_anomaly—which can be
observed by the circular dials by each anomaly metric. In addition to the alert list, the parallel co-
ordinates can be set to present only the ‘last 30 days’, which provides a clear view of the detected
insider lbegum1962
48 P.A. Legg
For the active learning component, the key here is that each label also has an
accept or reject option (shown by the green and red circles to the right of the label).
The user does not necessarily have to provide this information, however if they do,
then the system is able to incorporate this knowledge based into the decision-making
process. This is done by taking a weighted contribution from each feature, so that if
a rejected result scores highly on a particular feature, then this feature can be down-
weighted for this particular user, or role, or entire group, for a particular period of
time. In this fashion, the burden of false positives can be alleviated for the analysts.
It should be apparent by now that there is substantial interest in the area of insider
threat detection as a research discipline. Yet despite many proposed solutions, the
problem continues to persist in many organisations. So why is this? Part of the
challenge is security awareness. Many organisations are simply ill-equipped to
gather and analyse such activity data. Others may choose not to invest in security
until it is too late. Part of the challenge here is also how to transition the work
from academic research into industrial practice. A number of spin-out companies
are beginning to come about from insider-threat research, for which may begin
to address this problem. So, what is it that organisations can be doing to protect
themselves?
Perhaps the key element is for organisations to identify their most precious
assets, and identifying features to represent these. Machine Learning routines can
only form useful insight if working with appropriate data that is representative of
the problem domain. A system is unlikely to detect that a user is about to steal
many sensitive records if it knows nothing about a user’s access to such records.
Therefore, identifying which activity features an organisation is most concerned
about is a vital step in any application of insider threat detection. It is also vital that
appropriate visual analytic tools are in place for assessing the results of automated
detection routines, so that the analyst can fully understand the reasoning behind
why a particular individual has been flagged up as suspicious. Without this, humans
are merely taking a machine’s word for whether a individual should be disciplined.
Given the severe consequence of false accusations, it is vital that the analyst has full
confidence in a given decision.
Another emerging area of interest in combating the insider threat is analysing
text communications. This raises many ethical and privacy concerns, although
in a corporate environment it could be argued that this is a requirement of the
role (e.g., employees working in national security would be expected to abide
by such regulations). One proposal to provide analytics on textual data without
exposing privacy concerns is to perform online linguistics analysis that can then
be used to characterise the communication, rather than the raw text alone. In
[53], the Linguistic Enquiry Word Count (LIWC) tool was used as a means
of characterising psychological traits through use of language. The LIWC tool
Human-Machine Decision Support Systems for Insider Threat Detection 49
essentially provides dictionaries that relate particular words (or parts of words),
to independent features (e.g., love, friend, hate, self). There has been much work
in the psychology domain of relating LIWC features to OCEAN characteristics
(Openness, Conscientiousness, Extroversion, Agreeableness, Neuroticism) [54],
and also the Dark Triad (Narcissism, Machiavellianism, Psychopathy) [55]. A visual
analytics dashboard was developed for analysing the communications of multiple
users against these features, that can be used to identify when there is significant
change in a user’s communication, and how this could imply a change in their
psychological characteristics. Whilst this initial study demonstrated potential in this
area, there is much work that remains to be done in how feasible a solution this can
provide.
Another consider to make is who should be responsible for decision making
in insider threat—human or machine? Given the severity of disciplinary action,
it could be argued that a human should always need to intervene to inspect the
result to ensure that this is valid before any disciplinary or legal action is taken.
Then there is the issue of at what stage does the system intervene—should analysts
operate as proactive or reactive? In a proactive environment, systems may attempt
to predict the likelihood that a user will become an attacker, rather than a reactive
environment that is detecting an already-conducted attack. Again, ethical concerns
are raised such as whether a user would have conducted the attack if the system had
not intervened at that time? Lab-based experimentation on such scenarios can only
take us so far in understanding these problems, and the concept that an employee
can be disciplined for an action that they are yet to perform is very much seen as
the work of science fiction (much like that of Minority Report). However, what is
required is the collaboration and cooperation between organisations, and also with
academia, to continue to experiment and continue to develop tools that can alleviate
and support the demands of the analyst in their decision-making process.
6 Summary
In this chapter, I have considered the scope of insider threat detection, and how
systems can be developed to enable human-machine decision support such that
well-informed decisions can be made about insider threats. Whilst a wide range
of work exists on the topic, there is still much work to be done to combat the
insider threat. How can detection systems be provided with accurate and complete
data? How can detection systems extend beyond ‘cyber’ data sources, to build a
more complete representation of the organisation? How should the psychology of
insiders be accounted for, to understand their motives and intentions in their normal
practice, and understand how these may change and why? Then there are the ethical
concerns that need to be addressed—if employees are being monitored, how will this
affect staff morale? Will they simply find alternative ways to circumvent protective
measures?
50 P.A. Legg
The research shows much potential in being able to combat this problem,
however, it also reveals the importance of the human aspects of security. As stated
earlier, “employees are the greatest asset, and yet also the greatest threat”, and
it is fair that this has never been so true as it is in our modern society today.
Technology is enhancing how society operate, and yet it is also providing new means
for disgruntled insiders to attack. At the same time, insiders acting in physical space
are also becoming more creative in their planning. This begins to illustrate how
the boundaries between online and offline worlds are beginning to blur, and cyber
is just another factor in the much larger challenge of organisation security. Yet,
with continued efforts in the development of new security technologies, we can
better support the decision-making process between man and machine, to combat
the challenge of insider threat detection.
Acknowledgements Many thanks to my colleagues from Oxford Cyber Security, Dr. Ioannis
Agrafiotis, Dr. Jassim Happa, Dr. Jason Nurse, Dr. Oliver Buckley (now with Cranfield University),
Professor Michael Goldsmith, and Professor Sadie Creese, with whom my early work on insider
threat detection was carried out with.
References
31. J. F. Buford, L. Lewis, and G. Jakobson. Insider threat detection using situation-aware mas. In
Proc. of the 11th International Conference on Information Fusion, pages 1–8, 2008.
32. S. L. Garfinkel, N. Beebe, L. Liu, and M. Maasberg. Detecting threatening insiders with
lightweight media forensics. In Technologies for Homeland Security (HST), 2013 IEEE
International Conference on, pages 86–92, Nov 2013.
33. H. Eldardiry, E. Bart, Juan Liu, J. Hanley, B. Price, and O. Brdiczka. Multi-domain information
fusion for insider threat detection. In Security and Privacy Workshops (SPW), 2013 IEEE,
pages 45–51, May 2013.
34. T. E. Senator, H. G. Goldberg, A. Memory, W. T. Young, B. Rees, R. Pierce, D. Huang,
M. Reardon, D. A. Bader, E. Chow, et al. Detecting insider threats in a real corporate database
of computer usage activity. In Proceedings of the 19th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 1393–1401. ACM, 2013.
35. A. S. McGough, D. Wall, J. Brennan, G. Theodoropoulos, E. Ruck-Keene, B. Arief, C. Gamble,
J. Fitzgerald, A. van Moorsel, and S. Alwis. Insider threats: Identifying anomalous human
behaviour in heterogeneous systems using beneficial intelligent software (ben-ware). In
Proceedings of the 7th ACM CCS International Workshop on Managing Insider Security
Threats, MIST ’15, pages 1–12, New York, NY, USA, 2015. ACM.
36. N. Nguyen and P. Reiher. Detecting insider threats by monitoring system call activity. In
Proceedings of the 2003 IEEE Workshop on Information Assurance, 2003.
37. M. A. Maloof and G. D. Stephens. elicit: A system for detecting insiders who violate need-
to-know. In Christopher Kruegel, Richard Lippmann, and Andrew Clark, editors, Recent
Advances in Intrusion Detection, volume 4637 of Lecture Notes in Computer Science, pages
146–166. Springer Berlin Heidelberg, 2007.
38. J. S. Okolica, G. L. Peterson, and R. F. Mills. Using plsi-u to detect insider threats by
datamining e-mail. International Journal of Security and Networks, 3(2):114–121, 2008.
39. M. Harris. Visualizing insider activity and uncovering insider threats. Technical report, 2015.
40. K. Nance and R. Marty. Identifying and visualizing the malicious insider threat using bipartite
graphs. In System Sciences (HICSS), 2011 44th Hawaii International Conference on, pages
1–9, Jan 2011.
41. F. Stoffel, F. Fischer, and D. Keim. Finding anomalies in time-series using visual correlation
for interactive root cause analysis. In Proceedings of the Tenth Workshop on Visualization for
Cyber Security, VizSec ’13, pages 65–72, New York, NY, USA, 2013. ACM.
42. C. Kintzel, J. Fuchs, and F. Mansmann. Monitoring large ip spaces with clockview. In
Proceedings of the 8th International Symposium on Visualization for Cyber Security, VizSec
’11, pages 2:1–2:10, New York, NY, USA, 2011. ACM.
43. J. Zhao, N. Cao, Z. Wen, Y. Song, Y. Lin, and C. Collins. Fluxflow: Visual analysis of
anomalous information spreading on social media. Visualization and Computer Graphics,
IEEE Transactions on, 20(12):1773–1782, Dec 2014.
44. S. Walton, E. Maguire, and M. Chen. Multiple queries with conditional attributes (QCATs) for
anomaly detection and visualization. In Proceedings of the Eleventh Workshop on Visualization
for Cyber Security, VizSec ’14, pages 17–24, New York, NY, USA, 2014. ACM.
45. P. A. Legg, O. Buckley, M. Goldsmith, and S. Creese. Automated insider threat detection
system using user and role-based profile assessment. IEEE Systems Journal, PP(99):1–10,
2015.
46. I. Jolliffe. Principal component analysis. Wiley Online Library, 2005.
47. I. Agrafiotis, A. Erola, J. Happa, M. Goldsmith, and S. Creese. Validating an insider threat
detection system: A real scenario perspective. In 2016 IEEE Security and Privacy Workshops
(SPW), pages 286–295, May 2016.
48. P. A. Legg. Visualizing the insider threat: challenges and tools for identifying malicious user
activity. In Visualization for Cyber Security (VizSec), 2015 IEEE Symposium on, pages 1–7,
Oct 2015.
49. D. H. Jeong, C. Ziemkiewicz, B. Fisher, W. Ribarsky, and R. Chang. ipca: An interactive
system for pca-based visual analytics. In Proceedings of the 11th Eurographics / IEEE -
VGTC Conference on Visualization, EuroVis’09, pages 767–774, Chichester, UK, 2009. The
Eurographs Association; John Wiley & Sons, Ltd.
Human-Machine Decision Support Systems for Insider Threat Detection 53
Abstract Malware has been a major problem in desktop computing for decades.
With the recent trend towards mobile computing, malware is moving rapidly to
smartphone platforms. “Total mobile malware has grown 151% over the past year”,
according to McAfee®’s quarterly treat report in September 2016. By design,
AndroidTM is “open” to download apps from different sources. Its security depends
on restricting apps by combining digital signatures, sandboxing, and permissions.
Unfortunately, these restrictions can be bypassed, without the user noticing, by
colluding apps for which combined permissions allow them to carry out attacks.
In this chapter we report on recent and ongoing research results from our ACID
I.M. Asăvoae
INRIA, Paris, France
e-mail: [email protected]
J. Blasco ()
Information Security Group, Royal Holloway University of London, Egham, UK
e-mail: [email protected]
T.M. Chen
School of Mathematics, Computer Science & Engineering, City University of London, London,
UK
e-mail: [email protected]
H.K. Kalutarage
Centre for Secure Information Technologies, Queen’s University of Belfast, Belfast, UK
e-mail: [email protected]
I. Muttik
Cyber Curio LLP, Berkhamsted, UK
e-mail: [email protected]
H.N. Nguyen • S.A. Shaikh
Centre for Mobility and Transport, Coventry University, Coventry, UK
e-mail: [email protected]; [email protected]
M. Roggenbach
Department of Computer Science, Swansea University, Swansea, UK
e-mail: [email protected]
project which suggest a number of reliable means to detect collusion, tackling the
aforementioned problems. We present our conceptual work on the topic of collusion
and discuss a number of automated tools arising from it.
1 Introduction
radar for as long as possible. Unscrupulous advertisers could also benefit from
hiding privacy-invading functionality in multiple apps. These reasons led us to
believe that AndroidTM may be one of the first targets for collusion attacks. We also
realised that security practitioners who analyse threats for AndroidTM desperately
need tools which would help them uncover colluding apps. Such apps may be
outright malicious or they may be unwanted programs which often do aggressive
advertising coupled with disregard for users’ privacy (like those which would use
users’ contacts to expand their advertising further). Having a popular OS which
allowed (and to some extent even provides support to) colluding apps was a major
risk.
Before we started there were no tools or established methods to uncover these
attacks: discovering such behaviours is very tricky—two or more mobile apps,
when analysed independently, may not appear to be malicious. However, together
they could become harmful by exchanging information with one another. Multi-
app threats such as these were considered theoretical for some years, but as part of
this research we discovered colluding code embedded in many AndroidTM apps in
the wild [48]. Our goal was to find effective methods of detecting colluding apps
in AndroidTM [6–8, 11–13, 37]. This would potentially pave a way for spotting
collusions in many other environments that implement software sandboxing, from
other mobile operating systems to virtual machines in server farms.
1.1 Background
Malware has been a major problem in desktop computing for decades. With the
recent trend towards mobile computing, malware is moving rapidly to mobile
platforms. “Total mobile malware has grown 151% over the past year”, according
to McAfee®’s quarterly threat report from September 2016. Criminals are clearly
motivated by the opportunity—the number of smartphones in use is predicted to
grow from 2.6 billion in 2016 to 6.1 billion in 2020, predominantly AndroidTM ,
with more than 10 billion apps downloaded to date. Smartphones pose a particular
security risk because they hold personal details (accounts, locations, contacts, pho-
tos) and have potential capabilities for eavesdropping (with cameras/microphone,
wireless connections).
By design, AndroidTM is “open” to download apps from different sources.
Its security depends on restricting apps by combining digital signatures, sandboxing,
and permissions. Unfortunately, these restrictions can be bypassed, without the user
noticing, by colluding apps for which combined permissions allow them to carry
out attacks.
A basic example of collusion consists of one app permitted to access personal
data, which passes the data to a second app allowed to transmit data over the
network. While collusion is not a widespread threat today, it opens an avenue to
circumvent AndroidTM permission restrictions that could be easily exploited by
criminals to become a serious threat in the near future.
58 I.M. Asăvoae et al.
Almost all current research efforts are focusing on detection of single malicious
apps. The threat of colluding apps is challenging to detect because of the myriad
and possibly stealthy ways in which apps might communicate with each other.
Existing Anti-Virus (AV) products are not designed to detect collusion. A review
of the literature shows that detecting application collusion introduces a new set of
challenges including: the detection of communication channels between apps, the
exponential number of app combinations, and the difficulty of actually proving that
two or more apps are really colluding.
1.2 Contribution
In this chapter we report on recent and ongoing research results from our ACID
project1 which suggest a number of reliable means to detect collusion, tackling the
aforementioned problems. To this end we present our conceptual work on the topic
of collusion and discuss a number of automated tools arising from it.
We start with an overview on the AndroidTM Operating System, which introduces
the various security mechanism built in.
Then we give a definition for app collusion, and distinguish collusion from the
closely related phenomena of collaboration and confused deputy attacks.
Based on this we address the exponential complexity of the problem by
introducing a filtering phase. We develop two methods based on a lightweight
analysis to detect if a set of apps has any collusion potential. These methods extract
features through static analysis and use first order logic and machine learning to
assess whether an analysed app set has collusion potential. By developing two
methods to detect collusion potential we address the problem of collusion with two
distinct approaches.
The first order logic approach allows us to define collusion potential through
experts, which may identify attack vectors that are not yet being seen in the real
world. Whereas the machine learning approach uses AndroidTM permissions to
systematically assign the degree of collusion potential a set of apps may pose. A mix
of techniques as such provides for an insightful understanding of possibly colluding
behaviours and also adds confidence into filtering.
Once we have reduced the search space, we use a more computational intensive
approach, namely software model checking, to validate the actual existence of
collusion between the analysed apps. To this end, model checking provides dynamic
information on possible app executions that lead to collusion; counter examples
(witness traces) are generated in such cases.
1
http://acidproject.org.uk.
Detecting Malicious Collusion Between Mobile Software Application 59
The AndroidTM operating system consists of three software layers above a Linux®
kernel as shown in Fig. 1. The Linux® kernel is slightly modified for an embedded
environment. It runs device-specific hardware drivers, and manages power and the
file system. AndroidTM is agnostic of the processor (ARM, x86, and MIPS) but
does take advantage of some hardware-specific security capabilities, e.g., the ARM
v6 eXecute-Never feature for designating a non-executable memory area.
Above the kernel, libraries of native machine code provide common services
to apps. Examples include Surface Manager for graphics; WebKit® for browser
rendering; and SQLite for basic datastore. In the same layer, each app runs in its
own instance of AndroidTM runtime (ART) except for some apps that are native,
e.g., core AndroidTM services. A replacement for the Dalvik virtual machine (VM)
since AndroidTM 4.4, the ART is designed to run Dalvik executable (DEX) byte-
code on resource-constrained mobile devices. It introduces ahead-of-time (AOT)
compilation converting bytecode to native code at installation time (in contrast to
the Dalvik VM which interpreted code at runtime).
Application framework
AndroidTM runtime
Libraries
Dalvik VM
Linux® kernel
60 I.M. Asăvoae et al.
The application framework above the libraries offers packages and classes to
provide common services to apps, for example: the Activity Manager for starting
activities; the Package Manager for installing apps and maintaining information
about them; and the Notification Manager to give notifications about certain events
to interested apps.
The highest application layer consists of user-installed or pre-installed apps. Java
source code is compiled into JAR (Java archive) files composed of multiple Java
class files, associated metadata and resources, and optional manifest. JAR files can
be translated into DEX bytecode and zipped into AndroidTM package (APK) files
for distribution and installation. APK files contain .dex files, resources, assets, lib
folder of processor-specific code, the META-INF folder containing the manifest
MANIFEST.MF and other files, and an additional AndroidManifest.xml file. The
AndroidManifest.xml file contains the necessary configuration information to install
the app, notably defining permissions to request from the user.
AndroidTM apps are built composed of one or more components which must be
declared in the manifest.
• Activities represent screens of the user interface and allow the user to interact
with the app. Activities run only in the foreground. Apps are generally composed
of a set of activities, such as a “main” activity launched when a user starts an app.
• Services operate in the background to carry out long-running tasks for other apps,
such as listening to incoming connections or downloading a file.
• Broadcast receivers respond to messages that are sent through Intent objects, by
the same or other apps.
• Content providers manage data shared across apps. Apps with content providers
enable other apps to read and write their local data.
Any component can be public or private. If a component is public, components
of other apps can interact with it, e.g., start the Activity, start the Service. If a
component is private, only components from the app that runs with the same user
ID (UID) can interact with that component.
2.2 Communications
AndroidTM allows any app to start another app’s component in order to avoid
duplicate coding for the same function. However, this can not be done directly
because apps are separate processes. To activate a component in another app, an app
must deliver a message to the system that specifies the intent to start that component.
Detecting Malicious Collusion Between Mobile Software Application 61
Intents are message objects that contain information about the operation to be
performed and relevant data. Intents are delivered by various methods to application
components, depending on the type of component. Intents about certain events
are broadcasted, e.g., an incoming phone call. Intents can be explicit for specific
recipients or implicit, i.e., broadcast through the system to any components listen-
ing. Components can provide Intent filters to specify which Intents a component is
willing to handle.
Besides Intents, processes can communicate by standard Unix® communication
methods (files, sockets), AndroidTM offers three inter-process communication (IPC)
mechanisms:
• Binder: a remote procedure call mechanism implemented as a custom
Linux® driver;
• Services: interfaces directly accessible using Binder;
• Content Provider: provide access to data on the device.
AndroidTM apps can be downloaded from the official Google PlayTM market or
many third party app stores. To catch malicious apps from being distributed,
Google@ uses a variety of services including Bouncer, Verify Apps, and Safety
Net. Since 2012, the Bouncer service automatically scans the Google PlayTM
market for potentially malicious apps (known malware) and apps with suspicious
behaviours. It does not examine apps installed on devices or apps in third party app
stores. Currently, however, none of these services look for apps exhibiting collusion
behaviours.
The Verify Apps service scans apps upon installation on an AndroidTM device
and scans the device in the background periodically or when triggered by potentially
harmful behaviours, e.g., root access. It warns users of potentially harmful apps
(PHAs) which may be submitted online for analysis.
The Safety Net service looks for network-based security threats, e.g., SMS abuse,
by analyzing hundreds of millions of network connections daily. Google@ has the
option to remotely remove malicious apps.
AndroidTM security aims to protect user data and system resources (including the
network), which are regarded as the valuable assets. Apps are assumed to be
untrusted by default and therefore considered potential threats to the system and
other apps. The primary method of protection is isolation of apps from other apps,
users from other users, and apps from certain resources. IPC is possible but mediated
62 I.M. Asăvoae et al.
3 App Collusion
ISO 27005 defines a threat as “A potential cause of an incident, that may result in
harm of systems and organisations.” For mobile devices, the range of such threats
includes [62]:
Detecting Malicious Collusion Between Mobile Software Application 63
• Information theft happens when information is sent outside the device bound-
aries.
• Money theft happens, e.g., when an app makes money through sensitive API calls
(e.g. SMS).
• Service or resource misuse occurs, for example, when a device is remotely
controlled or some device function is affected.
As we have seen before, the AndroidTM OS runs apps in sandboxes, trying to
keep them separate from each other, especially that no information can be exchanged
between them. However, at the same time AndroidTM has communication channels
between apps. These can be documented ones (overt channels), or undocumented
ones (covert channels). An example of an overt channel would be a shared file or
intent; an example of a covert channel would be volume manipulation (the volume
is readable by all apps) in order to pass a message in a special code.
Broadly speaking, app collusion is when, in performing a threat, several apps are
working together, i.e., they exchange information which they could not obtain on
their own.
This informal definition is close to app collaboration, where several apps share
information (which they could not obtain on their own), in order to achieve a
documented objective.
A typical example of collusion is shown in Fig. 2, where two apps perform the
threat of information theft: the Contact_app reads the contacts database to pass
the data to the Weather_app, which sends the data outside the device boundaries.
The information between apps is exchanged through shared preferences.
In contrast, a typical example of collaboration would be the cooperation between
a picture app and an email app. Here, the user can choose a picture to be sent via
email. This requires the picture to be communicated over an overt channel from the
picture app to the email app. Here, the communication is performed via a shared
image file, to which both apps have access.
64 I.M. Asăvoae et al.
Contact_app Weather_app
Shared
READ Prefs.
INTERNET
CONTACTS
These examples show that the distinction between collusion and collaboration
actually lies in the notion of intention. In the case of the weather app, the intent
is malicious and undocumented, in the case of sending the email, the intent is
documented, visible to the user and useful.
To sharpen the argument, it might be the case that the picture app actually makes
the pictures readable by all apps, so that harm can be caused by some malicious app
sending pictures without authorisation. This would provide a situation, where a bug
or a vulnerability of one app is abused by another app, leading to a border case for
collusion. In this case one would speak about “confused deputy” attack: the picture
app has a vulnerability, which is maliciously abused by the other app, however, the
picture app was—in the way we describe it here—not designed with the intention
to collude. An early reference on such attacks is the work by Hardy [34].
This discussion demonstrates that notions such as “malicious”, intent, and
visibility (including app documentation—external and built-into the app) play a role
when one wants to distinguish between collusion, cooperation, and confused deputy.
This is typical in cyber security, see e.g. Harley’s book chapter “Antimalware
evaluation and Testing”, especially the section headed “Is It or Isn’t It?”, [35,
pp. 470–474]. It is often a challenge, especially for borderline cases, to distinguish
between benign and malicious application behaviours. One approach is to use a pre-
labeled “malicious” data set of APKs where all the aforementioned factors have
been already accounted for. Many security companies routinely classify AndroidTM
apps into clean and malicious categories to provide anti-malware detection in their
products and we had access to such set from Intel Security (McAfee®). All apps
classified as malicious fall into three mentioned threat categories. Now, collusion
can be regarded as a camouflage mechanism applied to conceal these basic threat’s
behaviours. After splitting malicious actions into multiple individual apps they
would easily appear harmless when checked individually. Indeed, even permissions
of each such app would indicate it cannot pose a threat in isolation. But in
combination, however, they may realise a threat. Taking into account all the details
contributing to “maliciousness”—deceitful distribution, lack of documentation,
hidden functionality, etc.—is practically impossible to formalise.
Here, in our book chapter, we aim to apply purely technical methods to discover
collusion. Thus, we will leave out of our definition all aspects relating to psychology,
sociology, or documentation. In the light of the above discussion our technical
definition of collusion thus applies to all three identified cases, namely collusion,
Detecting Malicious Collusion Between Mobile Software Application 65
This data leakage example is in line with the collusion definitions given in most
existing work [5, 17, 40, 42, 46, 52] which regards collusion as the combination
of inter-app communication with information leakage. However, our definition of a
threat is broader, as it includes also financial and resource/service abuse.
2
Concrete examples are available on request.
66 I.M. Asăvoae et al.
We present our analysis of a set of apps in the wild that use collusion to maximise
the effects of their malicious payloads [11]. To the best of our knowledge, this is
the first time that a large set of colluding apps have been identified in the wild. This
does not necessarily mean that there are no more colluding apps in the wild, as one
of the main problems (that we are addressing in our work) is the lack of tools to
identify colluding apps. We identified these sets of apps while looking for collusion
potential on a set of more than 40,000 apps downloaded from App markets. While
performing this analysis we found a group of apps that was communicating using
both intents and shared preference files. A manual review of the flagged apps
revealed that they were sharing information through shared preferences files to
synchronise the execution of a potentially harmful payload. Both the colluding and
malicious payload were included inside a library, the MoPlus SDK, embedded in
all apps. This library has been known to be malicious since November 2015 [58].
However, the collusion behaviour of the SDK was hitherto unknown. In the rest of
this section, we briefly describe this colluding behaviour.
The detected colluding behaviour looked different from the behaviour predicted
by most app collusion research [47, 57] so far. In a nutshell, all apps including the
MoPlus SDK that are running on a device will talk to each other to check which
of the apps has the most privileges. This app will then be chosen to execute the
local HTTP server able to receive commands from the C&C server, maximising the
effects of the malicious payload.
The MoPlus SDK includes the MoPlusService and the MoPlusReceiver compo-
nents. In all analysed apps, the service is exported. In AndroidTM , this is considered
to be a dangerous practice, as also other apps will be able to call and access
this service. However, in this case it is a feature used by the SDK to enable
communication between its apps.
The colluding behaviour is executed when the MoPlusService is created
(onCreate method). This behaviour is triggered by the MoPlus SDK of each app and
can be divided in two phases: establishing app priority and executing the malicious
payload. To establish the app priority—see Fig. 3—the MoPlus SDK executes
a number of checks, including the verifying if the app embedding the SDK has
granted the INTERNET, READ_PHONE_STATE, ACCESS_NETWORK_STATE,
WRITE_CONTACTS, WRITE_EXTERNAL_STORAGE or GET_TASKS
permissions.
After the priority has been obtained and stored, each service inspects the contents
of the shared preference files to get its priority, returning the package name of
the one with highest priority. Then, each service cancels previous intents being
registered (to avoid launching the service more than once) and sends an intent
targeting only the process with the higher previously saved priority—see Fig. 4.
Detecting Malicious Collusion Between Mobile Software Application 67
com.myapp
com.baidu.searchbox com.baidu.BaiduMap
Fig. 3 Phase 1 of the colluding behaviour execution. Each app saves a priority value that depends
on the amount of access it has to the system resources. Priority values are shown for the sake of
explanation
Fig. 4 Phase 2 of the colluding behaviour execution. Each app checks the WORLD_READABLE
SharedPreference files and sends an intent to the app with highest priority
3.1.1 Discussion
app collusion we must look not only to the specific features or capabilities of the
app, but also how those capabilities work when the app is being executed with
other apps. If we are considering collusion it does not make much sense to consider
the capabilities of an app in isolation with respect to other apps, we have to consider
the app executing in an environment where there are other apps installed.
This set of apps found in the wild relates to our collusion definition in the following
way. Consider a set of apps S D fapp1 ; app2 ; ; appn g that implements the
MoPlus SDK. As they embed the MoPlus SDK, the attacks that can be achieved
by them includes writing into the contacts database, launching intents and installing
applications without user interaction among others. This set of threats was identified
by TrendMicro researchers [58].
Consider now the installation of an application without the user interaction
as a threat Tinstall . As all apps embed the MoPlus SDK, all apps include the
code to potentially execute such threat, but only apps that request the necessary
permissions are able to execute it. If appi is the only app installed in the device,
and has the necessary permissions, executing Tinstall will require the following
actions fOpen serveri ; Receive commandi ; Install appi g, the underscore being the
app executing the action.
However, if another MoPlus SDK app, appj , is installed in the same device
but doesn’t have the permissions required to achieve Tinstall the threat won’t be
realised because of concurrency problems, both apps share the port where they
receive the commands. To avoid these, the MoPlus SDK includes the previously
described leader selection mechanisms that uses the SharedPreferences . In this
setting, we can describe the set of actions required by both apps to execute the
threat as ActMoplus D fCheck permissionsi ; Check permissionsj ; Save priority ii ;
Save priority jj ; Read priority ij ; Read priority ji ; Launch service ij ; Open serveri ;
Receive commandi ; Install appi g. Considering Read priority xy and Save priority xy
as actions that make use of the SharedPreferences as a communication channel, we
can consider that the presented set of actions follows under our collusion definition
as (1) there is a sequence of actions that execute a threat executed collectively by
appi and appj and (2) both apps communicate with each other.
4 Filtering
A frontal attack on detecting collusions to analyse pairs, triplets and even larger sets
is not practical given the search space. An effective collusion-discovery tool must
include an effective set of methods to isolate potential sets which require further
examination.
Detecting Malicious Collusion Between Mobile Software Application 69
Here, in a first step we extract information about app communications and access to
protected-resources. Using rules in first order logic codified in Prolog, the method
identifies sets of apps that might be colluding.
The goal of this is to serve as a fast, computationally cheap filter that detects
potentially colluding apps. For such a first filter it is enough to be based on
permissions. In practical work on real world apps this filter turns out to be effective
to detect colluding apps in the wild.
Our filter (1) uses Androguard [20] to extract facts about the communication
channels and permissions of all single apps in a given app set S, (2) which is then
abstracted into an over-approximation of actions and communication channels that
could be used by a single app. (3) Finally the collusion rules are fired if the proper
combinations of actions and communications are found in S.
4.1.1 Actions
We utilise an action set Actprolog composed out of four different high level actions:
accessing sensitive information, using an API that can directly cost money, control-
ling device services (e.g. camera, etc.), and sending information to other devices and
the Internet. To find out which of these actions an app could carry out, we extract
its set of permissions pmsprolog with Androguard. For each found permission, our
tool creates a new Prolog fact in the form uses.app; permission/. Then permissions
extracted are mapped to one of the four high level actions. This is done with a set
of previously defined Prolog rules. The mapping of all AndroidTM permissions to
the four high-level actions can be found in the project’s Github repository.3 As an
example, an app that declares the INTERNET permission will be capable of sending
information outside the device:
4.1.2 Communications
The communication channels established by an app are characterised by its API calls
and the permissions declared in its manifest file. We cover communication actions
(comprolog ) that can be created as follows:
• Intents are messages used to request tasks from other application components
(activities, services or broadcast receivers). Activities, services and broadcast
receivers declare the intents they can handle by declaring a set of intent filters.
3
https://github.com/acidrepo/collusion_potential_detector.
70 I.M. Asăvoae et al.
• External Storage is a storage space shared between all the apps installed without
restrictions. Apps accessing the external storage need to declare the
READ_EXTERNAL_STORAGE
permission. To enable writing, apps must declare
WRITE_EXTERNAL_STORAGE.
• Shared Preferences are an OS feature to store key-value pairs of data. Although
it is not intended for inter-app communication, apps can use key-value pairs to
exchange information if proper permissions are defined (before AndroidTM 4.4).
We map apps to sending and receiving actions by inspecting their code and
manifest files. When using intents and shared preferences we are able to specify the
communication channel using the intent actions and preference files and packages
respectively. If an application sends a broadcast intent with the action SEND_FILE
we consider the following:
send_broadcast.App; Intentsend_file /
! send.App; Intentsend_file /
We consider that two apps communicate if one of them is able to send and the other
to receive through the same channel. This allows to detect communication paths
composed by an arbitrary number of apps:
To identify collusion potential in app sets, we put together the different communi-
cation channels found in an app and their high-level actions as identified by their
permissions. Then, using domain knowledge we created a threat set that describes
some of the possible threats that could be achieving with a collusion attack. Our
threat set prolog considers information theft, money theft and service misuse. As our
definition states, each of the threats is characterised by a sequence of actions. In
fact, each of our collusion rules gathers the two elements required by the collusion
definition explained in Sect. 3: (1) each app of the group must execute at least one
action of the threat and (2) each app in S communicates at least with another app
in S. The following rule provides the example of an information threat executed
through two colluding apps:
Detecting Malicious Collusion Between Mobile Software Application 71
Prolog Facts
App1 Uses(App,INTERNET)
Modified Permissions Uses(App,READ_CONTACTS)
Send(App,I_SEND)
Androguard Receive(App,SP_FILE)
App1->INTERNET
App2 -> READ_CONTACTS
…
App2 Permission to action mapping
App1 ->SEND Intent Uses(X,Internet) -> Information_outside(X)
App2 ->FILE SharedPrefs. Uses(X,READ_CONTACTS) -> Sensitive(X)
… …
Communication Prolog
… Channels
App1 Communication rules Program
Send(X,Channel) and Receive(Y,Channel) ->
Expert Communication(X,Y)
…
Knowledge
Fig. 5 General overview of the process followed in the rule based collusion detection approach
sensitive_information.Appa /
^ information_outside.Appb /
^ communicate.Appa ; Appb ; channel/
! collusion.Appa ; Appb /
Note that more apps could be involved in this same threat as simply forwarders of
the information extracted by the first app until it arrives to the exfiltration app. This
case is also covered by Definition 1, as the forwarding apps need to execute their
communication operations to succeed on their attack (fulfilling both of our definition
conditions).
Finally, the Prolog rules defining collusion potential, the facts extracted from
apps, and rules mapping permissions to high level actions and communications
between apps are then put on a single file. This file is then fed into Prolog so
collusion queries can be made. The overall process is depicted in Fig. 5.
Let X D Œx1 ; : : : ; xk be a k-dimensional space with binary inputs, where k is the total
number of permissions in AndroidTM OS and xj 2 f0; 1g are independent Bernoulli
random variables. A variable xj takes the value 1 if permission j is found in the
set S of apps under consideration, 0 otherwise. Let Y D {m-malicious, b-benign}
be a one dimensional output space. The generative naive Bayes model specifies
a joint probability, p.x; y/ D p.x j y/:p.y/, of the inputs x and the label y: the
probability p.x; y/ of observing x and y at the same time is the same as the probability
p.x j y/ of x happening when we know that y happens multiplied with the probability
p.y/ that y happens. This explicitly models the actual distribution of each class
(i.e. malicious and benign in our case) assuming that some parameters stochastic
process generates the observations, and hence estimates the model parameters that
best explains the training data. Once the model parameters are estimated (say ), O
then we can compute p.ti j O / which gives the probability of the i test case is
th
c.x D 0; y D m/ C ˛
pO .x D 0 j y D m/ D (1)
c.y D m/ C 2˛
where the pseudo count ˛ > 0 is the smoothing parameter. If ˛ D 0, i.e. taking the
empirical estimates of the probabilities without smoothing, then
c.x D 0; y D m/
pO .x D 0 j y D m/ D (2)
c.y D m/
Equation (2) estimates the likelihoods using the training set R. Uninformative priors,
i.e. pO .y D m/, can also be estimated in the same way. Instead, we estimate prior
distribution in an informative way in this work as it would help us in modelling the
knowledge not available in data (e.g. permission’s critical level). Informative prior
estimation is described in Sect. 4.2.3.
O D m if and
In order to classify the ith test case, the model predicts p.ti j /
only if:
pO .x D ti ; y D m/
>1 (3)
pO .x D ti ; y D b/
Detecting Malicious Collusion Between Mobile Software Application 73
As per our collusion definition in Sect. 3, estimating the collusion threat likelihood
Lc .S/ of a non-singleton set S of apps involves two likelihood components L .S/ and
Lcom (S): L .S/ expresses how likely the app set S can fulfil the sequence of actions
required to execute a threat; Lcom .S/ is the ability to communicate between apps
in S. Using the multiplication rule of well-known basic principles of counting:
4.2.3 Estimating L£
where N is the number of apps in the training data set and ˛j ; ˇj are the penalty
effects. In this work we set ˛j D 1: The values for ˇj depend on the critical level of
permissions as given in [50, 55]. ˇj can take either the value 2N (if permission j is
most critical), N (if permission j is critical) or 1 (if permission j is non-critical).
74 I.M. Asăvoae et al.
Our probabilistic filter consists of two sub filters: an inner and an outer one. The
inner filter applies on the top of the outer filter. The outer filter is based on the L
value which we can compute using permissions only. Permissions are very easy and
cheap to extract from APKs—no decompilation, reverse engineering, complex code
or data flow analysis is required. Hence the outer filter is computationally efficient.
The majority of non-colluding app pairs in an average app set can be treated using
this filter only (see Fig. 6). This avoids the expensive static/dynamics analysis on
these pairs. The inner filter is based on Lcom value which we currently compute
using static code analysis. A third party research prototype tool Didfail [15] was
employed in finding intent based inter app communications. A set of permission
based rules was defined to find communication using external storage. Algorithm 1
presents the proposed filter to find out colluding candidates of interest.
4
This assumption might produce false positives, however, never false negatives. It is left as a future
work to improve this.
Detecting Malicious Collusion Between Mobile Software Application 75
Algorithm 1: Probabilistic filter. The outer filter is based on L and the inner
filter is based on Lcom
: Set of individual apps;
˝: Set of pairs of colluding candidates of interest;
input : D{app1 , app2 , app3 , . . . , appn }
output: ˝ D{pair1 , pair2 , pair3 , . . . , pairm }
if jj 2 then
Let D set of all possible app pairs in ;
foreach pairj in do
Compute L as described in section 4.2.3;
/* outer filter */
if L threshold then
Compute Lcom as described in section 4.2.4 ;
/* inner filter */
if Lcom DD 1 then
Return (pairj );
end
end
end
end
Algorithm 1 was automated using R5 and Bash scripts. As mentioned above, it also
includes calls to a third party research prototype [15] to find intent based commu-
nications in computing Lcom . The model parameters in Eq. (5) were estimated using
training datasets produced from a 29k size app sample provided by Intel security.
Our validation data set consists of 240 app pairs in which half (120) of them
are known colluding pairs while the other half are non-colluding pairs. In order to
prevent over fitting, app pairs in the validation and testing sets were not included in
the training set. As shown in Fig. 6, the proposed method assigns higher L scores6
for colluding pairs than clean pairs. Table 2 presents the confusion matrix obtained
for the proposed method by fitting a linear discriminant line (LDL), i.e. the blue
dotted line in Fig. 6 (Sensitivity D 0.95, specificity D 0.94, precision D 0.94 and the
F-score D 0.95).
However as shown in Fig. 6, colluding and non-colluding are not easily separable
two classes by a LDL. There are some overlaps between class elements. As a result
it produces false classifications in Table 2. It is possible to reduce false alarms by
changing the threshold. For example either setting the best possible discriminant line
or its upper bound (or even higher, see Fig. 6) as the threshold will produce zero false
5
http://www.r-project.org/.
6
We plot L values in Fig. 6 as outer filter in Algorithm 1 depends on it, and to show that majority
of non-colluding app pairs can be treated using L only. However, it should be noted that Lc D L
for colluding pairs as Lcom D 1.
76 I.M. Asăvoae et al.
positives or vice versa in Table 2. But as a result it will increase false negative rate
that will affect on the F-score—the performance measure of the classifier. Hence it
would be a trade-off between a class accuracy and overall performance. However
since the base rate of colluding apps in the wild is close to zero as far as anyone
knows, the false positive rate of this method would have to be vanishingly small to
be useful in practice. Instead of LDL, using a non-linear discriminator would also be
another possibility to reduce false alarms. This is left as a future work to investigate.
The average processing time per app pair was 80s which consists of 1s for
the outer filter and rest of the time for the inner filter. Average time was calculated
on a mobile workstation with an Intel Core i7-4810MQ 2.8 GHz CPU and 32 GB
of RAM.
We validate both our filtering methods against a known ground truth by applying
them to a set of artificially created apps. Furthermore, we report on managing
complexity by scaling up our rule based detection method to deal with 50,000+
real world applications.
The validation of the Prolog filter has been carried out with fourteen artificially
crafted apps that cover information theft, money theft and service misuse. Created
apps use Intents, Shared Preferences and External storage as communication
channels. They are organised in four colluding sets:
• The Document Extractor set consists of one app (id 1) that looks for sensitive
documents on the external storage; the other app (id 2) sends the information
received (via SharedPreferences ) to a remote server.
• The Botnet set consists of four apps. One app (id 3) acts as a relay that receives
orders from the command and control center. The other colluding apps execute
commands (delivered via BroadcastIntents ) depending on their permissions:
sending SMS messages (id 4), stealing the user’s contacts (id 5) and starting
and stopping tasks (id 6).
Detecting Malicious Collusion Between Mobile Software Application 77
• The Contact Extractor set consists of three apps. The first (id 7) reads contacts
from the address book, the second (id 8) forwards them via the external storage
to the third one (id 9), which sends them to the Internet. The first and second app
communicate via BroadcastIntents .
• The Location Stealing set consists of one app (id 12) that reads the user location
and shares it with the second app (id 13), which sends the information to the
Internet.
The three non-colluding apps are a document viewer (id 10), an information sharing
app (id 11) and a location viewer (id 14). The first app is able to display different file
types in the device screen and use other apps (via broadcast intents) to share their
uniform resource identifier (URI). The second app receives text fragments from
other apps and sends them to a remote server. The third app receives a location from
another app (with the same intent used by apps 12 and 13) and shows it to the user
on the screen.
Table 3 shows the results obtained with our rule based approached. The entry
“dark red club” in row 1 and column 2 means: the program detects that app id 1
sends information to app id 2, and these two apps collude on an “information theft”.
As we take communication direction into consideration, the resulting matrix is non-
symmetric, e.g., there is no entry in row 2 and column 1. The entry “light red club” in
row 1 and column 10 means: the program flags collusion of type “information theft”
though the set {id 1, id 10} is clean.This provides further information about the
collusion attack. For instance, one can see the information leak in information
theft attacks. Additionally, the way we defined the communication rules makes it
possible to identify transitive collusion attacks (i.e. app 7 colluding with app 9
through app 8). The approach identifies all colluding app sets. It also flags eight
78 I.M. Asăvoae et al.
false positives due to over-approximation. Note, that there are no false negatives
due to the nature of our test set: it utilises only those communication methods that
our Prolog approach is able to identify.
Our false positives happen mainly because two reasons. First, we do not consider
in our initial classification some of the communication channels that are already
widely use by apps in AndroidTM . For example, the Intent with action VIEW or
SEND are very common in AndroidTM applications. It is unlikely that apps would
use them for collusion as other apps could have registered to receive the same
information. Second, in this approach, we identify apps that are communicating
by sharing access to sensitive resources, but we do not look at how that access is
shared. It must be noted, that the main aim of this approach is to reduce the amount
of app combinations that are being passed through the data-flow analysis.
We tested the Probabilistic filter with a different sample consisting of 91 app pairs.
Figure 7 presents the outcome for this set. Each cell in the table denotes a L value
for the corresponding app pair. Note that though there are 196 possible pairs (i.e.
14 14 cells in the table), for readability, we leave the lower half empty since the
table is symmetric. Pairs on the diagonal are also not interesting to our discussion.
To minimise false negatives, we use the lower bound (D0.50) gained from the
validation dataset for the discriminant line as threshold for L . We report possible
collusion if L 0:5 and Lcom D 1, otherwise we report non-collusion. Dark
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0.51 0.61 0.97 1 0.8 1 0.81 0.77 0.77 0.77 0.44 0.44 0.95
2 0.48 0.62 0.55 0.49 0.55 0.58 0.51 0.51 0.58 0.31 0.31 0.49
3 0.69 0.64 0.56 0.64 0.48 0.61 0.61 0.72 0.41 0.41 0.58
4 1 0.84 1 0.85 0.71 0.71 0.82 0.56 0.56 0.95
5 0.84 1 0.86 0.67 0.67 0.82 0.47 0.47 1
6 0.84 0.68 0.58 0.58 0.65 0.43 0.43 0.78
7 0.86 0.67 0.67 0.82 0.47 0.47 1
8 0.51 0.51 0.58 0.31 0.31 0.77
9 0.77 0.77 0.44 0.44 0.61
10 0.77 0.44 0.44 0.61
11 0.47 0.47 0.73
12 0.47 0.41
13 0.41
14
Fig. 7 Testing the proposed filter. For readability—we leave the upper half empty since the table
is symmetric. Pairs on the diagonal are also not interesting to our discussion. Dark red shows true
positives, light red shows false positives, dark green shows true negatives, and light green shows
false negatives
Detecting Malicious Collusion Between Mobile Software Application 79
red shows true positives, light red shows false positives, dark green shows true
negatives, and light green shows false negatives.
With regards to false alarms, app pair (1,2) was not detected by our analysis
due to the third party tool does not detect communication using SharedPreferences.
Since we do only pairwise analysis, app pair (7,9) was not reported. That pair
depends on transitive communication. Pair (12,13) was not reported since L is
less than the chosen threshold. As mentioned in Sect. 4.2.6, it would be possible
to reduce false alarms by changing the LDL threshold, but subject to degrading the
overall performance measure of the classifier.
Precise estimation of Lcom would be useful to reduce false alarms in our analysis.
But it should be noted that existence of a communication is just a necessary
condition to happen a collusion, however not a sufficient condition to detect it. In
this context it is worth to mention that a recent study [23] shows that 84.4% of non-
colluding apps in the market place can communicate with other apps either using
explicit (11.3%) or implicit (73.1%) intent calls. Therefore the threat element (i.e.
L ) is far more informative in collusion estimation than the communication element
(Lcom ) in our model.
Both validation and testing samples are blind samples and we have not properly
investigated them for the biasedness or realisticity.
Filtering is an effective method to isolate app sets. Using software model checking,
we provide a sound method for proving app sets to be clean that also returns example
traces for potential collusion based on the K framework [54]—c.f. Fig. 8. We start
with a set of apps in the form of an Application Package File (APK). The DEX code
in each APK file is disassembled into the Smali format with open source tools. The
Smali code of the apps is parsed by the K tool. Compilation in the K tool translates
the K representation of the AndroidTM apps into a rewrite theory in Maude [18].
Finally, the Maude model checker searches the transition system compiled by the
K tool to provide an answer if the input set of AndroidTM apps colludes or not. In
the case when collusion is detected, the tool provides a readable counter example
trace. In this section we focus on information theft only.
5.1.1 Challenges
In the following we will explain how we define a transition system using K and
what abstractions we define in order to allow for an effective check for collusion.
Formalising Dalvik Byte-code in K poses a number of challenges: there are about
220 instructions to be formalised, the code is object oriented, it is register based (in
contrast to stack based, as Java Byte-code), it utilises callbacks and intent based
Detecting Malicious Collusion Between Mobile Software Application 81
communication, see [3]. We provide two different semantics for DEX code, namely
a concrete and an abstract one. While the concrete semantics has the benefit to be
intuitive and thus easy to be validated, it is the abstract semantics that we employ
for app model checking. We see the step from the descriptive level provided by [3]
to the concrete, formal semantics as a ‘controllable’ one, where human intuition
is able to bridge the gap. In future work, we intend to justify the step from the
concrete semantics to the abstract one by a formal proof. Our implementation of
both AndroidTM semantics in K is freely available.7 The code of the colluding apps
discussed in this section is accessible via an encrypted web-page. The password is
available on request.8
The K framework [54] proposes a methodology for the design and analysis of
programming languages; the framework comes with a rewriting-based specification
language and tool support for parsing, interpreting, model-checking and deductive
formal verification. The ideal work-flow in the K framework starts with a formal and
executable language syntax and semantics, given as a K specification, which then is
tested on program examples in order to gain confidence in the language definition.
Here, the K framework offers model checking via compilation into Maude programs
(i.e., using the existing reachability tool and LTL Maude model checker).
A K specification consists of configurations, computations, and rules, using a
specialised notation to write semantic entities, i.e., K-cells. For example, the K-
cell representing the set of program variables as a mapping from identifiers Id to
values Val is given by hId 7! Valivars . Configurations in K are labelled and nested
K-cells, used to represent the structure of the program state. Rules in K are of
two types: computational and structural. Computational rules represent transitions
in a program execution and are specified as configuration updates. Structural rules
provide internal changes of the program state such that the configuration can enable
the application of computational rules.
The concrete semantics specifies system configurations and transition rules for all
Smali instructions and a number of AndroidTM API calls in K. Here, we strictly
follow their explanation [2].
7
http://www.cs.swan.ac.uk/~csmarkus/ProcessesAndData/androidsmali-semantics-k.
8
http://www.cs.swansea.ac.uk/~csmarkus/ProcessesAndData/sites/default/files/uploads/resources/
code.zip.
82 I.M. Asăvoae et al.
sandboxes
sandbox*
thread memory
broadcasts
intent*
class*
methods
methods
method*
object*
regs
memory
objects
object
I2 T2 FN → TV1
it also defines several auxiliary functions which are used later in other modules
for semantic rules. For example, the function “isKImplementedAPI” is defined to
determine whether an API method has been implemented within the K framework;
if not, the interpreter will look for it within the classes of the invoking application.
The “loading” module is responsible for constructing the initial configuration.
When running a Smali file in the K framework, it will parse the file according
84 I.M. Asăvoae et al.
Syntax
Semantics core
Loading Arithmetics
Starting Read/write
Invoke/return Control
to the defined syntax and place the entire resulting abstract syntax tree (AST) in
a cell. The rules defined in the loading module are used to take this AST and
distribute its elements to the right cells of the initial configuration. In particular, each
application is placed in a sandbox cell, its classes are placed in the classes cell, etc.
The “invoke/return” module defines the semantic rules for invoking methods and
return instructions. The “control” module specifies the semantics of instructions
such as “if-then” and “goto”, which may change the program counter in a non-
trivially way. The “read/write” module implements the semantics of instructions
for manipulating objects in the memory such as instantiating new objects or array,
initialising elements of an array, retrieving value of an object field and changing the
value of an object field. Finally, the “arithmetics” module specifies the semantics of
arithmetic instructions such as addition, subtraction, multiply, division and bit-wise
operations.
In some situations, our semantics has to deal with unknown values such as
the device’s location returned by AndroidTM OS. In K, unknown values can be
represented by the built-in constant K. To this end, we provide for each of the
“control”, “read/write”, “arithmetics” modules a counter-part that is responsible for
unknown values. For example, when the value to be compared with 0 in an ifz
Smali instruction is unknown, we assume that the result is either true or false,
thereby leading to a non-deterministic execution of the Smali program. Similarly,
arithmetical operations propagate unknown values.
Regarding the semantics of the AndroidTM APIs which encompasses a rich set of
predefined classes and methods, API classes and methods usually come together
with AndroidTM OS on an AndroidTM device and hence are not included in the DEX
Detecting Malicious Collusion Between Mobile Software Application 85
Syntax
Semantics core
...
Broadcast
...
Location
Intent Apache-Http
code of an app. Obviously, one may obtain the Smali code of those API classes and
methods. However, this will significantly increase the size of the Smali code to be
analysed in K and consequently the state space of the obtained models. To this end,
we directly implement the semantics of some of these classes and methods in K
rules, based on their description [2]. While the first approach appears to be more
faithful, it would significantly increase the size of the Smali code to be analysed
in K and consequently the state space of the obtained models. This is avoided by
the second approach where one can choose the abstraction level required for the
analysis in question.
In Fig. 13, we show the structure for K modules which implements the semantics
of some API methods.
In particular, we have implemented a number of APIs, including modules
Location, Intent, Broadcast, and Apache-Http. Other API classes and methods can
be implemented similarly. For those modules that are not (yet) implemented in K,
we provide a mechanism that a call to any of them returns an unknown result, i.e.,
the “K” value.
A typical example is the Location module which is responsible for imple-
menting the semantics of API methods relating to the Location Manager such
as registering a callback function when the device’s location changes, i.e., the
requestLocationUpdates method from LocationManager class. When
a registered callback method is called, it is provided with an input parameter
referring to a Location object. The creator of this object is designated to the
application in the current sandbox by the object’s created cell. Furthermore, it is
marked as sensitive in its sensitive cell (see Fig. 10).
86 I.M. Asăvoae et al.
The abstract semantics lightens the configuration and the transitions in order to
gain efficiency for model checking while maintaining enough information to verify
collusion. The abstract configuration has a cell structure similar to the concrete
configuration except for the memory cell: instead of creating objects, in abstract
semantics we record the information flow by propagating the object types and
the constants (either strings or numerical). Structurally, the K specification for the
abstract semantics is organised in the same way as the concrete one, c.f. Fig. 12. In
the followings we describe the differences that render the abstraction.
In the “read/write” module the abstract semantics neglects the memory-related
details as described next: The abstract semantics for the instructions that create
new object instances (e.g., “new-instance Register, Type”) sets the register to
the type of the newly created object. The arithmetic instructions only define data
dependency between the source registers and the destination register. The move
instruction, that copies one register into another, sets the contents of the source
register into the destination. Finally, the load/store instructions, that copy from or
into the memory, are similarly abstracted into data-dependence. We exemplify this
latest class of instructions with the abstract semantics of the iget instruction
in Fig. 14.
The abstract semantics is field insensitive, e.g., the iget instruction only
maintains the information collected in the object register, R2 . In order to add field
sensitivity to the abstraction, we only need to en-queue in R1 the field F such that
after the substitution we have R1 7! F Õ L2.
The module “invoke/return” contains the most significant differences of the
abstract semantics w.r.t. the concrete semantics. The invoke instructions differentiate
the API calls from the app’s methods. The methods defined in the app are executed
upon invocation (using a call stack) while the API calls are further discriminated
into app-communication (i.e., “send” or “receive”), APIs which trigger callbacks,
APIs which access sensitive data, APIs which publish data, and ordinary APIs.
We currently consider only Intent based inter-app communication. All invoke
instructions add information to the data-flow as follows: the object for which the
method is invoked depends on the parameters of the invoked method. Similarly,
the move-result instruction defines data-dependence between the parameters of the
latest invoked method and the register where the result is written. The data-flow
abstraction allows us to see an API call just as an instruction producing additional
data dependency. Hence, we do not need to treat separately these APIs as in the
k regs
concrete semantics (by either executing their code or giving them special semantics).
This gives a lightweight feature to the abstract semantics meanwhile enlarging the
class of apps that can be verified. Obviously, the price paid is the over approximation
of the app behaviours which induces false positive colluding results.
The rules producing transitions in the transition system are defined in the
“control” module. The rules for branching instructions, i.e., if-then instructions,
are always considered non-deterministic in the abstract semantics. The rules for
goto instruction check if the goto destination was already traversed in the current
execution and, if this is the case, the jump to the destination is replaced by a fall
through. As such, the loops are traversed at most once since the data-flow collection
only requires one loop traversal.
Detecting collusion on the abstract semantics level works as follows: When an API
accessing sensitive data is invoked in an app A1 , the data-flow is augmented with
special a label “secret.A1 /”. If, via the data-flow abstraction, the “secret” arrives
into the parameters of a publish invocation of a different app A2 (A1 ¤ A2 ) then we
discover a collusion pattern for information theft. Note that the “secret” could be
passed from A1 to A2 directly or via other apps A0 s.
The property detected in the abstract semantics is a safe over-approximation of
Definition 1. Namely, (1) the set of colluding apps S includes two different apps
A1 and A2 , hence S is not a singleton set; (2) the apps A1 and A2 execute the
beginning and the end of the threat (i.e. the extraction and the publication of the
secret, respectively) while the apps A0 s act as messengers; (3) all the discovered
apps contribute in communicating the secret.
Note, we say that the abstract collusion result is an over-approximation due to the
fact that only “non-colluding” results could be a guarantee for the non-existence of a
set S with the characteristics given by Definition 1. If a colluding set S is reported in
the abstract model checking then this is either a true collusion, as argued in (1–3), or
a false witness to collusion. A false witness (also named “spurious counterexample”
in abstract model checking) may appear due to the overprotective nature of the data-
flow abstraction. This abstraction assumes that any data “touching” the secret may
take it and pass it (e.g. when the secret is given as parameter to an API call f then any
other parameter and the result of f are assumed to know the secret). Consequently,
any collusion set S reported by the abstract model checking has to be verified (e.g.
by exercising the concrete semantics over S).
Detecting Malicious Collusion Between Mobile Software Application 89
5.5.1 Evaluation
Our experiments indicate that our approach works correctly: if there is collusion it is
either detected or has a timeout, if there is no collusion then none is detected. In case
of detection, we obtain a trace providing evidence of a run leading to information
theft. The experiments further demonstrate the need for an abstract semantics,
9
All experiments are carried out on a Macbook Pro with an Intel i7 2.2 GHz quad-core processor
and 16 GB of memory.
90 I.M. Asăvoae et al.
beyond the obvious argument of speed: e.g. in case of a loop where the number
of iterations depends on an environmental parameter that can’t be determined, the
concrete semantics yields a time out, while the abstract semantics still is able to
produce a result. Model checking with the abstract semantics is about twice as fast as
with the concrete semantics. At least for such small examples, our approach appears
to be feasible.
6 Related Work
In this section we review the different previous works that have addressed the
identification and prevention of AndroidTM malicious software. We first review
previous approaches to detect and identify AndroidTM malware (single apps) in
general. Then, we address previous work on detection and identification of colluding
apps. Finally, we review works that focus on collusion prevention.
In general, techniques for detecting AndroidTM malware are categorised into two
groups: static and dynamic. In static analysis, certain features of an app are extracted
and analysed using different approaches such as machine learning techniques. For
example, Kirin [26] proposes a set of policies which allows matching permis-
sions requested by an app as an indication for potentially malicious behaviour.
DREBIN [4] trained Support Vector Machines for classifying malware using
number of features: used hardware components, requested permissions, critical
and suspicious API calls and network addresses. Similar static techniques can be
found in [16, 19, 38, 44, 63]. Conversely, dynamic analysis detects malware at run-
time. It deploys suitable monitors on AndroidTM systems and constantly looks for
malicious behaviours imposed by software within the system. For example, [33]
keeps track of the network traffic (DNS and HTTP requests in particular) in an
AndroidTM system as input and then utilises Naive Bayes Classifier in order to
detect malicious behaviours. Similarly, [39] collects information about the usage
of network (data sent and received), memory and CPU and then uses multivariate
time-series techniques to decide if an app admitted malicious behaviours. A different
approach is to translate AndroidTM apps into formal specifications and then to
employ the existing model checking techniques. These explore all possible runs of
the apps in order to search for a matching malicious activity represented by formulas
of some temporal logic, see, e.g., [10, 60].
In contrast to malware detection, detecting colluding apps involves not only
identifying whether a security threat can be carried out by these apps but also
revealing whether communication between them occurs during the attack. In
other words, existing malware detection techniques are not directly applicable for
detecting collusion.
Detecting Malicious Collusion Between Mobile Software Application 91
We summarise the state of the art w.r.t. collusion prevention and point the reader to
the current open research questions in the field.
A frontal approach to detecting collusions to analyse pairs, triplets and larger sets
is not practical given the search space. Thus, we consider the step of pre-filtering
apps essential for a collusion detection system if it were to be used in practice.
Even if we could find all collusions in all existing apps, new ones appear every day
and they could create new collusions with previously analysed apps. Continuously
re-analysing the growing space of all AndroidTM apps is unfeasible so an effective
collusion-discovery tool must include an effective set of methods to isolate potential
sets which require further examination.
The best long-term solution would be to enforce more isolation in the AndroidTM
OS itself. For example, apps may be required to explicitly declare all communi-
cations (this includes not only inter-app channels but also declaring all Internet
domains, ports and services which they intend to use) via their manifests and then
the OS will be able to block all other undeclared communications. However, this
will not work for already existing apps (as well as many apps which could be created
before such OS hardening were implemented) so in the meantime the best practical
approach is to employ, enhance and expand the array of filtering mechanisms we
developed to discover potentially colluding sets of apps.
A filter based on AndroidTM app permissions is the simplest one. Permissions are
very easy and cheap to extract from APKs—no de-compilation, reverse engineering,
complex code or data flow analysis is required. Alternatively (or additionally), to the
two filters described in our chapter, imprecise heuristic methods to find “interesting”
app sets may include: statistical code analysis of apps (e.g. to locate APIs potentially
Detecting Malicious Collusion Between Mobile Software Application 93
Acknowledgements This work has been supported by UK Engineering and Physical Sciences
Research Council (EPSRC) grant EP/L022699/1. The authors would like to thank the anonymous
reviewers for their helpful comments, and Erwin R. Catesbeiana (Jr) for pointing out the
importance of intention in malware analysis.
94 I.M. Asăvoae et al.
References
17. Chin, E., Felt, A.P., Greenwood, K., Wagner, D.: Analyzing inter-application communication
in AndroidTM . In: MobiSys’11, pp. 239–252 (2011)
18. Clavel, M., Duran, F., Eker, S., Lincoln, P., Martı-Oliet, N., Meseguer, J., Talcott, C.: All about
Maude. LNCS 4350 (2007)
19. Dai, G., Ge, J., Cai, M., Xu, D., Li, W.: SVM-based malware detection for AndroidTM
applications. In: Proceedings of the 8th ACM Conference on Security & Privacy in Wireless
and Mobile Networks, New York, NY, USA, June 22–26, 2015, pp. 33:1–33:2 (2015). DOI
10.1145/2766498.2774991. URL http://doi.acm.org/10.1145/2766498.2774991
20. Desnos, A.: Androguard. https://github.com/androguard/androguard (2016)
21. Dubey, A., Misra, A.: AndroidTM Security: Attacks and Defenses. CRC Press (2013)
22. Elenkov, K.: AndroidTM Security Internals: An In-Depth Guide to AndroidTM ’s Security
Architecture. No Starch Press (2014)
23. Elish, K.O., Yao, D., Ryder, B.G.: On the need of precise inter-app ICC classification for
detecting AndroidTM malware collusions. In: MoST (2015)
24. Elish, K.O., Yao, D.D., Ryder, B.G.: On the need of precise inter-app icc classification
for detecting AndroidTM malware collusions. In: Proceedings of IEEE Mobile Security
Technologies (MoST), in conjunction with the IEEE Symposium on Security and Privacy
(2015)
25. Enck, W., Gilbert, P., Han, S., Tendulkar, V., Chun, B.G., Cox, L.P., Jung, J., McDaniel, P.,
Sheth, A.N.: Taintdroid: an information-flow tracking system for realtime privacy monitoring
on smartphones. ACM Transactions on Computer Systems (TOCS) 32(2), 5 (2014)
26. Enck, W., Ongtang, M., McDaniel, P.: On lightweight mobile phone application certification.
In: Proceedings of the 16th ACM conference on Computer and communications security,
pp. 235–245. ACM (2009)
27. Enck, W., Ongtang, M., McDaniel, P.: Understanding AndroidTM security. IEEE security &
privacy (1), 50–57 (2009)
28. Fritz, C., Arzt, S., Rasthofer, S., Bodden, E., Bartel, A., Klein, J., le Traon, Y., Octeau, D.,
McDaniel, P.: Highly precise taint analysis for AndroidTM applications. EC SPRIDE, TU
Darmstadt, Tech. Rep (2013)
29. Gasior, W., Yang, L.: Network covert channels on the AndroidTM platform. In: Proceedings of
the Seventh Annual Workshop on Cyber Security and Information Intelligence Research, p. 61.
ACM (2011)
30. Gasior, W., Yang, L.: Exploring covert channel in AndroidTM platform. In: Cyber Security
(CyberSecurity), 2012 International Conference on, pp. 173–177 (2012). DOI 10.1109/Cyber-
Security.2012.29
31. Gordon, M.I., Kim, D., Perkins, J.H., Gilham, L., Nguyen, N., Rinard, M.C.: Information flow
analysis of AndroidTM applications in droidsafe. In: NDSS (2015)
32. Gunasekera, S.: AndroidTM Apps Security. Apress (2012)
33. Han, H., Chen, Z., Yan, Q., Peng, L., Zhang, L.: A real-time AndroidTM malware detection
system based on network traffic analysis. In: Algorithms and Architectures for Parallel
Processing - 15th International Conference, ICA3PP 2015, Zhangjiajie, China, November 18–
20, 2015. Proceedings, Part III, pp. 504–516 (2015). DOI 10.1007/978-3-319-27137-8_37.
URL http://dx.doi.org/10.1007/978-3-319-27137-8_37
34. Hardy, N.: The confused deputy:(or why capabilities might have been invented). ACM SIGOPS
Operating Systems Review 22(4), 36–38 (1988)
35. Harley, D., Lee, A.: Antimalware evaluation and testing. In: AVIEN Malware Defense Guide.
Elsevier (2007)
36. Huskamp, J.C.: Covert communication channels in timesharing systems. Ph.D. thesis, Califor-
nia Univ., Berkeley (1978)
37. Kalutarage, H.K., Nguyen, H.N., Shaikh, S.A.: Towards a threat assessment for apps collusion.
Telecommunication Systems 1-14 (2016). doi:10.1007/s11235-017-0296-1. http://dx.doi.org/
10.1007/s11235-017-0269-1
96 I.M. Asăvoae et al.
38. Kate, P.M., Dhavale, S.V.: Two phase static analysis technique for AndroidTM malware
detection. In: Proceedings of the Third International Symposium on Women in Computing
and Informatics, WCI 2015, co-located with ICACCI 2015, Kochi, India, August 10–13,
2015, pp. 650–655 (2015). DOI 10.1145/2791405.2791558. URL http://doi.acm.org/10.1145/
2791405.2791558
39. Kim, K., Choi, M.: AndroidTM malware detection using multivariate time-series tech-
nique. In: 17th Asia-Pacific Network Operations and Management Symposium, APNOMS
2015, Busan, South Korea, August 19–21, 2015, pp. 198–202 (2015). DOI 10.1109/AP-
NOMS.2015.7275426. URL http://dx.doi.org/10.1109/APNOMS.2015.7275426
40. Klieber, W., Flynn, L., Bhosale, A., Jia, L., Bauer, L.: AndroidTM taint flow analysis for app
sets. In: Proceedings of the 3rd ACM SIGPLAN International Workshop on the State of the
Art in Java Program Analysis, pp. 1–6. ACM (2014)
41. Krishnamoorthy, K.: Handbook of statistical distributions with applications. CRC Press (2015)
42. Li, L., Bartel, A., Bissyand, T., Klein, J., Le Traon, Y., Arzt, S., Siegfried, R., Bodden, E.,
Octeau, D., Mcdaniel, P.: IccTA: Detecting Inter-Component Privacy Leaks in AndroidTM
Apps. In: Proceedings of the 37th International Conference on Software Engineering (ICSE
2015) (2015)
43. Li, L., Bartel, A., Bissyandé, T.F., Klein, J., Le Traon, Y.: ApkCombiner: Combining multiple
AndroidTM apps to support inter-app analysis. In: SEC’15, pp. 513–527. Springer (2015)
44. Li, Q., Li, X.: AndroidTM malware detection based on static analysis of characteristic tree.
In: 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge
Discovery, CyberC 2015, Xi’an, China, September 17–19, 2015, pp. 84–91 (2015). DOI
10.1109/CyberC.2015.88. URL http://dx.doi.org/10.1109/CyberC.2015.88
45. Maji, A.K., Arshad, F., Bagchi, S., Rellermeyer, J.S., et al.: An empirical study of the
robustness of inter-component communication in AndroidTM . In: Dependable Systems and
Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on, pp. 1–12. IEEE
(2012)
46. Marforio, C., Francillon, A., Capkun, S.: Application collusion attack on the permission-based
security model and its implications for modern smartphone systems. technical report (2011)
47. Marforio, C., Ritzdorf, H., Francillon, A., Capkun, S.: Analysis of the communication between
colluding applications on modern smartphones. In: Proceedings of the 28th Annual Computer
Security Applications Conference, pp. 51–60. ACM (2012)
48. Muttik, I.: Partners in crime: Investigating mobile app collusion. In: McAfee® Threat Report
(2016)
49. Octeau, D., McDaniel, P., Jha, S., Bartel, A., Bodden, E., Klein, J., Le Traon, Y.: Effective
inter-component communication mapping in AndroidTM with epicc: An essential step towards
holistic security analysis. In: USENIX Security 2013 (2013)
50. Peng, H., Gates, C., Sarma, B., Li, N., Qi, Y., Potharaju, R., Nita-Rotaru, C., Molloy, I.: Using
probabilistic generative models for ranking risks of AndroidTM apps. In: Proceedings of the
2012 ACM conference on Computer and communications security, pp. 241–252. ACM (2012)
51. Rasthofer, S., Arzt, S., Lovat, E., Bodden, E.: Droidforce: enforcing complex, data-centric,
system-wide policies in AndroidTM . In: Availability, Reliability and Security (ARES), 2014
Ninth International Conference on, pp. 40–49. IEEE (2014)
52. Ravitch, T., Creswick, E.R., Tomb, A., Foltzer, A., Elliott, T., Casburn, L.: Multi-app security
analysis with fuse: Statically detecting AndroidTM app collusion. In: Proceedings of the 4th
Program Protection and Reverse Engineering Workshop, p. 4. ACM (2014)
53. Ritzdorf, H.: Analyzing covert channels on mobile devices. Ph.D. thesis, ETH Zürich,
Department of Computer Science (2012)
54. Roşu, G., Şerbănuţă, T.F.: An overview of the K semantic framework. Journal of Logic and
Algebraic Programming 79(6), 397–434 (2010)
55. Sarma, B.P., Li, N., Gates, C., Potharaju, R., Nita-Rotaru, C., Molloy, I.: AndroidTM permis-
sions: a perspective combining risks and benefits. In: Proceedings of the 17th ACM symposium
on Access Control Models and Technologies, pp. 13–22. ACM (2012)
Detecting Malicious Collusion Between Mobile Software Application 97
56. Sbirlea, D., Burke, M., Guarnieri, S., Pistoia, M., Sarkar, V.: Automatic detection of inter-
application permission leaks in AndroidTM applications. IBM Journal of Research and
Development 57(6), 10:1–10:12 (2013). DOI 10.1147/JRD.2013.2284403
57. Schlegel, R., Zhang, K., Zhou, X.y., Intwala, M., Kapadia, A., Wang, X.: Soundcomber: A
stealthy and context-aware sound trojan for smartphones. In: NDSS’11, pp. 17–33 (2011)
58. Shen, S.: Setting the record straight on moplus sdk and the wormhole vulnerabil-
ity. http://blog.trendmicro.com/trendlabs-security-intelligence/setting-the-record-straight-on-
moplus-sdk-and-the-wormhole-vulnerability/. Accessed: 04/0/2016
59. Six, J.: Application Security for the AndroidTM Platform: Processes, Permissions, and Other
Safeguards. O’Reilly (2011)
60. Song, F., Touili, T.: Model-checking for AndroidTM malware detection. In: J. Garrigue (ed.)
Programming Languages and Systems - 12th Asian Symposium, APLAS 2014, Singapore,
November 17–19, 2014, Proceedings, Lecture Notes in Computer Science, vol. 8858, pp. 216–
235. Springer (2014). DOI 10.1007/978-3-319-12736-1_12. URL http://dx.doi.org/10.1007/
978-3-319-12736-1_12
61. Suarez-Tangil, G., Tapiador, J.E., Peris-Lopez, P.: Compartmentation policies for AndroidTM
apps: A combinatorial optimization approach. In: Network and System Security, pp. 63–77
(2015)
62. Suarez-Tangil, G., Tapiador, J.E., Peris-Lopez, P., Ribagorda, A.: Evolution, detection and
analysis of malware for smart devices. Comm. Surveys & Tutorials, IEEE 16(2), 961–987
(2014)
63. Wang, Z., Li, C., Guan, Y., Xue, Y.: Droidchain: A novel malware detection method for
AndroidTM based on behavior chain. In: 2015 IEEE Conference on Communications and
Network Security, CNS 2015, Florence, Italy, September 28–30, 2015, 727–728 (2015). DOI
10.1109/CNS.2015.7346906. URL http://dx.doi.org/10.1109/CNS.2015.7346906
Dynamic Analysis of Malware Using
Run-Time Opcodes
Abstract The continuing fight against intentionally malicious software has, to date,
favoured the proliferators of malware. Signature detection methods are growingly
impotent against rapidly evolving obfuscation techniques. Research has recently
focussed on the low-level opcode analysis of disassembled executable programs,
both statically and dynamically. While able to detect malware, static analysis often
still cannot unravel obfuscated code; dynamic approaches allow investigators to
reveal the run-time code. Old and inadequately sampled datasets have limited the
extrapolation potential of much of the body of research. This work presents a
dynamic opcode analysis approach to malware detection, applying machine learning
techniques to the largest dataset of its kind, both in terms of breadth (610–100k
features) and depth (48k samples). N-gram analysis of opcode sequences from
n D 1::3 was applied as a means of enhancing the feature set. Feature selection was
then investigated to tackle the feature explosion which resulted in more than 100,000
features in some cases. As the earliest detection of malware is the most favourable,
run-length, i.e. the number of recorded opcodes in a trace, was examined to find the
optimal capture size. This research found that dynamic opcode analysis can detect
malware from benignware with a 99.01% accuracy rate, using a sequence of only
32k opcodes and 50 features. This demonstrates that a dynamic opcode analysis
approach can compare with static analysis in terms of speed. Furthermore, it has a
very real potential application to the unending fight against malware, which is, by
definition, continuously on the back foot.
1 Introduction
Since the 1986 release of Brain, regarded as the first PC virus, both the incessant
proliferation of malware (malicious software) and the failure of standard anti-virus
(AV) software, have posed significant challenges for all technology users. In the
second quarter of 2016, McAfee Labs reported that the level of previously unseen
instances of malware had exceeded 40 million for the preceding three months. The
total number of unique malware samples reached 600 million, a growth of almost
one-third on the previous year [4].
Malware was previously seen as the product of miscreant teenagers, whereas
it is now largely understood to be a highly effective tool for criminal gangs and
alleged national malfeasance. As a tool, malware facilitates a worldwide criminal
industry and enables theft and extortion on a global scale. One data breach in
a US-based company has been estimated to cost an average $5.85 million, or
$102 for each record compromised [1]. GDP estimates which attribute the cost
of cybercrime to modern developed countries are as much as 1.6% in Germany,
with other technology-heavy nations showing alarming levels (USA—0.64% and
China—0.63%) [4].
This chapter is presented in the following way: the remainder of the current
section describes malware, techniques for detecting malicious code and subsequent
counter-attacks. Section 2 presents related research in the field and the gap in the
current body of literature. Section 3 depicts the methodology used in establishing
the dataset and Section 4 presents the results of machine learning analyses of the
data. The final section discusses the conclusions which can be reached about the
results presented, and describes potential future work.
e.g.Tro jan.Win32.Crilock.B :
= A Windows-based Trojan, and the second variant found from the Crilock family
type of the malware using a traditional approach. However, referring to the family
variable enables a descriptive reference to the whole malware.
1.2.1 Signature-Based
1.2.2 Anomaly-Based
system. The monitoring phase then detects deviation from the baselines established
in training [16]. For example, a poisoned PDF may differ greatly from the typical
or expected structure of a clean PDF. Likewise, a one-hour TV show file may be
expected to be around 350 MB in size. If the file was a mere 120 KB, it would be
classed as suspicious. One of the major benefits of anomaly detection is the potential
to detect zero-day threats. As the system is trained on typical traits, it does not need
to know all of the atypical traits in advance of detection. Anomaly-based systems
are, however, limited by propensity towards false-positive ratings and the scale of
the features required to adequately model the system under inspection [16].
1.2.3 Specification-Based
Analysis is considered static when the subject is not executed and can be conducted
on the source code or the compiled binary representation [13]. The analysis seeks
to determine the function of the program, by using the code and data structures
within [40]. MD5 hash representations of the executable can be checked against
databases of hashes from previously detected malware. Call graphs can demonstrate
the architecture of the software and the possible flows between functions. String
analysis seeks out URL s, IP addresses, command line options, Windows Portable
Executable files (PE) and passwords [29]. The main advantage of using static
analysis techniques is that there is no threat from the malware sample, as it is never
executed, thus can be safely analysed in detail. All code paths can potentially be
examined, when the whole of the code is visible, whereas with dynamic analysis
only the code being executed is examined. Examining all code paths is not always
an advantage, as the information revealed may be redundant dead code, inserted
to mask the true intentions of the file. Static analysis techniques are increasingly
rendered redundant against growingly complex malware, which can obfuscate itself
from analysis until the point of execution. As the code is never executed in static
analysis, the obfuscation method does not always reveal the true code.
104 D. Carlin et al.
Dynamic analysis involves the execution of the file under investigation. Techniques
mine information at the point of memory entry, during runtime and post-execution,
therefore bypassing many of the obfuscation methods which stymie static analysis
techniques. The dynamic environment can be: (a) native: where no separation
between the host and the analysis environment is provided, as analysis takes place
in the native host OS. With malware analysis, this obviously creates issues with
the effects of the malware on the host. Software is available to provide a snapshot of
the clean native environment, which can then be reloaded following each instance of
malware execution, such as DeepFreeze. This can be quite a timely process however,
as the entire system must be restored to the point of origin [23]; (b) emulated: where
the host controls the guest environment through software emulating hardware. As
this software is driven by the underlying host architecture, emulation can create
performance issues; (c) virtualised: a virtual machine provides an isolated guest
environment controlled by the host. In a perfect environment, the true nature of the
subject is revealed and examinable. However, in retaliation, malware writers have
created a battery of evasion techniques in order to detect an emulated, simulated or
virtualized environment, including anti-debugging, anti-instrumentation and anti-
Virtual Machine tactics. One major disadvantage of dynamic approaches is the
runtime overhead in having to execute the program for a specified length of time.
Static tools such as IDAPro can disassemble an executable in a few seconds, whereas
dynamic analysis can take a few hundred times longer.
1. Junk-insertion: inserts additional code into a block, without altering its end
behaviour. This can be dead code, such as NOP padding, irrelevant code, such as
trivial mathematical functions and complex code with no purpose;
2. Code transposition: reorders the instructions in the code so that the resultant
binary differs from the executable or from the predicted order of execution;
3. Register reassignment: substitutes the use of one register for another in a specific
range;
4. Instruction substitution: takes advantage of the concept that there are many
ways to do the same task in computer science. Using an alternative instruction
to achieve the same end result provides an effective method for altering code
appearance.
Packing software, or packers, are used for obfuscation, compression and encryp-
tion of a PE file, primarily to evade static detection methods. When loaded into
RAM, the original executable is restored from its packed state in ROM, ready to
unleash the payload on the victim machine [5].
Polymorphic (many shapes) malware is comprised of both the payload and an
encryption/decryption engine. The polymorphic engine mutates the static code/-
payload, which is decrypted at runtime back to the original code. This can cause
problems for AV scanners, as the payload never appears the same in any descendant
of the original code. However, the encryption/decryption engine potentially remains
constant, and as such may be detected by signature- or pattern-matching scanners
[26]. A mutation engine may be employed to obfuscate the decryption routine. This
engine randomly generates a new decryption routine on each iteration, meaning the
payload is encrypted and the decryptor is altered to appear different to the previously
utilized decryptors [41].
Metamorphic (changes shapes) malware takes obfuscation a step further
than polymorphism. With metamorphic malware, the entire code is rewritten on
each iteration, meaning no two samples are exactly alike. As with polymorphic
malware, the code is semantically the same, despite being different in construction
or bytecode, with as few as six bytes of similarity between iterations [21].
Metamorphic malware can also use inert code excerpts from benignware to not only
cause a totally different representation, but to make the file more like benignware,
and thus evade detection [20].
1.4 Summary
The ongoing fight against malware seeks to find strategies to detect, prevent and
mitigate malicious code before it can harm targeted systems. Specification-based
and Anomaly-based detection methods model whole systems or networks for a
baseline, and report changes (anomalous network traffic, registry changes etc),
but are prone to high levels of false detections. Signature-detection is the most
commonly used method among commercial AV applications [43]. With the ever-
apparent evolution of malware, such approaches have been shown to be increasingly
106 D. Carlin et al.
2 Related Research
families and >100 variations per family. No controls were used in the sampling
across categories, with 497 worm, 3048 backdoor, and 3176 trojan variants, nor was
there a methodological statement as to how these categories were sampled.
Previous research in the field has typically used samples taken from the VxHeaven
website, which are old and outdated, having last been updated in 2010. There are
clear and notable limitations in the methodologies employed, with the majority
having datasets which were not adequately sized, structured or sampled. Datasets
which were generated statically failed to adequately address any serious form of
obfuscation, with some just initially excluding any packed sample. While dynamic
analysis does have disadvantages, it does offer a snapshot of the malware behaviour
at runtime, regardless of obfuscation, and has been recommended by researchers
who had previously focussed on static analysis [36]:
Indeed, broadly used static detection methods can deal with packed malware only by using
the signatures of the packers. As such, dynamic analysis seems like a more promising
solution to this problem ([18]) [36, p.226]
3 Methodology
Research in this field has typically used online repositories such as VxHeaven [2]
or VirusShare [33]. The corpus of the former is dated, with the last updates being in
2010. VirusShare, however, contains many more samples, is continuously updated,
and offers useful metadata. The site receives malware from researchers, security
teams and the public, and makes it available for research purposes on an invite-
only basis. The three most-recent folders of malware, at the time of writing, were
downloaded, representing approximately 195,000 instances of malware, of varying
formats. The Malicia dataset harvested from drive-by-downloads in the wild in [30]
was also obtained and incorporated into the total malware corpus. Benign software
was harvested from Windows machines, representing standard executables which
would be encountered in a benchmark environment.
Dynamic Analysis of Malware Using Run-Time Opcodes 109
As the files in the malware corpus were listed by MD5 hash, with no file format
information, a bespoke system was built to generate a database of attribute metadata
for the samples. Consideration was given to using standard *nix commands such as
file to yield the file formats. However, this did not provide all information necessary,
and so a custom solution was engineered. Figure 3 illustrates the system employed.
A full-feature API key was granted by VirusTotal for this pre-processing stage.
This allows a hash to be checked against the VirusTotal back-end database, which
includes file definitions and the results of 54 AV scanners on that file. The generated
report was parsed for the required features (Type, Format, First Seen, 54 AV scans
etc) and written to a local database for storage. A final step in the attribution system
allows the querying of the local database for specific file types and for files to be
grouped accordingly. For the present research, we were interested in Windows PE
files and our initial dataset yielded approximately 90,000 samples. Other file types
(Android, PDF, JS etc) were categorized and stored for future research.
On creation of the local attribute dataset, there was a noticeable variation in
whether each AV scanner determined that the specific file was malicious or not
and also the designated malware type and family. This presented a challenge to
our methodological stance. If a file is detected by very few of the 54 AV scanners,
it could be viewed as a potential false positive (i.e. detected as malicious, though
truly benign), or indeed that it is an instance of malware which can successfully
evade being detected by most of the AV scanners. False positive detections can
occur when a scanner mistakenly detects a file as malicious. This is exacerbated by
the use of fuzzy hashing, where hash representations of a file are seen as similar
to a malicious file, and the fact that scanning engines and signature databases are
often shared by more than one vendor. The false positive issue can have serious
consequences, such as rendering an OS unusable. It has become problematic enough
that VirusTotal has attempted to build a whitelist of applications from their trusted
sources, so that false positives can be detected and mitigated [24]. As such, we
vtScanner.py
VirusShare Malicia
MD5 lists
Local
malware db
VirusTotal
fileMover.py
4. File type of each sample queried, all Win32 PE files extracted and appends .exe to enable execution
10000
No of samples deteced
50.00
5000
0.00 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
Quantity of AV scanners
Fig. 4 The distribution of number of samples which were detected by the number of AV scanners
(horizontal axis), with the cumulative percentage)
A process for extracting opcode run-traces from executed malware was established
in [31, 32]. This generates the required traces, but is manually set up and operated,
and so is unfeasible for generating a large dataset. As such, automated scalable
processes were given consideration.
The work in [23] shows practical automated functions for use with VirtualBox
virtual machines, though the methods used are now deprecated. A virtualization
test-bed which resides outside of the host OS was developed by [12]. The authors
claimed the tool was undetectable when tested by 25,118 malware samples, though
no breakdown of the dataset is provided. However, the software was last updated
in October 2015, is only compatible with Windows XP and requires a deprecated
version of the Xen hypervisor, and so was not implemented.
The ubiquitous Cuckoo sandbox environment was considered also. While a
valuable tool in malware and forensic analysis, we sought a system which provided
the exact output for the method established in [31, 32]. Furthermore, we wish
to develop an opcode-focussed malware analysis environment, and so chose to
engineer a bespoke automated execution and tracing system.
VirtualBox was implemented as the virtualization platform. While other vir-
tualization software exists, such as VirtualPC, Xen and VMWare, the API for
VirtualBox is feature-rich and particularly suited to the present research. VirtualBox
comprises multiple layers on top of the core hypervisor, which operates at the kernel
level and permits autonomy for each VM from the host and other VMs. The API
allows a full-feature implementation of the VirtualBox stack, both through the GUI
and programmatically, and as such is well-documented and extensively tested [3].
A baseline image was created of Windows 10 64-bit, 2 GB RAM, VT-x accelera-
tion, 2 cores of an Intel i5 5300 CPU, and the VBoxGuestAdditions v4.3.30 add-on
installed. Windows 10 was selected as the latest guest OS due to the modernity
and market share. VBoxGuestAdditions allows extra features to be enabled inside
the VM, including folder-sharing and application execution from the host, both of
which are key features of the implementation. Full internet access was granted to
the guest and traffic was partially captured for separate research, and monitored as
a further check for the liveness of the malware. Security features were minimised
within the guest OS to maximise the effects of the executed malware.
OllyDbg v2 was preinstalled in the VM snapshot to trace the runtime behaviour
of each executed file. This is an open-source assembler-level debugger, which can
directly load both PE and DLL files and then debug them. The presence of a
debugger can be detected by both malicious and benign files, and such instances may
deploy masking techniques. As per [31], we used StrongOD v0.4.8.892 to cloak the
presence of the debugger.
The guest OS was crafted to resemble a standard OS, with document and
web histories, Java, Flash, .Net, non-empty recycling bin etc. While anti-anti-
virtualization strategies were investigated, and implemented where appropriate, not
all could be achieved while maintaining essential functionality, such as remote
112 D. Carlin et al.
Host OS
Python script
Virtualised
Guest OS Load clean
snapshot List of
files to
run
Debugger Execute PE
captures file
opcodes
PE file
VM tear
down
All files
processed?
No
Opcode
representation
doc
Yes
Stored
Stop
archive
execution. However, while operating at the opcode level, we get the opportunity to
monitor anti-virtualization behaviour and can turn this into a feature for detection.
Furthermore, this research seeks to investigate the execution of malware as a
user would experience it, which would include cloud-based virtualized machine
instances. Figure 5 illustrates the workflow of the automated implementation used
for the present research. On starting, a user-specified shared folder is parsed for all
files which are listed for execution. The VM is spun-up using the baseline snapshot,
mitigating any potential hang-overs from previously executed malware. OllyDbg
is loaded with the masking plug-in and the next file from the execution list as a
parameter. The debugger loads the PE into memory and pauses prior to execution.
The debugger is set to live overwrite an existing empty file with the run-trace of
opcode instructions which have been sent to the CPU, set at a trace-into level so
that every step is captured. This is in contrast to static analysis, which only yields
Dynamic Analysis of Malware Using Run-Time Opcodes 113
potentially executed code and code-paths. Figure 6 shows an example run-trace file
which has been yielded. After the set run-length, which was set to 9 min, elapsed,
the VM is torn down and the process restarts with the next file to be executed until
the list is exhausted. A 9 min execution time was selected, with 1 min for start up
and teardown, as we are focussed on detecting malware as accurately as possible,
but within as short a runtime as possible. While a longer runtime may potentially
yield richer trace files and more effectively trace malware with obfuscating sleep
functions, our research concentrates on the initial execution stage for detection
processes.
The bottleneck in dynamic execution is run-time, so any increase in capture rate
will be determined by parallelism while run-time is held constant. Two key benefits
of the implementation presented are the scalability and automated execution. The
initial execution phase was distributed across 14 physical machines, allowing the
dataset to be processed in parallel. This allowed the creation of such a large dataset,
in the context of the literature, within a practical period of time. Indeed, this was
only capped at the number of physical machines available. It is entirely feasible to
have a cloud- or server- based implementation of this tracing system, with a virtually
unlimited number of nodes.
Prior to analysis, the corpus of trace files require parsing to extract the opcodes
in sequence. A bespoke parser was employed to scan through each run-trace file
and keep a running count of the occurrence of each of the 610 opcodes listed in
the Intel x86/x64 architecture [22]. This list is more comprehensive than that of
[31, 32], as previous research has indicated that more rarely occurring opcodes
can provide better discrimination between malware and benignware than more
frequently occurring opcodes [7, 17].
As the run-traces vary in length (i.e. number of lines), a consistent run-length
is needed for the present research in order to investigate the effects of run-length
on malware detection. A file slicer was used on the run-trace dataset to truncate all
files to the maximum required in order to control run-length. As partly per [31, 32],
114 D. Carlin et al.
Runtraces
Parsing Count-based
Run
1k 2k 4k 8k 16k 32k 64k length
N-gram
analysis
1k 2k 4k 8k 16k 32k 64k
From initial download through to execution, the number of samples was depleted
by approximately 20%. A further 10 % of files failed to provide a run-trace for a
variety of reasons, such as missing dlls or lock-outs (Table 1).
The quantity of samples and features in any large dataset can provide method-
ological challenges for analyses, which was pertinent in the present study. For
example, with the n D 3 64k run-length dataset, there were 47,975 instances,
with 100,318 features, or 4.82 billion data points. When contained within a CSV
format, the file size approaches 9.4 GB. All analyses were conducted on a server-
class machine with a 12-core Intel Xeon CPU and 96 GB of RAM. Despite this
computational power, datasets of such magnitude are problematic, and so a sparse-
representation file format was chosen (i.e. zeros were explicitly removed from the
dataset, implied by their omission). This reduced the dataset file sizes by as much
as 90%, indicating a large quantity of zero entries, and so feature reduction was
investigated, as discussed further below.
All malware types were merged into a single malware category and compared to the
benign class. Due to the size imbalance, a hybrid over/under/sub- sampling approach
was taken to balance the classes. The minority (benign) class was synthetically
oversampled with the Synthetic Minority Oversampling Technique (SMOTE) [10].
This approach uses a reverse-KNN algorithm to generate new instances with values
116 D. Carlin et al.
Subset 1 v Benign
g
N=1...3 plin Averaged Results
sam
Subset 2 v Benign
sub
dom Subset 3 v Benign
N=1...3
Ran
Datasets Subset 4 v Benign
(run length)
Subset 5 v Benign
1k 1k 1k
1k
Subset 6 v Benign
Subset 7 v Benign
2k 2k 2k
1k
2k
Subset 8 v Benign
Subset 9 v Benign
1k 2k 4k 8k 16k 32k 64k
4k 4k 4k
2k 4k 8k 16k 32k 64k
4k 8k 16k 32k 64k
8k 8k 8k
conducted initially by a coarse grid search, followed by a fine grid search. For both
parameters higher is better, especially over large datasets, though classification will
plateau and a greater computational overhead penalty will be enacted.
4 Results
Standard machine learning outputs were yielded in the evaluation of the classifier
for each dataset:
1. Accuracy, i.e. the percentage of correctly assigned classes (0% indicating a poor
classifier, 100% showing perfect accuracy)
TP C TN
ACC D
TotalPopulation
2TP
F1 D
2TP C FP C FN
3. Area Under Receiver Operating Characteristic curve, i.e. when the True Positive
% rate is plotted against the False Positive % rate, the area underneath this curve
(1 is the ideal).
4. Area Under Precision Recall curve (1 is the ideal) where
TP
Precision D
TP C FP
TP
Recall D
TP C FN
The evaluations of each classification, prior to any feature reduction, are depicted in
Fig. 10.
The overall accuracy across all 21 analyses had a mean of 95.87%, with a range
of 93.01%–99.05%. The accuracy level declined in 3 of the 7 categories, between
n D 2 and n D 3, though increased in 3 and remained stable in one. This indicates
that there is marginal gain (0.06% on average) in increasing n from 2 to 3. The
n D 1 cohort did, however, show an overall higher level of accuracy in all but one
run-length (16k) and a greater mean by 1.29% vs n D 2 and 1.23% vs n D 3. F-
scores ranged from 0.93 to 0.991, with the 3 n-gram means again showing greater
118 D. Carlin et al.
100.00
99.00
98.00
97.00
96.00
95.00
94.00
64k
93.00 32k
92.00 16k
8k
91.00 4k
90.00 2k
1k
89.00
N=1 N=2 N=3
With the large quantity of features, attempts were made to reduce the number of
attributes, while maintaining the detection capability of the model. Furthermore,
it would not be computationally possible to employ feature extraction algorithms
within a feasible time period, with such a large feature set. As such, a feature
selection strategy was employed in the first instance.
Dynamic Analysis of Malware Using Run-Time Opcodes 119
The Gain Ratio algorithm was employed to reduce the number of features in
the datasets. Gain Ratio attempts to address the biases which can be introduced
by Information Gain, when considering attributes with a large range of distinct
values, by considering the number of branches that would result due to a split.
In the ratio, the numerator represents the information yielded about the label and
the denominator represents the information learned about the feature. The ranker
search method was used to traverse the feature set. With respect to the current
problem, ranker is advantageous as it is linear and does not take more steps than
the number of features. With the breadth of the present datasets, this provides a
saving of computational overhead.
Table 2 shows the quantity of features in each cohort and run-length at each step.
The number of features rises in n D 2 and n D 3, and in each increase in run-length,
as the potential number of observed opcodes rises.
Four levels of feature selection using Gain Ratio were investigated: no feature
selection, Gain Ratio merit score >=0.01 (i.e. non-zero), top 20 features, and top 50
features. A further category was originally investigated (merit score >=10) however,
this resulted in a model equivalent to random choice (i.e. all measures scored
50%). The same RF classifier was then presented with the reduced datasets for
classification.
Removing features with a merit score of <0.01 had a negligible impact on the
accuracy rating (0.013–0.018%) compared to the full feature set, but removed up to
84% of features.
Selecting the top n features was subsequently investigated, as the removal of
non-zero features is unique to each dataset but top n would provide uniformity. The
n D 1 dataset proved more resilient across top 20 and top 50 features than the n D 2
and n D 3 cohorts. This may be due to the difference in quantity of features from
no reduction to top n, e.g. n D 3 64k was reduced from 100,316 to 20 features in
this approach. Figure 11 shows accuracy for n D 1.
With any malware detection model using machine learning, consideration should
be given to the false positive rates (FP) (i.e. detected incorrectly as a specific class).
This is particularly relevant with imbalanced data. Table 3 lists the false positive
ratings for both the benign and malicious classes, per run-length and n-gram size.
120 D. Carlin et al.
Table 3 False positive rates for each run-length for n D 1::3 using all features
nD1 nD2 nD3
Benign Malicious Mean Benign Malicious Mean Benign Malicious Mean
1k 0.042 0.081 0.061 0.029 0.111 0.070 0.031 0.107 0.069
2k 0.040 0.048 0.044 0.036 0.079 0.057 0.040 0.075 0.058
4k 0.046 0.032 0.039 0.046 0.068 0.057 0.044 0.071 0.057
8k 0.022 0.028 0.025 0.023 0.059 0.041 0.024 0.059 0.041
16k 0.215 0.027 0.121 0.011 0.044 0.027 0.020 0.052 0.036
32k 0.008 0.011 0.009 0.008 0.047 0.027 0.010 0.045 0.027
64k 0.013 0.011 0.012 0.022 0.060 0.041 0.011 0.044 0.028
Mean 0.055 0.034 0.044 0.025 0.067 0.046 0.026 0.065 0.045
The ‘Benign’ columns quantify the rate of the model labeling the sample incorrectly
as benign, and vice versa for the ‘Malicious’ columns. Overall FP rates are low,
though a range of values is evident, consistent with the other machine learning
measures. Again, the n D 1 data performs slightly better than the other n-gram
sizes (meanD0.044 vs means of 0.046 and 0.045 respectively). The 32k dataset
shows the lowest false positive rates (benign 0.008 and malicious 0.011, giving a
mean of <1%). This indicates the strength of the model for accurate detection, while
controlling for false positives. Furthermore, the sampling approach taken to balance
the training data multiple times appears to have reduced the risk of increased false
positives with unbalanced datasets (Fig. 12).
The further metrics for the model are presented in Table 4. As run-length
increases, model performance increases for the No FS and merit>0 cohorts, with
the exception of the anomalous 16k set. The n D 2 and n D 3 datasets decline in
performance as run-length increases when a set number of features are used (Top 20
Dynamic Analysis of Malware Using Run-Time Opcodes 121
100.00
98.00
96.00
94.00
Accuracy (%)
92.00
90.00
88.00
86.00
84.00
No FS FS >0 FS top 20 FS top 50
FS category
Fig. 12 Mean accuracies for each run-length across FS approaches for all n-gram lengths
and Top 50). In particular the AU-ROC scores show a decline as run-length increases
for n D 2 using 20 features. An intuitive explanation would be due to the ratio of
original to selected feature numbers. However, of the three n-gram sizes, n D 2
had the lowest average percentage loss of features after the GR>0 feature selection
(n D 1: 80.19%, n D 2: 51.81%, n D 3: 80.67%). It had approximately 10 times
fewer features than n D 3 in the original cohort, and so the reduction to a fixed 20 or
50 should have had more of an impact if feature ratio was the important factor. As
AU-ROC measures discrimination, it would appear that the information for correctly
discriminating between benign and malicious code is distributed differently when
bi-grams are inspected.
5 Conclusions
With accuracy of 99.05% across such a large dataset, and using tenfold cross-
validation, there is a clear indication that dynamic opcode analysis can be used
to detect malicious software in a practical application. While the time taken by
dynamic analysis can be a disadvantage, the work presented here shows that a trace
with as few as 1000 opcodes can correctly discriminate malware from benign with
122 D. Carlin et al.
93.8% success. Increasing the run-trace to the first 32,000 opcodes increases the
accuracy to over 99%. In terms of execution time, this run-length would not be
dissimilar to static tools such as IDA Pro. Further reducing the feature set to the top
50 attributes and using a unigram representation upholds the classification accuracy
>99%, while reducing the parsing, processing, and classification overheads. In terms
of n-gram analysis, there is minimal gain when increasing between n D 2 and
n D 3. Furthermore, n>1 offers no increase in detection accuracy when runtime is
controlled for. Considering the computational overhead and feature explosion with
increasing levels of n, focus should be maintained on a unigram investigation.
The results presented concur with [32], who found 32k run-length to provide
the highest accuracy. However, the accuracy of the model peaked at 86.31%,
with 13 features being included. Furthermore, the present study employs a dataset
approximately 80 times the size of [32]. In the context of previous similar research,
the accuracy of the present model is superior while the dataset is significantly larger
and more representative of malware.
References
1. 2014 cost of data breach study. Tech. rep., Ponemon Inst, IBM (2014). URL http://public.dhe.
ibm.com/common/ssi/ecm/se/en/sel03027usen/SEL03027USEN.PDF
2. Vxheaven (2014). URL http://vxheaven.org/vl.php
3. Oracle vm virtualbox r programming guide and reference. Tech. rep., Oracle Corp. (2015b).
URL http://download.virtualbox.org/virtualbox/SDKRef.pdf
4. Mcafee labs threats report sept 2016. Tech. rep., McAfee Labs (2016). URL http://www.
mcafee.com/us/resources/reports/rp-quarterly-threat-q1-2015.pdf
5. Alazab, M., Venkatraman, S., Watters, P., Alazab, M., Alazab, A.: Cybercrime: The case of
obfuscated malware, pp. 204–211. Global Security, Safety and Sustainability & e-Democracy.
Springer (2012)
6. Anderson, B., Quist, D., Neil, J., Storlie, C., Lane, T.: Graph-based malware detection using
dynamic analysis. Journal in Computer Virology 7(4), 247–258 (2011)
7. Bilar, D.: Opcodes as predictor for malware. International Journal of Electronic Security and
Digital Forensics 1(2), 156–168 (2007)
8. Bontchev V., S.F., Solomon, A.: Caro virus naming convention (1991)
9. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
10. Chawla, N., Japkowicz, N., Kolcz, A.: Special issue on class imbalances. SIGKDD Explo-
rations 6(1), 1–6 (2004)
11. Christodorescu, M., Jha, S.: Static analysis of executables to detect malicious patterns. In:
Proceedings of the 12th Usenix Security Symposium, pp. 169–185. USENIX Association
(2006)
12. Dinaburg, A., Royal, P., Sharif, M., Lee, W.: Ether: malware analysis via hardware vir-
tualization extensions. In: Proceedings of the 15th ACM conference on Computer and
communications security, pp. 51–62. ACM (2008)
13. Egele, M., Scholte, T., Kirda, E., Kruegel, C.: A survey on automated dynamic malware-
analysis techniques and tools. ACM Computing Surveys 44(2), 6–6:42 (2012). DOI
124 D. Carlin et al.
35. Runwal, N., Low, R.M., Stamp, M.: Opcode graph similarity and metamorphic detection.
Journal in Computer Virology 8(1-2), 37–52 (2012)
36. Santos, I., Brezo, F., Sanz, B., Laorden, C., Bringas, P.G.: Using opcode sequences in single-
class learning to detect unknown malware. IET information security 5(4), 220–227 (2011)
37. Santos, I., Sanz, B., Laorden, C., Brezo, F., Bringas, P.G.: Opcode-sequence-based semi-
supervised unknown malware detection, pp. 50–57. Computational Intelligence in Security
for Information Systems. Springer (2011)
38. Schultz, M.G., Eskin, E., Zadok, E., Stolfo, S.J.: Data mining methods for detection of new
malicious executables. In: Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE
Symposium on, pp. 38–49. IEEE (2001)
39. Shabtai, A., Moskovitch, R., Feher, C., Dolev, S., Elovici, Y.: Detecting unknown malicious
code by applying classification techniques on opcode patterns. Security Informatics 1(1), 1–22
(2012)
40. Sikorski, M., Honig, A.: Practical Malware Analysis: The Hands-On Guide to Dissecting
Malicious Software. No Starch Press (2012)
41. Thunga, S.P., Neelisetti, R.K.: Identifying metamorphic virus using n-grams and hidden
markov model. In: Advances in Computing, Communications and Informatics (ICACCI), 2015
International Conference on, pp. 2016–2022. IEEE (2015)
42. Veluz, D.: Stuxnet malware targets scada systems (2010). https://www.trendmicro.com/vinfo/
us/threat-encyclopedia/web-attack/54/stuxnet-malware-targets-scada-systems
43. Vemparala, S., Troia, F.D., Corrado, V.A., Austin, T.H., Stamp, M.: Malware detection using
dynamic birthmarks. In: IWSPA 2016 - Proceedings of the 2016 ACM International Workshop
on Security and Privacy Analytics, co-located with CODASPY 2016, pp. 41–46 (2016). DOI
10.1145/2875475.2875476
Big Data Analytics for Intrusion Detection
System: Statistical Decision-Making Using
Finite Dirichlet Mixture Models
1 Introduction
In the cyber security field, an Intrusion Detection System (IDS) is essential for
achieving a solid line of defence against cyber intrusions. The digital world has
become the principal complement to the physical world because of the widespread
usage of computer networks and prevalence of programs and services which easily
accomplish users’ tasks in a short time at a low cost. A system is considered secure if
the three principles of computer security, Confidentiality, Integrity and Availability
(CIA), are successfully satisfied [38, 43]. Hackers always endeavour to violate these
principles, with each attack type having its own sophisticated manner and posing
serious threats to computer networking.
An Anomaly-based Detection System (ADS), a specific methodology of IDS
discussed in Sect. 2, still faces two problems for implementation in large-scale
industrial applications [2, 7, 39], such as cloud computing [30] and SCADA
systems [16]. Firstly, and most importantly, the construction of a profile from
various legitimate patterns is extremely difficult because of the frequent changes
of the normal data [2, 34, 47]. Secondly, the process of building a scalable, adaptive
and lightweight detection method is an arduous task with the high speeds and large
sizes of current networks [39, 47].
ADS methodologies have been developed using approaches involving data
mining and machine learning, artificial intelligence, knowledge-based and statistical
models [7, 34, 39]. Nevertheless, usually, these proposed techniques have produced
high False Positive Rates (FPRs) because of the difficulty of designing a solution
which solves the above problems. Recent research studies [27, 34, 41, 44, 47]
have focused on statistical models due to the ease of concurrently employing
and determining the potential properties of both normal and abnormal patterns of
behaviour. Discovering these properties and characterising a certain threshold for
any detection method to correctly detect attacks requires accurate analysis.
Both network and host systems have multiple devices, software, sensors, plat-
forms and other sources connected together to deliver services to users and organ-
isations anytime and anywhere. In addition, these systems monitor the demands
from such organisations by using Big Data analytical techniques and tools to
carefully provide decision support for distinguishing between normal and anoma-
lous instances. For these reasons, the capture and processing of these data are
dramatically increasing in terms of ‘volume’ , ‘velocity’ and ‘variety’ , which are
referred to as the phenomena of ‘Big Data’ [53]. The Big Data paradigm poses a
continuous challenge in the use of network or host data sources for the design of an
effective and scalable ADS.
Monitoring and analysing network traffic have attained growing importance for
several reasons. Firstly, they increase visibility to the user, system and application
traffic by gathering and analysing network flow records which also helps to track
the bandwidth consumption of users and systems to ensure robust service delivery.
Secondly, they identify performance bottlenecks and minimising non-business
bandwidth consumption. Thirdly, there are advantages related to IDS technology,
Big Data Analytics for IDS 129
which are the tracking of network traffic using protocol analysis for recognising
potential attack profiles, such as UDP spikes. Finally, they monitor network traffic,
peer-to-peer protocols and URLs for a specific device or network to determine
suspicious activities and unauthorised access [5].
In the literature, if data does not fit a normal distribution, it will be better
to fit and detect outliers/anomalies using mixture models, especially Gaussian
Mixture Model (GMM), Beta Mixture Model (BMM) or Dirichlet Mixture Model
(DMM) [15, 17, 34, 50]. According to [17, 18, 21], the DMM can fit and define
the boundaries of data better than other mixture models because it consists of a set
of probability distributions. Moreover, the DMM is more suitable for modelling
streaming data, for example, data originating from videos, images, or network
traffic. The mathematical characteristics of the DMM also permit the representation
of samples in a transformed space in which features are independent and identically
distributed (i.i.d.). In the case of high dimensionality, the DMM for clustering data
provides higher accuracy than other mixture models [9]. Therefore, we use this
model to properly fit network data using the lower-upper Interquartile Range (IQR)
[40] as a threshold to detect any observation outside them as an anomaly.
In this chapter, we propose a scalable framework for building an effective and
lightweight ADS that can efficiently identify suspicious patterns over network
systems. The framework consists of a capturing and logging module to sniff and
record data, a pre-processing module to analyse and filter these data and the
proposed ADS statistical decision engine, based on the DMM, for recognising
abnormal behaviours in network systems.The DMM model is a statistical technique
developed based on the method of anomaly detection which computes the density
of Dirichlet distributions for the normal profile (i.e., the training phase) and testing
phase (using the parameters estimated from the training phase). The decision-
making method for identifying known and new anomalies is designed by specifying
a threshold of the lower-upper IQR for the normal profile and considering any
deviation from it as an attack.
The performance of this framework is evaluated on two well-known datasets,
the NSL-KDD1, which is an improved version of the KDD99 and the most popular
dataset used for evaluating IDSs [48], and our UNSW-NB152 which involves a wide
variety of contemporary normal, security and malware events [34]. The Dirichlet
mixture model based anomaly detection technique is compared with three recent
techniques, namely the Triangle Area Nearest Neighbours (TANN) [49], Euclidean
Distance Map (EDM) [46] and Multivariate Correlation Analysis (MCA) [47].
These techniques were developed based on computing distances and correlations
between legitimate and malicious vectors, which cannot often find a clear difference
between these vectors, especially with modern attack styles that mimic normal ones.
1
The NSLKDD dataset, https://web.archive.org/web/20150205070216/http://nsl.cs.unb.ca/NSL-
KDD/, November 2016.
2
The UNSW-NB15 dataset, https://www.unsw.adfa.edu.au/australian-centre-for-cyber-security/
cybersecurity/ADFA-NB15-Datasets/, November 2016.
130 N. Moustafa et al.
However, our technique was established using the methods of the Dirichlet mixture
model and the accurate boundaries of interquartile range that properly find the small
differences between these vectors, considerably improving the detection accuracy.
The key contributions of this chapter are as follows.
1. We propose a new scalable framework for anomaly detection system on large-
scale networks. In this framework, we also develop a novel decision engine
based on the Dirichlet Mixture Model and lower-upper Interquartile Range to
efficiently detect malicious events.
2. We describe how statistical analysis can define the normality of network data
to choose the proper model that correctly fits the data and make an intelligent
decision-making method which discriminates between normal and abnormal
observations.
3. A performance evaluation of this framework is conducted on two benchmark
datasets: the NSL-KDD which is the most common dataset and UNSW-NB15
which is the latest dataset used for assessing IDSs, as well as comparing this
technique with three existing techniques to assess its reliability for detecting
intrusions.
The rest of this chapter is organised as follows. The background on Intrusion and
Anomaly Detection Systems are presented in Sect. 2. Section 3 describes related
work on decision engine approaches. The DMM-based technique is explained in
Sect. 4 and the proposed scalable framework discussed in Sect. 5. The experimental
results and discussions are provided in Sect. 6. Finally, concluding remarks are
presented in Sect. 7.
establishes a normal profile from host or network data and discovers any variation
from it as an attack. It can detect existing and new attacks, as it does not require any
effort to generate rules, an ADS has become a better solution than MDS and SPA
[13, 34, 38, 43]. However, it still has some challenges, as explained in Sect. 2.2, that
we will try to mitigate.
An IDS’s deployment architecture is classified as either distributed or centralised.
A distributed IDS is a compound system involving several intrusion detection sub-
systems installed at different locations and connected in order to transfer relevant
information. In contrast, a centralised IDS is a non-compound system deployed
at only one location, with its architecture dependent on an organisation’s size and
sensitivity of its data which should be considered in terms of its deployment [28].
With the new era of the Internet of Things (IoT), which is the networked inter-
connections of everyday objects often associated with their ubiquitous use, many
applications and systems need to be protected against intrusive activities. As cloud
computing environments and Supervisory Control and Data Acquisition (SCADA)
systems are currently fully dependent on the Internet, they require an adaptable
and scalable ADS for identifying the malicious events they frequently face. Cloud
computing is a “network of networks” based on Internet services in which virtual
shared servers provide the software, platform, infrastructure and other resources
[3]. It consists of the three service models Software as a Service (SaaS), Platform
as a Service (PaaS) and Infrastructure as a Service (IaaS) [30]. To detect attacks, an
134 N. Moustafa et al.
IDS can be installed on a virtual server as a host-based IDS or deployed across the
network as a network-based IDS or, by configuring both, provide a better defence.
SCADA systems monitor and control industrial and critical infrastructure func-
tions, for example, water, electricity, railway, gas and traffic [16]. Like the cloud
computing environment, with the rapid increase in the Internet and interconnected
networks, these systems face complicated attacks, such as DoS and DDoS, which
highlight the need for stronger SCADA security. However, designing the archi-
tecture and deployment of an adaptive and scalable ADS for these environments
has become a big challenge because the high speeds and large sizes of existing
networks generate a massive number of packets each time, and those packets should
be inspected simultaneously in order to identify malicious activities.
Many researchers have investigated decision engine approaches, which can be cate-
gorised into five types: classification-based approaches [7, 10, 13]; clustering-based
approaches [2, 7, 13, 17]; knowledge-based approaches [7, 10, 13]; combination-
based approaches [7, 11, 13, 52]; and statistical-based approaches [7, 13, 34, 47], as
illustrated in Table 1. Firstly, classification is a way of categorising data observations
in particular classes in a training set with a testing set containing other instances for
validating these classes; for instance, Horng et al. [23] proposed a Network-based
ADS which included a hierarchical clustering and support vector machine to reduce
the training time and improve detection accuracy. Ambusaidi et al. [4] developed a
least-square support vector machine for the design of a lightweight Network-based
ADS by selecting the significant features of network data and detecting anomalies.
Recently, Dubey et al. [14] developed a Network-based ADS based on the collection
techniques of an artificial neural network, k-means and Naïve-Bayes to improve the
detection of malicious activities. However, overall, classification-based IDSs rely
heavily on the assumption that each classifier has to be adjusted separately and
always consume more resources than statistical techniques.
Secondly, clustering involves unsupervised machine-learning algorithms which
allocate a set of data points to groups based on the similar characteristics of the
points, for example, distance or probability measures. Nadiammai et al. [35] anal-
ysed and evaluated k-means, hierarchical and fuzzy c-means clustering techniques
for building a Network-based ADS and reported that the complexity and detection
accuracy of the fuzzy c-means algorithm were better than those of the others.
Jadhav et al. [25] proposed a Network-based ADS based on clustering network
packets and developed a new data pre-processing function using the fuzzy logic
technique for classifying the severity of attacks in network traffic data. Zainaddin
et al. [52] proposed a hybrid of fuzzy clustering and an artificial neural network
to construct a Network-based ADS which efficiently detected malicious events.
Clustering-based ADS techniques have several advantages. Firstly, they group data
points in an unsupervised manner which means that they do not need to provide
Big Data Analytics for IDS 135
Knowledge Naldurg et al. [36], • Discriminate existing • Take too much time
Hung et al. [24] attacks during the processing
• Provide higher detec- • Use static rules for
tion rate defining malicious pat-
terns
Combination Perdisci et al. [37], • Achieve higher accu- • Need a huge effort to
Aburomman et al. racy and detection integrate some tech-
[1], Shifflet [45] • Demand only a set niques
of controlling param- • Take a long processing
eters to be adjusted time than other tech-
niques
class labels for observations. Secondly, they are effective for grouping large datasets
into similar groups to detect network anomalies. However, in contrast, clustering is
highly dependent on the efficacy of constructing a normal profile and the difficulty
of automatically updating it.
Thirdly, knowledge-based methods establish a set of patterns from input data
to classify data points with respect to class labels, with common knowledge-based
ADSs rule-based, expert systems and ontologies. Naldurg et al. [36] suggested a
framework for intrusion detection using temporal logic specifications with intrusion
136 N. Moustafa et al.
This section describes the mathematical aspects of estimating and modelling data
using the DMM, and discusses the proposed methodology for using this model to
build an effective ADS.
K
X
p.Xj; ˛/ D i Dir.Xj˛i / (1)
iD1
where D .1 ; : : :P
; K ) refers to the mixing coefficients, which are positive, with
their summation 1, KiD1 i ,˛ D .˛1 ; : : : ; ˛K /, and Dir.Xj˛i / indicates the Dirichlet
distribution of component i with its own positive parameters (˛ D .˛i1 ; : : : ; ˛iS )) as
138 N. Moustafa et al.
P S
. SsD1 ˛is / Y ˛is1
Dir.Xj˛i / D QS Xs (2)
sD1 .˛is / sD1
P
where X D .X1 ; : : : ; XS /, S is the dimensionality of X and SsD1 xs D 1, 0 Xs 1
for s D 1; : : : ; S. It is worth noting that a Dirichlet distribution is used as a parent
distribution to directly model the data rather than as a prior to the multinomial.
Considering a set of N independent identically distributed (i:i:d) vectors .X D
fX1 ; : : : ; XN g/ assumed to be produced from the mixture distribution in Eq. (1), the
likelihood function of the DMM is
N X
Y K
p.Xj; ˛/ D f ˘i Dir.Xl j˛i /g (3)
lD1 iD1
The finite mixture model in Eq. (1) is considered as a latent variable model.
Therefore, for each vector .Xi ), we introduce aPK-dimensional binary random vector
.Zi D fZ; : : : ; ZiK g/, where Zis 2 f0; 1g, KiD1 and Zis D 1 if Xi belongs to
component i, otherwise 0. For the latent variables .Z D fZ1 ; : : : ; ZN g/, which are
actually hidden ones that do not appear explicitly in the model, the conditional
distribution of Z given the mixing coefficients ./ is defined as
N Y
Y K
p.Zj/ D iZli (4)
lD1 iD1
Then, the likelihood function with the latent variables, which is actually the
conditional distribution of a dataset L given the class labels, Z can be written as
N Y
Y K
p.Xj; ˛/ D Dir.Xl j˛i / (5)
lD1 iD1
Big Data Analytics for IDS 139
In the testing phase, the Dirichlet PDF .PDF testing / of each observed record
.rtesting / is computed using the same parameters estimated for the normal profile
(.; ˛i ; Zi ),.lower; upper; IQR/). Algorithm 2 includes the steps in the testing phase
and decision-making method for identifying the Dirichlet PDFs of the attack
records, with step 1 constructing the PDF of each observed record using the stored
normal parameters . i ; ˛i ; Zi /.
Steps 2 to 6 define the decision-making process. The IQR of the normal instances
is computed to find the outliers/anomalies of any observed instance .rtesting / in the
testing phase which are considered to be any observations falling below .lowerw
.IQR// or above .upper C w .IQR//, where w indicates the interval values between
1.5 and 3 [40]. The detection decision is based on considering any PDF testing falling
out of this range as an attack record, otherwise normal.
Pre-processing
Capturing and logging
Decision-making method
Training Normal profile generation
phase using the density of the DMM Normal
parameters
instance
Parameters of
DMM and Tesng IQR threshold for
Interquarle phase outliers abnormal
Testing set for each instance instance
Tesng using the estimated Range(IQR)
phase parameters of the Normal
Profile
Mapping
Protocols TCP, UDP, ICMP Protocols 1, 2, 3
Mapping
Services HTTP, FTP, SMTP Services 1, 2, 3
Mapping
States INT, FIN, CON States 1, 2, 3
Fig. 4 Example of converting categorical features into numerical features using UNSW-NB15
dataset
This module sniffs network data and stores them to be processed for the decision
engine, like the steps for creating our UNSW-NB15 dataset [33, 34]. An IXIA
PerfectStorm tool,3 which has the capability to determine a wide range of network
3
The IXIA Tools, https://www.ixiacom.com/products/perfectstorm, November 2016.
142 N. Moustafa et al.
segments and elicits traffic for several web applications, such as Facebook, Skype,
YouTube and Google, was used to mimic recent realistic normal and abnormal
network traffic, as shown in Fig. 5. It could also simulate the majority of security
events and malicious scripts which is difficult to achieve using other tools. The con-
figuration of the UNSW-NB15 testbed was used to simulate a large-scale network
and a Tcpdump tool to sniff packets from the network’s interface while Bro, Argus
tools and other scripts were used to extract a set of features from network flows.
In [33, 34], the UNSW-NB15 was created, which comprises a wide variety of
features. These features can be classified into packet-based and flow-based. The
packet based features help in examining the packet payload and headers while the
flow based features mine information from the packet headers, such as a packet
direction, an inter-arrival time of packets, the number of source/destination IPs for a
particular time window, and an inter-packet length. AS depicted in Fig. 6, the pcap
files4 of this dataset were processed by the BRO-IDS and Argus tools to mine the
basic features. Then, we developed a new aggregator module to correlate its flows.
These flows were aggregated for each 100 connections, where packets with the same
source/destination IP addresses and ports, timestamp, and protocol were collected
in a flow record [31, 32]. This module enables to establish monitoring applications
for analysing network characteristics such as capacity, bandwidth, rare and normal
events.
4
Pcap refers to packet capture, which contains an Application Programming Interface (API) for
saving network data. The UNIX operating systems execute the pcap format using the libpcap
library while the Windows operating systems utilise a port of libpcap, called WinPcap.
Big Data Analytics for IDS 143
An aggregator module
Feature Extraction
Generate new attributes from
Extract basic network
the flow identifiers (source / MySQL
Pcap files attributes using BRO-
destination IPs and ports, Database
IDS and Argus tools
timestamps, and protocols) for
a specific interval time
These features were recorded using the MySQL Cluster CGE technology5 that
has a highly scalable and real-time database, enables a distributed architecture to
read and write intensive workloads and is accessed via SQL or NoSQL APIs, as
depicted in Fig. 7. It can also support memory-optimised and disk-based tables,
automatic data partitioning with load balancing, and can add nodes to a running
cluster for handling online big data. Although this technology has a similar
architecture to Hadoop tools6, which are the most popular for processing big offline
data, an ADS has to detect malicious behaviours in real time. These features are
then passed to the pre-processing module to be analysed and filtered.
5
The MySQL Cluster CGE technology, https://www.mysql.com/products/cluster/, November
2016.
6
The Hadoop technologies, http://hadoop.apache.org/, November 2016.
144 N. Moustafa et al.
The pre-processing module determines and filters network data in four steps.
Firstly, its feature conversion replaces symbolic features with numeric ones because
our DMM-based ADS technique deals with only numeric attributes. Secondly, its
feature reduction uses the PCA technique to adopt a small number of uncorrelated
features. As the PCA technique is one of the best-known linear feature reduction
techniques due to its the advantages. It requires less memory storage, having lower
data transfer and processing times, as well as better detection accuracy than others
[22, 26]. So, we chose it for this study.
Thirdly, feature normalisation arranges the value of each feature in a specific
interval to eliminate any bias from raw data and easily visualise and process it. We
applied the z-score function, which scales each feature (x) with a 0 mean (
) and 1
standard deviation (ı), as shown in Fig. 8, to normalise the data using the formula
.x
/
zD (6)
ı
Another essential statistical measure is the normality test which is a way
of assessing whether particular data follow a normal distribution. We used the
Kolmogorov-Smirnov (K-S) test, which is one of the most popular, in our previous
work [34]. In it, if the data do not follow a normal distribution, mixture models,
such as the GMM, BMM and DMM, are used to efficiently define outliers. In
this chapter, we use Q-Q plots to show that the network data do not follow a
Gaussian distribution. A Q-Q plot is a graphical tool designed to draw two sets of
quantiles against each other. If these sets are from the same distribution, the points
form an almost straight line, with the others treated as outliers [19]. Therefore, it
helps to track network flows and define which DE model is the best for identifying
suspicious activities as outlier points, as shown in the results in Sect. 6.4. Overall,
a statistical analysis is important for network data to make a decision regarding
detecting and preventing malicious events.
This section discusses the datasets used for evaluating the proposed technique, and
then the evaluation metrics applied for assessing the performance of the proposed
technique compared with some peer techniques. Finally, the features selected from
the NSL-KDD and UNSW-NB15 datasets, with the statistical results of these
features are explained.
Despite the NSL-KDD and KDD CUP 99 datasets being outdated and having several
problems, in particular duplications of records and unbalancing of normal and attack
records [33, 47, 48], they are widely used to evaluate NIDSs, due to a lack of
accurate dataset availability. As most state-of-the-art detection techniques have been
applied to these datasets, which are ultimately from the same network traffic, in
order to provide a fair and reasonable evaluation of the performance of our proposed
DMM-based ADS technique and comparison with those of related state-of-the-art
detection approaches, we adopted the NSL-KDD dataset and contemporary UNSW-
NB15 dataset which was recently released.
The NSL-KDD dataset is an improved version of the KDD CUP 99 dataset
suggested by Tavallaee et al. in [48]. It addresses some of the problems in the
KDD CUP 99 dataset, such as removing redundant records in the training and
testing sets to eliminate any classifier being biased towards the most repeated
records. Like in this dataset, in the NSL-KDD dataset, each record has 41 features
and the class label. It consists of five different classes, one normal and four
attack types (i.e., DoS, Probe, U2R and R2L), and includes two sets, training
(‘KDDTrainC FULL’ and ‘KDDTrainC 20%’) and testing (‘KDDTestC 20%’
and ‘KDDTest21 newattacks’).
The UNSW-NB15 dataset has a hybrid of authentic contemporary normal and
attack records. The volume of its network packets is approximately 100 GB with
2,540,044 observations logged in four CSV files. Each observation has 47 features
and the class label which demonstrate its variety in terms of high dimensionality.
Its velocity is, on average, 5–10 MB/s between sources and destinations which
means higher data rate transmissions across the Ethernets which exactly mimic
146 N. Moustafa et al.
real network environments. The UNSW-NB15 dataset includes ten different classes,
one normal and nine security and malware types (i.e., Analysis, Backdoors, DoS,
Exploits, Generic, Fuzzers for anomalous behaviours, Reconnaissance, Shellcode
and Worms) [33, 34].
Several experiments were conducted on the two datasets to measure the performance
and effectiveness of the proposed DMM-based ADS technique using external
evaluation metrics, including the accuracy, DR and FPR which depend on the four
terms true positive (TP), true negative (TN), false negative (FN) and false positive
(FP). TP is the number of actual attack records classified as attacks, TN is the
number of actual normal records classified as normal, FN is the number of actual
attack records classified as normal and FP is the number of actual normal records
classified as attacks. These metrics are defined as follows.
• The accuracy is the percentage of all normal and attack records correctly
classified, that is,
.TP C TN/
accuracy D (7)
.TP C TN C FP C FN/
• The Detection Rate (DR) is the percentage of correctly detected attack records,
that is,
TP
DR D (8)
.TP C FN/
• The False Positive Rate (FPR) is the percentage of incorrectly detected attack
records, that is,
FP
FPR D (9)
.FP C TN/
The DMM-based ADS technique was evaluated using the eight features from the
NSL-KDD and UNSW-NB15 datasets adopted using the PCA listed in Table 2.
The proposed DMM-ADS technique was developed using the ‘R language’ on
Linux Ubuntu 14.04 with 16 GB RAM and an i7 CPU processor. To conduct the
experiments on each dataset, we selected random samples from the ‘full’ NSL-KDD
dataset and the CSV files of the UNSW-NB15 dataset with various sample sizes
Big Data Analytics for IDS 147
between 80,000 and 200,000. In each sample, normal instances were approximately
60–70% of the total size, with some used to create the normal profile and the rest
for the testing set.
Statistical analysis supports the decisions of defining the type of modelling, which
efficiently fits data to recognise outliers as attacks. As previously mentioned, the
Q-Q plot is a graphical tool to check if a set of data come from a normal theoretical
distribution. features are considered from a normal distribution if the values of those
features fall on the same theoretical distribution line. Figure 9 represents that the
Quantiles of smean Quantiles of ct_src_dport_ltm Quantiles of ct_dst_sport_ltm Quantiles of srv_diff_host_rate Quantiles of dst_host_same_srv_rate Quantiles of srv_count
148
0 500 1000 1500 0 10 20 30 40 0 10 20 30 40 0.0 0.4 0.8 0.0 0.4 0.8 0 200 400
−1.0
−0.5
−10
−200
−400
−5
0.0
−5
−0.5
−100
0
0.5
0
0
0
0.0
200
1.0
5
100
0.5
400
1.5
10
200
600
1.0
2.0
10
Quantiles of service Quantiles of ct_dst_src_ltm Quantiles of tcprtt Quantiles of rerror_rate Quantiles of dst_host_count Quantiles of dst_host_srv_count
0 500 1000 1500 0 20 40 60 0.0 0.1 0.2 0.3 0.4 0.0 0.4 0.8 0 50 150 250 0 50 150 250
−15
−200
−200
−5
−0.5
0
−0.05
−5
0
0
0
0.0
0.00
200
5
200
10
0.05
0.5
400
15
20
0.10
15
Theoretical Quantiles from Normal Distribution
Theoretical Quantiles from Normal Distribution
Normal Q−Q plot for NSL−KDD Features
−10
−200
−200
−5
−5
−100
0
0
5
200
100
10
400
5
15
200
600
N. Moustafa et al.
Big Data Analytics for IDS 149
selected features do not fall on the theoretical distribution lines (i.e., red ones), and
there are much greater variations than the lines of feature values. We, therefore,
decided to choose the DMM, as one of the best non-normal distribution, for fitting
these features to build an ADS based on detecting the too far points of the feature
values as anomalies.
The PDFs of the DMM are estimated for normal and abnormal instances using
the NSL-KDD and UNSW-NB15 datasets to demonstrate to what extent these
instances vary, as presented in Fig. 10. In the NSL-KDD dataset, the PDFs of the
normal values range between 0 and 0.20, and their values are between 50 and 0. In
contrast, the PDFs of the abnormal values falls between 0 and 0.5, and their values
are between 30 and 0. As a result, it is noted that the PDFs of the normal instances
are different from the attack ones. Likewise, in the UNSW-NB15 dataset, the PDFs
of the normal instances are also dissimilar to the attack instances. These results
assert that the proposed decision-making method in Algorithm 2 can effectively
detect attacks due to the differences in their PDFs.
0.5
0.4
0.15
0.3
Density
Density
0.10
0.2
0.05
0.1
0.00
0.0
−50 −40 −30 −20 −10 0 −30 −25 −20 −15 −10 −5 0
Values Values
0.12
0.12
0.10
0.10
0.08
0.08
Density
Density
0.06
0.06
0.04
0.04
0.02
0.02
0.00
0.00
−35 −30 −25 −20 −15 −10 −5 0 −60 −50 −40 −30 −20 −10 0
Values Values
100
80
80
Detection Rate (%)
60
40
40
20
20
w=1.5 w=1.5
w=2 w=2
w=2.5 w=2.5
w=3 w=3
0
0 20 40 60 80 100 0 20 40 60 80 100
False positive Rate (%) False positive Rate (%)
Exploits, and Worms, are lower, due to the slight similarities between these attack
instances and normal ones. It can be noted that, as the variances of the selected
features for these instances are close, the PDFs fell into each other in terms of
decision-making.
152 N. Moustafa et al.
The performance evaluation results for the DMM-based ADS technique based
on the NSL-KDD dataset were compared with those from other three existing
techniques, namely the Triangle Area Nearest Neighbours (TANN) [49], Euclidean
Distance Map (EDM) [46] and Multivariate Correlation Analysis (MCA) [47],
with their overall DRs and FPRs listed in Table 6. These techniques are used for
comparing with our technique because they are the recent ones which have similar
statistical measures to our DMM-based ADS. The DRs of The TANN, EDM and
MCA were 91.1%, 94.2% and 96.2%, respectively, and their FARs 9.4%, 7.2% and
4.9%, respectively. In contrast, the DMM-based ADS achieved better results of a
97.2% DR and 2.4% FPR.
Big Data Analytics for IDS 153
The key reason for the DMM-based ADS technique performing better than the
other techniques was that the DMM fits the boundaries of each feature perfectly,
because it has a set of distributions for computing the PDF of each instance. More-
over, the lower-upper IQR method could effectively specify the boundary between
normal and outlier instances. However, despite the DMM-based ADS technique
achieving the highest DR and lowest FPR on the NSL-KDD dataset, its performance
on the UNSW-NB15 dataset was relatively lower due to slight variations between
the normal and abnormal instances. This indicated the complicated patterns of
contemporary attacks that almost mimic normal patterns.
The DMM-based ADS has several advantages. To start with, it is easily deployed on
large-scale systems to detect malicious activity in real-time because its training and
testing phases depend only on the DMM parameters of the normal profile. Since
the decision-making method is used the lower-upper IQR rule as a threshold, it
can identify the class label of each record with no dependency on other records.
Moreover, the ease of updating the normal profile parameters, with respect to choose
the best threshold. In contrast, if there are higher similarities between features, it
will produce higher FPR, so we applied the PCA to reduce the number of features
with selecting the highest variation of features for improving the performance of
the proposed technique. Also, the DMM-based ADS cannot define attack types,
such DoS and backdoors, as it was designed for handling binary classification (i.e.,
normal or attacks). For addressing this limitation, we will design a new statistical
function to identify the PDF values of each attack type.
7 Conclusion
filter, with the aim of integrating them with the Q-Q plots to design a visual
application for analysing and monitoring network data, and making decisions
regarding specific intrusions. We will also extend this study to apply the architecture
of the proposed framework in cloud computing and SCADA systems.
References
1. Aburomman, A.A., Reaz, M.B.I.: A novel svm-knn-pso ensemble method for intrusion
detection system. Applied Soft Computing 38, 360–372 (2016)
2. Ahmed, M., Mahmood, A.N., Hu, J.: A survey of network anomaly detection techniques.
Journal of Network and Computer Applications 60, 19–31 (2016)
3. Alqahtani, S.M., Al Balushi, M., John, R.: An intelligent intrusion detection system for cloud
computing (sidscc). In: Computational Science and Computational Intelligence (CSCI), 2014
International Conference on, vol. 2, pp. 135–141. IEEE (2014)
4. Ambusaidi, M., He, X., Nanda, P., Tan, Z.: Building an intrusion detection system using a
filter-based feature selection algorithm (2016)
5. traffic analysis, N.: Network traffic analysis (November 2016). URL https://www.ipswitch.
com/solutions/network-traffic-analysis
6. Berthier, R., Sanders, W.H., Khurana, H.: Intrusion detection for advanced metering infras-
tructures: Requirements and architectural directions. In: Smart Grid Communications
(SmartGridComm), 2010 First IEEE International Conference on, pp. 350–355. IEEE (2010)
7. Bhuyan, M.H., Bhattacharyya, D.K., Kalita, J.K.: Network anomaly detection: methods,
systems and tools. IEEE Communications Surveys & Tutorials 16(1), 303–336 (2014)
8. Bouguila, N., Ziou, D., Vaillancourt, J.: Unsupervised learning of a finite mixture model based
on the dirichlet distribution and its application. IEEE Transactions on Image Processing 13(11),
1533–1543 (2004)
9. Boutemedjet, S., Bouguila, N., Ziou, D.: A hybrid feature extraction selection approach for
high-dimensional non-gaussian data clustering. IEEE Transactions on Pattern Analysis and
Machine Intelligence 31(8), 1429–1443 (2009)
10. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM computing surveys
(CSUR) 41(3), 15 (2009)
11. Corona, I., Giacinto, G., Roli, F.: Adversarial attacks against intrusion detection systems:
Taxonomy, solutions and open issues. Information Sciences 239, 201–225 (2013)
12. Ding, Q., Kolaczyk, E.D.: A compressed pca subspace method for anomaly detection in high-
dimensional data. IEEE Transactions on Information Theory 59(11), 7419–7433 (2013)
13. Dua, S., Du, X.: Data mining and machine learning in cybersecurity. CRC press (2016)
14. Dubey, S., Dubey, J.: Kbb: A hybrid method for intrusion detection. In: Computer, Communi-
cation and Control (IC4), 2015 International Conference on, pp. 1–6. IEEE (2015)
15. Escobar, M.D., West, M.: Bayesian density estimation and inference using mixtures. Journal
of the american statistical association 90(430), 577–588 (1995)
16. Fahad, A., Tari, Z., Almalawi, A., Goscinski, A., Khalil, I., Mahmood, A.: Ppfscada: Privacy
preserving framework for scada data publishing. Future Generation Computer Systems 37,
496–511 (2014)
17. Fan, W., Bouguila, N., Ziou, D.: Unsupervised anomaly intrusion detection via localized
bayesian feature selection. In: 2011 IEEE 11th International Conference on Data Mining,
pp. 1032–1037. IEEE (2011)
18. Fan, W., Bouguila, N., Ziou, D.: Variational learning for finite dirichlet mixture models and
applications. IEEE transactions on neural networks and learning systems 23(5), 762–774
(2012)
Big Data Analytics for IDS 155
19. Ghasemi, A., Zahediasl, S., et al.: Normality tests for statistical analysis: a guide for non-
statisticians. International journal of endocrinology and metabolism 10(2), 486–489 (2012)
20. Giannetsos, T., Dimitriou, T.: Spy-sense: spyware tool for executing stealthy exploits against
sensor networks. In: Proceedings of the 2nd ACM workshop on Hot topics on wireless network
security and privacy, pp. 7–12. ACM (2013)
21. Greggio, N.: Learning anomalies in idss by means of multivariate finite mixture models. In:
Advanced Information Networking and Applications (AINA), 2013 IEEE 27th International
Conference on, pp. 251–258. IEEE (2013)
22. Harrou, F., Kadri, F., Chaabane, S., Tahon, C., Sun, Y.: Improved principal component analysis
for anomaly detection: Application to an emergency department. Computers & Industrial
Engineering 88, 63–77 (2015)
23. Horng, S.J., Su, M.Y., Chen, Y.H., Kao, T.W., Chen, R.J., Lai, J.L., Perkasa, C.D.: A novel
intrusion detection system based on hierarchical clustering and support vector machines.
Expert systems with Applications 38(1), 306–313 (2011)
24. Hung, S.S., Liu, D.S.M.: A user-oriented ontology-based approach for network intrusion
detection. Computer Standards & Interfaces 30(1), 78–88 (2008)
25. Jadhav, A., Jadhav, A., Jadhav, P., Kulkarni, P.: A novel approach for the design of network
intrusion detection system (nids). In: Sensor Network Security Technology and Privacy
Communication System (SNS & PCS), 2013 International Conference on, pp. 22–27. IEEE
(2013)
26. Lee, Y.J., Yeh, Y.R., Wang, Y.C.F.: Anomaly detection via online oversampling principal com-
ponent analysis. IEEE Transactions on Knowledge and Data Engineering 25(7), 1460–1470
(2013)
27. Li, W., Mahadevan, V., Vasconcelos, N.: Anomaly detection and localization in crowded
scenes. IEEE transactions on pattern analysis and machine intelligence 36(1), 18–32 (2014)
28. Milenkoski, A., Vieira, M., Kounev, S., Avritzer, A., Payne, B.D.: Evaluating computer
intrusion detection systems: A survey of common practices. ACM Computing Surveys (CSUR)
48(1), 12 (2015)
29. Minka, T.: Estimating a dirichlet distribution (2000)
30. Modi, C., Patel, D., Borisaniya, B., Patel, H., Patel, A., Rajarajan, M.: A survey of intrusion
detection techniques in cloud. Journal of Network and Computer Applications 36(1), 42–57
(2013)
31. Moustafa, N., Slay, J.: A hybrid feature selection for network intrusion detection systems:
Central points. In: the Proceedings of the 16th Australian Information Warfare Conference,
Edith Cowan University, Joondalup Campus, Perth, Western Australia, pp. 5–13. Security
Research Institute, Edith Cowan University (2015)
32. Moustafa, N., Slay, J.: The significant features of the unsw-nb15 and the kdd99 data sets for
network intrusion detection systems. In: Building Analysis Datasets and Gathering Experience
Returns for Security (BADGERS), 2015 4th International Workshop on, pp. 25–31. IEEE
(2015)
33. Moustafa, N., Slay, J.: Unsw-nb15: a comprehensive data set for network intrusion detection
systems (unsw-nb15 network data set). In: Military Communications and Information Systems
Conference (MilCIS), 2015, pp. 1–6. IEEE (2015)
34. Moustafa, N., Slay, J.: The evaluation of network anomaly detection systems: Statistical
analysis of the unsw-nb15 data set and the comparison with the kdd99 data set. Information
Security Journal: A Global Perspective (2016)
35. Nadiammai, G., Hemalatha, M.: An evaluation of clustering technique over intrusion detection
system. In: Proceedings of the International Conference on Advances in Computing,
Communications and Informatics, pp. 1054–1060. ACM (2012)
36. Naldurg, P., Sen, K., Thati, P.: A temporal logic based framework for intrusion detection.
In: International Conference on Formal Techniques for Networked and Distributed Systems,
pp. 359–376. Springer (2004)
156 N. Moustafa et al.
37. Perdisci, R., Gu, G., Lee, W.: Using an ensemble of one-class svm classifiers to harden
payload-based anomaly detection systems. In: Sixth International Conference on Data Mining
(ICDM’06), pp. 488–498. IEEE (2006)
38. Pontarelli, S., Bianchi, G., Teofili, S.: Traffic-aware design of a high-speed fpga network
intrusion detection system. IEEE Transactions on Computers 62(11), 2322–2334 (2013)
39. Ranshous, S., Shen, S., Koutra, D., Harenberg, S., Faloutsos, C., Samatova, N.F.: Anomaly
detection in dynamic networks: a survey. Wiley Interdisciplinary Reviews: Computational
Statistics 7(3), 223–247 (2015)
40. Rousseeuw, P.J., Hubert, M.: Robust statistics for outlier detection. Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery 1(1), 73–79 (2011)
41. Saligrama, V., Chen, Z.: Video anomaly detection based on local statistical aggregates. In:
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2112–2119.
IEEE (2012)
42. Seeberg, V.E., Petrovic, S.: A new classification scheme for anonymization of real data used
in ids benchmarking. In: Availability, Reliability and Security, 2007. ARES 2007. The Second
International Conference on, pp. 385–390. IEEE (2007)
43. Shameli-Sendi, A., Cheriet, M., Hamou-Lhadj, A.: Taxonomy of intrusion risk assessment and
response system. Computers & Security 45, 1–16 (2014)
44. Sheikhan, M., Jadidi, Z.: Flow-based anomaly detection in high-speed links using modified
gsa-optimized neural network. Neural Computing and Applications 24(3–4), 599–611 (2014)
45. Shifflet, J.: A technique independent fusion model for network intrusion detection. In:
Proceedings of the Midstates Conference on Undergraduate Research in Computer Science
and Mat hematics, vol. 3, pp. 1–3. Citeseer (2005)
46. Tan, Z., Jamdagni, A., He, X., Nanda, P., Liu, R.P.: Denial-of-service attack detection based
on multivariate correlation analysis. In: International Conference on Neural Information
Processing, pp. 756–765. Springer (2011)
47. Tan, Z., Jamdagni, A., He, X., Nanda, P., Liu, R.P.: A system for denial-of-service attack
detection based on multivariate correlation analysis. IEEE transactions on parallel and
distributed systems 25(2), 447–456 (2014)
48. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the kdd cup 99
data set. In: Proceedings of the Second IEEE Symposium on Computational Intelligence for
Security and Defence Applications 2009 (2009)
49. Tsai, C.F., Lin, C.Y.: A triangle area based nearest neighbors approach to intrusion detection.
Pattern recognition 43(1), 222–229 (2010)
50. Wagle, B.: Multivariate beta distribution and a test for multivariate normality. Journal of the
Royal Statistical Society. Series B (Methodological) pp. 511–516 (1968)
51. Wu, S.X., Banzhaf, W.: The use of computational intelligence in intrusion detection systems:
A review. Applied Soft Computing 10(1), 1–35 (2010)
52. Zainaddin, D.A.A., Hanapi, Z.M.: Hybrid of fuzzy clustering neural network over nsl dataset
for intrusion detection system. Journal of Computer Science 9(3), 391 (2013)
53. Zuech, R., Khoshgoftaar, T.M., Wald, R.: Intrusion detection and big heterogeneous data: a
survey. Journal of Big Data 2(1), 1 (2015)
Security of Online Examinations
Yousef W. Sabbah
1 Introduction
Possible Hackers
Lecturers e-Learning
Environment
Remote
Internet Internet Database
Local
Students Database
educational process itself [1–8]. As depicted in Fig. 1, the Internet connects a learner
with his lecturer and content regardless of place and time. Above and beyond,
many academic institutions consider e-Learning a vital element of their information
systems [1].
An e-Learning platform has different names, such as Learning and Course
Management System (LCMS), Virtual Learning Environment (VLE), e-Learning
Portal, etc. For instance, Moodle is a LCMS that supports social constructive
pedagogy with interactive style. Interactive material includes Assignments, Choices,
Journals, Lessons, Quizzes and Surveys [9]. Accordingly, we have implemented our
proposed model based on Quiz module in Moodle.
For better performance, an e-Learning platform should be integrated
withuniversity management information system (UMIS). This integrated solution
combines all relevant technologies in a single educational environment (i.e. an
e-University) that provides students, instructors and faculties with all required
services [10].
In this chapter, we propose an integrated e-learning solution that consists of main
and complementary components, as shown in Fig. 2. The academic portal (AP), with
single sign-on (SSO), represents the core of this solution, and the authoring tools
are used to build its Content. The middle annulus represents the main components,
which are classified, in the inner annulus, into three groups; e-Content, delivery
and assessment. The outer annulus illustrates the complementary components that
represent together a University Management Information System (UMIS). All
components are interconnected and can exchange data through the AP.
At the beginning, e-Learning systems were treated as research projects that
concentrate on functionalities and management tools rather than security [11].
Nowadays, these systems are operational and heavily used worldwide. In addition,
possible hackers may be located in the connections between either users and the
system or the system and remote database [1], as shown in Fig. 1. Therefore,
e-Learning systems should be provided with sufficient security to ensure confiden-
tiality, integrity, availability and high performance.
Moreover, e-Examination security occupies the highest priority in e-Learning
solutions, since this module contains the most sensitive data. In addition, an
Security of Online Examinations 159
efficient authentication method is required to make sure that the right student
(e.g. examinee) is conducting an exam throughout the exam’s period. The absence
of trusted techniques for examinees’ authentication is a vital obstacle facing e-
Learning developers [12]. This is why opponents claim that e-Learning cannot
provide a comprehensive learning environment, especially cheating-free online
exams. Our contribution is to find a solution for this problem through a novel model
for continuous authentication. This chapter introduces our proposed model in two
schemes called ISEEU and SABBAH.
The chapter consists of four sections. The current section provides an overview
of the main concepts of e-Learning and e-Examination systems and possible attacks
that should be considered. The second section discusses the main security issues
and authentication methods in e-Examination systems, as well as classification of
the main existing authentication schemes. The third section describes our proposed
continuous-authentication schemes. It provides a proposed implementation envi-
ronment and settings and a comprehensive analysis, design and implementation
of the schemes. Finally, the fourth section provides a comprehensive risk-analysis
and evaluation to compare the suggested schemes with their predecessors, a full
discussion of the results and conclusion, as well as challenges and future work.
160 Y.W. Sabbah
2 e-Examination Security
One reason for unsuccessful e-Learning is the lack of completely trustable, secured,
protected and cheating-free e-Examination [3, 21, 24–28]. Many studies report
that cheating is common in education [30–32]. Others report that around 70% of
American high-school students conduct cheating in at least one exam, where 95%
are never caught [33]. In addition, twelve studies report an average of 75% of
162 Y.W. Sabbah
Fig. 3 Presence-Identity-
Authentication (P-I-A) User Security
goals [25] Presence
Secure
Identity Authentication
college students cheat [12, 30, 32]. The situation is worse in online exams, where
73.6% of examinees say that cheating is easier and can never be detected [25].
Impersonation is one of cheating-actions that should be prevented or at least detected
in e-Examination systems.
Impersonation threats in e-Examination are categorized into three types [25];
Type A, Type B and Type C. Unfortunately, these types alone cannot assure
cheating-free e-Exam sessions. Therefore, we proposed a Type D impersonation
threat. These types are defined as follows:
1. Type A (Connived impersonation threat) supposes that a proctor is necessary.
Impersonation might occur in two cases: the proctor could not detect it, or he
allowed impersonation by force, sympathy, or for monetary purposes.
2. Type B occurs when a student passes his security information to a fraudulent who
answers the exam. Username-Password pairs, fall in this type. However, strength
of authentication method and existence of a proctor reduce this threat.
3. Type C occurs when the real student just login, letting a fraudulent to continue
the exam on his behalf. Non-shareable attributes using biometrics approaches,
such as fingerprint authentication fall in this greater security-challenging threat.
4. Type D might occur such that the real examinee is taking the exam, but another
person assists him for correct answers.
These factors require a user to know something unique (e.g. a password) that others
do not know. With a strong password policy, unauthorized parties cannot access
users’ information.
A user should possess some token that others do not have, such as keys or
cards. Unauthorized parties cannot access users’ information unless they obtain the
required tokens.
IFA
FAR D (1)
TNIT
CFR
FRR D (2)
TNCT
Where, FAR is the false acceptance rate, IFA is the ratio of impostors that were
falsely accepted, TNIT is the total number of tested impostors, FRR is the false
rejection rate, CFR is the ratio of clients that are falsely rejected, and TNCT is the
total number of tested clients. FAR measures the probability that an impostor is
falsely accepted, whereas FRR measures the probability that a valid user is rejected.
KDA proposes that typing rhythm is different from a user to another. It was proposed
with five metrics of user identity verification [24]:
• Typing speed, measured in characters per minute.
• Flight-time between two keys up, including the time a user holds on a key.
• Keystroke seek-time that is required to seek for a key before pressing it.
• Characteristic sequences of keystrokes, i.e. frequently typed sequences of keys.
• Characteristic errors, i.e. the common errors made by a user to be identified.
Correlation is used to measure similarity among the features of the saved
templates and the stroked keys, as shown in Eq. (3) [24].
v
u n
Xn uX Xn
rD .ki ti / =t ki2 ti2 (3)
iD1 iD1
iD1
Security of Online Examinations 165
This algorithm is proposed originally for video search [36–38], but it can be used for
continuous authentication and auto-detection of cheating actions. The examinee’s
video is matched against his stored template using tree-matching, as shown in
Fig. 4. A video is divided into a number of scenes in a structured-tree; each consists
of groups of relevant shots. The matching process moves level-by-level in a top-
down manner, where similarity is calculated using color histogram and shot style.
The algorithm uses a maximum order sum function to compute similarity in four
steps [36]:
• Initialize a matrix D with zeros for all elements.
• Fill the matrix according to Eq. (4).
Where, D is the matrix, max() is the maximum function, and childSim() is the
child similarity function.
• Locate the sum of child similarity for the optimal match by Eq. (5).
Where, sum is the sum of child similarity, numRow is the number of rows, and
numCol is the number of columns.
1
Sensor
3 Feature Extraction
4 7 6
Template
5 Matcher Database
Existing solutions for e-Examination authentication are categorized into five main
categories [25]. The same categorization will be adopted in Sect. 4 to conduct a
comparison between these schemes and ours.
Security of Online Examinations 167
M2
PD (7)
N2
Where, P is the expected overlap between questions in two random exam sets, M
is the exam size (i.e. number of questions), and N is the pool size (i.e. number of
questions in the pool). In other words, to raise the probability of choosing a distinct
set of questions for each student, at least, Eq. (8) should be satisfied [30].
N DSM (8)
Where, N is the pool size, S is the number of students to set for the exam, and M
is the exam size. Proponents of this scheme consider proctor-based e-Assessment
suitable, since it promotes identity and academic honesty [18, 30, 39].
This scheme employs a single biometrics approach for authentication. For example,
web authentication, based on face recognition, is used for the verification of student
identity with BioTracker that can track students while doing their exams at home.
BioTracker can be integrated with LMS, where three concepts are investigated; non-
collaborative verification, collaborative verification and biometrics traces [40].
Handwriting approach is another example of this scheme, where a pen tablet is
used for writing the most used characters in multiple-choice questions [41]. The
written characters are compared with templates that have been taken before the
exam [41]. Figure 6 depicts the structure of another similar approach that employs
Localized Arc Pattern (LAP) method [41]. It identifies a writer based on one letter
written on a piece of paper. It is adapted for multiple-choice e-Exams to recognize
an examinee by his handwritten letters [41].
168 Y.W. Sabbah
Fig. 8 Bimodal biometrics approach using fingerprint and mouse dynamics [42]
Security • Encryption
Module • Anti-X
• Device Checker
Academic Portal
Availability and Reliability
Finance
System UMIS
The AP and other web applications can be reached via the Internet or the Intranet
through a security module shown in Fig. 13. It protects the proposed e-Examination
schemes against several vulnerabilities:
1. Encryption: Secure Socket Layer/Transport Layer Security (SSL/TLS) protocol
is employed to encrypt the data stream between the web-server and the browser.
2. Firewall/Access Control Lists (ACL): It blocks unauthorized parties from access
to the e-Examination system and prevents any possible connection to assistants.
3. Cross-Site Scripting (XSS) Detector: A web vulnerability scanner that indicates
vulnerable URLs/scripts and suggests remediation techniques to be fixed easily.
4. SQL injection Detector: A web vulnerability scanner that detects SQL injection.
5. Anti-X (X D Virus, Worm, Spyware, Spam, Malware and bad content): A
reliable firewall/anti-X is installed, and black lists of incoming/outgoing traffic
are defined. These lists and definitions of viruses and spyware are kept up-to-
date.
6. Device checker and data collector: Checks that the authentication devices are
functioning properly, and ensures that only one of each is installed, by testing
interrupt requests (IRQ) and address space. It collects data about current user and
detects violations and exceptions and issues alerts to the e-Examination system.
The hardware and software required by the entire integrated e-Learning solution
(i.e. the implementation environment) refer specifically to those required to develop
the e-Examination scheme. Tables 2 and 3 illustrate hardware and software require-
ments in both server and client sides
Server Side
Client Side
Hardware requirements for both ISEEU and SABBAH schemes are; a personal
computer connected to 1 Mbps or more Internet speed, a webcam with reasonable
resolution and headphones. Also, software requirements include; an operating
system, a web browser, a media encoder and a flash player plugins. Additionally,
SABBAH requires a biometrics mouse with continuous fingerprint-scanner and its
suitable APIs.
In traditional exams, proctors control and manage exam sessions. They distribute
exam papers, announce the start and termination times, monitor and terminate exams
or report cheating actions and violations. Similarly, in our proposed schemes, a
proctor or the system itself should manage the exam sessions using a state diagram
with four possible states (2-bits) with an initial Null state, as shown in Fig. 14.
Transition of the states goes into eight steps throughout an e-Exam for both
of our proposed schemes, except that SABBAH replaces Proctor with VFKPS, as
follows:
1. Null-Ready: before starting an e-Exam, its state is Null, where all records,
including number of attempts, are null. When the examinee opens quiz block in
Moodle, he is asked to submit his ID, while the “Attempt now” button is dimmed.
At this point, the exam state changes to Ready ‘00’ and the proctor is notified.
2. Ready-Start: if the proctor validates the examinee’s identity, he clicks “Accept”
in his control toolbox to cause a transition from Ready to Start ‘01’. At this point,
the “Attempt now” button is enabled, and the examinee is notified to start.
3. Ready-Null: if the proctor suspects an examinee is not the correct one, he clicks
“Reject”, causing a transition from Ready to Null. So, the examinee should retry.
4. Start-Pause: on exceptions, such as webcam removal, the exam is paused for
exception handling, and its state is changed from Start ‘01’ to Pause ‘10’.
176 Y.W. Sabbah
Ready
00 2.1
c ted Ac
eje ce
pte
2. 2R d
e’s
i ne ted
m t
xa mi
1 E Sub Start/
Null ID Resume
01
g
tin
Han eption
ea
Ch t
dling
Reso eption
e d i
5 Re xam
ed im
xc
lved
x ce te L
3.1 E
e-E
E Ra
sche
xc
3.2
4.2 E
dule
Termin- Pause
ate 11 4.1 Exception 10
not Resolved
Fig. 14 State diagram of the proposed e-Examination models (ISEEU and SABBAH)
In our schemes, we propose a list of cheating actions to measure violation rate and
decide penalties, such as deduction from the total score, as illustrated in Table 4.
We conducted a survey with a sample of 50 experts of security, proctoring and
education. Each expert was asked to assign a weight of 1–5 to each violation, where
violations with higher risks are assigned higher weights. Then, the weights are
averaged, approximated and normalized to obtain the final weights of risks.
Security of Online Examinations 177
2. Validate 2. Validate
3. Initialize
proctor’s interface
e-Learning Media
Server Server
(ELS) (MS)
Webcam
Internet
(WC)
chn
ETn
ch1
Examinee’s Proctor’s
Terminal 1 Terminal
(ET1) (PT)
3. Initialize proctor’s interface: On the main interface of the e-Course, users with
proctor role clicks a link to a monitoring interface, which consists of multiple
video screens; one per examinee. It can be zoomed in/out when a violation action
is suspected. Also, it contains a control toolbox and a dropdown violation-list.
ISEEU employs a webcam attached to an examinee’s terminal (ET) that streams his
exam session to a media server (MS) through the Internet, as shown in Fig. 16.
The MS, in turn, forwards exam sessions to proctors through the e-Learning
server (ELS). Each examinee’s session is streamed through his channel, and all
the video streams appear on the proctor’s terminal (PT). Both examinee and proctor
are connected to the ELS and the MS through a security module that protects them
against possible attacks.
The flowchart of ISEEU is shown in Fig. 17. It describes its operation and the
procedure of conducting e-Exams. Moreover, the sequence diagram of Fig. 18
interactively clarifies its operation in 18 steps:
Security of Online Examinations 179
Attempt e-Exam
Notify a Proctor
Initialize/Start e-Exam
Initialize Indicator I= 0
Full-screen Lock
Start Proctoring
NO
NSH or SA?
YES
Indicator I + = Weight w
NO
TO, SU or I ≥ V?
YES
Compute cheat rate/ penalty
Disable Webcam
1 Attempt an e-Exam
2 Assign and notify a
3 Initialize audio and proctor
video devices
4 Initialize video
streaming
6 Initialize proctor’s
control toolbox
7 Grant permission for
8 Notify examinee
examinee
of permission
9 Generate e-Exam
10 Lock screen
11 Initialize cheating
indicator bar 12 Start streaming and
recording
13 Start e-Exam
14 Terminate e-exam
15 Unlock full-screen 16 Stop streaming and
recording
17 Disable audio and
video devices
18 Submit proctor’s
session report
3. Initialize audio and video devices: initialize the examinee’s audio and video
devices (e.g. headphones and webcam) and asks him to calibrate them.
4. Initialize video streaming: both the media encoder on the examinee’s terminal
(ET) and the media server (MS) are initialized, and their connection is check.
5. Start video streaming: a channel is assigned and a connection is established
between the ET and the MS. The examinee is asked to calibrate the webcam
until his face is top-centered, before his video starts streaming.
6. Initialize proctor’s control toolbox and notify him: initializes the control
toolbox for both examinee and proctor and activates the required functions.
7. Grant permission to examinee: the examinee waits permission from his proctor
to start. While waiting, an examinee can play a demo of instructions. At this
step, all necessary functions are activated. Proctors interact with examinees via
chat option in the toolbox and follow a predefined procedure to verify their
identity.
8. Notify examinee of permission: if an examinee is identified, the proctor
approves by clicking “Accept”. This enables the “Attempt now” button, and
an examinee is notified to start his e-Exam.
9. Generate e-Exam: this step randomly generates the questions from a question
bank on the ELS that guarantees different questions for each examinee.
10. Lock the examinee’s screen: immediately, the exam-interface is locked with a
full screen, and all functions, that allow access to resources, are disabled.
11. Initialize cheating indicator bar: a graphical cheating indicator bar is initialized
to 0% (I D 0) and appears on the exam’s interface.
12. Start streaming and recording: the e-Exam session streaming and recording
start. It is stored to the MS in order to be revised on uncertainty.
13. Start e-Exam: the timer is initialized, the exam session starts and continues until
time is over (TO), exam is submitted (SU) or terminated when violation limit
(I V) is exceeded. On exceptions, such as device failure or disconnection, the
device checker handles exceptions before the exam is being resumed.
14. Terminate e-Exam: If the exam terminates, it is closed with a relevant warning
message. The questions are auto-scored and scores appear to the examinee.
15. Unlock full-screen: the examinee’s interface is unlocked to its normal state.
16. Stop streaming and recording: streaming is stopped and the video is saved.
17. Disable audio and video devices: the webcam goes off.
18. Submit session report: It generates a session report that contains all violations.
A proctor revises it and the recorded session, if necessary, and submits his
report.
During an e-Exam, a proctor might pause/resume a session and can generate
alerts and violations by choosing from a list. Cheating rate is calculated each time
and appears on a cheating indicator bar (I) on the examinee’s interface. This rate
is accumulated on each issued violation (IC D w), such as no show (NSH) or
suspicious actions (SA). It is paused or terminated if a violation rate is exceeded
(I V).
182 Y.W. Sabbah
SABBAH scheme resolves the major challenges of ISEEU, especially that it does
not require manual intrusion. The following subsections introduce the new features.
Video/FPA/KDA
Processing Server
(VFKPS)
Internet
Examinee’s Continuous
Terminal (ET) Authentication
• The device checker ensures that authentication devices are the only functional
ones, by investigating ports, interrupt requests (IRQ) and memory addresses.
• Firewall/access lists module rejects unknown in/out protocols by closing all ports
except the required ones. This prevent communication with possible assistants.
• The VFKPS server automatically substitutes proctors using video comparison.
Phase I: Enrollment
This phase starts when a student enrolls e-Learning courses, as shown in Fig. 20:
1. Student’s fingerprint is scanned at the registrar’s desk.
2. A still photo and a short video are captured by a high-resolution camera.
3. A training set of keystrokes is captured by typing a passage on a dedicated PC.
4. The VFKPS performs feature extraction and saves that on the ELS.
In this phase, an exam passes into four possible states in the VFKPS as illustrated
in Fig. 21. When an examinee opens his video interface, he is asked to check video
and audio settings. Then he submits his ID, while the “Attempt now” is dimmed.
Initialization
This phase enables, checks, configures calibrates devices, and login to the ELS and
the e-Examination system takes place. It consists of three steps; FPA, KDA and
video initialization, as shown in Fig. 21.
FPA Initialization When a student opens the login screen, the fingerprint scanner is
enabled, and he is asked to enter his imprint on the on-mouse scanner. The VFKPS
runs a matching algorithm with the saved imprint. If login succeeds, the examinee
moves to the next step. Otherwise, he should just retry.
KDA Initialization This starts if the FPA succeeds, where a multimedia demo
appears with instructions, and the examinee is asked to type a short paragraph.
Keystrokes are transferred to the VFKPS for feature extraction and matching with
the stored ones. If they match, he moves to the next step, otherwise, he should retry.
Video Initialization After KDA initialization completes, the examinee is moved
to a blank video screen. Then, the webcam is enabled and his video appears, and
the examinee is asked to calibrate video and audio devices. He is reminded of chat
service with technical-support agents 24/7 online. The matching algorithm extracts
the features and compares them with the stored ones. Finally, if matched, he moves
to the next step, otherwise, he will just keep trying.
Operation
On exam initialization, the system chooses randomly from a large pool of questions
to minimize the chance for examinees to get similar questions. The pool size should
be at least equal to the number of examinees times the exam size. In this phase,
the exam actually starts, and so the timer’s countdown. Also, a full screen locks the
examinee’s desktop and the security module closes the ports to prevent access to
related resources from local disks, internet, remote desktops or remote assistants.
The cheating indicator is initialized to zero (I D 0).
FPA Operation The fingerprint scanner captures the imprint in two modes;
randomly or periodically. The imprints are transferred to the VFKPS for continuous
matching. If matched, a new imprint is captured. Otherwise, it continues trying and
moves to KDA matching.
KDA Operation The examinee’s activities on the keyboard are continuously
captured and sent to the VFKPS for matching. If matched, a new keystroke set is
captured. Otherwise, it continues trying and moves to video matching.
Security of Online Examinations 185
1 Initialization 2 Operation
Initialize/Start e-Exam
FPA Initialization
Enable Scanner Initialize Indicator I= 0
Full-screen Lock
Enter Imprint
Transfer to VFKPS Apply Firewall Rules
3 Violation
Image Processing FPA Operation
Enter Imprint
Feature Extraction
NI, NS, NO
Transfer to VFKPS
NO NSH, OR
Match Image Processing SA?
YES Feature Extraction
KDA Initialization YES
YES Match I+=w
KDA instructions
NO
Enter Test Set
KDA Operation
Transfer to VFKPS
Enter Test Set
Data Analysis
Transfer to VFKPS
Feature Extraction Data Analysis
NO Feature Extraction
Match
YES Match
YES
NO
Video Initialization
Enable Webcam Video Operation
Capture Video Shot Capture Video Shot 4 Termination
Video Operation This step takes video shots randomly or periodically and sends
them to the VFKPS for feature extraction and continuous matching. If matched, a
new video shot is captured and the operation is repeated. Otherwise, it continues
trying and repeats the operation cycle from the beginning. It also moves to the next
step to add a violation with a weight (w) from the list.
Violation
It occurs when some rules are violated, either by the system or the examinee. The
exam saves status, pauses, if necessary, and exception handling is followed before
the exam can be resumed.
System’s Violation This occurs when the keyboard, the mouse or the webcam
stops responding, turned off, removed or duplicated. The device checker can detect
such violations. Also, power off, internet disconnection and application or operating
system errors are considered. If the examinee did not cause any, these errors will
not be treated as cheating actions, and no penalty will be applied. Following the
state diagram, the exam saves its state, pauses and notifies an agent in the technical
support unit (TSU). When the violation reason is detected and corrected, the exam
resumes. Restart and shutdown of either hardware or software are allowed to fix the
problem. For security issues, the examinee cannot review the previously answered
questions after the system is back operational.
Examinee’s Violation This occurs when the examinee violates instructions, cheats
or tries to cheat. Violation might be impersonation, getting assistance from others,
or access to exam resources or material, etc. The violations are weighted and the
cheating rate is accumulated using weights (w) and represented on the cheating
indicator (I). A violation is weighted if all devices fail to recognize a user, or if
a suspicious action (SA) is detected. The examinee’s violations in SABBAH can
be:
• FPA violations include unmatched fingerprint with the stored one, and no imprint
(NI) is captured for a predefined period in questions that require a mouse.
• KDA violations include unmatched features of keystrokes and keystroke absence,
i.e. No strokes (NS), for a predefined period in essay questions.
• Video violations are not limited to face and/or head are unmatched, specific parts
of the body do not show (NSH) for a predetermined number of tries, suspicious
actions (SA) and moves (looking around, sleeping, bending, etc.), and producing
noise, speech or voice. On each violation, the system generates relevant warning
messages, and the cheating indicator increments (IC D w). If the resultant rate
exceeds some violation limit (I V), it moves to termination.
Security of Online Examinations 187
Termination
In this phase, the examination session actually terminates, as follows:
Normal Termination This occurs either when the exam’s time is over (TO), or
when an examinee submits all questions (SU). In both cases, the system saves the
session’s status-report and video recording. The report includes all violations and
the total cheating rate. Finally, it moves to finalization phase that unlocks the full
screen, turns off authentication devices and applies penalties.
Abnormal (Cheating) Termination Each time the examinee commits a violation,
it appears on his cheating indicator bar. When this rate exceeds a specific limit, say
(I 50%), the exam automatically terminates with a zero grade. In fact, this rate
depends on the institution’s rules and can be configured in the system settings. After
termination, the same procedure in (1) is followed.
The flowchart of this phase is shown in Fig. 22. It includes grading, applying
penalties, reporting, and transfer of scores.
Disable Webcam
4.1 Results
For more accuracy in the risk analysis, each risk measures several threats. The
distribution of priority on security risks is illustrated in Table 5. It is shown that
Preventing access to resources achieved the highest priority with a weight of 0.708,
whereas e-Learning environment security achieved the lowest priority with a weight
of 0.533. Note that the 5th and the 6th risks have the same priority of 0.567.
The average score of each scheme is weighted by Eq. (9) [45, 46], where each
score is multiplied by its corresponding weight and summed. Then, their summation
is divided by the summation of all weights.
Pn
iD0 .Wi Si /
Ts D P n (9)
iD0 Wi
Table 6 ISEEU and SABBAH compared with the previous e-Examination schemes against security risks
Previous scheme Proposed scheme
No. Security risk Proc-LAB BWA-FR eTest-LAP Theo-CFP FP-MD FP-HGD SIE-VM FP-VM ISEEU SABBAH
1 Preventing access to resources 54.5% 15.2% 15.2% 24.2% 24.2% 24.2% 63.6% 63.6% 100% 100%
2 Satisfying C-I-A goals 66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 100%
3 Satisfying P-I-A goals 100% 33.3% 66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 66.7% 100%
4 Impersonation threats 100% 35.7% 71.4% 71.4% 71.4% 71.4% 71.4% 100% 100% 100%
5 Interaction and feedback 100% 30.8% 30.8% 30.8% 30.8% 30.8% 30.8% 30.8% 69.2% 100%
6 R/FT of auth. methods 100% 8.7% 26.1% 30.4% 47.8% 47.8% 21.7% 43.5% 43.5% 73.9%
7 Environment security 100% 20.0% 20.0% 20.0% 20.0% 20.0% 20.0% 20.0% 100% 100%
Average score 88.7% 30.0% 42.4% 44.3% 46.8% 46.8% 48.7% 55.9% 78.0% 96.3%
Weighted-average score 87.4% 30.5% 42.8% 44.9% 47.2% 47.2% 50.3% 57.3% 78.4% 96.5%
Y.W. Sabbah
Table 7 Comparison of our proposed schemes (i.e., ISEEU and SABBAH) and the previous categories (i.e., schemes are combined into their categories)
against security risks
Security of Online Examinations
100%
90%
80%
70%
Percentage Score
60%
50%
40%
30%
20%
10%
0%
Preventing Satisfy C-I-A Satisfy P-I-A Impersonation Interaction and Red/ FT of Environment
access to goals goals threats feedback auth. methods security
resources
Security Risk
Fig. 23 ISEEU and SABBAH compared with the previous e-Examination Schemes against
security risks (Percentage Scores)
resources, C-I-A and P-I-A goals were not considered. Therefore, we present a more
accurate and comprehensive evaluation below.
Our evaluation and discussion assume that the proctor-based scheme (i.e.
traditional) has the highest security in the optimal case. In general, e-Examination
schemes compete with it to achieve, at most, the same security. The proctored-
only (Proc-LAB) scheme is the most similar to a traditional one, except that it is
computer-based and conducted in computer labs.
Nevertheless, SABBAH scheme (i.e. our proposed scheme) achieved the first rank
and the best security with an average score of 96.3%. The following justify this
distinct rank of SABBAH in the measured seven-risks:
1. Access to exam resources (100%): Exam resources are protected by several
methods:
– The full-screen lock prevents access to local disks, storage media, Internet,
Intranet, LANs and WANs.
– Video matching detects cheating actions such as copy from textbooks, cheat
sheets and PDAs (e.g. Phone Calls/MMS/SMS).
– Exam questions are randomly generated from a large pool (e.g. a question
bank).
– The exam repository is protected with fingerprint and the security module
(e.g. SSL/TLS, firewall, anti-x, XXS/SQLI detection, etc.).
Security of Online Examinations 193
100.0%
90.0%
80.0%
70.0%
Percentage Score
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
Security Risk
Proctored-Only (ProcLAB) Unimodal Biometrics (Theo-CFP)
Bimodal Biometrics (FP-HGD) Video Monitoring (SIE-VM)
Biometrics with VM (FP-VM) Proposed Model I (ISEEU)
Proposed Model II (SABBAH)
Fig. 24 Comparison of our proposed schemes (i.e. ISEEU and SABBAH) and the previous
Categories (i.e. combined schemes) against security risks (Percentage Scores)
2. C-I-A goals (100%): Strong authentication and security module assure confiden-
tiality, integrity and availability. Availability might be slightly degraded when the
number of examinees exceeds some limit, if not scheduled carefully.
3. P-I-A goals (100%): Continuous authentication using fingerprint, keystroke
dynamics and video matching ensure presence, identification and authentication.
4. Impersonation threats (100%): All types of impersonation threats are solved:
– Type A: never occurs, since there is no manual intervention. Also, sessions are
recorded for more processing on uncertainty.
– Type B: never occurs, since biometrics authentication is used, such that no
way to pass security information to fraudulent persons.
– Type C: similar to type B, continuous authentication is used to prevent
allowing fraudulent to get exam-control, since this scenario will be detected.
– Type D: video matching catches local assistants, while remote assistants are
prevented by the security module, which closes all vulnerable ports.
194 Y.W. Sabbah
5. Interaction and feedback (100%): Two effective tools of interaction with exam-
inees are provided; the cheating indicator and warning messages. The indicator
measures the cheating rate and appears on the exam’s interface. At the same time,
warning messages appear to examinees on each violation.
6. Redundancy/fault tolerance of authentication devices (73.9%): If one of the three
authentication devices fails, another continues working. The device checker,
in the security module, checks devices, detects failures and fixes them. It also
assures a single device only for each authentication method.
7. Environment security (100%): The security module provides a typical secure
environment for the whole system. The LCMS (i.e. Moodle) is protected with
SSL/TLS encryption. The firewall, the access lists, the XSS and SQLI detectors
reduce attacks. Moreover, attachments are checked for viruses, malware, spy-
ware, etc. If any is detected, it will be recovered, quarantined or deleted.
Proctored-only (Proc-LAB) comes next by achieving the second rank with an
average score of 88.7%. It scored 54.5 and 66.7% in the 1st and the 2nd risks
respectively and 100% in the last five risks. However, proctored-only cannot be
considered a pure e-Examination scheme.
Our proposed scheme (ISEEU) is ranked the third by achieving an average score
of 78%. Although it failed in the 6th risk with a score of 43.5%, this risk has a lower
priority and its impact is not considerable. Justification for this reasonable result
is:
1. Access to exam resources (100%): The same as SABBAH, but video monitoring
replaces video matching for cheating prevention.
2. C-I-A goals (66.7%): The security module provides high confidentiality and
integrity. Availability is degraded when the number of examinees exceeds some
limit, since each session needs a new channel. This reserves more memory, CPU
and bandwidth. Therefore, it should be scalable and kept under monitoring.
3. P-I-A goals (66.7%): Continuous video monitoring guarantees presence, identifi-
cation and authentication, but using username and password to login is still weak
and vulnerable.
4. Impersonation threats (100%): All types of impersonation threats are solved:
– Type A: exam sessions are monitored and recorded for uncertainty cases.
Security of Online Examinations 195
0% Unimodal Biometrics
Weighted-Average Score
Fig. 25 Weighted-average scores of ISEEU and SABBAH against the previous categories
196 Y.W. Sabbah
4.3 Conclusion
It is shown that security is vital for e-Learning systems, since they are operational
and contain sensitive information and operations. One of these operations is e-
Examination, which has been given higher attention recently. If an institution
decides to offer pure e-Courses, it will be faced with untrusted e-Examination.
Several efforts have been made in order to provide trustful e-Examination by
proposing new schemes to improve security and minimize cheating. Even though,
those schemes could not compete with traditional proctor-based examination.
This work contributes to solving the problem of cheating and other security
issues, taking into account most of the possible vulnerabilities. It is believed that
our proposed e-Examination schemes, i.e. ISEEU and SABBAH, present a major
contribution in the field. They are highly secure, simple and easy to be implemented.
They are the strongest competitors to the traditional examination scheme, or even
beat it with its new features. This is exactly what we proved in our results.
Results show that our proposed schemes proved reasonable security compared
with the previous ones. SABBAH ranks the first, and ISEEU ranks the third in the
risk analysis. Results show that the proctored-only scheme (i.e. traditional exam-
ination) ranked the second after SABBAH. This rank is considered satisfactory,
since our objective was to propose a scheme that competes with the traditional one.
The reason for superiority of SABBAH is its higher security and its high ability to
prevent/detect cheating actions.
However, the claim that traditional exams are 100% secure is theoretical, and the
literature shows that over than 70% of high-school students in USA admit cheating,
where 95% of them are never detected. Also, they are inflexible in place and
time and economically infeasible. Moreover, all procedures, such as exam delivery,
monitoring, grading, scoring and data entry, are manual, which are time-consuming
and need manpower.
Both SABBAH and ISEEU schemes satisfy the C-I-A (Confidentiality, Integrity
and Availability) and the P-I-A (Presence, Identity and Authentication) goals. Also,
they resolve security issues that were neither resolved nor mentioned previously:
• Access to exam resources, such as textbooks, worksheets, local computer, the
Internet and remote assistance. Both prevent this by fullscreen locks, interactive-
monitoring or video matching, and continuous authentication.
• Our schemes are interactive based on cheating indicator and warning messages
to stopp cheating when detected, while the previous schemes lack to this feature.
• Webcam failure pauses the exam until it is fixed to resume, while this scenario
leads to system failure in the previous video monitoring schemes.
Regarding plagiarism or impersonation threats, our schemes resolve all types
of threats. In addition, they have many advantages over the previous schemes. For
instance SABBAH scheme is:
• Fully automated: a proctor is no longer needed. Also, grading, cheating penalties,
and data transfer all are executed automatically.
Security of Online Examinations 197
4.4 Challenges
Although they have many advantages and resolve several security issues, our
schemes encounter some challenges. The main challenges can be summarized in:
• Internet speed/backbone bandwidth and robustness: high-speed and stable inter-
net connection are required, especially at peak times.
• Performance and capacity: they require servers with huge memory and stor-
age/disk space, and SABBAH requires a VFKPS with a high processing power.
• Implementation complexity: implementation of automatic video matching is not
easy, and its algorithms are still under development. Performance and accuracy
in feature extraction and matching are still challenging. Also, users on the client
side require a special hardware for fingerprint scanning.
• Failure penalties: on unrecoverable failures, the answered questions cannot be
reviewed by an examinee after being back, for security issues.
However, who keeps track with the fast advancement in information and
communication technology (ICT), discovers that all challenges are to be resolved
very soon. For instance, proponents claim that cloud-computing concept fits this
type of applications and will overcome many of these challenges. Accordingly, this
claim was tested in a small range, where ISEEU was deployed on cloud control;
computing platform as a service (PaaS). This test was limited, since we used a
trial version with limited services. Regarding special hardware requirement, most of
modern computers have fingerprint scanners, and touch-screens can provide more
options for secure and interactive e-Examination schemes.
It is being emphasized here that our proposed e-Examination schemes will
not provide completely cheating-free e-Exams, but they minimize cheating and
impersonation threats. With well-developed functions, we believe that our scheme
is more efficient and cost-effective than proctor-based in-classroom sessions.
198 Y.W. Sabbah
References
31. A. Lathrop and K. Foss, “Student Cheating and Plagiarism in the Internet Era: A Wake-
Up Call”, Englewood: Libraries Unlimited 2000. Last access December 2016. http://
www.questia.com/PM.qst?a=o&d=111682012
32. M. Dick, J. Sheard, C. Bareiss, J. Carter, D. Joyce, T. Harding and C. Laxer, “Addressing
Student Cheating: Definitions and Solutions”, ACM Special Interest Group on Computer
Science Education Bulletin, vol.[32], no.2, pp.172-184, June 2003.
33. Murdock TB, Miller A, Kohlhardt J. Effects of Classroom Context Variables on High School
Students’ Judgments of the Acceptability and Likelihood of Cheating. Journal of Educational
Psychology. 2004 Dec;96(4):765.
34. A. Kapil and A. Garg, “Secure Web Access Model for Sensitive Data”, International Journal
of Computer Science and Communication, vol.1, no.1, pp.13-16, January-June 2010.
35. J. A. Hernández, A. O. Ortiz, J. Andaverde and G. Burlak, “Biometrics in Online Assessments:
A Study Case in High School Students”, Proceedings of the 18th International Conference on
Electronics, Communications and Computers, Puebla, Mexico, pp.111-116, March 2008.
36. C. W. Ng, I. King and M. R. Lyu, “Video Comparison Using Tree Matching Algorithms”,
Proceedings of the International Conference on Imaging Science, Systems and Technology,
Las Vegas, USA, pp.184-190, June 2001.
37. X. Chen, K. Jia and Z. Deng, “A Video Retrieval Algorithm Based on Spatio-temporal Feature
Curves and Key Frames”, Proceedings of the 5th International Conference on International
Information Hiding and Multimedia Signal Processing, pp.1078-1081, September 2009.
38. M. S. Ryoo and J. K. Aggarwal, “Spatio-Temporal Relationship Match: Video Structure
Comparison for Recognition of Complex Human Activities”, Proceedings of the IEEE 12th
International Conference on Computer Vision, pp.1593-1600, Kyoto, Japan, September-
October 2009.
39. G. Harrison, “Computer-Based Assessment Strategies in the Teaching of Databases at Honours
Degree Level 1”, In H. Williams and L. MacKinnon (Eds.), BNCOD, vol.3112, pp.257-264,
Springer 2004.
40. E. G. Agulla, L. A. Rifon, J. L. Alba Castro and C. G. Mateo, “Is My Student at the Other
Side? Applying Biometric Web Authentication to e-Learning Environments”, Proceedings of
the 8th IEEE International Conference on Advanced Learning Technologies, Santander, Spain,
pp.551-553, July 2008.
41. S. Kikuchi, T. Furuta and T. Akakura, “Periodical Examinees Identification in e-Test Systems
using the Localized Arc Pattern Method”, Proceedings of the Distance Learning and Internet
Conference, Tokyo, Japan, pp.213-220, November 2008.
42. S. Asha and C. Chellappan, “Authentication of e-Learners Using Multimodal Biometric Tech-
nology”, in International Symposium on Biometrics and Security Technologies, Islamabad,
Pakistan, pp.1-6, April 2008.
43. Y. Levy and M. Ramim, “Initial Development of a Learners’ Ratified Acceptance of Multi-
biometrics Intentions Model (RAMIM)”, Interdisciplinary Journal of e-Learning and Learning
Objects, vol.5, pp.379-397, 2009.
44. C. C. Ko and C. D. Cheng, “Secure Internet Examination System Based on Video Monitoring”
Internet Research: Electronic Networking Applications and Policy, vol.14, no.1, pp.48-61,
2004.
45. N-Calculators Official Website, “Weighted Mean Calculator”. Last access December 2016.
http://ncalculators.com/statistics/weighted-mean-calculator.htm
46. Wikipedia Website, “Mathematical Definition of Weighted Mean”. Last access December
2016. http://en.wikipedia.org/wiki/Average
Attribute Noise, Classification Technique,
and Classification Accuracy
R. Indika P. Wickramasinghe
1 Introduction
Quality of the data takes an upmost importance in data analytics, irrespective of the
type of application. In data classification, high quality data are further important as
the accuracy of the classification correlates with the quality of data.
Real-world datasets can contain noise, which is one of the reasons that makes
the quality of data low [16, 38]. Shahi et al. [39] consider outliers and noise as the
uncertainty of data. Handling data that are mixed with noise brings a hard time for
the analysis. Some organizations allocate millions of dollars per year on detecting
errors with data [32]. When the noise is mixed with cyber-security related data, one
needs to give a serious attention to handle them due to the sensitive nature of the
data.
The attention for cyber-security and protective measures grew faster with the
expansion of cyberattacks. Stealing intellectual and financial resources are the main
reasons behind the deliberate breaching of computer systems. When the mobile
commerce expanded revolutionary, a series of thefts started to creep up in which
majority of them are related to credit-cards transactions. In a study, Hwang et al.
[19] states that approximately 60% of the American adults avoid doing business
online due their concern about misuse of the personal information. Riem [33] reports
a precarious incident of shutting down a credit-card site of a British bank called,
Halifax due to the exposure of consumers’ details. In this regard, measurements to
minimize credit-card related frauds are in great demand at present than ever before.
Noise hinders the classification [35, 46, 49] by decreasing the performance
accuracy, in terms of time in constructing the classifier, and in the size of the
classifier. Thist is why the identification of noise is an integral part in data
classification. Noise can be introduced in many ways into online transactions of
credit card data. Besides the conventional ways that introduce noise, research
indicates that magnetic card chips can be vulnerable at times and can introduce
noise to the transaction. Therefore, identification and isolation of noise from the
data before analyzing is very important.
Apart from the noise, the shape of the attributes’ distributions can make an
impact on the quality of data classification. It is a fact that most of the natural-
continuous random variables adhere some sort of Gaussian distribution. When the
data show a departure from the normality, it is considered as skewed. Osborne [31]
points out that mistake in data entry, missing data values, presence of outliers, and
nature of the variable itself are some of the reasons for the skewness of the data.
Furthermore, Osborne [31] makes use of several data transformation techniques
such as square root, natural log, and inverse transformation to convert the non-
normal data into normal. There is a strong association between the existence of
noise and the skewness of the distribution. It is apparent that skewness of data
directly influence outliers. Hence, there should be an important connection between
the nature of the skewness of data and the classification accuracy.
Even if someone removes the noise and the skewness of the data, it would not
completely reach the maximum data accuracy level in the classification. Selection
of the correct classification technique based on the available data, and the use of
appropriate sample sizes for training and test data are two of options one can
consider for improvement of classification. Though there are several literature
findings about the identification of appropriate classification technique, only a
handful of findings exists in connection with the selection of suitable sample ratios.
Even within the available findings, none of them are related to both cyber-security
related data that are mixed with noise. In this chapter we have two broad aims. At
first, we study the impact of noise in effective data classification. Secondly, we aim
Attribute Noise, Classification Technique, and Classification Accuracy 203
to improve the classification of noisy data. To this end, we consider how skewness,
appropriate ratios of the samples, and the classification technique impact on the
classification. This study brings the novelty in two ways. According to the author’s
knowledge, it is rare to find a study aiming to investigate the relationship between
above selected features and the classification accuracy. Furthermore, we propose
a novel, simple and most importantly an effective noise detection algorithm to
improve the classification accuracy. The rest of the chapter is organized as follows:
Next Sect. 2 discusses the related work, and Sect. 3 provides the background to
classification techniques. In Sect. 4, the dataset is described. The Sect. 5 aims to
discuss the issue of attribute noise on classification accuracy. Section 6 investigates
how skewness of the data influences the classification accuracy. Section 7 attempts
to improve the classification accuracy of the data, which is mixed with noise. This
is achieved by using the noise removal and the selection of appropriate sample sizes
for the training and testing samples. Then Sect. 8 discusses the results of the study
and Sect. 9 concludes the chapter.
2 Related Work
Use of SVM in data classification is not novel and has mixture of opinions regarding
the accuracy of it. Sahin and Duman [36] incorporated the knowledge of decision
trees and SVM in credit card fraud analysis. Though they found the model based
on decision trees outperformed SVM, the difference of performances between both
methods became less with the increment of the size of the training datasets. In
another study, Wei and Yuan [45] used an optimized SVM model to detect online
fraudulent credit card and found their proposed non-linear SVM model performed
better than the others. Colas and Brazdil [8] and Fabrice and Villa [13] compared
SVM with K-nearest neighbor (kNN) and naive Bayes in text classification. Though
they expected that SVM would outperform its counterparts, authors couldn’t find
SVM as the clear winner. In addition, they pointed out that the performance of kNN
continues to improve with the use of suitable preprocessing technique. Scholkopf
and Smola [37] and Mennatallah et al. [28] utilized SVM in anomaly detection. In
addition to SVM based techniques, other alternative techniques can be found in the
literature.
Abu-Nimeh et al. [1] compared six classifiers on phishing data. In their study,
authors used Logistic Regression, Classification and Regression Trees, Bayesian
Additive Regression Trees, Support Vector Machines, Random Forests, and Neural
Networks. According to the outcomes, authors claimed that the performance of
Logistic Regression was better than the rest.
As the previous research findings indicate, the nature of the data is imperative
for the accuracy of classification. Bragging and boosting are considered as popular
classifying trees and random forest was proposed by adding an extra layer to
bragging [25]. Díaz-Uriarte and Andres [12] incorporated random forest in gene
204 R. Indika P. Wickramasinghe
The aim of this section is to describe the theoretical foundation that is used in
this study. First of all, the section starts describing the four types of classification
techniques that are used in this study. Secondly, quantitative measurements that are
used to compare the four classification techniques are discussed.
SVM is considered as one of the most popular classification techniques [7], which
was introduced by Vapnik [43]. This is based on statistical learning theory and it
attempts to separate two types of data (Class A and Class B) using a hyper-plane by
maximizing the boundary of the separation, as shown in the Fig. 1.
Consider the n-tuple training dataset, S.
Here the set of feature vector space is M-dimensional and the class variable is
2-demesioinal. i.e., xi 2 RM and yi 2 f1, C1g. As mentioned before, the ultimate
aim of the SVM is to find the optimal hyper-plane that split the dataset into two
categories.
If the feature space is completely linearly separable, then the optimal separating
hyper-plane can be found by solving the following Linear Programming (LP)
problem:
Min jjwjj2
i D 1; 2; : : : ; n (2)
Unfortunately not all the datasets can be linearly separable. Therefore, SVM uses
a mapping called the Kernel. The purpose of the Kernel, ˆ is to project the linearly
inseparable feature space into a higher dimension so that it can be linearly separated
in the higher space. This feature space transformation is taken place according to
the following.
w:ˆ.x/ C b D 0 (3)
Class A
M
ar
gin
Class B
206 R. Indika P. Wickramasinghe
where w represents the weight vector, which is normal to the hyperplane. Following
are the most popular existing kernel functions that have been extensively studied in
the literature.
• Linear Kernel, ˆ xi ; xj D xiT xj
d
• Polynomial Kernel, ˆ xi ; xj D xTi xj C 1 ; for > 0
2
• Radial Kernel, ˆ(xi, xj ) D exp(”jjx
i T xj jj ) ,for > 0
• Sigmoid Kernel, ˆ xi ; xj D Tanh ˛xi xj C r .
PCA is considered as not robust enough to work with data when outliers are present.
Halouska and Powers [15] investigated the impact of PCA on noise data related to
Nuclear Magnetic Resonance (NMR). They concluded that a very small oscillation
in the noise of the NMR spectrum can cause a large variation of PCA, which
results in the form of irrelevant clustering. Robust Principal Component Analysis,
a modified version of popular PCA method attempts to recover the low-rank matrix
from the corrupted measurements. Though there are numerous versions of RPCA,
Hubert et al. [18] proposed a method that can be described briefly as follows.
• At first, use singular value decomposition is used to deduce the data space.
• Next, for each data point a measurement of outlyingness is calculated. Out of all
the n data points (suppose there are m smallest measurements), the covariance
matrix, †m is calculated. In addition, k number of principal components are
chosen to retain.
Attribute Noise, Classification Technique, and Classification Accuracy 207
Random Forest is considered as one of the most popular and widely applied machine
learning algorithms in data classification. Though the main use of Random Forest
is for classification, it can be used as a useful regression model. Random Forest
technique is an ensemble learning technique, proposed by Breiman [4]. Ensemble
methods are considered as learning algorithms that builds a series of classifiers to
classify a novel data point. This technique has found an answer to the overfitting
problem available in individual decision trees (Fig. 2).
The entire classification process of Random Forest is achieved in a series of steps
as described below.
• Suppose the number of training dataset contains N cases. A sub set from the
above N is taken at random with replacement. These will be used as the training
set to grow the tree.
• Assume there are M number of input variables. Then a number m, which is lower
than M is selected and m number of variables from the collection of M is selected
randomly. After that the best split of the selected m variables is selected to split
the node. Throughout this process, m is kept as a constant.
• Without pruning, each tree is grown to the largest extent. The prediction of a new
data point is by aggregating predictions.
where P represents total number of positive (Class A), instances while N represents
the total number of class B instances.
Precision Sensitivity
F Measure D 2
Precision C Sensitivity
4 Dataset
In this study we utilize a dataset about online transactions using credit cards.
This secondary dataset has been modified from the initial dataset, which contains
credit cards’ transactions by European credit cards holders within two days in
September 2013. This dataset includes 29 features including time, amount, and the
time duration of the transaction in seconds. Though it is interesting to know all the
included attribute names in the dataset, due to the confidentiality issues the data do
not disclose all the background information.
Attribute Noise, Classification Technique, and Classification Accuracy 209
Data can be classified as attributes and class labels [49]. When noise is considered,
it may either relate to independent variable (attribute) or to the dependent variable
(class). Hence, if the noise exists in the data it can be classified as either attribute
noise or class noise [23, 26, 42, 47, 49]. According to the literature findings,
class noise brings more adverse effects than the attribute noise for classification.
Despite the fact that class noise creates more damages than the attribute noise, it
is believed that the latter is more complicate to handle. Therefore, we focus on
studying attribute noise and the impact of it on data classification.
In the next phase random noise was introduced to the dataset in 5%, 10%, 15%,
and 20% levels and F-measure was measured accordingly. Using the F-measures
in each case, the percent change of F-measure was calculated. Table 2 summarizes
this findings. In the implementation of algorithm, training and testing samples were
generated according to the 70:30 ratio and performance indicators were calculated
based on 100 randomly selected samples.
6.1 Skewness
Skewness provides a measure about the symmetry of the distribution. This mea-
surement can be either positive or negative, in either case, skewness makes the
symmetric distribution asymmetric. Transformation of skewed data into symmetric
is often seen in data analysis. Though there is no strict set of guidelines regarding the
type of transformation to use, following are some of the popular transformations. As
Howell [17] suggested, if data are positively skewed sqrt(X) is used. If the skewness
is moderate, then the log(X) can be used. Brown [5] stated the identification of
skewness based on the Joanes and Gill [21] and Cramer [10] formulas as described
below.
Consider a sample of n data points, x1 , x2 , : : : , xn . Then the method of moment
coefficient of skewness is computed according the Eq. (5).
P
3 P
2
m3 xx xx
gD 3
I where m3 D ; m2 D (5)
m22 n n
Using the above g, the sample skewness (G) is calculated according to the Eq. (6).
p
n .n 1/
GDg (6)
.n 2/
Finally, the declaration of the skewness of the data is decided based on the value
of Zg . Here,
Attribute Noise, Classification Technique, and Classification Accuracy 211
G
Zg D I where SES; Standard Error of Skewness
SES
s
6n .n 1/
SES D
.n 2/ .n C 1/ .n C 3/
Therefore, we can classify the data as a negatively skewed if Zg < 2. If ÍZg Í < 2,
then the distribution is either symmetric, negatively skewed or positively skewed.
Finally if ÍZg Í > 2, we can classify that the distribution is positively skewed.
presence of noise of the data, noise removal algorithm may be critically important
for the classification accuracy. In this section, we aim to improve the classification
accuracy by proposing a novel, but simple noise removal algorithm. This algorithm
is tested using the credit card data and compare with one of the standard outlier
detection method, based on the Cook’s [9] distance.
As can be seen in the algorithm, the algorithm is executed in several steps. At first,
the data in each attribute is binned according the selected bin sizes (5, 10, 15, 25,
and 50). In the next step, data in each bin is converted into their corresponding
sample z-scores by treating the entire bin as the sample. This process continues
until the algorithm covers the entire dataset by taking each attribute at a time.
After completing the z-score calculation for each attribute, standard outlier detection
algorithm, as explained in Sect. 6.1 is applied. In the last step, outliers are removed
from the dataset. The pseudo code of this proposed algorithm is displayed in Table 4.
This algorithm is implemented under each classification method on data with
5%, 10%, 15%, and 20% noise levels. Performance indicators are recorded before
and after the implementation of the algorithm. Obtained results are shown in
Figs. 5 and 6.
The effectiveness of this algorithm is compared with the Cook’s distance
approach. Logistic regression is fitted on the above data and Cook’s distance is
calculated on each data point. Then the data points are declared as unusual if
214 R. Indika P. Wickramasinghe
Fig. 5 Change (%) of F-measure, Bin Size, and Classification Method for Noise Level of 5%
Fig. 6 Change (%) of F-measure, bin size, and classificationmethod for noise level of 15%
the Cook’s distance is more than the ratio of 4/ (total number of instances in the
dataset). This is implemented for all the datasets of 5%, 10%, 15%, and 20% noise
levels.
identification of appropriate balance between the two sample sizes is very crucial.
Therefore, it would be beneficial to understand how these ratio of samples impact
the classification accuracy, when the data are mixed with noise.
We study this using nine sample ratios of training-test datasets. 90:10, 80:20,
70:30, 60:40, 50:50, 4:60, 30:70, 20:80, and 10:90 are the selected sample ratios.
For each case, we consider datasets with 5%, 10%, 15%, and 20% noise levels.
After simulating 100 instances, the average and the standard deviations of each
measurement indicator were calculated at each noise levels. In addition, standard
deviation and the mean values of F-measure are calculated.
When comparing two statistics, with different distributions for their standard
deviations and means, the ratio between standard deviation ( ) and the mean (
)
values is considered as a better measurement. This measurement is considered as
the Coefficient of Variation (CV), which can be calculated according to the Eq. 8.
CV quantifies the dispersion of the statistics compared to its mean value.
CV D 100 (8)
Fig. 7 Training: test ratio, noise levels and standard deviation of the F-measure
216 R. Indika P. Wickramasinghe
Fig. 8 Training: test ratio, noise levels and standard deviation of the F-measure
8 Results
Table 2 shows the average F-measure for each classification method at each noise
levels. Smaller values are indicating robustness of the model to the noise. According
to the Table 2 and outcomes of the Analysis of Variance (ANOVA), it is obvious
that there is a significant difference in the values of RForest method, compared to
the other counterparts [F(3, 9) D 4.064 , p D 0.0442]. This means, Random Forest
method outperforms the rest of the alternatives across all the levels of noises.
Though SVM works better when there is less noise, with the increment of noise
RPCA works better than SVM.
Table 3 summarizes how skewness of the data influences the classification
accuracy, in the presence of noise. Even in this case, Random Forest method
shows the smallest change in both sensitivity and the F-measure. This indicates
that Random Forest method handles skewness of the data well. SVM shows that
it is not a robust classification technique for skewed data in comparison to the
other counterparts. Most importantly, the performance of SVM gets better with the
increment of noise levels. Out of PCA and RPCA, the latter performs better than
the former. This suggests that PCA is a better classification technique for noisy and
skewed data. This behavior is clearly explained by the Figs. 3 and 4.
Figures 5 and 6 display the relationship among the bin size, classification
technique, noise level and the classification accuracy. At 5% noise level, the highest
change of F-measure across all bin levels is recorded by SVM. Clearly, this is signif-
icantly different than other methods. This indicates that classification accuracy can
Attribute Noise, Classification Technique, and Classification Accuracy 217
In this empirical study we investigated the impact of attribute noise in the data on
the classification accuracy. At first, we studied the influence of attribute noise on the
classification accuracy. Secondly, we tried to enhance the power of data classifica-
tion in the presence of noise. Hence, we collected a dataset about online transactions
using credit-card, which is related to cyber-security. With this dataset, we classify
whether a transaction has been involved a fraud or not. According to our findings,
it is clear that classification accuracy is hindered by the presence of noise in the
data. Furthermore, Random Forest method outperforms the other three classification
techniques even in the presence of noise. SVM seems better than other two in the
absence of Random Forest, but when the level of noise increases, RPCA performs
better than the SVM. When testing the influence of skewness on data classification,
we found that there is a direct impact from skewness on the classification accuracy.
This influence affects differently on each technique. Among the selected techniques,
Random Forest is robust even when data are skewed. Though SVM looks vulnerable
to classify skewed data, when the noise level of the data is higher, SVM performs
better. Further analysis shows that PCA can classify data well when the data are
noisy and skewed. As a means of improving the classicization accuracy for noisy
data, we proposed a simple noise removal algorithm. According to the obtained
results, our algorithm significantly improve the classification accuracy. This was
compared with the Cook’s distance approach. According to the obtained results,
218 R. Indika P. Wickramasinghe
our proposed algorithm shows better performances compared to the Cooks’ distance
approach. Further analysis indicates that there is a strong relationship between the
selected bin size and the classification accuracy. Though this influence does not
impact all the classification techniques uniformly, bin sizes 10 and 20 record higher
performances than other bin sizes. At last, we studied about the appropriate ratio of
sample sizes for both training and test datasets. According to the outcomes of this
study, there is a strong connection between the ratio of datasets and the classification
accuracy. When selecting appropriate sample sizes, one needs to pay attention
about this ratios. As the results indicate, if the ratio is too small the variability
of the performance indicator inflates. In cyber-security related data, enhancing the
performance even in small amount will be advantageous. Though we have improved
the classification accuracy significantly, this obtained outcomes motivate further
research work to explore inclusion of novel classification techniques. In addition,
future work will be conducted to see the influence other factors on the accuracy of
data classification. This current study was conducted for the balanced data, therefore
it would be interesting to extend this study for unbalanced data as well. Furthermore,
an extension of the noise removal algorithm to address class noise will be beneficial
too.
References
1. Abu-Nimeh, S., Nappa, D., Wang, X., & Nair, S. A comparison of machine learning techniques
for phishing detection. In Proceedings of the anti-phishing working groups 2nd annual eCrime
researchers summit, pp. 60-69, ACM. (2007).
2. Akbani R., Kwek S., and Japkowicz N.: “Applying support vector machines to imbalanced
datasets,” in Proceedings of the 15th European Conference on Machine Learning, pp. 39–50,
(2004).
3. Beleites C., Neugebauer U., Bocklitz T., Krafft C., Popp J.: Sample size planning for
classification models. Anal Chim Acta. Vol. (760), pp. 25–33, (2013).
4. Breiman L.: Random forests. Machine Learning, Vol. 45(1), pp. 5–32, (2001).
5. Brown S., Measures of Shape: Skewness and Kurtosis, https://brownmath.com/stat/shape.htm,
(2008-2016)
6. Cao Y., Pan X., and Chen Y.: “SafePay: Protecting against Credit Card Forgery with
Existing Card Readers”, in Proc. IEEE Conference on Communications and Network Security,
pp. 164–172, (2015).
7. Carrizosa, E., Martin-Barragan, B., Morales, D. R.: Binarized support vector machines.
INFORMS Journal on Computing, Vol. 22(1), pp. 154–167, (2010).
8. Colas F., and Brazdil P.,“Comparison of SVM and Some OlderClassification algorithms in Text
Classification Tasks”, “IFIP International Federation for Information Processing”, Springer
Boston Volume 217, Artificial Intelligence in Theory and Practice, pp. 169–178, (2006).
9. Cook, R. D.: “Influential Observations in Linear Regression”. Journal of the American
Statistical Association. Vol. 74 (365), pp. 169–174, (1979).
10. Cramer, Duncan Basic statistics for social research: step-by-step calculations and computer
techniques using Minitab. Routledge, London.; New York, (1997).
11. Cureton, Edward E, and Ralph B. D’Agostino. Factor Analysis, an Applied Approach.
Hillsdale, N.J: L. Erlbaum Associates, (1983).
Attribute Noise, Classification Technique, and Classification Accuracy 219
12. Díaz-Uriarte R., De Andres, S. A.: Gene selection and classification of microarray data using
random forest. BMC bioinformatics, 7(1), p. 3, (2006).
13. Fabrice, R, Villa, N.: Support vector machine for functional data classification. Neurocomput-
ing/EEG Neurocomputing, Elsevier, 69 (7–9), pp.730–742, (2006).
14. Guyon I.: A scaling law for the validation-set training-set size ratio, AT & T Bell Laboratories,
Berkeley, Calif, USA, (1997).
15. Halouska S., Powers R.: Negative impact of noise on the principal component analysis of NMR
data, Journal of Magnetic Resonance Vol. (178) (1), pp. 88–95, (2006).
16. Hickey R. J., “Noise modelling and evaluating learning from examples,” Artif. Intell., vol. 82,
nos. 1–2, pp. 157–179, (1996).
17. Howell, D. C. Statistical methods for psychology (6th ed.). Belmont, CA: Thomson
Wadsworth, (2007).
18. Hubert, M., Rousseeuw, P. J., Branden, K. V.: ROBPCA: a new approach to robust principal
components analysis, Technometrics, vol. 47, pp. 64–79, (2005).
19. Hwang, J. J., Yeh, T. C., Li, J. B.: Securing on-line credit card payments without disclosing
privacy information. Computer Standards & Interfaces, Vol. 25(2), pp. 119-129, (2003).
20. Jayavelu D., Bar N.: A Noise Removal Algorithm for Time Series Microarray Data. In: Correia
L, Reis L, Cascalho J, editors. Progress in Artificial Intelligence, vol. 8154. Berlin: Springer,
pp. 152–62, (2013).
21. Joanes, D. N., Gill C. A.: “Comparing Measures of Sample Skewness and Kurtosis”. The
Statistician Vol. 47(1), pp. 183–189, (1998).
22. Kathiresan K., Vasanthi N. A., Outlier Detection on Financial Card or Online Transaction data
using Manhattan Distance based Algorithm, International Journal of Contemporary Research
in Computer Science and Technology (IJCRCST) Vol. 2(12), (2016).
23. Khoshgoftaar T., Hulse J. V.: Identifying noise in an attribute of interest. In ICMLA ’05:
Proceedings of the Fourth International Conference on Machine Learning and Applications
(ICMLA’05), pp. 55–62, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-
2495-8. doi: 10.1109/ICMLA.2005.39, (2005).
24. Lee C. C., Yoon J. W.: “A data mining approach using transaction patterns for card fraud
detection”, Seoul, Republic of Korea, pp. 1-12, (2013).
25. Liaw A., Wiener M.: Classification and Regression by Random Forest, R News, Vol. 2(3),
(2002).
26. Liebchen G.: Data Cl eaning Techniques for Software Engineering Data Sets. Doctoral thesis,
Brunel University, (2011).
27. Maratea A., Petrosino, A.: Asymmetric kernel scaling for imbalanced data classification, in:
Proceedings of the 9th International Conference on Fuzzy Logic and Applications, Trani, Italy,
pp. 196–203, (2011).
28. Mennatallah A., Goldstein M., Abdennadher, S.: Enhancing oneclass support vector machines
for unsupervised anomaly detection. In Proceedings of the ACM SIGKDD Workshop on
Outlier Detection and Description pp. 8–15, (2013).
29. Miranda A. L., Garcia L. P., Carvalho A. C., Lorena A. C., “Use of classification algorithms in
noise detection and elimination”, Proc. 4th Int. Conf. Hybrid Artif. Intell. Syst., pp. 417–424,
(2009).
30. Oja, E.: Principal components, minor components, and linear neural networks. Neural Net-
works, pp. 927–935, (1992).
31. Osborne, J. Notes on the use of data transformations. Practical Assessment, Research &
Evaluation, 8(6). http://PAREonline.net/getvn.asp?v=8&n=6, (2002).
32. Redman, T.: Data Quality for the Information Age. Artech House, (1996).
33. Riem, A.: Cybercrimes of the 21st century: crimes against the individual—part 1, Computer
Fraud and Security. Vol 6, pp. 13–17, (2001).
34. Rosenberg A.: “Classifying Skewed Data: Importance Weighting to Optimize Average Recall,”
Proc. Conf. Int’l Speech Comm. Assoc. (InterSpeech ’12), (2012).
35. Sáez, J.A., Galar M., Luengo, J. et al. Analyzing the presence of noise in multi-class problems:
alleviating its influence with the One-vs-One decomposition. Knowl Inf Syst 38: 179. doi:
10.1007/s10115-012-0570-1, (2014).
220 R. Indika P. Wickramasinghe
36. Sahin Y., Duman E.: “Detecting Credit Card Fraud by Decision Trees and Support Vector
Machines”, International Multi-conference of Engineers and computer scientists, (2011).
37. Scholkopf, B., Smola A. J.: Support Vector Machines and Kernel Algorithms, The Handbook
of Brain Theory and Neural Networks. MIT Press, Cambridge, UK, (2002).
38. Seo S.: Masters thesis. University of Pittsburgh; Pennsylvania: A review and comparison of
methods for detecting outliers in univariate data sets, (2006).
39. Shahi A., Atan R. B., Sulaiman M. N.: Detecting effectiveness of outliers and noisy data on
fuzzy system using FCM. Eur J Sci Res 36: pp. 627–638, (2009).
40. Siddiqui F., and Ali, Q. M.: Performance of non-parametric classifiers on highly skewed data,
Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Vol. 12(2), pp. 1547–1565,
(2016).
41. Tang L., Liu H.: Bias analysis in text classification for highly skewed data. In ICDM ’05:
Proceedings of the Fifth IEEE International Conference on Data Mining, IEEE Computer
Society, pp. 781–784, (2005).
42. Teng M. C.: Combining noise correction with feature selection. pp. 340–349, (2003).
43. Vapnik V., The Nature of Statistical Learning Theory. Springer-Verlag, ISBN 0-387-98780-0,
(1995).
44. Wang, Bin, et al. “Distance-based outlier detection on uncertain data.” Ninth IEEE Interna-
tional Conference on Computer and Information Technology, 2009. CIT’09. Vol. 1. IEEE,
(2009).
45. Wei X., and Yuan L.: “An Optimized SVM Model for Detection of Fraudulent Online
Credit Card Transactions,” International Conference on Management of e-Commerce and
e-Government, 2012.
46. Xiong H., Pandey G., Steinbach M, Kumar V.: “Enhancing data analysis with noise removal,”
IEEE Trans. Knowl. Data Eng., Vol. 18( 3), pp. 304–319, (2006).
47. Yoon K., Bae D.: A pattern-based outlier detection method identifying abnormal attributes in
software project data. Inf. Softw. Technol., Vol. 52(2), pp. 137–151. ISSN 0950-5849. (2010).
48. Zhou X., Zhang Y., Hao S., Li S., “A new approach for noise data detection based on cluster and
information entropy.” The 5th Annual IEEE International Conference on Cyber Technology in
Automation, Control, and Intelligent Systems, (2015).
49. Zhu X., Wu X., Class noise vs. attribute noise: a quantitative study, Artificial Intelligence
Review Vol. 22 (3). pp.177–210, (2004).
50. Zhu, X., Wu X., Yang, Y.: Error Detection and Impact-sensitive Instance Ranking in Noisy
Datasets. In Proceedings of 19th National conference on Artificial Intelligence (AAAI-2004),
San Jose, CA. (2004).
Part II
Invited Chapters
Learning from Loads: An Intelligent System
for Decision Support in Identifying Nodal Load
Disturbances of Cyber-Attacks in Smart Power
Systems Using Gaussian Processes and Fuzzy
Inference
Abstract The future of electric power is associated with the use of information
technologies. The smart grid of the future will utilize communications and big data
to regulate power flow, shape demand with a plethora of pieces of information
and ensure reliability at all times. However, the extensive use of information
technologies in the power system may also form a Trojan horse for cyberattacks.
Smart power systems where information is utilized to predict load demand at the
nodal level are of interest in this work. Control of power grid nodes may consist
of an important tool in cyberattackers’ hands to bring chaos in the electric power
system. An intelligent system is proposed for analyzing loads at the nodal level in
order to detect whether a cyberattack has occurred in the node. The proposed system
integrates computational intelligence with kernel modeled Gaussian processes and
fuzzy logic. The overall goal of the intelligent system is to provide a degree of
possibility as to whether the load demand is legitimate or it has been manipulated
in a way that is a threat to the safety of the node and that of the grid in general. The
proposed system is tested with real-world data.
1 Introduction
The application of digital control system technologies and sensor networks for
monitoring purposes in critical power facilities is a topic of high interest [1]. The
use of digital technologies offers significant advantages that include reduction in
the purchase and maintenance costs of facility components and equipment, and a
significant reduction in the volume of hardware deployed throughout the facility [2].
Fig. 1 The notion of smart power systems that are comprised of two layers
The basic mission of power systems is the nonstop delivery of generated elec-
tricity to connected consumers. Delivery systems have been equipped with several
monitoring, prognostics and diagnostics tools, as well as several redundant systems
and mechanisms [3]. It should be noted that redundancy refers to availability of
more than required systems and mechanisms that perform the same operation.
Redundancy allows retaining of the normal operation of the power system in case
a system or mechanism fails, given that the rest will compensate for it. Thus,
power systems exhibit high resiliency and fault tolerance behavior in fulfilling their
mission of 24/7 power delivery. The advent of computers and internet connectivity
added one more factor that should be taken intro serious consideration: cyber
security [4]. The advantages offered by the ever increasing use of computer and
communication technologies (ICT) in power system operation, come at a cost of
new vulnerabilities in system security. What was formerly a physically isolated
system is now open to unauthorized access and manipulation control through
potential vulnerabilities posed by ICT use [5]. For instance, the supervisory control
and data acquisition (SCADA) systems consist of the backbone of the overall power
system monitoring and subsequent decision-making processes. Hence, despite their
nowadays practical benefits, SCADA systems present serious targets for cyber-
attackers.
The advent of smart power technologies is expected to couple data technologies
with conventional power systems in ways that optimizes energy consumption and
minimizes loses [6]. The notion of smart power systems is visualized in Fig. 1,
where it can be observed that the smart power system is a form of a cyber-
physical system [7, 8]. The information (i.e., cyber) layer is comprised of two
modules. The first module includes the intelligent systems and signal processing
tools and the second module the databases. Those two modules are complementary;
the intelligent tools utilized stored information to make management decisions
pertained to power system operation [9]. The second layer is the physical layer that
contains the physical components of the power system.
Given that smart power systems are heavily dependent on data and information
processing tools as it is depicted in Fig. 1, cybersecurity is a major issue that
cannot be overlooked. Every asset of smart power systems may be compromised
and subsequently be transformed into a vehicle for conducting a cyber-attack [10].
Learning from Loads: An Intelligent System. . . 225
The remainder of this chapter contains five sections. Section 2 briefly introduces
GPR and fuzzy logic inference, while Sect. 3 introduces smart power system and
in particular Energy internet. Section 4 describes the proposed intelligent system.
Section 5 presents the results obtained by applying the intelligent system to a set of
load demand data. At the end, Sect. 6 concludes and summarizes the salient points
of our approach.
2 Background
Machine learning has been identified as one of the pillars in developing efficient
data analytics methods and decision support systems. One of the preeminent areas
of machine learning is the class of non-parametric methods called kernel machines.
A kernel machine is any analytical model that is a function of a kernel [23].
A kernel (a.k.a., kernel function) is any valid analytical function that takes the
following form:
where, ®(x) is called the basis function, and x1 , x2 are input values. The inputs to
a kernel may be either both scalar or vector values of equal length. Their range of
values depends on the problem under study, while the kernel output represents the
similarity between the input values. In general, selection of the basis function falls
within responsibilities of the modeler and depends on the specifics of the application
at hand [24]. For instance, the simplest basis function is ®(x) D x and therefore the
respective kernel takes the form given below:
k .x1 ; x2 / D x1 T x2 (2)
which is simply known as the linear kernel. It should be noted that formulation of
models using kernels whose form can be determined by the modeler is called the
kernel trick and finds wide use in data analytics and pattern recognition applications
[23].
The class of kernel machines contains the Gaussian processes (GP). Gaussian
processes may be modeled as a function of a kernel. In particular, the kernel
Learning from Loads: An Intelligent System. . . 227
enters into the Gaussian process formulation through its covariance function. A
kernel-modeled Gaussian process can be used either in classification or regression
problems. In the latter case, it is identified as Gaussian process regression [23].
Likewise Gaussian distribution, a Gaussian process is identified via its two
parameters, namely, the mean function and the covariance function denoted by m(x)
and C(xT ,x) respectively. Thus, we get [23]:
GP N m.x/; C xT ; x : (3)
Derivation of the GPR framework has as a starting point Eq. (3) where we set
m.x/ D 0; (4)
and
C xT ; x D k xT ; x : (5)
m .xNC1 / D kT C1
N tN (6)
2 .xNC1 / D k kT C1
N k (7)
with CN being the covariance matrix among the N training data, k being a vector of
covariances between the input xNC1 and the N training data, and k the scalar value
taken as k(xNC1 , xNC1 ) [23].
Thus, kernel selection should be done carefully and with respect to the respective
GPR output. Overall, kernel-based GPR offers flexibility in prediction-making; the
modeler is able to select a kernel among existing ones or compose a new kernel.
Hence, the kernel form can be tailored to the specifics of the prediction problem,
allowing the modeler to have flexibility in the way he builds his prediction model.
228 M. Alamaniotis and L.H. Tsoukalas
IF x is A; THEN y is B
IF x is C; THEN y is D
Learning from Loads: An Intelligent System. . . 229
where A, B, C and D are fuzzy sets. By using the sets in Fig. 2, an example of a
fuzzy controller for a heating system might have the following rules:
Fig. 3 General framework of Smart Power Systems implemented as an Energy Internet [30]
Having introduced the framework of smart power (energy internet) this section
presents the “learning from loads” system to detecting nodal load disturbances
for enhancing cybersecurity in this framework. In the following subsection, the
proposed methodology as well as the attack scenarios pertained to load disturbances
are presented.
In this section, the intelligent system that makes decisions pertained to nodal load
with respect to cybersecurity is presented. It should be noted that the proposed
intelligent system exploits current as well as future information via the use of
anticipation functions. Anticipation will allow the system to evaluate its future states
compared to what it has seen so up to this point. To that end, a learning from
load demand approach is adopted, and subsequently anticipates the future demand.
Learning from Loads: An Intelligent System. . . 231
Fig. 4 Block diagram of the proposed learning from loads intelligent system
Anticipation is utilized to evaluate the actual demand and make inferences whether
data have been manipulated or not.
The block diagram of the proposed system is depicted in Fig. 4. We observe
that there are two paths. The first path analyzes recorded data and anticipates future
load, while the second path utilizes the current load. In addition, to the above paths,
we may observe in Fig. 3 that there is a subsidiary path that contains external
information pertained to that specific node. For instance, the price of electricity at
current time.
Load anticipation is performed using kernel modeled Gaussian process regres-
sion, which was introduced in the previous section. The GPR is equipped with a
Gaussian kernel whose analytical form is given below:
!
kx1 x2 k2
k .x1 ; x2 / D exp (9)
2 2
where 2 is a kernel parameter whose value is being determined in the training phase
of the algorithm. The GPR model is trained on previous recorded load demands.
Once the training is finished the GPR is utilized for making prediction of the future
demand. Therefore, we observe that this is the first point at which our system learns
from loads (i.e., past loads).
Further, we observe in Fig. 3 that the information, anticipated, previous and
current loads are fed to a fuzzy inference system [34]. In addition, the external
information is also directly fed to the fuzzy inference system. Overall, it should
be noted that the available information is forwarded to the fuzzy inference.
The fuzzy inference system is comprised of two parts as Fig. 3 depicts. The
first part contains the fuzzy rules utilized for making inference. The fuzzy rules are
predetermined by the system modeler and they take the form of IF.. THEN rules as
shown in Sect. 2 as well. The left-hand side of the rules, i.e., conditions, may be
refer to anticipated load or current load. In particular, the rules for anticipated load
have the following general form:
where the fuzzy variable is Future Load and AF denotes the fuzzy values (see Fig. 5)
while the rules for current load have the form:
with Current Load being the fuzzy variable in that case (see Fig. 6).
In addition, we defined a new fuzzy variable which is equal to the absolute
difference between the Anticipated Load and the Actual Load of the same time
interval, i.e.:
where the variable Difference is actually the fuzzified value obtained by subtraction
of anticipated and actual load. This fuzzy variable will allow to measure how much
was the inconsistency between the anticipated and the actual load over the same time
interval. It should be noted that the actual load is different from the current load. The
current load exhibits the current load demand while the actual load variable refers
to the actual demand for the time interval that the anticipation was made. This is a
fundamental inference variable given that the offset between anticipation and actual
may “reveal” the presence of a threat or not.
Lastly, we model the current prices of electricity as external information. The
presented intelligent system aspires to be deployed in smart power environment
(energy internet) and therefore price will play an important role in justifying load
disturbances. The price will show the willingness of the consumer to change its
consumption. For instance, an attacker will increase the load demand no matter how
high the prices are; that can be used as an indicator to identify potential attackers.
The rules pertained to price have the following form:
Learning from Loads: An Intelligent System. . . 233
The second part of the fuzzy inference systems includes the defuzzification part.
The defuzzification uses the center of area method [27] whose form is given in
Eq. (8). The defuzzification provides a crisp output that is also the final output of
the system. The output value is the confidence of compromise pertained to load
disturbances for that node. The value of the confidence may support the power
system operators and security engineers to make decisions whether there is cyber-
attack pertained to load demand.
planned over a long time. Furthermore, the attackers may show patience, perform
long term reconnaissance and follow a carefully planned series of actions. Given
that smart power systems are heavily dependent on the utilization of information,
the cyberattacks may also involve a significant deal of grid congestion by increasing
nodal load demand.
The scenario that we examine in this work is the following:
– A cyberattacker wants to cause a blackout in a part of the grid. One way to
achieve this is by destabilizing one of the grid nodes. Destabilzation can be done
by increasing the demand beyond the capacity limits of the node. Therefore, the
node will become nonoperational and the power delivery will fail at that point.
In addition, if that node is at the backbone of the grid then it may propagate the
problem to the rest of the grid.
– A simple way to cause this type of attack is to compromise the intelligent meters
of several consumers that are connected to that node. As we noted, in the vision
of Energy Internet the consumer negotiates with the retailer. Negotiations may
be undertaken by the attacker with the consumer having no knowledge about it.
Overall, by taking into control several of the consumers the attacker may cause a
blackout in the node.
In the next section, the above scenario will be employed in a real case scenario.
Nodal demand from real-world datasets will be used as our test scenario. The
intelligent system will be utilized to analyze the load (current and anticipated) and
output a confidence value whether there is manipulated increase in load demand.
5 Results
In this section, we test the proposed intelligent system on a set of real world data.
The datasets contain the hourly values of loads and energy prices within the smart
grid system, for one day before the targeted day. For visualization purposes the
actual demand signal is given in Fig. 10.
The goal is detect whether the security status of a node in the grid was
compromised. We assume that load demands are recorded every hour. Therefore,
the GPR model is utilize for anticipating the load for the next hour. The predicted
signal is depicted in Fig. 11. It should be noted that the training of the GPR was
performed by utilizing the demand data of one and two days before the targeted day.
In our work, we tested our intelligent system on the following scenarios:
(1) No compromise has occurred.
(2) From 12.00 pm to 15.00 pm, the node is compromised and 10% increase in
demand is presented
(3) From 12.00 pm to 15.00 pm, the node is compromised and 50% increase in
demand is presented
236 M. Alamaniotis and L.H. Tsoukalas
(4) From 12.00 pm to 15.00 pm, the node is compromised and 75% increase in
demand is presented
(5) From 12.00 pm to 15.00 pm, the node is compromised and 90% increase in
demand is presented.
The presented intelligent system is applied in the above scenarios and a degree
of confidence per hour is taken. The respective degrees of confidence are given
in Table 1. We observe in Table 1, that in the majority of the cases the intelligent
system provides low confidence with regard to confidence of compromise. The value
is not zero, mainly because of the uncertainty in load prediction and the volatility of
Learning from Loads: An Intelligent System. . . 237
prices, as well as in the defuzzificaton method. The center of area method does not
get the extreme values (0 and 1). However, a low confidence is an adequate sign to
make the operator believe that there is no cyberattack. Though there is no specific
decision threshold, any confidence value above 0.7 denotes a certain occurrence of
node compromise.
Regarding the scenarios that we have intentionally increased the demand follow-
ing the goal of the attack (to make the node blackout), we observe the result in the
shaded area at the center of Table 1. We observe in scenario 2 that the confidence
slightly increases. This increase is very small and therefore the operator may ignore
it. However, the amount of load that was compromised is only 10% and thus we
assume that this is not enough to cause serious problem to that node. Overall, this
compromise may not be detected; however, the goal of the system is to support
stakeholders in decision-making tasks, rather than making the actual decisions.
238 M. Alamaniotis and L.H. Tsoukalas
With regard to scenarios 3,4 and 5 the confidence increases above 0.7 and
therefore we can consider that a cyberattack has been detected with high confidence.
Such high confidence is a serious indicator that something is wrong, and the demand
increased beyond the regular one. Therefore, it safe to conclude that the intelligent
system presented detects the threat in those cases with high confidence and hence,
it supports the correct decision by the operator.
6 Conclusions
Smart power systems are systems that integrate power with information, and as a
result may become targets for cyberattackers. The importance of the power grid
for innumerable everyday activities and modern life makes the defense of the
grid mandatory. In addition, recent recorded attacks showed that cyberattacks may
be carefully planned and not just opportunistic cases for attackers to show off.
Therefore, every aspect of smart power systems should be secured.
Intelligent systems offer new opportunities for implementing new decision
support and data analysis methods that mimic the way of human system operators.
In this work, we examined the case in which the attacker plans to congest a power
grid node by increasing the demand and causing a blackout. An intelligent system
that learns from load signals by utilizing GPR and fuzzy inference was developed
and applied on a set of five different scenarios. The results were encouraging: the
higher the “compromised demand” the higher the degree of confidence that the
system is being compromised provided by our system. Therefore, our system shows
a high potential for its deployment into smart power systems, and in particular for
an Energy Internet scenario is high.
Future work will contain two main directions. In the first direction, we will
explore the use of other kernel function beyond the Gaussian kernel for prediction
making in the GPR model. In particular, we will apply a variety of kernels on load
data from coming from different nodes, and record their performance. Analysis of
records will be used to develop a system for indicating the best kernel for each
node. In the second direction, we will extensively test our intelligent systems in a
higher variety of data including data of nodes from different geographical areas, and
different assembly of customers.
References
1. Wood, A. J., & Wollenberg, B. F. (2012). Power generation, operation, and control. John Wiley
& Sons.
2. Amin, S. M., & Wollenberg, B. F. (2005). Toward a smart grid: power delivery for the 21st
century. IEEE power and energy magazine, 3(5), 34–41.
3. Han, Y., & Song, Y. H. (2003). Condition monitoring techniques for electrical equipment-a
literature survey. IEEE Transactions on Power delivery, 18(1), pp. 4–13.
Learning from Loads: An Intelligent System. . . 239
4. Li, S., Li, C., Chen, G., Bourbakis, N. G., & Lo, K. T. (2008). A general quantitative crypt-
analysis of permutation-only multimedia ciphers against plaintext attacks. Signal Processing:
Image Communication, 23(3), pp. 212–223.
5. Ramsey, B. W., Stubbs, T. D., Mullins, B. E., Temple, M. A., & Buckner, M. A. (2015).
Wireless infrastructure protection using low-cost radio frequency fingerprinting receivers.
International Journal of Critical Infrastructure Protection, 8, 27–39.
6. Alamaniotis, M., Gao, R., & Tsoukalas, L.H., “Towards an Energy Internet: A Game-Theoretic
Approach to Price-Directed Energy Utilization,” in Proceedings of the 1st International ICST
Conference on E-Energy, Athens, Greece, October 2010, pp. 3–10.
7. Alamaniotis, M., Bargiotas, D., & Tsoukalas, L.H., “Towards Smart Energy Systems:
Application of Kernel Machine Regression for Medium Term Electricity Load Forecasting,”
SpringerPlus – Engineering, Springer, vol. 5 (1), 2016, pp. 1–15.
8. Karnouskos, S. (2011, July). Cyber-physical systems in the smartgrid. In Industrial Informatics
(INDIN), 2011 9th IEEE International Conference on (pp. 20-23). IEEE.
9. Alamaniotis, M., & Tsoukalas, L.H., “Implementing Smart Energy Systems: Integrating Load
and Price Forecasting for Single Parameter based Demand Response,” IEEE PES Innovative
Smart Grid Technologies, Europe (ISGT 2016), Ljubljana, Slovenia, October 9-12, 2016, pp.
1–6.
10. Beaver, J. M., Borges-Hink, R. C., & Buckner, M. A. (2013, December). An evaluation of
machine learning methods to detect malicious SCADA communications. In Machine Learning
and Applications (ICMLA), 2013 12th International Conference on (Vol. 2, pp. 54–59). IEEE.
11. Kesler, B. (2011). The vulnerability of nuclear facilities to cyber attack. Strategic Insights,
10(1), 15–25.
12. Lee, R. M., Assante, M. J., & Conway, T. (2016). Analysis of the cyber attack on the Ukrainian
power grid. SANS Industrial Control Systems.
13. NUREG/CR-6882, (2006). Assessment of wireless technologies and their application at
nuclear facilities. ORNL/TM-2004/317.
14. Song, J. G., Lee, J. W., Lee, C. K., Kwon, K. C., & Lee, D. Y. (2012). A cyber security risk
assessment for the design of I&C systems in nuclear power plants. Nuclear Engineering and
Technology, 44(8), 919–928.
15. Goel, S., Hong, Y., Papakonstantinou, V., & Kloza, D. (2015). Smart grid security. London:
Springer London.
16. Mo, Y., Kim, T. H. J., Brancik, K., Dickinson, D., Lee, H., Perrig, A., & Sinopoli, B. (2012).
Cyber–physical security of a smart grid infrastructure. Proceedings of the IEEE, 100(1),
195–209.
17. Lu, Z., Lu, X., Wang, W., & Wang, C. (2010, October). Review and evaluation of security
threats on the communication networks in the smart grid. In Military Communications
Conference, 2010-MILCOM 2010 (pp. 1830–1835). IEEE.
18. Dondossola, G., Szanto, J., Masera, M., & Nai Fovino, I. (2008). Effects of intentional threats
to power substation control systems. International Journal of Critical Infrastructures, 4(1-2),
129–143.
19. Taylor, C., Krings, A., & Alves-Foss, J. (2002, November). Risk analysis and probabilistic
survivability assessment (RAPSA): An assessment approach for power substation hardening.
In Proc. ACM Workshop on Scientific Aspects of Cyber Terrorism,(SACT), Washington DC
(Vol. 64).
20. Ward, S., O’Brien, J., Beresh, B., Benmouyal, G., Holstein, D., Tengdin, J.T., Fodero, K.,
Simon, M., Carden, M., Yalla, M.V. and Tibbals, T., 2007, June. Cyber Security Issues for
Protective Relays; C1 Working Group Members of Power System Relaying Committee. In
Power Engineering Society General Meeting, 2007. IEEE (pp. 1–8). IEEE.
21. Alamaniotis, M., Chatzidakis, S., & Tsoukalas, L.H., “Monthly Load Forecasting Using Gaus-
sian Process Regression,” 9th Mediterranean Conference on Power Generation, Transmission,
Distribution, and Energy Conversion: MEDPOWER 2014, November 2014, Athens, Greece,
pp. 1–7.
240 M. Alamaniotis and L.H. Tsoukalas
22. Qiu, M., Gao, W., Chen, M., Niu, J. W., & Zhang, L. (2011). Energy efficient security algorithm
for power grid wide area monitoring system. IEEE Transactions on Smart Grid, 2(4), 715–723.
23. Bishop, C.M. Pattern Recognition and Machine Learning, New York: Springer, 2006.
24. Alamaniotis, M., Ikonomopoulos, A., & Tsoukalas, L.H., “Probabilistic Kernel Approach to
Online Monitoring of Nuclear Power Plants,” Nuclear Technology, American Nuclear Society,
vol. 177 (1), January 2012, pp. 132–144.
25. C.E. Rasmussen, and C.K.I. Williams, Gaussian Processes for Machine Learning, Cambridge,
MA: MIT Press, 2006
26. D.J.C. Mackay, Introduction to Gaussian Processes, in C. M. Bishop, editor, Neural Networks
and Machine Learning, Berlin: Springer-Verlag, 1998, vol. 168, pp. 133–155.
27. Tsoukalas, L.H., and R.E. Uhrig, Fuzzy and Neural Approaches in Engineering, Wiley and
Sons, New York, 1997.
28. Alamaniotis, M, & Agarwal, V., “Fuzzy Integration of Support Vector Regressor Models for
Anticipatory Control of Complex Energy Systems,” International Journal of Monitoring and
Surveillance Technologies Research, IGI Global Publications, vol. 2(2), April-June 2014, pp.
26–40.
29. Consortium for Intelligent Management of Electric Power Grid (CIMEG), http://
www.cimeg.com
30. Alamaniotis, M., & Tsoukalas, L., “Layered based Approach to Virtual Storage for Smart
Power Systems,” in Proceedings of the 4th International Conference on Information, Intelli-
gence, Systems and Applications, Piraeus, Greece, July 2013, pp. 22–27.
31. Alamaniotis, M., Tsoukalas, L. H., & Bourbakis, N. (2014, July). Virtual cost approach:
electricity consumption scheduling for smart grids/cities in price-directed electricity markets.
In Information, Intelligence, Systems and Applications, IISA 2014, The 5th International
Conference on (pp. 38–43). IEEE.
32. Tsoukalas, L. H., & Gao, R. (2008, April). From smart grids to an energy internet: Assump-
tions, architectures and requirements. In Electric Utility Deregulation and Restructuring and
Power Technologies, 2008. DRPT 2008. Third International Conference on (pp. 94–98). IEEE.
33. Tsoukalas, L. H., & Gao, R. (2008, August). Inventing energy internet The role of anticipation
in human-centered energy distribution and utilization. In SICE Annual Conference, 2008 (pp.
399–403). IEEE.
34. Alamaniotis, M., & Tsoukalas, L. H. (2016, June). Multi-kernel anticipatory approach to
intelligent control with application to load management of electrical appliances. In Control
and Automation (MED), 2016 24th Mediterranean Conference on (pp. 1290–1295). IEEE.
Lefteri H. Tsoukalas is professor and former head of the School of Nuclear Engineering at
Purdue University and has held faculty appointments at the University of Tennessee, Aristotle
University, Hellenic University and the University of Thessaly. He has three decades of experience
in smart instrumentation and control techniques with over 200 peer-reviewed research publications
including the textbook “Fuzzy and Neural Approaches in Engineering” (Wiley, 1997). He directs
the Applied Intelligent Systems Laboratory, which pioneered research in the intelligent manage-
Learning from Loads: An Intelligent System. . . 241
ment of the electric power grid through price-directed, demand-side management approaches, and
anticipatory algorithms not constrained by a fixed future horizon but where the output of predictive
models is used over a range of possible futures for model selection and modification through
machine learning. Dr. Tsoukalas is a Fellow of the American Nuclear Society. In 2009 he was
recognized by the Humboldt Prize, Germany’s highest honor for international scientists.
Visualization and Data Provenance Trends
in Decision Support for Cybersecurity
Abstract The vast amount of data collected daily from logging mechanisms on
web and mobile applications lack effective analytic approaches to provide insights
for cybersecurity. Current analytical time taken to identify zero-day attacks and
respond with a patch or detection mechanism is unmeasurable. This is a current
challenge and struggle for cybersecurity researchers. User- and data provenance-
centric approaches are the growing trend in aiding defensive and offensive decisions
on cyber-attacks. In this chapter we introduce (1) our Security Visualization
Standard (SCeeL-VisT); (2) the Security Visualization Effectiveness Measurement
(SvEm) Theory; (3) the concept of Data Provenance as a Security Visualization
Service (DPaaSVS); and (4) highlight growing trends of using data provenance
methodologies and security visualization methods to aid data analytics and decision
support for cyber security. Security visualization showing provenance from a
spectrum of data samples on an attack helps researchers to reconstruct the attack
from source to destination. This helps identify possible attack patterns and behaviors
which results in the creation of effective detection mechanisms and cyber-attacks.
1 Introduction
While network logging and analytical methods help data analytics, data collected
from modern threats and attacks are growing rapidly and new malicious attacks
are more sophisticated. This requires better security approaches, methods and
solutions to help understand them. Data provenance and security visualization are
the growing trend for cyber security solutions [5–7]. Data captured from desired
cyber-attacks creates the ability to reconstruct malicious attack landscapes and
attribution to the attack origin. In addition the ability to track files end-to-end,
from creation till deletion provides better decision support to cyber security experts
and researchers [6, 7]. This helps security experts to identify malicious patterns
and behaviors, as a result of which better conclusions are drawn and effective
security implementation are taken such as security patches and defensive measures.
Unlike the “ILoveYou” worm in 2000, hackers and malware developers are getting
smarter by implementing cyber-attacks not only for fame or revenge, but also as
targeted attacks in forms of phishing attacks, organized cybercrime attacks and
nation state attacks [8]. For example, the STUXNET attack (zero day attack) on
the Natanz uranium enrichment plant in Iran was regarded as the first digital
weapon [9, 10]. Another example is the Sony Pictures data leak which has believed
to be instigated by a nation state [11, 12]. Such cyber-threats require urgent and
intelligent security methods to help protect systems and networks. However, if the
attacks have penetrated systems and networks, then the need for intelligent data
analytics are highly favorable. Security Visualization as a method in data analytics
is a go to technique where security experts not only investigate the cyber-attacks
using visualization but they also visually interpret how the attacks occur therefore
reducing the analysis time spent on analyzing attack datasets [13–16]. This in return
provides better decision support for the entire cyber security realm.
and applications that draw effective conclusions aiding decision support for cyber
security. For example, effective security technologies that have tracking, monitoring
and reporting capabilities. There is a need to harmonize threat intelligence tech-
nologies with time-base provenance technologies to provide effective and precise
findings.
The focus of this chapter is on data provenance; security visualization tech-
niques, effective visualization measurement techniques, cyber security standards
and application to aid decision support for cyber security. Although data provenance
is defined in many ways depending on its niche area of interest, in this chapter, data
provenance is defined as series of chronicles and the derivation history of data on
meta-data [22, 23]. The ability to track data from the state of creation to deletion and
reconstruct its provenance to explore prominent cyber-attack patterns at any given
time is the prime approach for this chapter. This is done using data collected from
logging applications and security visualization [6, 7, 22].
This chapter elaborates on incorporating security visualization and data prove-
nance mechanisms aiding data analytics and decision support for cybersecurity. The
benefits of incorporating security visualization and data provenance mechanisms are
as follows:
• It shares precise insights drawn from visually analyzing collected data (systems,
networks and web data).
• It also provides a comparison between existing cyber security standards and
establishes a new security visualization standard to aid users.
Several use cases from Tableau and Linkurious visualization platforms are used
in this chapter in Sect. 4.1. In addition, we will be emphasizing more on the inclusion
of security visualization into the law enforcement domain. We provide an overview
of our new security visualization standard and further discuss the benefits of threat
intelligence tools. And finally security visualization provides a full-proof user-
centered reporting methodology for all level of audiences (CEOs, management and
ordinary employees).
This chapter is organized as follows: Sect. 2 offers a background knowledge on
cyber security technologies; Sect. 3 identifies common areas where cyber security
technologies exists; Sect. 4 provides how cybersecurity technologies contribute to
‘Decision Support’ for Cyber Security; Sect. 5 provides our main contribution
of research work which is the establishment of a new ‘Security Visualization
Standard’; Sect. 6 proposed our ‘Security Visualization Effectiveness Measurement
(SvEm)’ theory; the concept of providing ‘Data Provenance as a Security Visu-
alization Service (DPaaSVS)’ and User-centric Security Visualization; and Sect. 7
provides the concluding remarks for this chapter.
2 Background
In general, data analytics for cyber security is widely used for exploring and report-
ing, particularly when analyzing threat landscapes, vulnerabilities, malware and
implementing better detection mechanisms. Situation awareness is a prime reporting
246 J. Garae and R.K.L. Ko
Data analytics is widely used alongside business intelligence (BI&A) and in big
data analytics [24, 25]. It is an important area of study by both researchers and
industries with intentions of exploring and reporting data-related problems and to
find solutions. As the Internet grows, there is an exponential increase in the type
and frequency of cyber-attacks [27]. Sources ranging from data warehouses to
video streaming and tweets generate huge amount of complex digital data. Cloud
technologies provide scalable solutions for big data analytics with efficient means
of information sharing, storage and smart applications for data processing [26, 28].
Gartner estimated that by 2016, more than 50% of large companies data will be
stored in the cloud [27, 29]. Big data analytics using data mining algorithms that
require powerful processing power and huge storage space are an increasingly
common trend. It has reporting mechanism and often visual dashboards. However,
because CEOs and upper-level managers are not always tech-savvy, lack of clarity
and complexity with information acquired, makes the comprehensive reporting
on such analytics a difficult task for cyber security experts. This is a challenge
which often raises concerns in decision making situations. For example, data breach
magnitude and the assessment process are often under estimated, not not reported
clearly. This affects how mitigation processes to resolve the situation. As a result,
the organization’s reputation can be at stake and such organizations are vulnerable
to cyber-attacks.
A major application in big data analytics is parallel and distributed systems. This
method coexists as part of the entire cloud infrastructure to sustain exceeding
exabytes of data and the rapid increase rate in data size [30]. The need to frequently
increase processing power and storage volumes are critical the factor for cloud
infrastructures. Adding onto this, security, fault-tolerance and access control are
critical for many applications [30]. Continuous security techniques are built to
Visualization and Data Provenance Trends in Decision Support for Cybersecurity 247
maintain these systems. This is yet another key decision factor in organizations and
industries for cyber security frameworks. Cloud technologies also provide scalable
software platforms for the use of smart grid cyber-physical systems [31]. These
platforms provide adaptive information assimilation channels for ingesting dynamic
data; a secure data repository for industries and researchers [31, 32]. It is a trend
for power and energy companies whereby data has become valuable and further
investigations into the data for insights are required and relevant for better service
delivery. While information sharing and visual data analytics are useful features in
these systems, data security is still major concern. With current sophisticated cyber-
attacks involving phishing or social engineering elements, customer data and utility
data are the main target [33, 34].
The technological shift and drift from common use of desktop computers to
mobile platforms and cloud technologies have expanded the cyber-threat landscapes
[45–47]. Newer urgent needs emerged in as the digital economy has matured
over the past decade. Businesses and consumers are more dependent than ever on
information systems [48]. This has contributed to how cyber security has evolved
over the past decade. The cyber security focus in the past decade for researchers and
industries can be summed up with the list below [45–48]:
1. Endpoint Detection and Response: These technologies includes intrusion
detection systems [36], provide the ability to frequently analyze the net-
work and identify systems or applications that might be compromised. With
endpoint detection mechanisms, responsive steps can be taken to mitigate
cyber-attacks [35].
2. Network and Sensors for Vulnerabilities: Such technologies provide flexibilities
to both users and network operators. Either wireless or wired, the sensor
network has multiple preset funtions such as sensing and processing, to enable
multiple application goals [49]. Senor nodes are capable of monitoring the
network area with the aim identifying and detecting interested security events
and reporting to a base station deployed on the network.
3. Cloud Security: The emerging cloud technologies have offered a wide range
of services including network, servers, storage, range of applications and
services [37]. However, this brought in a new range of security challenges
which have contribute to how cloud technologies have transformed from the
past 10 years.
4. Cyber Threat Intelligence: Cyber threat intelligence technologies profiles a
holistic approach for automated sharing, real-time monitoring, intelligence
gathering and data analytics [38]. Organizations are emphasizing on cyber
threat intelligence capabilities and information sharing infrastructure to enable
communications and trading between partners [39].
248 J. Garae and R.K.L. Ko
5. Hardware & Software Security: These includes providing security for hardware
products and software products. Current trends indicated that hardware and
software technologies have added capabilities and functions which require
security components to safeguard networks and applications.
6. Security Tools: Security tools generally covers applications which are often
used for securing systems and networks, e.g. penetration testing tools, vulnera-
bility scanners and antivirus softwares.
7. Security Tracking and Digital Auditing: These technologies focus on auditing
purposes especially to observe and record changes in the systems and networks.
Configuration changes of a computerized device by tracking the processes and
system tracks modification are some examples [40]. Other security tracking
purposes include geo-location tracking, monitoring operational variables and
outputs of specific devices of interest [44].
8. User and Behavioral Analytics: These technologies emphasize on understand-
ing user behaviors, behavioral profiles and end users. Security concerns over
targeting inappropriate audience are some of the issues encountered with these
technologies [41].
9. Context-aware Behavioral Analytics: Context-aware technologies provide
application mechanisms that are able to adapt to changing contexts and able
to modify its behavior to suit the user’s needs, e.g. smart homes inbuilt with a
context aware application that can alert a hospital if a person urgently requires
medical attention [42].
10. Fraud Detection Services: As Fraud cases are becoming popular, computer
based systems designed to alert financial institutions based on set fraud
conditions used to analyze card-holder debits. These systems also identify ‘at
risk’ cards which are possessed by criminals [43].
11. Cyber Situational Awareness: Recent research and surveys have shown that the
rise of cyber criminal activities have triggered the need to implement situational
awareness technologies. These includes the use of surveys, data analytic and
visualization technologies.
12. Risk Management Analytics: Risk management analytics are methodologies
used by organizations as part of their business strategies to measure how well
their business can handle risks. These technologies also allows organizations to
predict better approach to mitigate risks.
The research and industry interests are targeting mainly end-users, data collected,
security tools, threat intelligence and behavioural analytics [46–48, 50]. For example
in context-aware behavioral analytics technologies, we can witness techniques such
as mobile location tracking, behavioral profiling, third-party Big data, external threat
Intelligence and bio-printing technologies. These are founded on the principles of
unusual behavior or nefarious events [50]. Overall, popular cyber security technolo-
gies are summarized in the following categories as shown in Fig. 1 [50]. Honeypots
are examples of active defense measures. Cloud-based applications and BYODs
technologies are far beyond the realm of traditional security and firewall. They
are well suited for the cloud and Security Assertion Markup Languages (SAML)
Visualization and Data Provenance Trends in Decision Support for Cybersecurity 249
shown in Fig. 3. Tableau has capabilities of loading raw data and dashboard display
columns that could accommodate different visual reports of a given data collected.
This is illustrated with “Vis Area 1, 2, and 3” labels in Fig. 3.
Linkurious, a web-based graph data visualization platform has capabilities to
visualize complex data from security logs (network, system, application logs) [56].
It is a graph data visualization for cyber security threat analysis with the aims of
(1) aiding decision support and (2) providing intelligence to support the decisions
made [63]. Below are several Linkurious threat visualization analysis shown in
Figs. 4, 5 and 6.
Figure 4 shows a Linkurious visual analytics of a “normal network behaviour.”
With such known visual patterns of what a normal network behaviour is, any future
changes in the visual pattern will prompt security experts to further investigate it.
For example, comparing a visual analytics pattern in Fig. 5 (a UDP Storm Attack)
with Fig. 4, Cyber Security experts can effectively spot the difference and conclude
that Fig. 5 shows a possibly malicious network traffic. A storm of incoming network
packets targeting the IP: 172.16.0.11, as indicated by the direction of the arrows
on the visualization clearly indicates that this is a denial of service attack (DoS).
Linkurious also has the capabilities of representing data relationships between
entities. In Fig. 6, visual representations of IP addresses are shown indicating data
packets movement from the source IP address to the destination IP address.
Pentaho, a business analytics visualization platform, visualizes data across
multiple dimensions with the aim of minimizing dependence on IT [58]. It has
user-centric capabilities of drill through, lasso filtering, zooming and attribute
highlighting to enhance user experiences with reporting features. It provides user
interactions with the aim of allowing users to explore and analyze the desired
dataset.
252 J. Garae and R.K.L. Ko
Although there is a vast range of cyber security tools, and software applications
publicly available, there is still a need for data analytics and decision support
mechanisms targeting specific cyber security landscapes. This is because, users
(CEOs, Management and analysts) may belong to different security background
and exhibit different levels of knowledge. Specific user-centric cyber security tools
and techniques targeting specific audiences requires some specific cyber security
standard. Cyber security decision support standards are important to guide and
habour all reporting processes. Security researchers and experts leverage on such
254 J. Garae and R.K.L. Ko
standards in order to be on the same page with how a security event is being
presented using visualization. This means creating a security visualization standard
would provide a scope and guideline to help reduce the time spent on producing and
presenting security visualizations. Therefore, less time will be spent on assessing
a given security visualization to gain the most insights on a security visual
event. Although current academic research centers and security enterprises have
implemented in-house cyber security standards that suit their research aims, projects
and enterprise operations, the challenge and complexity in security visualization
reports manifests to some difficulties in understanding visualization reports. In the
following subsection, we present various cyber security standards and assess them
with the purpose of highlighting the importance how cyber security standards help
manage and reduce possible cyber-attacks.
On a broader coverage of the Standard, a way forward for cyber security researchers
and industry security experts is to establish security visualization policies and
guidelines. These guidelines are shown in Table 2.
The purpose of developing SCeeL-VisT standard is to put emphasis on common
grounds with security visualization development, presentation, and understanding
security events. Because the use of data visualization has been widely used across
many research domains and industries, complying with a security visualization
256 J. Garae and R.K.L. Ko
standard is of vital importance. This is due to the high frequency in which sensitive
data must be dealt with. In addition to that, current security visualizations are
tailored towards specific audiences making it difficult for other interested users
to understand insights presented in a given visualization. Finally, the most crucial
purpose of having a standard like SCeeL-VisT is its contribution to making effective
decisions in cyber security. Such standards creates a clear and precise scope for both
cyber security experts and what security visualizations should show in relations to
the raw data used.
cybercrimes. Therefore the cyber security trend for law enforcement can be scaled
down to three main technology categories: (1) Attribution tools; (2) Cyber Threat
Intelligence and (3) Secure Information Sharing Technologies (SIST). However, the
rate at which malwares and ransomwares are created versus the implementation of
new cybersecurity tools to combat these cyber-threats is far beyond proportion. This
raises a concern for law enforcement and cyber security across international borders.
Existing cybercrime challenges for law enforcement range from malware attacks,
ransomwares and terrorism [64]. Security approaches such as data analytics and
threat intelligence methodologies are some of the steps taken by law enforcement
research domains to help with improving investigation technologies. Due to how
cybercrime has been broadening its attack environment from not just the technology
domain but broadly penetrating others such as health and financial domain, the
fight against cybercrime for law enforcement has taken extensive direction which
requires external and international collaborations. This means information sharing
legislations have to be re-visited, policies and guidelines have to be implemented.
The rise of dark market trading between cybercriminals on the dark web allows
them to produce and escalate cyber-attacks at a rate leaving law enforcement,
academia and industry researchers to encounter cyber security challenges. However,
associating the use of digital currencies (Bitcoins) by cybercriminals with cyber-
crimes, triggers the need for new threat intelligence tools, data analytics tools and
vulnerability scanners that would allow effective execution of investigations.
Threat Intelligence tools and 24/7 monitoring live-feed applications allow law
enforcement agencies to carry out their investigations effectively especially for
transnational cybercrimes whereby international collaborations is required. Infor-
mation sharing between international law enforcement agencies without sharing the
underlying raw data capabilities is of high demands.
Bitcoin1 [65] is a peer-to-peer distributed electronic cash system that utilizes the
blockchain2 protocol [66, 67]. Due to how bitcoin payment transactions operate
over the Internet in a decentralized trustless system, law enforcement agencies
are seeking ways aid their investigations especially to track and monitor Bitcoins
movements that are involved in cybercrime events. The ability to provide visual
threat intelligence on Bitcoin transactions using blockchain tools are some of the
way forward to fighting cybercrime. Where the money flows to and from known
cybercriminals, allows law enforcement and cyber security investigators to track
cybercrime events.
1
Bitcoin is a distributed, decentralized crypto-currency, which implicitly defined and implemented
Nakamoto consensus.
2
Blockchain is a public ledger of all Bitcoin transactions that have executed in a linear and
chronological order.
258 J. Garae and R.K.L. Ko
(continued)
Visualization and Data Provenance Trends in Decision Support for Cybersecurity 259
Prior to this section, cyber security technology trends and use cases were analyzed
and discussed including security visualization. These are part of the growing trend
and effective methodologies for addressing cyber security issues (cybercrime).
However, to improve security visualization technologies, we have to look into ‘how’
and ‘what’ makes security visualization appealing to viewers of the visualization.
Measuring how effective, useful and appropriate security visualizations are to users
helps provide researchers and developers ways of improving security visualization
technologies. Before we discuss our new ‘effective visualization measurement tech-
nique’, a summary of existing effectiveness measurement techniques are outlined in
Table 3.
the SvEm3 theory aims to minimize the time (duration) spent on viewing a
visualization and making sense out from the intended insights portrayed in a
visualization. The components of the SvEm theorem are:
1. Mobile platform screen surface area (w * h): This refers to the surface area
used to display a security visualization. Screen sizes have a great impact on how
visualizations appear.
2. Security Visual Nodes (Svf ): These are known security attributes identified in a
visualization, e.g. an malicious or infected IP address.
3. N-dimensions: N-dimensions refers to how many visual dimensions used to
represent the visualization. The higher number of dimensions are used for
visualization indicates the depth of data being able to represent visually.
4. User Cognitive Load (Cl): This is based on how much knowledge (Prior
knowledge) a user has on the expect visualization. It is a prerequisite security
knowledge around expected security events such as a malware cyber-attack.
5. Memory Efficiency (tme ): This is a time-base attribute which measures how fast
one can recall security related attributes.
6. Number of Clicks (nclicks ): This refers to how many clicks one has to perform on
the mobile platform screen in order to view the required visualization.
3
The Security Visualization effectiveness Measurement theory designed for mobile platforms is
measured in percent (%) provides a way to measure clarity and visibility in a given security
visualization.
Visualization and Data Provenance Trends in Decision Support for Cybersecurity 261
.w h/=Svf dn
SvEm D > 50%.Distortion/ (1)
Cl tme nclicks
Where:
w * h : Mobile Display Area (dimensions)
Svf : Security Visual Nodes (e.g. Infected-IP, Timestamp, etc.)
dn : n-dimensional view of security visualization
Cl : Cognitive Load (Identifiable Attributes (Quantity)—Prior Knowledge)
tme : Memory efficiency (Effort based on Working memory—Time-base)
nclicks : Number of clicks on Visualization
Based on this theory, we have observed that the factors highly contributing to a
high SvEm value are: (1) w * h: smartphone dimensions and (2) dn : n-dimensional
view of security visualization i.e. a 3-dimensional representation visualization view
has proven less distorted than a single-dimensional visualization view. More data
information are visible and shown in higher n-dimensional visualization views. This
affects the users (viewer) ability to provide a higher count for Svf . The less value
of nclicks , indicates that the overall time spent on viewing the visualization. This
contributes to higher effectiveness measurement outcome. However, the current
focus is on the following:
262 J. Garae and R.K.L. Ko
Many cyber security researchers are investing their efforts into finding technological
solutions in understanding cyber-attack events, defending against them and finding
ways to mitigate such cyber-attacks. A prominent method in understanding cyber-
attack events is related to the concept of ‘Data Provenance’ which is defined as “a
series of chronicles and derivation history of data on meta-data” [7, 22]. Including
data provenance into security visualization allows cyber security researchers and
experts to be able to analyze cyber-attacks from it’s origins through to it’s current
state i.e. from the instant when the cyberattack was found in the systems to the
state of mitigation or even further to the ‘post - digital forensic’ state of the
cyberattack. Having the ability to apply data provenance as a security visualization
service (DPaaSVS), cybersecurity experts will better understand cyber-attacks,
attack landscapes and the ability to visually observe attack patterns, relationships
and behaviors in a nearly real-time fashion [73]. For example, Fig. 7 presents a deep
node visualization of network nodes, with patterns highlighted colors, rings-nodes
and lines of nodes captured every 5 s [73]. Although provenance has been used as
exploratory features in existing visualization platforms, we present the concept of
‘Data Provenance as a Security Visualization Service (DPaaSVS) and its features.
Visualization and Data Provenance Trends in Decision Support for Cybersecurity 263
Fig. 7 Security visualization with Deep Node Inc. provenance time-base patterns
For example, Deep Node security visualization provides it’s security analysts
with the ability to use Data Provenance as a Security Visualization Service
(DPaaSVS) by understanding network traffics based on Nodes in the concept of
Past-Node and present [73, 74]. Nodes visualized are monitored through network
traffics via it’s corresponding communication Port. See Fig. 8.
Security visualizations like any other general visualization has a purpose. Data
analytics, data exploration and reporting are the most common areas for security
visualization. However, the art of presenting the visualization to the viewers
(users) defines how useful it is. With targeted audiences, presenting visual insights
enables users make smart and effective decisions. While there area lot of aspects
Visualization and Data Provenance Trends in Decision Support for Cybersecurity 265
7 Concluding Remarks
In summary, cyber security technologies are driven by Internet users and industry
demands to meet security requirements to secure their systems and networks. As
technologies evolve, existing cyber security technological trends are often dictated
by smart sophisticated cyber-attacks causing cyber security experts to step up
their game into security researches. This motivates research opportunities for data
provenance and security visualization technologies in aid of understanding cyber-
attacks and attack landscapes.
Due to the increasing statistics of cyber-attacks penetrating networks, existing
cyber security technologies for ‘decision support’ are directed mainly into data
analytics and threat intelligence. Understanding how to prevent, protect and defend
systems and networks are the prime reasons for data analytics and threat intelligence
technologies. However, Security Visualization in the context of data provenance
266 J. Garae and R.K.L. Ko
and user-centric approaches are increasingly common and driven into the cyber
security technologies for smarter and effective decision support reporting. Finally,
with specific directed cyber security standards, policies and guidelines for security
visualization, effective decisions and conclusions can be reached with minimal time
required to react, defend and mitigate cyber-attacks.
Acknowledgements The authors wish to thank the Cyber Security Researchers of Waikato
(CROW) and the Department of Computer Science of the University of Waikato. This research is
supported by STRATUS (Security Technologies Returning Accountability, Trust and User-Centric
Services in the Cloud) (https://stratus.org.nz), a science investment project funded by the New
Zealand Ministry of Business, Innovation and Employment (MBIE). The authors would also like
to thank the New Zealand and Pacific Foundation Scholarship for the continuous support towards
Cyber Security postgraduate studies at the University of Waikato.
References
1. Orebaugh, Angela, Gilbert Ramirez, and Jay Beale. Wireshark & Ethereal network protocol
analyzer toolkit. Syngress, 2006.
2. Wang, Shaoqiang, DongSheng Xu, and ShiLiang Yan. “Analysis and application of Wireshark
in TCP/IP protocol teaching.” In E-Health Networking, Digital Ecosystems and Technologies
(EDT), 2010 International Conference on, vol. 2, pp. 269–272. IEEE, 2010.
3. Patcha, Animesh, and Jung-Min Park. “An overview of anomaly detection techniques: Existing
solutions and latest technological trends.” Computer networks 51, no. 12 (2007): 3448–3470.
4. Yan, Ye, Yi Qian, Hamid Sharif, and David Tipper. “A Survey on Cyber Security for Smart Grid
Communications.” IEEE Communications Surveys and tutorials 14, no. 4 (2012): 998–1010.
5. Tan, Yu Shyang, Ryan KL Ko, and Geoff Holmes. “Security and data accountability in
distributed systems: A provenance survey.” In High Performance Computing and Commu-
nications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing
(HPCC_EUC), 2013 IEEE 10th International Conference on, pp. 1571–1578. IEEE, 2013.
6. Suen, Chun Hui, Ryan KL Ko, Yu Shyang Tan, Peter Jagadpramana, and Bu Sung Lee.
“S2logger: End-to-end data tracking mechanism for cloud data provenance.” In Trust, Security
and Privacy in Computing and Communications (TrustCom), 2013 12th IEEE International
Conference on, pp. 594–602. IEEE, 2013.
7. Ko, Ryan KL, and Mark A. Will. “Progger: an efficient, Tamper-evident Kernel-space
logger for cloud data provenance tracking.” In Cloud Computing (CLOUD), 2014 IEEE 7th
International Conference on, pp. 881–889. IEEE, 2014.
8. Bishop, Matt. “Analysis of the ILOVEYOU Worm.” Internet: http://nob.cs.ucdavis.edu/classes/
ecs155-2005-04/handouts/iloveyou.pdf (2000).
9. D. Kushner, The Real Story of Stuxnet, IEEE Spectrum: Technology, Engineering, and Science
News, 26-Feb-2013. [Online]. Available: http://spectrum.ieee.org/telecom/security/the-real-
story-of-stuxnet.
10. A. K. Z. K. Z. Security, An Unprecedented Look at Stuxnet, the Worlds First Digital
Weapon, WIRED. [Online]. Available: https://www.wired.com/2014/11/countdown-to-zero-
day-stuxnet/.
11. Rigby, Darrell, and Barbara Bilodeau. “Management tools & trends 2011.” Bain & Company
Inc (2011).
12. Bonner, Lance. “Cyber Risk: How the 2011 Sony Data Breach and the Need for Cyber Risk
Insurance Policies Should Direct the Federal Response to Rising Data Breaches.” Wash. UJL
& Pol’y 40 (2012): 257.
Visualization and Data Provenance Trends in Decision Support for Cybersecurity 267
13. Siadati, Hossein, Bahador Saket, and Nasir Memon. “Detecting malicious logins in enterprise
networks using visualization.” In Visualization for Cyber Security (VizSec), 2016 IEEE
Symposium on, pp. 1–8. IEEE, 2016.
14. Gove, Robert. “V3SPA: A visual analysis, exploration, and diffing tool for SELinux and
SEAndroid security policies.” In Visualization for Cyber Security (VizSec), 2016 IEEE
Symposium on, pp. 1–8. IEEE, 2016.
15. Rees, Loren Paul, Jason K. Deane, Terry R. Rakes, and Wade H. Baker. “Decision support for
cybersecurity risk planning.” Decision Support Systems 51, no. 3 (2011): 493–505.
16. Teoh, Soon Tee, Kwan-Liu Ma, and S. Felix Wu. “A visual exploration process for the analysis
of internet routing data.” In Proceedings of the 14th IEEE Visualization 2003 (VIS’03), p. 69.
IEEE Computer Society, 2003.
17. Wang, Lingyu, Sushil Jajodia, Anoop Singhal, and Steven Noel. “k-zero day safety: Measuring
the security risk of networks against unknown attacks.” In European Symposium on Research
in Computer Security, pp. 573–587. Springer Berlin Heidelberg, 2010.
18. Mansfield-Devine, Steve. “Ransomware: taking businesses hostage.” Network Security 2016,
no. 10 (2016): 8–17.
19. Sgandurra, Daniele, Luis Muñoz-González, Rabih Mohsen, and Emil C. Lupu. “Automated
Dynamic Analysis of Ransomware: Benefits, Limitations and use for Detection.” arXiv preprint
arXiv:1609.03020 (2016).
20. Davis, Thad A., Michael Li-Ming Wong, and Nicola M. Paterson. “The Data Security
Governance Conundrum: Practical Solutions and Best Practices for the Boardroom and the
C-Suite.” Colum. Bus. L. Rev. (2015): 613.
21. L. Widmer, The 10 Most Expensive Data Breaches | Charles Leach, 23-Jun-2015. [Online].
Available: http://leachagency.com/the-10-most-expensive-data-breaches/.
22. J. Garae, R. K. L. Ko, and S. Chaisiri, UVisP: User-centric Visualization of Data Provenance
with Gestalt Principles, in 2016 IEEE Trustcom/BigDataSE/ISPA, Tianjin, China, August
23–26, 2016, 2016, pp. 1923–1930.
23. Zhang, Olive Qing, Markus Kirchberg, Ryan KL Ko, and Bu Sung Lee. “How to track your
data: The case for cloud computing provenance.” In Cloud Computing Technology and Science
(CloudCom), 2011 IEEE Third International Conference on, pp. 446–453. IEEE, 2011.
24. Microsoft, 2016 Trends in Cybersecurity: A quick Guide to the Most Important Insights in
Security, 2016. [Online]. Available: https://info.microsoft.com/rs/157-GQE-382/images/EN-
MSFT-SCRTY-CNTNT-eBook-cybersecurity.pdf.
25. Chen, Hsinchun, Roger HL Chiang, and Veda C. Storey. “Business intelligence and analytics:
From big data to big impact.” MIS quarterly 36, no. 4 (2012): 1165–1188.
26. Durumeric, Zakir, James Kasten, David Adrian, J. Alex Halderman, Michael Bailey, Frank Li,
Nicolas Weaver et al. “The matter of heartbleed.” In Proceedings of the 2014 Conference on
Internet Measurement Conference, pp. 475–488. ACM, 2014.
27. Mahmood, Tariq, and Uzma Afzal. “Security analytics: Big data analytics for cybersecurity:
A review of trends, techniques and tools.” In Information assurance (ncia), 2013 2nd national
conference on, pp. 129–134. IEEE, 2013.
28. Talia, Domenico. “Toward cloud-based big-data analytics.” IEEE Computer Science (2013):
98–101.
29. C. Pettey and R. Van der Meulen, Gartner Reveals Top Predictions for IT Organizations
and Users for 2012 and Beyond, 01-Dec-2011. [Online]. Available: http://www.gartner.com/
newsroom/id/1862714.[Accessed:01-Feb-2017].
30. Kambatla, Karthik, Giorgos Kollias, Vipin Kumar, and Ananth Grama. “Trends in big data
analytics.” Journal of Parallel and Distributed Computing 74, no. 7 (2014): 2561–2573.
31. Simmhan, Yogesh, Saima Aman, Alok Kumbhare, Rongyang Liu, Sam Stevens, Qunzhi Zhou,
and Viktor Prasanna. “Cloud-based software platform for big data analytics in smart grids.”
Computing in Science & Engineering 15, no. 4 (2013): 38–47.
32. Cuzzocrea, Alfredo, Il-Yeol Song, and Karen C. Davis. “Analytics over large-scale multidimen-
sional data: the big data revolution!.” In Proceedings of the ACM 14th international workshop
on Data Warehousing and OLAP, pp. 101–104. ACM, 2011.
268 J. Garae and R.K.L. Ko
33. Ericsson, Gran N. “Cyber security and power system communication essential parts of a smart
grid infrastructure.” IEEE Transactions on Power Delivery 25, no. 3 (2010): 1501–1507.
34. Khurana, Himanshu, Mark Hadley, Ning Lu, and Deborah A. Frincke. “Smart-grid security
issues.” IEEE Security & Privacy 8, no. 1 (2010).
35. Bejtlich, Richard. The practice of network security monitoring: understanding incident detec-
tion and response. No Starch Press, 2013.
36. Desai, Anish, Yuan Jiang, William Tarkington, and Jeff Oliveto. “Multi-level and multi-
platform intrusion detection and response system.” U.S. Patent Application 10/106,387, filed
March 27, 2002.
37. Mell, Peter, and Tim Grance. “The NIST definition of cloud computing.” (2011).
38. Burger, Eric W., Michael D. Goodman, Panos Kampanakis, and Kevin A. Zhu. “Taxonomy
model for cyber threat intelligence information exchange technologies.” In Proceedings of the
2014 ACM Workshop on Information Sharing & Collaborative Security, pp. 51–60. ACM,
2014.
39. Barnum, Sean. “Standardizing cyber threat intelligence information with the Structured Threat
Information eXpression (STIX).” MITRE Corporation 11 (2012).
40. O’Toole Jr, James W. “Methods and apparatus for auditing and tracking changes to an existing
configuration of a computerized device.” U.S. Patent 7,024,548, issued April 4, 2006.
41. Gerace, Thomas A. “Method and apparatus for determining behavioral profile of a computer
user.” U.S. Patent 5,848,396, issued December 8, 1998.
42. Gu, Tao, Hung Keng Pung, and Da Qing Zhang. “Toward an OSGi-based infrastructure for
context-aware applications.” IEEE Pervasive Computing 3, no. 4 (2004): 66–74.
43. Anderson, Douglas D., Mary E. Anderson, Carol Oman Urban, and Richard H. Urban. “Debit
card fraud detection and control system.” U.S. Patent 5,884,289, issued March 16, 1999.
44. Camhi, Elie. “System for the security and auditing of persons and property.” U.S. Patent
5,825,283, issued October 20, 1998.
45. L. Widmer, The 10 Most Expensive Data Breaches | Charles Leach, 23-Jun-2015. [Online].
Available: http://leachagency.com/the-10-most-expensive-data-breaches/.
46. SINET Announces 16 Most Innovative Cybersecurity Technologies of 2016 | Busi-
ness Wire, 19-Sep-2016. [Online]. Available: http://www.businesswire.com/news/home/
20160919006353/en/SINET-Announces-16-Innovative-Cybersecurity-Technologies-2016.
47. C. Pettey and R. Van der Meulen, Gartner Reveals Top Predictions for IT Organizations
and Users for 2012 and Beyond, 01-Dec-2011. [Online]. Available: http://www.gartner.com/
newsroom/id/1862714.
48. C. Heinl and E. EG Tan, Cybersecurity: Emerging Issues, Trends, Technologies and Threats in
2015 and Beyond. [Online]. Available: https://www.rsis.edu.sg/wp-content/uploads/2016/04/
RSIS$_$Cybersecurity$_$EITTT2015.pdf.
49. Kavitha, T., and D. Sridharan. “Security vulnerabilities in wireless sensor networks: A survey.”
Journal of information Assurance and Security 5, no. 1 (2010): 31–44.
50. B. Donohue, Hot Technologies in Cyber Security, Cyber Degrees, 03-Dec-2014.
51. Jeong, Jongil, Dongkyoo Shin, Dongil Shin, and Kiyoung Moon. “Java-based single sign-
on library supporting SAML (Security Assertion Markup Language) for distributed Web
services.” In Asia-Pacific Web Conference, pp. 891–894. Springer Berlin Heidelberg, 2004.
52. Gro, Thomas. “Security analysis of the SAML single sign-on browser/artifact profile.” In
Computer Security Applications Conference, 2003. Proceedings. 19th Annual, pp. 298–307.
IEEE, 2003.
53. Rees, Loren Paul, Jason K. Deane, Terry R. Rakes, and Wade H. Baker. “Decision support for
cybersecurity risk planning.” Decision Support Systems 51, no. 3 (2011): 493–505.
54. T. Reuille, OpenGraphiti: Data Visualization Framework, 05-Aug-2014. [Online]. Available:
http://www.opengraphiti.com/.
55. McKenna, S., Staheli, D., Fulcher, C. and Meyer, M. (2016), BubbleNet: A Cyber
Security Dashboard for Visualizing Patterns. Computer Graphics Forum, 35: 281–290.
doi:10.1111/cgf.12904
Visualization and Data Provenance Trends in Decision Support for Cybersecurity 269
56. Linkurious, Linkurious - Linkurious - Understand the connections in your data, 2016. [Online].
Available: https://linkurio.us/.
57. T. Software, Business Intelligence and Analytics | Tableau Software, 2017. [Online]. Available:
https://www.tableau.com/.
58. P. Corporation, Data Integration, Business Analytics and Big Data | Pentaho, 2017. [Online].
Available: http://www.pentaho.com/.
59. Norse Attack Map, 2017. [Online]. Available: http://map.norsecorp.com/$#$/.
60. Kaspersky Cyberthreat real-time map, 2017. [Online]. Available: https://cybermap.kaspersky.
com/.
61. FireEye Cyber Threat Map, 2017. [Online]. Available: https://www.fireeye.com/cyber-map/
threat-map.html.
62. Cyber Threat Map, FireEye, 2017. [Online]. Available: https://www.fireeye.com/cyber-map/
threat-map.html.
63. L. SAS, data visualization Archives, Linkurious - Understand the connections in your data.,
2015.
64. Interpol, Cybercrime / Cybercrime / Crime areas / Internet / Home - INTERPOL, Cybercrime,
2017. [Online]. Available: https://www.interpol.int/Crime-areas/Cybercrime/Cybercrime.
65. Nakamoto, Satoshi. “Bitcoin: A peer-to-peer electronic cash system.” (2008): 28.
66. Barber, Simon, Xavier Boyen, Elaine Shi, and Ersin Uzun. “Bitter to better: how to make
bitcoin a better currency.” In International Conference on Financial Cryptography and Data
Security, pp. 399–414. Springer Berlin Heidelberg, 2012.
67. Swan, Melanie. Blockchain: Blueprint for a new economy. “ O’Reilly Media, Inc.”, 2015.
68. IsecT Ltd, ISO/IEC 27001 certification standard, 2016. [Online]. Available: http://www.
iso27001security.com/html/27001.html.
69. ISO, ISO/IEC 27001 - Information security management, ISO, 01-Feb-2015. [Online]. Avail-
able: http://www.iso.org/iso/iso27001.
70. IsecT Ltd, ISO/IEC 27032 cybersecurity guideline, 2016. [Online]. Available: http://
iso27001security.com/html/27032.html.
71. Ware, Colin. Information visualization: perception for design. Elsevier, 2012.
72. Ramanauskait, Simona, Dmitrij Olifer, Nikolaj Goranin, Antanas enys, and Lukas Radvilavi-
ius. “Visualization of mapped security standards for analysis and use optimisation.” Int. J.
Comput. Theor. Eng 6, no. 5 (2014): 372–376.
73. Deep Node, Inc, Why Deep Node?, Deep Node, Inc., 2016. [Online]. Available: http://www.
deepnode.com/why-deep-node/.
74. Deep Node, Inc, The Concept Deep Node, Inc., 2016. [Online]. Available: http://www.
deepnode.com/the-concept/.
Jeffery Garae is a PhD research student with the Cyber Security Researchers of Waikato
(CROW). As a PhD candidate, his research focus is on security visualization for mobile
platforms and user-centric visualization techniques and methodologies. He is also interested in
data provenance, threat intelligence, attribution, digital forensics, post-data analytics and cyber
security situation awareness. He values the importance of security in ICT. He is the first recipient
of to the University of Waikato’s Master of Cyber Security (MCS) program in 2014. He is currently
the Doctoral Assistant for the Cyber Security course at the University of Waikato. In the ICT and
Security industry, he has a great number of years experience with Systems and Networks. As part
of his voluntary contribution to the Pacific Island countries, he serves as a security advisor and an
advocate to Cyber Security Situation Awareness.
Ryan K.L. Ko is Head of the Cyber Security Researchers of Waikato (CROW) and Senior
Lecturer with the University of Waikato. With CROW, he established NZ’s first cyber security
lab and graduate research programme in 2012 and 2013 respectively. He is principal investigator
of MBIE-funded (NZ$12.23million) STRATUS project. Ko co-established the NZ Cyber Security
Challenge since 2014. His research focuses on returning data control to users, and challenges
270 J. Garae and R.K.L. Ko
in cloud computing security and privacy, data provenance, and homomorphic encryption. He is
also interested in attribution and vulnerability detection, focusing on ransomware propagation.
With 50 publications including 3 international patents, he serves on 6 journal editorial boards,
and as series editor for Elsevier’s security books. He also serves as the editor of ISO/IEC 21878
Security guidelines in design and implementation of virtualized servers. A Fellow of Cloud
Security Alliance (CSA), he is a co-creator of the (ISC)2 CCSP certification—the gold-standard
international cloud security professional certification. Prior to academia, he was a HP Labs lead
computer scientist leading innovations in HP global security products. He is technical adviser for
the Ministry of Justice’s Harmful Digital Communications Act, NZ Minister of Communications
Cyber Skills Taskforce, LIC, CSA and Interpol.