Abu Odeh 2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

A Novel AI-based Methodology for Identifying Cyber Attacks in Honey Pots


Muhammed AbuOdeh1 , Christian Adkins1 , Omid Setayeshfar2 , Prashant Doshi1 , Kyu H. Lee2
1
THINC Lab,
2
Institute for Cyber Security and Privacy
Department of Computer Science, University of Georgia, Athens GA 30606
[email protected]

Abstract similar systems generally lack the capability to identify attack


phases, which is usually left to a human security analyst.
We present a novel AI-based methodology that identifies
phases of a host-level cyber attack simply from system call We present a novel machine learning (ML) based method-
logs. System calls emanating from cyber attacks on hosts such ology that automatically identifies phases of cyber attacks
as honey pots are often recorded in audit logs. Our methodol- on a single host system such as a honey pot. Our methodol-
ogy first involves efficiently loading, caching, processing, and ogy involves the following general steps: (i) Efficiently load,
querying system events contained in audit logs in support of cache, process, query, and display system events from audit
computer forensics. Output of queries remains at the system logs to support computer forensics. Output of queries should
call level and is difficult to process. The next step is to infer be provenance graphs, which can be processed for further
a sequence of abstracted actions, which we colloquially call analysis. (ii) Resulting trace is often still hard to parse and
a storyline, from the system calls given as observations to in need of further abstraction to facilitate analyses. Utilize a
a latent-state probabilistic model. These storylines are then
latent-state probabilistic model, which allows us to infer the
accurately identified with class labels using a learned clas-
sifier. We qualitatively and quantitatively evaluate methods most likely sequence of higher-level actions, which we call
and models for each step of the methodology using 114 dif- an attack storyline, while modeling system calls as observa-
ferent attack phases collected by logging the attacks of a red tions. (iii) Finally, we seek to identify the attack phase based
team on a server, on some likely benign sequences containing on the sequence of high-level actions inferred in the previous
regular user activities, and on traces from a recent DARPA step. We view this step as a multi-class classification problem
project. The resulting end-to-end system, which we call Cy- where each label is a phase of an attack.
berian, identifies the attack phases with a high level of accu- Classifying cyber attacks has been explored before. In
racy illustrating the benefit that this machine learning-based early work, Lippmann et al. (2000) analyzed intrusion detec-
methodology brings to security forensics.
tion performances of various rule-based systems on network
log data containing both benign and malicious activity, which
Introduction included labeling segments corresponding to low-level attack
types. The results highlight the limitations facing rule-based
Automatically identifying the phases of a cyber attack on a
detection toward attack identification. Bolzoni, Etalle, and
host is of significant import. It facilitates automated forensics,
Hartel (2009) paired anomaly detection with machine learn-
which leads to faster attack discovery, damage assessment,
ing to classify attack types based on n-gram analysis of attack
and ultimately prevention. It also represents a key step toward
payloads. These and other similar works classifying types of
better understanding the intent of the attacker. This paves the
attacks tend to be rule based and ignore the significant overlap
way for the defender to engage the attacker in more effective
between various attacks, which often share steps. Many face
cyber-deception techniques on honey pots. However, this
the well-documented rule-based limitation of being inflexible
capability is made difficult by the fact that the individual
to changes in both benign and malicious behaviors in the dy-
attack steps are often common between different types of
namic realm of cyber security. Others focus on classifying an
attacks, and the attack activity, if logged, is buried within
entire attack campaign (instead of intermediate attack phases)
massive system call logs that are hard to sift through.
where accurate identification of later-occurring phases relies
A step toward addressing this challenge is development of on successful identification of earlier phases.
host-based intrusion and anomaly detection systems (Cheva-
lier 2019), which alert defenders about anomalies in system Another system named HOLMES (Milajerdi et al. 2019)
behavior that may indicate malicious activity. For example, focuses on detecting anomalies as phases of advanced and
a recent technique named DeepLog (Du et al. 2017) demon- persistent threats. HOLMES also uses a rule-based system
strated feasibility of using deep learning to detect anomalous for classifying attack phases and relies on a large amount of
behavior based on low-level log data. However, DeepLog and benign log training data to reduce false positives in the testing
data. In contrast to the rigidity of such rule-based systems,
Copyright © 2021, Association for the Advancement of Artificial our ML-based methodology can learn and adapt to identify
Intelligence (www.aaai.org). All rights reserved. several attack phases as long as some system call logs for

15224
these phases exist to use during training. This is enabled by a Title S PG F O G Q A
key innovation of the methodology: the inference of attack AIQL ×  × × ×  ×
storylines as an intermediate step. Carbon Black LiveOps   × × ×  
To translate our methodology to practice, we qualitatively GrAALF       ×
or quantitatively evaluate each step with candidate methods or NoDoze ×  × × × × 
Plaso + Elastic × ×   ×  
models. An independent red team simulated several types of SAQL   × × ×  ×
attacks on a server acting as a honey pot. This yielded a total SOF ELK × ×   ×  
of 114 attack phases, which included asset discovery and data
exfiltration, among others. The pipeline of selected models Table 1: We denote S for streaming analysis, PG for prove-
lead to a system, which we call Cyberian, that performs nance graph forward and backward tracing, F for flexibility
very well in identifying the attack phases as measured by in schema, O for open-source, G for supporting granularity
an F1-score of 90.31 ± 8.44. Additionally, we test on a few smaller than process, Q for querying ability of the graph, and
attack sequences discovered in a recent data set released by A for intelligent anomaly detection. × shows lack of support,
DARPA. An out-of-class evaluation of the learned models to  shows limited support, and  shows full support.
test how well Cyberian can distinguish between malicious
sequences and those that are likely benign is also performed.
that are proprietary and commercially available only as well
Attack Phase Abstraction and Classification as Plaso and Elastic (Project 2019), AIQL (Gao et al. 2018b),
Our methodology is useful to areas of cybersecurity called cy- SAQL (Gao et al. 2018a), and SOF ELK (Lewes 2019). Ta-
ber forensics or cyber crime investigation. Tools for forensics ble 1 shows this comparison in a concise format; more details
predominantly rely on log records to understand behavior of about each of the systems is available in its reference.
attackers, whether at application level (web server logs) or at GrAALF (Setayeshfar et al. 2019), a new general-purpose
system level (system call logs). A key challenge for analyzing query system, offers good support on several desiderata but
logs is their enormity – a typical computer produces more lacks advanced anomaly detection which makes some hu-
than 10K low-level logged events each minute. While higher man involvement necessary. Regardless, it may serve well to
level logging may produce more understandable events, these instantiate this step of the methodology using GrAALF.
sacrifice detail and introduce heterogeneity due to diverse
events emitted by different programs in the system. On the Inferring Attack Storylines
other hand, lower level system logs provide detail and homo-
geneity, but extracting meaningful abstractions from them System call logs, even extracted by GrAALF, tend to be exten-
has been difficult. Our initial focus is to perform the forensics sive and incomprehensible. For example, some large attack
on honey pots, which are hosts masquerading as important phase files contain 10,154 records occupying up to 2.2MB.
systems in a subnet intended to deceive attackers by consum- This makes manually parsing and understanding attack steps
ing their attack resources. As these hosts are not in regular tedious. Resulting traces are often still hard to parse and need
use, there is less benign activity and the log entries largely further abstraction to facilitate analysis. Logged system calls
capture unauthorized intrusions and subsequent activity. may be perceived as a sequence of observable signals nois-
ily emitted by a dynamic system as an adversary performs
Provenance of System Calls an attack. Thus, we may infer the most likely sequence of
The first objective is to extract an attack-focused sequence of higher-level actions given these observations. This suggests
system calls from low-level system call logs and generate the modeling the probabilistic action recognition problem using
provenance graph. Each sequence represents an attack phase, classical latent-state graphical models. The sequence of ab-
such as data exfiltration or privilege escalation. Generating stracted actions is the most-likely explanation (MLE) inferred
the provenance graphs in this step is crucial to the rest of the by the model for the observed sequence of system calls.
methodology. We identify several desiderata that contribute We note that the same system calls often appear in dif-
toward ways for successfully generating these graphs: ferent attack sequencesand perform similar functions. For
• Schema flexibility by allowing varied log formats example, the sequence involving sh executing a temporary
• Online pattern matching or live monitoring and tracking process which then executes another shell is present in attack
on streaming log data phases such as system reconnaissance and persistence. This
• Backward and forward tracing on provenance – especially sequence is indicative of the attacker’s shell launching a tem-
on long sequences along with a graphical representation porary process which then starts another shell. Furthermore,
we may not distinguish between various temporary processes
• Support for a granularity smaller than the process for the purpose of generally understanding and identifying
• An intuitive query language and interface that goes beyond attack phases. Consequently, a relatively small number of
simple keyword-based filtering to isolate key events in logs distinct abstracted actions suffice to comprehensibly explain
• Automated anomaly detection capabilities various attack phases at a higher level. These abstracted ac-
• Open source version available to facilitate wider adoption tions form hidden states of latent models, and a categorized
We utilize these desiderata to qualitatively compare several listing of some candidate states is given in Table 2. For a
more prominent log analysis systems that are currently avail- given (cleaned) sequence of system calls, we refer to the
able. These include systems such as Carbon Black’s LiveOps sequence of states inferred by the model as its storyline.

15225
Bash execute process Attack phase Count
Start states Execute process (by non bash process) system reconnaissance 39
Init server daemon persistence 12
System operation privilege escalation 14
Generic process read write asset discovery 16
Generic process operation data exfiltration 14
System
Netapp operation network discovery 19
operation states
Bash operation
Server daemon operation
File operation Table 3: Distribution of various attack phases in our data set.
Information System information
states Library state
Process upload download previous works related to classification (Lee, Yoon, and Cho
Netapp upload download 2017; Zhang, Zhao, and LeCun 2015), and serves as one
Network states Daemon upload download performance baseline for the classification stage. Joachims
Server download (1998) discusses suitability of SVMs for text classification,
Server upload for which they have been used extensively. SVMs, being sim-
pler models, benefit from fewer parameters thereby requiring
Table 2: Categorized list of candidate states for latent-state less manual design. Indeed, the only parameters that usually
probabilistic model representing abstracted actions. Column require testing are kernel (linear vs non-linear, with some
four of Table 4 relates some states to system call events. variations) and maximum number of iterations for training.
A classification model trained to perform multi-class clas-
sification on each attack storyline represents the last step of
Probabilistic graphical models such as hidden Markov the methodology. The resulting output from this step is the
models (HMM) (Rabiner 1989) and conditional random fields identified attack phase. This model’s performance ultimately
(CRF) (Lafferty, McCallum, and Pereira 2001) can infer ab- demonstrates the instantiated method’s ability to extract high-
stracted actions corresponding to system calls as these models level information from the granular and extensive logs.
allow a sequence to be tagged with labels based on context.
While HMMs and linear-chain CRFs share many similari- Performance Evaluation
ties, HMMs tend to be simpler to learn from data and do not
require handcrafted feature functions characteristic of CRFs. To realistically evaluate our methodology, an independent
Prior use of probabilistic models such as HMMs in cyber- red team of cyber security researchers assisted by Metas-
security has mainly addressed intrusion detection (Liu et al. ploit (Rapid7 2020), a well-known penetration testing tool,
2018). For example, Garcia et al. (2012) use an HMM and engaged in several attacks on a Linux server over a few days.
k-means for detecting malicious activity, while Wang, Guan, These attacks yielded 114 phases, each of which is one of six
and Zhang (2004) use an HMM to determine whether a se- popular types (based on sequence of system calls executed
quence exceeds a certain threshold, thus classifying it as an by attacker in each phase). Attack phase types, informed in
attack. In Radhakrishna, Kumar, and Janaki (2016), the aim part by MITRE’s ATT&CK matrix (Strom et al. 2018), are:
is to identify intrusions using temporal pattern mining. In- 1. system reconnaissance: gather information about system,
trusion detection usually does not involve analyzing specific, including OS version and user account information;
actionable observations and identifying the attack phases. 2. persistence: attacker implements measures to maintain
access even after a system restart;
3. privilege escalation: attacker attempts to gain root or
Identifying Attack Phases higher access on the victim machine;
Each storyline details the logical steps comprising an attack 4. asset discovery: search for assets such as sensitive files
phase. We may view storylines as time series’ and the prob- on a system;
lem of identifying the corresponding attack phase as multi- 5. data exfiltration: attacker transfers data out;
class classification. Subsequently, storylines serve as input 6. network discovery: attacker explores different connec-
for a machine learning classification model. Storylines are tions made to/from the victim machine.
composed of sequential contexts with temporally extended The distribution of these types is shown in Table 3.
dependencies. Such dependence is likely a factor within such
complex, lengthydata as attack logs. Therefore, our primary Each attack is logged using Sysdig on the honey pot server.
model hypothesis is a Long Short-Term Memory (LSTM) GrAALF automates online monitoring by pre-defining a set
neural network (Hochreiter and Schmidhuber 1997). of query templates similar to predefined rules. Once the user
As alternative candidate classification models, a 1- specifies the system and sensitive files to be monitored, a set
dimensional convolutional neural network (CNN) and a sim- of queries derived from templates are deployed by GrAALF.
pler linear support vector machine (SVM) may be explored. For instance, for monitoring a modification of a sensitive
Sequential contexts like storylines often contain very simi- file, the following query is used by GrAALF “back select
lar repeated sub-sequences common among instances within write from file where name is X” where ‘X’ represents file
each class. 1D convolution is a common method in time- name of interest. To monitor for potential access to IP ranges
series data analysis and has frequently been used in various outside the local network “back select * from soc where not

15226
System calls generated by GrAALF from name, evt type, and Processed system Abstracted actions
to name extracted from sys- calls, in the form (HMM states)
tem calls from name →
evt type → to name
{ ”sequence number”: 12125, ”user”: ”www-data”, from name:sh evt type: exec bash → Bash execute
”from id”: 2398 , ”from name”: ”sh”, ”evt type”: ”exec”, to name: perl exec → perl process
”to name”: ”perl” , ”to id”:2399 , ”count”: 1}
{ ”sequence number”: 12530, ”user”: ”www-data”, from name: perl perl → exec → Execute process
”from id”: 2399 , ”from name”: ”perl”, ”evt type”: evt type: exec perl
”exec”, ”to name”: ”perl” , ”to id”:2400 , ”count”: 1} to name: perl
{ ”sequence number”: 12577, ”user”: ”www- from name: apache2 apache2 → Server daemon
data”, ”from id”: 1782 , ”from name”: ”apache2”, evt type: shutdown close → operation
”evt type”: ”shutdown”, ”to name”: ”10.0.2.8:44933 to name: ”10.0.2.8:44933 remote address
→10.0.2.10:80” , ”to id”:13 , ”count”: 2} → 10.0.2.10:80”
{ ”sequence number”: 12578, ”user”: ”www-data”, from name:apache2 apache2 → Server download
”from id”: 1782 , ”from name”: ”apache2”, ”evt type”: evt type: read read →
”read”, ”to name”: ”10.0.2.8:44933→10.0.2.10:80” , to name: 10.0.2.8:44933 → remote address
”to id”:13 , ”count”: 1} 10.0.2.10:80

Table 4: Raw system calls extracted from log files and processed. Column 1 shows system calls recorded by auditing software and
output by GrAALF. Column 2 gives information extracted from system calls, and the third shows format of sequences as passed
on to HMM. A latent-state model may view sequences as being emitted from corresponding states shown in fourth column.

name has 172.16.” where a range of local IP is ‘172.16.*.*’. ful log events. For example, if the underlying audit system
GrAALF provides further query templates for process-based fails to extract system or process information, this appears as
monitoring. For example, ‘nc’ and ‘scp’ processes are fre- hNAi, and we filter these out as they do not contain useful
quently used by adversaries to plan a backdoor or to exfiltrate information. We also exclude routine cache and library file
sensitive data. Query “back select * from * where name is access events (i.e., accesses of .lib and .so files), and unnamed
nc or name is scp” can monitor detailed behavior of such pipe accesses (i.e., “NULL” and “pipe”).
processes, and their remote hosts can be identified by — “for- To preserve readability, we provide the observation in the
ward select * form * where name is nc or name is scp”. Out- triple format from name → evt type → to name, where the
put of GrAALF’s templated queries is a focused provenance from name in the triple is the parent process, the evt type is
graph. The engagement and GrAALF’s analysis yielded 25 what the parent process performs, and to name is the child
sequences of system calls which did not belong to any of the process on which the event is performed. For example, bash
six phases and may be viewed as benign. We utilize these for → exec → nc means that bash executes the process netcat.
a deeper evaluation of learned models. All attack data includ- This is the third and final step in converting system call
ing log files and GrAALF’s output is available for download sequences into observations. The third column in Table 4
at https://tinyurl.com/yy9stomv. shows the observation for each system call, respectively.
Data cleaning On receiving system call sequences from HMM inference of attack storylines We utilize a stan-
GrAALF, individual system calls, such as those shown in dard HMM for inference. Observations to the HMM are
the first column in Table 4, are cleaned to be handled by the system calls in an attack phase sequence, cleaned and in
latent-state model. A simple pre-processing step removes triple format as described above To infer attack storylines
some fields from each log entry as these fields do not contain automatically, our aim is to learn transition and emission
information essential to understanding behavior. In particular, probability tables of the HMM from sequences of system
Linux audit logging includes fields such as sequence num- calls in attacks annotated with abstracted actions using a
ber, user name, and various IDs in addition to from name, learning algorithm such as Baum-Welch (Baum et al. 1970).
evt type, and to name. Values in the first grouptypically do This requires relating each state to observed system call
not speak about the action that was performed, and are triples emitted by the state. For example, the state Dae-
dropped from further analysis. Remaining fields that contain mon upload download emits the system call nmbd →
valuable information for the HMM are from name, evt type, sendto → some socket and state Server daemon operation
and to name (see second column of Table 4 for illustration). emits the system call proftpd → close → some socket.
We adopt straightforward rules to trim the set of distinct A storyline is then the most likely explanation inferred
observations. For instance, diverse shell processes such as sh, by the trained HMM for a given sequence of system call
bash, or zsh, are merged into bash as it is the most popular observations. An example short attack storyline inferred
shell. In some sequences, we observe processes that create for the asset discovery phase is: (Execute process ircd,
temporary child processes with randomly generated process Server daemon operation ircd, Execute process perl,
names; we rename them to temp process. Table 4 shows a Execute process perl, System information, Execute pro-
real audit log and processed logs. cess perl, Execute process temp process, Execute pro-
Additionally, we filter out obviously irrelevant and unhelp- cess temp process, System information). The lengths of

15227
Fold Mean LL Mean LL ratio Model Weighted-mean F1
0 -4.909 ± 0.996 0.938 ± 0.084 with HMM without HMM
1 -4.406 ± 1.566 0.956 ± 0.05 SVM 84.14 ± 12.82 64.79 ± 15.75
2 -4.586 ± 1.338 0.942 ± 0.071 CNN 86.32 ± 11.46 14.51 ± 16.30
3 -5.091 ± 0.974 0.95 ± 0.067 LSTM 90.31 ± 8.44 47.55 ± 26.03
4 -4.579 ± 1.549 0.911 ± 0.097
Table 7: Weighted-mean F1-score (%) and weighted standard
Table 5: Mean and standard deviation of log likelihood and deviation for models with and without storylines. Statistics
log likelihood ratio per test fold generated by HMM. Mean obtained by weighting phases’ F1-scores with class sizes.
log likelihood across all folds is -4.714, mean ratio is 0.939.
tently mispredicted by the HMM. As network discovery se-
storylines vary from 1 step to more than 5,000. quences are among the shortest, few errors have a significant
Cleaned sequences of system calls from GrAALF for 114 impact. Finally, storylines are passed on for identification.
attack phases yielded 1176 distinct observations, for which Classifier evaluation to identify attack phases We de-
we utilized an HMM with 17 states as defined in Table 2. sign the recurrent neural network model to embed tokenized
We implemented the HMM using the Pomegranate package and padded input sequences (length of 5,465), pass result-
(Schreiber 2017). We evaluate the HMM’s performance using ing vectors through two LSTM layers (100 memory units,
5-fold cross validation on the annotated sequences in the 114 dropout of 0.05, recurrent dropout of 0.1), and finally use a
+ 25 sequences. The HMM is trained using four folds, which softmax layer to arrive at a probability distribution over attack
involves learning the transition and emission probabilities phase labels for each input sequence. The CNN takes encoded
using Baum-Welch with a count-based initialization (Laan, and padded storylines through two convolution blocks (32
Pace, and Shatkay 2006) and pseudo counts. The HMM’s fit and 64 filters respectively, kernel sizes of 3, max pooling of 2,
and inference of the storyline is evaluated using the fifth fold. followed by dropout of 0.05), followed by a fully-connected
We report log likelihoods of system call sequences in the layer, and finally a softmax output layer producing classi-
five test folds. Each observed attack sequence is passed to the fication probabilities. The support vector classifier uses a
Viterbi algorithm, which yields the sequence of states that simple linear kernel. Various parameters for each model were
most likely explains observations. Likelihood is the probabil- explored to determine best configurations, such as different
ity that this sequence of states emits the observations. Table 5 dropout rates and training epochs and use of class weights.
shows mean log likelihood (divided by number of steps in the Our neural network models were implemented in Keras
phase to account for highly varying lengths) for each fold, and with TensorFlow backend, while scikit-learn’s SVM imple-
also presence of some sequences of system calls that could mentation was used. Experiments were performed on a Linux
not be predicted with high probability leading to noticeable system with 4 Intel Xeon Skylake processors, 32GB RAM,
standard deviations. In some test folds, the learned HMM and 1 NVDIA P100 GPU. To demonstrate utility of HMM’s
encountered some sequences with previously unseen system abstraction toward identifying high-level attack character-
calls. However, small non-zero values at initialization of tran- istics, we evaluate classifiers in two ways: The first exper-
sition and emission probabilities and Baum-Welch’s use of iment trains for up to 100 epochs (maximum iterations of
pseudocounts permitted generalization to these sequences. 20K for the SVM) on ground truth sequences and evaluates
Having evaluated the model’s fit, next we evaluate cor- each model’s performance on storylines comprised of states
rectness of the inferred storylines. We compare the previous that the HMM produced for each attack phase. A ground
mean log likelihoods with the mean of log probabilities of truth labeled sequence is the sequence of manually annotated
observed sequences in each test fold given the ground truth states. In the second, input sequences are system calls di-
assignment of states for a sequence. In Table 5, we also re- rectly coming from GrAALF without being processed by the
port the means and standard deviations of likelihood ratio HMM; as there are no storylines for this raw data, 5-fold
LL of MLE
LL of ground truth while noting that a ratio of 1 is desired. While cross validation was used to evaluate models on this data set
no fold gave a perfect ratio of 1, the mean ratio for most folds (confusion matrices for each fold were combined to get final
is above 0.9 indicating that generated storylines were mostly results). Relative results of these tests show how well HMM
correct. Table 6 contains likelihood ratios decomposed by predictions reflect true characteristics of each attack phase.
attack phase. Storylines pertaining to network discovery yield Table 7 gives the weighted-mean F1-score of the classifica-
lowest likelihood ratio because some system calls commonly tion by each of the models in both experiments. First, notice
seen in several network discovery sequences were consis- that the use of storylines as input to the classifiers improves
their accuracy significantly compared to using sequences of
low-level system calls; this improvement is especially large
Model Asset Sys. Exfil N/w Persist Priv. for the CNN. Among the various classifiers, the LSTM model
disc. recon. disc. Escal. achieves a better mean F1-score than the SVM or CNN. The
HMM 0.957 0.973 0.873 0.846 1.0 0.986 LSTM operating on attack storylines is able to accurately
identify the type of about 90% of the attack phases. How-
Table 6: Mean log likelihood ratio of the HMM decomposed ever, the paired F1-score differences between the LSTM and
by attack phases. The lowest performance is highlighted. the other methods are not statistically significant. Clearly,

15228
Model Asset Sys. Exfil N/w Persist Priv. Model Asset Sys. Exfil N/w Persist Priv. Benign
disc. recon. disc. Escal. disc. recon. disc. Escal.
SVM 74(43) 90(61) 93(50) 62(58) 89(74) 97(36) SVM 35 84 44 41 100 0 38
CNN 64(90) 90(99) 100(98) 91(99) 89(100) 79(98) CNN 58 92 53 88 89 72 0
LSTM 73(91) 92(99) 100(99) 94(99) 86(99) 93(99) LSTM 69 92 53 94 86 93 0

Table 8: F1-score (%) of HMM-aided models by attack Table 11: F1-score (%) of each HMM-aided model by attack
phases, and mean confidence on correct classifications (true phase and benign class, based on confidence threshold of 0.5.
positives). Lowest performance for each model is highlighted.

Model Asset Sys. Exfil N/w Persist Priv. How do trained classifiers perform on out-of-class in-
disc. recon. disc. Escal. stances? 25 likely benign storylines were tested alongside 114
SVM 68 90 88 90 80 93 attack phase storylines. As models are trained on instances
CNN 89 88 100 100 80 68 belonging to six attack phases only, each prediction was ini-
LSTM 79 92 100 100 75 93 tially for one of these classes. We simply rely on prediction
confidences to discriminate benign sequences. Ideally, if be-
Table 9: Precision (%) of HMM-aided classification models nign instances are significantly different from attack phase
by attack phases. Lowest performance for each is highlighted. instances, their predictions should have distinctly lower confi-
dences. For these tests, any prediction made with a confidence
value less than a threshold T was changed to a prediction of
the process of inferring the most likely explanation of the the benign class for the corresponding instance. This change
observed system calls is very valuable and identification of would make the prediction correct if the instance is truly
the attack phases benefits from reasoning about the context. benign and incorrect for any of the original 114 instances.
Table 11 shows results of this experiment with T = 0.5.
Precision and Recall We analyze models’ performances The SVM suffers most from this thresholding approach due
by reporting classification F1-scores by attack phase. Table 8 to confidence values being relatively lower and more varied
shows that LSTM’s and CNN’s weakest performance is on than those of the other models. It does achieve the highest
asset discovery sequences, where no model achieves a high F1-score on benign sequences out of the models (at T = 0.5
score. This significantly lower F1-score is due, in part, to and at other values), but its overall performance on the attack
sequences of that class having low average lengths compared phases is more severely diminished than those of the CNN or
to others. Asset discovery sequences give the models less LSTM. The CNN and LSTM achieve high enough confidence
information to learn from and inform the classification. values on the majority of their predictions so that this simple
As recall is the percentage of instances of a class that are thresholding method did not change any predictions to benign.
correctly classified, in this multi-class setting, the per-class Thus, the models achieved an F1-score of 0% on that class.
recall corresponds to the classification accuracy for each Furthermore, isolating predictions by each model on the
class. In Tables 9 and 10, we see that no model achieves high likely benign instances did not reveal significant differences
recall on the asset discovery phase, with the CNN’s recall in mean prediction confidence from that of true attack phases.
being the highest score for that class (75%). The exfiltration On benign instances, mean confidence is near 51% for SVM,
phase is evidently the easiest to classify, with perfect recall 88% for CNN, and 96% for LSTM. Comparatively, on the
and at least 82% precision for each model. The LSTM model 114 attack phases, the SVM has the most varied confidence
achieves the worst single score in both precision and recall. values with a mean confidence near 54%. The CNN and
LSTM reach consistently high confidence values, with means
of 92% and 97% respectively. As such, the mean confidences
Out-of-Distribution Evaluation While our aim is to make of the latter two models are similar. We did not find a value
the log analysis step highly precise in identifying attack for T that could correctly classify a high percentage of be-
provenance, we also consider the realistic scenario where nign instances without severely reducing the performance
it misidentifies possibly benign sequences of system calls as on the attack phases. This suggests that a more sophisticated
attack phases. As sequences are composed of system calls approach is needed for out-of-class instance identification.
that are likely to be shared with attack phases, the HMM
continues to generate storylines for likely benign sequences.
Confidence and Additional Phases Assessing the confi-
dences each model achieves during classification can reveal
Model Asset Sys. Exfil N/w Persist Priv. attack phases lacking clear discriminating features. The CNN
disc. recon. disc. Escal. and LSTM have high average confidence levels among cor-
SVM 81 90 100 47 100 100 rect and incorrect classifications, with almost all confidences
CNN 50 92 100 84 100 93 near 99% for the LSTM. The SVM has more dispersion in
LSTM 69 92 100 89 100 93 these scores, and has better distinction in confidences among
right and wrong classifications. Interestingly, across all three
Table 10: Recall (%) of HMM-aided classification models by models’ incorrect classifications, if the true class was privi-
attack phases. Lowest performance for each is highlighted. lege escalation then the predicted class was most likely to be

15229
A snippet of a raw system call log of an exfiltration attack Provenance graph generated by GrAALF GrAALF A snippet of the GrAALF output
2018-09-28 11:51:51.356210805 accept 15858 vsftpd /home/user_1/installs/vsftpd-2.3.4/ vsftpd upstart 15858 1540 generates a […
172.19.48.77 41015 <NA> <NA> <NA> 172.19.48.77:41015->172.19.48.76:21 4 172.19.48.76 ip 21 ipv4 4 root 0 245302
fd=4(<4t>172.19.48.77:41015->172.19.48.76:21) tuple=172.19.48.77:41015->172.19.48.76:21 queuepct=0 queuelen=0 focused { "sequence_number": 37001, "user": "root", "from_id": 15858 ,
queuemax=32 /bin/bash The raw system sequence of "from_name": "vsftpd", "evt_type": "exec", "to_name": "sh" ,
2018-09-28 11:51:51.356529237 clone 15858 vsftpd /home/user_1/installs/vsftpd-2.3.4/ vsftpd upstart 15858 1540 <NA> call log file is "to_id":16031 , "count": 2}
<NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> root 0 245304 res=16031(vsftpd) system calls ,
exe=/usr/local/sbin/vsftpd args= tid=15858(vsftpd) pid=15858(vsftpd) ptid=1540(upstart) cwd= fdlimit=1024 pgft_maj=0 loaded into from the { "sequence_number": 130618, "user": "root", "from_id": 16031 ,
pgft_min=177 vm_size=10856 vm_rss=1372 vm_swap=0 comm=vsftpd
cgroups=cpuset=/.cpu=/.cpuacct=/.io=/.memory=/.devices=/user.slice.freezer=/.net_cls=... GrAALF for provenance
"from_name": "sh", "evt_type": "exec", "to_name": "cat" ,
"to_id":16078 , "count": 2}
flags=72(CLONE_NEWIPC|CLONE_NEWPID) uid=0 gid=0 vtid=15858(vsftpd) vpid=15858(vsftpd) /bin/bash processing ,
2018-09-28 11:51:51.356564008 close 15858 vsftpd /home/user_1/installs/vsftpd-2.3.4/ vsftpd upstart 15858 1540 graph { "sequence_number": 187465, "user": "root", "from_id": 16094 ,
172.19.48.77 41015 <NA> <NA> <NA> 172.19.48.77:41015->172.19.48.76:21 4 172.19.48.76 ip 21 ipv4 4 root 0 245306 "from_name": "zip", "evt_type": "read", "to_name":
res=0 /bin/bash
"/home/user_1/sensetive_file.txt" , "to_id":7 , "count": 2}
2018-09-28 11:51:51.356604502 clone 16031 vsftpd /home/user_1/installs/vsftpd-2.3.4/ vsftpd vsftpd 16031 15858 <NA>
,
<NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> root 0 245309 res=0 exe=/usr/local/sbin/vsftpd args=
{ "sequence_number": 187470, "user": "root", "from_id": 16094 ,
tid=16031(vsftpd) pid=16031(vsftpd) ptid=15858(vsftpd) cwd= fdlimit=1024 pgft_maj=0 pgft_min=0 vm_size=10856
"from_name": "zip", "evt_type": "write", "to_name":
vm_rss=124 vm_swap=0 comm=vsftpd
"172.19.48.77:48270->172.19.48.76:6200" , "to_id":6 , "count": 2}
cgroups=cpuset=/.cpu=/.cpuacct=/.io=/.memory=/.devices=/user.slice.freezer=/.net_cls=...
flags=72(CLONE_NEWIPC|CLONE_NEWPID) uid=0 gid=0 vtid=1(systemd) vpid=1(systemd) /bin/bash
…]

The sequence of system calls


is sent to the HMM to
Machine learning for classifying The attack storyline provides an abstraction layer to perform causal inference
The HMM
attacks system call sequences output, the An HMM allows us to infer a sequence of system states given the emitted system calls (observations)
most likely
The abstracted Init-server-daemon → execute-process → sequence of
behavior is sent to a States Server-daemon Bash execute Generic process Process upload-
states,
machine learning operation process operation download
server-daemon-operation → bash-executes- represents
model to determine an attack
the attack phase process (x2) → system-information → execute- storyline

process (x6) → file-operation (x4) → daemon- Vsftpd


Bash Zip reads
Observations executes Zip writes to IP
upload-download → system-operation bash
executes cat sensitive file

Figure 1: Component steps in Cyberian’s pipeline. System call logs containing possible attacks (here, we show a data exfiltration
attack) is loaded into a graphical tool. Templated queries help identify possible attack phases and HMM based inference on these
traces generates a comprehensible ‘storyline’, which facilitates identification of the attack phase using an LSTM classifier.

asset discovery; the converse was also true. Notice that asset accuracy approaching 90% (Figure 1). Thus, Cyberian is
discovery is the worst classified attack phase overall. From effective in identifying attack behavior on a host based on
the low F1-scores combined with the shared high confidence distinct steps of various attacks. Our results demonstrate Cy-
between these two classes, we may reason that the storylines berian’s ability to identify actionable phases of attacks from
for these two phases have high similarity. Misclassified in- large system logs collected in honey pots, for further use.
stances of network discovery were always predicted to be One limitation of the evaluation is the relatively small num-
asset discovery, although the converse was rarely true. ber of distinct attack phases analyzed (though corresponding
To test this methodology’s adaptability to honeypots in logs are extensive). However, attacks tend to be infrequent
different settings, we evaluate system call logs from the and results show that significant accuracy is attainable from
DARPA Transparent Computing program (Computing 2020). the overall methodology even when relatively few phases are
GrAALF discovered six attack phases from Engagements available for training. One benefit of this general methodol-
2 and 4 of this program, containing 380 and 625 million ogy is that it can follow the same procedure to identify addi-
records respectively. Five represent activities falling within tional phases when system call logs for these other phases
our chosen attack phases, while one (which we label as ‘drop exist to use during training. The framework is not restricted
file’ phase) does not. We create ground truths for the first 5 to only using the phases in the reported experiments. A future
new sequences and train new instances of our classifiers with direction of research is to experiment with more sophisti-
those and the 114 original traces. Then we test the learned cated attacks to increase dataset and observation diversity; it
models on storylines for all 6 DARPA traces. The SVM cor- is likely that some of these additional logs will come from
rectly classifies two traces, while the LSTM classifies one more sophisticated methods of privilege escalation and data
correctly. Confidencesfor both models’ incorrect predictions exfiltration as well as new phases such as drop files.
are significantly lower than their mean values. The exception In addition, a more sophisticated method is needed to
is the LSTM’s prediction on the drop file phase, which is handle the possibility of encountering storylines that do not
exfiltration and has very high confidence. The CNN correctly belong to known attack phases (either benign or an unknown
classifies all but the drop file phase; notably that trace is con- phase); open set recognition is a promising avenue of research
fidently predicted to be system reconnaissance, the class with for this purpose. This approach seeks to prepare models to
the most instances and representing the broadest activity cate- effectively deal with classes unseen during training while
gory. As differences between F1-scores of the models are not accurately classifying seen classes (Geng, Huang, and Chen
significantly different in Table 7 and the CNN demonstrates 2020). A more robust version of Cyberian with such abili-
better ability to incorporate new data in this methodology, ties could identify mistakenly generated benign storylines or
the CNN may be the better approach for future deployment. storylines representing attack phases beyond the known set.

Concluding Remarks Acknowledgments


Through raw system call analysis using GrAALF, storyline This research was supported, in part, by a grant from the
generation through an HMM, and CNN-based machine learn- Army Research Office under grant number W911NF-18-1-
ing, this pipeline of candidate methods and models, which 0288. We also acknowledge discussions with Prof. Munindar
we name Cyberian, achieves an attack phase classification Singh’s group at NCSU, which helped shape this research.

15230
References Lewes, T. C. 2019. SOF-ELK® Virtual Machine Distri-
Baum, L. E.; Petrie, T.; Soules, G.; and Weiss, N. 1970. A bution. https://github.com/philhagen/sof-elk/blob/master/
maximization technique occurring in the statistical analysis VM README.md. [Online; accessed 25 May 2019].
of probabilistic functions of Markov chains. The annals of Lippmann, R. P.; Fried, D. J.; Graf, I.; Haines, J. W.; Kendall,
mathematical statistics 41(1): 164–171. K. R.; McClung, D.; Weber, D.; Webster, S. E.; Wyschogrod,
Bolzoni, D.; Etalle, S.; and Hartel, P. H. 2009. Panacea: D.; Cunningham, R. K.; et al. 2000. Evaluating intrusion
Automating attack classification for anomaly-based network detection systems: The 1998 DARPA off-line intrusion de-
intrusion detection systems. In International Workshop on tection evaluation. In Proceedings DARPA Information Sur-
Recent Advances in Intrusion Detection, 1–20. Springer. vivability Conference and Exposition. DISCEX’00, volume 2,
12–26. IEEE.
Chevalier, R. 2019. Detecting and Surviving Intrusions: Ex-
ploring New Host-Based Intrusion Detection, Recovery, and Liu, M.; Xue, Z.; Xu, X.; Zhong, C.; and Chen, J. 2018. Host-
Response Approaches. Ph.D. thesis, CentraleSupélec. Based Intrusion Detection System with System Calls: Review
Computing, D. T. 2020. Transparent Computing Engagement and Future Trends. ACM Computing Surveys (CSUR) 51(5):
5 Data Release. https://github.com/darpa-i2o/Transparent- 98.
Computing. [accessed 26 August 2020]. Milajerdi, S. M.; Gjomemo, R.; Eshete, B.; Sekar, R.; and
Du, M.; Li, F.; Zheng, G.; and Srikumar, V. 2017. DeepLog: Venkatakrishnan, V. 2019. Holmes: real-time apt detection
Anomaly Detection and Diagnosis from System Logs through correlation of suspicious information flows. In 2019
Through Deep Learning. In Proceedings of the 2017 ACM IEEE Symposium on Security and Privacy (SP), 1137–1152.
SIGSAC Conference on Computer and Communications Se- IEEE.
curity, CCS ’17, 1285–1298. New York, NY, USA: ACM. Project, P. 2019. Plaso (log2timeline). https://plaso.
ISBN 978-1-4503-4946-8. doi:10.1145/3133956.3134015. readthedocs.io/en/latest/. [Online; accessed 25 May 2019].
URL http://doi.acm.org/10.1145/3133956.3134015. Rabiner, L. R. 1989. A tutorial on hidden Markov models
Gao, P.; Xiao, X.; Li, D.; Li, Z.; Jee, K.; Wu, Z.; Kim, C. H.; and selected applications in speech recognition. Proceedings
Kulkarni, S. R.; and Mittal, P. 2018a. {SAQL}: A Stream- of the IEEE 77(2): 257–286.
based Query System for Real-Time Abnormal System Behav- Radhakrishna, V.; Kumar, P. V.; and Janaki, V. 2016. A Novel
ior Detection. In 27th USENIX Security, 639–656. USENIX. Similar Temporal System Call Pattern Mining for Efficient
Gao, P.; Xiao, X.; Li, Z.; Xu, F.; Kulkarni, S. R.; and Mittal, Intrusion Detection. J. UCS 22(4): 475–493.
P. 2018b. {AIQL}: Enabling Efficient Attack Investigation
Rapid7. 2020. metasploit. https://metasploit.com/. [Online;
from System Monitoring Data. In 2018 USENIX Annual
accessed 20 Jan 2020].
Technical Conference.
Schreiber, J. 2017. Pomegranate: fast and flexible probabilis-
Garcia, K. A.; Monroy, R.; Trejo, L. A.; Mex-Perera, C.;
tic modeling in python. The Journal of Machine Learning
and Aguirre, E. 2012. Analyzing log files for postmortem
Research 18(1): 5992–5997.
intrusion detection. IEEE Transactions on Systems, Man,
and Cybernetics, Part C (Applications and Reviews) 42(6): Setayeshfar, O.; Adkins, C.; Jones, M.; Lee, K. H.; and Doshi,
1690–1704. P. 2019. GrAALF: Supporting Graphical Analysis of Audit
Geng, C.; Huang, S.-j.; and Chen, S. 2020. Recent advances Logs for Forensics. arXiv preprint arXiv:1909.00902 .
in open set recognition: A survey. IEEE Transactions on Strom, B. E.; Applebaum, A.; Miller, D. P.; Nickels, K. C.;
Pattern Analysis and Machine Intelligence . Pennington, A. G.; and Thomas, C. B. 2018. Mitre Att&ck:
Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term Design and Philosophy. Technical report, MITRE Corp.
memory. Neural computation 9(8): 1735–1780. Wang, W.; Guan, X.; and Zhang, X. 2004. Modeling program
Joachims, T. 1998. Text categorization with support vector behaviors by hidden Markov models for intrusion detection.
machines: Learning with many relevant features. In European In Proceedings of 2004 International Conference on Ma-
conference on machine learning, 137–142. Springer. chine Learning and Cybernetics (IEEE Cat. No.04EX826),
volume 5, 2830–2835 vol.5. doi:10.1109/ICMLC.2004.
Laan, N. C.; Pace, D. F.; and Shatkay, H. 2006. Initial 1378514.
model selection for the Baum-Welch algorithm as applied to
HMMs of DNA sequences. Queen’s University, Kingston, Zhang, X.; Zhao, J.; and LeCun, Y. 2015. Character-level
ON, Canada . convolutional networks for text classification. In Advances
in neural information processing systems, 649–657.
Lafferty, J.; McCallum, A.; and Pereira, F. C. 2001. Con-
ditional random fields: Probabilistic models for segmenting
and labeling sequence data. In Proceedings of the 18th Inter-
national Conference on Machine Learning 2001.
Lee, S.-M.; Yoon, S. M.; and Cho, H. 2017. Human activity
recognition from accelerometer data using Convolutional
Neural Network. In 2017 IEEE International Conference on
Big Data and Smart Computing (BigComp), 131–134. IEEE.

15231

You might also like