Paper 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

UTrack: Enterprise User Tracking Based on OS-Level Audit Logs

Yue Li Zhenyu Wu Haining Wang


College of William and Mary Google Inc. Virginia Tech
[email protected] [email protected] [email protected]

Kun Sun Zhichun Li Kangkook Jee


George Mason University Stellar Cyber University of Texas at Dallas
[email protected] [email protected] [email protected]

Junghwan Rhee Haifeng Chen


University of Central Oklahoma NEC Laboratories America
[email protected] [email protected]

ABSTRACT on Data and Application Security and Privacy (CODASPY ’21), April 26–
Tracking user activities inside an enterprise network has been a 28, 2021, Virtual Event, USA. ACM, New York, NY, USA, 12 pages. https:
//doi.org/10.1145/3422337.3447831
fundamental building block for today’s security infrastructure, as it
provides accurate user profiling and helps security auditors to make
informed decisions based on the derived insights from the abun-
1 INTRODUCTION
dant log data. Towards more accurate user tracking, we propose
a novel paradigm named UTrack by leveraging rich system-level Nowadays, cyber-attacks have been becoming more sophisticated
audit logs. From a holistic perspective, we bridge the semantic gap and stealthy. In an Advanced Persistent Threat (APT) attack, an
between user accounts and real users, tracking a real user’s activi- attacker may lurk in the target network for more than half a year
ties across different user accounts and different network hosts based on average, escalating and maintaining the access privilege without
on causal relationship among processes. To achieve better scalabil- being caught [40]. As a result, there is an increasing demand of
ity and a more salient view, we apply a variety of data reduction and user tracking inside an enterprise network, in order to improve the
compression techniques to process the large amount of data. We visibility for the network monitoring, and help security analysts
implement UTrack in a real enterprise environment consisting of to make informed decisions on the detection of insider attacks and
111 hosts, which generate more than 4 billion events in total during targeted APT attacks. A recently enabled paradigm in the security
the experiment time of one month. Through our evaluation, we industry, called User Behavior Analytics (UBA) [38, 39], is built
demonstrate that UTrack is able to accurately identify the events upon this foundation. UBA categorizes a range of techniques that
that are relevant to user activities. Our data reduction and compres- keep monitoring user activities and identifying those that deviate
sion modules largely reduce the output data size, producing a both from normal user sessions. While UBA is a rather broad concept
accurate and salient overview on a user session profile. that can be applied to many scenarios on a different level, granu-
larity, and scope, its fundamental building block is to accurately
identify and model user activities. Capturing user activities with an
CCS CONCEPTS inaccurate or incomplete view could result in incorrect detection
• Security and privacy → Distributed systems security. or analysis.
Towards more accurate user modeling and verification, contem-
KEYWORDS porary UBA approaches attempt to fuse data from different data
Audit Logs; Forensics Analysis; User Tracking sources for creating a more comprehensive risk profile [33, 41].
Though they are useful in many scenarios [33, 39, 41], an inher-
ACM Reference Format: ent limitation is that they all lack a holistic view on systems since
Yue Li, Zhenyu Wu, Haining Wang, Kun Sun, Zhichun Li, Kangkook Jee, data are collected from only a couple of security-sensitive appli-
Junghwan Rhee, and Haifeng Chen. 2021. UTrack: Enterprise User Tracking cations, such as firewalls and proxies. Under such a setting, many
Based on OS-Level Audit Logs. In Proceedings of the Eleventh ACM Conference meaningful events could be missed, not to mention the difficul-
ties of correlating data with different syntax and semantics from
a variety of sources. A natural approach would be to leverage log
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed data at the operating system (OS) level, which can record data for
for profit or commercial advantage and that copies bear this notice and the full citation all applications under homogeneous syntax and comprehensible
on the first page. Copyrights for components of this work owned by others than ACM semantics. Such an audit log system is widely deployed in many se-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a curity infrastructures [22–24, 26, 27], mainly for forensics purposes.
fee. Request permissions from [email protected]. SLEUTH [18] and HOLMES [29] leverage system logs to identify
CODASPY ’21, April 26–28, 2021, Virtual Event, USA APT attacks based on abstracted security sensitive activities: “tags"
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8143-7/21/04. . . $15.00 for SLEUTH and “Tactics, Techniques and Procedures" (TTPs) for
https://doi.org/10.1145/3422337.3447831 HOLEMES.
In this paper, we present a novel user tracking system, named 2 MOTIVATIONS AND CHALLENGES
as UTrack, by leveraging the rich system log data to universally
2.1 Motivations
monitor user session activities. In addition to focusing on identify-
ing, consolidating, and scrutinizing security sensitive events [18, Contemporary user behavior monitoring is mostly done on dis-
29], UTrack is user-centric. UTrack does not pay more attention on parate applications and services. However, such a methodology
pre-defined “sensitive" read or write. Instead, the goal of UTrack is has multiple drawbacks that limit the usability of the monitoring
to present a user’s activity profile accurately and concisely, such system. The first drawback is the lack of completeness. In these
that more domain-specific behavior can be audited. For example, systems, only a small portion of user activities are recorded and an-
an employee copying a large amount of digital assets from the alyzed, since the logs are only generated from applications that are
company should be known by UTrack. usually perceived to be of strong security indication, for instance, a
We identify and tackle two major challenges. The first is to bridge firewall, a web proxy, or a sensitive database service. All other user
the semantic gap between user accounts and human users in both activities are not actively monitored. However, a successful attack,
in-host and cross-host scenarios. This is done by tracking causal especially an APT attack, usually comprises many individual steps.
relationship among processes through the user session root and The traces of each step may be buried in seemingly less interesting
correlating network events to identify network control channels. events that are not recorded by applications. By connecting these
The second challenge is to address the “needle in a haystack” prob- dots, one may detect an intrusion that cannot be identified by con-
lem stemmed from the huge volume of log data through a variety of ventional user behavior analytics. In contemporary user tracking
data reduction techniques. Unlike many previous works on log data schemes, the auditor lacks this holistic view on the entire system.
reduction [25, 46] that aim at information-lossless reduction, our The other limitation is the difficulty of correlating log data. Data
data pruning approach is to prune data that may carry meaningful collected from different services and applications may be of different
information but are out of the scope of user activity tracking. formats, granularity, and semantic levels. Parsing and correlating
We deploy UTrack in an enterprise network that comprises more data from different sources is very challenging. As a result, data
than 100 hosts running either Windows or Linux operating systems from individual sources are independently handled and analyzed in
with real users. The users are well aware of the setup. This is a many cases. Shashanka et al. [33] attempted to associate subjects
general setup among many enterprises that the company devices from different data sources, such as different IP addresses and user
are actively monitored. We manage to process log data from all the accounts. However, the capability of such an association is limited
hosts on a single machine, and demonstrate that UTrack is able to to a small scope, where the subjects are tightly bounded. Therefore,
accurately identify and concisely present the events that represent the inspector lacks view on the connections among critical pieces
activities of a real user inside the network in a human-consumable of puzzle from all data sources.
fashion. System Opportunities: A universal user activity tracking sys-
In summary, we make the following contributions. tem, which monitors activities of all users inside an enterprise
network, is very useful to resolve or mitigate the aforementioned
(1) We develop a new universal user tracking mechanism (UTrack) problems. However, recording all activities of individual users in
based on OS-level audit logs. UTrack aims to bridge the se- the entire network may incur significant system overhead. To bal-
mantic gap between human users and computer user accounts ance the trade-off between system overhead and data granularity,
by identifying and associating system events that appear in we leverage an OS level log system to collect data from each host
different user accounts and different hosts but belong to a inside a network. The OS level log system collects low level system
single user session. objects, such as processes, files, and network connections, which
(2) We apply effective data reduction methods on user session largely preserve the running states of a computer at a certain time.
profiles to achieve a scalable and salient presentation. The Thus, it can be used to accurately reconstruct the causality among
reduction mainly involves detecting interactive processes objects with clean semantics. Meanwhile, the data volume is at a
and modeling common data patterns. manageable level. Nowadays, many enterprises have deployed such
(3) We implement UTrack in a real enterprise environment, with a log system for forensics purposes [22–24, 26, 27].
data collected from more than 100 hosts. Our evaluation re-
sults show that UTrack is accurate and concise in presenting 2.2 Challenges
user activities. UTrack scales well with a low resource con- 2.2.1 Accurate Modeling of User Behaviors. When processing audit
sumption. logs, a user account is often considered equivalent to the user
itself. This is mostly true in some high-level applications, such as
The rest of the paper is organized as follows. Section 2 describes Facebook and Twitter. However, the assumption no longer holds
the motivation and challenges of an OS-level log based universal when it comes to low-level OS events.
user tracking system. Section 3 presents system overview. Section 4 Unlike application-specific logging that is clearly defined and has
elaborates how UTrack tracks user sessions across different ac- much higher semantic awareness, a generic OS-level log system
counts and different hosts. Section 5 presents various techniques monitors events with respect to individual user accounts. In an
UTrack adopts to pinpoint relevant events. Section 6 details the enterprise network, a user may have multiple user accounts, and
implementation and evaluation of UTrack, and Section 7 discusses a user account could be accessible by multiple users. For instance,
more use cases. Section 8 surveys related works, and finally, Sec- a network administrator could access both its personal account
tion 9 concludes the paper. and the root account on a web server. The web server may also
In general, there exists a semantic gap between user accounts
and human users. Solely relying on the user accounts to track the
behavior of a user is not reliable as it lacks proper linkage of user
account transition and service account delegation. We realize this
semantic gap, and discard this intuitive but invalid assumption in
our user tracking system. To clearly set boundaries between the
two concepts, thereafter, the term “user" always indicates a real
human user, while the term “account" always indicates a user (or
system) account in a computer system.

2.2.2 Identifying Data Triggered by Users. The other challenge we


face is to sift out data that are directly related to user behaviors. We
observe that only a small portion of the data are triggered by direct
user interactions with the computer, and others are spontaneous or
scheduled system events, such as automatic updater and cron jobs.
However, human interactions are the natural target of a user track-
ing system. In light of this, we attempt to identify system events
Figure 1: Users and User Accounts – an Example
that are triggered by users’ actual interactions, which achieves a
be managed by several system administrators. As observed in our much higher scalability and cuts off unnecessary distractions for
network, the discrepancy mainly comes from the following three security auditors. However, for cleaner data semantics and resource
scenarios. conservation, OS level log systems usually only record the causal
Account Transition Managing account privilege and ensuring relationships among primitive system objects, such as processes,
proper isolation among different privilege levels are essential to files, and sockets. They do not usually keep track of the operations
an operating system. However, user accounts with lower privilege on I/O devices, such as a click on the mouse or a tap on the key-
sometimes need higher privilege to accomplish certain tasks, which board. Without such information, it becomes non-trivial to identify
is mainly done by 2 ways: (1) setting the UID of a process (e.g., events stemmed from users’ interactions. To achieve our goal, we
“ping" command) or (2) having a higher privilege account to do follow the lineage of the UI management components to identify
the task (e.g., through “sudo" or “su" commands). Thus, a simple possible processes that have an open interface to the users and rely
task done by a single user may involve several user accounts, and on many useful features to determine interactive processes.
the same account may also be involved in activities performed We also need to address the semantic gap between the actual
by different users. A more comprehensive example is shown in high-level user behaviors and low-level system interpretations. In
Figure 1. In this example, two users Alice and Bob access a same file many cases, a simple operation from a user may result in a large
“secret.txt." However, from the perspective of the operating system, number of system events. Though these events are indeed triggered
the secret file is indeed accessed by two “vim" processes of the by the user’s behavior, they carry little information as most of
“root" account. Furthermore, whoever is granted the root privilege the steps are highly repetitive and predictable. To eliminate this
is able to interact with the system on any other account’s behalf. redundancy, we try to model file sets that are frequently accessed
Although this does not usually happen in normal operations, it by processes, and only record the events that do not fit in the model.
is possible that an attacker exploits this system trait to mask its In addition, we also model repetitive execution “branches" of a
malicious behaviors. process and compress these repetitive ones.
System Service. In a typical operating system, there are many
applications and services running in the background. Meanwhile, 3 SYSTEM OVERVIEW
many system accounts are created in the sense of security groups Nowadays, many organizations have started to deploy an agent
to achieve a finer granularity of access control for these services. on each host in their enterprise networks. UTrack works under
Although these accounts do not represent any individual user, they the same context of these forensics-purposed OS level log systems.
are delegated certain tasks by other users. Two examples, “sshd" Three types of system objects (process, files, and network sockets)
and “postgres processes", are shown in Host 2 in Figure 1. In this and their interaction events (e.g., a process creating a child pro-
case, the PostgreSQL database server receives a request of accessing cess or reading from a file) are recorded. Each event also carries
the database, and in response, the server daemon creates a child attributes that describe the activity, such as the event time, user
process to process the request. All these activities at the database account, file pathnames, and socket IP addresses, etc.
server are recorded as from the account “postgres", regardless of UTrack aims to help a system auditor to understand the activities
the real user that is in fact accessing the database. of users inside an inter-connected enterprise network by associating
Credential Sharing It is possible that a same account is shared both in-host and cross-host activities performed by the same user
among multiple real users. A typical scenario is the “root" account in a specific user session. The input to UTrack is a data stream
on a server, which may be managed by several developers or ad- collected from all hosts, and it can be either a real-time stream or
ministrators. Therefore, it is important to find the real performer an offline history database. Generally, it consumes the data from a
of an event, especially when multiple real users log into the system start time (𝑇𝑠 ) to an end time (𝑇𝑒 ), and outputs user session profiles
using the same account. to describe the activities of a user session within the time period. If
Session Profile UTrack works in an online fashion. It gradually builds the user
session profiles while consuming system events on the fly. It is
important for a UBA system to promptly analyze the data, so that
anomalies can be identified in their early stage and triaged to pre-
Aggregator
vent further damages. On the other hand, UTrack can also work
offline in forensics analysis by reconstructing the data stream from
Presentation the log database. In order to facilitate this feature, we build a data
Network Matching Abstraction replayer to replay the history data from the database, which is
In­host Tracking detailed in Section 6. This data replaying tool is also useful on
Cross­host Tracking implementing, debugging, and evaluating our user tracking system.
Note that UTrack does not aim to replace conventional forensics
techniques, such as backtracking or forward tracking. Instead, it
Data Pruning Data Modeling is indeed complementary to those techniques to better secure an
Data Pruning
enterprise network. Nowadays, it is a common case that people
do not really make good use of big data, and a large amount of
collected data remain in the warehouse without generating any
useful insights. UTrack demonstrates a new perspective to better
Interactiveness leverage the collected rich system data for system security and
...... Detection management purposes.
Session Profile
UTrack
4 UTRACK EVENT ASSOCIATION
Figure 2: UTrack Overview UTrack is capable of tracking users across an enterprise network by
linking events from different hosts. Note that we no longer depend
on the owner of the process (i.e. the user account) to determine the
real performer of an event. Instead, the process owner is only used
the data is an online stream, the end time 𝑇𝑒 is set to a distant future
as side information to give a hint of who the performer might be.
time. The user session profile is represented in a forest structure,
In the following, we first introduce the tracking mechanism on a
and the time is usually set in terms of weeks, days, or hours in
single host and then extend it to the cross-host scenarios.
different use cases.
Most forensics techniques consider a system object, such as an
identified trojan process, as a Point of Interest (POI), and aim at un- 4.1 Tracking In-host User Activities
derstanding the provenance or impact of POIs. In contrast, UTrack 4.1.1 Process Lineage. Modern operating systems usually main-
attempts to understand the behavior of a user in a session. In other tain all alive processes in a tree structure. For example, in Linux,
words, the user itself is the POI. In most cases, the behavior of a user every process, except the init process, is forked by a parent process.
is much more complex to describe than a single attack incidence in This parent-child relationship widely exists among processes and is
the system. Similarly or even worse, UTrack suffers from the same useful to determine the performer of most system activities. Specif-
data explosion problem, which makes it difficult to focus on the ically, we consider the user of the child processes to be the same as
real interesting events. Thus, it is imperative for UTrack to identify that of the parent process, unless we have a special reason to cut the
and keep the most relevant data, which can also help save system lineage and attribute the parent and children to different sessions.
resource. For instance, a user might open a Bash terminal and run the “ls”
Figure 2 illustrates an overview of UTrack. UTrack consumes a command in the terminal. Since the “ls” command is executed by a
data stream of log events from the aggregator, which receives, sorts, child process forked by the bash process, UTrack considers both
and sends out the data from the agent-enabled hosts. The first task processes to be performed by a single user.
of UTrack is to construct user session profiles by correlating both in- This parent-child relationship is a fundamental building block
host and cross-host activities, mainly relying on the process lineage of many forensics analysis techniques [22–24, 27], which usually
and network event matching. However, this session profile contains expand the investigation from POI, i.e., the detected point of an
a large amount of system-generated data that may not be directly attack. Timing is also considered to mitigate the possible depen-
related to the user’s operations, but can easily overwhelm other dency explosion and find the most relevant events. In contrast,
interesting events. To mitigate this issue, we apply interactiveness UTrack tracks all the parent-child relationships among processes
detection on the user session profile to identify the processes that for constructing a more complete user session model.
have actual interaction with the user. This step directs us to the We do not keep track of file data control flows, which are usually
events that are more relevant to user tracking. We also model the considered in forensics techniques. This is because when a process
files, network connections, and “sub-branches” of the interactive has written to a file, the file has a causal relationship with all the
processes to further compress the low-entropy events. The output processes that read the file afterwards; however, this causal relation-
of this step is a more salient session profile, which can be directly ship is out of the scope of user tracking where the user activities
construed by human auditors or be inputted to further security are the target. Furthermore, it introduces too many dependencies
measures. that may unnecessarily complicate the analysis.
(After) shell for whoever has successfully logged on the computer via ssh.
User init init
As such, the dedicated shell is considered as the session root.
Display
m manager
m d1 d2 m d1 d2
d Daemon
4.2 Tracking Cross-host User Activities
s1 s2 s1 s2
s Session
In an enterprise network with many inter-connected hosts, one user
a Application
may need to work on, or request resources and services from servers.
a1 a2 a3 a1 a2 a3
It is critical to track the cross-host user activities in order to achieve
(a) (b)
a better coverage than local-only tracking. A number of previous
works have been developed to help understand how a request is
Figure 3: Session Root Isolation
processed in a complex distributed system using middleware or
application level instrumentation [5, 37], statistical inference [2, 31],
or system call log and analysis techniques [32, 36]. However, they
4.1.2 User Log-on Sessions. Since one OS usually structures all its all cannot accurately work under generic-purposed OS level logs.
processes in a tree (or forest in Windows), when activities from We propose to track cross-host user activities based on one key
multiple users are recorded in a host, it is far from adequate to solely observation that after receiving remote requests, a server will act
rely on the process lineage for tracking each user. Thus, we must on behalf of the requester. Most servers have a daemon listening to
find a way to attribute related nodes to different users. A critical incoming requests and processing the requests accordingly. There
observation is that a user must have an interface to interact with are two types of server architectures, namely, event-driven servers
the computer, and usually the first step is to log on the computer and worker-based servers. For the event-driven servers, since a
for user authentication. In addition, the OS usually organizes the thread could handle multiple incoming requests in a non-blocking,
processes under one user log-on session in a tree structure, and interleaving manner, it is hard to correlate a remote request with the
normally there is a root node as the ancestor of all processes created corresponding activities of the server without specific assistance
in the user session. We call this node a session root. Figure 3 shows from the server. Therefore, our main focus is on the worker-based
an example of Linux instance. Each user logging in the system has servers, where a network request is solely handled by a worker.
a corresponding session root (i.e., node 𝑠 1 and node 𝑠 2 ), which is Worker-based servers are popularly used in enterprise networks
usually a child process of a running service. Their activities (for with a moderate number of users due to the ease of coding and
instance, open an application) are reflected in the subtrees under the maintenance. A worker-based server may support two working
session root. UTrack utilizes session roots to identify the activities modes, namely, on-demand worker creation and a pre-allocated
of each user. It brings us two benefits. First, it helps to separate worker pool. As an example of the first mode, sshd daemon accepts
processes and activities triggered by users from those generated a remote network connection, and creates an interactive command
by the OS or system services. Second, it can differentiate activities language interpreter process, such as a bash terminal. Thereafter,
among multiple users who have logged on the host simultaneously. the newly created command interpreter process is controlled by
UTrack identifies session roots from several known patterns. For the requester, and any activities performed by the process should
a normal user, the most common way to interact with the com- be attributed to the requester, regardless of the user account that
puter is via a Graphic User Interface (GUI). Even command line owns it on the server. We call this process a delegate of the remote
interactions are included since the terminal window itself is created user.
in the desktop environment. For example, in Linux, the X display It is more tricky to handle the worker pool mode. In UTrack,
manager (a process usually named *dm) manages the login screen one process node in a user session completely belongs to the user.
and organizes a user session in child processes. In our experimental However, it does not fit well with the mode of a worker pool, where
environment, the most common display manager is lightdm, and multiple long-living workers are pre-created and each worker only
thus the session root in this case is a lightdm session child with a dedicates a partial of its lifetime for a network request. In order to
session ID. When users log on a server through virtual consoles or accommodate such a case, we introduce a new notation – virtual
no X server is available on the server, the session root is /sbin/login, process – to model a span of the worker’s lifetime. When a worker
which is a child of the system init process. It is even easier for begins to work on a requester, we create a new node (i.e., the
Windows, as it is a GUI-based OS and the user interaction with the virtual process) in the user’s model, and the new node records
OS is usually through the GUI. We determine the windows process all the activities of the worker during this time. We illustrate this
“winlogon” as the session root, since it initiates the user authenti- process in Figure 4, where two users make requests to a server at
cation process and becomes the root of the desktop environment different times. The server dispatches the same worker to access
when the login succeeds. two different files, 𝑓1 and 𝑓2 . A virtual node is created to represent
Remote logins, such as through ssh and telnet, are envisioned as the time lapse that a worker is processing each request. Eventually,
user cross-host activities since events from multiple hosts need to the user model is constructed with each user associated with its
be correlated to track the relations. We elaborate how we handle own virtual process, which carries all data during the time when
cross-host activities in Section 4.2. When logins are from hosts the real worker handles the individual request.
that do not have an agent installed or the login happens before To track cross-host user activities, the first step is to find the com-
the tracking start time 𝑇𝑠 , we identify session roots based on the munication channel between the server application (i.e., responder)
service pattern. For example, the ssh daemon creates a dedicated and the user-controlled application (i.e., requester). Next, we try
5 PINPOINTING USER ACTIVITIES
bash bash
After correlating activities of users regardless of the process owner,
bash bash
[4] [1] UTrack collects user session profiles that keep track of all processes,
files, and sockets in the memory during the entire tracking period.
server
[5] w w As a result, the generated data can become very large, and interest-
[0] [2] ing events may be buried in piles of less-relevant data. Therefore, it
Worker f1 f2 is essential for UTrack to identify and keep only relevant and useful
[6] [3] events. Redundant data must be pruned to release the pressure of
huge system resource demand and to keep the security auditor
f1 f2 server worker
from unnecessary distractions.
(a) (b) When conducting data pruning, we stick to the user-centric
mentality by sifting out the events that are directly related to the
Figure 4: Virtual Process user’s interaction with the computer system. This is because that a
user session profile contains a collection of processes that are only
used to facilitate user or system operations. For example, Ubuntu
to identify a worker for the request using a rule-based method. In provides a number of tools and services, such as GNOME Virtual
both worker modes, we observe that after establishing a network File System(gvfs) for I/O abstraction, update-notifier for newer
channel as a connection acceptor, a child process or a sibling (when version checking, zeitgeist for logging user activities, etc., which
the listener and workers are siblings) of the server process immedi- are less relevant to user’s actual behaviors. In contrast, interactive
ately accesses the same network channel and generates a number processes are the processes that a user interacts with, such as a
of events. Based on this pattern, we can determine the worker and Bash shell or UI-based programs like Firefox, Notepad, etc. The
attribute all the activities of the worker to the remote requester. behavior of interactive processes is a genuine reflection of the user
operations. However, it is a challenge to identify those interactive
4.3 System Cold Start processes from our OS level logging information, which does not
UTrack consumes data from agents during a pre-set time period include any user actions, such as mouse clicking or keyboard in-
to help understand the user sessions; however, it may encounter put. UTrack relies on passive observation and prediction to find
the cold start problem, namely, the history data is not available and interactive processes, and multiple features have been identified to
the agents report events on scattered processes. If so, the linkage help distinguish interactive processes from other processes.
among processes may be missing. Since UTrack relies on the causal In addition to the interaction-oriented sifting, the data can be
relations to identify user sessions, it requires to reconstruct these further compressed due to the highly repetitive patterns found
relations between processes. To address this problem, the agent in processes and files. We observe that the interactive processes
periodically collects a system snapshot that stores the child-parent are prone to generate sub-processes “branches” for different tasks.
relationship among them. We use this information to reconstruct These branches could be similar to each other, regarding to the
the causality relations among stand-alone processes and further executable names, arguments, and files that are read. In many cases,
extract user sessions in the reconstructed process tree. Note that these monotonous data can easily dominate a session profile and
the parent-child relationship recorded in the snapshot may not be occupy over 90% data of the user profile. To address this issue, we
coherent with that generated by UTrack, due to possible process model both the activities of an interactive process and the common
delegation, user session identification and isolation, or the adoption files that are read by each executable, which significantly reduce
of orphaned processes, etc. This discrepancy is in fact beneficial the complexity of the user session profile.
to our user tracking scheme. For instance, if a user starts a system
service during the tracking period, the system service is regarded 5.1 Interactiveness Detection
spawned by the user, and the activities of the service can be attrib-
The purpose of interactiveness detection is to find user-triggered
uted to the user. However, if the tracking period is after the system
events. We consider user-triggered events to be events directly re-
start time and the service daemon is adopted by the “init” process,
sulting from a user action, such as opening a file using Notepad, etc.
then the service becomes a part of the operating system and cannot
It should be noted that technically, all events are results of human
represent any user. As such, we only reconstruct the parent-child
user activities, since background procedures and processes, even
relationship when the child process has no existing parent in the
the operating systems are installed by the user. However, since
UTrack model.
these processes are mostly regulated and expose behaviors dual to
bots, we consider them to be non-user-triggered. Conceptually, we
4.4 Scope envision this procedure being similar to find bots/crawlers in a net-
UTrack aims to connect events that are cross-host and cross-accounts. work, where the bots are essentially programmed by human users,
However, there are cases where UTrack cannot handle. For instance, but they expose very different behaviors and have little relation to
it could fail to identify causality among processes due to inability to active genuine users.
track IPC mechanisms, such as shared memory and shared files. It The interactiveness detection relies on passive observation of the
also cannot handle the event-driven servers, like those run NGINX. OS events, so it faces several noteworthy challenges. First, passive
Similar limitations can be found in previous works [6, 32, 36]. observation is believed to be less accurate than active detection
f4
solutions, such as Catpcha [34, 44]. Second, we do not have a spe- IP
f1 IP
cially tailored log system as those used in bot detection [13, 45], f3
or any side information such as social graph [7, 9, 13, 43]. Lastly, f2 e1 e1
system level events are low-level data whose semantic meanings e1 e1 f2
are harder to derive. Sometimes, we need to associate other related f3
f1 m1 m1 f4
events to truly understand the actual user operations.
To categorize unknown processes, we develop a machine learn- (a) (b)
ing approach that uses a number of useful features to distinguish an e: mainexec
f: files
interactive process from other processes. Note that we need to keep e2
m: model IP
the actual activities that are usually represented by child processes [2]
of an interactive process. For example, an interactive shell may run e1 m1
many commands, which is executed transiently. These commands e1
are not considered interactive processes. However, they represent IP e2
e2
the user’s activities and should be studied. e1
m1 m1
An important and new feature we use is the entropy of activity
batches. A fundamental observation is that the interactive processes (c) (d)
have irregular activities, due to human involvement. Thus, a process
performing tasks at a fixed time interval is unlikely to be controlled Figure 5: Data Modeling
by a real human user. However, treating each individual event as
a task is problematic since a single task usually constitutes many
steps and events. As a solution, we preprocess all these events that
are generated by the process to form a group of event batches, in
which each batch represents a high-level task or operation. A batch 5.3 Data Modeling
consists of a group of events where each pair of adjacent events The essence of identifying interactive processes is to find the activi-
has an inter-arrival time of less than a threshold 𝑇 . 𝑇 should be ties of processes, since they are likely the direct results of the user’s
carefully selected since it may result in leaving all the events into a operations. Therefore, all activities of the interactive processes
single huge batch when it is too large, or losing the causality among are preserved in our user session profile. Due to its long-living
events that are generated from a single task when it is too small. and interactive nature, an interactive process usually has many
We envision the time interval between two consecutive activity sub-process branches representing user activities. However, these
batches as a random process and decide if a random process is branches could be highly repetitive due to multiple reasons. First,
regular by computing the entropy rate based on empirically learned even interactive processes may have periodic routines for updating,
probability distribution [8, 12]. UTrack computes only the first order synchronization, etc. Second, user activities can be repeated. For
and second order entropy. It is expensive to calculate even higher instance, a user may run “ls” command many times in a terminal,
order entropy, which may need prior knowledge to determine a and many commands intrinsically invokes “ls”. Third, there is a
probability distribution. Also, we observe that the first and second large gap between user operations and the interpretations of the
order entropy can achieve a satisfactory result. computer system. Therefore, a single, seemingly atomic user opera-
tion may result in a large amount of low-level events. For example,
when opening a Firefox browser, we observe that a significant por-
tion of events are repetitive to serve the same low-level purpose,
5.2 Non-interactive Process Pruning such as checking the system time or OS version.
Instead of targeting at information-lossless pruning [25, 46], we can There is a large room for the improvement on salience of a user
afford to remove less interesting data points when coping with our session profile by modeling and compressing the files accessed by
specific goal of user activity tracking. However, it does not mean processes and the branches of interactive processes, respectively.
we do not track other processes. Actually, we keep track of all alive Based on the observation that many processes with the same exe-
processes that have any interaction with interactive processes or cutable name and same arguments (e.g., Chrome.exe type=renderer
become interactive processes. The processes we pruned are those . . . , we call them “mainexec”) may access a similar set of files, we
that do not have any lineage with an interactive process. In general, are able to model commonly accessed files under a mainexec, and
most processes that are not in a user session are pruned since they record only the difference. An example is shown in Figures 5(a)
are system-triggered events. and 5(b). We notice that both mainexecs 𝑒 1 and 𝑒 2 access a common
For the processes in a user session, if they are not related to any set of files ({𝑓1, 𝑓2, 𝑓3 }), which can be abstracted by a model (𝑚 1 ).
user interactions, they are also pruned. We develop an online algo- This model-based technique has also been used in Arnold [11] to
rithm to prune those processes using a bottom up, and backward reduce instrumentation overhead.
propagation method. The pruning starts from the leaf process when Figures 5(c) and 5(d) show that the session profile can be further
the process is ended. If the leaf process can be pruned, it is removed compressed if some branches are identical. Interactive processes
from the child list of the parent process, and the parent process will often have identical branches that could easily overwhelm the au-
be further checked to see if it can be pruned after the removal of ditor. Therefore, we can compress these identical branches by only
its child process. recording the timing information and the number of occurrences.
Table 1: Servers with the Most Network Connections
(c)
*.google.com (3)
Program Name Number of Instances User Instance Mode Host Type
134,492 671 Create New Linux
(b) sshd
smbd 8,120 428 Create New Linux
p A.B.0.0/32 (5) Postgres 5,152 559 Create New Windows&Linux
sendmail 1,218 17 Create New Linux
(a) httpd 874 841 Worker Pool Linux
/etc/lib/lib*(8)

Figure 6: File and IP Abstraction Linux users are less likely to log off or restart their computers than
Windows users. Besides, there exist 4 Linux hosts that do not have
any user sessions, which means that they are used as servers and
5.4 Presentation Simplification no one logs on the hosts through the Linux desktop environment.
Different from conventional backtracking or forward tracking of an However, the activities in those servers may be correlated to user
attacking incident, the session profile produced by UTrack describes sessions in other hosts. On average, each user session lasts 4.6 day.
a user session. Thus, the session profile becomes unavoidably larger We also observe that Linux sessions are significantly longer (9.1
and cannot be further compressed since all data points carry mean- days) than Windows sessions (3.9 days). More than 100 sessions
ingful information. To better visualize the data for user tracking, last beyond the one month period, so they are excluded when we
we use graphs to present all processes, files, and network connec- compute the average session lifespans.
tions in a user session. We visualize the session profile generated For cross-host tracking, we first identify the communication
by UTrack using the dot language, and then apply different level of channels. We correlate network events from all hosts by matching
simplification on the graph. 5-tuple attributes, which include local IP, remote IP, local port, re-
A fundamental challenge of presenting the session profile on mote port, and the network protocol. However, due to port or IP
a single graph is that the graph could be very large due to pro- recycling, two network events might be wrongly matched. To avoid
cesses accessing a large amount of files or network connections in such a situation, we add a constraint that two matching events
a long session. To alleviate this issue, we aggregate similar files and should happen within a small time window. This small window
network connections when visualizing the session profile graph. should consider the possible errors caused by asynchronous clocks
For instance, the activities of a process can be represented as in on different hosts and network resource recycling. In our imple-
Figure 6. For the files, we find common prefixes of the file names mentation, we set the time window to 60 seconds, and we recycle
and abstract them with the same prefix. The network connections the unpaired events after this time window.
are either aggregated using the host name of the IP address.The In our environment, the number of all ready-to-pair network
details of the abstraction can be found in Section 6.5. Note that events stabilizes at around 20,000 to 25,000. We observe that only
the essential difference between the presentation abstraction and around 12.3% of network events can be eventually paired, and most
the data reduction/compression techniques is that the actual data of the matched network events (82.4%) are localhost channels. This
model is not changed in the presentation simplification process. is reasonable because any communication to the outside world
Namely, the abstraction does not preserve any resource, but is only cannot be paired. Even the internal communication may not be
used to help the security auditors to have a better view on the data. identified, since not all computers host an agent in our environment.
Another case is the broadcast network events, which have multiple
6 IMPLEMENTATION AND EVALUATION receivers. When the server is working in the worker-pool mode, it
may take a non-negligible time to determine the delegated worker,
6.1 Experiment Environment since it needs to go through a network channel matching process. If
We deploy UTrack on 111 hosts of a real enterprise environment, a worker is found, a virtual process will be created for the requester.
21 Linux hosts and 90 Windows hosts. An agent is installed in each However, before the virtual process is created, the network request
host to collect and report system events. UTrack itself is written may have already been partially or entirely handled, because most
in Java and contains 8.3K LoC. We evaluate the performance of requests are handled very quickly. Thus, one should record the
UTrack based on one month of data. Within this period, more than mapping between the virtual node and the actual node, and migrate
4 billion events are generated, where 1.65 billion events come from the stand-out events to the virtual node once the delegation relation
Windows hosts and 2.41 billion events come from Linux hosts. To is established.
facilitate the use of history data, we implement a data replayer During the one-month experiment, we observe more than 186
to replay the data recorded and stored in the database with their programs that accept network connections, and the top 5 programs
original timestamps. With the assistance of the replayer, we are are listed in Table 1. The “Number of Instances” column shows
able to replay the one-month data within 30 hours. the total number of request processing instances we observed. In
our environment, since a server frequently runs ”ss” to localhost
6.2 User Tracking for system backup, we observe a large number of ssh events. We
In our one month experiment, we identify 507 user sessions across also find a Postgres database that constantly stores new data from
111 hosts. Note that the login screen itself is counted as a user network connections. An Apache server runs the default pre-fork
session and excluded from our data. Among the total 507 user Multi-Processing Module (MPM) to support a worker pool. The
sessions, only 61 of them are Linux sessions. One reason is that “User Instance” column indicates the instances that belong to a user
Table 2: Classification Results

Interactive Non-interactive Total


Classified as Interactive 447 (TP) 181 (FP) 628
Classified as Non-Interactive 5 (FN) 25,746 (TN) 25,751
Total 452 25,927

session. It shows only a small portion of the cross-host activities


can be seen in a user session, since most of the virtual process nodes
are pruned due to their irrelevance to user activities.

6.3 User-centric Activity Tracking


Figure 7: Batches in Processes Figure 8: Lifespan of Processes
To detect interactive processes, we employ an important feature,
the “regularity” of activities, which is measured by the first and
second order entropy rates on the inter-arrival time of activity
batches in a process. In our implementation, we empirically set the More than 26,000 processes are labeled after filtering out the
threshold of batching (𝑇 ) as 350 ms. Figure 7 illustrates the CDF of processes on the blacklist. We apply 10-fold cross-validation on all
the number of batches a process has when doing the interaction the processes, and the evaluation results are shown in Table 2. Our
detection, and the number of events a batch has. Both of them are machine learning module has a high accuracy and recall of 99.3%.
heavily tailed. For clarity, we limit the 𝑥 axis to be within 100. We However, the module has a fairly low precision, which is only 71.1%.
observe that 70.7% of processes have only one batch, and 99.1% of Therefore, in a user profile, there are a non-negligible portion of
processes have fewer than 100 batches. Similarly, more than 78% of processes that do not really interact with the user. However, even
batches have fewer than five events, and 97% of batches have less with those false positives, the user profile has been largely reduced
than 100 events. When computing the entropy of the process, we since the dominant factors of non-interactive processes are mostly
round the time interval to second-granularity to mitigate noise. identified and pruned off. We argue that having some wrongly
Another important feature we use is the lifespan of a process, classified processes in the profile is acceptable since the amount of
which describes the time duration from the time the process is noise created is limited and can be easily identified by the security
created to the time it is ended (or the current time if it is still alive at auditors.
the time of decision making). The lifespan is a strong indicator of an
interactive process. Due to the communicative nature, interactive 6.4 Data Modeling
processes tend to live longer than other processes. Therefore, a large We model files accessed by both processes and sub-process branches,
amount of transient processes, especially in Linux hosts, could be and compress them by only recording the deviations from the model.
filtered out by inspecting their lifespan of milliseconds. Figure 8 This modeling process is done when the process is ended. In most
illustrates the CDF of process lifespan, which indicates that around cases, the interaction detection module also kicks in at this moment.
90% of processes have a short life time less than 20 ms. It produces the same results no matter which module runs first,
Our model also considers the context of a process, including the since the modeled processes will be pruned if they or their ancestors
parent process, the number of child processes, and the nature of are later decided to be non-interactive. On the other hand, pruned
the parent, as a set of important features. If a process is created by processes do not go through the interaction detection stage. In fact,
a window manager (e.g., compiz is the default window manager in most processes do not live more than 20 ms (as shown in Figure 8),
Ubuntu 16.04), the process is more likely to be an interactive pro- and will be immediately pruned or modeled. In our implementation,
cess. We also blacklist 16 types of commonly seen non-interactive we apply data pruning, if applicable, before modeling for higher
processes (e.g., “/etc/update” periodically runs on Ubuntu OSes) to efficiency. As such, the evaluation results are only applied on the in-
remove unnecessary distractions. It is hard to maintain a white- teractive processes and their offspring. In contrast, non-interactive
list since a process could be sometimes interactive and sometimes processes are pruned off before any modeling can be done.
non-interactive depending on user operations. We build an FP-Tree [16] to model the commonly accessed files
In total, we extract 11 features to build a random forest model of the same mainexec on each host. The FP-Tree is frequently used
based on Weka [15] to predict if a process is interactive. In the to mine association rules from a growing data. We set the Minimum
training stage, we manually examine and label processes in 50 Support Threshold (MST) to 0.3, so that files with frequency less
user sessions (20 Linux sessions and 30 Windows sessions) in a than 0.3 are discarded from the tree. The FP-Tree no longer changes
one-day period. An advantage with manual effort is that we can after the training period. At this stage, new processes with the same
deliberately search the mainexec of a process online and better mainexec can be modeled using the Tree. Since many processes may
understand what the process is used for. The machine learning have the same process branches that exhibit exactly the same system
module is triggered when a process’s ending event is observed or behaviors, we compress the same branches into one and record the
the process has a sufficient nubmer of activities, including batches, number of occurrences. Some processes may have a random token
network connections, and child processes. in the mainexec, such as the Chrome renderer processes. We handle
them in a case-by-case manner. Similar to data pruning, our online file abstraction in exchange of a more succinct presentation. Toward
branch modeling algorithm adopts a bottom up approach, which this end, we build a 𝑡𝑟𝑖𝑒 (a.k.a. prefix tree) for all the files that are
starts modeling from the leaf processes and propagates back to the accessed by a process. In a trie, each node stores a single character,
parent process if no leaf process is alive. When a parent process and two file names with a common prefix share a single path until
notices that multiple child processes have the same model, it merges the end of the common prefix is reached. By carefully trimming
these child processes and records the occurrence of the model. The the tail branches of the tree, we can achieve a higher degree of
model of each process is represented in an XML-styled structure, condensation and keep the output graph concise. However, due to
which stores the information of files, remote IPs, and mainexec of trade-off between the amount of preserved information and the
itself and its offspring. Note that the backward propagation in our degree of abstraction, we can only benefit from the abstraction
user session model stops at the interactive processes. This is because when the gain is deemed larger than the cost. In our case, the
the model becomes increasingly large in lower-depth process nodes gaining is defined by the number of files that can be abstracted
due to the large number of child processes. Also, it does not provide when folding all nodes under a single node in the trie, and the cost is
any help on compressing the data, because the process models at the loss on the filename characters. Specifically, the gain is defined
these levels are rarely identical and thus hardly compressible. as 𝑛𝑖=1 𝑔𝑖 , where 𝑛 is the total number of files that can be abstracted,
Í
In the 507 user sessions, 8,382 interactive processes and 176,822 and 𝑔𝑖 is the preserved information for each of the file, which is
other processes (i.e. the child processes of the interactive processes) set as 𝐿𝑙 + 𝑓 , where 𝑙 denotes the number of complete directory
are identified. After modeling the branches, more than 71% of the level of the abstract file name, 𝐿 denotes that of the complete file
processes are compressed, leaving us 8,382 interactive processes name, and 𝑓 represents a fraction of the preserved part of the
and 50,394 of their child processes. All 58,776 processes access last level of the file name. For example, when abstracting the file
more than 1.2 million files. After data modeling, the number of files 1
“/A/B/C/DEF" to “A/B/C/D*", it is 34 + 43 = 0.83. Note that we stress
reduces to 502,446 (around 60% reduction), where a file model is
the directory depth instead of the length to mitigate the effect of
counted as one file.
long file names. We develop a greedy algorithm to trim the trie in
On average, each user session has about 116 processes, which
an iterative manner. In each iteration, we abstract the files with
is a small number considering that a user session can last for sev-
the maximum gaining until the number of files is within a max_file
eral days. However, the number of accessed files is large, almost
threshold, which specifies the number of files that we prefer each
1,000 files per session profile. We observe that the majority of files
process not to exceed in the presentation. Another scenario to stop
come from process initialization, since a new process can easily
the iteration is when the gain of abstracting a set of files is less than
read hundreds of files during initialization. Although this initializa-
a threshold (𝜏). Note that 𝜏 cannot be less than 1, otherwise a single
tion process can usually be modeled, our experiment period may
path will be abstracted with no gaining. Both max_file and 𝜏 can
not cover enough instances of the processes, so all the files are
be adjusted to balance the precision and size of the graph.
preserved. When UTrack runs for a sufficiently long time, these
processes can also be modeled. Our implementation on presenta- 6.5.2 Network Abstraction. Network events are also hard to model
tion simplification, which is designed for purely presentation with since user-driven network channels cannot be easily predicted, and
some information loss, can partially remedy this problem. thus cannot be modeled accurately. Therefore, it is important that
we abstract the network behaviors. We choose to model the remote
6.5 Presentation Simplification IP addresses. First, we look up the domain name of a remote IP
We can further simplify the graph describing the user session pro- address. If the domain name can be found, we abstract IP address
file to provide a better view for the auditors, especially when a with the same top-level domain and secondary-level domain, since
process is opened and many configuration files are read at once. these two levels of domains are usually sufficient to identify a
Furthermore, there are cases that a large amount of temporary files remote host. Now an abstraction may look like “*.google.com" or
with random names are created, so these files cannot be modeled “*.Facebook.com." In case the remote hostnames cannot be found by
at all. To tackle these problems, we manage to abstract the files a reverse DNS lookup, we adopt a network mask approach similar
and network channels into few nodes while preserving sufficient to [17]. Specifically, we abstract the class B and class C subnets to
information for understanding the whole process. Note that the abstract most IP addresses in a similar manner to file abstraction.
complete data is not lost in this abstraction step, so if needed, an
auditor can look into each abstraction for complete information. 6.6 Graph Presentation
Similar endeavors have been made in [17] to model the behavior
of containers that run the same services. However, in their imple- Figure 9 depicts a simplified user session profile we identified from
mentation, except for some special types of files, all other files are our network environment. In Figure 9, the two different colors
collapsed into one abstract, which is too aggressive and tends to indicate two hosts. Processes, files, and remote IPs are represented
lose important information. For example, when we have 10 files by ovals, squares, and diamonds, respectively. The complete graph
prefixed with "/A/B/C/" and 1 file named "/A/D/EFG", a preferable has a total of 323 nodes, including processes, abstracted files and
way might be to preserve both “/A/B/C/*" and “/A/D/EFG" instead remote IP addresses. For simplicity, we omit most unimportant files
of collapsing all files into "/A/*". from the graph with one exception of the “outlook.exe" process. We
illustrate the 5 abstracted files read by “outlook.exe" in the graph
6.5.1 File Abstraction. When there are a large amount of file access to give a basic idea of how the files look like after the abstraction
that cannot be modeled, we may need to sacrifice data accuracy in step. For other processes, we put the number of abstracted files
Path1/* (6)
Path3/* (3) C:/Users/desktop.ini behaviors which are in fact part of the cyber kill chain. UTrack can
Path2/OLK* (8) also be used to determine the value of files (by the amount of time
File1
outlook.exe
an employee spent on a file) for backing up digital asset. This is
(5/19) *.akamaitechnologies.com particularly useful in fighting ransomware.

Windowsxplorer.exe *.SMTP.com
(3/3) Dropbox.exe
(1/12) 162.125.X.X/16
8 RELATED WORKS
*.cloudfront.net User Tracking and UBA User or user activity tracking has been
chrome.exe extensively studied in different contexts and various techniques
putty.exe 7/129 *.canonical.com have been proposed. One typical scenario is web user tracking
[29]
103.235.X.X/16 through different measures [1, 3, 28]. User behavior tracking for the
putty.exe security purpose drives UBA, where user accounts are no longer
chrome.exe ­type=renderer
*.facebook.com the single indicator of who an incident is performed by. Nowadays,
sshd
[4] many security companies have announced UBA tool integration or
ls
sshd X.py plan to develop UBA in their systems [4, 19, 20, 35, 42].
bash
w UBA consists of two steps. The first is to model normal user
vi X.py behaviors, and the second is to detect abnormal users by examining
clear
bash ls how deviated they are from normal users. There can be many
python X.py
sh ­c ls metrics, algorithms, or machine learning models being used to
(9/75) [72] identify an abnormal user [33, 35]. Contemporary UBA mostly
python X.py
(9/75) ls models users based on basic patterns or statistics, for example,
sh ­c ls
[689] several basic statistics, such as total upload bytes and total download
bytes of a user [33]. However, to detect more sophisticated attacks, it
Figure 9: Example User Profile is vital to ensure high accuracy and descriptiveness of user activities.
Path1:C:/Users/X/appdata/local/microsoft/windows/temporaryInternetfiles/content.IE5
Path2: C:/Users/X/appdata/local/microsoft/outlook
Log Audit Log audit has been used in many fields of security
Path3: C:/Users/X/appdata/local/TEMP research, such as forensics analysis [22, 23, 27], intrusion recov-
File1: C;/program files/common files/system/ado/msadox.dll ery [14, 21], and intrusion detection [10]. One of the most widely
adopted log levels is the OS level, where the basic units are process,
files, sockets, etc. The reason is that the OS level maintains high
and the number of total files inside the process node. The case of fidelity of states of the entire system, as well as incurring acceptable
a process without a number indicates that the files are completely CPU and storage overhead [22]. There are previous works focusing
modeled. Meaningful files are preserved as nodes on the graph. The on the reduction of the storage overhead while not losing much
timing information is not included in this figure due to the space information [25, 46]. Besides, there are also previous works that
limit. Besides, timing is not critical in understanding the figure. The attempt to increase data granularity based on OS level logs [24, 26].
number of abstracted branches is shown in brackets on edges. One important use of log audit is to understand an attack, espe-
From the figure, we can easily find the user’s activities inside cially more sophisticated attacks (APT attacks) or unknown attacks.
the enterprise network. The user logs in the system on a Windows Security experts rely on the logs to determine how an attack hap-
host and the session lasts for six hours. The session spans two hosts pens [22, 24, 27], as well as its impact on the system [23]. They
through interactive ssh connections using putty. On the Windows capture the causal relationship among processes, files, or sockets,
host, the user browses the Internet via Chrome and uses Outlook and reconstruct the provenance of an attack and its ramification.
for emailing. The user then logs on a remote Linux host to edit and HERCULE [30] leverages community discovery algorithms to iden-
run a python program “X.py" (the file name is anonymized), which tify attacks based on the fact that the attack activities belong to
further runs the “ls" program many times. In general, a graph-based the same community in a graph. [6] logs events at the proxy and
session profile presentation can be easily understood by a human focuses on parsing traffic from application protocols like SQL.
auditor, and provides important insights on the activities of a user. User Interaction Detection The detection of bot generated
data (system-triggered) from human-generated data (user-triggered)
7 USE CASES is a long-studied subject that has applications in many fields. Gen-
Many more UBA features can be directly applied to UTrack for erally there are two types of detection. One is the active detection,
anomaly detection. For example, one can audit the roles (user ac- such as CAPTCHA [44], which is easy to implement, (arguably)
counts) that a user has been playing in the network from the user more accurate, but intrusive. The other type is the passive detection,
profiles and identify higher-level inconsistencies. For instance, one which relies on processing log events to detect abnormal behav-
cannot be both “Alice” and “Bob” in the same session profile. Besides iors. The related previous works include detecting game cheaters
providing a foundation of UBA systems, there are many other use through Human Observational Proof [13], bots in online social
cases that can be built on top of UTrack. For example, it can be used networks [7–9, 43], detecting malicious web bots/crawlers, Google
in forensics analysis to study the behavior of attackers (such that reCaptcha [34], and malicious crawler detection [45]. There are
the attacker becomes the POI) and reveal more seemingly benign some significant differences between these techniques and ours. A
major one is that they have specially tailored data input. For exam- Real-time attack scenario reconstruction from COTS audit data. In 26th USENIX
ple, user agent, cookie lifetime in Google’s reCaptcha [34], a user Security Symposium. 487–504.
[19] IBM. 2016. IBM QRadar User Behavior Analytics. https://www.ibm.com/cz-
account favored access log system in [13, 45], or side information en/marketplace/qradar-user-behavior-analytics.
such as social graph [7, 9, 13, 43]. [20] Johna Till Johnsons. 2015. User behavioral analytics tools can thwart security
attacks. http://searchsecurity.techtarget.com/feature/User-behavioral-analytics-
tools-can-thwart-security-attacks.
9 CONCLUSION [21] Taesoo Kim, Xi Wang, Nickolai Zeldovich, and M Frans Kaashoek. 2010. Intrusion
Recovery Using Selective Re-execution.. In USENIX OSDI. 89–104.
This paper presents UTrack, a novel user tracking system that con- [22] Samuel T King and Peter M Chen. 2003. Backtracking intrusions. ACM SOSP
nects events under different user accounts and from different hosts (2003), 223–236.
to form a novel holistic user session profile. UTrack enables a sys- [23] Samuel T King, Zhuoqing Morley Mao, Dominic G Lucchetti, and Peter M Chen.
2005. Enriching Intrusion Alerts Through Multi-Host Causality.. In NDSS.
tem auditor to easily find out the activities of users inside enterprise [24] Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu. 2013. High Accuracy Attack
networks. UTrack associates the activities of a user effectively by Provenance via Binary-based Execution Partition.. In NDSS.
identifying a session root and then following both the local process [25] Kyu Hyung Lee, Xiangyu Zhang, and Dongyan Xu. 2013. LogGC: garbage col-
lecting audit log. In Proceedings of the 2013 ACM SIGSAC Conference on Computer
lineage and the network control flow of the session root. To achieve and Communications Security. 1005–1016.
scalability and salient description, UTrack employs an interaction [26] Shiqing Ma, Kyu Hyung Lee, Chung Hwan Kim, Junghwan Rhee, Xiangyu Zhang,
and Dongyan Xu. 2015. Accurate, low cost and instrumentation-free security
detection module to sift out the most relevant events that result audit logging for windows. In ACM ACSAC. 401–410.
from users’ interactions, and models common file and activity pat- [27] Shiqing Ma, Xiangyu Zhang, and Dongyan Xu. 2016. ProTracer: towards practical
terns. Our evaluation in a real enterprise environment of 111 hosts provenance tracing by alternating between logging and tainting. In Proceedings
of NDSS, Vol. 16.
shows UTrack’s effectiveness on producing accurate and concise [28] Jonathan R Mayer and John C Mitchell. 2012. Third-party web tracking: Policy
user session profiles for system auditors to use. and technology. In IEEE Symposium on Security and Privacy 2012. 413–427.
[29] Sadegh M Milajerdi, Rigel Gjomemo, Birhanu Eshete, R Sekar, and VN Venkatakr-
ishnan. 2019. Holmes: real-time apt detection through correlation of suspicious
REFERENCES information flows. In 2019 IEEE Symposium on Security and Privacy. 1137–1152.
[1] Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juarez, Arvind [30] Kexin Pei, Zhongshu Gu, Brendan Saltaformaggio, Shiqing Ma, Fei Wang, Zhiwei
Narayanan, and Claudia Diaz. 2014. The web never forgets: Persistent tracking Zhang, Luo Si, Xiangyu Zhang, and Dongyan Xu. 2016. Hercule: Attack story
mechanisms in the wild. In Proceedings of the 2014 ACM CCS. reconstruction via community discovery on correlated log graph. In Proceedings
[2] Animashree Anandkumar, Chatschik Bisdikian, and Dakshi Agrawal. 2008. Track- of the 32nd Annual Conference on Computer Security Applications. ACM, 583–595.
ing in a spaghetti bowl: monitoring transactions using footprints. In ACM SIG- [31] Patrick Reynolds, Janet L Wiener, Jeffrey C Mogul, Marcos K Aguilera, and Amin
METRICS Performance Evaluation Review, Vol. 36. 133–144. Vahdat. 2006. WAP5: black-box performance debugging for wide-area systems.
[3] Richard Atterer, Monika Wnuk, and Albrecht Schmidt. 2006. Knowing the user’s In Proceedings of the 15th International Conference on World Wide Web. 347–356.
every move: user activity tracking for website usability evaluation and implicit [32] Bo Sang, Jianfeng Zhan, Gang Lu, Haining Wang, Dongyan Xu, Lei Wang, Zhi-
interaction. In WWW. hong Zhang, and Zhen Jia. 2012. Precise, scalable, and online request tracing for
[4] BALABIT. 2015. Privileged Account Analytics - User Behavior Analytics Security multitier services of black boxes. IEEE Transactions on Parallel and Distributed
Solution. https://www.balabit.com/privileged-account-analytics. Systems 23, 6 (2012), 1159–1167.
[5] Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using [33] Madhu Shashanka, Min-Yi Shen, and Jisheng Wang. 2016. User and entity behav-
Magpie for Request Extraction and Workload Modelling.. In USENIX OSDI. ior analytics for enterprise security. In 2016 IEEE Big Data. 1867–1874.
[6] Adam Bates, Wajih Ul Hassan, Kevin Butler, Alin Dobra, Bradley Reaves, Patrick [34] Suphannee Sivakorn, Jason Polakis, and Angelos D Keromytis. 2016. I’m not a
Cable, Thomas Moyer, and Nabil Schear. 2017. Transparent Web Service Audit- human: Breaking the Google reCAPTCHA. Black Hat,(i) (2016), 1–12.
ing via Network Provenance Functions. In Proceedings of the 26th International [35] Splunk. 2015. Splunk User Behavior Analytics.
Conference on World Wide Web. 887–895. https://www.splunk.com/en_us/products/premium-solutions/user-behavior-
[7] Qiang Cao, Michael Sirivianos, Xiaowei Yang, and Tiago Pregueiro. 2012. Aiding analytics.html.
the detection of fake accounts in large scale social online services. In Proceedings [36] Byung-Chul Tak, Chunqiang Tang, Chun Zhang, Sriram Govindan, Bhuvan
of the 9th USENIX Conference on Networked Systems Design and Implementation. Urgaonkar, and Rong N Chang. 2009. vPath: Precise Discovery of Request
15–15. Processing Paths from Black-Box Observations of Thread and Network Activities..
[8] Zi Chu, Steven Gianvecchio, Haining Wang, and Sushil Jajodia. 2010. Who is In USENIX ATC.
tweeting on Twitter: human, bot, or cyborg?. In Proceedings of the 26th ACM [37] Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-
Annual Computer Security Applications Conference. 21–30. Malek, Julio Lopez, and Gregory R Ganger. 2006. Stardust: tracking activity in a
[9] George Danezis and Prateek Mittal. 2009. SybilInfer: Detecting Sybil Nodes using distributed storage system. In ACM SIGMETRICS Performance Evaluation Review,
Social Networks.. In NDSS. San Diego, CA. Vol. 34. 3–14.
[10] Dorothy E Denning. 1987. An intrusion-detection model. IEEE Transactions on [38] Mike Tierney. 2015. The Rise of User Behavior Analytics.
software engineering 2 (1987), 222–232. http://www.veriato.com/company/blog/veriato-blog/2015/12/15/the-rise-
[11] David Devecsery, Michael Chow, Xianzheng Dou, Jason Flinn, and Peter M Chen. of-user-behavior-analytics.
2014. Eidetic Systems.. In USENIX OSDI. 525–540. [39] Roy Hodgman Tod Beardsley. 2015. RAPID 7 Research Report: Understanding
[12] Steven Gianvecchio and Haining Wang. 2007. Detecting covert timing chan- User Behavior Analytics.
nels: an entropy-based approach. In Proceedings of the 14th ACM Conference on [40] Trustwave. 2015. Trustwave global security re-
Computer and Communications Security. 307–316. port. https://www2.trustwave.com/rs/815-RFM-
[13] Steven Gianvecchio, Zhenyu Wu, Mengjun Xie, and Haining Wang. 2009. Battle 693/images/2015_TrustwaveGlobalSecurityReport.pdf.
of botcraft: fighting bots in online games with human observational proofs. [41] Melissa Turcotte and Juston Shane Moore. 2017. Technical Report LA-UR-17-
In Proceedings of the 16th ACM Conference on Computer and Communications 21663: User Behavior Analytics.
Security. 256–268. [42] VARONIS. 2016. User Behavior Analytics. https://www.varonis.com/user-
[14] Ashvin Goel, Kenneth Po, Kamran Farhadi, Zheng Li, and Eyal De Lara. 2005. behavior-analytics/.
The taser intrusion recovery system. In ACM SOSP. 163–176. [43] Bimal Viswanath, Ansley Post, Krishna P Gummadi, and Alan Mislove. 2010.
[15] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, An analysis of social network-based sybil defenses. ACM SIGCOMM Computer
and Ian H Witten. 2009. The WEKA data mining software: an update. ACM Communication Review 40, 4 (2010), 363–374.
SIGKDD explorations newsletter (2009). [44] Luis Von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel
[16] Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without Blum. 2008. recaptcha: Human-based character recognition via web security
candidate generation. In ACM SIGMOD Record, Vol. 29. 1–12. measures. Science 321, 5895 (2008), 1465–1468.
[17] Wajih Ul Hassan, Mark Lemay, Nuraini Aguse, Adam Bates, and Thomas Moyer. [45] Shengye Wan, Yue Li, and Kun Sun. 2017. Protecting Web Contents against
2018. Towards Scalable Cluster Auditing through Grammatical Inference over Persistent Distributed Crawlers. In IEEE ICC.
Provenance Graphs. In NDSS. [46] Zhang Xu, Zhenyu Wu, Zhichun Li, Kangkook Jee, Junghwan Rhee, Xusheng
[18] Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Xiao, Fengyuan Xu, Haining Wang, and Guofei Jiang. 2016. High fidelity data
Gjomemo, R Sekar, Scott Stoller, and VN Venkatakrishnan. 2017. SLEUTH: reduction for big data security dependency analyses. In ACM CCS.

You might also like