Detecting Data Exfiltration by Integrating Information Across Layers
Detecting Data Exfiltration by Integrating Information Across Layers
Detecting Data Exfiltration by Integrating Information Across Layers
IEEE IRI 2013, August 14-16, 2013, San Francisco, California, USA
309
978-1-4799-1050-2/13/$31.00 ©2013 IEEE
list of operations producing expected outputs, it fails to system features like memory consumption, CPU uti-
take into consideration cases where the hardware may lization and disk usage. The base values are used to
be tampered so that it not only passes all the tests, derive correlation coefficients on test data which came
but has a malicious circuit which executes additional from real attacks. This approach analyzes overall sys-
sabotaging functionality on top of the expected activi- tem features like the total memory and total CPU con-
ties [25]. Since most intrusion detection and prevention sumption level. A drawback of this detection technique
software try to protect their users by actively monitor- is that it can be easily evaded by trojans with an ex-
ing inbound data from the network or by looking for tremely small memory and CPU footprint that can re-
known attack signatures, very few of them can detect sult in a significant deviation in the overall numbers
the aforementioned attack scenarios. for the host machine when observed as a whole. The
We describe a novel detection system that monitors false positive rate of their system is not documented.
a set of system and network level features of a host There has also been some work building on the phi-
system and flags alerts based on temporally-related losophy of using multiple sensing modules to detect
anomalous behavior detected in multiple monitored attacks. However, these multiple sensors are typically
modules. It is well known that by building a behavioral all at the same level of the stack – just the host, or
model of the system under normal usage and detecting just the network. Such a narrow feature set can re-
deviations from this model when under an attack can duce the accuracy of alerts produced by increasing the
provide us with strong hints of an attack[3, 7, 26]. The false positive rates of the intrusion detection system.
individual alerts produced by each module are then Kerschbaum et al. [5] discuss the use of multiple sen-
expressed as resource description framework RDF as- sors embedded into the operating systems, but only
sertions. These assertions when processed by semantic describe in details sensors that specifically pertain to
rules produce highly effective intrusion alerts that have network based attacks.
a low false positive rate. Process profiling is proposed by Okazaki et al. [18],
who derive a normal usage pattern based on system call
2. Related work sequences and compare this to the profile of a system
under an attack. A similar approach based on a sys-
tem calls profile is proposed by Eskin et al. [1]. Various
Fisk et al. [2] propose a global vault to prevent unau-
machine learning approaches have been applied to se-
thorized data breaches by separating the employee ma-
lected system feature sets to classify attacks with good
chines from the ones that contain sensitive information.
results, starting from the seminal work of Forrest et
They implement this strict isolation between the user
al. [3] and Lee et al. [7]. Undercoffer [26] created
machines and the servers by placing limits such as a
a model of a running system under normal usage, and
whitelist of allowed inter machine processes and a max-
then used that model to detect attacks in the future us-
imum allowed bandwidth. This is an impractical ap-
ing machine learning algorithms. Mathews et al. [10]
proach when applied for large organizations as it puts
also took a machine learning approach in identifying a
stringent conditions on what a user can or cannot do.
network-based feature set which was able to produce
Liu et al. [9] describe a framework to actively mon-
good classification results in identifying malicious net-
itor and react in cases of intrusions and their possible
work data.
detection. Their proposed intrusion detection engine
is placed at the network edge, scans outbound traffic,
and decides if it should forward the data to the outside 3. System Design
node or not. The main drawback in the system is the
live monitoring and intrusion prevention approach that We describe a prototype intrusion detection system
must mine a large amount of data and decide whether (IDS) that is highly modular and has in place multiple
or not to forward it without affecting outbound band- sensing modules across multiple layers of the system.
width speeds. For even a medium sized corporation, a Each alert from the individual monitoring sensor is rep-
single module deployed at the lone egress point of a cor- resented as a set of RDF assertions. Producing RDF
porate network would require tremendous processing assertions allows our system to fit into a larger semantic
powers to monitor and analyze each outgoing packet integration and reasoning framework being developed
at runtime. in our laboratory [10, 4] that uses traditional and non
Ramachandran et al. [23] claim that their behavior- traditional sensors to form a collaborative approach to
based model can catch most network data exfiltration cybersecurity. The assertions from our system can be
scenarios. They first learn the normal behavior of a integrated with other information and the results aug-
system by using kernel density estimation methods on mented using various reasoners, including description-
310
of our system.
In the remainder of this section we describe some of
the modules we have implemented and that are used
in the example exfiltration scenario.
311
analyze TCP sessions is based on earlier work done in 4. Profile building and live monitoring
our group [10] that had shown good results in detecting
malicious network traffic by analyzing TCP sessions on Most of our development and testing was on systems
inbound data. running the Windows operating system due to the
We use two features to model the outbound network high number of publicly available attacks specifically
flow characteristics: the mean inter-departure packet targeted towards these. Our profiling and process
times and the number of packets in a single TCP ses- monitoring module is currently limited to Windows-
sion. The first feature denotes the rate of packets flying based processes, though similar routines can be easily
out and the second feature denotes the pure quantity written (and in some cases already exist) for Linux.
of outbound data. These two features, when taken to- We successfully profiled and monitored a list of nine
gether, give a good picture of sufficient data going out common Windows processes: calc.exe, conhost.exe,
of a system in a short span of time. Both characteris- explorer.exe, firefox.exe, msinfo32.exe, mspaint.exe,
tics are expected to be high when an attacker infiltrates notepad.exe, powershell.exe and wmplayer.exe.
a victim and tries to maximize his information theft by
The decision to select these nine was based on three
exfiltrating the data as quickly as possible.
factors. First, we wanted a list of processes that are
either pre-installed in a standard configurations or are
3.4 Dynamic link libraries part of very popular software packages. Second, we
wanted a wide range of processes in terms of their mem-
We profile the list of dynamic link library (DLL) calls a ory consumption pattern to avoid biasing our results.
process makes during its normal execution. It is a fair The final selection criterion was the amount of user
assumption that for an extensively profiled process, one interaction each of the monitored processes witness in
can gather a finite list of all DLL files that the process their lifetime. We wanted a broad variety of processes
typically opens for its regular use. A process making a which would include background processes such as ex-
DLL call that is not among its normal set may indicate plorer.exe or conhost.exe that do not involve user in-
that it has been compromised and an alert is generated. teraction to processes like firefox.exe and wmplayer.exe
that do.
3.5 Registry keys We ran the profiling module for three to four days
with intermittent use of each process to produce av-
Similar to the list of DLLs, we maintain a list of all erage values of memory consumed by the heap, stack
registry keys a Windows process usually accesses. Any and private data sections. We also calculated the stan-
new registry key being accessed is another indicator on dard deviation of these three respective mean values
our list that gets flagged as a possible process execut- for each process. Once the profile was built, we moni-
ing maliciously. A trojaned process can have multiple tored these process live and raised alerts if the memory
reasons to access to registry keys it has never accessed consumption for any of the three memory types went
before. A simple process like notepad, for example, over three standard deviations of the averaged value.
should not have to access a network configuration reg- We implemented a simplistic non-statistical approach
istry entry. If it does, there is high probability that a to profile the list of DLLs, windows registry keys and
malicious process pretending to be notepad is access- system calls that the processes called under normal us-
ing network information in order to connect to a remote age. During the profiling phase of these processes, a
server. whitelist of all DLLs, system calls and registry keys
was prepared which was essentially a list of all calls
3.6 System calls the processes made under normal use. If any new DLL,
system call or registry key is called outside the earlier
There is sufficient past work [18, 1, 8] that proves that built whitelist, an alert is raised.
system call monitoring can produce good indicators of For our networking module we used the libpcap [15]
an attack. One of the process characteristics that we libraries to implement packet sniffing for all outbound
monitor to detect any deviations from the norm is the traffic. The splitcap [14] tool was used to extract TCP
system calls being made by that process. We assume session based information from the network packets be-
that a trojan hiding underneath an existing process ing monitored. The system was run for a few days
is likely to call a distinct set of system calls which if and all IP addresses that the host communicated with
monitored, can be used to raise an alert. We use a logged. This list of IP addresses served as a whitelist
fairly simple approach, essentially only looking the the of all destinations that were deemed safe to be com-
number of system calls made, not their pattern. municating with. A network packet sent to any IP
312
address outside this list would throw an alert. Packet 2. Metasploit executable
sniffing sessions were initiated on five machines in our 3. Applet based attack
lab used by multiple users who had volunteered. The 4. Remote Administration using HTTP tunneling
data collected from these volunteers was aggregated (RATTE)
to produce overall network flow characteristics. These 5. Tab nabbing attack
characteristics collected and aggregated produced an
average value of the inter packet departure time per Once the victim’s machine was successfully compro-
TCP session and the average number of packets sent mised and complete access gained, we tried to mimic
in a single TCP session. a real attack resulting in data exfiltration. The first
The hardware monitoring module had a simple im- step was to hide our malicious process behind an ex-
plementation. All connected hardware devices were isting one using code injection. We then downloaded
profiled using their manufacturer UUID as their iden- files from the victim’s machine, took screen shots of
tification number and alerts were raised for any new the victim’s screen, and captured key strokes. We also
hardware introduced in the system. In case of USB executed remote processes and extracted network con-
flash drives, an additional information informing us figuration information from the victim.
whether the USB device was seen in the past or not We ran the same set of attacks against six differ-
was added in the produced alerts. This allows the pos- ent commercially available security software systems.
sibility of highly flexible rules running on our RDF as- These covered traditional anti-virus systems, firewalls
sertions such a sample rule which called for no alerts and pure intrusion detection systems. The list in-
to be raised if the USB drive inserted in the system cluded Microsoft forefront endpoint, Spyware termina-
had been frequently used in the past. This approach tor, Windows defender, Snort, AVG and Comodo fire-
can be extended to other devices – for instance logging wall.
the MAC address of a network card or a disk serial
number. 6. Results
5. Testing our system Every time a new USB flash drive was inserted, our
hardware monitoring module was able to produce an
We used the Metasploit [17] open source penetration- alert with the additional information of whether the
testing framework to create and apply attacks in order flash drive had been seen before or not. Results from
to test our intrusion detection system. Within Metas- the memory monitoring module 1 show that all three
ploit, we extensively used the social engineering toolkit memory types can potentially be good features to be
(SET) [22]. Social engineering based attacks are among monitored to detect an attack. For the nine sample
the most common forms used today for data exfiltra- processes however, heap and stack turned out to be
tion. SET is popular, with over two million downloads, less accurate indicators when compared to private data
for two reasons: (1) it offers a large number of easy to memory type.
run attacks that do not require much experience or We observed that for most of the profiled processes,
background knowledge, and (2) it is tightly integrated the private data memory type witnessed a significant
with Metasploit, allowing pen-testers and white hat jump whenever we tried to hide our malware behind a
hackers to develop custom exploits by combining SET particular process using code injection. The three pro-
based attack options with custom payloads. The list of cesses for which the jump was less than one standard
past attacks that used social engineering to infiltrate deviation (¡1σ) were Microsoft paint (mspaint.exe),
their victims includes highly sophisticated APTs like Windows media player (wmplayer.exe), and Firefox
Stuxnet [6], which was spread using USB drives, and (firefox.exe). This was largely due to these processes
the Aurora attack on Google [19], which is believed having a highly variable memory consumption pattern
to have been initiated by sending malicious URLs to dependent on their usage which leads to a high stan-
Google employees. The social engineering toolkit under dard deviation value. Firefox, for example, can start
Metasploit allows us to test our system against similar as a small process with a memory footprint of a few
attacks that can be launched by using malicious hard- hundred kilobytes, but can reach a value more than ten
ware to directly transfer the Trojan payloads on to a times that due to heavy graphic content of the websites
known system. being viewed or simply by the number of concurrent
We ran the following five attacks available in Metas- tabs opened by the user. In case of Windows media
ploit’s SET: player, we found surges in the memory usage when the
1. PowerShell attack using shellcode injection player was used to stream high definition videos when
313
Process Priv data Stack Heap Process DLL Registry System call
calc.exe 554σ 11.14σ 3.72σ calc.exe 17 31 4
conhost.exe 1964σ 32σ 428σ conhost.exe 27 233 3
esplorer.exe 30.8σ 0.96σ 2.32σ esplorer.exe 22 34 3
firefox.exe 0.47σ 2.1σ 15.6σ firefox.exe 5 40 0
msinfo.exe 31σ 0.047σ 0.89 σ msinfo.exe 21 45 0
mspaint.exe 1.08σ 0.38σ 0.24σ mspaint.exe 14 280 0
notepad.exe 42.58σ 0.01σ 2σ notepad.exe 16 31 0
powershell.exe 1972σ 21σ 15.9σ powershell.exe 34 310 0
wmplayer.exe 0.65σ 0.9σ 0.82σ wmplayer.exe 84 2175 9
Table 1. Memory deviations for attacked pro- Table 2. The number of new DLLs calls, reg-
cesses istry keys accessed and system calls are in-
dicators of compromised processes.
314
Total TCP sessions monitored 1154
Malicious sessions 12
MIDPT module alerts 114
Packet count alerts 34
Combined alerts 4
True positives 3
False positives 1
315
PowerShell Metasploit Applet tunneling Tab nabbing
Microsoft forefront endpoint Missed Caught Caught Missed Missed
Spyware terminator Missed Missed Missed Missed Missed
Windows defender Missed Missed Missed Missed Missed
Snort Caught Caught Missed Missed Missed
AVG Missed Caught Missed Caught Missed
Comodo firewall Missed Caught Missed Caught Missed
Our system Caught Caught Caught Caught Caught
Table 4. Our system performed well compared to others on experiments with several common types
of attacks.
[4] A. Joshi, R. Lal, T. Finin, and A. Joshi. Extracting [17] J. O’Gorman, D. Kearns, and M. Aharoni. Metasploit:
cybersecurity related linked data from text. In Seventh The Penetration Tester’s Guide. No Starch Press,
IEEE International Conference on Semantic Comput- 2011.
ing. IEEE Computer Society, September 2013. [18] Y. Okazaki, I. Sato, and S. Goto. A new intrusion
[5] F. Kerschbaum, E. Spafford, and D. Zamboni. Us- detection method based on process profiling. In Sym-
ing embedded sensors for detecting network attacks. posium on Applications and the Internet, pages 82–90.
In ACM Workshop on Intrusion Detection Systems, IEEE, 2002.
2000. [19] Operation aurora. http://wikipedia.org/wiki/Opera-
[6] R. Langner. Stuxnet: Dissecting a cyberwarfare tion Aurora. (accessed 2013-05-29).
weapon. Security & Privacy, 9(3):49–51, 2011. [20] IBM distributes infected USB drives at conference.
[7] W. Lee and S. J. Stolfo. A framework for construct- http://scmagazine.com/ibm-distributed-infected-
ing features and models for intrusion detection sys- usb-drives-at-conference/article/170862/. (accessed
tems. ACM Transactions Infformation Systems Secu- 2013-05-29).
rity, 3(4):227–261, Nov. 2000. [21] Netbook comes with factory-sealed malware.
[8] W. Lee, S. J. Stolfo, and P. K. Chan. Learning pat- http://scmagazine.com/netbook-comes-with-factory-
terns from unix process execution traces for intrusion sealed-malware/article/137147/. (accessed 2013-05-
detection. In AAAI Workshop on AI Approaches to 29).
Fraud Detection and Risk Management, 1997. [22] N. Pavkovic and L. Perkov. Social Engineering
[9] Y. Liu, C. Corbett, K. Chiang, R. Archibald, Toolkita systematic approach to social engineering. In
B. Mukherjee, and D. Ghosal. Sidd: A framework MIPRO 2011, 34th International Convention, pages
for detecting sensitive data exfiltration by an insider 1485–1489. IEEE, 2011.
attack. In 42nd Hawaii Int. Conf. on System Sciences, [23] R. Ramachandran, S. Neelakantan, and A. Bidyarthy.
pages 1–10. IEEE, 2009. Behavior model for detecting data exfiltration in net-
[10] M. L. Mathews, P. Halvorsen, A. Joshi, and T. Finin. work environment. In Conf. on Internet Multimedia
A collaborative approach to situational awareness for Systems Architecture and Application. IEEE, 2011.
cybersecurity. In 8th Int. Conf. on Collaborative Com- [24] P. Sharma. A multilayer framework to catch data exfil-
puting: Networking, Applications and Worksharing, tration. Master’s thesis, University of Maryland, Bal-
pages 216–222. IEEE, 2012. timore County, August 2013.
[11] Metasploit Commands. http://hacking-tutorial.com- [25] M. Tehranipoor and F. Koushanfar. A survey of hard-
/tips-and-trick/7-metasploit-meterpreter-core- ware trojan taxonomy and detection. Design & Test
commands-you-should-know/. (accessed 2013-05-29). of Computers, IEEE, 27(1):10–25, 2010.
[12] Metasploit Tutorial. http://offensive-security.com- [26] J. Undercoffer. Intrusion Detection: Modeling Sys-
/metasploit-unleashed/Meterpreter Basics. (accessed tem State to Detect and Classify Aberrant Behav-
2013-05-29). ior. PhD thesis, University of Maryland, Baltimore
[13] MeterpreterClient. http://wikibooks.org/wiki/Meta- County, Feb. 2004.
sploit/MeterpreterClient. (accessed 2013-05-29). [27] J. Undercoffer, A. Joshi, T. Finin, and J. Pinkston.
[14] SplitCap. http://netresec.com/?page=SplitCap. (ac- Using DAML+OIL to classify intrusive behaviours.
cessed 2013-05-29). Knowledge Engineering Review, 18(3):221–241, 2003.
[15] TcpDump and LibPcap. http://tcpdump.org/. (ac- [28] J. Undercoffer, A. Joshi, and J. Pinkston. Modeling
cessed 2013-05-29). computer attacks: An ontology for intrusion detection.
[16] Using Metasploit Meterpreter Keylogger. In 6th Int. Symp. on Recent Advances in Intrusion
http://hacking-tutorial.com/hacking-tutorial/5-step- Detection, pages 113–135. Springer, 2003.
using-metasploit-meterpreter-keylogger-keylogging/.
(accessed 2013-05-29).
316