Towards Generating Real-Life Datasets For Network Intrusion Detection

International Journal of Network Security, Vol.17, No.6, PP.683-701, Nov.
2015 683
Towards Generating Real-life Datasets for Network

Intrusion Detection
Monowar H. Bhuyan1 , Dhruba K. Bhattacharyya2 , and Jugal K. Kalita3

(Corresponding author: Monowar H. Bhuyan)
Department of Computer Science and Engineering, Kaziranga University, Jorhat-785006, Assam, India1
(Email: [email protected])
Department of Computer Science and Engineering, Tezpur University, Tezpur-784028, Assam, India2
Department of Computer Science, University of Colorado at Colorado Springs, CO 80918, USA3
(Received February 5, 2015; revised and accepted Apr. 20 & May 9, 2015)
Abstract a intensive set of intrusions or attacks. This is a significant

challenge, since not many such datasets are available. There-
With exponential growth in the number of computer applica- fore the detection methods and systems are evaluated only with
tions and the sizes of networks, the potential damage that can a few publicly available datasets that lack comprehensiveness
be caused by attacks launched over the Internet keeps increas- and completeness [2, 17] or are outdated. For example, Coop-
ing dramatically. A number of network intrusion detection erative Association for Internet Data Analysis (CAIDA) Dis-
methods have been developed with respective strengths and tributed Denial of Service (DDoS) 2007, Lawrence Berkeley
weaknesses. The majority of network intrusion detection re- National Laboratory (LBNL), and ICSI datasets are heavily
search and development is still based on simulated datasets anonymized without payload information, decreasing research
due to non-availability of real datasets. A simulated dataset utility. Researchers also frequently use a single NetFlow based
cannot represent a real network intrusion scenario. It is im- intrusion dataset found at [25, 40] with a limited number of at-
portant to generate real and timely datasets to ensure accurate tacks.
and consistent evaluation of detection methods. In this paper,
we propose a systematic approach to generate unbiased full-
feature real-life network intrusion datasets to compensate for 1.1 Importance of Datasets
the crucial shortcomings of existing datasets. We establish the
In network traffic anomaly detection, it is always important to
importance of an intrusion dataset in the development and val-
test and evaluate detection methods and systems using datasets
idation process of detection mechanisms, identify a set of re-
as network scenarios evolve. We enumerate the following rea-
quirements for effective dataset generation, and discuss several
sons to justify the importance of a dataset.
attack scenarios and their incorporation in generating datasets.
We also establish the effectiveness of the generated dataset in • Repeatability of experiments: Researchers should be able
the context of several existing datasets. to repeat experiments with the dataset and get similar re-
Keywords: Dataset, intrusion detection, NetFlow, network sults, when using the same approach. This is important
traffic because the proposed method should cope with the evolv-
ing nature of attacks and network scenarios.
1 Introduction • Validation of new approaches: New methods and algo-

rithms are being continuously developed to detect net-
In network intrusion detection, particularly when using work anomalies. It is necessary that every new approach
anomaly based detection, it is difficult to accurately evaluate, be validated.
compare and deploy a system that is expected to detect novel
attacks due to scarcity of adequate datasets. Before deploy- • Comparison of different approaches: State-of-the-art net-
ing in any real world environment, an anomaly based network work anomaly detection methods must not only be vali-
intrusion detection system (ANIDS) must be trained, tested dated, but also show improvements over older methods in
and evaluated using real labelled network traffic traces with performance in a quantifiable manner. For example, the
International Journal of Network Security, Vol.17, No.6, PP.683-701, Nov. 2015 684
DARPA 1998 dataset [26] is commonly used for perfor- for real-time deployment in certain situations. Most ex-
mance evaluation of anomaly detection systems [24]. So isting datasets have been created based on the following
that one method can be compared against others. assumptions.
• Parameters tuning: To properly obtain the model to clas- – Anomalous traffic is statistically different from nor-
sify the normal from malicious traffic, it is necessary to mal traffic [13].
tune model parameters. Network anomaly detection as- – The majority of network traffic instances is nor-
sumes the normality model to identify malicious traffic. mal [36].
For example, Cemerlic et al. [9] and Thomas et al. [44]
use the attack-free part of the DARPA 1999 dataset for However, unlike most traditional intrusions, DDoS at-
training to estimate parameter values. tacks do not follow these assumptions because they
change network traffic rate dynamically and employ
• Dimensionality or the number of features: An optimal set multi-stage attacks. A DDoS dataset must reflect this fact.
of features or attributes should be used to represent nor-
mal as well as all possible attack instances.
1.3 Motivation and Contributions
1.2 Requirements By considering the aforementioned requirements, we propose
a systematic approach for generating real-life network intru-
Although good datasets are necessary for validating and evalu- sion dataset at both packet and flow levels with a view to
ating IDSs, generating such datasets is a time consuming task. analyzing, testing and evaluating network intrusion detection
A dataset generation approach should meet the following re- methods and systems with a clear focus on anomaly based de-
quirements. tectors. The following are the major contributions of this paper.
• Real world: A dataset should be generated by monitor- • We present guidelines for real-life intrusion dataset gen-
ing the daily situation in a realistic way, such as the daily eration.
network traffic of an organization.
• We discuss systematic generation of both normal and at-
• Completeness in labelling: The labelling of traffic as be- tack traffic.
nign or malicious must be backed by proper evidence for
each instance. The aim these days should be to provide • We extract features from the captured network traffic
labelled datasets at both packet and flow levels for each such as basic, content-based, time-based, and connec-
piece of benign and malicious traffic. tion-based features using a distributed feature extraction
framework.
• Correctness in labelling: Given a dataset, labelling of
each traffic instance must be correct. This means that our • We generate three categories of real-life intrusion
knowledge of security events represented by the data has datasets, viz., (i) TUIDS (Tezpur University Intrusion
to be certain. Detection System) intrusion dataset, (ii) TUIDS coor-
dinated scan dataset, and (iii) TUIDS DDoS dataset.
• Sufficient trace size: The generated dataset should be un- These datasets are available for the research community
biased in terms of size in both benign and malicious traffic to download for free.
instances.
• Concrete feature extraction: Extraction of an optimal set 1.4 Organization of the Paper
of concrete features when generating a dataset is impor-
tant because such features play an important role when The remainder of the paper is organized as follows. Section 2
validating a detection mechanisms. discusses prior datasets and their characteristics. Section 3 is
dedicated to the discussion of a systematic approach to gen-
• Diverse attack scenarios: With the increasing frequency, erate real-life datasets for intrusion detection with a focus on
size, variety and complexity of attacks, intrusion threats network anomaly detectors. Finally, Section 4 presents obser-
have become more complex including the selection of tar- vations and concluding remarks.
geted services and applications. When contemplating at-
tack scenarios for dataset generation, it is important to tilt
toward a diverse set of multi-step attacks that are recent. 2 Existing Datasets
• Ratio between normal and attack traffic: Most bench- As discussed earlier, datasets play an important role in the test-
mark datasets are biased because the proportion of normal ing and validation of network anomaly detection methods or
and attack traffic are not the same. This is because nor- systems. A good quality dataset not only allows us to iden-
mal traffic is usually much more common than anomalous tify the ability of a method or a system to detect anomalous
traffic. However, the evaluation of an intrusion detection behavior, but also allows us to provide potential effective-
method or system using biased datasets may not be fit ness when deployed in real operating environments. Several
datasets are publicly available for testing and evaluation of net- • Denial of Service (DoS): An attacker attempts to prevent
work anomaly detection methods and systems. A taxonomy of valid users from using a service provided by a system.
network intrusion datasets is shown in Figure 1. We briefly Examples include SYN flood, smurf and teardrop attacks.
discuss each of them below.
• Remote to Local (r2l): Attackers try to gain entrance to
a victim machine without having an account on it. An
example is the password guessing attack.
• User to Root (u2r): Attackers have access to a local vic-

tim machine and attempt to gain privilege of a superuser.
Examples include buffer overflow attacks.
• Probe: Attackers attempt to acquire information about the

target host. Some examples of probe attacks are port-
scans and ping-sweep attacks.
Background traffic was simulated and the attacks were all

known. The training set, consisting of seven weeks of la-
Figure 1: A taxonomy of network intrusion datasets [2] belled data, is available to the developers of intrusion detec-
tion systems. The testing set also consists of simulated back-
ground traffic and known attacks, including some attacks that
are not present in the training set. The distribution of normal
2.1 Synthetic Datasets and attack traffic for this dataset is reported in Table 1. We
also identify the services associated with each category of at-
Synthetic datasets are generated to meet specific needs or cer- tacks [12, 22] and summarize them in Table 2.
tain conditions or tests that real data satisfy. Such datasets
are useful when designing any prototype system for theoret-
2.2.2 NSL-KDD Dataset
ical analysis so that the design can be refined. A synthetic
dataset can be used to test and create many different types of Analysis of the KDD dataset showed that there were two im-
test scenarios. This enables designers to build realistic behav- portant issues with the dataset, which highly affect the perfor-
ior profiles for normal users and attackers based on the dataset mance of evaluated systems often resulting in poor evaluation
to test a proposed system. This provides initial validation of of anomaly detection methods [43]. To address these issues, a
a specific method or a system; if the results prove to be satis- new dataset known as NSL-KDD [32], consisting of selected
factory, the developers then continue to evaluate a method or a records of the complete KDD dataset was introduced. This
system in a specific domain real-life data. dataset is also publicly available for researchers1 and has the
following advantages over the original KDD dataset.
2.2 Benchmark Datasets • This dataset doesn’t contain superfluous and repeated
We discuss seven publicly available benchmark datasets gener- records in the training set, so classifiers or detection meth-
ated using simulated environments in large networks. Different ods will not be biased towards more frequent records.
attack scenarios were simulated during the generation of these
datasets. • There are no duplicate records in the test set. Therefore,
the performance of learners is not biased by the methods
which have better detection rates on frequent records.
2.2.1 KDDcup99 Dataset
• The number of selected records from each difficulty level
Since 1999, the KDDcup99 dataset [21] has been the most is inversely proportional to the percentage of records in
widely used dataset for evaluation of network based anomaly the original KDD dataset. As a result, the classification
detection methods and systems. This dataset was prepared by rates of various machine learning methods vary in a wider
Stolfo et al. [41] and is built upon the data captured in the range, which makes it more efficient to have an accurate
DARPA98 IDS evaluation program. The KDD training dataset evaluation of various learning techniques.
consists of approximately 4, 900, 000 single connection vec-
tors, each of which contains 41 features and is labelled as ei- • The number of records in the training and testing sets is
ther normal or attack of a specific attack type. The test dataset reasonable, which makes it practical to run experiments
contains about 300, 000 samples with a total 24 training types, on the complete set without the need to randomly select a
with an additional 14 attack types in the test dataset only [14]. small portion. Consequently, evaluation results of differ-
The represented attacks are mainly four types: denial of ser- ent research groups are consistent and comparable.
vice, remote-to-local, user-to-root, and surveillance or prob-
ing. 1 http://www.iscx.ca/NSL-KDD/
The NSL-KDD dataset consists of two parts: (i) KDDTrain+

and (ii) KDDTest+ . The KDDTrain+ part of the NSL-KDD
dataset is used to train a detection method or system to de-
tect network intrusions. It contains four classes of attacks
and a normal class dataset. The KDDTest+ part of NSL-
Normal
KDD dataset is used for testing a detection method or a system

97277
97277
97277 when it is evaluated for performance. It also contains the same
telnet, rlogin, pop, imap, ftp

classes of traffic present in the training set. The distribution of
warezmaster, imap, ftp write,
attack and normal instances in the NSL-KDD dataset is shown

warezclient, guess passwd,
in Table 3.
telnet, rlogin
multihop, phf, spy
Service(s)
r2l
Table 3: Distribution of normal and attack traffic instances in

imap
smtp
dns
dns
ftp
NSL-KDD dataset
X
X
-
Attacks
r2l
Dataset DoS u2r r2l Probe Normal Total

KDDTrain+ 45927 52 995 11656 67343 125973
Table 1: Distribution of normal and attack traffic instances in KDDCup99 dataset
in-
Attack name
KDDTest+ 7458 67 2887 2422 9710 22544
-
dictionary
Table 2: List of attacks and corresponding services in KDDcup99 dataset
sendmail
ftp-write
xsnoop
named
named
stances
xlock
guest
imap
Total
1126
1126
1126
2.2.3 DARPA 2000 Dataset

Any user session
Any user session
Any user session
Any user session
Any user session
Any user session
Any user session
loadmodule, perl
buffer overflow,
A DARPA2 evaluation project [18] targeted the detection of

-
Service(s)
complex attacks that contain multiple steps. Two attack sce-

Attacks
rootkit,
narios were simulated in the DARPA 2000 evaluation contest,

u2r
u2r
namely Lincoln Laboratory scenario DDoS (LLDOS) 1.0 and

LLDOS 2.0. To achieve variations, these two attack scenar-
in-
ios were carried out over several network and audit scenarios.
Attack name
stances
loadmodule
These sessions were grouped into four attack phases: (a) prob-
-
Total
ffbconfig
fdformat
52
52
52
ing, (b) breaking into the system by exploiting vulnerability,

Xterm
eject
perl
(c) installing DDoS software for the compromised system, and

ps
portsweep, nmap
(d) launching DDoS attack against another target. LLDOS 2.0

satan, ipsweep,
Service(s)
is different from LLDOS 1.0 in that attacks are more stealthy

-
Attacks
many
many
many
many
and thus harder to detect. Since this dataset contains multi-

icmp
stage attack scenarios, it is also commonly used for evaluation

Probe
Probe
of alert correlation techniques.

in-
Attack name
2.2.4 DEFCON Dataset

-
stances
ipsweep
Total
4107
4107
4107
mscan
nmap
satan
saint
The DEFCON3 dataset is another commonly used dataset for

evaluation of IDSs [11]. It contains network traffic captured
nep-
back, teardrop,
during a hacker competition called Capture The Flag (CTF),

echo/chargen
pod, land
Attacks
in which competing teams are divided into two groups: attack-

Service(s)
smurf,
Any TCP
Any TCP
tune,
syslog
ers and defenders. The traffic produced during CTF is very

icmp
icmp
smtp
N/A
N/A
http
http
DoS
different from real world network traffic since it contains only

in-
DoS
intrusive traffic without any normal background traffic. Due to

this limitation, DEFCON dataset has been found useful only in
391458
229853
229853
stances
Total
ping of death
process table
evaluating alert correlation techniques.

Attack name
SYN flood
mailbomb
udpstorm
teardrop
apache2
syslogd
smurf
back
land
Corrected KDD
2.2.5 CAIDA Dataset

Whole KDD
10% KDD
CAIDA4 collects many different types of data and makes them

Dataset
KDD99
Dataset
available to the research community. CAIDA datasets [8] are

very specific to particular events or attacks. Most of its longer
2 http://www.ll.mit.edu/mission/communications/ist/corpora/
ideval/data/index.html
3 http://cctf.shmoo.com/data/
4 http://www.caida.org/home/
traces are anonymized backbone traces without their payload. 2.2.7 Endpoint Dataset
The CAIDA DDoS 2007 attack dataset contains one hour of
The background and attack traffic for the endpoint datasets are
anonymized traffic traces from DDoS attacks on August 4,
described below.
2007, which attempted to consume a large amount of net-
work resources when connecting to Internet servers. The traf- • Endpoint background traffic: In the endpoint context, we
fic traces contain only attack traffic to the victim and responses see in Table 5 that home computers generate significantly
from the victim with 5 minutes split form. All traffic traces are higher traffic volumes than office and university comput-
in pcap (tcpdump) format. The creators removed non-attack ers because: (i) they are generally shared between mul-
traffic as much as possible when creating the CAIDA DDoS tiple users, and (ii) they run peer-to-peer and multimedia
2007 dataset. applications. The large traffic volumes of home comput-
ers are also evident from their high mean number of ses-
2.2.6 LBNL Dataset sions per second. To generate attack traffic, developers
on infect Virtual Machines (VMs) at the endpoints with
LBNL’s internal enterprise traffic traces are full header net- different malware, viz., Zotob.G, Forbot-FU, Sdbot-AFR,
work traces without payload [23]. This dataset suffers from Dloader-NY, So-Big.E@mm, MyDoom.A@mm, Blaster,
heavy anonymization to the extent that scanning traffic was Rbot-AQJ, and RBOT.CCC. Details of the malware can
extracted and separately anonymized to remove any informa- be found in [42]. Characteristics of the attack traffic in
tion which could identify individual IPs. The background and this dataset are given in Table 6. These malwares have
attack traffic in the LBNL dataset are described below. diverse scanning rates and attack ports or applications.
• LBNL background traffic: This dataset can be ob- • Endpoint attack traffic: The attack traffic logged at the
tained from the Lawrence Berkeley National Laboratory endpoints is mostly comprised of outgoing port scans.
(LBNL) in the US. Traffic in this dataset is comprised Note that this is the opposite of the LBNL dataset, in
of packet level incoming, outgoing and internally routed which most attack traffic is inbound. Moreover, the attack
traffic streams at the LBNL edge routers. Traffic was traffic rates at the endpoints are generally much higher
anonymized using the tcpmkpub tool [35]. The main ap- than the background traffic rates of the LBNL datasets.
plications observed in the internal and external traffic are This diversity in attack direction and rates provides a
Web, email and name services. Other applications like sound basis for performance comparison among scan de-
Windows services, network file services and backup were tectors. For each malware, attack traffic of 15 minute
used by internal hosts. The details of each service and duration was inserted in the background traffic for each
information on each packet and other relevant description endpoint at a random time instance. This operation was
are given in [34]. The background network traffic statis- repeated to insert 100 non-overlapping attacks of each
tics of the LBNL dataset are given in Table 4. worm inside each endpoint’s background traffic.
• LBNL attack traffic: This dataset identifies attack traf- 2.3 Real-life Datasets
fic by isolating scans in aggregate traffic traces. Scans
are identified by flagging those hosts which unsuccess- We discuss three real-life datasets created by collecting net-
fully probe more than 20 hosts, out of which 16 hosts are work traffic on several consecutive days. The details include
probed in ascending or descending IP order [35]. Mali- both normal as well as attack traffic in appropriate proportions
cious traffic mostly consists of failed incoming TCP SYN in the authors’ respective campus networks (i.e., testbeds).
requests, i.e., TCP port scans targeted towards LBNL
hosts. However, there are also some outgoing TCP scans 2.3.1 UNIBS Dataset
in the dataset. Most UDP traffic observed in the data (in-
The UNIBS packet traces [45] were collected on the edge
coming and outgoing) is comprised of successful con-
router of the campus network of the University of Brescia
nections, i.e., host replies for the received UDP flows.
in Italy, on three consecutive working days. The dataset in-
Clearly, the attack rate is significantly lower than the
cludes traffic captured or collected and stored using 20 work-
background traffic rate. Details of the attack traffic in this
stations, each running the GT (Ground Truth) client daemon.
dataset are shown in Table 4. Complexity and privacy
The dataset creators collected the traffic by running tcpdump
were two main reservations of the participants of the end-
on the faculty router, which was a dual Xeon Linux box that
point data collection study. To address these reservations,
connected the local network to the Internet through a dedicated
the dataset creators developed a custom multi-threaded
100Mb/s uplink. They captured and stored the traces on a ded-
MS Windows tool using the W inpcap API [7] for data
icated disk of a workstation connected to the router through a
collection. To reduce packet logging complexity at the
dedicated ATA controller.
endpoints, they only logged very elementary session-level
information (bidirectional communication between two
IP addresses on different ports) for the TCP and UDP 2.3.2 ISCX-UNB Dataset
packets. To ensure user privacy, an anonymization pol- The ISCX-UNB dataset [37] is built on the concept of profiles
icy was used to anonymize all traffic instances. that include the details of intrusions. The datasets were col-
Table 4: Background and attack traffic information for the LBNL datasets
Date Duration LBNL hosts Remote hosts Background traffic rate Attack traffic rate
(mins) (packet/sec) (packet/sec)
10/04/2004 10 min 4,767 4,342 8.47 0.41
12/15/2004 60 min 5,761 10,478 3.5 0.061
12/16/2004 60 min 5,210 7,138 243.83 72
because it is synthetically generated. In addition to the diffi-

Table 5: Background traffic information for four endpoints
culty of simulating real time network traffic, there are addi-
with high and low rates
tional challenges in IDS evaluation [30]. These include diffi-
Endpoint Endpoint Duration Total ses- Mean session
ID type (months) sions rate (/sec) culties in collecting attack scripts and victim software, differ-
3 Home 3 3,73,009 1.92 ing requirements for testing signature based vs. anomaly based
4 home 2 4,44,345 5.28 IDSs, and host-based vs. network based IDSs. In addition to
6 University 9 60,979 0.19 these, we make the following observations based on our anal-
10 University 13 1,52,048 0.21
ysis.
• Most datasets are not labelled properly due to non-

Table 6: Endpoint attack traffic for two high and two low-rate availability of actual attack information. These include
worms KDDcup99, UNIBS, Endpoint and LBNL datasets.
Malware Release Date Avg. Scan rate Port (s) Used
(/sec) • The proportion of normal and attack ratios are different in
Dloader-NY Jul 2005 46.84 sps TCP 1,35,139 different datasets [21, 38, 45].
Forbot-FU Sept 2005 32.53 sps TCP 445
Rbot-AQJ Oct 2005 0.68 sps TCP 1,39,769 • Several existing datasets [21, 23, 38] have not been main-
MyDoom-A Jan 2006 0.14 sps TCP 3127-3198 tained or updated to reflect recent trends in network traffic
by incorporating evolved network attacks.
• Most existing datasets are annonymized [8, 18] due to po-

lected using a real-time testbed by incorporating multi-stage tential security risks to an organization. They do not share
attacks. It uses two profiles - α and β - during the generation their raw data with researchers.
of the datasets. α profiles are constructed using the knowledge
of specific attacks and β profiles are built using the filtered • Several datasets [8, 23, 18, 45] lack in traffic features.
traffic traces. Real packet traces were analyzed to create α and They have only raw traffic traces but it is important to
β profiles for agents that generate real-time traffic for HTTP, extract relevant traffic features for individual attack iden-
SMTP, SSH, IMAP, POP3 and FTP protocols. Various mul- tification.
tistage attack scenarios were explored to generate malicious
traffic.
3 Real-life Datasets Generation
2.3.3 KU Dataset
As noted above, the generation of an unbiased real-life intru-
5 sion dataset incorporating a large number of real world attacks
The Kyoto University dataset is a collection of network traffic
data obtained from honeypots. The raw dataset obtained from is important to evaluate network anomaly detection methods
the honeypot system consisted of 24 statistical features, out of and systems. In this paper, we describe the generation of
which 14 significant features were extracted [38]. The dataset three real-life network intrusion datasets6 including (a) a TU-
developers extracted 10 additional features that could be used IDS (Tezpur University Intrusion Detection System) intrusion
to investigate network events inside the university more effec- dataset, (b) a TUIDS coordinated scan dataset, and (c) a TU-
tively. The initial 14 features extracted are similar to those in IDS DDoS dataset at both packet and flow levels [16]. The
the KDDcup99 datasets. Only 14 conventional features were resulting details and supporting infrastructure is discussed in
used during training and testing. the following subsections.
2.4 Discussion 3.1 Testbed Network Architecture

The datasets described above are valuable assets for the intru- The TUIDS testbed network consists of 250 hosts, 15 L2
sion detection community. However, the benchmark datasets switches, 8 L3 switches, 3 wireless controllers, and 4 routers
suffer from the fact that they are not good representatives of that compose 5 different networks inside the Tezpur University
real world traffic. For example, the DARPA dataset has been campus. The architectures of the TUIDS testbed and TUIDS
questioned about the realism of the background traffic [27, 29] testbed for DDoS dataset generation are given in Figures 2 and
5 http://www.takakura.com/kyoto data 6 http://agnigarh.tezu.ernet.in/∼dkb/resources.html
3, respectively. The hosts are divided into several VLANs,

Table 8: List of real time attacks and their generation tools
each VLAN belonging to an L3 switch or an L2 switch inside
the network. All servers are installed inside a DMZ7 to pro- Attack name Generation Attack name Generation tool
vide an additional layer of protection in the security system of tool
an organization. 1.bonk targa2.c 15.linux-icmp linux-icmp.c
2.jolt targa2.c 16.syn-flood synflood.c
3.land targa2.c 17.window-scan nmap/rnmap
3.2 Network Traffic Generation
4.saihyousen targa2.c 18.syn-scan nmap/rnmap
To generate real time normal and attack traffic, we configured 5.teardrop targa2.c 19.xmasstree-scan nmap/rnmap
several hosts, workstations, and servers in the TUIDS testbed 6.newtear targa2.c 20.fin-scan nmap/rnmap
network. The network consists of 6 interconnected Ubuntu 7.1234 targa2.c 21.null-scan nmap/rnmap
10.10 workstations. On each workstation, we have installed 8.winnuke targa2.c 22.udp-scan nmap/rnmap
several severs including a network file server (Samba), a mail 9.oshare targa2.c 23.syn- LOIC
sever (Dovecot), a telnet server, an FTP server, a Web server, flood(DDoS)
and an SQL sever with PHP compatibility. We also installed 10.nestea targa2.c 24.rst-flood(DDoS) Trinity v3
and configured 4 Windows Servers 2003 to exploit a diverse 11.syndrop targa2.c 25.udp- LOIC
set of known vulnerabilities against the testbed environment. flood(DDoS)
Servers and their services running on our testbed are summa- 12.smurf smurf4.c 26.ping- DDoS ping v2.0
flood(DDoS)
rized in Table 7.
13.opentear opentear.c 27.fraggle udp- Trinoo
flood(DDoS)
Table 7: Servers and their services running on the testbed net- 14.fraggle fraggle.c 28.smurf icmp- TFN2K
work flood(DDoS)
Server Operating Services Provider

system
Main Server Ubuntu 10.10 Web, eMail Apache 2.4.3,
Dovecot 2.1.14 3.3 Attack Scenarios
Network File Ubuntu 10.10 Samba Samba 4.0.2
Server The attack scenarios start with information gathering tech-
Telnet Server Ubuntu 10.10 Telnet telnet-0.17- niques collecting target network IP ranges, identities of name
36bulid1 servers, mail servers and user e-mail accounts, etc. This is
FTP Server Ubuntu 10.10 ftp vsFTPd 2.3.0 achieved by querying the DNS for resource records using net-
Windows Server Windows Web IIS v7.5 work administrative tools like nslookup and dig. We consider
Server 2003 six attack scenarios when collecting real time network traffic
MySQL Server Ubuntu 10.10 database MySQL 5.5.30 for dataset generation.
3.3.1 Scenario 1: Denial of Service Using Targa

The normal network traffic is generated based on the day-
to-day activities of users and especially generated traffic from This attack scenario is designed to perform attacks on a target
configured servers. It is important to generate different types using the targa8 tool until it is successful. Targa is a very pow-
of normal traffic. So, we capture traffic from students, faculty erful tool to quickly damage a particular network belonging to
members, system administrators, and office staff on different an organization. We ran targa by specifying different parame-
days within the University. The attack traffic is generated by ter values such as IP ranges, attacks to run and number of times
launching attacks within the testbed network in three different to repeat the attack.
subsets, viz., a TUIDS intrusion dataset, a coordinated scan
dataset and a DDoS dataset. The attacks launched in the gen- 3.3.2 Scenario 2: Probing Using nmap
eration of these real-life datasets are summarized in Table 8.
As seen in the table above, 22 distinct attack types (1-22 in In this scenario, we attempt to acquire information about the
Table 8) were used to generate the attack traffic for the TUIDS target host and then launch the attack by exploiting the vulner-
intrusion dataset; six attacks (17-22 in Table 8) were used to abilities found using the nmap9 tool. Examples of attacks that
generate the attack traffic for the coordinated scan dataset and can be launched by this method are syn-scan and ping-sweep.
finally six attacks (23-28 in Table 8) were used to generate
the attack traffic for a DDoS dataset with combination of TCP,
3.3.3 Scenario 3: Coordinated Scan Using rnmap
UDP and ICMP protocols.
7 Demilitarized zone is a network segment located between a secured local
This scenario starts with a goal to perform coordinated port
network and unsecured external networks (Internet). A DMZ usually contains scans to single and multiple targets. Tasks are distributed
servers that provide services to users on the external network, such as Web,
8 http://packetstormsecurity.com/
mail and DNS servers that are hardened systems. Typically, two firewalls are
installed to form the DMZ. 9 http://nmap.org/
Figure 2: Testbed network architecture used during TUIDS dataset generation
Figure 3: Testbed network architecture used during DDoS dataset generation
among multiple hosts for individual actions which may be syn- binations. This attack has been designed with the goal of ac-
chronized. We use the rnmap10 tool to launch coordinated quiring an SSH account by running a brute force dictionary at-
scans in our testbed network during the collection of traffic. tack against our central server. We use the brutessh11 tool and
a customized dictionary list. The dictionary consists of over
6100 alphanumeric entries of varying length. We executed the
3.3.4 Scenario 4: User to Root Using Brute Force ssh
attack for 60 minutes, during which superuser credentials were
These attacks are very common against networks as they tend returned from the server. This ID and password combination
to break into accounts with weak username and password com- was used to download other users’ credentials immediately.
10 http://rnmap.sourceforge.net/ 11 http://www.securitytube-tools.net/
3.3.5 Scenario 5: Distributed Denial of Service Using

Agent-handler Network
This scenario mainly attempts to exploit an agent handler net-
work to launch the DDoS attack in the TUIDS testbed net-
work. The agent-handler network consists of clients, handlers
and agents. The handlers are software packages that are used
by the attacker to communicate indirectly with the agents. The
agent software exists in compromised systems that will even-
tually carry out the attack on the victim system. The attacker
may communicate with any number of handlers, thus making
sure that the agents are up and running. We use Trinity v3,
TFN2K, Trinoo, and DDoS ping 2.0 to launch the attacks in Figure 4: Hierarchy of Network Traffic Capturing Components
our testbed.
Libpcap defines a common standard format for files in

3.3.6 Scenario 6: Distributed Denial of Service Using IRC which captured frames are stored, also known as the tcpdump
Botnet format, currently a de facto standard used widely in public net-
Botnets are an emerging threat to all organizations because work traffic archives. Modern kernel-level capture frameworks
they can compromise a network and steal important informa- on UNIX operating systems are mostly based on the BSD (or
tion and distribute malware. Botnets combine individual ma- Berkeley) Packet Filter (BPF) [28]. The BPF is a software de-
licious behaviors into a single platform by simplifying the ac- vice that taps network interfaces, copying packets into kernel
tions needed to be performed by users to initiate sophisticated buffers and filtering out unwanted packets directly in interrupt
attacks against computers or networks around the world. These context. Definitions of packets to be filtered can be written
behaviors include coordinated scanning, DDoS activities, di- in a simple human readable format using Boolean operators
rect attacks, indirect attacks and other deceitful activities tak- and can be compiled into a pseudo-code to be passed to the
ing place across the Internet. BPF device driver by a system call. The pseudo-code is in-
The main goal of this scenario is to perform distributed at- terpreted by the BPF Pseudo-Machine, a lightweight, high-
tacks using infected hosts on the testbed. An Internet Relay performance, state machine specifically designed for packet
Chat (IRC) bot network allow users to create public, private filtering. Libpcap also allows programmers to write appli-
and secret channels. For this, we use a LOIC12 , an IRC-based cations that transparently support a rich set of constructs to
DDoS attack generation tool. The IRC systems have sev- build detailed filtering expressions for most network protocols.
eral other significant advantages for launching DDoS attacks. A few Libpcap system calls can be read directly from user’s
Among the three important benefits are (i) they afford a high command line, compile into pseudo-code and passed it to the
degree of anonymity, (ii) they are difficult to detect, and (iii) Berkeley Packet Filter. Libpcap and the BPF interact to al-
they provide a strong, guaranteed delivery system. Further- low network packet data to traverse several layers to finally
more, the attacker no longer needs to maintain a list of agents, be processed and transformed into capture files (i.e., tcpdump
since he can simply log on to the IRC server and see a list of format) or samples for statistical analysis.
all available agents. The IRC channels receive communica- With the goal of preparing both packet and flow level
tions from the agent software regarding the status of the agents datasets, we capture both packet and NetFlow traffic from dif-
(i.e., up or down) and participate in notifying the attackers re- ferent locations in the TUIDS testbed. The capturing period
garding the status of the agents. started at 08:00:05 am on Monday February 21, 2011 and con-
tinuously ran for an exact duration of seven days, ending at
08:00:05 am on Sunday February 27th. Attacks were executed
3.4 Capturing Traffic during this period for the TUIDS intrusion and the coordinated
The key tasks in network traffic monitoring are lossless packet scan datasets. DDoS traffic was also collected for the same
capturing and precise timestamping. Therefore, software or amount of time but during October, 2012 with several varia-
hardware is required with a guarantee that all traffic is cap- tions of real time DDoS attacks. Figure 5 illustrates the pro-
tured and stored. The real network traffic is captured using tocol composition and the average throughput during the last
the Libpcap [19, 20] library, an open source C library offer- hour of data capture for the TUIDS intrusion dataset.
ing an interface for capturing link-layer frames over a wide We use a tool known as lossless gigabit remote packet cap-
range of system architectures. It provides a high-level com- ture with Linux (Gulp13 ) for capturing packet level traffic in a
mon Application Programming Interface (API) to the differ- mirror port as shown in the TUIDS testbed architecture. Gulp
ent packet capture frameworks of various operating systems. reads packets directly from the network card and writes to the
The offered abstraction layer allows programmers to rapidly disk at a high rate of packet capture without dropping pack-
develop highly portable applications. A hierarchy of network ets. For low-rate packets, Gulp flushes the ring buffer if it
traffic capturing components is given in Figure 4 [10]. has not written anything in the last second. Gulp writes into
12 http://sourceforge.net/projects/loic/ 13 http://staff.washington.edu/corey/gulp/
Figure 5: (a) composition of protocols and (b) average throughput during last hour of data capture for the TUIDS intrusion
dataset seen in our lab’s traffic
even block boundaries for excellent writing performance when

the data rate increases. It stops filling the ring buffer after re-
ceiving an interrupt but it would write into the disk whatever
remains in the ring buffer.
In the last few years, NetFlow has become the most popular
approach for IP network monitoring, since it helps cope with
scalability issues introduced by increasing network speeds.
Now major vendors offer flow-enabled devices, such as Cisco
routers with NetFlow. A NetFlow is a stream of packets that
arrives on a source interface with the key values shown in
Figure 6. A key is an identified value for a field within the
packet. Cisco routers have NetFlow features that can be en-
abled to generate NetFlow records. The principle of NetFlow
is as follows: When the router receives a packet, its NetFlow
module scans the source IP address, the destination IP address,
the source port number, the destination port number, the proto- Figure 6: Common NetFlow parameters
col type, the type of service (ToS) bit in the IP header, and the
input or output interface number on the router of the IP packet
to judge whether it belongs to a NetFlow record that already (a) NFDUMP: This tool captures and displays NetFlow traf-
exists in the cache. If so, it updates the NetFlow record; oth- fic. All versions of nfdump support NetFlow v5, v7, and v9.
erwise, a new NetFlow record is generated in the cache. The nfcapd is a NetFlow capture daemon that reads the NetFlow
expired NetFlow records in the cache are exported periodically data from the routers and stores the data into files periodically.
to a destination IP address using a UDP port. It automatically rotates files every n minutes (by default it is
For capturing the NetFlow traffic, we need a NetFlow col- 5 minutes). We need one nfcapd process for each NetFlow
lector that can listen to a specific UDP port for getting traffic. stream. Nfdump reads the NetFlow data from the files stored
The NetFlow collector captures exported traffic from multiple by nfcapd. The syntax is similar to that of tcpdump. Nfdump
routers and periodically stores it in summarized or aggregated displays NetFlow data and can create top N statistics for flows
format into a round robin database (RRD). The following tools based on the parameters selected. The main goal is to analyze
are used to capture and visualize the NetFlow traffic. NetFlow data from the past as well as to track interesting traffic
Table 9: Parameters identified for packet level data

Sl. Parameter Description
No. name
1 Time Time since occurrence of first frame
2 Frame No Frame number
3 Frame Len Length of a frame
4 Capture Len Capture length
5 TTL Time to live
6 Protocol Protocols (such as, TCP, UDP, ICMP etc.)
7 Src IP Source IP address
8 Dst IP Destination IP address
9 Src port Source port
10 Dst port Destination port
11 Len Data length
12 Seq No Sequence number
13 Header Len Header length
14 CWR Congestion window record
15 ECN Explicit congestion notification Figure 7: Number of flows per second in TUIDS intrusion
16 URG Urgent TCP flag datasets during the capture period
17 ACK Acknowledgement flag
18 PSH Push flag
19 RST Reset flag
20 SYN TCP syn flag
21 FIN TCP fin flag
22 Win Size Window Size
23 MSS Maximum segment size
patterns continuously from high speed networks. The amount

of time from the past is limited only by the disk space available
for all NetFlow data.
Nfdump has four fixed output formats: raw, line, long and
extended. In addition, the user may specify any desired out-
put format by customizing it. The default format is line, un-
less specified. The raw format displays each record in multiple
lines and prints any available information in the traffic record.
(b) NFSEN: NfSen is a graphical Web based front end tool for
visualization of NetFlow traffic. NfSen facilitates the visual-
ization of several traffic statistics, e.g., flow-wise statistics for
various features, navigation through the NetFlow traffic, pro-
cesses within a time span and continuous profiles. It can also
add own plugins to process NetFlow traffic in a customized
manner at a regular time interval.
Normal traffic is captured by restricting it to the internal
networks, where 80% of the hosts are connected to the router,
including wireless networks. We assume that normal traf-
fic follows the normal probability distribution. Attack traffic
is captured as we launch various attacks in the testbed for a
week. For DDoS attacks, we used packet-craft14 to generate
customized packets. Figures 7 and 8 show the number of flows
per second and also the protocol-wise distribution of flows dur-
ing the capturing period, respectively.
Figure 8: Protocol-wise distribution of flow per second in TU-

3.5 Feature Extraction IDS intrusion dataset during the capture period
We use wireshark and Java routines for filtering unwanted
packets (such as packets with routing protocols, and packets
in comma separated form in a text file. The details of parame-
with application layer protocols) as well as irrelevant informa-
ters identified for packet level data are shown in Table 9.
tion from captured packets. Finally, we retrieve all relevant
information from each packet using Java routines and store it We developed several C routines and used them for filter-
ing NetFlow data and for extracting features from the captured
14 http://www.packet-craft.net/ data. A detailed list of parameters identified for flow level data
is given in Table 10. Algorithm 1 : FC and labelling (F )

We capture, preprocess and extract various features in both Input: extracted feature set, F = {α, β, γ, δ}
packet and flow level network traffic. We introduce a frame- Output: correlated and labelled feature data, D
work for fast distributed feature extraction from raw network 1: initialize D
traffic, correlation computation and data labelling, as shown 2: call F eatureExtraction(), F ← {α, β, γ, δ}, . the pro-
in Figure 9. We extract four types of features: basic, content cedure FeatureExtraction() extracts the features separately
based, time based and connection based, from the raw network for all cases
traffic. We use T = 5 seconds as the time window for extrac- 3: for i ← 1 to |N | do . N is the total traffic instances
tion of both time based and connection based traffic features. 4: for i ← 1 to |F | do . F is the total traffic features
S1 and S2 are servers used for preprocessing, attack labelling 5: if (unique(src.ip ∧ dst.ip)) then
and profile generation. W S1 and W S2 are high-end worksta- 6: store D[ij] ← αij , βij
tions used for basic feature extraction and merging packet and 7: end if
NetFlow traffic. N1 , N2 , · · · N6 are independent nodes used 8: if ((T == 5s) ∧ (LnP == 100)) then . T is the
for protocol specific feature extraction. The lists of extracted time window, LnP is the last n packets
features at both packet and flow levels for the intrusion datasets 9: Store D[ij] ← γij , δij
are presented in Table 11 and Table 12, respectively. The list 10: end if
of features available in the KDDcup99 intrusion dataset is also 11: end for
shown in Table 13. 12: D[ij] ← {normal, attack} .
label each traffic feature instance based on the duration of
Table 10: Parameters identified for flow level data the collected traffic
Sl. Parameter Description 13: end for
No. name
1 flow-start Starting of flow
2 Duration Total life time of a flow
3 Proto Protocol, i.e., TCP, UDP, ICMP etc. includes the TUIDS intrusion dataset, the TUIDS coordinated
3 Src-IP Source IP address scan dataset and the TUIDS DDoS dataset. The final labelled
4 Src-port Source port feature datasets for each category with the distribution of nor-
5 Dest-IP Destination IP address mal and attack information are summarized in Table 15. All
6 Dest-port Destination port
7 Flags TCP flags datasets are prepared at both packet and flow levels and pre-
8 ToS Type of Service sented in terms of training and testing in Table 15.
9 Packets Packets per flow
10 Bytes Bytes per flow
11 Pps Packet per second 3.7 Comparison with Other Public Datasets
12 Bps Bit per second
13 Bpp Byte per packet Several real network traffic traces are readily available to the
research community as reported in Section 2. Although these
traffic traces are invaluable to the research community most
if not all, fail to satisfy one or more requirements described
in Section 1. This paper is mostly distinguished by the fact
3.6 Data Processing and Labelling
that the issue of data generation is approached from what other
As reported in the previous section, traffic features are ex- datasets have been unable to provide, for the network secu-
tracted separately (within a time interval). So, it is impor- rity community. It attempts to resolve the issues seen in other
tant to correlate each feature (i.e., basic, content based, time datasets by presenting a systematic approach to generate real-
based, and connection based) to a time interval. Once cor- life network intrusion datasets. Table 16 summarizes a com-
relation is performed for both packet and flow level traffic, parison between the prior datasets and the dataset generated
labelling of each feature data as normal or anomalous is im- through the application of our systematic approach to fulfill
portant. The labelling process enriches the feature data with the principal objectives outlined for qualifying dataset.
information such as (i) the type and structure of malicious or Most datasets are unlabelled as labelling is labor-intensive
anomalous data, and (ii) dependencies among different iso- and requires a comprehensive search to tag anomalous traffic.
lated malicious activities. The correlation and labelling of each Although an IDS helps by reducing the work, there is no guar-
feature traffic as normal or anomalous is made using Algorithm antee that all anomalous activity is labelled. This has been a
1. F = {α, β, γ, δ} is the set of extracted features, where α major issue with all datasets and one of the reasons behind the
is the set of basic features, β is the set of content-based fea- post insertion of attack traffic in the DARPA 1999 dataset, so
tures, γ is the set of time-based features and δ is the set of that anomalous traffic can be labelled in a deterministic man-
connection-based features. Both normal and anomalous traf- ner. Having seen the inconsistencies produced by traffic merg-
fic are collected separately in several sessions within a week. ing, this paper has adopted a different approach to provide the
We remove normal traffic from anomalous traces as much as same level of deterministic behavior with respect to anomalous
possible. traffic by conducting anomalous activity within the capturing
The overall traffic composition with protocol distribution in period using available network resources. Through the use of
the generated datasets is summarized in Table 14. The traffic logging, all ill-intended activity can be effectively labelled.
Figure 9: Fast distributed feature extraction, correlation and labelling framework
The extent and scope of network traffic capture become 3.8 Comparison with Other Relevant Work
relevant in situations where the information contained in the
traces may breach the privacy of individuals or organizations. Our approach differs from other works as follows.
In order to prevent privacy issues, almost all publicly avail-
able datasets remove any identifying information such as pay- • The NSL-KDD [32] dataset is an enhanced version of
load, protocol, destination and flags. In addition, the data is the KDDcup99 intrusion dataset prepared by Tavallaee
anonymized where necessary header information is cropped or et al. [43]. This dataset is too old to evaluate a mod-
flows are just summarized. ern detection method or a system that has been devel-
oped recently. It removes repeated traffic records from
the old KDDcup99 dataset. In contrast, our datasets are
In addition to anomalous traffic, traces must contain back- prepared using diverse attack scenarios incorporating re-
ground traffic. Most captured datasets have little control over cent attacks. Our datasets contain both packet and flow
the anomalous activities included in the traces. However, a level information that help detect attacks more effectively
major concern with evaluating anomaly based detection ap- in high speed networks.
proaches is the requirement that anomalous traffic must be
present at a certain scale. Anomalous traffic also tends to be- • Song et al. [39] prepared the KU dataset and used the
come outdated with the introduction of more sophisticated at- dataset to evaluate an unsupervised network anomaly de-
tacks. So, we have generated more up-to-date datasets that tection method. This dataset contains 17 different features
reflect the current trends and are tailored to evaluate certain at packet level only. In contrast, we present a systematic
characteristics of detection mechanisms which are unique to approach to generate real-life network intrusion datasets
themselves. and prepared three different categories of datasets at both
packet and flow levels.
As discussed earlier, several datasets are available for evalu-
ating an IDS. Network intrusion detection researchers evaluate • Like Shiravi et al. [37], our approach considers recently
detection methods using intrusion datasets to demonstrate how developed attacks and attacks on network layers when
their methods can handle recent attacks and network environ- generating the datasets. Shiravi et al. concentrate mostly
ments. We have used our datasets to evaluate several network on application-layer attacks. They build profiles for dif-
intrusion detection methods. Some of them are outlier-based ferent real-world attack scenarios and use them to gener-
network anomaly detection approach (NADO) [4], an unsu- ate traffic that follows the same behavior while generating
pervised method [3, 6], an adaptive outlier-based coordinated the dataset at packet level. In comparison, we generate
scan detection approach (AOCD) [5], and a multi-level hybrid three different categories of datasets at both packet and
IDS (MLH-IDS) [15]. We found better results in almost all the flow levels for the research community to evaluate detec-
experiments when we used TUIDS dataset in terms of false tion methods or systems. Since we have extracted more
positive rate, true positive rate and F-measure. number of features at both packet and flow levels. Our
Table 11: List of packet level features in TUIDS intrusion dataset

Label/feature name Type Description
Basic features
1. Duration C Length (number of seconds) of the connection
2. Protocol-type D Type of protocol, e.g., tcp, udp, etc.
3. Src-ip C Source host IP address
4. Dest-ip C Destination IP address
5. Src-port C Source host port number
6. Dest-port C Destination host port number
7. Service D Network service at the destination, e.g., http, telnet, etc.
8. num-bytes-src-dst C The number of data bytes flowing from source to destination
9. num-bytes-dst-src C The number of data bytes flowing from destination to source
10. Fr-no C Frame number
11. Fr-len C Frame length
12. Cap-len C Captured frame length
13. Head-len C Header length of the packet
14. Frag-off D Fragment offset: ‘1’ for the second packet overwrite everything, ‘0’ otherwise
15. TTL C Time to live: ‘0’ discards the packet
16. Seq-no C Sequence number of the packet
17. CWR D Congestion window record
18. ECN D Explicit congestion notification
19. URG D Urgent TCP flag
20. ACK D Acknowledgement flag value
21. PSH D Push TCP flag
22. RST D Reset TCP flag
23. SYN D Syn TCP flag
24. FIN D Fin TCP flag
25. Land D 1 if connection is from/to the same host/port; 0 otherwise
Content-based features
26. Mss-src-dest-requested C Maximum segment size from source to destination requested
27. Mss-dest-src-requested C Maximum segment size from destination to source requested
28. Ttt-len-src-dst C Time to live length from source to destination
29. Ttt-len-dst-src C Time to live length from destination to source
30. Conn-status C Status of the connection (e.g., ‘1’ for complete, ‘0’ for reset)
Time-based features
31. count-fr-dest C Number of frames received by unique destinations in the last T seconds from the same source
32. count-fr-src C Number of frames received from unique sources in the last T seconds from the same destination
33. count-serv-src C Number of frames from the source to the same destination port in the last T seconds
34. count-serv-dest C Number of frames from destination to the same source port in the last T seconds
35. num-pushed-src-dst C The number of pushed packets flowing from source to destination
36. num-pushed-dst-src C The number of pushed packets flowing from destination to source
37. num-SYN-FIN-src-dst C The number of SYN/FIN packets flowing from source to destination
38. num-SYN-FIN-dst-src C The number of SYN/FIN packets flowing from destination to source
39. num-FIN-src-dst C The number of FIN packets flowing from source to destination
40. num-FIN-dst-src C The number of FIN packets flowing from destination to source
Connection-based features
41. count-dest-conn C Number of frames to unique destinations in the last N packets from the same source
42. count-src-conn C Number of frames from unique sources in the last N packets to the same destination
43. count-serv-srcconn C Number of frames from the source to the same destination port in the last N packets
44. count-serv-destconn C Number of frames from the destination to the same source port in the last N packets
45. num-packets-src-dst C The number of packets flowing from source to destination
46. num-packets-dst-src C The number of packets flowing from destination to source
47. num-acks-src-dst C The number of acknowledgement packets flowing from source to destination
48. num-acks-dst-src C The number of acknowledgement packets flowing from destination to source
49. num-retransmit-src-dst C The number of retransmitted packets flowing from source to destination
50. num-retransmit-dst-src C The number of retransmitted packets flowing from destination to source
C-Continuous, D-Discrete
datasets will help to identify individual attacks in more The following are the major observations and requirements
effectively in high speed networks. when generating an unbiased real-life dataset for intrusion de-
tection.
4 Observations and Conclusion • The dataset should not exhibit any unintended property in
both normal and anomalous traffic.
Several questions may be raised with respect to what consti-
• The dataset should be labelled properly.
tutes a perfect dataset when dealing with the datasets gener-
ation task. These include qualities of normal, anomalous or • The dataset should cover all possible current network sce-
realistic traffic included in the dataset. We provide a path and narios.
a template to generate a dataset that simultaneously exhibits
the appropriate levels of normality, anomalousness and real- • The dataset should be entirely nonanonymized.
ism while avoiding the various weak points of currently avail-
able datasets, pointed out earlier. Quantitative measurements • In most benchmark datasets, the two basic assumptions
can be obtained only when specific methods are applied to the described in Section 1 are valid but this bias should be
dataset. avoided as much as possible.
Table 12: List of flow level features in TUIDS intrusion dataset

Basic features
1. Duration C Length (number of seconds) of the flow
2. Protocol-type D Type of protocol, e.g., TCP, UDP, ICMP
3. Src-ip C Source host IP address
4. Dest-ip C Destination IP address
5. Src-port C Source host port number
6. Dest-port C Destination host port number
7. ToS D Type of service
8. URG D TCP urgent flag
9. ACK D TCP acknowledgement flag
10. PSH D TCP push flag
11. RST D TCP reset flag
12. SYN D TCP SYN flag
13. FIN D TCP FIN flag
14. Src-bytes C Number of data bytes transfered from source to destination
15. Dest-bytes C Number of data bytes transfered from destination to source
17. Conn-status C Status of the connection (e.g., ‘1’ for complete, ‘0’ for reset)
Time-based features
18. count-dest C Number of flows to unique destination IPs in the last T seconds from the same source
19. count-src C Number of flows from unique source IPs in the last T seconds to the same destination
20. count-serv-src C Number of flows from the source to the same destination port in the last T seconds
21. count-serv-dest C Number of flows from the destination to the same source port in the last T seconds
22. count-dest-conn C Number of flows to unique destination IPs in the last N flows from the same source
23. count-src-conn C Number of flows from unique source IPs in the last N flows to the same destination
24. count-serv-srcconn C Number of flows from the source IP to the same destination port in the last N flows
25. count-serv-destconn C Number of flows to the destination IP to the same source port in the last N flows
• Several datasets lack traffic features, although it is impor- References

tant to extract traffic features with their relevancy for a
particular attack. [1] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita,
“RODD: An effective reference-based outlier detection
technique for large datasets,” in Proceedings of First In-
Despite the effort needed to create unbiased datasets, there ternational Conference on Computer Science and Infor-
will always be deficiencies in any one particular dataset. mation Technology, pp. 76–84, Bangalore, India, 2011.
Therefore, it is very important to generate dynamic datasets [2] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita,
which not only reflect the traffic compositions and intrusions “Network anomaly detection: Methods, systems and
types of the time, but are also modifiable, extensible, and re- tools,” IEEE Communications Surveys and Tutorials,
producible. Therefore, new datasets must be generated from vol. 16, no. 1, pp. 303–336, 2014.
time to time for the purpose of analysis, testing and evalua- [3] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita,
tion of network intrusion detection methods and systems from “Towards an unsupervised method for network anomaly
multiple perspectives. detection in large datasets,” Computing and Informatics,
In this paper, we provide a systematic approach to generate vol. 33, no. 1, pp. 1–34, 2014.
real-life network intrusion datasets using both packet and flow [4] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita,
level traffic information. Three different types of datasets has “NADO: Network anomaly detection using outlier ap-
been generated using the TUIDS testbed. They are (i) the TU- proach,” in Proceedings of ACM International Con-
IDS intrusion dataset, (ii) the TUIDS coordinated scan dataset, ference on Communication, Computing & Security,
and (iii) the TUIDS DDoS dataset. We incorporate the maxi- pp. 531–536, New York, USA, 2011.
mum number of possible attacks and scenarios when generat- [5] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita,
ing the datasets on our testbed network. “AOCD: An adaptive outlier based coordinated scan de-
tection approach,” International Journal of Network Se-
curity, vol. 14, no. 6, pp. 339–351, 2012.
[6] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita,
“An effective unsupervised network anomaly detection
Acknowledgments method,” in Proceedings of ACM International Confer-
ence on Advances in Computing, Communications and
This work is partially supported by Department of Information Informatics, pp. 533–539, New York, USA, 2012.
Technology (DIT) and Council of Scientific & Industrial Re- [7] CACE Technologies, WinPcap, June 2015.
search (CSIR), Government of India. The authors are thankful (http://www.winpcap.org)
to the funding agencies and also gratefully acknowledge the [8] CAIDA, The Cooperative Analysis for Internet Data
anonymous reviewers for their valuable comments. Analysis, 2011. (http://www.caida.org)
Table 13: List of features in the KDDcup99 intrusion dataset

Basic features
1. Duration C Length (number of seconds) of the connection
2. Protocol-type D Type of protocol, e.g., tcp, udp, etc.
3. Service D Network service at the destination, e.g., http, telnet, etc.
4. Flag D Normal or error status of the connection
5. Src-bytes C Number of data bytes from source to destination
6. Dst-bytes C Number of data bytes from destination to source
8. Wrong-fragment C Number of “wrong” fragments
9. Urgen C Number of urgent packets
10. Hot C Number of “hot” indicators (hot: number of directory accesses, create and execute program)
11. Num-failed-logins C Number of failed login attempts
12. Logged-in D 1 if successfully logged-in; 0 otherwise
13. Num-compromised C Number of “compromised” conditions (compromised condition: number of file/path not found errors and jumping commands)
14. Root-shell D 1 if root-shell is obtained; 0 otherwise
15. Su-attempted D 1 if “su root” command attempted; 0 otherwise
16. Num-root C Number of “root” accesses
17. Num-file-creations C Number of file creation operations
18. Num-shells C Number of shell prompts
19. Num-access-files C Number of operations on access control files
20. Num-outbound-cmds C Number of outbound commands in an ftp session
21. Is-host-login D 1 if login belongs to the “hot” list; 0 otherwise
22. Is-guest-login D 1 if the login is a “guest” login; 0 otherwise
Time-based features
23. Count C Number of connections to the same host as the current connection in the past 2 seconds
24. Srv-count C Number of connections to the same service as the current connection in the past 2 seconds (same-host connections)
25. Serror-rate C % of connections that have “SYN” errors (same-host connections)
26. Srv-serror-rate C % of connections that have “SYN” errors (same-service connections)
27. Rerror-rate C % of connections that have “REJ” errors (same-host connections)
28. Srv-rerror-rate C % of connections that have “REJ” errors (same-service connections)
29. Same-srv-rate C % of connections to the same service (same-host connections)
30. Diff-srv-rate C % of connections to different services (same-host connections)
31. Srv-diff-host-rate C % of connections to different hosts (same-service connections)
32. Dst-host-count C Count of destination hosts
33. Dst-host-srv-count C Srv count for destination host
34. Dst-host-same-srv-rate C Same srv rate for destination host
35. Dst-host-diff-srv-rate C Diff srv rate for destination host
36. Dst-host-same-src-port-rate C Same src port rate for destination host
37. Dst-host-srv-diff-host-rate C Diff host rate for destination host
38. Dst-host-serror-rate C Serror rate for destination host
39. Dst-host-srv-serror-rate C Srv serror rate for destination host
40. Dst-host-rerror-rate C Rerror rate for destination host
41. Dst-host-srv-rerror-rate C Srv rerror rate for destination host
DIS-122004.pdf)
Table 14: TUIDS dataset traffic composition
Protocol Size (MB) (%)
[11] DEFCON, The SHMOO Group, 2011. (http://cctf.
(a) Total traffic composition shmoo.com/)
IP 66784.29 99.99
ARP 3.96 0.005 [12] L. Delooze, Applying Soft-Computing Techniques to
IPv6 0.00 0.00 Intrusion Detection, Ph.D. Thesis, Computer Science
IPX 0.00 0.00
STP 0.00 0.00 Department, University of Colorado, Colorado Springs,
Other 0.00 0.00 2005.
(b) TCP/UDP/ICMP traffic composi-
tion [13] D. E. Denning, “An intrusion-detection model,” IEEE
TCP 49049.29 73.44% Transactions on Software Engineering, vol. 13, pp. 222–
UDP 14940.53 22.37%
ICMP 2798.43 4.19% 232, Feb. 1987.
ICMPv6 0.00 0.00 [14] A. A. Ghorbani, W. Lu, and M. Tavallaee, “Network at-
Other 0.00 0.00
tacks,” in Network Intrusion Detection and Prevention,
pp. 1–25, Springer-verlag, 2010.
[15] P. Gogoi, D. K. Bhattacharyya, B. Bora, and J. K.
[9] A. Cemerlic, L. Yang, and J.M. Kizza, “Network intru- Kalita, “MLH-IDS: A multi-level hybrid intrusion detec-
sion detection based on bayesian networks,” in Proceed- tion method,” The Computer Journal, vol. 57, pp. 602–
ings of the 20th International Conference on Software 623, May 2014.
Engineering and Knowledge Engineering, pp. 791–794, [16] P. Gogoi, M. H. Bhuyan, D. K. Bhattacharyya, and
San Francisco, USA, 2008. J. K. Kalita, “Packet and flow-based network intrusion
[10] A. Dainotti and A. Pescape, “PLAB: A packet cap- dataset,” in Proceedings of the 5th International Con-
ture and analysis architecture,” 2004. (http://traffic. ference on Contemporary Computing, LNCS-CCIS 306,
comics.unina.it/software/ITG/D-ITGpublications/TR- pp. 322–334, Springer, 2012.
Table 15: Distribution of normal and attack connection instances in real time packet and flow level TUIDS datasets
Dataset type
Connection type Training dataset Testing dataset
(a) TUIDS intrusion dataset
Packet level
Normal 71785 58.87% 47895 55.52%
DoS 42592 34.93% 30613 35.49%
Probe 7550 6.19% 7757 8.99%
Total 121927 - 86265 -
Flow level
Normal 23120 43.75% 16770 41.17%
DoS 21441 40.57% 14475 35.54%
Probe 8282 15.67% 9480 23.28%
Total 52843 - 40725 -
(b) TUIDS coordinated scan dataset
Packet level
Normal 65285 90.14% 41095 84.95%
Probe 7140 9.86% 7283 15.05%
Total 72425 - 48378 -
Flow level
Normal 20180 73.44% 15853 65.52%
Probe 7297 26.56% 8357 34.52%
Total 27477 - 24210 -
(c) TUIDS DDoS dataset
Packet level
Normal 46513 68.62% 44328 60.50%
Flooding attacks 21273 31.38% 28936 39.49%
Total 67786 - 73264 -
Flow level
Normal 27411 57.67% 28841 61.38%
Flooding attacks 20117 42.33% 18150 38.62%
Total 47528 - 46991 -
Table 16: Comparison of existing datasets and their characteristics

Dataset u v w No. of instances No. of attributes x y z Some references
Synthetic No No Yes user dependent user dependent Not known any user dependent [4, 1]
KDDcup99 Yes No Yes 805050 41 BCTW P C1 [48, 33, 47, 31]
NSL-KDD Yes No Yes 148517 41 BCTW P C1 [43]
DARPA 2000 Yes No No Huge Not known Raw Raw C2 [37]
DEFCON No No No Huge Not known Raw P C2 [37]
CAIDA Yes Yes No Huge Not known Raw P C1 [37]
LBNL Yes Yes No Huge Not known Raw P C2 [46]
Endpoint Yes Yes No Huge Not known Raw P C2 , C3 [46]
UNIBS Yes Yes No Huge Not known Raw P C2 [46]
ISCX-UNB Yes Yes Yes Huge Not known Raw P A [37]
KU Yes Yes No Huge 24 BTW P C1 [39]
TUIDS Yes Yes Yes Huge 50,24 BCTW P,F C1 [4, 1]
u-realistic network configuration
v-indicates realistic traffic
w-describes the label information
x-types of features extracted as basic features (B), content based features (C), time based features(T)
and window based features(W)
y-explains the types of data as packet based (P) or flow based (F) or hybrid (H) or others (O)
z-represents the attack category as C1 -all attacks, C2 -denial of service, C3 -probe, C4 -user to root,
C5 -remote to local, and A-application layer attacks
[17] D. Hoplaros, Z Tari, and I. Khalil, “Data summarization databases/kddcup99/)

for network traffic monitoring,” Journal of Network and [22] K. Kendall, A Database of Computer Attacks for the
Computer Applications, vol. 37, pp. 194–205, 2014. Evaluation of Intrusion Detection Systems, Master’s The-
[18] Information Systems Technology Group MIT Lin- sis, MIT, 1999.
coln Lab, DARPA Intrusion Detection Data Sets,
[23] Lawrence Berkeley National Laboratory (LBNL),
Mar. 2000. (http://www.ll.mit.edu/mission/ communica-
ICSI, LBNL/ICSI Enterprise Tracing Project, 2005.
tions/ist/corpora/ideval/data/2000data.html)
(http://www.icir.org/enterprise-tracing/)
[19] V. Jacobson, C. Leres, and S. McCanne, “The tcpdump
manual page,” Lawrence Berkeley Laboratory, Berkeley, [24] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, and J. Sri-
CA, 1989. vastava, “A comparative study of anomaly detection
[20] V. Jacobson, C. Leres, and S. McCanne, “Libpcap,” schemes in network intrusion detection,” in Proceedings
Lawrence Berkeley Laboratory, Berkeley, CA, Initial of the 3rd SIAM International Conference on Data Min-
public release, June 1994. ing, pp. 25–36, 2003.
[21] KDDcup99, “Knowledge discovery in databases [25] B. Li, J. Springer, G. Bebis, and M. H. Gunes, “A sur-
DARPA archive,” 1999. (https://archive.ics.uci.edu/ml/ vey of network flow applications,” Journal of Network
and Computer Applications, vol. 36, no. 2, pp. 567–581, [40] A. Sperotto, R. Sadre, F. Vliet, and A. Pras, “A labeled
2013. data set for flow-based intrusion detection,” in Proceed-
[26] R. P. Lippmann, D. J. Fried, I. Graf, et al., “Evaluating ings of the 9th IEEE International Workshop on IP Oper-
intrusion detection systems: The 1998 DARPA offline ations and Management, pp. 39–50, Venice, Italy, 2009.
intrusion detection evaluation,” in Proceedings of the [41] S. J. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. K.
DARPA Information Survivability Conference and Expo- Chan, “Cost-based modeling for fraud and intrusion de-
sition, pp. 12–26, 2000. tection: Results from the JAM project,” in Proceedings of
[27] M. V. Mahoney and P. K. Chan, “An analysis of the 1999 the IEEE DARPA Information Survivability Conference
DARPA/Lincoln laboratory evaluation data for network and Exposition, vol. 2, pp. 130–144, USA, 2000.
anomaly detection,” in Proceedings of the 6th Interna- [42] symantec.com, Symantec Security Response, June 2015.
tional Symposium on Recent Advances in Intrusion De- (http://securityresponse.symantec.com/avcenter)
[43] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani,
tection, pp. 220–237, 2003.
“A detailed analysis of the KDD CUP 99 data set,” in
[28] S. McCanne and V. Jacobson, “The BSD packet fil- Proceedings of the 2nd IEEE International Conference
ter: A new architecture for user level packet capture,” on Computational Intelligence for Security and Defense
in Proceedings of the Winter 1993 USENIX Conference, Applications, pp. 53–58, USA, 2009.
pp. 259–269, 1993. [44] C. Thomas, V. Sharma, and N. Balakrishnan, “Usefulness
[29] J. McHugh, “Testing intrusion detection systems: a cri- of DARPA dataset for intrusion detection system evalua-
tique of the 1998 and 1999 DARPA intrusion detection tion,” in Proceedings of the Data Mining, Intrusion De-
system evaluations as performed by lincoln laboratory,” tection, Information Assurance, and Data Networks Se-
ACM Transactions on Information and System Security, curity, SPIE 6973, Orlando, FL, 2008.
vol. 3, pp. 262–294, Nov. 2000. [45] UNIBS, University of Brescia Dataset, 2009.
[30] P. Mell, V. Hu, R. Lippmann, J. Haines, and M. Zissman, (http://www.ing.unibs.it/ntw/tools/traces/)
An Overview of Issues in Testing Intrusion Detection Sys- [46] J. Xu and C. R. Shelton, “Intrusion detection using con-
tems, 2003. (http://citeseer.ist.psu.edu/621355.html) tinuous time bayesian networks,” Journal of Artificial In-
[31] Z. Muda, W. Yassin, M. N. Sulaiman, and N. I. Udzir, telligence Research, vol. 39, pp. 745–774, 2010.
“A K-means and naive bayes learning approach for bet- [47] G. Zhang, S. Jiang, G. Wei, and Q. Guan, “A prediction-
ter intrusion detection,” Information Technology Journal, based detection algorithm against distributed denial-of-
vol. 10, no. 3, pp. 648–655, 2011. service attacks,” in Proceedings of the ACM International
[32] NSL-KDD, NSL-KDD Data Set for Network-based Conference on Wireless Communications and Mobile
Intrusion Detection Systems, Mar. 2009. (http://iscx.cs. Computing: Connecting the World Wirelessly, pp. 106–
unb.ca/NSL-KDD/) 110, Leipzig, Germany, 2009.
[48] Y. F. Zhang, Z. Y. Xiong, and X. Q. Wang, “Distributed
[33] M. E. Otey, A. Ghoting, and S. Parthasarathy, “Fast dis- intrusion detection based on clustering,” in Proceeding of
tributed outlier detection in mixed-attribute data sets,” the International Conference on Machine Learning and
Data Mining and Knowledge Discovery, vol. 12, no. 2- Cybernetics, vol. 4, pp. 2379–2383, Aug. 2005.
3, pp. 203–228, 2006.
[34] R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, and Monowar H. Bhuyan is an assistant professor in the Depart-
B. Tierney, “A first look at modern enterprise traffic,” in ment of Computer Science and Engineering at Kaziranga Uni-
Proceedings of the 5th ACM SIGCOMM Conference on versity, Jorhat, Assam, India. He received his Ph.D. in Com-
Internet Measurement, pp. 2, Berkeley, USA, 2005. puter Science & Engineering from Tezpur University (a Cen-
[35] R. Pang, M. Allman, V. Paxson, and J. Lee, “The devil tral University) in February 2014. He is a life member of IETE,
and packet trace anonymization,” SIGCOMM Computer India. His research areas include data mining, cloud security,
Communication Review, vol. 36, no. 1, pp. 29–38, 2006. computer and network security. He has published 20 papers
[36] L. Portnoy, E. Eskin, and S. Stolfo, “Intrusion detection in international journals and referred conference proceedings.
with unlabeled data using clustering,” in Proceedings of He is on the programme committee members/referees of sev-
ACM CSS Workshop on Data Mining Applied to Security, eral international conferences/journals.
pp. 5–8, 2001.
[37] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, Dhruba K. Bhattacharyya received his Ph.D. in Computer
“Towards developing a systematic approach to generate Science from Tezpur University in 1999. Currently, he is a
benchmark datasets for intrusion detection,” Computers Professor in the Computer Science & Engineering Department
& Security, vol. 31, no. 3, pp. 357–374, 2012. at Tezpur University. His research areas include data min-
[38] J. Song, H. Takakura, and Y. Okabe, “Description ing, network security and bioinformatics. Prof. Bhattacharyya
of kyoto university benchmark data,”. pp. 1–3. 2006. has published more than 220 research papers in leading in-
(http://www.takakura.com/Kyoto data/BenchmarkData- ternational journals and conference proceedings. Dr. Bhat-
Description-v5.pdf) tacharyya also has written/edited 10 books. He is on the edito-
[39] J. Song, H. Takakura, Y. Okabe, and K. Nakao, “Toward a rial boards of several international journals and also on the pro-
more practical unsupervised anomaly detection system,” gramme committees/advisory bodies of several international
Information Sciences, vol. 231, pp. 4–14, Aug. 2013. conferences/workshops.
Jugal K. Kalita is a professor of Computer Science at the Uni-

versity of Colorado at Colorado Springs. He received his Ph.D.
from the University of Pennsylvania in 1990. His research
interests are in natural language processing, machine learn-
ing, artificial intelligence, bioinformatics and applications of
AI techniques to computer and network security. He has pub-
lished more than 150 papers in international journals and re-
ferred conference proceedings and has written two technical
books. Professor Kalita is a frequent visitor of Tezpur Uni-
versity where he collaborates on research projects with faculty
and students.

Towards Generating Real-Life Datasets For Network Intrusion Detection

Uploaded by

Copyright:

Available Formats

Towards Generating Real-Life Datasets For Network Intrusion Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Towards Generating Real-Life Datasets For Network Intrusion Detection

Uploaded by

Copyright:

Available Formats

International Journal of Network Security, Vol.17, No.6, PP.683-701, Nov.

Towards Generating Real-life Datasets for Network

Monowar H. Bhuyan1 , Dhruba K. Bhattacharyya2 , and Jugal K. Kalita3

Abstract a intensive set of intrusions or attacks. This is a significant

1 Introduction • Validation of new approaches: New methods and algo-

• User to Root (u2r): Attackers have access to a local vic-

• Probe: Attackers attempt to acquire information about the

Background traffic was simulated and the attacks were all

The NSL-KDD dataset consists of two parts: (i) KDDTrain+

KDD dataset is used for testing a detection method or a system

97277 when it is evaluated for performance. It also contains the same

telnet, rlogin, pop, imap, ftp

attack and normal instances in the NSL-KDD dataset is shown

Table 3: Distribution of normal and attack traffic instances in

Dataset DoS u2r r2l Probe Normal Total

KDDTest+ 7458 67 2887 2422 9710 22544

2.2.3 DARPA 2000 Dataset

A DARPA2 evaluation project [18] targeted the detection of

complex attacks that contain multiple steps. Two attack sce-

narios were simulated in the DARPA 2000 evaluation contest,

namely Lincoln Laboratory scenario DDoS (LLDOS) 1.0 and

ing, (b) breaking into the system by exploiting vulnerability,

(c) installing DDoS software for the compromised system, and

(d) launching DDoS attack against another target. LLDOS 2.0

is different from LLDOS 1.0 in that attacks are more stealthy

and thus harder to detect. Since this dataset contains multi-

stage attack scenarios, it is also commonly used for evaluation

of alert correlation techniques.

2.2.4 DEFCON Dataset

The DEFCON3 dataset is another commonly used dataset for

during a hacker competition called Capture The Flag (CTF),

in which competing teams are divided into two groups: attack-

ers and defenders. The traffic produced during CTF is very

different from real world network traffic since it contains only

intrusive traffic without any normal background traffic. Due to

evaluating alert correlation techniques.

2.2.5 CAIDA Dataset

CAIDA4 collects many different types of data and makes them

available to the research community. CAIDA datasets [8] are

because it is synthetically generated. In addition to the diffi-

• Most datasets are not labelled properly due to non-

• Most existing datasets are annonymized [8, 18] due to po-

2.4 Discussion 3.1 Testbed Network Architecture

3, respectively. The hosts are divided into several VLANs,

Server Operating Services Provider

3.3.1 Scenario 1: Denial of Service Using Targa

Figure 2: Testbed network architecture used during TUIDS dataset generation

Figure 3: Testbed network architecture used during DDoS dataset generation

3.3.5 Scenario 5: Distributed Denial of Service Using

Libpcap defines a common standard format for files in

even block boundaries for excellent writing performance when

Table 9: Parameters identified for packet level data

patterns continuously from high speed networks. The amount

Figure 8: Protocol-wise distribution of flow per second in TU-

is given in Table 10. Algorithm 1 : FC and labelling (F )

Figure 9: Fast distributed feature extraction, correlation and labelling framework

Table 11: List of packet level features in TUIDS intrusion dataset

Table 12: List of flow level features in TUIDS intrusion dataset

• Several datasets lack traffic features, although it is impor- References