Itcm: A R T I T C M: EAL IME Nternet Raffic Lassifier Onitor
Itcm: A R T I T C M: EAL IME Nternet Raffic Lassifier Onitor
Itcm: A R T I T C M: EAL IME Nternet Raffic Lassifier Onitor
ABSTRACT
The continual growth of high speed networks is a challenge for real-time network analysis systems. The
real time traffic classification is an issue for corporations and ISPs (Internet Service Providers). This work
presents the design and implementation of a real time flow-based network traffic classification system. The
classifier monitor acts as a pipeline consisting of three modules: packet capture and pre-processing, flow
reassembly, and classification with Machine Learning (ML). The modules are built as concurrent processes
with well defined data interfaces between them so that any module can be improved and updated
independently. In this pipeline, the flow reassembly function becomes the bottleneck of the performance. In
this implementation, was used a efficient method of reassembly which results in a average delivery delay of
0.49 seconds, approximately. For the classification module, the performances of the K-Nearest Neighbor
(KNN), C4.5 Decision Tree, Naive Bayes (NB), Flexible Naive Bayes (FNB) and AdaBoost Ensemble
Learning Algorithm are compared in order to validate our approach.
KEYWORDS
Traffic Classification System, Pipeline, Flow Reassembly, Machine Learning .
1. INTRODUCTION
The Internet traffic is changing continuously and this contribute to difficult the characterization of
network behaviour and structure. Massive games [1] and cloud and grid services increase every
day their percentage participation in total network traffic. Traffic monitoring Systems generally
make use of flow information. Examples are NetFlow [2] or IETF IPFIX [3], which defines a
standard to exporting flow information by routers and switches. Such systems are widely used in
network service providers and corporations to gain knowledge about critical business
applications, analyze communication patterns prevalent in traffic, collect data for account, or
detect anomalous traffic patterns [4]. A vital issue for corporations and ISPs (Internet Service
Providers) is to identify traffic application types which are transmitted on their networks [5].
Pattern recognition and machine learning models have given significant attention to semisupervised learning [6]. In network traffic areas, encryption and processing restrictions, protocol
obfuscation and use of ephemeral ports make the task of construct classification models difficult.
The large amount of Internet traffic flowing through networks makes the use of approaches that
combine labelled and unlabeled data to construct accurate classifiers suitable.
There are a large number of papers in the traffic monitoring and traffic classification area. Most
papers usually focus on either traffic flow reassembly or traffic classification and identification,
but not on their combination. This paper describes the architecture of a real time Internet traffic
DOI:10.5121/ijcsit.2014.6602
23
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
classifier monitor for use in corporate networks. It also evaluates different machine learning
methods for network traffic classification. The classifier monitor is based on concept of
bidirectional flow. This means that the fundamental object to be classified in a determined pattern
is the traffic flow, either complete or as subflow. A flow is defined by one or more packets
between a host pair with the same quintuple: source and destination IP address, source and
destination ports and protocol type (ICMP, TCP, UDP) [9].
The remainder of this paper is organized as follows. Section 2 overviews the related work about
flow reassembly and traffic classification. In section3, we describe the design and implementation
of the classifier monitor. Section 4 details the data collection used for evaluate the ITCM and
describe how the experiments were performed. Section 5 presents and discusses the performance
tests results. Section 6 ends with some conclusions and future work.
2. R ELATED WORK
Statistical classification is based on collecting statistical information on properties of the traffic
flow, and relies on the assumption that each category has a particular distribution of properties
which represents it and can be used to identify it [10]. The statistical traffic classification using
machine learning techniques have been widely explored in the recent years.
There are a limited number of tools available in the literature for traffic classification [7]. The
NetAI tool is able to perform online and offline feature extraction, although not directly perform
traffic classification. The FullStats is able to extract an extensive set of characteristics, but from a
offline trace. The GTVS, which is a DPI (Deep Packet Inspection) based software, allows the
labelling of traces in a semi-automated manner. The only two traffic classification tools which
implement machine learning method are Tstat 2.0 and TIE. The Tstat makes use of the packet size
and interpacket time features in a Bayesian framework for identification of Skype and obfuscated
P2P file sharing. Although the tool has a limited number of applications, it can extract a large
number of characteristics. The TIE software platform is available to the research community, and
which allows the development of classification methods. The framework provides traffic capture
and processing, feature extraction, and online classification. At the moment, a few number of
features is available on TIE. Systems like Bro, which is able to collect flow statistics and perform
payload-based classification at high speed rates, are limited when the traffic is encrypted [8].
Here, we briefly review some important approaches to stream reassembly (subsection 2.1) and
traffic classification (subsection 2.2).
24
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
system checks the packet sequence number and determines if this packet is the next expected
packet for the respective connection. If true, the packet is sent for signature detection.
In [14], to improve the forensic analysis, the authors present a TCP stream evaluation
methodology which consists of estimate the TCP reassembly accuracy by the precise
identification of potential errors at packet and stream levels hidden in this process. This approach
can be used for derivation and computation of reassembly errors. In the proposed TCP reassembly
model, an session counting algorithm is presented, which defines a flow as a set of TCP packets
with same values to source IP and port, and destination IP and port, and a flow can have multiples
sessions delimited by well defined phases of connection establishment and termination. From two
traffic captures obtained with Tcpdump tool, the authors used a libpcap-based program [15] to
read the packet traces and evaluate the reassembly and error verification approaches , which was
experimentally validated with known analysis tools as Tcptrace and Tcpflow.
25
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
classification is performed by the execution of the KNN (k=1) classification method over the
members of the closest cluster to an unseen flow.
Mallapradda et al. (2009) [6] propose a boosting framework for binary classification which
combines advantages of graph-based approaches and ensemble methods. The strategy is to
improve the performance of a given supervised learning algorithm using unlabeled data. The
proposed algorithm is named SemiBoost, and is a general and efficient approach that allows the
choice of a base classifier which is well-suited to a specific task. Like other boosting algorithms,
the classification accuracy is improved iteratively, but SemiBoost selects unlabeled data along the
iterations. The proposed algorithm combines similarity information with classifier predictions to
obtain the most confident pseudo-labels. The authors used WEKA [18] software in the
implementation of benchmark semi-supervised approaches. From the evaluation and comparison
of the method with three state-of-art semi-supervised algorithms (TSVM, LDS and LavSVM) and
in 16 different datasets indicate significant gain of the proposed approach, when evaluated with
the base classifiers Decision Stump, J48 and SVM.
Szab et al. (2011) [19] proposes a framework for traffic classification which uses machine
learning techniques and only information from the packet headers. One component of the
proposed framework is the combination of classification and clustering algorithms to make the
identification system robust under different network conditions. The training and evaluation of
classification system were performed with the traffic flow obtained from measurements in
networks with different access technologies and different locations, in order to make the traffic
characteristics varied. The authors found that these clustering and classification methods resulted
with different performance results when used to identify traffic from unknown networks. They
also verified that clustering algorithms have proved to be more robust with network parameter
changes while classification methods can learn about a specific network more accurately. The
authors present and evaluate two different combinations of classification and clustering
approaches that result in accuracy increase when comparing to standalone cases. The first
combination, named classification with clustering information, means that each training flow has
its respective cluster number as a new attribute for supervised classification. The cluster numbers
of training flows are obtained from the previous clustering of training data. Since the clustering
information attribute may be neglected or considered with low-importance by a supervised
technique, this approach cannot always improve the global accuracy. The second approach,
named model refinement with per cluster based classification, initially applies unsupervised
learning to generate clusters. A separate classification model is then built for the set of flows of
each cluster. In the evaluation phase at an unseen flow, the unsupervised method results with the
number for the most similar cluster for which the associated model is used to evaluate the flow.
This approach always considers the clustering results with high importance and the supervised
techniques can build simple models since each group contains a limited number of flow types.
This implies that the impact of over-fitting with the classification model is reduced. The per
cluster based classification scheme outperforms the standalone supervised and unsupervised
methods. The proposed methods obtained 93% and 75% of TP ratio for the evaluation on the
same network and the cross-checks on other networks, respectively.
Erman et al. (2007) [8] proposes and evaluates a semi-supervised methodology for statistical
traffic classification that is able to accommodate known and unknown applications. The authors
mention three main advantages to their proposed semi-supervised method: Initially, fast and
accurate classifiers can be designed since a training dataset that consists with a small number of
labelled flows and a large number of unlabeled flows. Second, the approach is robust and can
handle both previously unknown applications and new patterns of existing applications.
Moreover, network operators can insert unlabeled flows to improve the classifier performance,
allowing the iterative development. Finally, the proposed approach can be integrated with
solutions that collect flow statistics. The semi-supervised model combines supervised and
26
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
unsupervised methods in two steps: The approach first uses the K-Means clustering technique to
partition a training dataset composed of a few labelled flows and abundant unlabeled flows. The
second step utilizes the available labelled flows to build a cluster with application mapping for
which clusters without labelled flows remain unmapped, which corresponds to flows that possibly
do not belong to any known application. The authors found that the proposed model was able to
identify a variety of different applications with a high rate of accuracy, such as Web, P2P, FTP
and E-mail. The flow and byte accuracy was above 98% and 93%, respectively. Furthermore, the
authors noted that datasets with large number of flows consistently achieve a high classification
of accuracy. The authors verify that despite labelling tools, labelling a large dataset can be
expensive and difficult. In practice, labelling only a fraction of the training flows is sufficient to
obtain high levels of accuracy.
3.1. Architecture
The monitor works as a three-stage pipeline, with a collect and preprocessing module, a flow
reassembly module, and an attribute extraction and classification module. For the purpose of
pipeline, the time is divided in intervals of 30s. This value was chosen arbitrarily. Once the
monitor starts, three parallel processes are in execution on each interval: the packet capture, the
flow reassembly of the previous interval packet capture, and the flow classification for the
collection occurred in two delay intervals. Another parallel process is responsible for closing old
connections periodically, in order to reduce the use of memory and processing during reassembly
process. This approach allows the classifier monitor reach a response time of (30 + ) seconds,
where is the necessary time to performs the reassembly of the captured data in a given interval.
In summary, the monitor works with a quantum of 30s of traffic and with an average delay of
seconds in the flow reassembly, feature extraction, and classification. The average found value
achieved in the current implementation was = 0.49 seconds.
The Figure 1 exhibits the capturing and processing environment of the monitoring and
classification system. Basically, we assume that the traffic is mirrored by a network border router
to a network interface monitored by the system. The system periodically performs the processing
and categorization of captured data, and presents the obtained information from monitoring and
classification process.
27
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
The Figure 2 exhibits the layered structure of implemented classifier monitor, whose tasks
modules are online traffic collection from a network point, pre-processing for flow reassembly,
extraction and selection of statistical attributes, flow labelling since payload analysis or portbased method (only during training step), training with a supervised machine learning technique
and classification, using the ML model built from training data. The classifier monitor performs
the packet traffic capture continuously. In the training phase, the captured packets are sent to a
reassembly process, which associate each packet to its respective flow. A parallel process extracts
statistical information from the packet headers, selects the most relevant attributes using an
attribute selection algorithm, and labels the flows with well-known ports method [20]. The traffic
flows, which are disposed in a spatial representation (each flow is an instance with a set of
characteristics), are used to training a selected supervised classification method. In the evaluation
phase, the unlabeled flows obtained with the collection, reassembly, and attribute extraction are
finally evaluated by the classifier.
Since the modules are implemented as concurrent processes, the data interfaces between each
module allows that each one can be improved and updated independently. To the packet capture
and pre-processing, the input and output are respectively the TCP packet and a data structure with
only the necessary packet information to the monitor (e.g. bytes, timestamp, flags, no payload).
For the flow reassembly module, the input and output are respectively a set of pre-processed
packets and the data structure which represents the reconstructed flow. For the classification
module, the features vector is the input, and the output is the class.
The monitor was developed in C#.Net programming language using Visual Studio Integrated
Development Environment. The online packet capture is performed based on sequential reading
and processing of each packet contained from a network interface. The monitor also adopts a
given timeout for packet capture and presentation of results. The reason for implementation of a
flow reassembly algorithm, despite given the existence of several tools and libraries to reach this
goal, as libNIDS [21], TcpTrace [22], and WireShark [23], is the possibility to evaluate different
approaches for subflow classification, as mentioned in [24]. Moreover, the evaluation of
approaches for real time TCP stream reassembly becomes possible, as in [25] and [26], which are
fundamental in the development of a high-speed traffic classification system.
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
packet capture process must capture completed packets so that the stream is reassembled
correctly.
For the task of packet capture, we use ''TCP Session Reconstruct Tool'' [27], which is a C# utility
for packet capture and reconstruction of complete and incomplete TCP sessions. This tool is
available with CPOL license, and is based on libnids library [21] and Wireshark. It uses a
reconstruction TCP algorithm named TcpRecon. TcpRecon reconstructs a bidirectional flow,
holds each flow in a dictionary structure and recovers its payload. We reuse this software by
replacing the TcpRecon policy by our proposed reassembly scheme. The TCP flow Reassembly is
explained in the next subsection.
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
for TCP session reassembly. The applied reassembly approach was validated experimentally with
the Tcptrace and Tcpflow tools. The reassembly approach is described in algorithm on Figure 3
and works as follows: for each TCP packet received, the system searches the corresponding
connection in the connection record list. If the record is valid, the package is inserted into this
one. If is an invalid record and the packet contains a SYN flag, a new connection is created for
the packet. If the record is invalid and the packet does not contain the SYN flag, the packet is
dropped. If the packet containing the FIN or RST flags, the connection is terminated.
30
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
implemented in the first prototyping of ITCM, other sophisticated labelling techniques can be
subsequently incorporated.
31
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
A common assumption, which is not inherent in Naive Bayesian approach but still often used, is
that for each class, the numerical attributes values are normally distributed. According to [33],
despite this assumption does not reflects the Internet traffic reality, such approach outperforms
some more complex models.
3.5.3. Kernel Naive Bayes
The KNB learning technique is a generalization of the NB and algorithmically similar to this one
in every aspects, except in computation of distribution function ( = | = ) for continuous
attributes, which can be replaced by a variety of non-parametric estimation methods, among
them, the kernel density estimation, which, as the name suggests, uses kernel estimation methods
instead of simple Gaussian approach [34]. The distribution function is shown by the following
expression, where represents the number of training instances that belongs to class , and
= :
The intention of KNB method is that the kernel estimation allow the technique has a good
performance in domain that violate the normality assumption. According to [9], the simple
assumption about the normality of the discriminators is inaccurate and eminent problems arise
when the actual distribution is multimodal, and this situation may indicate that the considered
class is too large or other distribution must be used for the data analysis.
3.5.4. C4.5 Decision Tree
The C4.5 method, which is used to generate Univariate decision tree [35], is a statistical classifier
because its decision trees can be used in classification [36]. The classifier was created by Ross
Quinlan [37] and is an extension of ID3 (Iterative Dichotomizer 3) algorithm, which builds
simple decision trees. C4.5 makes use of information entropy to build decision tree in the same
way as the ID3.
The C4.5 makes various decisions based on data takes into account all available input features
[38]. The algorithm recursively chooses the feature with highest normalized information gain at
each node of the tree. The chosen feature splits the data into subsets enriched at one class or the
other [36]. The information gain indicates how well a decision will separate the output class from
most [38]. We utilized the J48 learning method, which is an open source Java implementation of
the C4.5 algorithm.
3.5.5. AdaBoost Ensemble Learning Algorithm
AdaBoost is an ensemble learning algorithm which alters weak learning classifiers by assigning
weights with the overall goal to correctly classify the instances. Generic Boosting manipulate de
training data in order to generate distinct classifiers and improves the classification accuracy
iteratively. At each iteration, the weights are ajusted so that the weights of the misclassified
examples are increased and those of correctly classified examples are decreased. All instances are
used in each iteration. Each classifier generated at each round of AdaBoost has an associated
weight, which means that a classifier with high degree of correctly identified instances will have
more weight in the total ensemble. The final decision of the ensemble is the weighted-majority
voting of the hypothesis generated of each generated classifier [38]. See [39] for more details.
32
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
4. METHODOLOGY
This section details the design and operation of our classifier monitor, and concludes with the
presentation of the modules which compose the system.
Characteristic
Packets Number
Capture Size
Capture Duration
Average packet Size
Average capture Rate
Description
614282
565.62MB
3516.79s
920.78 Bytes
1.28 Mbps
Table 2. Characteristics of the Trace
Characteristic
Packets Number
Capture Size
Capture Duration
Average packet Size
Average capture Rate
Description
1579921
1.88GB
1355.83s
1195.16 Bytes
11.14 Mbps
Table 3 presents the summary of the identified applications flows in the current traces, which
were: Www (World Wide Web), Https (Http protocol over TLS/SSL), Ftp (File Transfer
Protocol), Xvttp (Xvttp Protocol) and Isakmp (Isakmp Protocol). The most representative
categories in
traffic traces are Www and Ftp applications. In
traffic trace, the Https and
Isakmp application have a greater number of flow instances. In our study, the classification and
training steps are performed at the end of packet capture simulation and reassembly. This
methodology is necessary to evaluate the modules of our classifier monitor.
33
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
Category
Www-http
HTTPs
FTP
Domain
Total
Description
World Wide Web HTTP
Http Protocol over TLS/SSL
File Transfer Protocol
Domain Name Server
1022
139
1808
2969
337
27
1
365
Metric
TCP Connections Number
Max Capture & Reassembly Throughput
Max Reassembly Throughput
Average Capture & Reassembly Rate
Average Delivery Delay
Total Monitor Time
2969
3.95 fps
24997.25 fps
1014.73 Mbps
0.49s
75.67s
365
2.89 fps
128.21 fps
635.34 Mbps
7.50s
310.84s
34
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
In Table 5, our system is compared with the Tcptrace, TcpFlow, TcpRecon and Wireshark tools.
The TcpRecon was modified to use a flow timeout of 60 seconds. We can observe that the
number of flows is not the same between the tools, because the divergence of the used traffic flow
concept, as explained previously. We can observe that our adopted reassembly approach
execution time is lower than the other tools. Our reassembly scheme was implemented in TCP
session Reconstruction Tool, replacing the TcpRecon default policy. In summary, the difference
between these two policies is the adopted recently-accessed-first principle and the use of different
data structures to hold established and not established TCP connections.
Table 5. Comparison with External Tools
Approach/Tool
Proposed Approach
TcpFlow
Tcptrace
Wireshark
TcpRecon
Flows Number
2969
5894
3044
3036
3036
Reassembly Time
75.67s
118.87s
612.95s
182.69s
96.97s
Since TcpRecon and our proposed scheme are written in same language and uses the same packet
capture libraries, we also compare the performance of these two policies one of each other. The
confidence interval estimation of an event population will have greater reliability if the event is
executed at least 30 times [40]. We executed and measured the elapsed times of the
aforementioned TCP reassembly policies. The policies were evaluated over the already presented
datasets. We computed the average execution time and confidence level for each TCP policy. We
consider a high confidence level of 95%. The resulting confidence levels for TcpRecon and our
adopted reassembly scheme are presented in Table 6. For the
traffic trace, our scheme obtain a
time complexity advantage of 20.31 seconds. For the
traffic trace, there is also a reduction of
9.39 seconds with our approach.
Table 6. Performance Comparison of Reassembly Policies
Traffic Trace
TcpRecon
91.85 5.61 seconds
380.36 14.82 seconds
Proposed Policy
71.54 3.57 seconds
370.97 7.72 seconds
The Table 7 presents the main results about the classification process. We can observe that C4.5
Decision Tree was able to categorize on average 87.40% and 89.86% of the traffic correctly for
the two traffic traces. The AdaBoost ensemble algorithm, using the DecisionStump classifier, was
able to categorize on average 78.17% and 89.58% of the traffic correctly. The KNN technique,
with k=10, was able to categorize on average 86.25% and 91.50% of the traffic correctly, against
72.48% and 80.00% for NB classifier. The duration of classification phase was a few seconds,
and the results aim to validate the previous phases of the classifier monitor.
Table 7. Global Accuracy per Trace
Classifier
C4.5 Decision Tree
AdaBoost (DecisionStump)
K-Nearest Neighbor
Naive Bayes
Flexible Naive Bayes
87.40%
78.17%
86.25%
72.48%
64.09%
89.86%
89.58%
91.50%
80.00%
88.76%
35
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
6. C ONCLUSIONS
This paper presented the architecture, implementation, and performance of an Internet traffic
classifier monitor. The monitor is composed of three modules which were implemented as
concurrent processes: capture and pre-processing, flow reassembly, and classification. For the
traffic trace, the throughput reassembly module of the current implementation is 24997.25
flows per second. The average delivery delay is 0.49 seconds. For the classification module, the
C4.5 algorithm outperforms KNN and AdaBoost classifiers with average accuracy of 87.40% and
89.86% against 72.48% and 80% for the KNN and AdaBoost methods, respectively.
Future directions for this research includes to incorporate subflow based classification in ITCM to
reduce response time. Second, we aim to verify the performance impact of our classifier monitor
at gigabit links, which are becoming increasingly common at computer networks. Finally, we
could also prototype ITCM with NetFPGA hardware, since the implementation of network
systems in hardware is essential for any real time application, particularly in gigabit networks
[41].
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Y.-T. Han and H.-S. Park, Game traffic classification using statistical characteristics at the transport
layer. ETRI journal, vol. 32, no. 1, 2010.
B. Claise, G. Sadasivan, V. Valluri, and M. Djernaes, Cisco systems netflow services export version
9, RFC 3954, October, Tech. Rep., 2004.
B. Claise, M. Fullmer, P. Calato, and R. Penno, Ipfix protocol specifications, draftietf-ipfixprotocol-03. txt, 2004.
P. Siska, M. Stoecklin, A. Kind, and T. Braun, A flow trace generator using graph-based traffic
classification techniques, in Proceedings of the 6th International Wireless Communications and
Mobile Computing Conference on ZZZ. ACM, 2010, pp. 457462.
H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K. Lee, Internet traffic
classification demystified: myths, caveats, and the best practices, in Proceedings of the 2008 ACM
CoNEXT conference. ACM, 2008, p. 11.
P. Mallapragada, R. Jin, A. Jain, and Y. Liu, Semiboost: Boosting for semi-supervised learning,
Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 11, pp. 20002014,
2009.
A. Dainotti, A. Pescap, and K. Claffy, Issues and future directions in traffic classification,
Network, IEEE, vol. 26, no. 1, pp. 3540, 2012.
J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, Offline/realtime traffic
classification using semi-supervised learning, Performance Evaluation, vol. 64, no. 9-12, pp. 1194
1213, 2007.
A. Moore and D. Zuev, Internet traffic classification using bayesian analysis techniques, in
Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and
modelling of computer systems. ACM, 2005, p. 60.
R. Bar-Yanai, M. Langberg, D. Peleg, and L. Roditty, Realtime classification for encrypted traffic,
Experimental Algorithms, pp. 373385, 2010.
B. XIONG, C. Xiao-su, and C. Ning, A Real-Time TCP Stream Re-assembly Mechanism in HighSpeed Network, JOURNAL OF SOUTH-WEST JIAOTONG UNIVERSITY, vol. 17, no. 3, 2009.
J. Postel, Rfc 793: Transmission control protocol, DARPA Internet Program Protocol
Specification, 1981.
P. Agarwal, TCP Stream Reassembly and Web based GUI for Sachet IDS, Masters thesis, Indian
Institute of Technology Kanpur, Kanpur, India, 2007.
G. Wagener, A. Dulaunoy, and T. Engel, Towards an estimation of the accuracy of tcp reassembly in
network forensics, in Future Generation Communication and Networking, 2008. FGCN08. Second
International Conference on, vol. 2. IEEE, 2008, pp. 273278.
V. Jacobson and S. McCanne, libpcap: Packet capture library, Lawrence Berkeley
Laboratory, Berkeley, CA, 2009.
36
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
[16] Y. Wang and S. Yu, Machine Learned Real-Time Traffic Classifiers, in Intelligent Information
Technology Application, 2008. IITA08. Second International Symposium on, vol. 3. IEEE, 2009,
pp. 449454.
[17] L. Jun, Z. Shunyi, L. Yanqing, and Z. Zailong, Internet traffic classification using machine
learning, in Second International Conference on Communications and Networking in China, 2007.
CHINACOM07, 2007, pp. 239243.
[18] E. Frank, M. Hall, and L. Trigg, Weka 3-Data Mining with Open Source Machine Learning Software
in Java, The University of Waikato, 2000.
[19] G. Szab, J. Szle, Z. Turnyi, and G. Pongrcz, Multi-level machine learning traffic classification
system, in ICN 2012, The Eleventh International Conference on Networks, 2012, pp. 6977.
[20] IANA. (2014, May) Internet assigned numbers authority. [Online]. Available: http:/www.iana.org
[21] R.
Wojtczuk,
Libnids
homepage.
available
at
http://libnids.sourceforge.net/, 2012.
[22] S.
Ostermann,
Tcptrace
homepage.
available
at http://www.tcptrace.org/,
2012.
[23] A. Orebaugh, G. Ramirez, and J. Burke, Wireshark and Ethereal network protocol analyzer toolkit.
Syngress Media Inc, 2007.
[24] G. Maiolini, A. Baiocchi, A. Rizzi, and C. Di Iollo, Statistical classification of services tunneled into
ssh connections by a k-means based learning algorithm, in Proceedings of the 6th International
Wireless Communications and Mobile Computing Conference. ACM, 2010, pp. 742746.
[25] S. Nor, Near Real Time Online Flow-Based Internet Traffic Classification Using Machine
Learning (C4. 5), International Journal of Engineering (IJE), vol. 3, no. 4, p. 370, 2009.
[26] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, Traffic classification on the
fly, ACM SIGCOMM Computer Communication Review, vol. 36, no. 2, pp. 2326, 2006.
[27] S. Yahalom, Available at http://www.codeproject.com/articles/20501/tcp-session-reconstructiontool, 2012.
[28] C. Partridge, Jacobson on tcp in 30 instructions. Message-ID 1993Sep8, vol. 213239,
1993.
[29] T. Nguyen and G. Armitage, A survey of techniques for internet traffic classification using
machine learning, Communications Surveys & Tutorials, IEEE, vol. 10, no. 4, pp. 5676, 2008.
[30] J. Frijters, Ikvm, an implementation of java for mono and the .net framework [computer
software and documentation], 2004.
[31] M. J. Islam, Q. M. J. Wu, M. Ahmadi, and M. A. Sid-Ahmed, Investigating the performance
of naive- bayes classifiers and k- nearest neighbor classifiers, Convergence Information
Technology, International Conference on, vol. 0, pp. 15411546, 2007.
[32] D. Zuev and A. Moore, Traffic classification using a statistical approach, Passive and Active
Network Measurement, pp. 321324, 2005.
[33] I. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques.
Morgan
Kaufmann Pub, 2005.
[34] G. John and P. Langley, Estimating continuous distributions in Bayesian classifiers, in Proceedings
of the eleventh conference on uncertainty in artificial intelligence, vol. 1. Citeseer, 1995, pp. 338
345.
[35] T. Korting, C4. 5 algorithm and multivariate decision trees, Image Processing Division,
National Institute for Space ResearchINPE Sao Jose dos CamposSP, Brazil.
[36] K. Singh and S. Agrawal, Comparative analysis of five machine learning algorithms for ip
traffic classification, in Emerging Trends in Networks and Computer Communications (ETNCC),
2011 International Conference on. IEEE, 2011, pp. 3338.
[37] J. Quinlan, C4. 5: programs for machine learning. Morgan kaufmann, 1993, vol. 1.
[38] C. McCarthy and A. Zincir-Heywood, An investigation on identifying ssl traffic, in
Computational Intelligence for Security and Defense Applications (CISDA), 2011 IEEE
Symposium on, April 2011, pp. 115 122.
[39] Y. Freund and R. Schapire, Experiments with a new boosting algorithm, in MACHINE
LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-.
MORGAN
KAUFMANN PUBLISHERS, INC., 1996, pp. 148156.
[40] R. Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental Design,
Measurement, Simulation, and Modeling. Wiley - Interscience, ISBN:0471503361., New York, NY,
April, 1991.
37
International Journal of Computer Science & Information Technology (IJCSIT) Vol 6, No 6, December 2014
[41] J. Naous, G. Gibb, S. Bolouki, and N. McKeown, Netfpga: reusable router architecture for
experimental research, in Proceedings of the ACM workshop on Programmable routers for
extensible services of tomorrow. ACM, 2008, pp. 17.
Authors
Silas Santiago Lopes Pereira received his B.S degree in Computer Science from State
University of Cear, Brazil, in 2010. Master degree in Computer Science from State
University of Cear, Brazil, in 2013. He is now a professor at Federal Institute of Cear.
His interests include machine learning applications and computer networks.
Jos Everardo Bessa Maia received his B.S. degree from Federal University of Cear,
Brazil, in 1980, M. S degree from State University of Campinas, Brazil, 1989, both in
Electrical Engineering. Ph.D from the Federal University of Cear, Brazil, 2011. He is a
professor in the Statistics and Computer Science Department at State University of
Cear and University of Fortaleza.
Jorge Luiz de Castro e Silva received his postdoctoral research in Computer Science
from the State University of Campinas, Brazil, 2011. Ph.D in Computer Science from the
Federal University of Pernambuco, Brazil, 2004. Master degree in Computer Science
from the Federal University of Cear, Brazil, 1997. Received his B. S. degree in Data
Processing Technology from Federal University of Cear, Brazil, 1978, and B.S degree in
Business Administration from the State University of Cear, Brazil, 1981. He is a
professor at the State University of Cear and Teacher / Advisor of the Master Program in Computer
Science.
38