Academia.eduAcademia.edu

Internet traffic modeling by means of Hidden Markov Models

2008, Computer Networks

Responsible Editor: M. Smirnow a b s t r a c t

Computer Networks 52 (2008) 2645–2662 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet Internet traffic modeling by means of Hidden Markov Models q,qq Alberto Dainotti a, Antonio Pescapé a, Pierluigi Salvo Rossi b,c, Francesco Palmieri c, Giorgio Ventre a,* a b c Department of Computer Science and Systems, University of Naples ‘‘Federico II”, Via Claudio 21, 80125 Napoli, Italy Department of Electronics and Telecommunications, Norwegian University of Science and Technology, O.S. Bragstads plass 2B, 7491 Trondheim, Norway Department of Information Engineering, Second University of Naples, Via Roma 29, 81031 Aversa (CE), Italy a r t i c l e i n f o Article history: Received 1 March 2007 Received in revised form 12 December 2007 Accepted 7 May 2008 Available online 28 May 2008 Responsible Editor: M. Smirnow a b s t r a c t In this work, we propose a Hidden Markov Model for Internet traffic sources at packet level, jointly analyzing Inter Packet Time and Packet Size. We give an analytical basis and the mathematical details regarding the model, and we test the flexibility of the proposed modeling approach with real traffic traces related to common Internet services with strong differences in terms of both applications/users and protocol behavior: SMTP, HTTP, a network game, and an instant messaging platform. The presented experimental analysis shows that, even maintaining a simple structure, the model is able to achieve good results in terms of estimation of statistical parameters and synthetic series generation, taking into account marginal distributions, mutual, and temporal dependencies. Moreover we show how, by exploiting such temporal dependencies, the model is able to perform short-term prediction by observing traffic from real sources. Ó 2008 Elsevier B.V. All rights reserved. 1. Introduction Understanding and solving performance-related issues of current and future networks requires the availability of realistic, but still simple and manageable, traffic models. Therefore the modeling of Internet traffic represents a critical task in the study and in the design of Internet architectures. Many efforts have focused on modeling source traffic related to specific application-level protocols, also with the purpose of conducting realistic network traffic simulation q This work has been partially supported by PRIN 2007 RECIPE Project, by CONTENT EU Network of Excellence, and finally from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 216585 (INTERSECTION Project). qq Preliminary results within the same framework of this work have been recently published in [1,2]. * Corresponding author. Tel.: +39 081 7683908; fax: +39 081 7682950. E-mail addresses: [email protected] (A. Dainotti), [email protected] (A. Pescapé), [email protected] (P. Salvo Rossi), francesco.palmieri@ unina2.it (F. Palmieri), [email protected] (G. Ventre). 1389-1286/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2008.05.004 and emulation experiments (i.e. generating synthetic traffic in real networks). Here we present a source-based modeling approach relying on a packet-level view of Internet traffic. The analysis of network traffic, indeed, can be made at different abstraction levels, e.g. session, conversation, connection/ flow, packet, byte. With the term packet-level we mean the characterization of traffic in terms of Inter Packet Time (IPT) and Packet Size (PS). Such approach results particularly attractive because of its conciseness, flexibility, and because it allows to look at traffic from the lowest point of view. With the term source-based approach, we mean a traffic characterization and modeling of traffic generated by Internet applications running on single hosts. As for the analytical approach, we adopted a specifically suited Hidden Markov Model (HMM). The idea is to keep the model analytically simple and tractable, but capable to capture important joint dynamics (in terms of both marginal distributions and time dependencies) of IPT and PS. We evaluate the model capabilities (learning, generation, and prediction) in order to construct realistic packet-level 2646 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 models from the automated analysis of empirical traffic traces, by considering the marginal distributions and the auto- and cross-covariances of IPT and PS. In addiction, we compare synthetically generated sequences of IPT–PS pairs against those from real traces. Also, we show the capability of the proposed model to predict the short-term future behavior of the analyzed traffic on the basis of a very small amount of monitored traffic. An important objective of this work is the design of a model flexible enough to work with different kinds of Internet traffic sources. For this reason, we apply our approach to traffic traces of various application-layer protocols and related to very different Internet services. More precisely we separately consider traffic generated by (i) SMTP, (ii) HTTP, (iii) a network game, and (iv) an instant messaging application. The conducted experimental investigation shows that, with a very limited complexity, the proposed model achieves acceptable results. Moreover, the prediction capabilities of the model – tested here in an off-line fashion – let foresee the useful application of such modeling approach for resource reservation and admission control purposes. Finally, according to the source-based approach, we do not focus on aggregate link traffic, whereas we separately analyze several sessions of traffic exiting from single hosts related to specific application-level protocols. The purpose is to model the average single session for each considered application. The effect of superposition of multiple synthetic traffic sources, e.g. the presence of self-similarity and long range dependence in the synthetic aggregated traffic, falls beyond the scope of this paper. The rest of the paper is organized as follows. In Section 2, related works and motivations at the basis of this work are given. Section 3 provides an introduction to HMM and a description on their application to build the proposed analytical model, furnishing details about model statistics and the learning stage. Section 4 describes the measurement approach, giving insights and motivations on the specific traffic taken in consideration. In Section 5, we show results of the model applied to SMTP, HTTP, Age of Mythology (AoM), and MSN Messenger. Section 6 ends the paper discussing the presented results and giving conclusion remarks. 2. Motivation and related work Source traffic models are necessary to reproduce realistic user/application behavior in simulative environments or in network testbeds by injecting synthetic network traffic (e.g. traffic emulation). This allows to study network architectures performance problems by reconstructing the flows of packets generated by single sources. In the past, several source models related to HTTP traffic have been proposed [3,4], being the dominant Internet application, whereas only simple statistical characterizations of source traffic related to other applications like SMTP, network games, etc., have been presented [5,6]. Past years though, have seen a growing heterogeneity of Internet applications, making necessary the availability of models for different kinds of applications. Here, we explore the feasibility of a single modeling approach, flexible enough to work with different categories of sources, to be easily integrated into a traffic generation (or simulation) framework [7]. Moreover, even if we stress that this was the main focus of the present work, as regards some of the considered traffic categories, as network games, we would like to notice that this work represents one of the first attempts to build a thorough statistical model able to take into account multiple properties of the traffic. Indeed, while there exists a rich literature in terms of traffic characterizations and sometimes modeling of network games [6,8–11], this usually focuses on fitting the marginal distributions of IPT and PS, sometimes by arbitrarily splitting the fitting into different analytical distributions for different portions of the sample set; time dependence (a relevant property to consider when studying traffic modeling and simulation) and mutual dependence are usually not taken into account. The proposed model relies on HMMs to reproduce traffic sources at packet-level. The reason for we focused on a packet-level view of traffic is that it provides the following benefits when compared with higher-level approaches: (i) we look at traffic at the deepest level of detail but at the same time basing the observations on just two variables; (ii) switching devices often operate on a packet-by-packet basis, therefore it is important to dispose of realistic packet-level models to evaluate their performance; (iii) most network performance problems (e.g. Loss, Delay, Jitter) happen at packet level; (iv) working at packet-level makes our approach independent of protocols evolution and applicable to different applications/protocols; (v) such kind of model is usable in traffic generators and simulators; (vi) traffic at packet level remains observable after encryption made by, for example, end-to-end cryptographic protocols such as SSL or IPSec; (vii) packet-level traffic models make robust approaches to traffic profiling for anomaly detection. As far as concerns the analytical modeling approach, we had to face the trade-off among accuracy (the capability to capture as much statistical properties of traffic dynamics as possible), flexibility, and simplicity. The use of HMMs allowed us to build an easily tractable model, capable to jointly take into account IPT and PS first order statistics as well as temporal dynamics and correlation. In spite the large number of references related to network traffic modeling, very few works aim at joint modeling of IPT and PS [12–14]. Whereas, it has been demonstrated that neglecting aspects related to PS (e.g. assuming a constant value) significantly affects performance analysis [12]. Correlation structure is also a fundamental aspect that must be considered [15] when realistic replication of traffic is needed. Recently the interest in HMM-based models has grown, and HMM models have been proposed as a tool for several network traffic related research problems. In [16,17] HMM models have been used to model the states of packet channels via corresponding loss probabilities and end-to-end delay distributions. Similar works have been proposed to model wired [18] and wireless [19] packet channels. To the best of our knowledge, few modeling works using HMMs to model traffic sources at packet level are present in literature. Specifically, we found A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 approaches to Internet traffic modeling able to capture temporal structures based on MMPP (Markov Modulated Poisson Process) [20] and BMAP (Batch Markovian Arrival Process) [12,13]. In [20], authors propose a layered model to replicate traffic at edge routers that takes into account hierarchical characteristics of Internet traffic as well as long-range dependence properties, while in [13] efficient implementation of analytical tractable models for aggregate IP traffic is presented focusing on burstiness and self-similarity properties. The BMAP model proposed in [12], which considers both packet IPT and PS, is designed to capture the long-range dependence present in traffic traces of aggregate link traffic and it is evaluated in terms of queue analysis. In our work, instead, we concentrate on the traffic generated by single sources which is then mixed in network links, therefore here we do not consider queue analysis. In [21], a Markov-based model has been proposed and applied to variable bitrate MPEG traffic; GOP layer traffic characteristics for MPEG video traffic and sources are constructed from MPEG1 encoded video sequences. The same model has been applied to other traffic types (e.g, VoIP traffic). In [22], HMMs have been used to disjointly model IPT and PS of both aggregated and WWW traffic, comparing results against those from a stochastic generator based on a chaotic attractor. Finally, in [23], again considering IPT and PS disjointly, HMMs have been used to build traffic classifiers based on packetlevel statistics related to some Internet applications. It is worth noticing that the HMM-based modeling approach presented here is part of a more general framework that also includes packet-channels modeling [18,19]. The long-term objective is a powerful homogeneous analytical framework for effective modeling of packet-level environments in heterogeneous scenarios (both in terms of traffic sources and end-to-end network paths). To highlight and summarize the significance of the approach proposed in this work, we underline that, to the best of our knowledge, it extends the results present in literature in that it allows IPT/PS joint description; it allows synthetic series generation of both IPT and PS; it allows source state estimation with traffic prediction; it is derived by real traffic traces; it has been tested on different traffic types (quite different from each other in terms of both used protocols and users/applications behavior), deriving analogies and differences on the equivalent traffic models;  results obtained with the analyzed traffic categories show the flexibility of the proposed model making it generalizable;  as regards games traffic, this represents one of the first works to present a more complete model taking into account the aforementioned statistical properties.      Finally, we would like to underline that at [24], we make publicly available the open-source tool (called Plab) used for traffic analysis and measurement at packet-level, the algorithms developed for the analytical model, and, finally, the large set of heterogeneous data/traffic traces used in this work. 2647 3. The model 3.1. Hidden Markov Models We propose a statistical model1 for packet-level network traffic. More specifically, we model the single source of traffic as an HMM. Generally speaking, an HMM may be viewed as a probabilistic function of a (hidden) Markov chain [25], thus it is composed of 2 variables:  the hidden-state variable, whose temporal evolution follows a Markov-chain behavior;  the observable variable, that stochastically depends on the hidden state. Its topology is shown in Fig. 1, where xn 2 fs1 ; . . . ; sN g and yn 2 fo1 ; . . . ; oM g represent the state and the observable at discrete time n, respectively, with N and M being the number of states and the number of observable, respectively. An HMM is characterized by  u – the initial state distribution, where ui ¼ Prðx1 ¼ si Þ;  A – the N  N state transition matrix, where Ai;j ¼ Prðxn ¼ sj jxn1 ¼ si Þ;  B – the N  M observable generation matrix,2 where Bi;j ¼ Prðyn ¼ oj jxn ¼ si Þ. We denote k ¼ fu; A; Bg the complete set of parameters. The three fundamental problems of an HMM are  evaluation – given a model k and a sequence of observations y ¼ ðy1 ; . . . ; yL Þ, compute efficiently the probability of the sequence given the model, PrðyjkÞ. It is solved via the forward–backward algorithm.  reconstruction – given a model k and a sequence of observations y ¼ ðy1 ; . . . ; yL Þ, find the most likely corresponding sequence of states x ¼ ðx1 ; . . . ; xL Þ. It is solved via the Viterbi algorithm, a dynamic programming technique performing computation of the best score and tracking variables.  learning – given a sequence of observations y ¼ ðy1 ; . . . ; yL Þ, find the set of parameters k such that the likelihood of the model Lðy; kÞ ¼ PrðyjkÞ is maximum. It is solved via the Baum–Welch algorithm, a special case of the Expectation–Maximization algorithm [26], that iteratively updates the parameters in order to find a local maximum point of the parameter set. It is worth noticing that the recursive computation of the forward and backward variables presents a complexity oðLN2 Þ with respect to the complexity oðLNL Þ of direct calculation, with L being the length of the sequence of 1 Notation – Upper (resp. lower) bold case letters denote matrices (resp. column vectors), Ai;j (resp. ai ) denotes the ði; jÞth (resp. ith) element of matrix A (resp. column vector a), 1 denotes a column vector whose elements are 1, di;j denotes the delta of Kronecker, ½T and Efg, respectively, denote transpose and expectation operators, the symbol  means ‘‘distributed as”. 2 If the observable variable is continuous, the observable matrix is replaced with a set of N conditional pdfs, say fB1 ðyÞ; B2 ðyÞ; . . . ; BN ðyÞg. 2648 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 and PS meaningless). Summarizing we have a model where xn is a discrete random variable whose dynamic behavior is governed by the transition matrix A, with a Markovian assumption for the evolution, and yn is a bi-dimensional continuous random variable describing IPT and PS as mixtures of conditionally independent (given the state) Gamma distributions, i.e. ðtÞ ðpÞ fi ðyn Þ ¼ fi ðdn Þfi ðbn Þ: Fig. 1. Hidden Markov Model topology. observations. For a more comprehensive discussion on HMMs refer to [25,27,28]. 3.2. A packet-level source model Referring to a single source of traffic, we consider an HMM in which the state variable is discrete, xn 2 fs1 ; . . . ; sN g, and the observable variable is a continuous bi-dimensional vector, yn ¼ ½dn ; bn T . The first and second components of yn represent the IPT and the PS for the nth packet, respectively in dBl (which we define as 10log10 ðIPT=1 lsÞ) and in bytes.3 The state variable has been introduced to account for memory and correlation phenomena between IPT and PS. We assumed that IPT and PS are statistically independent given the state. Also, in order to reduce the number of parameters, we assume u ¼ q, where q is the steady-state distribution,4 given by AT q ¼ q. K ¼ fA; gðtÞ ; wðtÞ ; gðpÞ ; wðpÞ g is the set of parameters characterizing the model, denoting the state transition matrix, the conditional IPT and PS distribution vectors, respectively, i.e. ð1Þ 3.2.1. Model statistics The IPT and PS conditional means and standard deviations are ðtÞ ðtÞ lðtÞ rðtÞ i ¼ g i wi ; i ¼ riðpÞ qffiffiffiffiffiffiffi ðpÞ ðpÞ ¼ g i wi ; qffiffiffiffiffiffiffi ðtÞ ðtÞ g i wi ; respectively, due to the conditional Gamma distribution assumption, then the IPT and PS global means and standard deviations of the model are lðtÞ ¼ N X qi l ðtÞ i ; rðtÞ i¼1 lðpÞ ¼ N X vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N uX ðtÞ ðtÞ ðtÞ ¼t qi li ð1 þ g i Þwi  ðlðtÞ Þ2 ; i¼1 ðpÞ qi li ; rðpÞ i¼1 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N uX ðpÞ ðpÞ ðpÞ ¼t qi li ð1 þ g i Þwi  ðlðpÞ Þ2 : i¼1 ð3Þ Also, global IPT and PS pdfs are fIPT ðdÞ ¼ N X ðtÞ qi fi ðdÞ; f PS ðbÞ ¼ i¼1 C ¼ ðpÞ ðb=wi Þgi ðpÞ ðpÞ fi ðbÞ ðpÞ wi 1 C ðpÞ eðb=wi ðpÞ ðg i Þ /i ¼ 1 : 1  Ai;i ð4Þ R ðmÞ ¼ Efdn dnþm g ¼ ( qT EII 1; RðpÞ ðmÞ ¼ Efbn bnþm g ¼ ( qT EII 1; RðtpÞ ðmÞ ¼ Efdn bnþm g ¼ ( ðd > 0Þ; Þ ðb > 0Þ: The choice of Gamma distributions for IPT and PS is because a mixture of normal distributions can easily approximate a general distribution, Gamma is practically very similar to a normal distribution and has the desirable characteristic to be null for negative values (being negative IPT 3 ðpÞ qi fi ðbÞ: i¼1 ðtÞ ðtÞ ¼ N X The conditional (given that state) duration in the state si is then the conditional pdfs for IPT and PS are: ðtÞ fi ðdÞ ð2Þ IPT and PS auto- and cross-correlations of the model are  Ai;j ¼ Prðxnþ1 ¼ sj jxn ¼ si Þ; ðtÞ ðtÞ  dn jxn ¼ si  Cðg i ; wi Þ; ðpÞ ðpÞ  bn jxn ¼ si  Cðg i ; wi Þ; ðtÞ ðtÞ ðd=wi Þgi 1 eðd=wi Þ ðtÞ ðtÞ wi ðg i Þ ðpÞ ðpÞ lðpÞ ¼ g i wi ; i We measure IPT with a resolution of 1 ls (as explained in Section 4) and apply a logarithmic transformation because they range over several orders of magnitude. 4 If xn is an irreducible and aperiodic process, the steady-state distribution equals the limit distribution, qi ¼ limn!1 fPrðxn ¼ si Þg, see [29]. ðtÞ T ðtÞ m ¼ 0; jmj1 q E A ðtÞ E 1; m 6¼ 0; ðpÞ m ¼ 0; qT EðpÞ Ajmj1 EðpÞ 1; m 6¼ 0: ðtpÞ qT EII 1; m ¼ 0; qT EðtÞ Ajmj1 EðpÞ 1; m 6¼ 0: where EIIi;j ¼ Ai;i ð1 þ g i Þg i ðwi Þ2 di;j ; ðtÞ ðtÞ ðtÞ ðtÞ EIIi;j ¼ Ai;i ð1 þ g i Þg i ðwi Þ2 di;j ; ðpÞ ðtpÞ ðpÞ ðtÞ ðtÞ ðpÞ ðpÞ ðpÞ ðtÞ ðtÞ ðtÞ Ei;j ¼ Ai;i g i wi di;j ; ðpÞ ðpÞ ðpÞ Ei;j ¼ Ai;i g i wi di;j ; ðpÞ EIIi;j ¼ Ai;i g i wi g i wi di;j : It is worth noticing that E and EII are first-order and second-order statistics matrices. To show traffic dynamics without the biasing effects of IPT and PS global means, in Section 5 covariances are taken into account instead of correlations. 2649 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 3.2.2. Learning the model parameters The Expectation–Maximization algorithm [26] is an optimization procedure that allows learning of a new set of parameters for a stochastic model according to improvements of the likelihood of a given sequence of observable variables. For structures like HMM’s this optimization technique reduces to the Baum–Welch algorithm [25,27,28], studied for discrete and continuous observable variables with a broad class of allowed conditional pdfs. The Baum–Welch algorithm is an iterative procedure looking for a local maximum of the likelihood function which typically depends on the starting point K. When necessary, multiple trainings with different initial conditions provide the global solution. More specifically, consider a set of observable sequences Y ¼ fYð1Þ ; . . . ; YðKÞ g referred to as the training set, ðkÞ ðkÞ where each sequence YðkÞ ¼ ½y1 ; . . . ; yLk  represents IPT 5 and PS from a single session. We want to find the set of parameters such that the likelihood LðY; KÞ ¼ PrðYjKÞ of the training set is maximum. The Baum–Welch for the proposed source-traffic model is then based on the following equations: PLk 1 ðkÞ ðkÞ ðkÞ 1 k¼1 LðkÞ n¼1 n ðiÞAi;j fj ðynþ1 Þbnþ1 ðjÞ ; PK P L 1 ðkÞ ðkÞ 1 k k¼1 LðkÞ n¼1 n ðiÞbn ðiÞ PK PLk ðkÞ ðkÞ ðkÞ 1 k¼1 LðkÞ n¼1 n ðiÞbn ðiÞdn ðtÞ ðtÞ ^i ¼ P g^i w ; PLk 1 ðkÞ K ðkÞ 1 k¼1 LðkÞ n¼1 n ðiÞbn ðiÞ ^ i;j ¼ A PK a a a a PLk ðkÞ ðkÞ ðkÞ 1 k¼1 LðkÞ n¼1 n ðiÞbn ðiÞbn ðpÞ ðpÞ ^i ¼ P ; g^i w P Lk 1 ðkÞ K ðkÞ 1 k¼1 LðkÞ n¼1 n ðiÞbn ðiÞ PK PLk ðkÞ ðkÞ ðtÞ 2 ðkÞ 1 k¼1 LðkÞ n¼1 n ðiÞbn ðiÞðdn  i Þ ðtÞ ðtÞ 2 ^ ^ ; g i ðwi Þ ¼ PK P L 1 ðkÞ ðkÞ 1 k k¼1 LðkÞ n¼1 n ðiÞbn ðiÞ PK PLk ðkÞ ðkÞ ðpÞ 2 ðkÞ 1 k¼1 LðkÞ n¼1 n ðiÞbn ðiÞðbn  i Þ ðpÞ 2 ^ ðpÞ Þ ¼ ; g^i ðw PK PLk 1 ðkÞ i ðkÞ 1 k¼1 LðkÞ n¼1 n ðiÞbn ðiÞ PK a a l a a l a a where referring to the kth sequence, the likelihood is LðkÞ ¼ PrðYðkÞ jKÞ ¼ N X anðkÞ ðiÞbðkÞ n ðiÞ; i¼1 and the Forward and Backward variables are computed according to the following recursions: 8 N1 > < P aðkÞ ðiÞA f ðyðkÞ Þ; n ¼ 1; . . . ; L ; i;j j k n n1 aðkÞ ðjÞ ¼ i¼0 n > : d1;j ; n ¼ 0; 8 N1 > < P A f ðyðkÞ ÞbðkÞ ðjÞ; n ¼ 0; . . . ; L  1; i;j j k nþ1 nþ1 ðkÞ bn ðiÞ ¼ j¼0 > : 1; n ¼ Lk : In our experiments the initialization for the parameter set K, has been such to have the conditional pdfs, for both IPT and PS, uniformly distributed on the whole observed range. More specifically, the state-transition matrix is given by Ai;j ¼ 1=N; 5 The meaning of ‘‘session” will be better defined in Section 4. while denoting ðkÞ dmin ¼ min dn ; k;n ðkÞ bmin ¼ min bn ; k;n ðkÞ dmax ¼ max dn ; k;n ðkÞ bmax ¼ max bn ; k;n then fgðtÞ ; wðtÞ ; gðpÞ ; wðpÞ g are chosen as dmax  dmin ; Nþ1 bmax  bmin ; ¼ Nþ1 dmax  dmin ; 5ðN þ 1Þ bmax  bmin : ¼ 5ðN þ 1Þ ðtÞ ðtÞ liþ1  li ¼ riðtÞ ¼ ðpÞ lðpÞ iþ1  li rðpÞ i ð6Þ 4. Traffic traces and measurement approach 4.1. Considered traffic To verify its flexibility and general applicability, the proposed modeling approach has been tested with different categories of Internet traffic sources. The choice of such applications takes into account the level of novelty and popularity. Also, we considered applications differing from several points of views which all reflect into traffic peculiarities: man–computer interaction, transferred objects, underlying network protocols, etc. The list of considered Internet applications, along with details of the corresponding traffic traces we used,6 is reported in Table 1. Firstly, we considered more traditional services as the Web and the Email. Although HTTP and SMTP are two applications largely involving all the Internet population (the most used by common users), they substantially differ for the kinds of treated objects as well as the level of user interaction. The characteristics of traffic generated by HTTP clients can be heavily affected by the human factor, above all as regards timings [2,3], whereas SMTP clients traffic is affected by users mostly in terms of the number and size of packets to be transferred. Secondly, we considered applications which have become popular in the recent years and currently represent an increasing portion of the overall Internet traffic: instant messaging and multi-player network games. They both present novel and interesting characteristics with respect to other applications. Due to these differences, as for both games and instant messaging, the interest in the characterization and modeling of their traffic is increased in the last years [6,11,31,32]. Network games have strict latency requirements and traffic properties which substantially differ from more traditional Internet applications [6]. Moreover, while their traffic represents a relevant percentage yet – in [33] it was reported that about 4% of all packets in a backbone could be associated with only six popular network games – it is constantly increasing. Thus, analysis of such traffic is crucial to properly design and provision networks for future needs. We studied traffic generated by Age of Mythology (AoM), a Microsoft Real Time Strategy Multiplayer ð5Þ 6 Apart the AoM traffic traces available at [30], they are freely available at [24]. 2650 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 Table 1 Traffic traces details Traffic Link Protocol Port Date Size Pkts Sessions SMTP HTTP AoM MSN WAN WAN LAN WAN TCP TCP UDP TCP 25 80 2300 1863 9/2005 7/2004 8/2003 4/2006 3 GB 60 GB 12 MB 1 GB 43 M 830 M 180 K 9M 56 K 1M 6 1M Game [34]. As regards Instant Messengers, they are used by 50% of the Internet users all around the world [35], being MSN Messenger the most popular application, followed by AOL and Yahoo Messenger. In this work we model the traffic generated by MSN Messenger (MSN in the following) clients [36]. The level of user interaction in these kind of applications is obviously much higher. Moreover, because they represent a new vehicle of viruses, worms, and of other kinds of malicious use, the study of instant messaging applications, besides email, has also interesting security implications. For example, the traffic behavior characterization and modeling of such applications could be exploited for security purposes (classification, detection, prevention). In Table 1 details about the traffic traces that we analyzed are given. As regards SMTP, HTTP, and MSN, we captured traffic by passively monitoring the WAN access link at University of Napoli ‘‘Federico II” network during the period January 2004–April 2006. The observed link represents the only connection of the University network to the Internet, and it has a maximum throughput equal to 200 Mbps. With the term ‘‘session”, in the case of SMTP (resp. HTTP), we mean all the traffic exchanged between two hosts related to port TCP 25 (resp. TCP 80), with a timeout of 15 min. As regards SMTP, we present results from the sessions with less than 100 packets, which we defined as short-lived, and which account for  97% of the SMTP sessions. This is because we found that there are other sessions which exhibit extremely different statistical properties. This was confirmed by a K-means clustering we performed using a few features per session, e.g. number of packets, bytes, IPT and PS mean and variance. Note that considering only this class does not affect our approach, as we do not want to provide a comprehensive model for SMTP traffic. At this stage we want to show the applicability of the proposed approach also to this kind of traffic. As regards MSN, the MSN protocol, uses a client–server communication model in which user clients interact with Microsoft servers that belong to the MSN Messenger network and which accept connections on TCP port 1863 Capture Trace inspection [37]. There are mainly two kinds of servers, which offer services of presence and instant messaging, respectively, [38]. Analysis of both communication protocol and real traffic traces allowed us to identify the subnets associated to each service. We collected traffic related to both services and both directions (inbound and outbound). In this work we report results related only to the outbound direction (i.e. from the clients to the servers) of instant messaging traffic. Each session is made of all the traffic exchanged between a single client–server pair (related to server port TCP 1863), with an inactivity timeout of 15 min. The AoM traces, instead, have been provided, in Tcpdump format, by the Worcester Polytechnic Institute (WPI), MA (USA) [30]. They consist of packet sequences of complete gaming sessions, between two players, captured in a LAN environment. We consider an AoM session given by all the traffic exchanged from the beginning to the end of a match. Only six gaming sessions were studied, because packet-level traffic of RTS games has been demonstrated being very predictable and strongly dependent from the specific game application whereas it is poorly dependent from user behavior [39]. Indeed, past works studying the statistical characterization of the traffic generated by this game have used only such traces. Such works show that this traffic is substantially different from traffic of more classical network applications. Moreover, in [40] we showed results and commented regarding the invariance of gaming traffic when observed under different situations, which makes reasonable the use of a small number of traces. As regards SMTP, HTTP, MSN traces, instead, we observed a much larger set of sessions. This is because of the more complex nature of such traffic [2] and also because we could gather our own traces. 4.2. Tools and issues with the data Obtaining and making available traffic data useful for characterization and modeling is a complex task, which not only consists into traffic collection and selection of the appropriate traffic flows, but it also involves activities such as data sanitization and anonymization (see Fig. 2). We used Plab [2,24] to capture the traffic traces we collected and analyzed. Plab is an open-source software, partially based on the Libpcap library [41], that we developed for the analysis of live traffic and of file traces in tcpdump format, and focused on packet-level measurement and analysis. This platform, employed also in previous works on traffic analysis and modeling, is capable to efficiently analyze very large traffic traces and to separate traffic into different sessions. Depending on user-specified parame- Trace sanitization & anonymization Measurements and preliminary analysis Fig. 2. Life cycle of data analysis. Data analysis 2651 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 ters, a session can be identified by: (i) all packets sent and received by a host (host mode); (ii) all packets identified by source and destination IP and ports with a default timeout of 60 s (flow mode); (iii) all packets exchanged by 2 hosts related to a specific service (e.g. TCP port 80), with a user definable timeout (conversation mode). Given one of the above modes, sessions are assigned an ID, and for each session IPTs between packets flowing in the same direction are calculated, along with PS. We call such data packet-level data series. With Plab it is possible to specify command line filters in tcpdump/Berkeley Packet Filter syntax to select the type of traffic to be captured or analyzed, e.g. layer 3 protocol, port, etc. Also, more intelligence was introduced into the software, as the ability to decode optional TCP headers like the MSS, or to filter packets or entire sessions based on several others criteria. For example, we introduced some payload inspection capabilities which served for data sanitization. As regards sanitization, indeed, here we report on how we removed, from the considered data, samples related to traffic which was not HTTP, but tried to masquerade as it by running on TCP port 80. We instructed Plab to analyze the first 3 bytes of payload data exchanged between each host pair of a conversation related to TCP port 80. Under normal conditions, such bytes should correspond to the method invoked by the client in a HTTP request. As reported in Table 2, we observed that almost 94% of the sessions started with a GET request, 4% with a POST request, etc. Only a small fraction of the sessions presented packets starting with a byte not corresponding to an alphabetic character. Inside this category, 99% of the conversations started with the byte 0xe3, the first byte exchanged by peers opening a communication session based on the eDonkey2000 protocol [42], used by eMule and eDonkey file-sharing applications. Also, 0.44% of the sessions were initiated by the host communicating from port TCP 80 (labeled as ‘‘downstream” in Table 2). Because our interest was in modeling traffic generated only by applications running over HTTP, we instructed Plab to recognize such sessions and to filter them out. By filtering our traces, we observed that 5.12% of the processed packets were discarded. Therefore, this non-HTTP traffic represents a not negligible portion of the captured traffic. As regards the number of filtered sessions, they account for about 0.7% of the total. This suggests that the filtered sessions tend to generate more packets than authentic HTTP ones. By comparing the results obtained with and without filtering such sessions, we observed that discarded traffic had a consistent impact in terms of payload size and inter-packet time. Comparisons of the obtained distributions for upstream traffic at the UNINA site are shown in Fig. 3. Table 2 Payload inspection on the first packet opening a conversation related to TCP port 80 Conversation start GET POS HEA Downstream 0xe3 PRO Percentage 93.94 4.23 0.7 0.44 0.27 0.2 0.4 sanitized 0.35 unsanitized 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 log10(x); x = 1E–5s 6 7 8 1 0.8 0.6 0.4 sanitized 0.2 0 0 unsanitized 500 1000 1500 bytes Fig. 3. Filtered UNINA upstream: IPT PDF (top), PS CDF (bottom). Observing the properties of such distributions it is clear that the filtered sessions increase the portion of back-toback packets with full payload, probably due to the presence of file-transfers. As reported in [2], after filtering out such traffic from the traces captured at the UNINA site, we found packet-level profiles strongly similar to those obtained by observing traffic at another site in which no Peerto-Peer applications were running. This is not only a confirmation of the correct sanitization we performed, but also revealed important invariants (with respect to space and time) of the characteristics of the studied traffic. The above example shows how acquiring realistic and reliable data to be used as a reference for traffic modeling is a delicate and sometimes not straightforward task, which requires attention and appropriate tools. Finally, as regards data anonymization, to preserve users privacy we kept only the IP and TCP headers of each packet, and we scrambled IP addresses using the widetcpdpriv tool from the MAWI-WIDE project [43]. 2652 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 4.3. Measurement methodology and analyzed data 5. Experimental results This section presents some results of our model when it is applied to SMTP, HTTP, AoM, and MSN traffic. We used the model with N ¼ 4 states for AoM and MSN traffic and N ¼ 5 states for SMTP and HTTP traffic, due to their more complex structure. Choices N ¼ 4 and N ¼ 5 have been found effective empirically, as they provided a log —likelihood/average length For each session between two hosts, depending on their direction, two separate flows of data can be identified, which we called upstream and downstream. In the case of SMTP, HTTP, and MSN, we identify as upstream the traffic flowing from a client to a server, that is, packets with destination port set to respectively TCP 80, TCP 25, and TCP 1863. Whereas downstream traffic is related to the opposite direction. In the case of SMTP, as regards downstream traffic, it is worth mentioning that the vast majority of downstream flows – for each session – are made of only few packets (about 5) of small size. Thus they represent a very small portion of SMTP traffic. This can explained by SMTP protocol specifications: the peer acting as a server usually answers to requests and data transfers from the client with small messages that must have a numeric ID prepended. As for HTTP instead, strong volumes of traffic are generated in both directions, this is due to the intrinsic nature of the Web traffic. In this paper, we concentrate on the traffic sources represented by SMTP, HTTP, and MSN clients, we therefore model only upstream traffic. We adopt the same approach for AoM, modeling the traffic flowing in the outbound direction when seen from the point of view of a specific peer (i.e. leaving the workstation of a gaming user). Anyway, being the observed AoM traces related to matches with two players, the traffic flowing in the other direction is almost symmetrical. An important aspect of our methodology is that in the evaluation of IPT and PS distributions we did not take into account packets with empty payload at transport level. Since we wanted to characterize the traffic generated by the applications, independently as much as possible of the transport level protocol itself, we decided to drop all TCP-specific traffic, like connection establishment packets (SYN-ACK-SYNACK) and pure acknowledgment packets [44]. For the same reason, in the estimation of the packet size, we measured the byte length of the TCP payload or, in the case of AoM we considered the UDP payload. These choices make our results usable for simulation purposes as an input for TCP state machines and UDP/IP stacks, like in D-ITG [24] and TCPlib [45]. As regards the time resolution of the measurements, the packet timestamping resolution provided by the Libpcap library (which is used both by Tcpdump and Plab), and by the kernel drivers that it links to, is of 1 ls. Moreover, because of the wide range of the considered IPTs, as reported in Section 3, we applied a logarithmic transformation to the measured values, 10log10 ðIPT=1 lsÞ, which we will refer to as dBl. —10 —10 —10 0 1 AoM SMTP HTTP MSN 2 1 2 3 4 5 6 7 8 9 10 iteration Fig. 4. Log-likelihood (normalized with respect to the average length of the session) vs. iteration. sufficient number of modes to capture traffic behavior for the considered applications. In our experiments models with smaller number of states failed capturing the correct Table 3 Model discrepancy: IPT-k2 and PS-k2 IPT-k2 PS-k2 SMTP Starting Trained 2:1  108 1.4 3:3  1046 4.6 HTTP Starting Trained 1:7  106 0.31 1:1  1044 0.30 AoM Starting Trained 0:99  102 0.24 2:2  1016 1.6 MSN Starting Trained 1:7  102 0.68 1:3  1047 1.8 Table 4 Covariance EF: IPT-K, IPT-m, PS-K, PS-m, IPT/PS-K and IPT/PS-m IPT-K IPT-m PS-K PS-m IPT/PS-K IPT/PS-m SMTP Data Starting Trained 1.0 1.0 1.0 1.2 50 0.43 1.0 1.0 1.0 0.91 50 0.25 0.31 1.0 0.54 0.46 50 0.18 HTTP Data Starting Trained 1.0 1.0 1.0 0.75 50 1.8 1.0 1.0 1.0 0.63 50 0.29 0.16 1.0 0.17 1.1 50 0.98 AoM Data Starting Trained 1.0 1.0 1.0 42 50 47 1.0 1.0 1.0 42 50 47 0.10 1.0 0.17 0.086 50 2.4 MSN Data Starting Trained 1.0 1.0 1.0 1.0 50 0.42 1.0 1.0 1.0 0.49 50 0.21 0.38 1.0 0.59 0.19 50 0.17 2653 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 behavior, with some modes missing and/or correlation mismatches. Also, we do not explore models with larger number of states as increasing N provides a twofold nega- tive effect: (i) increased computational complexity, affecting learning, monitoring, generation; (ii) more-likely ‘‘overfitting” problems, affecting prediction. It is worth 0.06 0.03 data starting trained 0.05 0.02 PS pdf IPT pdf 0.04 0.03 0.015 0.02 0.01 0.01 0.005 0 data starting trained 0.025 0 10 20 30 40 50 60 70 80 0 90 0 500 1000 dBµ 0.04 3 data starting trained 0.035 1500 bytes x 10 —3 data starting trained 2.5 2 PS pdf IPT pdf 0.03 0.025 0.02 1.5 0.015 1 0.01 0.5 0.005 0 0 0 10 20 30 40 50 60 70 80 0 90 500 1000 0.12 0.12 data starting trained 0.1 data starting trained 0.1 0.08 0.08 PS pdf IPT pdf 1500 bytes dBµ 0.06 0.06 0.04 0.04 0.02 0.02 0 10 20 30 40 50 60 0 70 0 20 40 60 80 100 120 140 bytes dBµ —3 x 10 6 0.07 data starting trained 0.06 data starting trained 5 0.05 PS pdf IPT pdf 4 0.04 0.03 3 2 0.02 1 0.01 0 0 10 20 30 40 50 60 70 80 90 0 0 500 dBµ Fig. 5. Histogram and pdf for IPT (left) and PS (right). 1000 bytes 1500 2654 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 average length of the sessions. The reason behind normalization is because logðLðY; KÞÞ is decreasing with the length of the sequence Y. In the following, we denote ‘‘starting” model the uniformly distributed initialization for the parameters, while ‘‘trained” model, the parameters obtained after 10 itera- noticing that this section aims to provide: (i) the effectiveness of HMM’s in traffic modeling at traffic level and (ii) specific HMM’s for the 4 considered traffic typologies. For all traffic typologies the learning algorithm converged in terms of likelihood after a few iterations. Fig. 4 shows the log-likelihood normalized with respect to the between IPT and IPT 1 data starting trained between PS and PS 0 –0.5 covariance 0.5 0.5 0 –0.5 0 5 10 15 5 10 15 0.5 0 5 10 15 0 0.5 0 20 5 10 15 0 5 15 data starting trained 0.5 covariance covariance 20 20 between IPT and PS 1 0 –0.5 15 10 data starting trained –0.5 Lag 0 20 between PS and PS 1 0 10 0.5 Lag data starting trained 0.5 20 between IPT and PS Lag between IPT and IPT 5 15 0.5 0 Lag 0 10 data starting trained –0.5 1 5 1 covariance covariance covariance 20 data starting trained –0.5 15 0 between PS and PS 1 0 10 0.5 Lag data starting trained 0.5 20 between IPT and PS Lag between IPT and IPT 5 15 –0.5 0 20 Lag 0 10 data starting trained –0.5 1 5 1 covariance covariance covariance 0 15 0 data starting trained –0.5 covariance 20 between PS and PS 1 0.5 10 0 Lag data starting trained 5 0.5 Lag between IPT and IPT 0 data starting trained –0.5 0 20 Lag 1 between IPT and PS 1 data starting trained covariance covariance 1 0.5 0 –0.5 0 5 10 15 20 Lag Fig. 6. Auto- and cross-covariance for IPT and PS. 0 5 10 Lag 15 20 2655 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 Table 5 Global statistics: traffic, model, global mean IPT (dBl), global mean PS (bytes), IPT global st. dev. (dBl), PS global st. dev. (bytes) lðtÞ lðpÞ rðtÞ rðpÞ SMTP Data Starting Trained 40 48 43 710 731 688 19 20 18 619 347 647 HTTP Data Starting Trained 49 49 56 542 731 541 15 19 17 324 347 348 AoM Data Starting Trained 47 42 47 12 69 12 8 11 8 4 30 4 MSN Data Starting Trained 56 48 58 557 731 511 16 19 15 570 331 561 tions of the Baum–Welch algorithm, the training set will simply be referred to as ‘‘data”. Table 3 compares the model discrepancy, for both the starting and the trained models, with respect to data via the parameter k2 , commonly used in order to evaluate fitting distributions [48]. Table 4 compares the amplitude and decay parameters (K and m) for the covariances of data, starting model, and trained models when an exponential fitting (EF), with a minimum mean square error criterion, has been applied; that is, covariances are described in the form K expðmlagÞ. More specifically, the amplitude parameter is considered fixed (K ¼ 1) for the auto-covariances, whereas it is a free parameter for the cross-covariances. Both tables show that the model, even if working in a jointly fashion, is able to fit with a good accuracy both marginal distributions and covariances. Figs. 5 and 6, analyzed in the following, will confirm graphically the results of both Tables 3 and 4. More specifically, Table 5 compares the global means and standard deviations for the starting and trained models with the data: the trained models exhibit good results in terms of mean and standard deviation for each considered traffic. The starting values have been set via Eqs. (5) and (6) in Section 3, and are useful to show how, in few iterations, the model converges to values close to the empirical data. Such global statistics are obtained as weighted averages of each state conditional statistics, Eq. (3), whereas we will comment the behaviors of the single states in the following Table 6 SMTP conditional statistics: state, steady-state probability, conditional mean IPT (dBl), conditional mean PS (bytes), IPT conditional st. dev. (dBl), PS conditional st. dev. (bytes), conditional duration si qi liðtÞ lðpÞ i rðtÞ i rðpÞ i /i s1 s2 s3 s4 s5 0.447 0.109 0.176 0.057 0.211 57 18 31 51 35 48 542 1344 1107 1458 8 14 10 7 13 134 269 140 204 7 10 1 4 1 2 Table 7 HTTP conditional statistics: state, steady-state probability, conditional mean IPT (dBl), conditional mean PS (bytes), IPT conditional st. dev. (dBl), PS conditional st. dev. (bytes), conditional duration si qi lðtÞ i lðpÞ i riðtÞ rðpÞ i /i s1 s2 s3 s4 s5 0.153 0.356 0.297 0.111 0.083 41 54 64 59 60 314 345 530 880 1387 23 14 13 15 15 344 70 104 195 77 3 27 19 11 3 Table 8 AoM conditional statistics: state, steady-state probability, conditional mean IPT (dBl), conditional mean PS (bytes), IPT conditional st. dev. (dBl), PS conditional st. dev. (bytes), conditional duration si qi lðtÞ i lðpÞ i riðtÞ riðpÞ /i s1 s2 s3 s4 0.881 0.100 0.013 0.006 49 27 48 52 12 15 31 25 3 5 3 1 2 7 9 8 8 1 1 1 Table 9 MSN conditional statistics: state, steady-state probability, conditional mean IPT (dBl), conditional mean PS (bytes), IPT conditional st. dev. (dBl), PS conditional st. dev. (bytes), conditional duration si qi lðtÞ i lðpÞ i riðtÞ rðpÞ i /i s1 s2 s3 s4 0.578 0.092 0.060 0.270 66 44 61 45 104 437 655 1378 8 17 5 14 73 299 193 62 7 1 1 4 subsection, where different comparisons among data, starting and trained models, in terms of global pdfs (see Fig. 5), auto- and cross-covariance (see Fig. 6) are made. Some more considerations on the single modes discovered by the model are made looking at the conditional statistics (see Tables 6–9). In addition we made the trained models generate output variables to compare the synthetically generated IPT–PS pairs with those from real data (see Fig. 7). Finally the trained models are investigated in terms of prediction capabilities (see Figs. 10, 8, 9, 11). 5.1. Model construction and validation 5.1.1. SMTP traffic Fig. 5 shows how SMTP traffic presents two main modes in the IPT distribution, separated by three orders of magnitude, and two main modes in the PS distribution, essentially made of very small packets and full payload packets respectively. More specifically, it is apparent from Table 6 that s1 is responsible for transmission with large IPT and small PS, while s3 and s5 are responsible for transmission with small IPT and large PS. These three dominant modes account approximately for 85% (from steady state distributions) of the behavior of SMTP that mainly alternates two phases: s1 with low bitrate and s3 and s5 with high bitrate. States s2 and s4 can be viewed as transient states to switch between these two modes. 2656 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 IPT sequence 50 0 0 1000 2000 3000 4000 5000 10 log10(IPT/1µs) 10 log10(IPT/1µs) IPT sequence 100 100 50 0 0 sequence number PS sequence PS (bytes) PS (bytes) 500 1000 2000 3000 4000 0 0 1000 6000 8000 10000 0 0 1000 PS (bytes) PS (bytes) 500 5000 2000 4000 6000 8000 4000 5000 4000 5000 4000 5000 8000 10000 8000 10000 500 0 0 10000 1000 10 log10(IPT/1µs) 50 40 30 20 2000 3000 4000 5000 60 50 40 30 20 0 1000 50 3000 4000 50 0 0 5000 1000 sequence number IPT sequence 40 20 6000 8000 10000 10 log10(IPT/1µs) 60 4000 2000 3000 sequence number IPT sequence 80 2000 3000 100 PS (bytes) 100 2000 2000 sequence number PS sequence sequence number PS sequence 1000 3000 IPT sequence 60 1000 2000 sequence number IPT sequence 10 log10(IPT/1µs) 4000 1000 sequence number PS (bytes) 3000 1500 1000 10 log10(IPT/1µs) 2000 sequence number PS sequence 1500 0 5000 50 sequence number PS sequence 0 0 4000 100 10 log10(IPT/1µs) 10 log10(IPT/1αs) 4000 3000 IPT sequence 50 2000 2000 sequence number IPT sequence 0 5000 500 5000 100 0 0 4000 1000 sequence number 0 0 3000 1500 1000 0 2000 sequence number PS sequence 1500 0 1000 80 60 40 20 0 0 2000 sequence number PS sequence 4000 6000 sequence number PS sequence PS (bytes) PS (bytes) 1500 1000 500 0 0 2000 4000 6000 8000 10000 sequence number 1000 500 0 0 2000 4000 6000 sequence number Fig. 7. Training (left) and Synthetic (right) traces for IPT and PS. Fig. 6 shows how the model captures temporal dynamics, showing how both IPT and PS have a significant memory, due to the non negligible auto-covariances. More interesting is the negative cross-covariance, confirming that IPT and PS usually are not at the same time both large or small. In Fig. 7 we show the capability of the model to jointly reproduce time series of both IPT and PS by synthetic gen- 2657 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 IPT sequence 10 10 log (IPT/1µs) 100 data monitoring prediction 50 0 0 50 100 150 200 sequence number PS sequence 2000 data monitoring prediction PS (bytes) 1500 1000 500 0 0 50 100 150 200 sequence number Fig. 8. Monitoring and prediction for SMTP: IPT (top) and PS (bottom). eration of traffic patterns, and thus it may be integrated into a traffic generation or simulation framework. 5.1.2. HTTP traffic In Fig. 5 the fitting of marginal distributions is shown. Note that IPT are spread over eight orders of magnitude, but the majority of them is concentrated approximately between 10 ms and 1 s. These values are compatible with RTTs found in Wide Area Networks. Indeed HTTP clients often perform a lot of subsequent requests to the same server. In the case of Web, for example, the first request of an HTML document is typically followed by more requests for the embedded objects. If such objects are small enough to be sent within one or few packets (as often is the case [46]), the requests are sent with intervals close to the RTT from the client to the server. As for the PS distribution, the HMM model captures the characteristics of real data but seems to slightly under-estimate small values in favor of larger ones. Differently from the previous traffic typology HTTP does not present any dominant state while alternating various behaviors. It appears less regular also because states with all combination of small/large IPT and PS are present, as confirmed by the small cross-covariance. The correlation structures of HTTP, shown in Fig. 6, are very interesting. They present correlation at several lags with an oscillating behavior. The envelope decays faster for cross- than autocovariances, and it is accurately captured by the trained model. It is worth saying that the model trained with sin- gle sessions (here not reported) captured the oscillating behavior too, while for the whole traffic, where different kinds of sessions are considered, this is quite hard and was not possible. Fig. 7 shows the results of synthetically generated HTTP traffic patterns. 5.1.3. AoM traffic N ¼ 4 states showed to be sufficient to capture the behavior of the data. Fig. 5 shows how PS are usually smaller than 20 bytes and concentrated around few close values, while on the contrary the IPT distribution presents a bi-modal behavior, with the 2 modes separated by more than 2–3 orders of magnitude. We found similar behaviors in other real-time strategy games, where stations typically send periodic update packets plus additional update packets when a user action must be immediately transmitted [47]. This evident link between user actions and packet-level traffic is probably one of the causes of the more randomness that was found in AoM traffic, when compared to the other sources considered. Table 8 shows more specifically how both the IPT modes are associated to the same range of PS, although the mode with lower IPTs (s2 ) is more spread in terms of PS with respect to the mode with larger IPTs (s1 ). In addiction, looking at steady-state probabilities and conditional durations in Table 8 we can affirm that s1 and s2 capture the 2 typical situations of real-time strategy games (periodic updates and user actions) while s3 and s4 are transient model introduced by the model. Fig. 5 confirms the previous analysis. 2658 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 gards the model suitability for synthetic traffic generation [7] and simulation. Fig. 6 shows the results obtained for auto- and crosscovariance. We found that all three covariances rapidly decay, an aspect that is well captured by the trained model. Also note (see cross-covariance at Lag 0) the presence of a small dependence between IPT and PS of the single packet, well captured by the trained model. This denotes the dominance of the large IPT–small PS mode. Fig. 7 shows how the model is able to accurately reproduce the AoM traffic pattern. 5.2. Prediction: some preliminary result After the presentation of the performance of the HMMbased proposed model, here we show some preliminary result of its prediction capability. Indeed, the correlation structure of the various traffic typologies suggests to use the trained model for prediction purposes on a sample trace. The main objective is to show the capability of the model to provide the expected short-term future behavior of the traffic with sufficient accuracy. Such a characteristic results particularly appealing when thought as part of a more complex network-sensing and adaptive-management system. In order to give an idea of what kind of information the proposed approach could provide to higherlevel applications, we performed (off-line) the following basic steps on the traces previously described: 5.1.4. MSN traffic MSN presents some similar characteristics to SMTP: two main modes for both IPT and PS as shown in Fig. 5, captured by states s1 and s4 as shown in Table 9. The former accounts for low bitrate behavior, large IPT (60 dBl) and small PS (0.1 KB), while the latter for high bitrate behavior, small IPT (40 dBl) and large PS (1.3 KB). Again the negative cross-covariance in Fig. 6 confirms this kind of coupling between IPT and PS. Whereas auto-covariance reveals the presence of memory for IPT and PS characteristics. Fig. 6 shows an exponentially decaying trend for the data that is well-captured by the model, denoting the presence of a significant dependence between IPT–PS pairs of successive packets. In Fig. 7, it can be seen that the model is able to accurately reproduce the MSN Messenger traffic patterns, replicating the two main IPT and PS modes. Also in the case of this traffic category, these results look promising as re-  Monitoring – W samples (in terms of IPT–PS pairs) are observed iteratively to obtain an estimate of the current state via the Viterbi algorithm [25];  Prediction – on the basis of the current state estimate and of the trained model parameters, the traffic is assumed to remain in that state (thus keeping conditional mean values for IPT and PS) for number of samples proportional to the conditional duration (Eq. (4)). IPT sequence 10 10 log (IPT/1µs) data monitoring prediction 100 50 0 0 50 100 150 200 sequence number PS sequence data PS (bytes) 1500 monitoring prediction 1000 500 0 0 50 100 150 sequence number Fig. 9. Monitoring and prediction for HTTP: IPT (top) and PS (bottom). 200 2659 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 IPT sequence 10 10 log (IPT/1µ s) data monitoring prediction 100 50 0 0 50 100 150 200 sequence number PS sequence data PS (bytes) 1500 monitoring prediction 1000 500 0 0 50 100 150 200 sequence number Fig. 10. Monitoring and prediction for AoM: IPT (top) and PS (bottom). 10 log10 (IPT/1µs) IPT sequence data 100 monitoring prediction 50 0 0 50 100 150 200 sequence number PS sequence 2000 PS (bytes) data monitoring prediction 1500 1000 500 0 0 50 100 150 sequence number Fig. 11. Monitoring and prediction for MSN: IPT (top) and PS (bottom). 200 2660 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 Figs. 10, 8, 9, 11 show results in the case of AoM, SMTP, HTTP and MSN traffic. We considered a W ¼ 3-sample observation to obtain current-state estimate. Also, we assume that the traffic holds on conditional mean values, refer to Eq. (2), for a number of samples proportional to the conditional duration, refer to Eq. (4), of the state. The values of the conditional statistics, used for shown results, are those reported in Tables 6–9. Asterisks, circles, and diamonds represent data, monitored samples, and predicted samples, respectively. Also, for better precision, a confidence interval proportional to the conditional standard deviation, refer to Eq. (2), is reported with a green segment. Comparing the frequent superposition between asterisks and diamonds, it can be noticed how the model captures and predicts the traffic dynamics for all the considered traffic typologies. Such a result is quite interesting, especially when looking at the source behavior when being in the states with large duration, i.e. s1 for SMTP (whose conditional mean values are 57 dBl and 48 bytes), s2 for HTTP (whose conditional mean values are 54 dBl and 345 bytes), s1 for AoM (whose conditional mean values are 49 dBl and 12 bytes), s1 for MSN (whose conditional mean values are 66 dBl and 104 bytes). Also, the very small ratio between diamonds and circles gives an idea of the small amount of sensing that is needed to infer quite reliable information of what we should expect in the short term future behavior of the traffic. Note again that, due to the joint modeling we proposed, estimation of the state variable allows to infer knowledge about both IPT and PS expected behavior simultaneously. Table 10 shows the relative mean square error (RMSE) and the percentage of monitoring data (MP) with respect the three different sizes of the monitoring window. It is apparent how, for all the considered traffic typologies, the RMSE is not significantly affected by the window size, both for IPT and PS, while obviously the MP is increasing with it. IPT prediction is very welle performing for all the considered traffic typologies, especially for AoM and MSN, while PS prediction is not very accurate for SMTP and MSN traffic. Indeed, the trained models for such applications present some discrepancy with the training data in PS pdf as well PS auto-covariance. It is worth noticing that the experiment we have performed is just a simple not-optimum example of using Table 10 RMSE and MP for prediction with different sizes for the monitoring window Traffic Window size IPT RMSE PS RMSE MP (%) SMTP W=3 W=5 W=7 0.12 0.16 0.15 0.44 0.58 0.52 37 48 56 HTTP W=3 W=5 W=7 0.13 0.12 0.12 0.30 0.27 0.27 24 33 40 AoM W=3 W=5 W=7 0.027 0.027 0.027 0.13 0.13 0.14 27 38 46 MSN W=3 W=5 W=7 0.024 0.020 0.024 0.43 0.42 0.41 30 41 50 the model. Another possibility could be building from the state estimated via the Viterbi algorithm an m-best list of most likely n-transitions. The real value of the model when used jointly with a monitoring algorithm, is the probabilistic representation in terms of the state matrix A of the possible evolution of the traffic. 6. Discussion and conclusion In this work, we proposed a HMM-based model of traffic sources at packet level. It jointly models IPT and PS of Internet applications traffic. It has been shown how the proposed HMM approach is able to capture the behavior of marginal distributions, mutual dependencies, and temporal structures of the traffic generated by a heterogeneous set of sources. The capability to accurately replicate and predict traffic makes the proposed approach quite promising. Results obtained from four kinds of traffic sources, related to totally different Internet applications, have been analyzed. Empirical data clearly show that this heterogeneity is also reflected by the traffic that they generate at packet level. Indeed they differ for the behavior of the marginal distributions of IPT and PS but also for their correlation structure. As for the last point, it is worth noting that we found larger autocorrelations with a slower decay for SMTP, HTTP, and MSN traffic when compared to AoM. Such behavior can be partially explained by the influence of TCP end-to-end flow control, which introduces dependencies between IPTs. Indeed, while SMTP, HTTP, and MSN run over TCP, AoM traffic is carried by UDP packets. Furthermore, rigid application-level protocol rules of SMTP and HTTP induce more structure into their traffic patterns. On the other side, as regards AoM, the interaction of the gaming user introduces more randomness into the traffic. Again, we underline that the paper aims at modeling the average behavior of a single session. The study of the superposition of several sessions, generated by multiple sources may indeed lead to the generation of an aggregate traffic showing long range dependence and self similarity characteristics, but such investigation falls beyond the scope of the present work. In all the cases the level of computational and structural complexity associated to the model is quite low. Training models for SMTP, HTTP, AoM and MSN required few iterations, and though SMTP and HTTP traffic present a much more complex structure, they only required one more state (with respect to AoM and MSN) for effective modeling. Then, the flexibility of an HMM approach, even when applied to a low-level traffic modeling, appears quite encouraging. Concluding, it is worth highlighting that the more exciting result of the proposed model is, in our opinion, the capability to fit at the same time both IPT and PS statistics and dynamics, even if not obtaining extreme accuracy, of four different traffic sources with a relative small set of parameters. Benefits and possible applications of such modeling approach include: (i) a better understanding of source traffic dynamics (taking into account also temporal structures) related to different Internet applications; (ii) A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 exploitation of the short-term prediction capabilities of the model; (iii) usage in traffic simulation and generation frameworks [7]; (iv) application for traffic classification purposes. Finally, we foresee the integration of the presented model within a larger analytical framework, based on HMMs, which includes modeling of heterogeneous packet channels [18,19]. References [1] A. Dainotti, A. Pescapé, P. Salvo Rossi, G. Iannello, F. Palmieri, G. Ventre, An HMM approach to internet traffic modeling, in: Proc. of IEEE GLOBECOM, November 2006, pp. 1–6. [2] A. Dainotti, A. Pescapé, G. Ventre, A packet-level characterization of network traffic, in: Proc. of IEEE CAMAD, June 2006, pp. 38–45. [3] B.A. Mah, An empirical model of HTTP network traffic, Proc. of IEEE INFOCOM, vol. 2, April 1997, pp. 592–600. [4] J. Cao, W.S. Cleveland, Y. Gao, K. Jeffay, F.D. Smith, M.C. Weigle, Stochastic models for generating synthetic http source traffic, in: Proc. of IEEE INFOCOM, vol. 3, 2004, pp. 1546–1557. [5] R. Ohri, E. Chlebus, Measurement based e-mail traffic characterization, in: Proc. of Int. Symp. on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), Philadelphia, PA, July 2005. [6] W. Feng, F. Chang, W. Feng, J. Walpole, A traffic characterization of popular on-line games, IEEE/ACM Transactions on Networking 13 (3) (2005) 488–500. [7] <http://www.grid.unina.it/software/ITG>, September 2007. [8] J. Farber, Network Game Traffic Modelling, NetGames2002, Braunschweig, Germany, 2002. pp. 53–57. [9] R. Bangun, E. Dutkiewicz, Modelling multi-player games traffic, in: Proc. of International Conference on Information Technology: Coding and Computing, 27–29 March 2000, pp. 228–233. [10] M. Borella, Source models of network game traffic, Computer Communications 23 (4) (2000) 403–410. [11] T. Lang, G.J. Armitage, P. Branch, H. Choo, A synthetic traffic model for half-life, in: Proc. of Australian Telecommunication Networks and Application Conference 2003 (ATNAC 2003), Melbourne, Australia, December 2003. [12] P. Salvador, A. Pacheco, R. Valadas, Modeling IP traffic: joint characterization of packet arrivals and packet sizes using BMAPs, Elsevier Computer Networks 44 (October) (2004) 335–352. [13] A. Klemm, C. Lindemann, M. Lohmann, Modeling IP traffic using the batch Markovian arrival process, Performance Evaluation Journal 54 (2) (2003) 149–173. [14] J. Gao, I. Rubin, Multifractal analysis and modeling of long-rangedependent traffic, in: Proc. of IEEE ICC, June 1999, pp. 382–386. [15] A.T. Andersen, B.F. Nielsen, A Markovian approach for modeling packet traffic with long-range dependence, IEEE Journal on Selected Areas in Communications 16 (5) (1998) 719–732. [16] K. Salamatian, S. Vaton, Hidden Markov Modeling for network communication channels, in: Proc. of ACM SIGMETRICS 2001, vol. 29, 2001, pp. 92–101. [17] W. Wei, B. Wang, D. Towsley, Continuous-time Hidden Markov Models for network performance evaluation, Performance Evaluation 49 (1–4) (2002) 129–146. [18] P. Salvo Rossi, G. Romano, F. Palmieri, G. Iannello, Joint end-to-end loss-delay Hidden Markov Model for periodic UDP traffic over the Internet, IEEE Transactions on Signal Processing 54 (2) (2006) 530– 541. [19] G. Iannello, F. Palmieri, A. Pescapè, P. Salvo Rossi, End-to-end packetchannel Bayesian model applied to heterogeneous wireless networks, in: Proc. of IEEE GLOBECOM, November 2005, pp. 484– 489. [20] L. Muscariello, M. Mellia, M. Meo, M.A. Marsan, R. Lo Cigno, Markov models of internet traffic and a new hierarchical MMPP model, Computer Communications Journal 28 (16) (2005) 1835–1851. [21] O. Rose, Simple and efficient models for variable bit rate MPEG video traffic, Performance Evaluation Journal 30 (1) (1997) 69–85. [22] E. Costamagna, L. Favalli, F. Tarantola, Modeling and analysis of aggregate and single stream internet traffic, in: Proc. of IEEE GLOBECOM, December 2003, pp. 3830–3834. [23] C. Wright, F. Monrose, G. Masson, HMM profiles for network traffic classification, in: Proc. of VizSEC/DMSEC, October 2004, pp. 9–15. [24] <http://www.grid.unina.it/Traffic/>, September 2007. 2661 [25] L.R. Rabiner, A tutorial on Hidden Markov Models and selected applications in speech recognition, Proceedings of the IEEE 77 (2) (1989) 257–285. [26] J.A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, University of Berkeley, CA, Technical Report ICSI-TR-97-021, 1998. [27] L.A. Liporace, Maximum likelihood estimation for multivariate observations of Markov sources, IEEE Transactions on Information Theory IT-28 (5) (1982) 729–734. [28] B.H. Juang, S.E. Levinson, M.M. Sondhi, Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory IT-32 (2) (1986) 307– 309. [29] E. Çinlar, Introduction to Stochastic Processes, Prentice Hall, 1975. [30] <http://nile.wpi.edu/downloads>, September 2007. [31] R.B. Jennings, E.M. Nahum, D.P. Olshefski, D. Saha, Zon-Yin Shae, C. Waters, A study of Internet instant messaging and chat protocols, IEEE Network 20 (4) (2006) 1621. [32] X. Zhen, G. Lei, J. Tracey, Understanding instant messaging traffic characteristics, in: Proc. of the 27th International Conference on Distributed Computing Systems (IEEE ICDCS 2007), Toronto, Canada, 25–29 June 2007. [33] S. McCreary, K. Claffy, Trends in wide area IP traffic patterns – a view from Ames Internet Exchange, in: Proc. of ITC Specialist Seminar of Measurement and Modeling of IP Traffic, September 2000, pp. 1– 11. [34] <http://www.microsoft.com/games/ageofmythology/>, September 2007. [35] <http://www.comscore.com/>, September 2007. [36] <http://join.msn.com/messenger/overview>, September 2007. [37] <http://www.microsoft.com/technet/prodtechnol/isa/2000/maintain/ isaimsec.mspx>, September 2007. [38] <http://www.hypothetic.org/docs/msn/general/overview.php>, September 2007. [39] M. Claypool, The effect of latency on user performance in real-time strategy games, Elsevier Computer Networks 49 (1) (2005) 52–70. [40] A. Dainotti, A. Botta, A. Pescapé, G. Ventre, Searching for invariants in network games traffic, in: Proc. of ACM Co-Next 2006 Student Workshop, Lisboa, Portugal, December 2006. [41] S. McCanne, V. Jacobson, The BSD packet filter: a new architecture for user level packet capture, in: Proc. of Winter 1993 USENIX, January 1993, pp. 259–269. [42] T. Karagiannis, A. Broido, M. Faloutsos, K. Claffy, Transport layer identification of P2P traffic, in: Proc. of ACM SIGCOMM IMC, October 2004, pp. 121–134. [43] <http://www.wide.ad.jp/wg/mawi/>, September 2007. [44] R. Caceres, P. Danzig, S. Jamin, D. Mitzel, Characteristics of wide-area TCP/IP conversations, ACM SIGCOMM Computer Communication Review 21 (4) (1991) 101–112. [45] P. Danzig, S. Jamin, R. Caceres, D. Mitzel, D. Estrin, An empirical workload model for driving wide-area TCP/IP network simulations, Journal of Internetworking: Research and Experience 3 (1) (1992) 1– 26. [46] F.D. Smith, F.H. Campos, K. Jeffay, D. Ott, What TCP/IP procotol headers can tell us about the Web, in: Proc. of ACM SIGMETRICS, June 2001, pp. 245–256. [47] A. Dainotti, A. Pescapé, G. Ventre, A packet-level model of Starcraft traffic, in: Proc. of IEEE Hot-P2P, July 2005, pp. 33–42. [48] S.P. Pederson, M.E. Johnson, Estimating model discrepancy, Technometrics 32 (3) (1990) 305–314. Alberto Dainotti is Ph.D. student in Computer Engineering and Systems at the Computer Science Department of University of Napoli ‘‘Federico II”, Italy, where he received the M.S. Laurea Degree in Computer Engineering in 2004. His research interests fall in the areas of network measurements, traffic analysis, and network security. 2662 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 Antonio Pescapé is Assistant Professor at the Department of Computer Engineering and Systems of the University of Napoli Federico II. He received the M.S. Laurea Degree in Computer Engineering and the Ph.D. in Computer Engineering and Systems at University of Napoli Federico II. His research interests are in the networking field with focus on models and algorithms for Internet Traffic, Network Measurement and Management of heterogeneous IP networks, and Network Security. He has co-authored a large number of journal and conference publications. He is IEEE member and he has served and serves on several conference technical program committees (IEEE Globecom, IEEE ICC, IEEE WCNC, IEEE HPSR, etc.) and has served as Guest Editor of the Special Issue of Computer Networks on ‘‘Traffic classification and its applications to modern networks”. Pierluigi Salvo Rossi was born in Naples, Italy, on April 26, 1977. He received the ‘‘Laurea” degree in Telecommunications Engineering (summa cum laude) in January 2002 and the Ph.D. in Computer Science in January 2005, both from the University of Naples ‘‘Federico II”, Naples, Italy. In 2002, he worked as a Research Engineer at the CIRASS (Interdepartmental Research Center for Signal Analysis and Synthesis), University of Naples ‘‘Federico II”, Naples, Italy. In 2003, he worked as Research Engineer at the Department of Information Engineering, Second University of Naples, Aversa (CE), Italy. In 2004, he was Visiting Research Engineer at the CSPL (Communications and Signal Processing Laboratory), Electrical and Computer Engineering Department, Drexel University, Philadelphia, PA, US. In 2005, he worked as Postdoc Research Engineer at the ITeM (Multimedia Information and Telematic National Laboratory), CINI (Italian University Consortium for Computer Science and Engineering), Naples, Italy. In 2006, he worked as Postdoc Research Engineer at the CRdC–ICT (Regional Institute for Research on Information and Communication Technology), Benevento, Italy. He is currently a Postdoc Research Engineer at the Department of Electronics and Telecommunications, Norwegian University of Science and Technology, Trondheim, Norway. Since 2004, he is Adjunct Professor at the Second University of Naples, Aversa (CE) Italy. His research interests fall within the areas of speech processing, communication system modeling, wireless communications. Francesco Palmieri received his Laurea in Ingegneria Elettronica cum laude from Università degli Studi di Napoli Federico II, Italy, in 1980. In 1983, he was awarded a Fulbright scholarship to conduct graduate studies at the University of Delaware, Newark, where he received a M.S. degree in applied sciences and a Ph.D. in electrical engineering in 1985 and 1987, respectively. In 1981, he served as a 2nd Lieutenant in the Italian Army in fullfillment of draft duties. In 1982 and 1983, he was with the ITT firms: FACE SUD Selettronica in Salerno (currently Alcatel), Italy, and Bell Telephone Manufacturing Company in Antwerpen, Belgium, as a designer of digital telephone systems. He was appointed Assistant Professor in Electrical and Systems Engineering at the University of Connecticut, Storrs, in 1987, where he was awarded tenure and promotion to associate professor in 1993. In the same year, after a national competition, he was awarded the position of Professore Associato at the Dipartimento di Ingegneria Elettronica e delle Telecomunicazioni at Università degli Studi di Napoli Federico II, Italy, where he has been until October 2000. In February 2000, he was nominated Professore Ordinario di Telecomunicazioni after a national competition and appointed in November 2000 at Dipartimento di Ingegneria dell?Informazione, Seconda Università di Napoli, Aversa, Italy. His reaserch interests are in the areas of signal processing, communications, information theory and neural networks. Giorgio Ventre is Professor of Computer Networks in the Department of Computer Engineering and Systems of the University of Napoli Federico II, where he is leader of the COMICS team. COMICS stands for Computers for Interaction and Communications and is a research initiative in the areas of networking and multimedia communications. After started ITEM, the first research laboratory of the Italian University Consortium for Informatics (CINI), He is now President and CEO of CRIAI, a research company active in the areas of Information Technologies. As leader of the networking research group at University of Napoli Federico II. He is principal investigator for several national and international research projects. His research interests are in the area of network protocols and architectures. He has co-authored more than 150 publications and he is member of the IEEE and of the ACM.