Academia.eduAcademia.edu

Generation of high bandwidth network traffic traces

2002, Proceedings. 10th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunications Systems

High bandwidth network traffic traces aren eeded to understand the behavior of high speed networks (suchas the Internet backbone). However,the implementation of a mechanism to collect sucht races is difficult in practice. In the absence of real traces, tools to generate high bandwidth traces would aid the study of high speed network behavior.W ed escribe three methods of generating high bandwidth network traces: scaling low bandwidth network traffic traces, merging multiple low bandwidth traces and generating traces through simulation by scaling a structural model of real world traces. Wee valuate the generated traces and discuss the advantagesand disadvantageso fe achm ethod. Wea lso discuss some of the issues involved in generating traces by the structural model method.

To appear in the Proceedings of the 10th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Systems, (MASCOTS) Fort Worth, Texas, Oct. 12-16, 2002. Generation of High Bandwidth Network Traffic Traces Purushotham Kamath, Kun-chan Lan, John Heidemann, Joe Bannister and Joe Touch University of Southern California Information Sciences Institute Los Angeles, California, U.S.A. {pkamath, kclan, johnh, joseph, touch}@isi.edu Abstract High bandwidth network traffic traces are needed to understand the behavior of high speed networks (such as the Internet backbone). However, the implementation of a mechanism to collect such traces is difficult in practice. In the absence of real traces, tools to generate high bandwidth traces would aid the study of high speed network behavior. We describe three methods of generating high bandwidth network traces: scaling low bandwidth network traffic traces, merging multiple low bandwidth traces and generating traces through simulation by scaling a structural model of real world traces. We evaluate the generated traces and discuss the advantages and disadvantages of each method. We also discuss some of the issues involved in generating traces by the structural model method. 1. Introduction The behavior of a network depends to a large extent on the nature of the traffic generated by its users. Network protocols and switching mechanisms behave differently under different traffic patterns. Network protocols or switching mechanisms themselves may be the cause of different types of traffic patterns. Understanding network Purushotham Kamath is supported, and Joe Bannister and Joe Touch are partially supported as part of the NGI Multicast Applications and Architecture (NMAA-ADDON) project funded by the Defense Advanced Research Projects Agency Information Technology Office, DARPA Order No. H645, Program Code No. 9A20, issued by DARPA/CMD under Contract No. MDA972-99-C-0022. Kun-chan Lan and John Heidemann are partially supported in this work as part of the SAMAN project, funded by DARPA and the Space and Naval Warfare Systems Center San Diego (SPAWAR) under Contract No. N66001-00-C-8066. John Heidemann is also partially supported as part of the CONSER project, funded by NSF as grant number ANI-9986208. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, SPAWAR or NSF. traffic patterns and their causes is central to network research on the Internet. This has lead to several efforts to collect traffic traces on the Internet [16] and to analyze them. Most real world network traces that are publicly available [16,17] are low bandwidth traces from OC3c, OC12c links or FDDI or Ethernet traces. In contrast to the wide availability of low bandwidth traces, traces from high bandwidth links (OC48, OC192 links, such as in the core of the Internet) are not widely available. The difficulty of implementing a mechanism for collecting traces at high speeds [11] is one factor that contributes towards the lack of public availability of such traces. As link speeds increase in the future, the difficulties involved in obtaining high bandwidth traces will increase. In the absence of real high bandwidth traces, one option is to attempt to generate traces that are likely to resemble real traces. Applications of such traces include studies of the behavior of routers, switches and network protocols on high speed links. Prior studies [21] have suggested that traffic characteristics differ widely depending on where and when the data was recorded. Given such differences in traffic characteristics, there is reason to believe that traffic seen in the core of the Internet (high bandwidth traces) may differ from traffic seen at the edges or on a local area network (low bandwidth traces). Earlier studies have indicated that traffic on an Ethernet network is self similar [13] in nature. Recent studies [3] indicate that Internet traffic tends to Poisson and independent as the load increases, as it does in the core of the Internet. Thus, at this point of time, the characteristics of high bandwidth traffic are still under investigation. In this paper we describe three methods of generating a high bandwidth trace and evaluate the generated traces. As implied above, the major hurdle that we face in validating this effort is the lack of a real world high bandwidth trace to compare our generated traces with. Nevertheless, it is worthwhile to compare real world low bandwidth traces with the high bandwidth traces generated by the three methods to determine how different high bandwidth traffic could be from low bandwidth traffic. The main contribution of this paper is the comparison of the network traffic traces generated by the three methods and a discussion of the issues involved in the generation by the structural model method. 2. Challenges in generating high bandwidth traces HTTP client Initial request Response The trace files that we generate consist of the following fields for each packet: • Timestamp (time when the packet was received at the point where it is being traced) • Source IP address, destination IP address and source and destination TCP/UDP port numbers • Packet size (total length field from the IP packet header) Generating each of the above fields (for low bandwidth or high bandwidth traces) presents several challenges which we discuss below: 2.1 Timestamp The timestamp of a packet indicates how busy the link is. The timestamps can indicate the bursty nature of traffic on the link. There are several factors that affect the timestamp of a packet. The factors listed below affect the timestamp of a packet whether it traverses low or high bandwidth links or a combination of them. The challenge in generating timestamps for a high bandwidth trace is in deciding if these factors will affect the traces and cause them to differ from traces on a low bandwidth link. Another challenge is how to incorporate these effects in the generation method to ensure that the generated traces are accurate. • Link characteristics: The bandwidth of a link determines the transmission time of the packet. Both the transmission time and the propagation delay of the link contribute to the round trip time (RTT) of the connection. With a transport protocol such as TCP, the RTT determines when the next packet can be sent and hence its timestamp. • Switch/router characteristics: The queuing delay at the network switches contributes to the RTT of the connection. The drop policy at the switch also influences packet transmissions. With a transport protocol such as TCP, the congestion control algorithm reacts to packet drops and hence is dependent on the switch drop policy. • Transport protocol characteristics: The transport protocol used (TCP) influences packet transmission times HTTP server Round trip time= Tx. time +Prop. delay +Queuing delay Object 1, User ,2KB Page think with time Object 2, 2 objects ,4KB Second Request Figure 1. Factors affecting the packet timestamps through the congestion control algorithm.[1] • Application protocol characteristics: HTTP influences the packet transmission times through the number of web pages requested, the number of objects per page and the size of each object. The use of persistent and pipelined connections [10] also affects the timestamp of each packet. • User characteristics: The user arrival rate and the user think times also affect the timestamps of the packets by controlling how often a user sends requests. Figure 1 shows how the above factors can impact the timestamp of a packet. 2.2 Addresses The source address and destination address of a packet depend on the number of clients and servers, the user arrival rate at each client and how busy a server is. A high bandwidth link may see more unique destination addresses than a low bandwidth link but the increase in addresses seen may not be proportional to the increase in bandwidth. E.g. It is known that for inter AS traffic a small percentage of end host flows contribute to a large percentage [8] of the traffic. 2.3 Packet size The packet size distribution depends on the number of requests and the file sizes requested. With web traffic the packet sizes usually vary from 40 bytes (connection setup packets) to 1500 bytes (the path maximum transmission unit (MTU) is the minimum of the maximum transmission units on the path and Ethernet is usually on at least one segment of most paths. Hence most data packets tend to be at most 1500 bytes long). The packet size distribution should remain the same on low and high bandwidth links. 3. Evaluating the quality of the traces Network traffic characteristics may be described by several metrics. The characteristics that are measured in this study may be divided into two categories. Packet characteristics: • Packet interarrival time distribution • Packet sizes distribution • Distribution of the number of unique destination addresses seen in a time interval Connection characteristics: • TCP connection duration distribution • Distribution of the number of bytes and packets in a TCP connection These five metrics reasonably model IP traffic traces from the perspective of at least one application - that of studying network switch/router behavior. The packet interarrival time gives an indication of how fast a router needs to process packets. The packet size affects the buffer space and the packet transmission time at the output port. The destination address determines where the packet needs to be sent and the distribution of the number of unique addresses seen in an interval affects mechanisms such as routing table caches [9] used in a switch or router. The connection characteristics can affect how often a particular routing table entry is used. The TCP connection duration and the number of bytes and packets sent determine the temporal locality of addresses seen. 4. Methods of generating high bandwidth traces The simplest methods to generate a high bandwidth trace involve processing a low bandwidth trace. This is the methodology used in the first two methods that we describe. Another method of generating high bandwidth traces is to simulate a high speed link and extract traces during the simulation. The third method we describe is based on this methodology. In this section we describe these three methods and discuss how accurate we can expect the generated traces to be. We have used the Auckland II trace set [17] as the low bandwidth trace set used to generate high bandwidth traces. The Auckland II traces are from a wide area network link with packet peak rate of 2Mbps in each direction. Bit rates (calculated over 100ms intervals) vary from 0.4Mbps to 8Mbps. The duration of the trace is around 11 hours. The trace consists of two trace files for traffic in each direction. 4.1 Scaling A simple method to generate a high bandwidth trace is to scale a real world trace in time by dividing the timestamps in the trace by a constant, the scaling factor, s. Most common real world traces are either bidirectional traces such as an Ethernet trace or a pair of unidirectional traces from a wide area link such as the Auckland [17] data set. This method can be applied to either form of trace files. If the original traffic trace is from a 10 Mbps link and we want to generate a 1Gbps trace file, then the scaling factor s = 1Gbps/10Mbps = 100 Each timestamp in the original trace file Torig is replaced in the generated trace file by a new timestamp Tnew where Tnew = Torig/s All other records in the generated trace file (source and destination addresses, packet sizes) are the same as in the original trace file. The strongest justification for applying this method is that high bandwidth network traffic is a multiplexing of traffic from several low bandwidth links. If the packet arrivals of n streams on a low bandwidth link are independent and the packets interarrival times are uniformly distributed, then the resulting aggregated trace should have an arrival rate equal to approximately n times the arrival rate of the low bandwidth links. There are several flaws in this reasoning. Packets on a TCP connection will not have uniformly distributed interarrival times due to structural effects such as the dependence of packet transmission on RTT and user think times. The presence of a faster link does not imply that the propagation delay would change or that a user would read a web page faster. In addition, the multiplexing of traffic from several links may have other effects such as • Queuing delays at switches/routers • Packet drops at switches/routers (which depend on drop policies such as Drop Tail, RED etc.) which in turn affect the TCP congestion control algorithm (resulting in window size changes, RTT reestimation and RTO recalculation). It is unlikely that these effects will be accurately modeled by the scaling method. Another disadvantage of this method is that the low bandwidth trace used for generation needs to be a long duration trace (at least s times the duration of the desired high bandwidth trace). A possible solution to this is to concatenate several low bandwidth traces and then scale. However, this results in abrupt changes in traffic patterns at the points in time where the traces were concatenated. 4.2 Merging In the merging method several real low bandwidth traffic traces are merged. The merging is done by simulating the multiplexing that happens at an output queued switch. The merging is done by a switch simulator. The switch simulator consists of a single switch using a FIFO queue with tail drop policy. The traces files are fed as input to the switch simulator. The timestamps in the input trace files determine the time when each packet is placed in the output queue. The switch simulator serves the queue using a FIFO queuing discipline and send packets out on the output link. Each packet suffers a queuing delay that depends on the number of packets in the queue. Each packet requires a transmission time that depends on the packet size and the bandwidth of the output link. The trace data (timestamp, source and destination addresses, packet size) for the packet is logged in a file before it is sent out. The number of files being merged is called the merging factor, m. This method attempts to correct the deficiency of the scaling method by introducing a queuing mechanism which simulates the queuing delays. Note that it simulates the queuing delays at a single switch and not the queuing delay that would be seen in a network of switches. Unlike the scaling method, merging does not significantly distort structural effects such as user think times and RTTs. As with the scaling method, this method does not take into account the effect of packet drops or the TCP congestion control algorithm into account. Any effects due to TCP congestion control that were present in the original traces are retained in the generated trace. However, the effect of new packet drops at the switch on the TCP congestion control algorithm are not taken into account. The method works with undirectional traces only. Since it simulates a switch, bidirectional traces cannot be used as input. To generate bidirectional traffic, each direction will need to be generated separately. Doing this means that the traffic in opposite directions (data and their acks) may not be correlated. Because timestamps of data and acks are changed independently, it is possible that the traffic generated in the two directions may be traffic that could not have been generated in reality. To generate a trace file of bandwidth m times the original trace bandwidth, it is necessary to have m low bandwidth trace files. This may not always be feasible in practice. A solution to this is to split a long low bandwidth trace file into m shorter trace files and merge them. This may result in the distortion of TCP connections which span the points in time where the file was split packets that are sent well into the connection may now appear at the start of the connection, or even before the connection was established. A solution to this problem is renumbering the destination addresses in the split trace files. This alleviates the problem of the distortion of the TCP connection statistics - packets on different input interfaces of the switch simulator all have unique destination addresses and hence belong to different TCP connections. However, this mechanism distorts the destination address statistics. 4.3 Scaling structural through simulation. model parameters To generate traffic traces through simulation we employ a structural modeling approach. [22] Traditional black box traffic modeling approaches focus on employing complex time-series analysis to model network traffic. These models ignore the underlying network structure and hence provide no or little insight about the observed characteristics of measured traffic and its underlying causes. On the other hand, structural modeling proposes that we should explicitly take into account the complex hierarchical structure of application and intertwined networking mechanisms in order to accurately reproduce the traffic. Our structural model (based on a tool called RAMP ) [12] attempts to model user and application characteristics [20] and network mechanisms by deriving information from low bandwidth TCP level network traces. The procedure is described below: 1. Cumulative distribution functions (CDFs) for different parameters are derived from a real world (low bandwidth) trace. The parameters can be divided into three categories: user behavior (user interarrival times), web page characteristics (number of pages per session, number of objects per page), object characteristics (object size, inter object interval). 2. The structural model for the simulation based on these parameters is described below: User behavior: • User interarrival times for the simulation are chosen randomly based on the user interarrival CDF. Web page: • The number of pages per user session is chosen randomly based on its CDF. • The sources of the pages are chosen based on a server popularity CDF. Object: • The number of objects within one page is chosen based on an object CDF. • The size of the objects in a page are chosen based on an object size CDF. • A TCP connection is used for multiple request/response exchanges or a single request/response exchange based on the probability of persistent connections (HTTP1.1) versus non-persistent connections (HTTP1.0) as computed from the trace. In persistent connection mode, all objects within the same page are sent via the same TCP connection.[10] • The TCP window size for both servers and clients are also randomly chosen from a CDF. 3. This structural model is used to drive a network simulation from which traces are gathered. Around 50-70% of traffic on the Internet today [6] is web traffic. Based on this observation and on the data obtained from our real world trace files, only user characteristics of web traffic have been modeled. It is known that web consists of mostly short lived flows. Long lived flows such as multimedia streams represent a small percentage of the traffic. RAMP is being modified to extract characteristics from different types of traffic. This will allow a more accurate representation of traffic characteristics and generation of more realistic traces. In addition to the structural model of user behavior a suitable network topology must also be chosen. Simulating a backbone network topology is a difficult task. Simulating the entire topology along with traffic sources at each node strains [18] the available computing resources. The topology chosen should be such that it should be easy to increase the amount of traffic on the link on which the traffic trace is recorded. Hence, the backbone topology has been simplified to a dumbbell topology as shown in Figure 2 with clients and servers on either side of the bottleneck link. A packet from a client to a server traverses four router nodes. The traffic traces are collected on the bottleneck link. To assign link latencies, the round trip times are determined from the low bandwidth traffic trace [12] and a cumulative distribution function is generated. Link latencies are assigned to client links from this distribution. Studies [4] have shown that the host pairs increases as the square root of the bit rate. Since the scaling factor of the simulation was 100, the number of servers and clients was chosen to be approximately 10 times the number of server and client IP addresses found in the Auck II trace. To obtain higher bandwidth traces, the link bandwidths were increased linearly by a factor called the simulation scaling factor and the user interarrival rates were decreased linearly by the simulation scaling factor. Servers Servers Bottleneck link Clients Clients Figure 2. Simulation topology In contrast to the other two methods, this method will accurately model the queuing delays of a network and mechanisms such as the TCP congestion control behavior. The main disadvantage of this method is that generating traces of higher bandwidth consumes an increasing amount of resources. In order to generate higher bandwidth traces using a simulator such as NS[2] , the size of the simulation (number of users and traffic) must be increased, which consumes a large amount of resources in the form of memory and processor cycles. The structural model method represents a challenge in terms of the resources [19] required to run the simulations. Section 6 discusses this issue in greater detail. 5. Evaluation of the generated traces As described in the introduction, the characteristics of high bandwidth traces are unknown. Therefore, evaluation of each method to determine which one generates the most accurate trace is important. The traces generated by each method are evaluated by comparing the cumulative distribution functions (CDFs) of the generated traces with those of the original traces. Figures 3 through 8 compare the traces generated by the three methods with the original traces. The original (or real) trace file was from the Auck II trace set (as described in Section 4). The duration of the original trace used was around 11 hours. It is difficult to judge the accuracy of the generated traces since we do not have a real high bandwidth trace to compare with. The comparison of the generated traces with the low bandwidth trace is an attempt to observe how the characteristics of a trace may change from low to high bandwidths. The three methods (scaling, merging and structural modeling) attempt to scale the traces by a factor of 100. The original trace was from a 10Mbps link and the goal was to generate traffic on a 1Gbps link. For the scaling method the scaling factor was 100. Since the original ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆∆∆∆ ∆∆∆∆ ∆∆ ∆ ∆∆∆∆∆∆∆∆ ∆∆∆∆∆∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ •• ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ •••• ∆∆ ∆ • ∆ ∆ ∆ ∆ • ∆ ∆ • ∆ ∆ ∆ ∆ ∆ ∆ ∆∆ ∆ ••• ∆ ∆ ∆ ∆ ∆∆ ∆ ∆ ∆ ••• ∆ ∆ ∆∆ ∆ • ∆ ∆ ∆ • ∆ ∆ ∆ ∆ ∆ ∆ ∆∆ •• ∆ ∆∆ ∆ ∆∆ ∆ ∆ ∆∆ ∆∆ ∆∆ •• ∆ ∆∆ ••• ∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆ ∆∆∆∆∆ • ∆∆∆ •• • •• ••• •••• 0.75 0.5 0.25 0•∆ • •• ∆ ∆ 1e-06 • •• • •• •• ∆∆ ∆ ∆∆ ∆ ∆ ∆ ∆ ∆∆ ∆∆∆∆ ∆∆∆∆ ∆ ∆ ∆ ∆∆∆∆ ∆∆∆∆ ∆∆∆∆ ∆ ∆∆∆ • • ∆ ∆ Low bandwidth Scaled Merged Structural model 1e-05 0.0001 0.001 0.01 Packet interarrival time (seconds) 1 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ••••••••••• •• • • • • • • • • • • • ••••••• 0.75 •••••••••••••••••••••••••••• •• • • Cumulative distribution function Cumulative distribution function 1 ∆ 0.5 • • 0.25 ∆ • • ∆ ∆ Low bandwidth Scaled Merged Structural model 0• 0.1 0 500 1000 Packet size (bytes) 1500 Figure 3. Packet interarrival time CDF Figure 4. Packet size CDF trace file was around 11 hours long, the generated trace file was around 6 minutes long. the scaled traces. The shape of the graph is much smoother than the low bandwidth traces, due to the effect of queuing delays while merging. The shift in the location of the CDF for the structural model generated traces is lower than that observed for the scaling and the merging methods which indicates that scaling the user interarrival times has not necessarily scaled the packet interarrival times proportionately. The shape of the graph is closer to the low bandwidth traces than the traces produced by merging. This indicates that the structural model method has represented packet interarrivals more accurately than the merging method when both are compared to the original traces. The merging method used a merging factor of 100. Since it was not feasible to acquire 100 trace files of 6 minute duration from different sites, a single trace file of 11 hours was divided into 100 trace files of approximately 6 minutes each. The timestamps in each trace file were normalized to start at a time of zero seconds. The destination addresses in each trace file were renumbered to prevent distortion of the TCP connection statistics. While this prevented a distorted in statistics such as connection duration, it may have contributed to some distortion in the form of a larger number of destination addresses than would be normally seen. In the case of the structural modeling method, the scaling factor was 100. User characteristics were extracted from 360 seconds of the trace. Around 1000 seconds of time was simulated (to allow the system to reach steady state) and 360 seconds of trace was extracted. 5.1 Packet interarrival time Figure 3 shows the CDF of the packet interarrival time for the four traces. From the graph it can be observed that the packet interarrival time for the scaled trace has been reduced by the scaling factor, shifting the location of the CDF. The shape of the curve is identical to the low bandwidth trace. The location of the packet interarrival time for the merged traces has been reduced by the merging factor. The location of the graph has been shifted to a location very close to the plot for 5.2 Packet size Figure 4 shows the CDFs of the packet size for the four traces. The low bandwidth trace and the scaled traces have identical CDFs, and are indistinguishable on the graph. The merging method gives a packet size distribution which very closely matches that of the low bandwidth traces. The structural model method however has a CDF which is considerably different from the low bandwidth traces. This occurs for several reasons, some relating to abstractions in the simulation model. First, it is well known that TCP packet size distributions are strongly bimodal, with a large number of ACK packets around 40 bytes and data packets around the MTU size, typically either 540 bytes or 1500 bytes. The simulator’s default data packet size is 1000 bytes, halfway between the two MTUs observed in the trace. Second, the simulator does not try to model byte-level details of web traffic, but instead rounds transfers to an even number of whole ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆∆ ∆ ∆∆ ∆ ∆ ∆∆∆∆∆ 0.75 0.5 0.25 0• • •• ••• •• •• ••• • •• ••• •• •• ••• • •• •• • • •••• • •• ••• • ••• 1 • ∆ Low bandwidth • Scaled Merged ∆ Structural model 0 2000 4000 6000 Unique destination addresses in a 1 second interval Figure 5. Unique destination addresses in an interval of 1 second (CDF) data packets. Thus the traces show some variation in packet length (showing a curve in the 40-500 bytes range and slight slope in the 500-1500 bytes range), while the simulation is completely bimodal (40 or 1000 bytes). Finally, while the simulation model only supports web traffic (which accounts for a large percentage of the total traffic in the trace), the trace contains a variety of protocols with various packet lengths. A large percentage of small packets (those in 40-500 bytes range) are contributed by UDP traffic and protocols such as telnet, ftp etc. Each of these differences could be rectified by a more detailed traffic model in simulation. However, we found that these differences did not have much effect on aggregate traffic patterns. RAMP is being modified to use different packet sizes based on a CDF extracted from real traces. With that change, we expect that the packet size distribution in the generated trace will more closely resemble the real trace. 5.3 Destination addresses Figure 5 shows the CDF for the number of unique destination addresses seen in an interval of 1 second. Studies of traffic locality in the NSFNET backbone [5] indicate that a large percentage of the traffic is destined for a small percentage of hosts. Other studies [4] indicate that the number of host pairs increases as the square root of the bit rate. These effects should be considered when evaluating the generated traces. The diversity of the destination addresses (number of unique addresses) from the scaled traces has remained unchanged but the distribution across time (number of Cumulative distribution function Cumulative distribution function 1 ∆ ∆ ∆∆∆∆∆∆∆ ∆ ∆ ∆ ∆∆∆ ∆ ∆∆ ∆∆ ∆∆∆ ∆ ∆∆ ∆ ∆ ∆∆ ∆∆ ∆ ∆ ∆ ∆∆ ∆ ∆ ∆∆∆∆ ∆∆∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆∆ ∆ ∆∆ ∆ ∆∆ ∆ ∆ ∆ ∆ ∆ ∆∆∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆ ∆∆∆∆∆∆ ∆ ∆ ∆ ∆∆∆ ∆∆ ∆∆∆∆∆∆∆∆∆ ∆∆ ∆ ∆∆ ∆ ∆ ∆∆ ∆∆∆ ∆∆∆∆∆∆∆∆∆∆ ∆∆∆∆∆∆∆∆∆ ∆ ∆ ∆ ∆∆∆ ∆ ∆∆∆∆∆∆ ∆ ∆ ∆ ∆ ∆ ∆∆∆∆∆ ∆∆∆∆∆ ∆∆∆∆∆ ∆∆∆∆∆ ∆∆∆∆∆ • •• ••• ∆∆∆ •• ∆ •• ∆ • • 0.75 0.5 • • 0.25 • 0∆ 100 •• ••• • • • •• • • • • • • ∆ ∆ ∆∆ • Low bandwidth Scaled Merged Structural model • •• • • ∆ 1000 10000 100000 Number of bytes in a TCP connection Figure 6. TCP connection bytes CDF unique addresses seen in an interval) has changed. As seen from the graph the scaling of traces does not increase the number of destination addresses seen in a interval linearly by the scaling factor. The reason for this is because a small percentage of destination addresses account for a large percentage of traffic. Therefore, although more packets are seen in an interval of time, the number of unique destination addresses seen does not increase by the same factor. In the case of merging, the destination address distribution has scaled linearly. This effect is primarily due to how the files were merged. The trace was generated by merging 100 trace files. Since 100 trace files of suitably long duration (360 seconds) were unavailable, the 100 trace files were generated by splitting a long duration trace file (around 11 hours). To avoid distorting the TCP connection statistics, the destination addresses in each of the 100 trace files were renumbered. As a result, the destination address CDF has scaled linearly by a factor of 100. From the structural model plot it can be seen that the destination addresses seen in an interval has been scaled by a factor of around 10. As described in section 4.3, the number of destination hosts in the simulation was scaled by a factor of 10, to accurately model the system. Hence the address distribution has been scaled by about 10. 5.4 TCP connection packets/bytes Figures 6 and 7 show the CDF for TCP connection bytes and packets. The TCP connection packets is the number of packets sent during a TCP connection ( ∆ 0.75 ∆ ∆ • ∆ • •• ∆∆∆∆ ∆∆∆ ∆ ∆ ∆ ∆∆ ∆∆ ∆ ∆∆ ∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆ ∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆ ∆∆∆∆∆∆∆∆ ∆∆∆∆ ••• •• • • •• • • • • ∆ • Low bandwidth Scaled Merged ∆ Structural model • • • • ∆ ∆ ∆ • 0.25 • • • 0 1 1 • • ∆ ∆ ∆ ∆∆ ∆ ∆ ∆∆ ∆∆∆∆∆∆ ∆∆ ∆ ∆ ∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆∆ ∆ ∆∆ ∆ ∆ • • ∆ 0.75 • 0.5 • ∆ ∆ ∆∆ ∆∆ Cumulative distribution function Cumulative distribution function 1 ∆ 0.5 Low bandwidth Scaled Merged ∆ Structural model • • • • ∆ ∆ ∆ 0.25 0 10 100 Number of packets in a TCP connection Figure 7. TCP connection packets CDF including the connection setup and teardown packets). The TCP bytes is the total number of bytes sent during a TCP connection (including connection setup and teardown packets). For the scaling and merging methods, the TCP connection bytes and packets closely approximate the low bandwidth traces. The structural model method however has a much higher percentage of lower byte connections. This is because the structural model is based on user characteristics derived from web traffic only. It has neglected all other types of traffic such as telnet or ftp traffic which may have higher byte connections. In addition the simulator was run in a mode where it did not send connection setup and teardown packets. 5.5 TCP connection duration Figure 8 shows the TCP connection duration. The TCP connection duration can be influenced by several factors - the bottleneck bandwidth of the connection, the propagation delay, protocol characteristics (as in the case of web sessions with persistent connections (HTTP 1.1)), by the number of objects in a page or by user characteristics. One of the effects of the scaling is that these characteristics will not be retained. E.g. a telnet session may last anywhere from a few minutes to a few hours while a persistent web connection may last a few seconds. Scaling the traffic compresses these periods. As seen in the graph, the connection duration of the traces generated by scaling has been reduced by the scaling factor. For merging the connection duration should be affected slightly by the added queuing delay. However, the graph indicates a close match in the flow duration of the merged trace and the real trace, indicating that the queuing delay was 1 10 100 Duration of a TCP connection (seconds) Figure 8. TCP connection duration CDF minimal. The structural model method has generated a large number of short duration flows. This is typical of web traffic and is due to extracting characteristics from only web traffic. In addition, the web traffic extracted from the real (low bandwidth) trace file had a large percentage of non persistent connections (80%) resulting in a large number of short flows. 6. Discussion There are several issues involved in the three methods that make it difficult to accurately generate traffic on a backbone network. Some of these issues are discussed below. 6.1 Choosing the duration of the input and generated traces The duration of the input traces and the generated traces plays a large role in determining the accuracy of the generated traces. Input files that are too short and are used in the scaling or merging methods will not capture long tailed behavior. In particular, for the structural model method, input files that are too short do not allow the structural model to capture long tailed behavior in the CDFs that are generated. The duration of the simulation also affects the characteristics of the trace generated. Simulations that are too short will not capture long tailed behavior. On the other hand it may be computationally infeasible to simulate a long duration. Our experience was that if a comparison is to be made between the real and generated traces, then the 1000 Hat Linux 7.2. The 10Gbps simulation took almost 24 hours. The 10 Gbps simulation was stopped after 500s of simulated time (hence the abruptly terminated curve for the 10Gbps simulation) due to a limit in the environment used to run the simulation. As seen from the graph above it takes approximately 250 seconds of simulation time before the system reaches steady state and traces can be analyzed. s = 1000 Mbps 100 10 s = 100 1 7. Related work 0.1 Real 0.01 0 250 500 750 Time (seconds) 1000 Figure 9. Relationship between the simulation scaling factor and the generated traffic as a function of simulation time. simulation should be run for a duration at least equal to the duration of the real world trace from which the data to drive the simulation was extracted. E.g. if a generated trace of 100 seconds is required, then data for the structural model should be extracted from a real world trace of duration 100 seconds and then 100 seconds of simulation should be run. 6.2 Increasing the bandwidth and determining steady state of the simulation In the structural model method high bandwidth traces are obtained by varying two quantities: • User interarrival times • Link bandwidths We increased both by the same factor so as to maintain their ratio constant. Thus we have made an implicit assumption that the number of users increases proportionately to the link bandwidth. Figure 9 indicates the bit rate (bps) in the real (10Mbps) and generated traces (1Gbps and 10Gbps) averaged over intervals of 1s for three different scalings v/s time. The graph indicates that increasing the link bandwidth and decreasing the user interarrival time by the simulation scaling factor s results in an increase in the traffic by an amount in the order of the simulation scaling factor. The lowest curve is the original trace file (on a 10Mbps link). The middle curve is the simulation run with a simulation scaling factor of 100 (i.e a 1Gbps link) and the top curve has a simulation scaling factor of 1000 (i.e a 10Gbps link). A trace of 1000 seconds of a 1Gbps simulation took about 3 hours and 50 minutes to run on a 1GHz Pentium IV system with 1GB RAM running Red SynRGen[7] is a file reference generator that is used to build test suites for file systems. Like the structural model method it attempts to model the behavior of real users by tracing user behavior. A model of user behavior is built from the trace. Stress testing of the file system is performed by using this user model to simulate a large number of users on the system. This system is in many ways analogous to the structural model which also builds a user model and attempts to increase the number of users to generate higher bandwidth traces. Lucas et al. [14,15] have characterized the nature of wide area IP traffic and have developed a traffic model called (M, P, S) to generate traffic for simulation. Their method of generation of traffic traces consist of three steps: Generation of aggregate traffic (arrivals per 100 ms intervals) using a self similar traffic model, partitioning the generated traffic by assigning destination addresses according to expected arrival distributions and finally distributing the packet arrivals in arrivals per ms. Unlike the structural modeling method which attempts to build a model of user behavior, their method assumes a particular model for the packet arrivals (self similar). However, it is likely that their model will be able to generate high bandwidth traces much more efficiently than the structural model method. 8. Conclusions We have compared three different methods of generating high bandwidth traces. Of these methods scaling appears to generate the least accurate traces, distorting flow durations and destination address diversity. Merging with renumbering appears to be a viable method if the factors of interest are only connection characteristics and packet characteristics such as packet interarrival times and packet sizes. However, because of renumbering, it will not accurately model the destination address characteristics. The structural model method generates accurate traces. However the use of only web traffic as a method of simulating all Internet traffic causes the model to fall short of accuracy. In particular flow durations have not been replicated accurately. An extended model based on different types of Internet traffic would provide more accurate traces. As a method of generating higher bandwidth traces, scaling user interarrival times gives a fairly proportional scaling in the offered load on a link. The simulation should be run for a time equal to the duration of the real world trace used to extract the statistics for the simulation. Ultimately the limits of processor speed and memory limit the ability of the structural method to generate traces that indicate long tailed behavior. However, this method can generate traces that help understand high bandwidth network behavior for short durations of time (seconds rather than hours). References 1. 2. 3. Allman, M. and Paxson, V., “TCP congestion control; RFC 2581,” Internet request for comments (Apr 1999). Bajaj, S., Breslau, L., Estrin, D., Fall, K., Floyd, S., Haldar, P., Handley, M., Helmy, A., Heidemann, J., Huang, P., Kumar, S., McCanne, S., Rajaie, R., Sharma, P., Varadhan, K., Xu, Y., Yu, H., and Zappala, D., “Improving simulation for network research,” Technical Report 99-702, University of Southern California, http://www.isi.edu/nsnam (Mar 1999). Cao, J., Cleveland, W. S., Lin, D., and Sun, D. X., “On the non stationarity of Internet traffic,” Proceedings of ACM Sigmetrics, pp. 102-112, Cambridge, MA (Jun 2001). 4. Claffy, K., “Internet measurements: Myths about Internet data,” Presentation at NANOG 2002, Miami (Feb 2002). 5. Claffy, K., Braun, H. W., and Polyzosi, G., “Traffic characteristics of the T1 NSFNET backbone,” Proceedings of IEEE Infocom, pp. 885-892, San Francisco (Aug 1993). 6. Claffy, K., Miller, G., and Thompson, K., “The nature of the beast: recent traffic measurements from an Internet backbone,” Proceedings of Inet (July 1998). 7. Ebling, M. R. and Satyanarayana, M., “SynRGen: An extensible file reference generator,” ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 108--117. (May 1994). 8. Fang, W. and Peterson, L., “Inter-AS traffic patterns and their implications,” Proceedings of Global Telecommunications Conference, Vol. 3, pp. 1859-1868, Rio de Janeiro, Brazil (Dec 1999). 9. Feldmeier, D., “Improving gateway performance with a routing table cache,” Proceedings of IEEE Infocom, pp. 298-307, New Orleans (Mar 1988). 10. Fielding, R., Mogul, J., Gettys, J., Frystyk, H., and Berners-Lee, T., “Hypertext Transfer Protocol -- HTTP/1.1; RFC 2068,” Internet request for comments (Jan 1997). 11. Iannaccone, G., Diot, C., and McKeown, N., “Monitoring very high speed links,” Proceedings of ACM Sigcomm Internet Measurement Workshop, pp. 267-271, San Francisco (Nov 2001). 12. Lan, K.C. and Heidemann, J., “Rapid model generation from traffic measurement,” ISI Technical Report ISITR-561 (Aug 2002). 13. Leland, W. E., Taqqu, M. S., Willinger, W., and Wilson, D. V., “On the self-similar nature of Ethernet traffic,,” IEEE/ACM Transactions on Networking, Vol 2, No 1, pp. 1-15 (Feb 1994). 14. Lucas, M. T., Dempsey, B. J., Wrege, D. E., and Weaver, A. C., “Statistical characterization of wide area IP traffic,” Sixth International Conference on Computer Communications and Networks (IC3N’97), Las Vegas, NV (Sept 1997.). 15. Lucas, M. T., Dempsey, B. J., Wrege, D. E., and Weaver, A. C., “(M, P, S) - An efficient background traffic model for wide area network simulation,” 1997 IEEE Global Telecommunications Conference, Vol. 3, pp. 1572-1576 (1997). 16. McGregor, A., Braun, H. W., and Brown, J., “The NLANR Network Analysis Infrastructure,” IEEE Communications Magazine, Vol. 38, No 5, pp. 122-128, http://pma.nlanr.net/Traces/Traces (May 2000). 17. Micheel, J., Graham, I., and Brownlee, N., “The Auckland data set: an access link observed,” Proceedings of the 14th ITC Specialists Seminar on Access Networks and Systems, Barcelona, http://pma.nlanr.net/Traces/long (Apr 2001). 18. Paxson, V. and Floyd, S., “Why we don’t know how to simulate the Internet,” Proceedings of the 1997 Winter Simulation Conference (Dec 1997). 19. Riley, G. and Ammar, M., “Simulating large networks how big is big enough?,” Proceedings of the first international conference on Grand Challenges for Modelling and Simulation (Jan 2002). 20. Smith, F. D., Campos, F., Jeffay, K., and Ott, D., “What TCP/IP headers can tell us about the Web,” Proceedings of ACM Sigmetrics, pp. 245-256, Cambridge, MA (Jun 2001). 21. Willinger, W. and Paxson, V., “Where mathematics meets the Internet,” Notices of the American Mathematical Society, Vol. 45, No. 8, pp. 961-970, b (Aug 1998). 22. Willinger, W., Paxson, V., and Taqqu, M. S., “Self similarity and heavy tails: Structural modeling of network traffic,” A practical guide to heavy tails: Statistical techniques and applications, ISBN 0-8176-3951-9, Birkhauser, Boston (1998).