Academia.eduAcademia.edu

Internet traffic measurement

2001, IEEE Internet Computing

T he Internet's evolution over the past 30 years has been accompanied by the development of various network applications. These applications range from early text-based utilities such as file transfer and remote login to the more recent advent of the Web, electronic commerce, and multimedia streaming.

Spotlight Internet Traffic Measurement Carey Williamson • University of Calgary T he Internet’s evolution over the past 30 years has been accompanied by the development of various network applications. These applications range from early text-based utilities such as file transfer and remote login to the more recent advent of the Web, electronic commerce, and multimedia streaming. For most users, the Internet is simply a connection to these applications. They are shielded from the details of how the Internet works, through the information-hiding principles of the Internet protocol stack, which dictates how user-level data is transformed into network packets for transport across the network and put back together for delivery at the receiving application. For many networking researchers, however, the protocols themselves, rather than the information they carry, are of interest. Using specialized network measurement hardware or software, these researchers collect information about network packet transmissions, including their timing structure and contents. With detailed packet-level measurements and some knowledge of the IP stack, they can use reverse engineering to gather significant information about both the application structure and user behavior, which can be applied to a variety of tasks like network troubleshooting, protocol debugging, workload characterization, and performance evaluation and improvement. From humble beginnings in local area networks, traffic measurement technologies have scaled up over the past 15 years to provide insight into fundamental behavior properties of the Internet, its protocols, and its users. In this overview, I introduce the tools and methods for measuring Internet traffic and offer highlights from research results. Measurement Tools Network measurement tools include hardware and software approaches. Software tools are typically much less expensive than hardware, but the latter usually offer better functionality and performance. 70 NOVEMBER • DECEMBER 2001 http://computer.org/internet/ Hardware Approaches Network traffic analyzers are special-purpose tools designed to collect and analyze network data. Such equipment is widely available and often expensive, with the cost depending on the number of network interfaces, types of network cards, storage capacity, and protocol processing capabilities. As an example of the hardware-based traffic analysis process, I’ll use measurements we collected in 1998 at an ISP running IP over an Asynchronous Transfer Mode backbone. To analyze network traffic, we used a NavTel IW95000.1 The ATM network analyzer provides nonintrusive capture of cell-level ATM traffic streams, including packet headers and payloads. The analyzer timestamps each ATM cell with a 1-microsecond timestamp resolution, and records the captured traffic into memory in a compressed proprietary binary data format. The size of the memory capture buffer and the volume of network traffic determine the maximum time interval for trace collection (typically several seconds at 155 Mbps OC-3 rates, and several minutes at 1.5 Mbps T1 rates). Once the capture buffer is full, traces can be saved to disk or copied to another machine for offline trace analysis. In this case, we used a custom C program to decode the NavTel IW95000’s recorded data. The program converts the binary data file into an ASCII format with TCP/IP protocol information. Figure 1 shows an example of the humanreadable trace format (I’ve “sanitized” IP addresses throughout to conceal user identities). The format includes a timestamp (in microseconds, relative to the trace’s start time), recognized protocol types, and selected fields from the IP and TCP packet headers. Given this trace format, you can easily construct customized scripts to process a trace file and extract the desired information, such as timestamp, packet size, and IP and TCP protocol information. In this example, we used offline trace analyses to study ISP Web traffic characteristics.1 1089 -7801/ 01/$10.00 ©2001 IEEE IEEE INTERNET COMPUTING Tutorial TIMESTAMP 0 14966 15015 22090 22126 29960 29960 31724 36055 36279 37181 41731 PROTOCOL SOURCE IP_ADDRESS SRC PORT DESTINATION IP_ADDRESS DST PORT IP_PKT SIZE TCP SEQ TCP ACK IP IP IP IP IP IP IP IP IP IP IP IP 307.246.129.64 561.877.104.57 391.82.374.90 719.327.502.359 582.127.755.91 561.877.104.57 419.74.87.6 419.74.87.6 512.84.9.317 512.84.9.317 407.84.92.183 399.81.77.33 1060 7410 1105 1140 1291 3741 80 80 1125 1126 1207 80 427.86.12.704 427.86.12.704 891.82.59.75 526.837.913.44 419.74.87.6 427.86.12.704 582.127.755.91 582.127.755.91 419.74.87.628 419.74.87.628 398.54.73.39 342.406.374.91 80 80 80 80 80 80 1291 1291 80 80 5190 1116 40 508 40 40 40 40 1500 1500 311 271 40 40 920641 410104 2816846 1010185 9557082 985526 653402 654862 857517 857661 64202 1062629 412791 32779 7726 14762 50482 58006 57082 57082 89873 3293 9407 68778 TCP TCP TCP TCP TCP TCP TCP TCP TCP TCP TCP TCP Figure 1. TCP/IP packet trace file of ISP network measurements. IP and TCP packet headers include IP source and destination address, IP packet size,TCP source and destination port numbers, and TCP sequence-number information for data and acknowledgment packets. Software Approaches Software-based measurement tools typically modify the kernel of a commodity workstation’s network interface to give it packet-capture capability. One widely used tool is tcpdump, which uses the Berkeley Packet Filter architecture to capture TCP/IP packets. Tcpdump lets you capture a network’s IP packets and filter the captured traffic streams based on specific host addresses, port numbers, or protocol types. Tcpdump is widely used to study Internet applications and growth trends in Internet traffic over time.2 Figure 2 (next page) shows an example of a tcpdump trace, which includes a timestamp for each packet and the IP and TCP headers, which carry address and control information. In post-processing, you can extract application-level behaviors, such as the Web document transfer shown in the figure. Another software-based approach relies on the access logs recorded by Web servers and proxies. These logs record each client request for Web site content, including the time of day, client IP address, URL requested, and document size. Postprocessing access logs can offer useful insight into Web server workloads3 without having to collect detailed network-level packet traces. Measurement Methods Four major axes characterize the measurement approaches that researchers use to study network behavior generally and the Internet specifically. Passive versus Active Measurement A passive network monitor records packet traffic on a network without creating additional traffic. Most IEEE INTERNET COMPUTING network measurement tools fall into this category. An active approach uses packets that the measurement device itself generates to probe the Internet and measure its characteristics. Examples of this approach include ■ ■ ■ the ping utility, which estimates network latency to a particular Internet destination; the traceroute utility, which determines Internet routing paths; and the pathchar tool, which estimates link capacities and latencies along an Internet path. Online versus Offline Analysis Some network traffic analyzers support real-time data collection and analysis, often with graphical displays of live traffic data. Most hardware-based analyzers support this feature. Other measurement devices, such as tcpdump, are intended only for real-time data collection and storage. Once the device collects and stores the traffic data, you can analyze it offline. LAN versus WAN Measurement Early network traffic research focused on localarea network (LAN) environments, such as Ethernet LANs. LANs are easier to measure than widearea networks for two main reasons. First, a LAN is typically administered by a single well-known organization and obtaining security clearance for traffic analysis is thus relatively straightforward. Second, an Ethernet LAN broadcasts so that all hosts see all packets. To measure traffic in this context, you simply configure a nethttp://computer.org/internet/ NOVEMBER • DECEMBER 2001 71 Spotlight TIME SOURCE PORT DESTINATION PORT FLAG SEQNUM ACKNUM 19:52.731470 19:52.731889 19:52.732200 19:52.738205 19:52.743248 19:52.758535 19:52.758862 19:52.759700 19:52.759935 406.17.8.12.64826 > 723.65.19.6.www > 406.17.8.12.64826 > 406.17.8.12.64826 > 723.65.19.6.www > 406.17.8.12.64826 > 723.65.19.6.www > 723.65.19.6.www > 406.17.8.12.64826 > 723.65.19.6.www: 406.17.8.12.64826: 723.65.19.6.www: 723.65.19.6.www: 406.17.8.12.64826: 723.65.19.6.www: 406.17.8.12.64826: 406.17.8.12.64826: 723.65.19.6.www: S S 4256930:4256930(0) 768500:768500(0) . 4256931:4257101(170) 768501:5769840(1339) 4257101:4257101(0) . 5769840:5769840(0) . ack 4256931 ack 768501 win 17520 ack 768501 ack 4257101 ack 5769840 ack 4257102 ack 4257102 ack 5769841 P P F F Figure 2. A tcpdump packet trace file.This example shows a Web document transfer, including a timestamp for each packet and the IP and TCP headers. work interface into promiscuous mode; that is, the interface receives and records (rather than ignores) packets destined for other network hosts. Researchers later extended measurement work to WAN environments.2,4,5 These environments present challenges in administrative control of the network, including security and privacy. Organizations with a single Internet access point can install measurement devices inline, on an Internet link near the organization’s default router. Recently, Barford and Crovella discussed deploying a wide-area Web measurement infrastructure that collects simultaneous measurements of client, server, and network behaviors.6 By coordinating time between these measurements, it’s possible to achieve a more complete picture of end-to-end network performance. Protocol-Level Analysis Measurement tools collect data and analyze traffic at different protocol levels. Many network traffic analyzers support multilayer protocol analysis, but require a specialized network card for each network type. For example, specialized network cards exist for Ethernet, FrameRelay, ATM, and wireless networks, but IP and higher-layer protocols can use the same back-end protocol analysis engines. Research Highlights The past 15 years of Internet traffic measurement have produced key observations; I’ve selected 10 that I think best summarize and highlight this research. 1. Internet traffic continues to change. Longitudinal studies show that Internet traffic continues to grow and change over relatively short time scales.2 This change is not simply one of traffic volume, but also of traffic mix, protocols, applications, and users. Despite the value of Internet traffic mea72 NOVEMBER • DECEMBER 2001 http://computer.org/internet/ surement as a research methodology, any data set collected from an operational network represents but one snapshot at one point in time in the Internet’s evolution. Trying to identify invariants in traffic structure is one way to cope with the unending battle of measuring and understanding Internet traffic. 2.Aggregate network traffic is multifractal. Characterizing aggregate network traffic is difficult for many reasons: ■ ■ ■ ■ the Internet’s heterogenous nature, network application diversity, variable link speeds and network-access technologies, and changing user behaviors. Nevertheless, networking researchers have identified a significant degree of long-range dependence (LRD) in network traffic, which they refer to as “self-similar,” “fractal,” or “multifractal” behavior.7,8 This LRD property appears to be ubiquitous; it is present in LAN, WAN, video, data, Web, ATM, FrameRelay, and SS7 signaling traffic. Researchers attribute this LRD property in part to users’ heavy-tailed on-off behaviors, which are perhaps exacerbated by the Internet’s TCP/IP protocols.9 More recent research addresses Internet traffic’s “non-stationarity” and suggests that the multifractal traffic structure evident at a large network’s edges diminishes within the core.10 Despite its largely complex multifractal structure, researchers have developed surprisingly concise mathematical models to characterize and analyze Internet traffic, with the aim of improving Internet infrastructure design. 3. Network traffic exhibits locality properties. Network traffic structure is far from random. Traffic structure is imposed implicitly by users’ applicationIEEE INTERNET COMPUTING Tutorial Traffic Measurement Resources layer tasks (such as file transfers or Web page downloads), and is reinforced by the TCP/IP data transfer protocols. Packets are not independent and isolated entities; rather, they are part of a logical information flow in the higher protocol layers. This flow manifests at the network layer in recognizable — though not necessarily predictable — patterns of packet timing and source and destination addresses. This structure is often referred to in terms of temporal locality (time-based correlation of information) or spatial locality (geography-based correlation). ■ ■ ■ ■ ■ 4. Packet traffic is distributed nonuniformly. Analysis of TCP/IP packets’ source and destination addresses typically shows that the distribution of packet traffic among hosts is highly nonuniform. A common observation is that 10 percent of hosts account for 90 percent of traffic. In some sense, this observation is not surprising, given the client-server paradigm for many network applications. However, the presence of this property in many network measurement studies suggests a fundamental power-law structure in many aspects of Internet traffic3,9,11 and even in certain aspects of Internet topology.12 5. Packet sizes are distributed bimodally. The size (in bytes) of the network packets traversing the Internet have a “spiky” distribution.4 About half the packets carry the maximum number of data bytes permitted by the maximum transmission unit (MTU) parameter defined for a network interface. About 40 percent of packets are small (40 bytes) because of the prevalence of (header-only) TCP acknowledgment packets for data received. The remaining 10 percent of packets are somewhat randomly scattered between the two extremes, based on how much user data remains in the “last” packet of a multipacket transfer. Occasionally, secondary spikes occur in the distribution due to IP fragmentation between networks with different MTU sizes. 6.The packet arrival process is bursty. Much of the classical work in queuing theory and communication network design is based on the assumption that the packet arrival process is a Poisson process. In simple terms, the Poisson arrival process means that events — such as earthquakes, telephone calls, and, in this case, packet arrivals — occur independently at random times, with a welldefined average rate. More formally, the interarrival times between events in a Poisson process are exponentially distributed and independent, and no two events happen at exactly the same time. Poisson models are attractive mathematically IEEE INTERNET COMPUTING ■ Internet Traffic Archive (ITA) is a public-domain repository of traces and data sets collected by networking researchers. http://ita.ee.lbl.gov Internet Traffic Report (ITR) offers hourly statistics on global Internet traffic trends. http://www.InternetTrafficReport.com National Laboratory for Applied Network Research (NLANR) is a U.S.-based initiative on high-performance networking. http://nlanr.net/ NLANR Measurement and Operations Analysis Team (MOAT) offers online Internet traffic statistics, traces, and tools from a NLANR subgroup specializing in Internet traffic measurement. http://moat.nlanr.net National Internet Measurement Infrastructure (NIMI) is an NLANR initiative that provides ubiquitous measurement capability for Internet traffic, topology, routing, and quality of service. http://www.ncne.nlanr.net/nimi/ Tcpdump is public-domain software for collecting network-level packet traces. http://www.tcpdump.org/ (All current as of Oct. 2001) because the exponential distribution has a “memoryless” property: even if we know the time elapsed since the last event, we have no hint when the next event will occur. Poisson models are often amenable to elegant mathematical analysis, leading, for example, to closed-form expressions for the mean waiting time (and variance) in queuing network models. Detailed studies of Internet traffic show that the packet arrival process is bursty, rather than Poisson. That is, rather than having independent and exponentially distributed interarrival times, Internet packets arrive in clumps.13 This bursty structure is due in part to the data transmission protocols. The result is that queuing behavior can be much more variable than that predicted by a Poisson model. Given this finding, the value of the simple (Poisson) network traffic models used in network performance studies is doubtful. This realization has motivated recent research on network traffic modeling.5 7.The session arrival process is Poisson. Although the packet arrival process is not Poisson, there is strong evidence that the session arrival process is Poisson. That is, Internet users seem to operate independently and at random when initiating access to certain Internet resources. This observation has been noted for several network applications. For example, in their studies of telnet traffic, Paxson and Floyd have found that a Poisson process effectively models the session arrival process when they use a time-varying rate (such as hourly).13 Similarly, we found that a Poisson arrival process is effective for modeling user requests for individual Web pages on a Web server.3 http://computer.org/internet/ NOVEMBER • DECEMBER 2001 73 Spotlight 8. Most TCP conversations are brief. In a 1991 study, more than 90 percent of TCP conversations exchanged less than 10 Kbytes of data and lasted at most only two or three seconds.4 This prevalence of short-lived connections was somewhat surprising at the time, particularly for file transfer and remote login applications. However, the Web’s advent has significantly reinforced this conversation paradigm. The literature suggests that approximately 80 percent of Web document transfers are less than 10 Kbytes, though the distribution has a significant heavy tail.3,9 9.Traffic flows are bi-directional, but often asymmetric. Many Internet applications generate a bi-directional data exchange, though the data volume sent in each direction often differs greatly. This observation was true in the early 1990s (for example, see Caceres et al.4 and Paxson5), and it is even truer today because of the Web’s download-intensive nature. We don’t yet know how large-scale peer-to-peer networking paradigms (such as Napster and grids) will impact Internet traffic asymmetry. 10.TCP accounts for most Internet traffic. Since the early 1990s, TCP has dominated Internet packet traffic,4,14 and will likely continue to do so for the foreseeable future. The primary reason is the Web’s advent. Because the Web relies on TCP for reliable data transfer, the growing number of Internet users, the widespread availability of easyto-use Web browsers, and the proliferation of Web sites with rich multimedia content have combined to create an exponential growth of TCP traffic. Although Web caching and content distribution networks have softened TCP’s impact (for examples, see Breslau et al.11 and my forthcoming article15) its overall growth is still dramatic. That said, several recent (and popular) Internet applications — including video streaming, Napster, IP telephony, and multicast — rely predominantly on the user datagram protocol (UDP), and might gradually shift the traffic balance away from TCP. Conclusion Network measurement research has grown in scope and magnitude to match the Internet. Recent initiatives (see the “Traffic Measurement Resources” sidebar) are striving to provide a practical and scalable infrastructure for wide-scale operational measurement of today’s Internet. Among the challenges ahead are to establish an adequate measurement infrastructure across heterogeneous, multivendor 74 NOVEMBER • DECEMBER 2001 http://computer.org/internet/ networks, and ensure the infrastructure’s scalability with increased traffic and network speeds. References 1. R. Epsilon, J. Ke, and C. Williamson, “Analysis of ISP IP/ATM Network Traffic Measurements,” ACM Performance Evaluation Review, vol. 27, no. 2, Sept. 1999, pp. 15-24. 2. V. Paxson, “Growth Trends in Wide Area TCP Connections,” IEEE Network, vol. 8, no. 4, July-Aug. 1994, pp. 8-17. 3. M. Arlitt and C. Williamson, “Internet Web Servers: Workload Characterization and Performance Implications,” IEEE/ACM Trans. Networking, vol. 5, no. 5, Oct. 1997, pp. 815-826. 4. R. Caceres et al., “Characteristics of Wide-Area TCP/IP Conversations,” Proc. ACM Special Interest Group Data Comm. (SIGCOMM’91), ACM Press, New York, 1991, pp. 101-112. 5. V. Paxson, “Empirically-Derived Analytic Models of WideArea TCP Connections,” IEEE/ACM Trans. Networking, vol. 2, no. 4, Aug. 1994, pp. 316-336. 6. P. Barford and M. Crovella, “Measuring Web Performance in the Wide Area,” ACM Performance Evaluation Review, vol. 27, no. 2, Sept. 1999, pp. 37-48. 7. A. Feldmann, A. Gilbert, and W. Willinger, “Data Networks as Cascades: Explaining the Multi-Fractal Nature of Internet Traffic,” Proc. ACM Special Interest Group Data Comm. (SIGCOMM’98), ACM Press, New York, 1998, pp. 42-55. 8. W. Leland et al., “On the Self-Similar Nature of Ethernet Traffic (Extended Version),” IEEE/ACM Trans. Networking, vol. 2, no. 1, Feb. 1994, pp. 1-15. 9. M. Crovella and A. Bestavros, “Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes,” IEEE/ACM Trans. Networking, vol. 5, no. 6, Dec. 1997, pp. 835-846. 10. J. Cao et al., “On the Nonstationarity of Internet Traffic,” Proc. ACM Special Interest Group Metrics (SIGMETRICS’01), ACM Press, New York, 2001, pp. 102-112. 11. L. Breslau et al., “Web Caching and Zipf-Like Distributions: Evidence and Implications,” Proc. Int’l Joint Conf. IEEE Computer and Comm. Societies (IEEE Infocom99), IEEE Computer Soc. Press, Los Alamitos, Calif., 1999, pp. 126-134. 12. M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On PowerLaw Relationships of the Internet Topology,” Proc. ACM Special Interest Group Data Comm. (SIGCOMM’99), ACM Press, New York, 1999, pp. 251-262. 13. V. Paxson and S. Floyd, “Wide-Area Traffic: The Failure of Poisson Modeling,” IEEE/ACM Trans. Networking, vol. 3, no. 3, June 1995, pp. 226-244. 14. K. Thompson, G. Miller, and R. Wilder, “Wide-Area Internet Traffic Patterns and Characteristics,” IEEE Network, vol. 11, no. 6, Nov.-Dec. 1997, pp. 10-23. 15. C. Williamson, “On Filter Effects in Web Caching Hierarchies,” to be published in ACM Trans. Internet Technology, vol. 2, no. 1, Feb. 2002. Carey Williamson is a professor in the Department of Computer Science at the University of Calgary in Calgary, Alberta, Canada, where he holds an iCORE senior research fellowship in broadband wireless networks, applications, protocols, and performance. His research interests include Internet protocol performance, network traffic measurement, and network simulation. Readers can contact Williamson via e-mail at carey@cpsc. ucalgary.ca. IEEE INTERNET COMPUTING