Journal of Systems Architecture 55 (2009) 25–42
Contents lists available at ScienceDirect
Journal of Systems Architecture
journal homepage: www.elsevier.com/locate/sysarc
Protocol offload analysis by simulation
Andrés Ortiz a,*, Julio Ortega b, Antonio F. Díaz b, Pablo Cascón b, Alberto Prieto b
a
b
Department of Communications Engineering, University of Malaga, Spain
Department of Computer Architecture and Technology, University of Granada, Spain
a r t i c l e
i n f o
Article history:
Received 27 May 2007
Received in revised form 24 March 2008
Accepted 17 July 2008
Available online 7 August 2008
Keywords:
Full-system simulation
HDL simulation
LAWS model
Protocol offloading
Network interfaces,
Simics
a b s t r a c t
In the last years, diverse network interface designs have been proposed to cope with the link bandwidth
increase that is shifting the communication bottleneck towards the nodes in the network. The main point
behind some of these network interfaces is to reach an efficient distribution of the communication overheads among the different processing units of the node, thus leaving more host CPU cycles for the applications and other operating systems tasks. Among these proposals, protocol offloading searches for an
efficient use of the processing elements in the network interface card (NIC) to free the host CPU from network processing. The lack of both, conclusive experimental results about the possible benefits and a deep
understanding of the behavior of these alternatives in their different parameter spaces, have caused some
controversy about the usefulness of this technique.
The contributions of this paper deal with the implementation and evaluation of offloading strategies
and with the need of accurate tools for researching the computer system issues that, as networking,
require the analysis of interactions among applications, operating system, and hardware. Thus, in this
paper, a way to include timing models in a full-system simulator (Simics) to provide a suitable tool for
network subsystem simulation is proposed.
Moreover, we compare two kinds of simulators, a hardware description language level simulator and a
full-system simulator (including our proposed timing models), in the analysis of protocol offloading at
different levels. We also explain the results obtained from the perspective of the previously described
LAWS model and propose some changes in this model to get a more accurate approach to the experimental results. From these results, it is possible to conclude that offloading allows a relevant throughput
improvement in some circumstances that can be qualitatively predicted by the LAWS model.
Ó 2008 Elsevier B.V. All rights reserved.
1. Introduction
The rate of network bandwidth improvement seems to be twofold every 9–12 months, as is established by Gilder’s law [3]. This
trend implies that network technologies have outstripped Moore’s
law, commonly used to predict the improvement in microprocessor performance (transistor density is usually correlated with processor performance). For example, from 1995 to 2002, Ethernet has
shown a hundred-fold improvement, from 100 Mbps to 10 Gbps
[6]. Through an OC-192 link, about 19,440 64-bit Kpackets could
be received with 51.2 ls between consecutive packet arrivals. This
implies that a (not available now) 100–200 GIPS processor would
be required, whenever 5000–10,000 instructions are approximately needed to process a packet [12]. In this way, the network
nodes would become the main bottlenecks in the communication
path. Moreover, communication processing includes I/O bus transfers, interrupts, cache misses, and other overheads that do not
scale well with faster processors [30]. Therefore, an adequate net* Corresponding author. Tel.: +34 952 13 41 66; fax: +34 952 13 20 27.
E-mail address:
[email protected] (A. Ortiz).
1383-7621/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.sysarc.2008.07.005
work interface (NI) implementation that reduces all those poorly
scaling operations and other overheads related with context
switching and multiple data copies is getting decisive in the overall
communication path performance.
Much research work has been carried out trying to improve the
communication performance in servers that use commodity networks and generic protocols such as TCP/IP. This research can be
classified into two complementary alternatives. One of these alternatives searches for the reduction of the software overhead in the
processing of the communication protocols either by optimizing
the TCP/IP layers, or proposing new and lighter protocols. Moreover, these new protocols usually fall into one of two types: the
protocols that optimize the operating system communication support, such as GAMMA [10] or CLIC [15]; and the user level network
interfaces [4], such as the VIA (virtual interface architecture) standard [46].
The other researching alternative in this field tries to take
advantage of other processors included in the system. For example,
[39] proposes the use of one or more nodes of a cluster (the so
called TCP servers) for network processing, while the other nodes
run the application and the OS functions not related to network
26
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
processing. Moreover, the use of dedicated processors for network
processing either in an SMP [36,47] or in a multi-core microprocessor [40,51] has been also proposed. This last technique, also called
onloading, is one of the features of the Intel I/O acceleration technology [24].
The network interface card (NIC) is the hardware that provides
the physical access to the network by usually including a low-level
addressing system (the MAC addresses) and the functions of physical and data link layers. In the past 10 years, some network acceleration features have been included in the NICs (mainly in Ethernet
NICs) [17,25]. Thus, almost all NICs for Gigabit/s and 10 Gigabit/s
Ethernets can determine and check the TCP/IP checksums. Usually,
they also implement strategies to reduce the interrupt frequency
by generating one interrupt request for multiple packets sent or
received instead of one request per packet (interrupt coalescing)
[4]. Other common features of NICs are the use of header splitting
[37,50], that place protocol headers and payloads in separate buffers, and Jumbo frames [2], which are larger frames than the Ethernet maximum frame size of 1500 bytes (up to 9000 bytes), used to
reduce the per-frame processing overhead. Besides these features,
currently, many NICs include programmable processors. These
Intelligent NICs (INICs) are frequent in the interconnection networks in current cluster-based computing systems, and much research has been done towards the use of these processors to
offload network processing from the host CPU [27]. This way, the
CPU is free from communication overhead and a faster implementation of more flexible communication systems is possible. The TCP
offload engines (TOEs) are examples of NICs following this alternative [6,8,9,13,18,32]. Moreover, inside this trend, it is possible to include the use of network processors (NP) [7,21,35,38,42],
programmable circuits specially suited for fast network processing.
Besides specific hardware for implementing operations, such as
CRC processing, that are frequent in the communication functions,
the NPs also include several processing elements that usually
implement multithreading to tolerate the memory access latencies
[14,19,20].
There are some advantages that offloading the communication
functions could provide:
– As the CPU does not have to process the communication protocols, the availability of CPU cycles for the applications increases.
The overlap between communication and computation also
increases.
– As it implements the communication protocols, the network
interface cards can directly interact with the network without
the CPU involvement. This has two important consequences:
(a) the protocol latency can be reduced as short messages, such
as the ACKs, do not need to be transferred across the I/O bus
that connects the NIC to the main memory through the chipset;
(b) the CPU has to process less interrupts for context changing
to attend the received messages.
– It is possible to improve the efficiency of the DMA transfers
from the NIC if the short messages are assembled to generate
less DMA transfers.
– As protocol offloading can contribute to reduce the traffic on the
I/O bus, the communication performance can be improved
because the bus contention is reduced: the I/O bus is used to
exchange commands between the CPU and the NICs and for
DMA data transfers between the main memory and the NIC.
– The use of a programmable NIC with specific resources to
exploit different levels of parallelism could improve the efficiency in the processing of the communication protocols. Thus,
it would make possible a dynamic protocol management in
order to use the most adequate protocol (according the data
to communicate and the destination) to build the message.
Thus, by offloading, a distribution of the communication tasks
among the different elements of the host, particularly between
the host CPU and the processor in the NIC, is provided. The communication tasks that imply interactions with the network can
be implemented in the NIC in order to leave more CPU cycles for
the computation work required by the applications. When the
CPU needs to send or receive data through the network, it can write
or read them to/from the main memory where the NIC would read
or have written them. This way, protocol offloading should be seen
as a technique that enables both the parallelization of the network
communication work and the direct data placement on the main
memory, thus avoiding some communication overheads rather
than only shifting them to the NIC [42].
However, some works [10,12,31,37] criticize protocol offloading
and provide experimental results to argue that TCP offloading does
not clearly benefit the communication performance. Nevertheless,
there are other works that demonstrate the benefits of TCP offloading. For example, in [50] an experimental study is carried out based
on the emulation of a NIC connected to the I/O bus and controlled by
one of the CPUs in the SMP. The results show improvements from
600% to 900% in the TCP-emulated offload. Moreover, in [16], counterarguments to the TCP offloading criticism of [33] are provided.
On the one hand, the reasons for the scepticism on offloading
benefits are the difficulties in the implementation, debugging,
quality assurance, and management of the offloaded protocols
[33]. The communication between the NIC (with the offloaded protocol) and the CPU and the API could be as complex as the protocol
to be offloaded [13] (cited in [33]). Protocol offloading requires the
coordination between the NIC and the OS for a correct management of resources such as buffers, port numbers, etc. In case of protocols such as TCP, the control of the buffers is complicated and
could hamper the offloading benefits (for example, the TCP buffers
must be held until acknowledged or pending reassembly). Moreover, the inefficiency of short TCP connections is due to the overhead of processing the events that are visible to the application
and cannot be avoided by protocol offloading [33]. These are not
definitive arguments with respect to the offloading usefulness
but they counterbalance the possible benefits. In any case, this
means that an efficient host/NIC interface for offloading is one of
the main issues to take advantage of this technique [16].
On the other hand, there are fundamental reasons that affect
the possible offloading advantages. One of them is the ratio of host
CPU speed to NIC processing speed. The CPU speed is usually higher than the speed of the processors in the NIC and, moreover, the
increment in the CPU speeds according to Moore’s law tends to
maintain or even to increase this ratio in the case of the specific
purpose processors in the NIC. Thus, the part of the protocol that
is offloaded would require more execution time in the NIC than
in the CPU, and the NIC could become the communication bottleneck. The use of general-purpose processors in the NIC (with
speeds similar to the CPU) could represent a bad compromise between performance and cost [11]. Moreover, the limitations in
the resources (memory) which are available in the NIC could imply
restrictions in the system scalability (for example, limitations in
the size of the IP routing table). According to these arguments, it
is clear that NIC processing and memory capabilities are important
issues. Nevertheless, faster CPUs are not enough to avoid the effect
of the operations that prevent performance to scale with processor
speeds [15,22].
The problems of offloading are clearly apparent in the use of
TCP protocol either in WAN applications (such as FTP and e-mail)
or in LAN applications that require low bandwidth (such as Telnet).
In these cases, the overheads of the connection management are
the most important and the more difficult to avoid by protocol
offloading. In this way, [33] concludes that offloading is more
27
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
adequate in applications requiring high bandwidths, low latencies,
and long-term connections. RDMA (remote direct access memory)
is an example where protocol offloading can be efficient. RDMA is a
protocol that allows packet transfer to the right memory buffer,
thus given an adequate procedure for 0-copy. As the RDMA component DDP (direct data placement) requires an early de-multiplexing of the input packets, its implementation (and the TCP
protocols below it) in the NIC could be advantageous.
Thus, it should be understood why the benefits of protocol offload are so elusive and difficult to predict [42]. It is clear that the
system communication performance depends on many factors,
from the application computation/communication profile, to the
interactions between operating system, application, and hardware.
In particular, the detailed profile of memory accesses for a given
network application is difficult to evaluate and take into account
to predict performance. The goal of this paper, which is an extension of our conference papers [33,34], is to use simulation at different levels to get insight into the offloading effects.
Section 2 describes the LAWS model [42], which has been recently proposed to predict the offloading effects. Then, Section 3
uses an HDL (hardware description language) simulator to analyze
the behavior of the different elements in the communication path
in order to understand their role in the communication performance either without offloading or with different offloading alternatives. In this section, together with the CPU, the NIC, and the
network links, we also model the effect of the buses, the bridge,
and the main memory. Section 4 considers the use of Simics, a
full-system simulator that allows a detailed simulation of hardware, application software, and operating system, and Section 5
compares the experimental results obtained by Simics with those
predicted by LAWS and proposes some modifications to this model
in order to improve its accuracy. Finally, Section 6 provides the
conclusions of the paper and states the questions that remain to
be considered in future works.
(a) Before offloading
CPU
+
Mem
NIC
Network
Host
TNetwork=m/B
THost=(aX+oX)m
BBefore=min{B,1/(aX+oX)}
(b) After offloading
CPU
+
Mem
TCPU=(aX+(1-p)oX)m
NIC
Network
TNIC=poYβm
=poY m
TNetwork=m/B
BAfter=min{B,1/(aX+(1-p)oX),1/ poYβ}
poY }
(c) peak throughput improvement behavior according LAWS
% improvement on peak
throughput
best improvement
(c,1/c)
1/c
(( y+1
+1)/c)-1
)/c)-1
1/ y
c
Y (compute intensity))
c=max(αβ,,σ)
2. A model to estimate offloading performance
Some papers [17,39,44,49] have recently appeared to study the
offloading fundamental principles under the experimental results.
The paper [42] introduces the LAWS model to characterize the protocol offloading benefits in Internet services and streaming data
applications. In [19], the EMO (extensible message-oriented offload) model is proposed to analyze the performance of various offload strategies for message oriented protocols. In this paper, we
have used the LAWS model.
LAWS model [42] gives an estimation of the peak throughput of
the pipelined communication path according to the throughput
provided by the corresponding bottleneck (the link, the NIC, or
the host CPU). The model only includes applications that are
throughput limited (such as Internet servers) and fully pipelined
[23,46], when the parameters used by the model (CPU occupancy
for communication overhead and for application processing, occupancy scale factors for host, and NIC processing, etc.) can be accurately known. The analysis provided in [42] considers that the
performance is CPU limited before applying the protocol offload,
as this technique always yields no improvement otherwise.
Fig. 1 explains how the LAWS model views the system before
and after offloading. The notation which is used is similar to that
of [42]. Before offloading (Fig. 1a), the system is considered as a
pipeline with two stages, the host and the network. In the host,
to transfer m bits, the application processing causes a CPU work
equal to aXm and the communication processing produces a CPU
work oXm. In these processing delays, a and o are the amount of
CPU work per data unit, and X is a scaling parameter used to take
into account variations in processing power with respect to a
Fig. 1. A view of the LAWS model before offloading (a) and after offloading (b); and
behavior of the peak throughput offloading according to LAWS model (c).
reference host. Moreover, m/B is the latency to provide the m bits
by a network link with a bandwidth, B.
Thus, as the peak throughput provided before offloading is
determined by the bottleneck stage, we have Bbefore = min(B,
1/(aX + oX)). After offloading, we have a pipeline with three stages
(Fig. 1b), and a portion p of the communication overhead has been
transferred to the NIC. In this way, the latencies in the stages for
transferring m bits are m/B for the network link, aXm + (1 p)oXm
for the CPU stage, and poYbm for the NIC stage. In the expression
for the NIC latency, Y is a scaling parameter to take into account
the difference in processing power with respect to a reference,
and b is a parameter that quantifies the improvement in the
communication overhead that could be reached with offloading,
i.e. bo is the normalized overhead that remains in the system after
offloading when p = 1 (full offloading) [42].
In this way, after offloading, the peak throughput is Bafter =
min(B, 1/(aX + (1 p)oX), 1/poYb) and the relative improvement
in peak throughput is defined as d b = (Bafter Bbefore)/Bbefore. The
LAWS acronym comes from the parameters used to characterize
the offloading benefits. Besides the parameter b (structural ratio),
we have the parameters a = Y/X (lag ratio), that considers the ratio
between the CPU speed to NIC computing speed; c = a/o (application ratio) that, for the given application, measures the ratio of
computation cost to communication cost; and r = 1/oXB (wire ratio), that corresponds to the portion of the network bandwidth that
the host can provide before offloading. In terms of the parameters
28
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
a, b, c, and r, the relative peak throughput improvement can be expressed as:
1
min r1 ; cþð1pÞ
; p1ab min r1 ; 1þ1 c
db ¼
min r1 ; 1þ1 c
ð1Þ
From LAWS model some conclusions can be derived in terms of
simple relationships among the LAWS ratios (see Fig. 1c, obtained
from (1)):
– Protocol offloading provides an improvement that grows linearly in applications with low computation/communication rate
(low c). This profile corresponds to streaming data processing
applications, network storage servers with large number of
disks, etc. In case of CPU intensive applications, the throughput
improvement reached by offloading is bounded by 1/c and goes
to zero as the computation cost increases (i.e. c grows). The best
improvement is obtained for c = max(ab, r). Moreover, as the
slope of the improvement function (c + 1)/c 1 is 1/c and
c = max(ab, r), the throughput improvement grows faster as
ab and r decrease.
– Protocol offloading may reduce the communication throughput
(negative improvement, db < 0) if the function (c + 1)/c 1
takes negative values. This means that c < (c 1) and, as c > 0
and r < 1, it should verify that c = ab and ab > 1. Thus, if the
NIC speed is lower than the CPU speed (a > 1) offloading may
reduce performance if the NIC saturates before the network link,
i.e. ab > r, as the improvement is bounded by 1/a (whenever
b = 1). Nevertheless, if an efficient offload implementation (for
example, by using direct data placement techniques) allows
structural improvements (such that b < 1) that make ab < 1, it
is possible to maintain the offloading usefulness even for a > 1.
– There is not any improvement in slow networks (r 1) where
the host is able to assume the communication overhead without
aid. The offloading technique is useful whenever the host is not
able to communicate at link speed (r 1), but in these circumstances, c has to be low, as it has been previously said. As there
is a trend to faster networks (r decreases), offloading can be
seen as a very useful technique. When r is near one, the best
improvement corresponds to cases where there is a balance
between computation and communication costs before offloading (c = r = 1).
Nevertheless, as LAWS estimates peak throughput improvements in fully pipelined communication paths, it could only represent a first approach, that is perhaps far from the performance
observed in real communication systems. In this paper, we have
used two kinds of simulators to get experimental evidences about
the LAWS model usefulness.
3. A first analysis of offloading through HDL simulation
LAWS and other models, such as EMO, provide good starting
points to get a first insight about the conditions that make offloading useful, and contribute to guide the experimental work in the
design space of the offloading. Nevertheless, a detailed experimental validation with wider sets of application domains and benchmarks is required.
Simulation is a good way to get this. It can be considered the
most frequent technique to evaluate computer architecture proposals. To be a useful tool, a simulator needs to simulate the target
machine (and to drive it by using realistic workloads) with the details required by the questions to be answered. Many realistic
workloads, such as web servers, databases, and other networkbased applications use operating system services and simulators
that run only user-mode, single-threaded workloads are not adequate [1] because a network-oriented simulation requires timing
models of the network DMA activity, coherent and accurate models
of the system memory, and full-system simulation including the
OS [5].
Although the main part of our analysis of offloading has been
implemented by using the full-system simulator Simics [29], in
this section, we have firstly used an HDL (hardware description
language) model of the communication path. This model allows
us the simulation of the hardware characteristics at an adequate
level of detail to introduce and understand the influence of the different implementation characteristics: delays produced by the
software and buses, contention to access memory and other
shared hardware elements, register level transactions, etc. Moreover, for real applications, it is difficult to reproduce experimentally the LAWS curves corresponding to the peak throughput
improvement against the application ratio, c, when one of the
others parameters, ab or r, changes while the other is kept constant (see the figures of [42] and Fig. 1b). For example, it is difficult
to keep unchanged the application CPU work per data unit, a,
before and after offloading, and it is also difficult to modify an
application to get the values of a, and o, that allow a (more or less)
complete set of values for c. In this way, an HDL model makes
these situations easier to reach, as the delays used to represent
the different procedures (application or communication tasks)
can be directly changed.
3.1. An HDL model for offloading
Fig. 2 gives the modules included in the HDL model which we
have written by using the Verilog hardware description language.
Besides the NIC, CPU and cache, chipset and memory, the HDL
model includes the delays and contention problems in the I/O
bus and memory bus, and makes possible to inject packets at different speeds through the network module.
As an example of the level of detail used in the different modules of our HDL model, Fig. 3 illustrates the elements included in
the HDL description of the NIC. It includes a queue of buffers where
the data coming from the network link are stored. There are two
modules that control the way the data are, respectively, written
to the queue or read from it. The other element in the NIC implements the communication protocol if it is offloaded, initializes
the DMA, generates the interrupt requests to the CPU, etc. We have
considered a programmable NIC with enough local memory to
store data and code. It also includes faster memories to implement
CPU
+
Cache
Chipset
Bus
I/O E/S
Bus
Network
Memory
bus
Memo
Memory
ria
NIC
Fig. 2. Modules of the host HDL model.
Red
29
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
end
start
dOut
dIn
+1
cReady
cR
cReady
+1
leido
rdack
escrito
wrack
pReady
pReady
leer
read
estado
state
escribir
write
intr
intrack
Fig. 3. Main elements of the NIC module.
the queues of buffers (in Fig. 3, we represent only one queue) and
other control registers. The CPU and the NIC processor interact by
reading from or writing to some shared registers. The role of the
NIC processor in the communication depends on the protocol
implementation to be simulated. Whenever offloading is considered, the behavior of the NIC is controlled by a program stored in
the NIC memory that runs in the NIC processor. The NIC is modeled
as a cut-through device rather than a store-forward one. It overlaps
the transfers across the I/O bus and the network links.
Fig. 4 compares non-offloading (Fig. 4a) with two offloading
alternatives ( Figs. 4b and c) at the receive-side. Whenever the
reception is done without offloading, the NIC, after getting some
information from the received packet, interrupts the CPU. It executes the driver to initialize the DMA operation between the NIC
and the main memory (for example, the memory address in which
to store the packet data has to be communicated to the NIC). The
NIC stores the received data in main memory through a DMA operation and informs the CPU at the end of this operation. Then, the
CPU starts the protocol processing of the packet in main memory.
In the alternative for offloading shown in Fig. 4b, the NIC is able
to start the process of the protocol once the whole packet is received (or even after receiving only part of it). Then, the NIC inter-
CPU
rupts the CPU and executes the driver to initialize the DMA
operation, as in the non-offloading alternative. After that, the NIC
starts the DMA operation to transfer the packet data to main memory and informs the CPU when data are available in its main memory at the end of the DMA transfer. The alternative shown in Fig. 4c
frees the CPU from almost all the communication overhead, as the
NIC is able to process the protocol and initializes the DMA to send
the payload of the received packet to memory. When these data
are stored in the corresponding addresses, the NIC informs the
CPU that the application can use them.
Fig. 5 shows some of the signals exchanged among the different
modules in the HDL simulations. They correspond to packet reception with and without offloading. To get shorter timing diagrams,
the size of the injected packet is small and only eight words have
to be transferred to memory by a DMA operation. The timing diagrams of Figs. 5a and b correspond to the same reception without
offloading, although they show different time scales. Fig. 5c corresponds to the offloaded reception.
The link signal corresponds to the data entering the NIC from
the network link. The arrival of data to the main memory is indicated by mem signal. The intr and intack signals are the signals exchanged between the CPU and the NIC for, respectively, interrupt
CPU
NIC
NIC
CPU
NIC
Data from
network
…
…
…
Data from
network
Protocol
processing
+
Protocol
processing
DMA
initialization
DMA
initialization
DMA
initialization
…
…
…
Data to main
memory
(DMA)
Protocol
Processing
+
Data available
to application
Data to main
memory
(DMA)
Data
available to
application
Data
available to
application
a
Data from
network
b
Fig. 4. Reception with offloading (a), and with two offloading alternatives (b) and (c).
c
Data to main
memory
(DMA)
30
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
a
Data from
network
link
mem
intr
Data to main
memory
(DMA)
DMA
initialization
b
Data from
network
intack
dma
dmaend
protCPU
dmainitCPU
link
mem
intr
Data to main
memory
(DMA)
Protocol
Processing
c
Data from
network
intack
dma
dmaend
protCPU
dmainitCPU
link
mem
Data to main
memory
(DMA)
Protocol
Processing
DMA
initialization
intr
intack
dma
dmaend
protNIC
dmainitCPU
Fig. 5. Timing diagrams for packet reception: (a and b) no-offload; (c) offload.
requesting and acknowledgement. The dma signal indicates a DMA
transfer in progress from the NIC to the CPU, whereas dmaend
means a finished DMA transfer. The protCPU signal indicates that
the protocol is processed by the CPU and protNIC, that the protocol
processing is done in the NIC. In the three figures (Figs. 5a–c), the
DMA initialization is done by the CPU and it is indicated by using
dmainitCPU.
Figs. 5b and c correspond, respectively, to Figs. 4a and b, and
illustrate the differences between reception without and with offloading. As it can be seen from link signal in both figures (Figs. 5b
and c), the link bandwidth saturates the node: firstly the packets
enter the NIC and are stored in the NIC queue (Fig. 3) at the link
speed, but as the packets cannot be processed at the link speed,
the queue gets full and the speed of packet arrival decreases. In
case of reception without offloading (Figs. 5a and b), the CPU is
interrupted (intr and intrack signals), it initializes de DMA operation of the incoming data (dmainitCPU) and, after the DMA transfer
to memory (see mem signal), the CPU processes the packet according to the communication protocol (protCPU signal). In case of an
offloaded reception (Fig. 5c), the packet is firstly processed in the
NIC (protNIC) and the CPU is interrupted to start the transfer of
the payload (intr and intack). Then, the CPU initializes the DMA
operation (dmainitCPU) and the DMA transfer is done (mem).
As has been said, in Fig. 5, eight words have been transferred for
either offloading or non-offloading simulations. Nevertheless, in
case of offloading, the number of words to be transferred for a
given packet is lower as the packet is processed in the NIC and
the header bits of the packet can be removed.
3.2. Offloading performance evaluation by HDL simulation
Our HDL model allows the simulation of the CPU, the NIC, and
the network link, besides the effects of the buses, the bridge, and
the main memory. Recently, Intel has launched its I/O accelerating
technology (I/OAT) to keep pace with the emergence of multi-Gigabit/s links by allowing the servers to take advantage of that high
bandwidth in order to increment its throughput and quality of service [18]. Among the main bottlenecks referred in the whitepapers
by Intel describing the main I/OAT characteristics are the system
overheads and the memory accesses [24]. The profile of memory
accesses for the application executed in the CPU, the protocol processing in the CPU or NIC, and the DMA accesses to read/store the
packets or payloads in the main memory have a significant influence in the communication performance and thus, determine the
improvement that can be obtained from offloading.
Fig. 6 shows the improvement in throughput against the memory accesses generated by the CPU running either the application
or the communication protocol. The experiments have been done
considering negligible delays in the I/O and memory buses, and
in the chipset circuits. Small packets with eight words have been
simulated. The 100% columns mean that all the words of a packet
are transferred to the main memory by DMA, whilst the 75%
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
a
% throughput improvement
30
25
20
15
100%
75%
10
5
0
-5
0
256
512
1024
Application memory accesses
b
% throughput improvement
80
70
60
100%
75%
50
40
30
20
10
0
-10
0
8
16
32
Protocol memory accesses (0 accesses in the app.)
c
31
Figs. 6b and c illustrate the performance improvement obtained
by protocol offloading as the memory accesses generated by the
communication protocol grow. This improvement comes from
both, the reduction in the main memory accesses obtained when
the protocol is offloaded (thus the LAWS parameter b in less than
one, b < 1, and decreases as the amount of memory accesses increases), and the increase of protocol overhead with respect to
the application processing requirements (the parameter c in the
LAWS model decreases).
Fig. 7 shows the percentage of throughput improvement against
the application ratio parameter c (the computation/communication rate) in the LAWS model for two values of the product ab,
where a is the ratio between the CPU and the NIC processor computing speeds (lag ratio in LAWS), and b quantifies the improvement in communication overhead after offloading (structural
ratio in LAWS). In the experiments corresponding to Fig. 7a, no
memory accesses are generated by the application, while in the
case of Fig. 7b, a given maximum of memory accesses are generated by the application (in this case, approximately a 5% of the
application time corresponds to memory accesses). In our simulations, the system is host limited before offloading, since offload
does not yield any benefit otherwise [42]. The packet size used in
the experiments of Fig. 7 corresponds to DMA transfers of eight
words between the NIC and the main memory. The characteristics
and the behavior shown in the curves of Fig. 7 agree with the conclusions extracted from the LAWS model [42] in Section 1. Thus, in
Fig. 7a, no improvement is obtained whenever ab = 1 (it is not low
enough). As ab is decreased, a throughput improvement is obtained. This improvement decreases as the application ratio grows:
the amount of communication overhead decreases with respect to
the computation needs of the application (as the LAWS model
predicts).
% throughput improvement
45
40
35
30
25
100%
75%
20
15
10
5
0
0
8
16
32
Protocol memory acceses (512 accesses in the app.)
Fig. 6. Improvement in the throughput with offloading (for different number of
memory accesses).
columns correspond to a reduction in the number of words to be
transferred (a 75% of the packet size is transferred to system memory), as the protocol has been processed in the NIC when offloading
is applied and it is not necessary to store the packet head in the
main memory. From Fig. 6a, it can be seen that as the application
generates more memory accesses, the improvement obtained by
offloading grows as offloading allows concurrency between the
NIC protocol processing and the execution of the application in
the CPU. Nevertheless, the increment in the throughput performance goes down above certain amount of accesses. This situation
can be explained by taking into account that, as the number of
memory accesses increases, the rate of communication overhead
with respect to the application processing decreases (the parameter c in the LAWS model grows).
Fig. 7. Throughput improvement with offloading (p = 1) against application ratio:
(a) without application memory accesses; (b) with application memory accesses.
32
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
The behavior shown in the curves of Fig. 7b is also similar to
that predicted by Eq. (1) (with p = 1) and shown in Fig. 1b. The
throughput improvement grows with c for c < 1, and decreases
as c grows for c > 1. Moreover, as the product ab decreases, the
maximum achieved for the throughput improvement grows. Nevertheless, although the qualitative behavior of the curves agrees
with the LAWS models, there are important quantitative differences in the experimental results. In the case of ab = 0.8, the differences lie between 47% and 68%, and between 62% and 82% for
ab = 1.0.
These differences between the simulation results and the LAWS
model predictions are not difficult to justify. First of all, LAWS
model predicts the improvement in peak throughputs. Moreover,
LAWS takes into account the distribution of the application and
communication works before and after offloading, and introduces
the possible effects of the implementation in the change of the total amount of work to do through the parameter b. Nevertheless,
there are also effects on the CPU time which is consumed by the
applications due to the specific profiles of the use of the memory
hierarchy and other elements of the I/O subsystem [6]. These effects could decrease the throughputs with respect to their peak
values. In the following section, a full-system simulation of protocol offloading is considered to reach experiments with realistic
interactions among operating system, hardware, and applications,
as they can be difficult to reconstruct by HDL simulations.
4. Offloading analysis through full-system simulation
The research in the computer system design issues dealing with
high-bandwidth networking requires an adequate simulation tool
providing a computer model that makes possible to run commercial OS kernels (as the most part of the network code runs at the
system level), and other features for network-oriented simulation,
such as a timing model of the network DMA activity and a coherent
and accurate model of the system memory [5]. There are not many
simulators with these characteristics. Some examples are M5 [28],
SimOS [41], and some other simulators based on Simics [27,45],
such as GEMS [31] and TFsim [32].
Simics is a commercial full system simulator that allows engineers to have accurate hardware models in such a way that software cannot detect the difference between real hardware and the
provided virtual environment. This way, Simics allows the simulation of application code, operating system, device drivers and protocol stacks running on the hardware modeled. Moreover, Simics is
a fast functional simulator that makes possible to simulate complex applications, operating systems, network stack protocols,
and other real workloads.
Nevertheless, Simics is a functional simulator in itself and does
not provide an accurate timing model. In [47], some limitations are
reported with respect to the capabilities of Simics in the model of
x86 processors (out-of-order microarchitectural issues such as
branch prediction, reorder buffer, functional units, etc. are not
properly modeled) and in the simulation of cc-NUMA computers
with accurate cache miss models. In these cases, the functionality
of Simics should be extended to allow accurate evaluations of some
commercial workloads. This way, to use Simics for protocol offloading evaluation, we have required to develop a network interface model that processes the protocol instead of the main CPU
of the system, and to overcome some Simics limitations in the simulation of protocol offloading or networks in general:
(a) Networks are simulated at packet level. The transactions are
performed as one event. Thus, the details of a network
packet transaction (by sending individual bytes) are not simulated. Instead, the complete transaction is simulated as one
(b)
(c)
(d)
(e)
action. In this way, the network and I/O devices are simulated in a transaction-based style. This constitutes an important drawback in the network-oriented system simulation,
where detailed timing models of network DMA events are
required [5].
Simulated link bandwidth could be potentially infinite, but
in practice, a very high bandwidth (i.e.: 10 Gbps), requires
a high simulation time and the results are not as expected
(although Simics is able to handle high bandwidths, the
NIC model does not).
Packets are delivered to the network with a configurable
latency that depends on the length of a time slice. A time
slice in Simics is the minimum time that we could measure.
It can be modified, but the lower bound is determined by the
CPU speed. So, it is necessary to ensure that the minimum
latency that we are simulating is enough to allow the maximum bandwidth needed for our purposes.
Using shorter time slices (lower latencies) in multi-machine
configurations slows down the simulation speed. So, this
latency could not be as low as one would like.
In order to build simulation models at hardware level, Simics
provides the stall execution mode that allows us to simulate
latencies or time accesses, but only between the CPU and the
memory, and not for the buses.
Despite these limitations, we have preferred Simics instead of
simulators such as M5 [28] or SimOS [41] due to the ability to
change the simulation parameters and to create hardware models,
as well as to simulate a lot of different CPU models with Simics. In
fact, it provides the DML (device modeling language) language.
This is not only a configuration language but also a hardware
description language for device modeling that can be connected
to our simulated architecture through a Simics connector. Furthermore, because of its C++ features, the debugging process in simulators such as M5 and SimOS is harder compared with that in Simics,
which also provides effective tools for debugging and profiling.
There are other simulators based on Simics, such as GEMS [31]
and TFSim [32], that provide accurate timing models but they are
focused to specific systems. For instance, GEMS is a Simics based
simulator for Sparc-based computers.
Taking into account the above issues, to overcome the drawbacks related with the modeling of the time behavior of the simulated system that prevents the use of Simics for network system
simulation, we propose extending the functionality of Simics
including more detailed timing models as we describe in what
follows.
By default, Simics relates the execution time with the number of
executed instructions. In this way, whenever the instructions are
executed in order, each instruction corresponds to a (configurable)
number of clock cycles. In a multiprocessor system, although there
is a time interval assigned to simulate the execution of instructions
corresponding to each processor, all the processors in the system
execute the same number of instructions after a given amount of
simulated time. In case of an out-of-order execution, there is not
any correspondence between the number of cycles and the number
of executed instructions. In Simics, the memory accesses can be
generated not only from the processors, but also from other devices. Given a memory physical address, to identify the object to
which that address belongs, Simics uses the memory-space concept.
This way, a memory-space maps a physical address to any object
that would be able to participate in a transfer (a RAM, a flash
memory, or a device). A memory-space can also include other memory-spaces, thus building a hierarchy. The possibility to define
memory-space hierarchies has allowed us to model latencies in
the transfers as can be explained from Fig. 8. Simics internally uses
the so called simulator translation cache (STC) to speed up the
33
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
Data returned to CPU
from memory
CPU
CPU Stalls
Reissue
transaction CPU initiates
º
transaction
STC
Stall
Memory
Space
Call timing
model
Yes
Timing model
connected?
No
No stall
RAM
Perform
memory
access
Fig. 8. Memory transfers in Simics without using the STC.
CPU0
CPU0
timming
model
MEM0
PCI mem
Timming
model
Onboard
MEMORY
MEM
timming
model
simulations by using a table for translated addresses that avoids
having to go through all the memory hierarchy. Nevertheless, this
strategy implies that sometimes (whenever the memory address to
be accessed is included in the STC) the timing models are not applied. Thus, to have more accurate timing models, the STC should
be disabled as is shown in Fig. 8, although at the cost of slower simulations. Thus, the timing model can be defined through the timing_model_operate() function of the Simics API by adequately
setting its parameters. The timing model we have defined is applied to all the devices to which it is connected. Whenever one of
these devices tries to generate a memory access, the system checks
if no other device is using the bus. If the bus is busy, the contention
is simulated by adding a given (and configurable) number of cycles
to the memory access latency.
In [1] Simics, it is also extended with two timing models, a
memory simulator that implements a cache hierarchy with cache
controllers and an interconnection network for multicomputer systems, and a processor timing simulator for Sparc V9 instruction set.
In the approach here proposed, all the timing models present the
same interface to set their parameters, they are defined by taking
advantage of the resources provided by the Simics environment,
and thus, it is possible to build timing models not only for the processors, the memory hierarchy, and the connection between the
processors and the memory, but also for the NIC and the I/O buses.
We have built a Simics simulation model by defining two customized machines and a standard Ethernet network connecting
them in the same way as we could have in the real world. Simics
even allows us the connection between the simulated machine
and a real network, using the Simics Central module. Nevertheless,
we have avoided the use of this Simics Central module in order to
reduce the simulation time and to increase the attainable maximum bandwidth: since Simics Central acts as a router, it limits
the simulated effective bandwidth. This way, we have connected
our two machines directly, something similar to using a crossover
Ethernet cable in the real world.
We have used two models in our simulations. The first one corresponds to a non-offloaded system, in which we have a Pentium 4
based machine running at 400 MHz (enough to have up to 1 Gbps
at network level and not to slow down the simulation speed). We
have also used Simics NIC gigabit models in the BCM5703 PCI
based Ethernet card included in our system. The model is shown
in Fig. 9.
PCI BUS
NIC
BCM5703C
North Bridge0
Mapped in Simics
I/O PCI
MEM
Ethernet
Application + Communication
proccessing
Fig. 9. Hardware model for a system without offloading.
With this model, we have determined the maximum performance we can achieve using a simple machine with one processor,
and no-offloading effects. This way, the CPU of the system executes
the application and processes the communication protocols. The
maximum throughputs and the CPU loads for this model are
shown in Section 5.
Fig. 10. Hardware model for offloading simulation.
34
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
In order to offload the protocols, and so remove the protocol
processing work from the CPU, we have used the model shown
in Fig. 10, corresponding to a system where one of the processors
has been isolated from the other and the NIC is directly connected
to this CPU in order to improve the parallelism between application and network processes. In Simics, by default, the bridges
merely act as connectors and, in this case, no contention is modeled at simulation time. The way to simulate contention effects is
through the use of the timing models we have previously described. Thus, a timing model is connected to each entry of the
bridge where access contention is possible. This is not an exact
way to model contention, but is provides an adequate simulation
of the contention behavior. Thus, in our model (Fig. 10), the
north bridge and the buses use timing models and do not only
act as connectors. In the ideal case, where no timing models are
used, transfers between CPUs and memory would not hold any
other transfer.
On the other hand, in Simics, PCI I/O and memory spaces are
mapped in the main memory (MEM0). So, at hardware level, transfers between these memory spaces would not necessarily require a
bridge because Simics allows us the definition of full-custom hardware architectures. We add a north bridge in our architecture in order to simulate a real and standard machine in which we can
install a standard operating system (i.e.: Linux).
The computer of Fig. 10 includes two Pentium 4 CPUs, a DRAM
module of 128 MB, an APIC bus, a PCI bus with a Tigon-3
(BCM5703C) gigabit Ethernet card attached, and a text serial console. The use of a text serial console is due to a limitation in Simics
that at the moment is not able to have more than one machine running over a single Simics instance with graphical consoles. It only
100
90
With Offloading
80
Without Offloading
% CPU LOAD
70
60
50
40
30
20
10
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Message Size (Kbytes)
Fig. 11. CPU load comparison graph.
100
UDP
TCP
90
80
% CPU interrupt gain
70
60
50
40
30
20
10
0
0
1000
2000
3000
4000
5000
6000
Message Size (Kbytes)
Fig. 12. Decrease in the number of interrupts.
7000
8000
9000
35
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
can simulate and communicate several Simics instances through
the Simics Central module. Furthermore, using text serial consoles
(thus avoiding the use of graphical consoles), we have reached a
faster simulation [48].
Once we have two machines defined and networked, Simics allows an operating system to be installed over them. For our purposes, we have used Debian Linux with a 2.6 kernel [26]. It
allows us the necessary support for our system architecture and
the implementation of the required changes. In the following section, we provide the experimental results obtained and analyze
them according to the LAWS model.
5. Experimental results
In order to evaluate protocol offloading, we have used several Simics and operating system features. Using the kernel 2.6, it is possible to assign a CPU to the communication subsystem, isolating it
from any other workload. This could be done with Linux cpuset [35],
which avoids attaching processes to that isolated CPU. The cpusetsare Linux lightweight objects that allow us the machine partition. This partition makes possible to assign memory nodes to
each created cpuset object. Moreover, the memory assigned to a
particular cpuset can be restricted to be exclusively used by this
cpuset.
In this way, we have a system with a CPU, CPU0, for running
applications and the operating system processes and another processor, CPU1, for running the communication subsystem.
In order to test our model and evaluate offloading, we have used
netpipe [45], which is a protocol-independent tool that measures
the network performance in terms of the available throughput between two hosts. It consists of two parts: a protocol independent
driver, and a protocol specific communication section. The communication section depends on the specific protocol used, since it
implements the connection and transferring functions, whereas
the driver remains the same. For each measurement, netpipeincrements the block size following its own algorithm.
In our experiments, we have used optimized network parameters in order to achieve the maximum throughput in every test,
with or without offloading. For instance, we have used 9000 bytes
10
x 10
MTUs (jumbo frames). We have also applied standard TCP windows, which sometimes produce oscillations in the throughput.
This could be avoided by using oversized TCP windows (i.e.:
256 kbytes), but the maximum attainable throughput should not
be affected.
In the following graphs ( Figs. 11 and 12), we illustrate some
experimental results using the TCP stack as transport-layer protocol [43] and a Gigabit Ethernet network. In Fig. 11, the loads of
CPU0 in the no-offloaded and offloaded cases are compared.
When the protocol is offloaded, the load of CPU0 is lower, as it
only executes the application that generates data and the NIC driver. When CPU0 has to generate data and process the protocol, its
load grows up to 90% of the maximum: CPU0 is busy and there are
not many cycles available for other tasks.
The curves of Fig. 12 provide the percentages of the interrupts
requested to CPU0 in the non-offloading case that are avoided
when the communication protocol is offloaded. Thus, a 0% in the
figure means that with and without offloading, CPU0 receives the
same number of interrupts, and a 50%, that when the protocol is
offloaded, CPU0 receives half of the interrupts received in the
non-offloading case.
The decrease in the interrupts per second obtained with offloading is about 60% for TCP, and about 50% for UDP. So, with regards to
the offloading effects in the overall performance, the more cycles are
required for protocol processing, the higher is the improvement in
the time spent in interrupt servicing (less interrupts and less CPU
time spent in servicing them). Thus, as TCP requires more CPU cycles
to be processed than UDP, the benefits are more apparent in the case
of TCP. In our simulations, we have not used techniques, such as
interrupt coalescing, that are common in present NICs, as they cannot
be supported by our Simic model for the NIC. If they would be simulated, a reduction in the interrupts with non-offloading should be
observed, and probably the difference in the number of interrupts
between offloading and non-offloading would be lower.
The results obtained with netpipe are shown in Figs. 13 and 14.
These graphs provide the throughput for each transferred block
size and the maximum attainable throughput. Fig. 13 shows the
improvement that could be reached in the ideal case in which
the NIC could communicate with the bus without latency: the
8
9
8
Peak Throughput (Mbps)
Without Offloading
7
With Offloading
6
5
4
3
2
1
0
100
101
102
103
104
Message Size (Bytes)
Fig. 13. Peak throughput comparison.
105
106
107
36
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
NIC that processes the protocols is connected to a bus without latency and it could access the onboard memory also without any latency. As the figure shows, the throughput obtained in this case
can be almost the bandwidth of the network.
To obtain the results shown in Fig. 14, we have modeled the effect of having a non-ideal connection between CPU0 and the processor of the NIC, CPU1. In order to simulate this, we have
introduced the corresponding timing models in the NIC bus and
in the memory accesses from the NIC processor.
a
In Fig. 14a, the throughput for different latency values in the
NIC accesses is shown against the message sizes. In the legend, Offload x, x is the number of delays in memory accesses from CPU1,
with respect to a reference value. So, Offload 2 means a double
memory access delay from CPU1 as compared to Offload 1. We
can see that the memory latency is decisive in the performance,
as is mentioned in many previous papers (see for example [23]).
The lower throughputs obtained in the case of small block sizes
are due to the ACKs required by TCP protocol to transfer a block.
8
10
x 10
No Offload
9
8
Offload 6
Offload 4
Offload 2
Peak Throughput(Mbps)
7
Offload 10/1
6
5
4
3
2
1
0
0
10
10
1
10
2
10
3
10
4
10
5
10
6
10
7
Message Size (Bytes)
b
450
Without Offloading
With Offloading
400
Peak Throughput(Mbps)
350
300
250
200
150
100
50
0
100
101
102
103
104
105
106
107
Message Size (bytes)
Fig. 14. Throughput of offloading vs. non-offloading: (a) for different memory latencies from the NIC; (b) for limited host (30% of the link bandwidth).
37
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
The curves in Fig. 14b correspond to a limited host (in this case,
it can only deliver about 30% of the link bandwidth) without any
other overhead source (i.e. without any application overhead)
and a NIC with the same memory latency than the host CPU. The
LAWS model will be considered below, to analyze these results
in detail.
8
10
x 10
100%
75%
9
50%
8
25%
No off
Peak Throughput (Mbps)
7
6
5
4
3
2
1
0
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
Message Size (Bytes)
Fig. 15. Throughput for different NIC processor speeds.
5
2.5
x 10
No Offload
Offload 6
Offload 4
2
Offload 2
Offload 1
RTT (µs)
1.5
1
0.5
0
0
1
2
3
4
5
6
7
Message Size (Bytes)
Fig. 16. RTT on Gigabit Ethernet with and without offloading.
8
9
6
x 10
38
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
Table 1
Latencies for different offloading alternatives
Offloading
Latency (ls)
10/1
10/2
10/4
10/6
No-offloading
24.4
31.68
42.53
46.61
66.9
a
10
Fig. 15 shows the effect of the technology used to implement
the NIC processor in the performance of protocol offloading. As this
is one of the arguments to question the protocol offloading benefits, this analysis is important. In order to run the corresponding
simulations, we have modified the step rate of the NIC processor.
The curves in Fig. 15 correspond to the communication performance for a NIC processor running at 75%, 50%, and 25% of the host
CPU speed.
8
Offload 1
Offload 2
Offload 4
10
7
Offload 6
Block Size (Bytes)
No Offload
10
10
10
6
5
4
0
0.5
1
1.5
2
2.5
RTT (µs)
b
10
x 10
5
7
Offload 1
10
Offload 2
6
Offload 4
Message Size(Bytes)
10
10
10
10
10
10
Offload 6
5
No Offload
4
3
2
1
0
10
0
10
1
10
2
10
3
10
4
10
5
RTT (µs)
Fig. 17. Saturation points for different offloading alternatives using different scales.
10
6
39
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
As we can see from Fig. 15, the speed of the NIC processor
(CPU1) affects in a decisive way the throughputs. The performance
gets worse as the processor speed decreases. Moreover, in the case
of a very slow NIC processor, the performance for protocol offloading is even worse than the performance without offloading. So, it is
clear that offloading improves the communication performance
only if the processor included in the NIC is sufficiently fast compared with the host CPU (CPU0). Otherwise, offloading could even
diminish the performance.
Fig. 16 shows the Round Trip Time vs. Block Size, and Table 1 provides the latency, both measured with netpipe. The meaning of Offload x in Fig. 16 is similar to that of Fig. 14a. It is clear that high
throughput does not mean low latency. In our experiments with
TCP, we have seen how latency depends a lot on TCP configuration
factors. As the parameters we have used in TCP simulations have
been optimized, the improvement obtained is less apparent than
other effects such as throughput improvement, as is shown in
Fig. 16. Table 1 provides the latency improvement for different offloading conditions.
Fig. 17a shows the saturation points for both offloaded and nonoffloaded TCP experiments. The dependence between the saturation points and the offloading capabilities is clear. In Fig. 17b, the
latency improvement is also shown. Nevertheless, the latency
and the location of the saturation point have a heavy dependence
on the TCP configuration, such as TCP buffers, the use of Nagle
algorithm, etc.
5.1. The LAWS model and the simulation results
To conclude this section, we compare the results obtained from
our simulations with the predictions provided by the LAWS model.
Thus, Fig. 18 shows two curves corresponding to the peak throughput improvement against the application ratio (c), respectively,
predicted by LAWS, and the curve obtained experimentally with
Simics. These specific curves correspond to the following values
in the other parameters of LAWS: p = 0.75, a = 1, b = 0.40, and
B = 1 Gbps. As can be seen, there are important differences
between both curves, not only in the amount of throughput
improvement achieved, but also in the location of the maximum
and the rate of decreasing throughput improvement with the
application ratio. To get a more accurate fit between the experimental and the theoretical curves and a more deep insight about
the causes of the differences between the LAWS predictions and
the experimental results, we have added three new parameters,
da, do, and s, to the expression of the peak throughput:
h
i
1
1
min B; ðað1þda ÞXþð1pÞoð1þd
; boð1þd
ÞXÞð1þ
s
Þ
ÞpY
o
b
h
i
db ¼
1
1
min B; aXþoX
ð2Þ
The effects which we try to model through the parameters da, do,
and s, can be understood from expression (2). The parameters da
and do represent rates of change in the work per data after offloading, whilst the parameter s is a rate of change in the CPU workload
after offloading, due to the overheads of the communication between the CPU and the NIC through the I/O subsystem. Thus,
parameter a changes to a + ada in expression (2); o changes to
o + odo; and the CPU workload after offloading, W, changes to
W + sW.
Fig. 19 shows that it is possible to get better approaches than
the experimental results by using expression (2) and adequate values for da, do, and s. In the figure, the curve LAWSmod(1) corresponds to values da = 5 105, do = 0 and s = 5 102, and the
curve LAWSmod(2) to da = 5 105, do = 26 106 and s =
0.147. It is clear that our modified LAWS model makes possible
to approach the performance predictions to the experimental results. Particularly, it is possible to obtain very accurate information
about the value of c where experimental improvements higher
than zero start, and the value of c with the highest experimental
improvement. In this way, according to the values of the parameters in the modified LAWS model that allows the best approach
to the experimental curve, it can be concluded that after offloading: (1) the CPU workload requires more execution time (s > 0);
(2) the application workload increases as compared to that used
Throughput Improvement
40
Experimental
LAWS
35
30
%
25
20
15
10
5
0
0
0.5
1
1.5
2
2.5
3
3.5
Application Rate
Fig. 18. Comparison between the peak throughput improvement predicted by LAWS and the improvement obtained by simulation (p = 0.75, a = 1, b = 0.4, B = 1 Gbps).
40
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
Throughput Improvement
40
Experimental
LAWS
LAWS mod(1)
LAWS mod(2)
35
30
25
%
20
15
10
5
0
-5
0
0.5
1
1.5
2
2.5
3
3.5
Application Rate
Fig. 19. Approaching the experimental results by the modified LAWS model (LAWSmod) of expression (2).
in the LAWS model (da > 0); (3) the communication overhead decreases (do < 0).
It can be also seen that in experimental results, the decrease of
the throughput improvement as c increases is similar to the 1/c
rate predicted by LAWS model. This means that the LAWS model
(with the modifications we have included) provides enough flexibility to get an accurate explanation of the offload behavior. Nevertheless, other models should be also considered to allow accurate
predictions in circumstances different to those considered by the
LAWS model (throughput-limited applications rather than message-oriented ones).
6. Conclusions
The scientific and engineering progress requires adequate tools
to obtain experimental data. As has been previously claimed [31],
to build a timing simulator for evaluating systems with workloads
requiring operating systems support is difficult. This is the case of
network-oriented simulations, as most part of the network code
runs at the operating system level. In this paper, we leverage the
full-system simulator Simics as the basis for modeling the timing
of the memory system, the CPU, and the I/O buses to provide a suitable tool for researching in the computer system issues dealing
with networking.
Moreover, in this paper, we have compared HDL simulation and
full-system simulation to analyze the protocol offloading technique. First of all, an HDL model has allowed us the study of the offloading performance with an easy control of the parameters that
determine the behavior of the different hardware elements. The
experimental results obtained by the HDL simulations are qualitatively similar to that predicted by the LAWS model, although there
are important quantitative differences among the obtained and the
predicted improvements achieved by offloading. Moreover, the
need for analyzing the system behavior under realistic workloads
and traffic profiles (taking into account the interaction among
operating system, hardware, and applications) requires a full-system simulation. To do that, we have used Simics.
Although Simics presents some limitations and it is possible to
use other simulators for our purposes, the resources provided by
Simics for device modeling and its debugging facilities make Simics
an appropriate tool. Moreover, it allows a relative fast simulation of
the different models. Thus, we have developed timing models that
have been included in Simics to overcome the referred limitations
of this full-system simulator (that does not provide either timing
models or TOE’s models by itself). Thanks to the Simics models
we have developed, it is possible to analyze the most important
parameters and the conditions in which offloading determines
greater improvements in the overall communication performance.
The obtained simulation results show the improvement provided
by offloading heavy protocols like TCP, not only in the ideal case,
in which we use ideal buses, but also in more realistic situations,
in which memory latencies and non-ideal buses are modeled.
The results obtained in our experiments show that offloading
allows throughput improvements in all the cases where the host
and the NIC processors have similar speed. Moreover, it is shown
that offloading releases the 40% of the system CPU cycles in applications with intensive processor utilization. On the other side, we
also present results that show how the technology of the processor
included in the NIC affects the overall communication performance. The behavior which we have observed in our experiments
coincides with the analyses and conclusions reached from the
LAWS model. This situation constitutes an evidence of the correctness of our Simic model for protocol offloading. Nevertheless, as
there are important quantitative differences among the LAWS predictions and the results of the Simics simulations, we have included some parameters in the LAWS model to take into account
the effect of the memory accesses contention and the communication between the NIC and the CPU through the I/O subsystem.
In any case, the LAWS model can be only applied to environments in which throughput is limited either by network bandwidth or processing overhead rather than latency. However,
other performance models can be analyzed with our simulation
methodology to offer a wider knowledge about the offloading
behavior in other scenarios, with the corresponding benchmarks
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
and real applications. One of these performance models is the EMO
model [19], applicable to message-oriented environments.
Acknowledgements
This work has been funded by projects TIN2007-60587 (Ministerio de Ciencia y Tecnología, Spain) and TIC-1395 (Junta de Andalucía, Spain). The authors also thank the reviewers for their
suggestions.
References
[1] A.R. Alameldeen et al., Simulating a $2M Comercial Server on a $2K PC, IEEE
Computer (2003) 50–57.
[2] Alteon Websystems: Extended Frame Sized for Next Generation Ethernets,
http://staff.psc.edu/mathis/MTU/AlteonExtendedFrames_W0601.pdf.
[3] J.S. Bayne, Unleashing the Power of Networks, http://www.johnsoncontrols.
com/Metasys/articles/article7.htm, 1998.
[4] R.A.F. Bhoedjang, T. Rühl, H.E. Bal, User-level network interface protocols, IEEE
Computer (1998) 53–60.
[5] N.L. Binkert, E.G. Hallnor, S.K. Reinhardt, Network-oriented full-system
simulation using M5, in: Sixth Workshop on Computer Architecture
Evaluation using Commercial Workloads (CECW), February 2003.
[6] N.L. Binkert et al., Performance analysis of system overheads in TCP/IP
workloads, in: Proceedings of the 14th International Conference on Parallel
Architectures and Compilation Techniques (PACT’05), 2005.
[7] Broadcom web page: http://www.broadcom.com/, 2007.
[8] S. Chakraborty et al., Performance evaluation of network processor
architectures: combining simulation with analytical estimation, Computer
Networks 41 (2003) 641–645.
[9] Chelsio web page: http://www.chelsio.com/, 2007.
[10] G. Ciaccio, Messaging on gigabit Ethernet: some experiments with GAMMA
and other systems, in: Workshop on Communication Architecture for Clusters,
IPDPS, 2001.
[11] D.D. Clark et al., An analysis of TCP processing overhead, IEEE Communications
Magazine 7 (6) (1989) 23–29.
[12] D.E. Comer, Network Systems Design using Network Processors, Prentice-Hall,
2004.
[13] M. O’Dell, Re: how bad an idea is this? Message on TSV mailing list, November
2002.
[14] Dell web page: www.dell.com (‘‘Boosting data transfer with TCP offload engine
technology” by P. Gupta, A. Light, and I. Hamerof, Dell Power Solutions, August,
2006).
[15] A.F. Díaz, J. Ortega, A. Cañas, F.J. Fernández, M. Anguita, A. Prieto, A., Light
weight protocol for gigabit Ethernet, in: Workshop on Communication
Architecture for Clusters (CAC’03) (IPDPS’03), April 2003.
[16] D. Freimuth, Server network scalabilty and TCP offload, in: USENIX Annual
Technical Conference, General Track, 2005, pp. 209–222.
[17] S. GadelRab, 10-Gigabit Ethernet connectivity for computer servers, IEEE Micro
(2007) 94–105.
[18] P. Gelsinger, H.G. Geyer, J. Rattner, Speeding up the network: a system
problem a platform solution, Technology@IntelMagazine, March 2005.
[19] P. Gilfeather, A.B. Maccabe, Modeling protocol offload for message-oriented
communication, in: Proceedings of the 2005 IEEE International Conference on
Cluster Computing (Cluster 2005), 2005.
[20] Y. Hoskote et al., A TCP offload accelerator for 10 Gb/s Ethernet in 90 nm
CMOS, IEEE Journal of Solid-State Circuits 38 (11) (2003) 1866–1875.
[21] X. Hu, X. Tag, B. Hua, High-performance IPv6 forwarding algorithm for multicore and multithreaded network processors, in: Proceedings of the 11th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2006,
pp. 168–177.
[22] Intel Product Line of Network Processors, http://www.intel.com/design/
network/products/npfamily/index.htm, 2007.
[23] Intel I/O Acceleration Technology (white paper ‘‘Accelerating High-Speed
Networking with Intel I/O Acceleration Technology”), http://www.intel.com/
technology/ioacceleration/306517.pdf, 2005.
[24] Intel I/O Acceleration Technology, http://www.intel.com/technology/
ioacceleration/index.htm, 2007.
[25] H.-W. Jin, P. Balaji, C. Yoo, J.-Y. Choi, D.K. Panda, Exploiting NIC
architectural support for enhancing IP-based protocols on highperformance networks, Journal of Parallel and Distributed Computing 65
(2005) 1348–1365.
[26] The Linux Kernel Archives web page, http://www.kernel.org, 2007.
[27] H.-Y. Kim, S. Rixner, TCP offload through connection handoff, in: EuroSys’06,
2006.
[28] M5 simulator system Source Forge page: http://sourceforge.net/projects/
m5sim, 2007.
[29] P.S. Magnusson et al., Simics: a full system simulation platform, IEEE Computer
(2002) 50–58.
[30] E.P. Markatos, Speeding up TCP/IP: faster processors are not enough, in: IEEE
21st International Performance, Computing, and Communications Conference,
2002, pp. 341–345.
41
[31] M.M. Martin et al., Multifacet’s General Execution-driven Multiprocessor
Simulator (GEMS) Toolset, Computer Architecture News (CAN), http://
www.cs.wisc.edu/multifacet/papers/can05_gems.pdf, 2005.
[32] C.J. Mauer et al., Full-system timing-first simulation, in: ACM Sigmetrics
Conference on Measurement and Modeling of Computer Systems, June
2002.
[33] J.C. Mogul, TCP offload is a dumb idea whose time has come, in: Ninth
Workshop on Hot Topics in Operating Systems (HotOS IX), 2003.
[34] Neterion web page: http://www.neterion.com/, 2007.
[35] Opensource Cpuset for Linux web page, http://www.bullopensource.org/
cpuset/, 2004.
[36] A. Ortiz, J. Ortega, A.F. Díaz, A. Prieto, Protocol offload evaluation using Simics,
IEEE Cluster Computing, Barcelona, 2006.
[37] A. Ortiz, J. Ortega, A.F. Díaz, A. Prieto, Analyzing the benefits of protocol offload
by full-system simulation, in: 15th Euromicro Conference on Parallel,
Distributed and Network-based Processing, PDP 2007.
[38] I. Papaefstathiou et al., Network processors for future high-end systems and
applications, IEEE Micro (2004).
[39] M. Rangarajan et al., TCP Servers: Offloading TCP processing in Internet
Servers, Design Implementation and Performance, Tech. Report, DCS-TR-481,
Rugers Univ., 2002.
[40] G. Reginier et al., TCP onloading for data center servers, IEEE Computer (2004)
48–58.
[41] M. Rosenblum et al., Using the SimOS machine simulator to study complex
computer systems, ACM Transactions on Modeling and Computer Simulation 7
(1) (1997) 78–103.
[42] P. Shivam, J.S. Chase, On the elusive benefits of protocol offload, in:
SIGCOMM’03 Workshop on Network-I/O convergence: Experience, Lessons,
Implications (NICELI), August 2003.
[43] Transmission Control Protocol Specification. RFC793, http://rfc.net/
rfc793.html.
[44] L. Thiele, et al., Design space exploration of network processor architectures,
in: Proceedings of the First Workshop on Network Processors (en el Eighth
International Symposium on High Performance Computer Architecture),
February 2002.
[45] D. Turner, X. Chen, Protocol-dependent message-passing performance on
Linux clusters, in: IEEE International Conference on Cluster Computing, 2002
(Cluster 2002).
[46] Virtual Interface Developer Forum: http://www.vidf.org/, VIDF, 2001.
[47] F.J. Villa, M.E. Acacio, J.M. García, Evaluating IA-32 web servers through
Simics: a practical experience, Journal of Systems Architecture 51 (2005)
251–264.
[48] Virtutech web page: http://www.virtutech.com/, 2007.
[49] R.Y. Wang, A. Krishnamurthy, R.P. Martin, T.E. Anderson, D.E. Culler, Towards a
theory of optimal communication pipelines, Technical Report No. UCB/CSD98-981, EECS Department, University of California, Berkeley, 1998.
[50] R. Westrelin et al., Studying network protocol offload with emulation:
approach and preliminary results, 2004.
[51] B. Wun, P. Crowley, Network I/O acceleration in heterogeneous multicore
processors, in: Proceedings of the 14th IEEE Symposium on High-Performance
Interconnects (HOTI’06), 2006.
Andrés Ortiz received the Ing. degree in Electronics
Engineering from the University of Granada in 2000.
From 2000 to 2005 he was working as Systems Engineer
with Telefonica, Madrid, Spain, where his work areas
were high performance computing and network performance analysis. Since 2004 he has been with the
Department of Communication Engineering at the University of Malaga as an Assistant Prof. His research
interests include high performance networks, mobile
communications, RFID and embedded power-restrained
communication devices.
Julio Ortega received the B.Sc. degree in electronic
physics in 1985, the M.Sc. degree in electronics in 1986,
and the Ph.D. degree in 1990, all from the University of
Granada, Spain. His Ph.D. dissertation has received the
Award of Ph.D. dissertations of the University of Granada.
He was at the Open University, U.K., and at the
Department of Electronics (University of Dortmund,
Germany), as invited researcher. Currently he is a Full
Professor at the Department of Computer Architecture
and Technology of the University of Granada. His
research interest are in the fields of parallel processing
and parallel computer architectures, artificial neural
networks, and evolutionary computation. He has led research projects in the areas
of networks and parallel architectures, and parallel processing for optimization
problems.
42
A. Ortiz et al. / Journal of Systems Architecture 55 (2009) 25–42
Antonio F. Díaz received the M.Sc. degree in electronic
physics in 1991, and the Ph.D. degree in 2001, all from
the University of Granada, Spain.
He is currently an Associate Professor in the
Department of Computer Architecture and Computer
Technology. His research interests are in the areas of
network protocols, distributed systems and network
area storage.
Pablo Cascón is a researcher and Ph.D. Student at the
University of Granada. He received the M.Sc. degree in
2004 in Computer Science from the University of Granada. His research interest are in the fields of protocol
offloading, network processors and high performance
communications.
Alberto Prieto earned his B.Sc. in Physics (Electronics)
in 1968 from the Complutense University in Madrid. In
1976, he completed a Ph.D. at the University of Granada.
From 1971 to 1984 he was founder and Head of the
Computing Centre, and he headed Computer Science
and Technology Studies at the University of Granada
from 1985 to 1990. He is currently a fulltime Professor
and Head of the Department of Computer Architecture
and Technology. Is the co-author of four text-books
published by McGraw-Hill and Thomson editorials, has
co-edited five volumes of the LNCS, and is co-author of
more than 250 articles. His area of research primarily
focuses on intelligence systems.