1
A Case for End System Multicast
Yang-hua Chu, Sanjay G. Rao, Srinivasan Seshan and Hui Zhang
Carnegie Mellon University
Email: {yhchu,sanjay,srini+,hzhang}@cs.cmu.edu
Abstract— The conventional wisdom has been that IP is the
natural protocol layer for implementing multicast related functionality. However, more than a decade after its initial proposal,
IP Multicast is still plagued with concerns pertaining to scalability, network management, deployment and support for higher
layer functionality such as error, flow and congestion control. In
this paper, we explore an alternative architecture that we term
End System Multicast, where end systems implement all multicast related functionality including membership management
and packet replication. This shifting of multicast support from
routers to end systems has the potential to address most problems associated with IP Multicast. However, the key concern is
the performance penalty associated with such a model. In particular, End System Multicast introduces duplicate packets on
physical links and incurs larger end-to-end delays than IP Multicast.
In this paper, we study these performance concerns in the
context of the Narada protocol. In Narada, end systems selforganize into an overlay structure using a fully distributed protocol. Further, end systems attempt to optimize the efficiency of
the overlay by adapting to network dynamics and by considering application level performance. We present details of Narada
and evaluate it using both simulation and Internet experiments.
Our results indicate that the performance penalties are low both
from the application and the network perspectives. We believe
the potential benefits of transferring multicast functionality from
end systems to routers significantly outweigh the performance
penalty incurred.
I. Introduction
Traditional network architectures distinguish between two
types of entities: end systems (hosts) and the network (routers
and switches). One of the most important architectural decisions is the division of functionality between end systems and
networks.
In the Internet architecture, the internetworking layer, or
IP, implements a minimal functionality — a best-effort unicast
datagram service, and end systems implement all other important functionality such as error, congestion, and flow control.
Such a minimalist approach is one of the most important technical reasons for the Internet’s growth from a small research
network into a global, commercial infrastructure with heterogeneous technologies, applications, and administrative authorities. The growth of this network has in turn unleashed the
development of new applications, which require richer network
functionality.
The key architectural question is: what new features should
be added to the IP layer? Multicast and QoS are the two most
important features that have been or are being added to the
IP layer. While QoS functionality cannot be provided by end
systems alone and thus has to be supported at the IP layer, this
is not the case for multicast. In particular, it is possible for
end systems to implement multicast services on top of the IP
unicast service.
This research was sponsored by DARPA under contract number F30602-991-0518, and by NSF under grant numbers Career Award NCR-9624979 ANI9730105, ITR Award ANI-0085920, and ANI-9814929. Additional support was
provided by Intel. Views and conclusions contained in this document are those
of the authors and should not be interpreted as representing the official policies,
either expressed or implied, of DARPA, NSF, Intel or the U.S. government.
In deciding whether to implement multicast services at the
IP layer or at end systems, there are two conflicting considerations that we need to reconcile. According to the end-to-end
arguments [18], a functionality should be (i) pushed to higher
layers if possible; unless (ii) implementing it at the lower layer
can achieve large performance benefits that outweigh the cost
of additional complexity at the lower layer.
In his seminal work in 1989 [5], Deering argues that this second consideration should prevail and multicast should be implemented at the IP layer. This view has been widely accepted
so far. IP Multicast is the first significant feature that has been
added to the IP layer since its original design and most routers
today implement IP Multicast. Despite this, IP Multicast has
several drawbacks that have so far prevented the service from
being widely deployed. First, IP Multicast requires routers to
maintain per group state, which not only violates the “stateless”
architectural principle of the original design, but also introduces
high complexity and serious scaling constraints at the IP layer.
Second, IP Multicast is a best effort service, and attempts to
conform to the traditional separation of routing and transport
that has worked well in the unicast context. However, providing higher level features such as reliability, congestion control,
flow control, and security has been shown to be more difficult
than in the unicast case. Finally, IP Multicast calls for changes
at the infrastructural level, and this slows down the pace of
deployment.
In this paper, we revisit the issue of whether multicast related functionality should be implemented at the IP layer or at
the end systems. In particular, we consider a model in which
multicast related features, such as group membership, multicast routing and packet duplication, are implemented at end
systems, assuming only unicast IP service. We call the scheme
End System Multicast. Here, end systems participating in the
multicast group communicate via an overlay structure. The
structure is an overlay in the sense that each of its edges corresponds to a unicast path between two end systems in the
underlying Internet.
We believe that End System Multicast has the potential to
address most problems associated with IP Multicast. Since all
packets are transmitted as unicast packets, deployment may be
accelerated. End System Multicast maintains the stateless nature of the network by requiring end systems, which subscribe
only to a small number of groups, to perform additional complex processing for any given group. In addition, we believe that
solutions for supporting higher layer features such as error, flow,
and congestion control can be significantly simplified by leveraging well understood unicast solutions for these problems, and
by exploiting application specific intelligence.
While End System Multicast has many advantages, several
issues need to be resolved before it become a practical alternative to IP Multicast, In particular, an overlay approach to
multicast, however efficient, cannot perform as well as IP Multicast. It is impossible to completely prevent multiple overlay
edges from traversing the same physical link and thus some redundant traffic on physical links is unavoidable. Further, communication between end systems involves traversing other end
2
systems, potentially increasing latency. We present an example
to illustrate these points in Section II. In this paper, we focus on
two fundamental questions pertaining to the End System Multicast architecture: (i) what are the performance implications
of using an overlay architecture for multicast? and (ii) how do
end systems with limited topological information cooperate to
construct good overlay structures?
In this paper, we seek to answer these questions in the context
of a protocol that we have developed called Narada. Narada
constructs an overlay structure among participating end systems in a self-organizing and fully distributed manner. Narada
is robust to the failure of end systems and to dynamic changes
in group membership. End systems gather information of network path characteristics using passive monitoring and active
measurements. Narada continually refines the overlay structure
as more network information is available. We present details of
Narada’s design in Section III.
We have conducted a detailed evaluation of the End System
Multicast architecture in the context of the Narada protocol using simulation and Internet experiments. Our evaluation considers both application and network level metrics as discussed
in Section IV. Our Internet experiments are conducted on a
wide-area test-bed of about twenty hosts as described in Section V. The results indicate that End System Multicast can
achieve bandwidth performance comparable to IP Multicast,
while at the same time achieving mean receiver latencies that
are about 1.3–1.5 times latencies seen with IP Multicast. Results from our simulation experiments, presented in Section VI
are consistent with our Internet results and indicate that the
promise of the End System Multicast architecture extends to
medium sized groups of hundreds of members. For example, for
groups of 128 members, the average receiver delay with Narada
is about 2.2–2.8 times the average receiver delay with IP Multicast, while the network resources consumed by Narada is about
twice that of IP Multicast. Overall our results demonstrate that
End System Multicast is a promising architecture for enabling
small and medium sized group communication applications on
the Internet.
across the costly transcontinental link R1 − R2. Yet, this efficiency over naive unicast based schemes has been obtained with
absolutely no change to routers, and all intelligence is implemented at the end systems. However, while intelligently constructed overlay trees can result in much better performance
than naive unicast solutions, they still cannot perform as well
as solutions with native IP Multicast support. For example,
in Figure 1(d), link A − R1 carries a redundant copy of data
transmission, while the delay from source A to receiver D has
increased.
Given that End System Multicast tries to push functionality to the edges, there are two very different ways this can be
achieved : peer-to-peer architectures and proxy-based architectures. In a peer-to-peer architecture, functionality is pushed
to the end hosts actually participating in the multicast group.
Such architectures are thus completely distributed with each
end host maintaining state only for those groups it is actually
participating in. In a proxy-based architecture on the other
hand, an organization that provides value added services deploys proxies at strategic locations on the Internet. End hosts
attach themselves to proxies near them, and receive data using
plain unicast, or any available multicast media. While these architectures have important differences, fundamental to both of
them are concerns regarding the performance penalty involved
in disseminating data using overlays as compared to solutions
that have native multicast support. Thus, an end system in
our paper refers to the entity that actually takes part in a selforganization protocol, and could be an end host or a proxy.
Our evaluation of End System Multicast targets a wide range
of group communication applications such as audio and video
conferencing, virtual classroom and multi-party network games.
Such applications typically involve small (tens of members) and
medium sized (hundreds of members) groups. While End System Multicast may be relevant even for applications which involve much larger group sizes such as broadcasting and content
distribution - particularly in the context of proxy-based architectures - such applications are outside the focus of this paper.
We defer a detailed discussion to Section VII.
II. End System Multicast
III. Narada Design
We illustrate the differences between IP Multicast, naive unicast and End System Multicast using Figure 1. Figure 1(a)
depicts an example physical topology, where R1 and R2 are
routers, while A, B, C and D are end systems. Link delays
are as indicated. R1 − R2 represents a costly transcontinental
link, while all other links are cheaper local links. Further, let
us assume A wishes to send data to all other nodes.
Figure 1(b) depicts naive unicast transmission. Naive unicast
results in significant redundancy on links near the source (for
example, link A − R1 carries three copies of a transmission by
A), and results in duplicate copies on costly links (for example,
link R1 − R2 has two copies of a transmission by A).
Figure 1(c) depicts the IP Multicast tree constructed by
DVMRP [5]. DVMRP is the classical IP Multicast protocol,
where data is delivered from the source to recipients using an
IP Multicast tree composed of the unicast paths from each recipient to the source. Redundant transmission is avoided, and
exactly one copy of the packet traverses any given physical link.
Each recipient receives data with the same delay as though A
were sending to it directly by unicast.
Figure 1(d) depicts an “intelligent” overlay tree that may be
constructed using the End System Multicast architecture. The
number of redundant copies of data near the source is reduced
compared to naive unicast, and just one copy of the packet goes
In this section, we present Narada, a protocol we designed
that implements End System Multicast. In designing Narada,
we had the following objectives in mind:
• Self-organizing: The construction of the end system overlay
must take place in a fully distributed fashion and must be robust
to dynamic changes in group membership.
• Overlay efficiency: The tree constructed must be efficient
both from the network and the application perspective. From
the network perspective, the constructed overlay must ensure
that redundant transmission on physical links is kept minimal.
However, different applications may require overlays with different notions of efficiency. While interactive applications like
audio conferencing and group messaging require low latencies,
applications like video conferencing simultaneously require high
bandwidth and low latencies.
• Self-improving: The overlay construction must include mechanisms by which end systems gather network information in a
scalable fashion. The protocol must allow for the overlay to incrementally evolve into a better structure as more information
becomes available.
• Adaptive to network dynamics: The overlay must adapt to
long-term variations in Internet path characteristics (such as
bandwidth and latency), while being resilient to inaccuracies
3
A
C
A
C
1
1
1
R1
25
1
R2
D
R2
R1
1
B
(a)
C
1
D
(b)
A
1
1
1
1
B
25
R1
A
1
25
R2
R1
1
1
B
C
1
1
D
R2
25
B
(c)
1
D
(d)
Fig. 1. Example to illustrate naive unicast, IP Multicast and End System Multicast
27
A
2
2
B
C
27
D
Fig. 2. Example to illustrate the mesh based approach in Narada.
III-B. Narada runs a variant of standard distance vector algorithms on top of the mesh and uses well known algorithms
to construct per-source (reverse) shortest path spanning trees
for data delivery. We discuss this in Section III-C. While the
Narada framework is generic and is applicable to a range of applications, it may be customized to meet the requirements of a
specific application. We discuss this in Section III-D with the
example of conferencing applications.
A. Group Management
inherent in the measurement of these quantities.
The intuitive approach to constructing overlay spanning trees
is to construct them directly - that is, members explicitly select their parents from among the members they know [10].
Narada however constructs trees in a two-step process. First,
it constructs a richer connected graph that we term a mesh,
and tries to ensure that the mesh has desirable performance
properties that are discussed later. In the second step, Narada
constructs spanning trees of the mesh, each tree rooted at the
corresponding source using well known routing algorithms. Figure 2 presents an example mesh that Narada constructs for the
physical topology shown in Figure 1(a), along with a spanning
tree rooted at A.
This mesh-based approach is motivated by the need to support multi-source applications. Single shared trees are susceptible to a central point of failure, and are not optimized for
an individual source. Explicitly constructing multiple overlay
trees, one tree for each source is a possible design alternative
but needs to deal with the overhead of maintaining and optimizing multiple overlays. In contrast, meshes allow us to construct trees optimized for the individual source, yet allow us
to abstract out group management functions at the mesh level
rather than replicating them across multiple trees. Further, we
may leverage standard routing algorithms for construction of
data delivery trees.
In our approach, trees for data delivery are constructed entirely from the overlay links present in the mesh. Hence, it becomes important to construct a good mesh so that good quality
trees may be produced. A good mesh has two properties. First,
the quality of the path between any pair of members is comparable to the quality of the unicast path between that pair of
members. Second, each member has a limited number of neighbors in the mesh. By path qualities, we refer to the metrics of
interest for the application, such as delay and bandwidth. Limiting the number of neighbors in the mesh controls the overhead
of running routing algorithms on the mesh.
We explain the distributed algorithms that Narada uses to
construct and maintain the mesh in Section III-A. We present
mechanisms Narada uses to improve mesh quality in Section
In this section, we present distributed heuristics Narada uses
to keep the mesh connected, to incorporate new members into
the mesh and to repair possible partitions that may be caused
by members leaving the group or by member failure.
As we do not wish to rely on a single non-failing entity to
keep track of group membership, the burden of group maintenance is shared jointly by all members. To achieve a high degree
of robustness, our approach is to have every member maintain
a list of all other members in the group. Since Narada is targeted towards medium sized groups, maintaining the complete
group membership list is not a major overhead. Every member’s list needs to be updated when a new member joins or an
existing member leaves. The challenge is to disseminate changes
in group membership efficiently, especially in the absence of a
multicast service provided by the lower layer. We tackle this by
exploiting the mesh to propagate such information. However,
this strategy is complicated by the fact that the mesh might
itself become partitioned when a member leaves. To handle
this, we require that each member periodically generate a refresh message with monotonically increasing sequence number,
which is disseminated along the mesh. Each member i keeps
track of the following information for every other member k in
the group: (i) member address k; (ii) last sequence number ski
that i knows k has issued; and (iii) local time at i when i first
received information that k issued ski . If member i has not received an update from member k for Tm time, then, i assumes
that k is either dead or potentially partitioned from i. Member
i then initiates a set of actions to determine the existence of a
partition and repair it if present as discussed in Section 3.1.3.
Propagation of refresh messages from every member along the
mesh could potentially be quite expensive. Instead, we require
that each member periodically exchange its knowledge of group
membership with its neighbors in the mesh. A message from
member i to a neighbor j contains a list of entries, one entry for
each member k that i knows is part of the group. Each entry has
the following fields: (i) member address k; and (ii) last sequence
number ski that i knows k has issued. On receiving a message
from a neighbor j, member i updates its table according to the
pseudo code presented in Figure 3.
Finally, given that a distance vector routing algorithm is run
on top of the mesh (Section III-C), routing update messages ex-
4
Let i receive refresh message from neighbor j at i′ s local
time t. Let < k, skj > be an entry in j ′ s refresh message.
• if i does not have an entry for k, then i inserts the
entry < k, skj , t > into its table
• else if i’s entry for k is < k, ski , tki >, then
• if ski ≥ skj i ignores the entry pertaining to k
• else i updates its entry for k to < k, skj , t >
Fig. 3. Actions taken by a member i on receiving a refresh message from
member j.
C
E
B
F
A
G
D
Fig. 4. A sample overlay topology
changed between neighbors can include member sequence number information with minimum extra overhead.
Let Q be a queue of members for which i has stopped
receiving sequence number updates for at least Tm
time. Let T be maximum time an entry may remain in Q.
while(1) begin
Update Q;
while( !Empty(Q) and
Head(Q) is present in Q for ≥ T time)
begin
j= Dequeue(Q);
Initiate probe cycle to determine if j is dead
or to add a link to it.
end
if( !Empty(Q)) begin
prob = Length(Q)/ GroupSize;
With probability prob begin
j= Dequeue(Q);
Initiate probe cycle to determine if j is dead
or to add a link to it.
end
sleep(P). // Sleep for time P seconds
end
end
Fig. 5. Scheduling algorithm used by member i to repair mesh partition
A.1 Member Join
When a member wishes to join a group, Narada assumes that
the member is able to get a list of group members by an out-ofband bootstrap mechanism. The list needs neither be complete
nor accurate, but must contain at least one currently active
group member. In this paper, we do not address the issue of
the bootstrap mechanism. We believe that such a mechanism
is application specific and our protocol is able to accommodate
different ways of obtaining the bootstrap information.
The joining member randomly selects a few group members
from the list available to it and sends them messages requesting
to be added as a neighbor. It repeats the process until it gets
a response from some member, when it has successfully joined
the group. Having joined, the member then starts exchanging
refresh messages with its neighbors. The mechanisms described
earlier will ensure that the newly joined member and the rest
of the group learn about each other quickly.
A.2 Member Leave and Failure
When a member leaves a group, it notifies its neighbors, and
this information is propagated to the rest of the group members
along the mesh. In Section III-C, we will describe our enhancement to distance vector routing that requires a leaving member
to continue forwarding packets for some time to minimize transient packet loss.
We also need to consider the difficult case of abrupt failure. In
such a case, failure should be detected locally and propagated to
the rest of the group. In this paper, we assume a fail-stop failure
model [20], which means that once a member dies, it remains in
that state, and the fact that the member is dead is detectable
by other members. We explain the actions taken on member
death with respect to Figure 4. This example depicts the mesh
between group members at a given point in time. Assume that
member C dies. Its neighbors in the mesh, A, G stop receiving
refresh messages from C. Each of them independently send
redundant probe messages to C, such that the probability every
probe message (or its reply) is lost is very small. If C does not
respond to any probe message, then, A and G assume C to be
dead and propagate this information throughout the mesh.
Every member needs to retain entries in its group membership table for dead members. Otherwise, it is impossible to
distinguish between a refresh announcing a new member and
a refresh announcing stale information regarding a dead member. However, dead member information can be flushed after
sufficient amount of time.
A.3 Repairing Mesh Partitions
It is possible that member failure can cause the mesh to become partitioned. For example, in Figure 4, if member A dies,
the mesh becomes partitioned. In such a case, members must
first detect the existence of a partition, and then repair it by
adding at least one overlay link to reconnect the mesh. Members on each side of the partition stop receiving sequence number updates from members on the other side . This condition
is detected by a timeout of duration Tm .
Each member maintains a queue of members that it has
stopped receiving sequence number updates from for at least
Tm time. It runs a scheduling algorithm that periodically and
probabilistically deletes a member from the head of the queue.
The deleted member is probed and it is either determined to
be dead, or a link is added to it. The scheduling algorithm is
adjusted so that no entry remains in the queue for more than a
bounded period of time. Further, the probability value is chosen carefully so that in spite of several members simultaneously
attempting to repair partition only a small number of new links
are added. The algorithm is summarized in Figure 5.
B. Mesh Performance
The constructed mesh can be quite sub-optimal, because (i)
initial neighbor selection by a member joining the group is random given limited availability of topology information at bootstrap; (ii) partition repair might aggressively add edges that
are essential for the moment but not useful in the long run; (iii)
group membership may change due to dynamic join and leave;
and (iv) underlying network conditions, routing and load may
vary. Narada allows for incremental improvement of mesh quality by adding and dropping of overlay links. Members probe
each other at random and new links may be added depending
on the perceived gain in utility in doing so. Further, members
continuously monitor the utility of existing links, and drop links
perceived as not useful. This dynamic adding and dropping of
links in the mesh distinguishes Narada from traditional routing
protocols.
The issue, then, is the design of a utility function that reflects mesh quality. A good quality mesh must ensure that for
any pair of members, there exists paths along the mesh which
5
EvaluateUtility (j) begin
utility = 0
for each member m (m not i) begin
CL = current latency between i and m along mesh
N L = new latency between i and m along mesh
if edge i-j were added
if (N L < CL) then begin
L
utility + = CL−N
CL
end
end
return utility
Fig. 6. Example algorithm that i uses in determining utility of adding
link to j, when latency is the main metric of interest
EvaluateConsensusCost(j) begin
Costij = number of members for which i uses j as
next hop for forwarding packets.
Costji = number of members for which j uses i as
next hop for forwarding packets.
return max(Costij , Costji )
end
Fig. 7. Algorithm that i uses to determine consensus cost to a neighbor
j
can provide performance comparable to the performance of the
unicast path between the members. A member i computes the
utility gain if a link is added to member j based on (i) the
number of members to which j improves the performance of
i; and (ii) how significant this improvement in performance is.
The precise utility function depends on the performance metric
(or metrics) that the overlay is being optimized for. Figure 6
presents example pseudo code for a setting where latency is the
primary metric of interest. The utility can take a maximum
value of n, where n is the number of group members i is aware
of. Each member m can contribute a maximum of 1 to the utility, the actual contribution being i′ s relative decrease in delay
to m if the edge to j were added. Narada adds and removes
links from the mesh using the following heuristics:
• Addition of links: Every member periodically probes some
random member that is not a neighbor, and evaluates the utility
of adding a link to this member. When a member i probes a
member j, j returns to i a copy of its routing table. i uses
this information to compute the expected gain in utility if a
link to j is added as described in Figure 6. i decides to add a
link to j if the expected utility gain exceeds a given threshold.
The threshold is chosen to depend on the group size, and the
number of neighbors i and j have in the mesh. Finally, there
may be other metric-specific heuristics for link addition. For
example, when the overlay is optimized for latency, i may also
add a link to j if the physical delay between them is very low
and the current overlay delay between them very high.
• Dropping of links: Ideally, the loss in utility if a link were to
be dropped must exactly equal the gain in utility if the same
link were immediately re-added. However, this requires estimating the relative degradation in performance to a member if
a link were dropped and it is difficult to obtain such information. Instead, we overestimate the actual utility of a link by its
cost. The cost of a link between i and j in i′ s perception is the
number of group members for which i uses j as next hop. Periodically, a member computes the consensus cost of its link to
every neighbor using the algorithm shown in Figure 7. It then
picks the neighbor with lowest consensus cost and drops it if
the consensus cost falls below a certain threshold. The threshold is again computed as a function of the member’s estimation
of group size and its number of mesh neighbors. The consensus
cost of a link represents the maximum of the cost of the link in
each neighbor’s perception. Yet, it might be computed locally
as the mesh runs a distance vector algorithm with path information.
Our heuristics for link-dropping have the following desirable
properties:
• Stability: A link that Narada drops is unlikely to be added
again immediately. This is ensured by several factors: (i) the
threshold for dropping a link is less than or equal to the threshold for adding a link; (ii) the utility of an existing link is overestimated by the cost metric; and (iii) dropping of links is done
considering the perception that both members have regarding
link cost.
•Partition avoidance: We present an informal argument as
to why our link dropping algorithm does not cause a partition
assuming steady state conditions and assuming multiple links
are not dropped concurrently. Assume that member i drops
neighbor j. This could result in at most two partitions. Assume
the size of i′ s partition is Si and the size of j ′ s partition is Sj .
Further, assume both i and j know all members currently in
the group. Then, the sum of Si and Sj is the size of the group.
Thus Costij must be at least Sj and Costji must be at least Si ,
and at least one of these must exceed half the group size. As
long as the drop threshold is lower than half the group size, the
edge will not be dropped. Finally, we note that in situations
where a partition of the mesh is caused (for example, due to
multiple links being dropped simultaneously), our mechanisms
for partition detection and repair described in Section III-A
would handle the situation.
C. Data Delivery
Narada runs a distance vector protocol on top of the mesh.
In order to avoid the well-known count-to-infinity problems, it
employs a strategy similar to BGP [17]. Each member not only
maintains the routing cost to every other member, but also
maintains the path that leads to such a cost. Further, routing updates between neighbors contains both the cost to the
destination and the path that leads to such a cost. The persource trees used for data delivery are constructed from the
reverse shortest path between each recipient and the source, in
identical fashion to DVMRP [5]. A member M that receives a
packet from source S through a neighbor N forwards the packet
only if N is the next hop on the shortest path from M to S.
Further, M forwards the packet to all its neighbors who use M
as the next hop to reach S.
The routing metric used in the distance vector protocol depends on the the metrics for which the overlay is being optimized, which in turn depends on the particular application. We
present an example in Section III-D.
A consequence of running a routing algorithm for data delivery is that there could be packet loss during transient conditions
when member routing tables have not yet converged. In particular, there could be packet loss when a member leaves the
group or when a link is dropped for performance reasons. To
avoid this, data continues to be forwarded along old routes for
enough time until routing tables converge. To achieve this, we
introduce a new routing cost called Transient Forward (TF).
TF is guaranteed to be larger than the cost of a path with a
valid route, but smaller than infinite cost. A member M that
leaves advertises a cost of TF for all members for which it had
a valid route. Normal distance vector operations leads to members choosing alternate valid routes not involving M (as TF is
guaranteed to be larger than the cost of any valid route). The
6
leaving member continues to forward packets until it is no longer
used by any neighbor as a next hop to reach any member, or
until a certain time period expires.
D. Application Specific Customizations
A key feature of End System Multicast is that it enables application customizable solutions. In this section, we will study
how we support an important, performance demanding class of
applications - video conferencing - within the Narada framework. Conferencing applications require overlay trees simultaneously optimized for both latency and available bandwidth.
Thus, this study allows us to illustrate how dynamic metrics
like bandwidth and latency are dealt with in the Narada framework. These ideas may be applied to other applications as well.
Conferencing applications deal with media streams that can
tolerate loss through a degradation in application quality. This
allows us to build a system that employs a hop-by-hop congestion control protocol. An overlay node adapts to a bandwidth
mismatch between the upstream and downstream links by dropping packets. We use TFRC [7] as the underlying transport protocol on each overlay link. TFRC is rate-controlled UDP, and
achieves TCP-friendly bandwidths. It does not suffer delays associated with TCP such as retransmission delays, and queueing
delays at the sender buffer.
To construct overlay trees simultaneously optimized for bandwidth and latency, we have leveraged work done by Wang and
Crowcroft [21] in the context of routing on multiple metrics in
the Internet. A first choice is to use a single mixed routing metric which is a function of both bandwidth and latency. However,
it is unclear how this function can individually reflect the bandwidth and latency requirements of the application. Instead, we
use multiple routing metrics in the distance vector protocol, the
latency between members and the available bandwidth. The
routing protocol uses a variant of the shortest widest path algorithm presented in [21]. Every member tries to pick the widest
(highest bandwidth) path to every other member. If there are
multiple paths with the same bandwidth, the member picks the
shortest (lowest latency) path among all these.
We collect raw latency estimates of links in the mesh by having neighbors ping each other every 200 milli-seconds. Raw estimates of available bandwidth are obtained by passively monitoring data flow along the links. Both available bandwidth and
latency are dynamic in nature, and using them as routing metrics leads to serious concerns of instability. We deal with the
stability concerns using techniques in the design of the routing
metrics described below:
• Latency: We filter raw estimates of the overlay link latency
using an exponential smoothing algorithm. The advertised link
latency is left unchanged until the smoothed estimate differs
from the currently advertised latency by a significant amount.
• Available bandwidth: We filter raw estimates of the available
bandwidth of an overlay link using an exponential smoothing algorithm, to produce a smoothed estimate. Next, instead of using
the smoothed estimate as a routing metric, we define discretized
bandwidth levels. The smoothed estimate is rounded down to
the nearest bandwidth level for routing purposes. Thus, a mesh
link with a smoothed estimate of 600 Kbps may be advertised
as having a bandwidth of 512 Kbps, in a system with levels corresponding to 512 Kbps and 1024 Kbps. To minimize possible
oscillations when the smoothed estimate is close to a bandwidth
level, we employ a simple hysteresis algorithm. Thus, while we
move down a level immediately when the smoothed estimate
falls below the current level, we move up a level only if the esti-
mate significantly exceeds the bandwidth corresponding to the
next level.
Given that conferencing applications often have a bounded
source rate, the largest level in the system is set to this maximum rate. Discretization of bandwidth and choice of a maximum bandwidth level ensure that all overlay links can fall in a
small set of equivalence classes with regard to bandwidth. This
discretized bandwidth metric not only enables greater stability
in routing on the overlays, but also allows latency to become
a determining factor when different links have similar but not
identical bandwidth.
Our discussion so far has described how we incorporate bandwidth into the routing protocol, given a mesh that has been
constructed. Incorporating bandwidth also requires changes to
the heuristics used when a mesh link is added. When a member
i probes a member j and has an estimate of the available bandwidth to j, a modified version of the utility function presented
in Figure 6 that considers bandwidth is used to determine if a
link to j must be added. However, if i does not have a bandwidth estimate to j, it first determines the available bandwidth
using active measurements involving transfer of data using the
underlying transport protocol for several seconds, but at a rate
bounded by the maximum source rate. To minimize the number
of active measurements, i conducts a probe to j only if j is seeing better application level performance than i to the sources.
We are currently investigating whether we can minimize active
bandwidth measurements by using light-weight probing techniques such as RTT probes and 10 Kilobyte data transfers.
IV. Evaluation
The primary goal of our evaluation is to answer the following
question: what performance penalty does an End System Multicast architecture suffer as compared to IP Multicast with regard
to both application and network level metrics? To answer this
question, we compare the performance of a scheme for disseminating data under the IP Multicast architectural framework,
with the performance of various schemes for disseminating data
under the End System Multicast framework.
We have conducted our evaluation using both simulation and
Internet experiments. Internet experiments help us understand
how schemes for disseminating data behave in dynamic and unpredictable real-world environments, and give us an idea of the
end-to-end performance seen by actual applications. On the
other hand, simulations help analyze the scaling properties of
the End System Multicast architecture with larger group sizes.
Further, they help in understanding details of protocol behavior
under controlled and repeatable settings.
In the rest of this section, we present schemes that we compare for disseminating data, and our performance metrics. Sections V and VI present results for our Internet and simulation
experiments.
A. Schemes Compared
We compare the following schemes for disseminating data in
our simulation and Internet experiments.
• DVMRP: We assume that IP Multicast involves constructing
classical DVMRP like trees [5], composed of the reverse paths
from the source to each receiver.
• Narada: This represents a scheme that constructs overlay
trees in an informed manner, making use of network metrics
like bandwidth and latency. It is indicative of the performance
one may expect with an End System Multicast architecture,
7
though an alternate protocol may potentially result in better
performance.
• Random: This represents a naive scheme that constructs
random but connected End System Multicast overlays.
• Naive Unicast:
Here, the source simultaneously unicasts
data from the source to all receivers. Thus, in a group of size
n, the source must send n – 1 duplicate copies of the same data.
We note that the network metric considered in Narada impacts overlay performance. We have evaluated Narada-based
schemes that consider: (i) static delay based metrics such as
propagation delay; (ii) latency alone; (iii) bandwidth alone; and
(iv) latency and bandwidth. We refer the reader to [3] for detailed results of this study.
B. Performance Metrics
To facilitate our comparison, we use several metrics that capture both application and network level performance.
• Latency: This metric measures the end-to-end delay from the
source to the receivers, as seen by the application.
• Bandwidth: This metric measures the application level
throughput at the receiver.
• Stress: We refer to the number of identical copies of a packet
carried by a physical link as the stress of a physical link. For
example, in Figure 1(b), links R1 − R2 and A − R1 have a stress
of 2 and 3 respectively, while in Figure 1(d), link R1 − R2 has
a stress of 1. In general, we would like to keep the stress on all
links as low as possible.
L
• Resource Usage: We define resource usage as
d ∗ si ,
i=1 i
where, L is the number of links active in data transmission,
di is the delay of link i and si is the stress of link i. The resource usage is a metric of the network resources consumed in
the process of data delivery to all receivers. Implicit here is the
assumption that links with higher delay tend to be associated
with higher cost. The resource usage is 30 in the case of transmission by DVMRP, 57 for naive unicast and 32 for the smarter
tree , shown in Figure 1(c), Figure 1(b) and Figure 1(d) respectively. Finally, we compute the Normalized Resource Usage
(NRU) of a scheme as the ratio of the resource usage with that
scheme relative to the resource usage with DVMRP.
• Protocol Overhead: This metric is defined as the ratio of the
total bytes of non-data traffic that enters the network to the total bytes of data traffic that enters the network. The overhead
includes control traffic required to keep the overlay connected,
and the probe traffic and active bandwidth measurements involved in the self-organization process.
Latency and bandwidth are application level performance
metrics, while all other metrics measure network costs. Not all
applications care about both latency and bandwidth. Our evaluation thus considers the needs of applications with more stringent requirements (such as conferencing), which require both
high bandwidth and low latencies. An architecture that can
support such applications well can potentially also support applications that care about latency, or bandwidth alone.
C. Issues in Measuring Performance Metrics
Our Internet and simulation evaluation presents several issues
that we discuss.
C.1 Internet Evaluation
The limited deployment of IP Multicast in the real world
makes it difficult to evaluate application level performance us-
ing this architecture. Instead, we approximate this by the Sequential Unicast test. Here, we measure the bandwidth and
latency of the unicast path from the source to each recipient
independently (in the absence of other recipients). The above
technique provides an indication of the performance that applications would see with IP Multicast using DVMRP like trees.
While DVMRP actually results in trees composed of unicast
paths from the receiver to the source (reverse-path forwarding),
we do not expect this to affect our comparison results.
We compute the resource usage with DVMRP by deriving
the physical links of the tree, as well as the delays of these links,
by doing a traceroute from the source to each receiver.
Our Internet experiments currently do not measure stress.
Measuring this metric requires an accurate knowledge of the
physical paths between all pairs of members.
In our Internet experiments, the latency metric includes the
propagation and queuing delays of individual overlay links, as
well as queueing delay and processing overhead at end systems
along the path. We ideally wish to measure the latency of each
individual data packet. However, issues associated with time
synchronization of hosts and clock skew adds noise to our measurements of one-way delay that is difficult to quantify. Therefore, we choose to estimate the round trip time (RTT). By RTT,
we refer to the time it takes for a packet to move from the source
to a recipient along a set of overlay links, and back to the source,
using the same set of overlay links but in reverse order. Thus,
the RTT of an overlay path S-A-R is the time taken to traverse
S-A-R-A-S. The RTT measurements include all delays associated with one way latencies, and are ideally twice the end-to-end
delay.
C.2 Simulation Evaluation
Our simulation experiments are conducted using a locally
written, packet-level, event-based simulator. The simulator assumes shortest delay routing between any two members. The
simulator models the propagation delay of physical links but
does not model bandwidth, queueing delay and packet losses.
This was done for two reasons. First, it is difficult to model
Internet dynamics in a reasonable way in a simulator. Second,
modeling of cross-traffic potentially restricts the scalability of
our simulations.
Given these restrictions, not all metrics have been evaluated
in our simulations. In particular, we do not consider the bandwidth between members. Second, we assume that delays between members remains constant, and the Latency metric is
used in a more static sense. Finally, the Protocol Overhead metric in our simulations does not consider the overhead involved
in members discovering bandwidth to each other.
D. Summarizing Performance of Schemes
The objective of our evaluation is to evaluate and compare
the performance of various schemes for disseminating data with
respect to each of the performance metrics listed in Section IVB. For a metric such as resource usage, it is easy to summarize
the performance of a scheme. However, it is much more difficult to summarize the latency and bandwidth performance that
a number of different hosts observe with a particular scheme.
One approach is to present the mean bandwidth and latency,
averaged across all receivers. Indeed, we do use this technique
in Sections V-B and VI-C. However, this does not give us an
idea of the distribution of performance across different receivers.
A simple approach to summarizing an experiment is to explicitly specify the bandwidth and latencies that each individual
receiver sees. Although the set of hosts and source transmission
8
V. Internet Evaluation
Our Internet experiments are conducted on a wide-area testbed with about twenty hosts, including a machine behind
ADSL, and hosts in Asia and Europe. Our evaluation is conducted with our implementation of Narada that has been customized to conferencing applications, as described in Section
III-D.
An important factor that affects the performance of a scheme
for disseminating data is the degree of heterogeneity in the environment we consider. To study the performance of schemes in
environments with different degrees of heterogeneity, we create
two groupings of hosts, the Primary Set and the Extended Set
. The Extended Set includes all hosts in our test-bed, while the
Primary Set consists of a subset of hosts located at university
sites in North America which are in general well-connected to
each other. There is greater variation in bandwidth and latencies of paths between nodes in the Extended Set as compared
to the Primary Set.
We begin by presenting our experimental methodology. We
then present results in a typical experiment run in Section V-B.
Section V-C provides a detailed comparison of various schemes
for constructing overlays with regard to application level performance, and Section V-D presents results related to network
costs.
A. Evaluation Methodology
The varying nature of Internet performance influences the
relative results of experiments done at different times. Characteristics may change at any time and affect the performance
of various experiments differently. Ideally, we should test all
schemes for disseminating data concurrently, so that they may
observe the exact same network conditions. However, this is not
possible, as the simultaneously operating schemes would interfere with each other. Therefore, we adopt the following strategy:
(i) we interleave experiments with the various protocol schemes
that we compare to eliminate biases due to changes that occur
at shorter time scales, and (ii) we run the same experiment at
different times of the day to eliminate biases due to changes that
occur at a longer time scale. We aggregate the results obtained
from several runs that have been conducted over a two week
period.
Every individual experiment is conducted in the following
fashion. All members join the group at approximately the same
time. The source multicasts data at a constant rate and after
four minutes, bandwidth and round-trip time measurements are
1400
1200
Bandwidth (Kbps)
rate are identical, a particular scheme may create a different
overlay layout for each experimental run. While an individual host may observe vastly different performance across the
runs, this does not imply that the various overlays are of any
different quality. Therefore, we need metrics that capture the
performance of the overlay tree as a whole.
Let us consider how we summarize an experiment with regard
to a particular metric such as bandwidth or latency. For a set
of n receivers, we sort the average metric value of the various
receivers in ascending order, and assign a rank to each receiver
from 1 to n. The worst-performing receiver is assigned a rank of
1, and the best-performing receiver is assigned a rank of n. For
every rank r, we gather the results for the receiver with rank r
across all experiments, and compute the mean. Note that the
receiver corresponding to a rank r could vary from experiment
to experiment. For example, the result for rank 1 represents the
performance that the worst performing receiver would receive
on average in any experiment.
1000
800
600
400
200
0
0
200
400
600
800
Time (seconds)
1000
1200
Fig. 8. Mean Bandwidth averaged over all receivers as a function of time.
collected. We vary the source rate to study dependence of results we see on the source rate. Each experiment lasts for 20
minutes. We adopt the above set-up for all schemes, except
Sequential Unicast. As described in Section IV-C, we approximate the performance of Sequential Unicast by determining the
bandwidth and latency information of the unicast path from the
source to each receiver. We do this by unicasting data from the
source to each receiver for two minutes in sequence.
B. Results with a Typical Run
The results in this section give us an idea of the dynamic nature of overlay construction, and how the quality of the overlay
varies with time. Our experiment was conducted on a week-day
afternoon, using the Primary Set of machines and at a source
rate of 1.2 Mbps. The source host is at UCSB.
Figure 8 plots the mean bandwidth seen by a receiver, averaged across all receivers, as a function of time. Each vertical
line denotes a change in the overlay tree for the source UCSB.
We observe that it takes about 150 seconds for the overlay to
improve, and for the hosts to start receiving good bandwidth.
After about 150 seconds, and for most of the session from this
time on, the mean bandwidth observed by a receiver is practically the source rate. This indicates that all receivers get nearly
the full source rate throughout the session.
Figure 9 plots the mean RTT to a receiver, averaged across
all receivers as a function of time. The mean RTT is about
100 ms after about 150 seconds, and remains lower than this
value almost throughout the session.
Figures 8 and 9 show that in the first few minutes of the session, the overlay makes many topology changes at very frequent
intervals. As members gather more network information, the
quality of the overlay improves over time, and there are fewer
topology changes. In most of our runs, we find that the overlay converges to a reasonably stable structure after about four
minutes. Given this, we gather bandwidth and RTT statistics
after four minutes for the rest of our experiments.
The figures above also highlight the adaptive nature of our
scheme. We note that there is a visible dip in bandwidth, and
a sharp peak in RTT at around 460 seconds. An analysis of our
logs indicates that this was because of congestion on a link in
the overlay tree. The overlay is able to adapt by making a set of
topology changes, as indicated by the vertical lines immediately
following the dip, and recovers in about 40 seconds.
We have also evaluated how the RTTs to individual receivers
vary during a session and results are presented in [3]. For all
receivers, over 94% of the RTT estimates are less than 200 ms,
while over 98% of the RTT estimates are less than 400 ms.
9
300
400
Sequential Unicast
Narada
Random
350
250
300
Mean RTT (ms)
RTT (ms)
200
150
100
250
200
150
100
50
50
0
0
0
200
400
600
Time (seconds)
800
1000
1200
0
2
4
6
8
10
12
14
Rank
Fig. 9. Mean RTT averaged over all receivers as a function of time.
Fig. 11. Mean RTT versus rank at 1.2 Mbps source rate for the Primary
Set of machines
1600
Mean Bandwidth (Kbps)
1400
1200
1000
800
600
400
Sequential Unicast
Narada
Random
200
0
0
2
4
6
8
10
12
14
Rank
Fig. 10. Mean bandwidth versus rank at 1.2 Mbps source rate for the
Primary Set of machines
C. Application Level Performance
We present our results that compare the performance of various schemes for disseminating data on the Internet in various
environments. We present results for two settings: (i) the Primary Set and a source rate of 1.2 Mbps; and (ii) the Extended
Set and a source rate of 2.4 Mbps. Most pairs of hosts in the
Primary Set can sustain throughputs of 1.2 Mbps and thus the
first scenario represents a relatively less heterogeneous environment where simpler schemes could potentially work reasonably
well. On the other hand, the Extended Set represents an environment with a much higher degree of heterogeneity. Increasing
the source rates to 2.4 Mbps stresses the schemes more, because
many Internet paths even between well connected university
machines cannot sustain this rate. Further, several hosts in our
test-bed are located behind 10 Mbps connections, and a poorly
constructed overlay can result in congestion near the host.
C.1 Primary Set at 1.2 Mbps Source Rate
Figure 10 plots the mean bandwidth against rank for three
different schemes. Each curve corresponds to one scheme, and
each point in the curve corresponds to the mean bandwidth
that a machine of that rank receives with a particular scheme,
averaged across all runs. The error-bars show the standard deviation. Thus they do not indicate confidence in the mean,
rather they imply the degree of variability in performance that
a particular scheme for constructing overlays may involve. For
example, the worst-performing machine (rank 1) with the Random scheme receives a bandwidth of a little lower than 600 Kbps
on average. We use this method of presenting data in all our
comparison results.1
1
The curves are slightly offset from each other for clarity of presentation.
We wish to make several observations. First, the Sequential
Unicast curve indicates that all but one machine get close to
the source rate, as indicated by one of the top lines with a dip
at rank 1. Second, Narada is comparable to Sequential Unicast.
It is able to ensure that even the worst-performing machine in
any run receives 1150 Kbps on average. Interestingly, Narada
results in much better performance for the worst-performing
machine as compared to Sequential Unicast. It turns out this
is because of the existence of pathologies in Internet routing.
It has been observed that Internet routing is sub-optimal and
there often exists alternate paths between end system that have
better bandwidth and latency properties than the default paths
[19]. Third, Narada results in consistently good performance, as
indicated by the small standard deviations. Fourth, the Random scheme is sub-optimal in bandwidth. On average, the
worst-performing machine with the Random scheme (rank 1)
gets a mean bandwidth of about 600 Kbps. Further, the performance of Random can be quite variable as indicated by the large
standard deviation. We believe that this poor performance with
Random is because of the inherent variability in Internet path
characteristics, even in relatively well connected settings.
Figure 11 plots mean RTT against rank for the same set of experiments. First, the RTT of the unicast paths from the source
to the recipients can be up to about 150 ms, as indicated by
the lowest line corresponding to Sequential Unicast. Second,
Narada is good at optimizing the overlay for delay. The worst
machine in any run has an RTT of about 160 ms on average.
Third, Random performs considerably worse with an RTT of
about 350 ms for the worst machine on average. Random can
have poor latencies because of suboptimal overlay topologies
that may involve criss-crossing the continent. In addition, Random is unable to avoid delays related to congestion, particularly
near the participating end hosts.
C.2 Extended Set at 2.4 Mbps Source Rate
We stress our scheme for constructing overlays by considering
extremely heterogeneous environments as represented by the
Extended Set. Given the poor performance of Random even in
relatively less heterogeneous settings, we do not present results
here. Figures 12 and 13 plot the bandwidth and RTT against
host ranks for the four schemes of interest.
The Sequential Unicast curves show that there are quite a
few members that have low bandwidth and high latencies from
the source, which indicates the heterogeneity in the set we consider. Even in such a heterogeneous setting, Narada is able to
achieve a performance close to the Sequential Unicast. Apart
from the less well-connected hosts (ranks 1–5), all other mem-
10
TABLE I
3000
Average normalized resource usage of different schemes
Mean Bandwidth (Kbps)
2500
Experiment Setup
2000
Naive Unicast
Random
Narada
Min-Span
1500
1000
Primary
1.2 Mbps
2.62
2.24
1.49
0.85
Extended
2.4 Mbps
1.83
1.97
1.31
0.83
500
Sequential Unicast
Narada
0
0
2
4
6
8
10
Rank
12
14
16
TABLE II
18
20
Fig. 12. Mean bandwidth versus rank at 2.4 Mbps source rate for the
Extended Set of machines
800
Sequential Unicast
Narada
700
Average overhead with Narada and a breakdown of the overhead
Experiment Setup
Average Overhead (%)
% of
Bandwidth
overhead Probes
due to
Other
Primary
1.2 Mbps
10.79
Extended
2.4 Mbps
14.20
92.24
7.76
94.30
5.70
Mean RTT (ms)
600
500
400
300
200
100
0
0
2
4
6
8
10
Rank
12
14
16
18
20
Fig. 13. Mean RTT versus rank at 2.4 Mbps source rate for the Extended
Set of machines
bers get bandwidths of at least 1.8 Mbps, and see RTTs of
less than 250 ms on average. For ranks 1–5, Narada is able to
exploit Internet routing pathologies and provide better performance than Sequential Unicast. A particularly striking example
is two machines in Taiwan, only one of which has good performance to machines in North America. In our runs, the machine
with poorer performance is able to achieve significantly better
performance by connecting to the other machine in Taiwan.
C.3 Choice of Network Metrics
In addition to the schemes listed here, we have evaluated
other schemes for constructing overlays in [3]. Overall our results indicate that it is important to explicitly optimize for both
latency and bandwidth while supporting applications such as
conferencing. Considering latency alone, or bandwidth alone
leads to degraded performance. Further, the performance with
static delay based metrics such as propagation delay is poor.
The reader is referred to [3] for further details.
D. Network Level Metrics
Table I compares the mean normalized resource usage (Section IV-B) of the overlay trees produced by the various schemes
for different environments and source rates. The values are normalized with respect to the resource usage with DVMRP. Thus,
we would like the normalized resource usage to be as small as
possible, with a value of 1.00 representing an overlay tree that
has the same resource usage as DVMRP. The trees constructed
by Narada can change over time - we consider the final tree produced at the end of an experiment. However, we observe that
the overlays produced by these schemes are reasonably stable
after about four minutes.
We note from Table I that Narada can result in trees that
make 30–50 % more use of resources than DVMRP. Further,
Naive Unicast trees which have all recipients rooted at the
source, and schemes such as Random that do not explicitly exploit network information have a high resource usage. We have
also determined the resource usage of Min-Span, the minimum
spanning tree of the complete graph of all members, computed
by estimating the delays of all links of the complete graph. Minimum spanning trees are known to be optimal with respect to
resource usage, and as Table I shows, have lower resource usage
than DVMRP. This indicates that an End System Multicast
architecture can indeed make as efficient, if not better use of
network resources than IP Multicast. However, while minimum
spanning trees are efficient from the network perspective, it is
not clear that they perform well from the application perspective.
Table II summarizes the protocol overhead (Section IV-B)
involved in Narada, along with a breakdown of the main factors that contribute to the overhead. We find that the average
overhead is between 10 to 15% across all settings. This is an indication that even simple heuristics that we have implemented
can keep the overall overhead low. Further, more than 90% of
the overhead is due to members probing each other for bandwidth. Other sources of overhead contribute just 3–7% of the
overhead. These include exchange of routing messages between
neighbors, group management algorithms to keep the overlay
connected, and probes to determine the delay and routing state
of remote members. Our current work is investigating the use of
light-weight probing techniques to further reduce the overhead
due to bandwidth measurements.
VI. Simulation
Section V demonstrates that an End System Multicast architecture can perform quite well in realistic Internet settings. In
this section, we study the performance issues with larger group
sizes using simulation experiments. We begin by presenting factors that affect the evaluation. We then present our simulation
setup, and our results.
A. Factors Affecting Performance
A key factor that affects our comparison results is the topology model used in our simulations. We used three different
11
−d
ability of αe β∗L , where L is the length of the longest possible
edge, d is a random variable between 0 and L, and α and β
are parameters. We use the Georgia Tech. [22] random graph
generators to generate topologies of this model.
• Mapnet: Backbone connectivity and delay are modeled after actual ISP backbones that could span multiple continents.
Connectivity information is obtained from the CAIDA Mapnet
project database [9]. Link delays are assigned based on geographical distance between nodes.
• Autonomous System map (ASMap): Backbone connectivity information is modeled after inter-domain Internet
connectivity.
This information is collected by a route
server from BGP routing tables of multiple geographically
distributed routers with BGP connections to the server [8]. This
data has been analyzed in [6] and has been shown to satisfy certain power laws. Random link delays of 8 − 12 ms was assigned
to each physical link.
In our simulations, we used backbone topology sizes consisting of around 1070 members and multicast groups of up to
256 members. We used a Waxman topology consisting of 1024
routers and 3145 links, an ASMap topology consisting of 1024
routers and 3037 links and a Mapnet topology consisting of
1070 routers and 3170 links. We have also studied the impact
of varying topology size for each topology model in [4].
With Narada, each member in the data delivery tree has a
degree that is dynamically configured based on the available
bandwidth near a member. If a member has too many children,
this could result in congestion near the member and a decrease
in the available bandwidth. Narada can adapt dynamically to
such a situation by detecting the fall in bandwidth and having
children move away. However, given that our simulator does
not consider Internet dynamics, we model the impact of this
artificially by imposing restrictions on the degree. We do this
using a parameter called the fanout range. The fanout range of
a member is the minimum and maximum number of neighbors
each member strives to maintain in the mesh. An increase of
the fanout range could decrease mesh diameter and result in
better delay performance. However, it could potentially result
in higher stress on links near members. All results presented
here assume a fanout range of < 3 – 6 >. We have investigated
the impact of varying fanout range and the reader is referred to
[4] for more details.
B. Simulation Setup
All experiments we report here are conducted in the following manner. A fixed number of members join the group in
the first 100 seconds of the simulation in random sequence. A
member that joins is assumed to contain a list of all members
that joined the group previously. After 100 seconds, there is
no further change in group membership. One sender is chosen at random to multicast data at a constant rate. We allow
the simulation to run for 40 minutes. In all experiments, neighbors exchange routing messages every 30 seconds. Each member
probes one random group member every 10 seconds to evaluate
performance.
256
# of Physical Links (log-scale)
• Waxman: The model considers a set of n vertices on a square
in the plane and places an edge between two points with a prob-
512
Unicast
DVMRP
Narada
DVMRP
128
64
Narada
32
16
Unicast
8
4
2
1
1
2
4
8
16
32
Stress of Physical Link (log-scale)
64
128
Fig. 14. No. of physical links with a given stress vs. Stress for naive
unicast, Narada and DVMRP
ASMap-Narada
Waxman-Narada
Mapnet-Narada
16
Worst Case Physical Link Stress
models to generate backbone topologies for our simulation. For
each model of the backbone, we modeled members as being attached directly to the backbone topology. Each member was
attached to a random router, and was assigned a random delay
of 1 − 4ms.
14
12
10
8
6
4
2
0
0
50
100
150
Group Size
200
250
300
Fig. 15. Worst case physical link stress vs. group size for topologies for
three models using Narada
C. Simulation Results
For all results in this section, we compute each data point
by conducting 25 independent simulation experiments and we
plot the mean with 95% confidence intervals. Due to space constraints, we present plots of selected experiments and summarize
results of other experiments.
C.1 Stress
To get a better understanding of the stress metric, we consider
the performance seen in a typical experiment conducted using a
topology generated by the Waxman model and a group size of
128 members. One of the members is picked as source at random, and we evaluate the stress of each physical link. We study
the variation of physical link stress under Narada and compare
the results we obtain with physical stress under DVMRP and
naive unicast in Figure 14. Here, the horizontal axis represents
stress and the vertical axis represents the number of physical
links with a given stress. The stress of any physical link is at
most 1 for DVMRP, indicated by a solitary dot. Under both
naive unicast and Narada, most links have a small stress - this
is to be expected. However, the significance lies in the tail of
the plots. Under naive unicast, one link has a stress of 127 and
quite a few links have a stress above 16. This is unsurprising
considering that links near the source are likely to experience
high stress. Narada however distributes the stress more evenly
across physical links, and no physical link has a stress larger
than 9. While this is high compared to DVMRP, it is a 14-fold
improvement over naive unicast.
Figure 15 plots the variation of worst case physical link stress
against group size for three topologies with Narada. Each curve
12
200
Random
Mean Delay
150
Normalized Resource Usage
3.5
ASMap
Waxman
Mapnet
Naive Unicast
Random
Narada
3
2.5
2
1.5
100
Narada
1
0
50
DVMRP
0
0
50
100
150
200
Group Size
250
300
Fig. 16. Mean receiver delay with Narada, Random and DVMRP as a
function of group size for 3 topology models. The curves are bunched
into three families depending on the scheme used. Within each family,
the legend indicates the performance for a particular topology model.
corresponds to one topology model. Each point corresponds to
the mean worst case stress for a particular group size, averaged
over 25 experiments, and plotted with 95% confidence intervals.
We observe that the curves are close to each other for small
group sizes but seem to diverge for larger group sizes. Further,
for all topologies, worst case stress increases with group size.
Thus, for a group size of 64, mean worst case stress is about 5−7
across the three topologies, while for a group size of 256, it is
about 8−14. We believe this increase of stress with group size is
an artifact of the small topologies in a simulation environment
relative to the actual Internet backbone. The reader is referred
to [4] for a further discussion.
Finally, we have also evaluated stress with Random. Our
results indicate that Random tends to result in slightly higher
stress than Narada across all topology models, and we omit the
results for clarity.
C.2 Delay Results
Figure 16 plots the mean delay experienced by a receiver using
Random, Narada and DVMRP, as a function of group size for
three different topology models. Each curve corresponds to a
particular scheme, and a particular topology model. Each point
represents the mean receiver delay for that group size averaged
over 25 experiments, plotted with 95% confidence intervals. For
example, the mean receiver delay with Narada using the ASMap
topology is about 50ms for a group size of 16.
The curves are bunched into three families: the topmost set
of curves correspond to Random, the lowest set corresponds to
DVMRP and the set in between corresponds to Narada. For a
range of group sizes and all topology models, Narada outperforms Random, but does not do as well as DVMRP. For example, for a group size of 16 members, the mean receiver delay with
Random varies between 70 − 80ms depending on the topology
model, while the mean delay with Narada is between 40−55ms,
and the mean delay with DVMRP is around 25 − 30ms,
For all topology models, the mean delay with DVMRP is relatively independent of group size. However, the performance of
both Narada and Random tends to degrade with larger group
size. For a group size of 256 members, the mean delay with
Fig. 17.
Unicast
50
100
150
Group Size
200
250
300
Effect of group size on NRU : Narada, Random and Naive
Narada is about 70 – 105 ms and about 150 – 170 ms for Random, depending on the topology model.
C.3 Resource Usage
Figure 17 plots the normalized resource usage (NRU)
against group size for the Waxman model alone. The results
are normalized with respect to IP Multicast. The lowest curve
corresponds to Narada, while the two upper curves correspond
to Random and Naive Unicast respectively. First, Narada consumes less network resources than Naive Unicast and Random,
across all group sizes. For a group size of 16, the NRU is about
1.3 for Narada, while the NRU is about 1.6 for Naive Unicast
and Random. Second, NRU tends to increase with group size
for all schemes. For a group size of 128, the NRU for Narada is
about 1.9 and 2.4 for Naive Unicast and Random. While these
results are reasonable, we believe the performance of Narada
with regard to resource usage could be even more significant if
members are clustered. We have repeated this study with the
Mapnet and ASMap topologies and observe similar trends. For
all topologies, the NRU is at most 1.9 for a group size of 128.
C.4 Protocol Overhead
In our simulations experiments, Protocol Overhead does not
measure the cost of bandwidth probes, which we found as the
chief source of overhead in our Internet results. Thus, this metric measures overhead mainly due to routing and group management associated with Narada. We find that the protocol
overhead due to these factors increases linearly with group size,
however this is not significant for the group sizes we consider.
For a source data rate of 128 kbps, the protocol overhead is
about 2% for a group size of 64 members, and 4% for a group
size of 128 members. Finally, we note that the control traffic
that Narada introduces is independent of source data rate and
thus the protocol overhead is even lower for higher source rates.
VII. Discussion
We begin by summarizing results from our simulation and
Internet experiments and then discuss some open issues.
A. Summary of Results
Our key results are as follows:
• Application Level Performance: Our Internet results demonstrate that End System Multicast can meet the bandwidth requirements of applications and at the same time achieve low latencies. In Internet experiments with the Primary Set, all hosts
sustain over 95% of the source rate and achieve latencies lower
13
than 80ms, With the Extended Set, the mean performance attained by each receiver is comparable to the performance of the
unicast path from the source to that receiver. Our simulation
results match these numbers, and indicate that the penalty in
delay is low even for medium size groups. For a range of topology models, the ratio of the mean delay with Narada relative to
the mean delay with DVMRP is less than 1.7 for groups of size
16, and less than 3.5 for groups of 256 members.
• Stress: Our simulation results demonstrate that Narada results in a low worst case stress for small group sizes. For example, for a group size of 16, the worst case stress is about 5.
While for larger group sizes, worst case stress may be higher,
it is still much lower than unicast. For example, for a group of
128 members, Narada reduces worst case stress by a factor of
14 compared to unicast.
• Resource Usage:
Our Internet results demonstrate that
Narada may incur a resource usage that is about 30 − 50%
higher than with DVMRP, while it can improve resource usage
by 30 − 45% compared to naive unicast. Again, our simulation
results are consistent with our Internet results, and indicate
that the performance with respect to this metric is good even
for medium sized groups. The resource usage is about 35 − 55%
higher than with DVMRP for group sizes of 16 members, and
about a factor of two higher for group sizes of 128 members.
Further, we believe that the performance in resource usage may
be even better if we consider clustered group members.
• Protocol Overhead: Our Internet experiments demonstrate
that Narada can have a protocol overhead of about 10 – 15 %
for groups up to 20 members. Over 90% of this overhead is
members probing each other for bandwidth. To reduce the
cost of bandwidth probes further, we are currently exploring
light-weight probing techniques based on RTT measurements,
transfers of 10 kilobyte data chunks and bottleneck bandwidth
measurements. Our initial experience suggests that these lightweight probing techniques are promising and can be quite effective at controlling overhead. Our simulation experiments on
the other hand do not involve bandwidth probes. Our results
indicate that the overhead due to other factors (e.g. routing
and group management) is not significant for the group sizes
we consider.
B. Open Issues
Overall our results suggest that End System Multicast can
achieve good performance for small and medium sized groups
involving tens to hundreds of members. The question then is:
can an End System Multicast architecture scale to support much
larger group sizes? Based on our experience, we believe the following issues need to be addressed:
• As the group size increases, the number of overlay hops between any pair of members increases, and hence the delay between them potentially increases(e.g., Figure 16). A careful
analysis that investigates fundamental performance limits of an
overlay approach for large group sizes would provide valuable
insight on the feasibility of the End System Multicast architecture for large scale interactive applications.
• While we have demonstrated that End System Multicast can
ensure good application performance over longer time scales, we
have not investigated performance of applications over shorter
time scales. Events such as failure of members, members leaving
the group, or network congestion can potentially result in poor
transient performance, particularly for interactive applications.
While this is an issue that must be investigated even for smaller
groups, it could be a greater concern for larger group sizes, as
they could encounter a much higher frequency of such events.
• A self-improving overlay approach incurs overhead due to active measurements, and takes time to converge into an efficient
structure. As group size increases, it is not clear whether an
End System Multicast approach can keep probe overhead low
and construct efficient overlays quickly.
While the above issues need to be addressed to determine
the viability of an End System Multicast approach for larger
group sizes, certain design decisions taken in the current version of the Narada protocol may prevent it from scaling to larger
group sizes. In Narada, each member maintains information regarding all other group members. This is a deliberate design
choice that has been motivated by two reasons. First, Narada
does not rely on external nodes for normal protocol operation.
While it does use an out-of-band mechanism for bootstrapping,
failure of this mechanism prevents new members from joining
the group, but existing group members may continue to communicate with each other. Second, Narada has been designed
with the objective of re-establishing connectivity among participating group members even under failure modes involving the
simultaneous death of a significant fraction of group members.
While maintaining full group membership information helps to
achieve these goals, it leads to the concern that the costs of
maintaining such information may be prohibitively expensive
for larger sized groups.
In this paper, we have made minimal assumptions regarding
support from network infrastructure, both in terms of the robustness properties of end systems, and the network information
available for overlay construction. We believe that scalability
can be achieved more easily by making additional assumptions
about the composition of the end systems, the failure models of
hosts and the availability of external mechanisms for collecting
network information. We describe recent efforts in this direction
in Section VIII. Further, we are currently exploring these issues
in the context of proxy-based End System Multicast architectures. Such architectures consist of a set of more robust or stable
nodes that are not likely to all fail with high probability. This
can greatly simplify the design of self-organizing protocols, and
enable more scalable solutions. In addition, proxies are assumed
to be persistent with long-lived relationships among them. This
reduces the need for active measurements in creating overlays
and helps in quick instantiation of efficient overlays.
VIII. Related Work
Since the initial proposal of End System Multicast [4], several other researchers have begun advocating an overlay based
approach for multicast communication [2], [10], [13], [15]. Architecturally, proposals for overlay based multicast have primarily
differed on whether they assume a peer-to-peer architecture,
or a proxy (infrastructure) based architecture. Yoid [10], and
ALMI [15] emphasize peer-to-peer settings. In contrast, Scattercast [2], and Overcast [13] argue for infrastructure support.
We view both these architectures as interesting and plan to
look at the challenges and constraints specific to each architecture in the future. Further, ALMI [15] advocates a completely
centralized solution, and places all responsibility for group management, and overlay computation with a central controller.
As the research community has begun to acknowledge the
importance of overlay based architectures, self-organizing protocols for constructing overlays have emerged as an important
field of study. Most of the earlier proposed protocols fall in two
broad categories that we summarize below:
14
• Protocols like Yoid [10], BTP [11] and Overcast [13] construct
trees directly - that is, members explicitly select their parents
from among the members that they know. While Overcast
targets single source broadcasting applications, and constructs
trees rooted at the source, Yoid constructs a single shared tree
for all sources.
• Narada and Gossamer [2] construct trees in a two-step process: they first construct efficient meshes among participating
members, and in the second step construct spanning trees of the
mesh using well known routing algorithms. A mesh-based approach has been motivated by the need to support multi-source
applications, such as conferencing. Single shared trees are not
optimized for the individual source and are susceptible to a central point of failure. An alternative to constructing shared trees
is explicitly constructing multiple overlay trees, one tree for each
source. However, this approach needs to deal with the overhead
of maintaining and optimizing multiple overlays.
Recently, researchers have begun designing self-organizing
protocols that can scale to very large group sizes. Again, these
newer protocols have taken two different approaches:
• Delaunay Triangulations [14], CAN [16] and Bayuex [23] assign to members addresses from some abstract coordinate space,
and neighbor mappings are based on these addresses. For example, CAN assigns logical addresses from cartesian coordinates
on an n-dimensional torus. [14] assigns points to a plane and
determines neighbor mappings corresponding to the Delaunay
triangulation of the set of points. Determining neighbor mappings based on member addresses enables routing of messages
based on the addresses, and full fledged routing protocols such
as distance vector algorithms are not required. Each member
needs to maintain knowledge about only a small subset of members enabling the protocols to scale better to larger group sizes.
However, in contrast to tree and mesh-based approaches, these
protocols impose rules on neighbor relationships that are dictated by addresses assigned to hosts rather than performance.
This may involve a performance penalty in constructed overlays and could complicate dealing with dynamic metrics such
as available bandwidth.
• The Nice project [1] and Kudos [12] achieve better scaling
properties than Narada by organizing members into hierarchies
of clusters. Kudos constructs a two level hierarchy with a
Narada like protocol at each level of the hierarchy. [1] constructs
a multi-level hierarchy, and does not involve use of a traditional
routing protocol. A concern with hierarchy-based approaches is
that they complicate group management, and need to rely on
external nodes to simplify failure recovery.
ticast support from routers to end systems, while introducing
some performance penalties, has the potential to address most
problems associated with IP Multicast. We have shown, with
both simulation and Internet experiments, that the performance
penalties are low in the case of small and medium sized groups.
We believe that the potential benefits of transferring multicast
functionality from end systems to routers significantly outweigh
the performance penalty incurred.
Second, we have proposed one of the first self-organizing and
self-improving protocols that constructs an overlay network on
top of a dynamic, unpredictable and heterogeneous Internet environment without relying on a native multicast medium. We
also believe this is among the first works that attempts to systematically evaluate the performance of a self-organizing overlay
network protocol and the tradeoffs in using overlay networks.
Further, we believe that the techniques and insights developed
in this paper are applicable to overlay networks in contexts other
than multicast.
Our current work involves studying mechanisms that can ensure robust transient performance of applications in environments with highly dynamic group membership, and highly variable network characteristics. Further, while in this work we
have made conservative assumptions regarding the composition
of end systems and their failure modes, we are currently investigating how we may take advantage of proxy-based End System
Multicast architectures.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
To our knowledge, ours is perhaps the first work that has conducted a detailed Internet evaluation to analyze the feasibility
of an overlay based architecture. Our work has shown that it
is important to dynamically adapt to bandwidth and latency
[3], and we have incorporated techniques in Narada that help
to achieve this goal. In contrast, most other works have considered delay based metrics, and not dealt with important issues
pertaining to the dynamic nature of network metrics.
IX. Conclusion
We have made two contributions in this paper. First, we have
shown that for small and medium sized multicast groups, it is
feasible to use an end system overlay approach to efficiently
support all multicast related functionality including membership management and packet replication. The shifting of mul-
[13]
[14]
[15]
[16]
[17]
[18]
[19]
S. Banerjee, B.Bhattacharjee, and S.Parthasarathy. A Protocol for Scalable
Application Layer Multicast. Technical report, University of Maryland,
July 2001. CS-TR 4278.
Y. Chawathe. Scattercast: An Architecture for Internet Broadcast Distribution as an Infrastructure Service. Fall 2000. Ph.D. thesis.
Y. Chu, S.G. Rao, S. Seshan, and H. Zhang. Enabling Conferencing Applications on the Internet using an Overlay Multicast Architecture. In
Proceedings of ACM Sigcomm, August 2001.
Y. Chu, S.G. Rao, and H. Zhang. A Case for End System Multicast. In
Proceedings of ACM Sigmetrics, June 2000.
S. Deering. Multicast Routing in Internetworks and Extended LANs. In
Proceedings of the ACM SIGCOMM, August 1988.
C. Faloutsos, M. Faloutsos, and P. Faloutsos. On Power-law Relationships
of the Internet Topology. In Proceedings of ACM Sigcomm, August 1999.
S. Floyd, M. Handley, J. Padhye, and J. Widmer. Equation-based Congestion Control for Unicast Applications. In Proceedings of the ACM SIGCOMM, August 2000.
National Laboratory for Applied Network Research.
Routing data.
http://moat.nlanr.net/Routing/rawdata/.
Cooperative Association for Internet Data Analysis. Mapnet project.
http://www.caida.org/Tools/Mapnet/Data/.
P.
Francis.
Yoid:
Your
Own
Internet
Distribution,
http://www.aciri.org/yoid/. April 2000.
D.A. Helder and S. Jamin. End-host multicast communication using
switch-tree protocols. In Proceedings of the Workshop on Global and Peerto-Peer Computing on Large Scale Distributed Systems (GP2PC), May 2002.
S. Jain, R.Mahajan, D.Wetherall, G.Borriello, and S.D. Gribble. Scalable
Self-Organizing Overlays. Technical report UW-CSE 02-02-02, University
of Washington, February 2002.
J. Jannotti, D. Gifford, K. L. Johnson, M. F. Kaashoek, and J. W. O’Toole
Jr. Overcast: Reliable Multicasting with an Overlay Network. In Proceedings of the Fourth Symposium on Operating System Design and Implementation (OSDI), October 2000.
J. Jannotti, D. Gifford, K. L. Johnson, M. F. Kaashoek, and J. W. O’Toole
Jr. Application-layer Multicast with Delaunay Triangulations. In IEEE
Globecom, November 2001.
D. Pendarakis, S. Shi, D. Verma, and M. Waldvogel. ALMI: An Application
Level Multicast Infrastructure. In Proceedings of 3rd Usenix Symposium on
Internet Technologies & Systems (USITS), March 2001.
Sylvia Ratnasamy, Mark Handley, Richard Karp, and Scott Shenker.
Application-level Multicast using Content-Addressable Networks. In Proceedings of NGC, 2001.
Y. Rekhter and T. Li. A Border Gateway Protocol 4 (BGP-4), RFC 1771,
March 1995.
J. Saltzer, D. Reed, and D. Clark. End-to-end Arguments in System Design. ACM Transactions on Computer Systems, 2(4):195–206, 1984.
S. Savage, A. Collins, E. Hoffman, J.Snell, and T. Anderson. The End-to-
15
[20]
[21]
[22]
[23]
end Effects of Internet Path Selection. In Proceedings of ACM Sigcomm,
August 1999.
F.B. Schneider. Byzantine Generals in action: Implementing fail-stop processors. ACM transactions on Computer Systems, 2(2), pages 145–154, 1984.
Z. Wang and J. Crowcroft. Bandwidth-delay Based Routing Algorithms.
In IEEE GlobeCom, November 1995.
E. W. Zegura, K. L. Calvert, and S. Bhattacharjee. How to Model an
Internetwork. In Proceedings of IEEE Infocom, March 1996.
S.Q. Zhuang, B.Y.Zhao, A.D.Joseph, R.H.Katz, and J.D.Kubiatowicz.
Bayeux: An Architecture for Scalable and Fault-tolerant Wide-Area Data
Dissemination. In Proceedings of NOSSDAV, April 2001.
Yang-hua Chu is a Ph.D. student in Computer Science at Carnegie Mellon University. He received the
B.S and M.Eng. degrees from the Massachusetts Institute of Technology in 1996 and 1997, respectively.
His research interests lie in overlay networks, with
an emphasis on building systems and services that
can be deployed on the Internet. His web page is at
http://www.cs.cmu.edu/˜yhchu.
Sanjay Rao is currently a doctoral candidate at the
School of Computer Science, Carnegie Mellon University. He received the B.Tech degree in Computer
Science and Engineering from the Indian Institute of
Technology, Madras in 1997, and the MS degree in
Computer Science from Carnegie Mellon University
in 2000. His research interests include overlay network design, peer-to-peer networks and multicast and
other technologies that enable efficient group communication over the Internet. His web page is at
http://www.cs.cmu.edu/˜sanjay.
Srinivasan Seshan is currently an Assistant Professor at Carnegie Mellon University’s Computer Science
Department. Dr. Seshan received his Ph.D. in 1995
from the Computer Science Department at University of California, Berkeley. From 1995 to 2000, Dr.
Seshan was a research staff member at IBM’s T.J.
Watson Research Center. Dr. Seshan’s primary interests are in the broad areas of network protocols
and distributed network applications. In the past, he
has worked on topics such as transport/routing protocol interactions with wireless networks, fast protocol
stack implementations, RAID system design, performance prediction for Internet transfers, firewall design, and improvements to
the TCP protocol. His current work includes overlay network design, Web server
benchmarking, mobile computing/networking, and new approaches to congestion control. His web page is at http://www.cs.cmu.edu/˜srini.
Hui Zhang (M’ 95/ ACM’ 95) is the Chief Technical Officer of Turin Networks and the Finmeccanica
Associate Professor at the School of Computer Science of Carnegie Mellon University. He received the
B.S. degree in Computer Science from Beijing University in 1988, the M.S. degree in Computer Engineering from Rensselaer Polytechnic Institute in 1989,
and the Ph.D. degree in Computer Science from the
University of California at Berkeley in 1993. Hui
Zhang’s research interests are in Internet, QoS, Multicast, peer-to-peer systems, and metro networks. He
received the National Science Foundation CAREER
Award in 1996 and the Alfred Sloan Fellowship in 2000. His homepage is at
http://www.cs.cmu.edu/˜hzhang.