Timer Interaction in Route Flap Damping

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Timer Interaction in Route Flap Damping

Beichuan Zhang Dan Pei Daniel Massey Lixia Zhang


[email protected] [email protected] [email protected] [email protected]
UCLA UCLA Colorado State University UCLA

Abstract The goal of BGP damping is to allow updates of stable


routes to pass through but block updates generated by unsta-
Route Flap Damping is a mechanism generally used in ble routes. Briefly, route flap damping works as follows. A
network routing protocols. Its goal is to limit the global im- router associates a penalty value with each destination (i.e.,
pact of unstable routes by temporarily suppressing routes an IP prefix) advertised by a neighbor router. A route flaps
with rapid changes over short time periods. Although route whenever the neighbor router changes its route to the desti-
damping is a clearly defined and simple procedure at each nation. When it happens, the penalty value is increased. In
router, its effect in a large network setting is not well un- the absence of route changes, the penalty value decays over
derstood. We show that the current damping design leads time. When the penalty value exceeds a predefined cut-off
to the intended behavior only under persistent route flap- threshold, further updates from the same neighbor for the
ping. When the number of flaps is small, the global routing same destination will no longer be propagated. That is, the
dynamics deviates significantly from the expected behavior route is suppressed. When the penalty value drops below a
with a longer convergence delay. Previous work observed predefined reuse threshold, the router will start propagating
that a single route flap can falsely trigger route suppression updates for that destination again, i.e., the route is reused.
due to path exploration. However our simulations show that Throughout this paper we will use the word damping as an
this false suppression only accounts for 30% of the conver- abbreviation for “route flap damping” to refer to the whole
gence delay after a single route flap. Our study reveals pre- mechanism, and the word suppression to refer to the spe-
viously unknown interactions between reuse timers at differ- cific action of stopping propagating updates.
ent routers. Route suppression and reuse at different routers Despite its simple rules of operation at each router, the
are triggered at different times and thus affect the number overall effect of damping is not fully understood. A recent
of updates received by other routers. In turn, this impacts study [9] showed that, after a single route flap, path explo-
other routers’ damping behavior. We propose to use Root ration can falsely trigger route suppression and prolong the
Cause Notification to eliminate both false suppression and convergence time. Yet our simulations show that this false
undesirable timer interaction. suppression alone can only account for 30% of the conver-
gence delay after a single route flap, and cannot explain the
damping behavior after two or more route flaps. We will
1 Introduction show that the current BGP damping mechanism achieves
the intended behavior only under persistent route flapping.
It remains a challenge to design a responsive and effi- When the number of route flaps is small, the global routing
cient routing protocol for large scale networks. In this paper dynamics deviates significantly from the intended behavior.
we examine a specific problem caused by unexpected inter- Because the current damping implementation counts all re-
actions among multiple nodes in a large network. Since dy- ceived updates in calculating the penalty value, and because
namic routing protocols adapt to topological changes, a sin- route suppression and reuse at different routers happen at
gle unstable link can potentially cause a large number of up- different times, false damping can be triggered not only by
dates being propagated throughout the entire network, con- path exploration, but also by the updates due to route reuse
suming router CPU cycles and link bandwidth [8]. To limit at neighbor routers. We propose to add Root Cause Notifi-
the global impact of individual unstable routes, Route Flap cation (RCN) [11] to routing updates, in order to eliminate
Damping [14] was added to the Border Gateway Protocol both false suppression and undesirable timer interactions.
(BGP) [12] several years ago. It is commonly believed that The remainder of the paper is organized as follows. Sec-
damping has played an essential role in putting the global tion 2 describes BGP damping mechanism and previous
Internet routing update overhead under control [3]. work. Section 3 analyzes the intended damping behavior.
4000
Cut-off Threshold
Reuse Threshold
3500

3000

2500

Penalty
2000

1500

1000

500

0
Figure 1. Example Figure 2. A router’s RIBs 0 240 480 720 960 1200 1440 1680 1920 2160 2400 2640
Time (seconds)

Figure 3. Damping Penalty

Section 4 analyzes the actual damping behavior in detail, in- Damping Parameters Cisco Juniper
cluding how timer interactions shape routing dynamics dur- Withdrawal Penalty (PW ) 1000 1000
ing damping. Section 5 presents simulation results. Section Re-announcement Penalty (PA ) 0 1000
6 proposes the use of RCN to facilitate damping. Section Attributes Change Penalty 500 500
7 discusses the implication of routing policies on damping, Cut-off Threshold (Pcut ) 2000 3000
and Section 8 concludes the paper. Half Life (minute) (H) 15 15
Reuse Threshold (Preuse ) 750 750
2 Route Flap Damping and Previous Work Max Hold-down Time (minute) 60 60

Figure 1 shows a general network scenario that will be


used throughout our analysis and simulations in this paper. Table 1. Default Damping Parameters
A router in a customer network, the originAS, is connected
to a router in its provider network, the ispAS. When the link When the penalty value is greater than zero, it decays
[originAS, ispAS] comes up, the router in ispAS will an- exponentially over time. More formally, if the penalty is
nounce to the rest of the network the route to originAS; p(t0 ) at time t0 and becomes p(t) at time t, then
when the link goes down, the ispAS router will withdraw
the route.
p(t) = p(t0 ) e−λ(t−t0 ) (1)
Generally speaking, each BGP router peers with a num-
ber of neighboring routers and exchanges routing updates. where λ is often configured by the half-life H = ln 2/λ.
A router stores the routes received from each peer in the A suppressed route will be reused when the penalty drops
corresponding RIB-IN table (Figure 2). For each destina- below the reuse threshold. This is often implemented by
tion prefix, the router picks the best route among all the setting a reuse timer based on the current penalty value and
RIB-INs and stores this best route in the Local-RIB table. reusing the route when the reuse timer expires.
Depending on the routing policy, the router may announce Table 1 lists the default parameter settings from two ma-
all or part of its best routes to its peers. It stores the routes to jor router vendors, and Figure 3 shows an example of the
be announced to each peer in the corresponding RIB-OUT penalty value (with Cisco default parameters) changes over
table. time in response to a few route flaps.
Damping associates a penalty value with each entry in Damping was introduced into Internet inter-domain rout-
a RIB-IN. That is, there is a penalty value associated with ing in mid 1990s, and has been widely supported in com-
each peer and destination prefix pair. Whenever a new up- mercial routers. RFC 2439 [14] documents its design ra-
date message is received, the corresponding RIB-IN entry tionale, algorithm, and implementation strategy. RFC 3221
is updated and so is its penalty value. Different types of [3] states that damping is widely deployed and helps stabi-
updates are assigned different penalty increments. If the lize the routing infrastructure, but it is also well known that
penalty exceeds the cut-off threshold, the RIB-IN entry will different implementations use inconsistent parameters, and
no longer be used in selecting the best route. Note that dur- damping is not universally deployed everywhere.
ing this route suppression, new routing updates for the same The intended effect of damping is to allow occasional
entry may continue to arrive, and if so the penalty value routing changes to propagate without delay, while suppress
will continue to increase accordingly. Because a suppressed persistently changing routes until they become stable. How-
route does not enter Local-RIB, none of the new changes ever, as early as in 1998, Panigl [10] observed that a single
will be propagated any further. route withdrawal followed by a re-announcement in Europe
triggered route suppression in North America. The cause of The message count is the total number of updates observed
this behavior was not explained until 2002, when Mao et al. in the network starting from the first flap.
[9] showed that path exploration was the reason. Convergence time can be calculated given the flapping
BGP path exploration was first reported by intervals and damping parameters. For occasional flaps of
Labovitz et al. [6][7]. For example, in Figure 1, as- link [originAS, ispAS], route suppression should not be
sume that X can reach originAS via three peers. When link triggered, and the convergence time is the normal BGP con-
[originAS, ispAS] fails and ispAS sends a withdrawal to vergence time, usually between seconds and a few minutes
the rest of the network, X will receive the withdrawal from [6]. When link [originAS, ispAS] flaps persistently, the
one of its peers first. Not knowing link [originAS, ispAS] excessive routing updates will increase the penalty value at
has failed, X will switch to another peer to reach originAS, ispAS and cause ispAS to suppress its route to originAS.
thus it “explores” alternate paths. Every time X changes After the flapping stops, ispAS will wait for the penalty
its route, it will send an update to Y . Only after receiving value p to drop below the reuse threshold Preuse before
withdrawals from all its peers, will X finally send its re-announcing the route. The announcement will trigger a
own withdrawal to Y . This path exploration happens at BGP Tup event (i.e., a previously unreachable destination
every router in the network with alternative paths, and can becomes reachable), which takes time tup for the network
amount to a large number of updates. Depending on the to converge. Let r denote the time it takes for the penalty
timing of these updates, Y can receive multiple updates on value to drop below the reuse threshold, then the total con-
link [X, Y ] even though link [originAS, ispAS] only flaps vergence time should be:
once.
1 p
[9] is the first work to point out the interplay between t = r + tup  r = ln
path exploration and damping. It shows that path explo- λ Preuse
ration can amplify one single flap into many updates which From Table 1, we can see that with Cisco default setting, r is
falsely trigger suppression somewhere in the network. This at least 20 minutes and therefore r  tup . The actual value
unexpected interplay highlights the complexity introduced of r depends on the penalty value p, the reuse threshold, and
by the scale of a large network: one cannot easily predict the half-life. To calculate p, let w(i) be the time between the
the overall network behavior even if he knows exactly how ith flap and the (i − 1)th flap, f (i) be the penalty increment
each individual node works. caused by the ith flap, i = 1, 2, . . . , k − 1, k, and w(1) = 0.
However, false suppression caused by path exploration Right after the k th flap, the penalty value p(k) is
alone cannot fully explain the observed long convergence
delay. In [9], the simulation results show that convergence p(k) = p(k − 1) ∗ e−λw(k) + f (k)
delay can be as long as one hour. In section 5.2, we will k−1
 Pk
explain why it is unlikely to reach such a high penalty value = [f (i) ∗ e−λ j=i+1 w(j)
] + f (k)
by just path exploration. Moreover, [9] did not examine how i=1
damping would behave in response to more than one flap.
Later in the paper, Figure 8 shows the calculation results
In this paper, we will give a detailed analysis of the
of damping’s intended convergence delay under a varying
damping process under one or more flaps, and show that it
number of route flaps.
is the reuse timer interaction among multiple routers that
A precise message count generally cannot be obtained
stretches the convergence delay to be much longer than
analytically, since it depends on the network topology and
what path exploration alone could do. We will also show
timing of updates. Nevertheless, the general trend can be
how to enhance the damping mechanism with RCN to pre-
predicted. As the number of flaps increases, the number of
vent the undesirable behavior due to path exploration and
updates also increases since each new flap triggers some up-
reuse timer interaction.
dates in the network. After a certain number of flaps, how-
ever, the message count is expected to be almost constant,
3 The Intended Behavior of BGP Damping since new flaps are suppressed by ispAS and no update is
propagated beyond ispAS.
Before analyzing its actual behavior in a network, we Damping reduces the number of updates by suppress-
first quantify damping’s intended behavior. We are inter- ing routing updates but it also increases convergence time.
ested in how damping affects routing dynamics in response Our analysis suggests that ispAS can largely control the
to one or more route flaps. To quantify the effect, we use trade-off by setting appropriate penalty increments, cut-off
two metrics: convergence time and message count. Conver- threshold, and reuse threshold. The configuration can be
gence time is defined as the time from when the originAS tuned so that a small number of flaps does not trigger any
stops flapping (i.e., sends its final route announcement) to damping delay, while a large number of flaps is suppressed,
when the last update message is observed in the network. keeping the overall updates injected into the network at a
reasonable level. Therefore, the overall intended behavior penalty value at the receiving router, until eventually either
in a network relies only on how the unstable link flaps and the flapping stops or the flapping routes are suppressed. [9]
how the incident routers set their damping parameters, re- showed that path exploration can amplify a single flap dur-
gardless of the rest of the network. ing the charging period and falsely trigger route suppres-
sion. In the rest of this section, we will analyze other states
4 Damping Behavior in Distributed Systems and show that reuse timer interaction plays a major role in
these states.
The previous section describes damping’s intended be-
havior based on the rules applied to each individual router. 4.2 Secondary Charging Effect
However, the overall behavior of a network cannot be di-
rectly derived by examining individual routers separately. After charging ends, there is no update in flight or
As we will show in this section, the network damping be- queued for transmission. However, some routes may be
havior is largely driven by previously unknown reuse timer suppressed by some routers. In other words, some routes in
interactions among different routers. RIB-IN cannot be used in Local-RIB because their penalties
are over the threshold. This can occur in both the converged
4.1 Stages of Damping Behavior state and the suppressed state. To understand the difference
between these two states, one must determine whether the
Our simulation studies show that, when an unstable des- reuse timer will be silent or noisy when it expires.
tination exists and all the routers in a network perform BGP Figure 5 shows an example of a silent reuse timer.
damping, the whole network goes through different states Router A has received two routes, RB and RC , from neigh-
during damping. We will first give definitions to these bors B and C respectively. RB is the best path and is
states, then explain them in more detail, and discuss two currently installed in Local-RIB, while RC is suppressed
types of reuse timer interactions. and cannot be considered as a candidate for use in Local-
RIB. When the reuse timer for RC expires, RC will be-
• Charging: It starts with the first flap of the route to the come available and A will re-run its path selection algo-
unstable destination. During the charging period, rout- rithm. However in this case, RC is irrelevant and RB re-
ing updates are exchanged among routers and each up- mains as the best path. We say this reuse timer is silent
date increases (charges) the router’s damping penalty. since its expiration will have no effect on Local-RIB and
This charging ends when there is no update in transit will not trigger any update by A. The network is in a con-
or waiting to be sent in the whole network. verged state if there is no reuse timer at all or every reuse
timer is silent.
• Suppression: After charging, if there is at least one Figure 6 shows an example of a noisy reuse timer. Again
router whose best route is unavailable due to suppres- router A has received RB and RC from B and C respec-
sion, the network enters suppression state, which ends tively. But in this case, RB is currently suppressed and
when a reuse timer expires and triggers a new routing cannot be considered as a candidate for use in Local-RIB.
update. When its reuse timer expires, RB becomes available and A
• Releasing: This period follows the suppression period will select it as the new best path. A will update its Local-
and lasts until all the routing updates have been deliv- RIB and RIB-OUT, and announces this change to its neigh-
ered. bors. In turn this new message may cause A’s neighbors to
update their routes. The network is in the suppression state
• Converged: After releasing, the network enters con- if there is no pending update, and at least one router has a
verged state, where every route in each router’s Local- noisy reuse timer waiting to expire.
RIB is the best route from all its RIB-IN entries. Note When a noisy reuse timer expires, the network moves
that some RIB-IN entries might still be suppressed, but from the suppression state to the releasing state, during
they are not the best route and thus their unavailability which messages triggered by noisy route reuse can charge
makes no impact to Local-RIB. remaining reuse timers. For example, consider nodes X and
Y in Figure 1, and assume Y has suppressed link (X, Y ).
Figure 4 illustrates the transitions between different If a noisy reuse timer expires at X, it will trigger an update
states.1 Some routing flaps make the network move from sent to Y . Although this update was not directly caused by
the converged state to charging state, during which updates route flapping, Y will follow the damping rule and increase
are propagated in the network and each update increases the its penalty value, thus Y ’s reuse timer is charged again. We
1 In the real Internet, due to its large scale, different parts of the network call this type of interaction between X’s reuse timer and Y ’s
may be in different state and these four states may not be clearly separated. reuse timer the Secondary Charging effect. Combined with
3000
Cut-off Threshold
Reuse Threshold
2500

2000

Penalty
1500

1000

500

0
0 1000 2000 3000 4000 5000 6000
Time (seconds)

Figure 4. Four-state of a
damping process in a dis- Figure 5. Silent Reuse Figure 6. Noisy Reuse Figure 7. Damping
tributed system Penalty

path exploration, secondary charging will not only lengthen 4.4 Overall Damping Behavior
existing reuse timers, but can also lead to new route sup-
pressions sometimes. This drives the network to a new sup- The above discussion shows that there are two types of
pression period even though no new route flap has occurred. reuse timer interaction: secondary charging prolongs con-
The network can converge only when all noisy reuse timers vergence time, while muffling by ispAS’ reuse timer re-
have expired. duces secondary charging by making potentially noisy timer
Figure 7 shows an example of simulated route penalty expirations silent. These two types of timer interactions
over time after a single route flap. In this case, the router compete with each other, and the net result depends on the
computing the penalty is not adjacent to the flapping link; number of flaps sent by originAS.
more precisely it is 7 hops away from originAS. The Let RTh be ispAS’ reuse timer, and RTnet be the last
charging period happens within the first 100 seconds, dur- noisy reuse timer in the rest of the network. Initially RTh is
ing which path exploration amplifies one flap into several zero as route suppression is not triggered at ispAS. But once
updates and triggers route suppression, as described in [9]. it is triggered, any further flaps from originAS will increase
With path exploration alone, the network would converge RTh only and have no effect on RTnet at all. As the number
around 2000th second when the route is reused. However, of flaps increases, a critical point (Nh ) is reached when
due to secondary charging, the penalty value is pushed up
over the cut-off threshold again. Before the route is even- RTh > RTnet
tually reused after the 5000th second, secondary charging
That is, when the number of flaps is greater than a certain
pushes the penalty up three more times. In this case, sec-
number Nh , RTh will outlast all noisy reuse timers in the
ondary charging accounts for more than 60% of the total
network, making the muffling effect dominant. When RTh
convergence delay! We will discuss details of the simula-
expires, it is the only reuse timer in the entire network, and
tions in Section 5.
there will be no secondary charging at all. The convergence
time will be totally determined by when RTh expires, which
4.3 Muffling Effect brings the convergence time in line with the intended behav-
ior, as we described in Section 3. The overall results can be
For a single or a small number of route flaps, ispAS summarized as follows:
does not suppress or delay any update. After a number
of flaps, however, route suppression will be triggered at is- • After a small number of route flaps, due to path explo-
pAS, and further flaps will be blocked from entering the ration and secondary charging, a network with damp-
network. Since the link (originAS, ispAS) is suppressed, ing can have longer convergence time than the in-
ispAS has no route to reach originAS. As a result, ispAS tended behavior.
sends a route withdrawal to all its peers, which is then prop-
• When the number of flaps is greater than a certain
agated throughout the network. Note that when a router
number, due to muffling effect, a system with damp-
receives a withdrawal message, it removes the route and in-
creases the penalty value for that route. When a remote ing follows the intended behavior.
router’s reuse timer expires, it will find no route to the orig-
inAS, thus cannot trigger any update. Any reuse timer that 5 Simulation Results
expires before ispAS reuses link (originAS, ispAS) will
be silent. We call this effect the Muffling effect. The muf- Predicting the actual damping behavior in a network is
fling effect is removed after ispAS reuses its route to the difficult. It depends on the degree of path exploration, tim-
originAS and sends an announcement to the network. ing of updates, and order of reuse timer expirations. There-
7000 22000
No Damping (simulation, mesh) No Damping (simulation, mesh)
Full Damping (simulation, mesh) Full Damping (simulation, mesh)
Full Damping (simulation, Internet) 20000 Full Damping (simulation, Internet)
6000 Full Damping (calculation)
18000
Convergence Time (second) 5000 16000

Number of Updates
14000
4000
12000
3000
10000

2000 8000

6000
1000
4000

0 2000
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Number of Pulses Number of Pulses

Figure 8. Convergence Time Figure 9. Message Count

fore we resort to simulations to verify our analysis and fur-


ther illustrate the timer interactions.
purpose, we also plotted the intended behavior of conver-
5.1 Simulation Methodology gence time based on the equations in Section 3. For in-
tended behavior, when the number of pulses n = 1 or 2,
We conducted BGP simulations using SSFNet [13]. Two route suppression is not triggered and the convergence time
types of network topologies are used: mesh topologies is the same as that of no damping; when n ≥ 3, route sup-
and Internet-derived topologies. A mesh topology is a pression is triggered and the convergence time goes up. This
2-dimensional grid in which nodes at opposite edges are added convergence delay is the price that damping is willing
connected, so that all nodes are topologically equal. An to pay for reducing message count in the network. It is de-
Internet-derived topology [1] is derived from the Internet termined by the originAS’ flapping pattern and the ispAS’
AS connectivity graph, and has long-tailed distribution of damping configuration, regardless of any other conditions
node degree. in the network.
Given a network topology, we randomly select a node to
be the ispAS and attach an originAS to it (Figure 1). Be- For the actual behavior, the results exhibit the same trend
fore the simulation starts, every node learns a stable route in both mesh topology and Internet-derived topology. For
to the originAS. We then repeatedly fail and recover link a small number of pulses, the damping dynamics deviates
[originAS, ispAS], causing originAS send alternate route from the intended behavior significantly with longer con-
withdrawals and announcements to ispAS. We call a pair of vergence time. But after the critical point (Nh = 5), the
a withdrawal and its following announcement a pulse. Af- convergence time matches the calculated values very well,
ter some number of n pulses, the link fully recovers and the which verifies our analysis in Section 4.
originAS stops flapping. Note the final update from origi-
nAS is always a route announcement. When n < 5, damping causes a long convergence de-
Results presented in this section are obtained from sim- lay which can be close to, or even more than one hour.
ulations with Cisco default parameters, flapping interval 60 Such a long delay cannot be explained by path exploration
seconds, topology size of 100 nodes, and damping enabled alone. Based on the damping parameter settings in Table
at all nodes. In [15], we report more simulation results 1, a suppression time of one hour corresponds to a penalty
from using different damping parameters, flapping inter- value of 12000. Since the penalty decays exponentially, the
vals, topology sizes, and partial deployment of damping. higher the value is, the faster it decreases. A penalty value
Though varying different factors results in different values of 12000 requires a large number of updates with very short
of convergence time and message count, the overall trend is inter-arrival time. However, path exploration cannot gen-
the same as the results presented here and can be explained erate updates in such rate, because it triggers route sup-
by our analysis in Section 4. pression, which will block the propagation of future up-
dates! In simulations we never observed any penalty value
5.2 Simulation Results close to 12000. However, the long convergence time can
be easily explained by secondary charging. As shown in
Figures 8 and 9 show the convergence time and message Figure 7, path exploration charges the initial penalty value
count versus the number of pulses. When there is no damp- to about 3000, but since secondary charging increases the
ing, convergence time is short and the message count in- reuse timer multiple times later, the total convergence time
creases linearly with the number of pulses. For comparison can be prolonged to more than one hour.
400 400 Np = 3 400 Np = 5
60
Np = 1

350 C 50 350 SC 350


R
300 40 300 300
Number of Updates

Number of Updates

Number of Updates
250 30 250 250

200 20 200 200

150 10 150 150

100 S 0 100 100


1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 M
50 50 50

0 0 0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
Time (seconds) Time (seconds) Time (seconds)

(a) n = 1; C (charging), S (sup- (b) n = 3; M (muffling), SC (c) n = 5


pression), R (releasing) (strong secondary charging)

400 400 400


Np = 1 Np = 3 Np = 5

350 C S R 350 M 350

Number of Links being suppressed


Number of Links being suppressed

Number of Links being suppressed

300 300 300

250 250 250

SC
200 200 200

150 150 150

100 100 100

50 50 50

0 0 0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
Time (seconds) Time (seconds) Time (seconds)

(d) n = 1; C (charging), S (sup- (e) n = 3; M (muffling), SC (f) n = 5


pression), R (releasing) (strong secondary charging)

Figure 10. Update Series and Damped Link Count

5.3 Charging, Suppression, and Releasing ger route suppression on the (originAS, ispAS) link, but
Figure 10(d) shows that it does trigger route suppression at
We now examine the simulation results with 100-node roughly 275 other links in the network.
mesh topology (Figure 10) in detail to illustrate how reuse After the 120th second, the update messages cease and
timer interactions influence damping dynamics and cause the network enters a suppression period that lasts from the
distinct charging, suppression, and releasing periods. 120th second to the 1574th second. During this time period,
there is no outstanding routing messages. However, many
The Effect of Single Pulse (n = 1) Figures 10(a) and preferred routes are marked as unavailable due to suppres-
10(d) plot update series and damped link count triggered by sion. Finally at the 1574th second, these reuse timers begin
a single pulse, respectively. The update series shows the to expire and the network enters the releasing period. As
number of update messages observed in the network in 5- the previously suppressed routes become available, new up-
second bins; the damped link count shows the total number dates are triggered and the damped link count decreases.
of links being suppressed at the moment.2 The releasing period lasts until the 5147th second when the
The originAS flaps by sending an initial withdrawal and last update is observed.3
then re-announces the route 60 seconds later. The first with- Note that although route suppressions happen in a rela-
drawal starts a charging period that lasts through the first tively short charging period, the release of all reuse timers is
120 seconds. Note that even though originAS sends only spread over a long period of time. The releasing period ac-
one withdrawal and one announcement, Figure 10(a) shows counts for about 70% of total convergence time and 30% of
that this single pulse is amplified to several hundred up- total message count. Path exploration is the cause of false
dates in the network. This one pulse is not enough to trig- suppression in the charging period, but it is the secondary
2 When a node suppresses routes from a neighbor node, we count it as
charging that is responsible for the extended releasing pe-
one “damped link.” Since there are 200 links in the topology, and each link
can be suppressed by either end, the upper bound on damped link count is 3 Some reuse timers expire after the 5147th second, but they are silent

400. and do not contribute to either convergence time or message count.


riod. Our further examination of the simulation results con- router estimates whether incoming updates are due to path
firms this claim. At first, a large number of reuse timers exploration or not, and if yes, the damping penalty will not
expire in roughly the same time period, as shown by the be increased. However, selective route flap damping does
rapid drop of damped link count between the 1574th and not detect all path exploration updates and does not address
the 2000th second. The expirations of these timers trigger the problem of secondary charging.
a new wave of updates, which increases damping penalty Secondary charging occurs when routers far away from
on other links. As a result, some reuse timers that have not the flapping origin suppress routes longer than routers close
expired are postponed. Figure 7 is a typical example. Over- to the flapping origin. In our simulations, this scenario is
all, secondary charging can occur multiple times, causing typically caused by path exploration. However, other fac-
some reuse timers to be postponed again and again, which tors, such as diverse damping parameter settings, can also
stretches the releasing period and exacerbates convergence lead to such scenario and cause secondary charging. For
time. example, assume router Y in Figure 1 has set more aggres-
sive damping parameters than router X, i.e., for the same
The Effect of Three Pulses (n = 3) Under our simula- sequence of updates, Y suppresses the route longer than X.
tion settings, the third pulse will trigger suppression on the After the originAS sends out a number of flaps, route sup-
[originAS, ispAS] link. As a result, the destination be- pression is triggered at both X and Y . Even if X and Y
comes unreachable. Comparing the Figures of n = 1 and receive exactly the same number of updates with same in-
n = 3, reuse timers that expire between the 1575th and the tervals, due to their different damping parameters, X will
1927th second are noisy timers in the case of single pulse, reuse its route to originAS earlier than Y . When X reuses
but are silent ones in the case of three pulses, which is the its route and sends it to Y , this announcement will re-charge
result of the muffling effect. Reuse timers that expire be- Y ’s reuse timer on link [X, Y ].
fore RTh are all muffled. The expiration of RTh triggers a The fundamental problem is that current damping in-
powerful secondary charging at the 1927th second. When creases the penalty value for all received updates, regard-
some other reuse timers expire shortly after the 2000th sec- less of their root cause. Routing updates can be triggered
ond, both the message count and damped link count surge by many different reasons, including route flapping, path
to a high level. The impact is so powerful that a new sup- exploration, route reuse, and so forth. When updates are
pression period is formed in Figure 10(e). produced by path exploration, false suppression can occur;
when updates are produced by route reuse, secondary charg-
ing can occur. The damping penalty should apply only to
The Effect of Five Pulses (n = 5) After five pulses, the updates caused by route flapping.
reuse timer at ispAS (RTh ) has been increased to the point We propose to use Root Cause Notification (RCN)
where it becomes the last timer to expire in the entire net- [11][2] to help guide damping decisions. RCN attaches
work. Beginning around the 1500th second, reuse timers in the root cause information to each update and thus allows
the rest of the network begin to expire. Due to the muffling routers to associate each update with a particular route flap
effect, all routers have declared the destination unreachable (or other cause). Each route flap (not each update) increases
and these reuse timers expire silently. The last timer to ex- the damping penalty. We first review the RCN concept and
pire is RTh . When this timer expires, it triggers a route then show RCN enables damping to behave as intended.
announcement sent to the network. As the route announce-
ment propagates throughout the network, it creates a small
6.1 Root Cause Notification (RCN)
surge of updates, but no secondary charging since there is
no other pending reuse timers. For any number of pulses
n ≥ 5, the convergence time is solely determined by when RCN attaches to each routing update the root cause that
RTh fires, exactly the intended behavior predicted from the triggers the update. A root cause is defined as RC =
single router view of damping algorithm. {[u v], status, seq num}, where [u v] is the root cause
link, status indicates whether the link is down or up, and
seq num is the sequence number associated with the link to
6 RCN-Enhanced Damping denote the order in which root causes are generated. A node
that detects the status change of an adjacent link sends out a
Our work and [9] clearly show that when the number of routing update with RC attached as an additional attribute.
flaps is small, damping can cause unintended long conver- When a node’s best path changes due to the receipt of an
gence delay. [9] proposes “selective route flap damping,” update message, this node will copy the root cause from the
in which a router attaches with each announcement a rel- incoming update into the outgoing update message. This
ative preference value compared with previous announce- ensures that any update that is triggered by the same link
ment. Based on this additional information, the receiving status change will carry the same root cause information.
7000
No Damping (simulation, mesh)
Full Damping (simulation, mesh)
Full Damping (simulation, Internet)
6000 Damping and RCN
Full Damping (calculation)

Convergence Time (second)


5000

4000

3000
Figure 11. Damping Without RCN
2000

1000

0
0 1 2 3 4 5 6 7 8 9 10
Number of Pulses

Figure 13. Convergence Time

22000
No Damping (simulation, mesh)
20000 Full Damping (simulation, mesh)
Full Damping (simulation, Internet)
18000 Damping and RCN

Figure 12. Damping With RCN 16000

Number of Updates
14000

12000

10000
For example, in Figure 1, assume the current se- 8000

quence number for link [ispAS originAS] maintained 6000

4000
at ispAS is 0. When ispAS first detects the failure 2000

of link [ispAS originAS], it will attach a root cause 0


0 1 2 3 4 5 6 7 8 9 10

of {[ispAS originAS], down, 1} to the withdrawal trig-


Number of Pulses

gered by this failure. All messages triggered by the


same link failure will carry exactly the same root cause Figure 14. Message Count
{[ispAS originAS], down, 1} when they are propagated
throughout the network. When link [ispAS originAS]
is up, ispAS will sends an announcement with root case damping mechanism. Even though a single route flap may
{[ispAS originAS], up, 2}. lead to multiple updates, only one update is passed through
RCN was originally developed to reduce BGP slow con- the filter to the damping algorithm. Note that the filter only
vergence. The details of the algorithm, message overhead, prevents some updates from reaching the damping algo-
and incremental deployment issues are addressed in [11]. rithm; all updates are still accepted and passed to the routing
In this paper, we make use of the RCN concept to improve decision process.
damping only. More specifically, we only assume that RCN Figures 11 and 12 illustrate how the RCN helps damp-
information is attached to the update messages; we do not ing. In Figure 11, a single route flap combined with path
assume that RCN is used to influence routing decisions or exploration results in several updates. Each of these updates
reduce convergence time. If RCN is used for both damping adds to the damping penalty and the single flap may result
and convergence improvement, our results should remain in false route suppression. In Figure 12, the same sequence
essentially the same, except that the overall message count of updates is received but each update carries an RCN that
associated with a flap should also be reduced as a result of identifies the single route flap. With this enhancement, false
RCN’s role in improving convergence. suppression and reuse timer interaction are prevented. Al-
though there are many updates during path exploration, they
6.2 Damping with RCN all carry the same root cause and thus only one of these up-
dates passes the filter and increases the damping penalty.
With root cause information attached to each routing up- When a suppressed route is reused, the RCN is attached to
date, we increase damping penalty only for updates caused the route announcement, which will not cause penalty in-
by route flaps. For each peer, a router maintains a recent his- crease at receiving routers since the root cause have been
tory of root causes that have been received from that peer. seen before.
When an update is received, its root cause is checked against Figures 13 and 14 show the simulation results of RCN-
the history list. If the root cause is already present in the enhanced damping in the 100-node mesh topology. With
history list, this update does not result in any penalty in- the help of RCN, routing convergence does not experience
crement. If this root cause has not been reported before, extra long delay when the number of pulses is small and it
a penalty increment is applied according to damping con- closely matches the calculated (intended) convergence time.
figuration, and the root cause is added into the history list. At the same time, damping achieves its goal of limiting the
In other words, our approach acts as a filter in front of the message count when the number of pulses is large. Inter-
4500
estingly, damping with RCN produces slightly more mes- With Policy
No policy
4000 Intended (calculation)
sages than damping without RCN. This is because when 3500

Convergence Time (second)


RCN is used, route suppression happens after three pluses, 3000

exactly as specified by the damping algorithm and param- 2500

2000
eters. Without RCN, false suppression happens earlier due 1500
to path exploration and reduces the number of messages. 1000

Overall, by making use of root cause information, we can 500

prevent complex interactions in the network from having 0


0 1 2 3 4 5 6 7 8 9 10
Number of Pulses
negative impact on damping, and make damping work as
the design intends.
Figure 15. Impact of Policy
7 Discussion
derived topology, in which every pair of connected nodes is
Our analysis and simulation show that the behavior of assigned a relationship as customer-provider or peer-peer.
a network is often influenced by unexpected interactions The policy regulates route announcement to ensure that a
among its components. In the case of BGP route damp- router does not transit data traffic for a third party. That is,
ing, path exploration results in false suppression, and unex- besides traffic originated by its own, a router will forward
pected timer interactions prolong the routing convergence. traffic only if the traffic is coming from its customers or des-
Our study discovered the previously unknown reuse timer tined to one of its customers. The simulation results show
interactions. that this policy greatly reduces the number of nodes that
Although our analysis and simulation are based on the turn on false suppression (not shown in the figure), thus re-
BGP damping mechanism, potential interactions among duces secondary charging and moves the convergence time
other components in BGP must be considered before one closer to the intended behavior. This unexpected reduction
applies our results to infer Internet routing performance. of slow convergence by routing policy can serve as another
For clarity of analysis, in this paper we only presented re- example of unexpected interactions among different com-
sults using a fixed flapping interval, full damping deploy- ponents in a large system. However routing policies do not
ment with consistent damping parameters at all the routers, eliminate all path explorations or muffle all noisy timers.
and shortest path routing policy. In reality unstable des- Our simulation results show that, although secondary charg-
tinations exhibit different flapping patterns and different ing now affects a small number of nodes, the convergence
routers have inconsistent damping parameter settings. Fur- delay for these affected nodes will still be much longer than
thermore, routing policies can have a big impact on path what the damping design intends.
exploration which in turn can affect the damping behavior. Overall, we make two observations. First, policies that
For example, routing policies other than shortest path are currently in place help reduce some of the undesired
are widely used to regulate BGP route selection and prop- damping behaviors. This is important to consider when
agation. BGP policies consider factors such as the com- translating simulation results into Internet results. Second,
mercial relationship between two networks and, as a result, routing policies change over time and other unexpected fac-
some physically plausible paths may be prohibited by pol- tors can influence routing behavior as well. Rather than
icy and not announced to peers. Routing policies lead to counting on routing policies to reduce slow convergence,
reduced number of alternate paths that can be explored dur- hence reduce the negative effect of damping, we believe the
ing convergence period, which in turn reduces the number RCN-enhanced damping provides the correct solution. It
of routers that turn on false suppression, the main factor that solves the fundamental problem by correctly identifying the
sets up secondary charging. Furthermore, the shortest path route flap that triggers a particular update and applying the
policy assumes that if a route is selected after a reuse timer damping penalty to the flap itself (as opposed to the per-
expires, this new route will be announced to all the neigh- ceived result of a flap).
bors. However under different routing policies, this route
may not be announced to some, or even all, the neighbors.
If a route is announced, the result is a noisy reuse timer; if 8 Conclusion
policy forbids the announcement, an otherwise noisy reuse
timer would be silenced. Route flap damping is a seemingly simple mechanism
To quantify the impact of routing policy on damping dy- to prevent the instability of any individual route from over-
namics, we have run simulations using the no-valley rout- loading the global system. If one examines the effect of
ing policy, which is widely adopted in practice [4][5]. Fig- BGP’s damping mechanism at a single router, the result is
ure 15 shows the simulation result on a 208-node Internet- simple: all updates of a route are propagated without delay
as long as the associated penalty value is below the thresh- Zhao for the discussion on early version of the paper, and
old; otherwise they are blocked. However when damping is anonymous reviewers for their comments.
applied to a network of routers, not only slow convergence
can falsely trigger suppression elsewhere other than at the References
router adjacent to the flapping origin, but route reuse timer
interactions can also lead to “after shock” effect, unsettling
[1] BJ Premore. Multi-as topologies from bgp routing tables.
routing changes long after the flapping origin has stabilized. http://www.ssfnet.org/Exchange/gallery/asgraph/index.html.
Our result explains how the number of flaps affects the
[2] J. Chandrashekar, Z. Duan, Z.-L. Zhang, and J. Krasky. Lim-
damping dynamics in a network of routers. For a persis-
iting path exploration in path vector protocols. In Proc. of
tently unstable route, the router closest to the flapping ori- IEEE INFOCOM, March 2005.
gin can damp the route and effectively isolate the instability
from the global system, achieving the goal of the damping [3] G. Huston. Commentary on inter-domain routing in the In-
ternet. RFC 3221, IETF, December 2001.
design. However when a route changes only a small number
of times, the combined results of path exploration and sec- [4] Geoff Huston. Interconnection, peering and settlements, part
ondary charging can lead to prolonged routing convergence i. Internet Protocol Journal, 2(1), 1999.
delay. We have proposed a simple solution that can effec- [5] Geoff Huston. Interconnection, peering and settlements, part
tively eliminate false suppression due to path exploration ii. Internet Protocol Journal, 2(2), 1999.
and reuse timer interactions. [6] C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian. Delayed
The global Internet routing infrastructure is a complex Internet Routing Convergence. In Proc. of ACM SIGCOMM,
system and it is difficult or even impossible to capture all the August 2000.
factors that may present in the Internet routing in simulation [7] C. Labovitz, G. Malan, and F. Jahanian. Origins of Inter-
experiments. Consequently, our simulation results may not net Routing Instability. In Proc. of IEEE INFOCOM, march
be directly applicable to predicting BGP damping dynamics 1999.
in the Internet. In particular, we believe that the commonly [8] Craig Labovitz, G. Robert Malan, and Farnam Jahanian. In-
used no-valley routing policies can reduce the number of ternet routing instability. ACM/IEEE Transactions on Net-
alternative paths, hence reducing false suppressions due to working, 6(5):515–528, October 1998.
path exploration, consequently the chances for reuse timer [9] Z. M. Mao, R. Govindan, G. Varghese, and R. Katz. Route
interactions. However, due to the Internet’s large scale and flap damping exacerbates Internet routing convergence. In
diversity in routing policy and damping deployment, path Proc. of ACM SIGCOMM, August 2002.
explorations of various degrees do exist in various parts of
[10] Christian Panigl, Joachim Schmitz, Philip Smith, and
the Internet, providing ready conditions for reuse timer in- Cristina Vistoli. Ripe routing-wg recommendations for co-
teractions, which can explain the known measurement re- ordinated route-flap damping parameters. RIPE 229, RIPE,
sults of prolonged routing convergence delay. October 2001.
Despite its unintended behavior under certain conditions,
[11] D. Pei, M. Azuma, D. Massey, and L. Zhang. BGP-RCN:
BGP damping serves as the last fence of defense against un- Improving BGP Convergence Through Root Cause Notifica-
stable routes when other mechanisms have failed. We firmly tion. Computer Networks, 2005. To appear.
believe that damping is a necessary mechanism to protect
[12] Y. Rekhter and T. Li. Border Gateway Protocol 4. RFC 1771,
the global Internet routing infrastructure from melting down
Internet Engineering Task Force, July 1995.
under high routing dynamics, and damping is equally essen-
tial to other distributed systems where resource constraints [13] SSF Research Network. Ssfnet. http://www.ssfnet.org.
such as power, bandwidth, and router resources are limited. [14] C. Villamizar, R. Chandra, and R. Govindan. Bgp route flap
The intriguing interplay of false damping and reuse timer dampening. RFC 2439, IETF, November 1998.
interactions discovered in this work serves as a good exam- [15] B. Zhang, D. Massey, and L. Zhang. Bgp dynamics during
ple of a more general challenge: one cannot predict the re- route flap damping. Technical Report 03-805, USC-CSD,
sulting behavior of a protocol in a large system by examin- November 2003.
ing its operation at a single component in isolation, because
such studies overlook essential (and often unexpected) in-
teractions that are inherent in a large system.

9 Acknowledgment

We would like to thank Morley Mao and BJ Premore


for the help on damping simulation in SSFNet, Xiaoliang

You might also like