bgp-4 Notes
bgp-4 Notes
bgp-4 Notes
You can
obtain an HTML or OpenOffic e version of this tutorial with the hypertext links
by sending an email to the author.
BGP basic s
Case study
Pre-CIDR
rapid growth
Source :
http://bgp.potaroo.net/as1221/bgp-ac tive.html
http://bgp.potaroo.net
http://www.c idr-report.org
The reasons for the recent growth
http://bgp.potaroo.net
UPDATE UPDATE
Prefix:194.100.0.0/23, Prefix:194.100.0.0/ 16
NextHop:194.100.0.1 NextHop:194.100.0.2
ASPath: AS65000 ASPath: AS123
AS65000 194.100.0.2
R2
194.100.0.0/30
R1 194.100.0.1
194.100.10.0/23 194.100.0.0/ 16
AS123
its provider
No impact on BGP routing table size
BGP/2003.4.5 © O. Bonaventure, 2003
The private AS numbers (range 64512 through 65535) are reserved for
private use and should not be advertised on the global Internet. See
See also
J. Stewart, T. Bates, R. Chandra, E. Chen, Using a Dedicated AS for Sites
Homed to a Single Provider, RFC2270, January 1998
Evolution of typical stub AS (2)
Advantage
Another solution is to strip the AS# of the c lient network in the BGP
advertisement. Removing this information may prohibit other domains
from detecting loops. For this reason, two new attributes need to be
added to the BGP advertisement :
ATOMIC_AGGREGATE indic ates that path information has been lost
in the aggregation proc ess
Indic ates also that the prefix should not be deaggregated
further
AGGREGATOR c ontains info useful for debugging
UPDATE
Prefix:194.100.0.0/ 16
NextHop:194.100.0.2
ASPath: AS123
AGGREGATOR
AS123, 194.100.0.2
ATOMIC_AGGREGATE
In April 2003, a BGP table c ollected by the RIPE RIS project c ontained about
7% of routes with the ATOMIC_AGGREGATE attribute
A dual-homed stub ISP
200.0.0.2
UPDATE UPDATE
R3 Prefix:194.100.10.0/23 Prefix:200.00.0.0/23,
NextHop:200.0.0.2 NextHop:200.0.0.2
ASPath: AS789:AS4567 ASPath: AS789
200.0.0.0/ 16
AS789
BGP/2003.4.8 © O. Bonaventure, 2003
A dual- homed stub ISP (2)
UPDATE UPDATE
Prefix:194.100.0.0/ 16 Prefix: 194.100.10.0/23
ASPath: R ASPath: ASW:ASZ:AS789:AS4567
ASX:ASY:{AS123,AS4567}
AS9999
Routing table
194.100.10.0/ 23 Path:ASW:ASZ:AS789:AS4567
Consequenc es
For more information on filtering based on the RIR alloc ation guidelines, see
Steve Bellovin, Randy Bush, Timothy G. Griffin, and Jennifer Rexford,
"Slowing routing table growth by filtering based on address alloc ation
policies," June 2001, available from http://www.research.att.com/~ jrex
BGP basic s
Case study
The BGP decision process also contains a additional step after the ASPath
step where the routes with the lowest ORIGIN attribute are preferred. We
ignore this step and this attribute in this tutorial.
The BGP decision process used by router vendors may c hange c ompared to
this theoretical desc ription. For real BGP dec ision proc esses, see :
http://www.riverstonenet.com/support/bgp/routing-model/index.htm#_Route_Selection_Process
http://www.juniper.net/techpubs/software/junos53/swc onfig53-ipv6/html/routing-overview-ipv69.html
http://www.foundrynet.com/services/documentation/ec mg/BGP4.html
There have been some proposals to allow ISPs to change the BGP decision
process on their routers to have a better control on the selec ted routes.
A. Retana, R. White, BGP Custom Decision Proc ess, Internet draft, draft-
retana-bgp-custom-decision-00.txt, work in progress, 2003
quality of routes
Not always a good indicator R0
R1
RA
R2
RB R3
RC R4
Consequenc e
Internet paths tend to be short, 3-5 AS hops
AS1
R6's routing table R8
1/8:AS2 via R2 (eBGP,best) C= 50 R7's routing table
1/ 8:AS2 via R3 (iBGP) C= 1 1/ 8:AS2 via R2 (iBGP)
1/8:AS2 via R3 (eBGP, best)
R6 R7
UPDATE
UPDATE
Prefix:1.0.0.0/8 Prefix:1.0.0.0/8
ASPath: AS2 ASPath: AS2
NextHop: R2 NextHop: R3
C= 1 C= 98
R2 R0 R3 Flow of IP packets
t owards 1.0.0.0/ 8
1.0.0.0/ 8 AS2
R6 R7
UPDATE UPDATE
Prefix:1.0.0.0/8 Prefix:1.0.0.0/8
ASPath: AS2 ASPath: AS2
NextHop: R2 NextHop: R3
C= 1 C= 98
R2 R0 R3
BGP/2003.4.15
1.0.0.0/ 8 AS2 © O. Bonaventure, 2003
The lowest MED step in
the BGP decision process
Motivation : c old potato routing
In a multi-c onnected AS, indicate which entry
C= 50 C= 1
AS1
R6 R7
UPDATE
UPDATE
Prefix:1.0.0.0/8
Prefix:1.0.0.0/8
ASPath: AS2
ASPath: AS2
NextHop: R3
NextHop: R2 C= 1 C= 98 MED: 98
MED : 1 R2 R0 R3
AS1
R0
1.0.0.0/ 8
AS2 R2 R3 AS3
UPDATE UPDATE
Prefix:1.0.0.0/8 Prefix:1.0.0.0/8
ASPath: AS2:AS1 R1 ASPath: AS3:AS1
Consequenc e
A router with a low IP address will be preferred
Note that on some router implementations, the lowest router id step in the
BGP decision process is replac ed by the selec tion of the oldest route. See
e.g. : http://www.cisco.com/warp/public/459/25.shtml
Preferring the oldest route when breaking times is used to prefer stable
paths over unstable paths, however, a drawbac k of this approac h is that the
selection of the BGP routes will depend on the arrival times of the
corresponding messages. This makes the BGP selection proc ess non-
deterministic and c an lead to problems that are difficult to debug.
More on the MED step in the BGP
decision process
Unfortunately, the proc essing of the MED is
more complex than desc ribed earlier
Correct proc essing of the MED
MED values can only be compared between routes receiving
from the SAME neighboring AS
Routes which do not have the MED attribute are considered
to have the lowest possible MED value.
Selection of the routes containing MED values
Flow of IP packets
R8
AS1
C= 50 C= 1
C= 1
C= 50
R6 R6b R7b R7
C= 1
R0:AS2:AS0, MED= 0 R4 R3
R0:AS2:AS0, MED= 9
AS3
C= 9
R2 R5
AS2
AS0
BGP/2003.4.19 R0 © O. Bonaventure, 2003
In the example above, assuming a full iBGP mesh inside AS1 and that all
routes have the same local-pref value, router R8 will rec eive four paths to
reach router R0 :
One path going via R5 in AS2 and received with MED=9
One path going via R3 in AS3 and received with MED=20
One path going via R2 in AS2 and received with MED=0
One path going via R4 in AS3 and received with MED=21
The local-pref and AS-Path steps of the decision process will not remove
any path from c onsideration.
The MED step of the BGP decision proc ess will selec t, from eac h
neighboring AS, the paths with the smallest MED, namely :
One path going via R2 in AS2 and received with MED=0
One path going via R3 in AS3 and received with MED=20
Then, the c losest nexthop step of the BGP decision process will select as
best path the path that leaves AS1 router R7, i.e. :
One path going via R3 in AS3 and received with MED=20
This is the standardized proc essing of the MED attribute in BGP4. As always
with BGP4 implementations, some implementations allow operators to :
Ignore the MED values from a given peer
Process all MED values without considering the AS from whic h the MED
value was learned
in this c ase, the path via R6 would be selected by R8
...
Route oscillations with MED
C=1
eBGP session RR1 RR3
iBGP session
C=1
Physical link C=2 C=4
R1 R2 R3
RX RZ
R0
R1 R2 R3
RX RZ
BGP/2003.4.21 R0 © O. Bonaventure, 2003
Route oscillations with MED (3)
RR1's best path selection
R1 R2 R3
RX RZ
R0
BGP/2003.4.22 © O. Bonaventure, 2003
Other problems with Route Reflectors
RR2
RX RY RZ
Ra, Rb, and Rc will all prefer their direct eBGP path
RR1, RR2 and RR3 will never reach an agreement
BGP/2003.4.23 © O. Bonaventure, 2003
With an iBGP full mesh, all BGP routers would rec eived the three possible
paths and RR1 would prefer the path via R2, RR2 would prefer the path via
R3 and RR3 would prefer the path via R1.
With Route Reflectors, the situation is more c omplex bec ause each RR
only knows some of the routes since eac h RR only advertises its best path
on the iBGP full mesh with the other Rrs.
RR1 will learn the path via RX from its c lient R1. RR2 learns the path via
RY from its c lient R2 and RR3 learns the path via RZ from its c lient R3.
Assume RR1is the first to selec t its path. It selec ts the RX path since it
only knows this path and advertises it to RR2 and RR3. Upon reception of
this advertisement, RR3 c ompares the path via RZ and the path via RX and
prefers the path via RX. RR3 advertises its best path to R3, but R3 still
prefers its direc t path to RZ.. Note that RR3 does not advertise the path via
RZ to the other RRs since this is not its best path.
Now, assume that RR2 selects its best path. It knows the paths via RX
(learned from RR1) and RY (learned via R2). The c urrent best path is clearly
the path via RY and RR2 advertises this path to RR1 and RR3. Upon
reception of this advertisement, RR1 will selec t again its best path. Now,
RR1's best path is clearly the path via RY. Unfortunately, the selection of
this path forces RR1 to withdraw the path via RX that it initially advertised.
Upon reception of the withdraw message, RR3 will need to select its best
path... The RRs will exc hange BGP messages forever without reaching a
consensus.
Physical link
RX RY
Note that this forwarding problem does not occ ur if R1 and R2 use some
tunneling mechanism (e.g. MPLS) to send pac kets towards RX and RY via
RR1 and RR2
Outline
BGP basic s
Case study
Optimize performance
Wai Sum Lai et al., A framework for internet traffic engineering measurement,
Internet draft, draft-ietf-tewg-measure-02.txt, Marc h 2002
Traffic Matrix Estimation: Existing Tec hniques and New Direc tions. A.
Medina (Sprint Labs, Boston University) , N. Taft (Sprint Labs), K.
Salamatian (University of Paris VI), S. Bhattacharyya, C. Diot (Sprint
Labs)
See also the papers presented at the ACM SIGCOMM Internet Measurement
Workshops and at PAM
Link-level traffic monitoring
Principle
rely on SNMP statistics maintained by each
frequently
Advantages
Simple to use and to deploy
collection/ presentation
Rough information about network load
Drawbacks
No addressing information
congestion
BGP/2003.4.28 © O. Bonaventure, 2003
R2
R1
R3 R4
Principle
Within IETF, the IPFIX working group is expec ted to develop a standard
alternative to Netflow. See http://www.ietf.org/html.charters/ipfix-charter.html
Open source tools c an also be used to c apture traffic in Netflow format, see
e.g. http://www.ntop.org
Flow level traffic capture (3)
Advantages
provides detailed information on the traffic
Drawbacks
flow information needs to be exported to
monitoring station
information about one flow is 30 - 50 bytes
average size of HTTP flow is 15 TCP packets
CPU load on high speed on routers
monitoring workstation
Netflow v8
This figure is based on a study of all the interdomain traffic of three distinct
ISPs at different periods of time. The trace was c ollected during one week
for BELNET, the Belgian Researc h ISP, five days for YUCOM, a dialup ISP
based in Belgium and one day for PSC, a gigapop in the US. This figure is
analyzed in :
B. Quoitin, S. Uhlig, C. Pelsser, L. Swinnen and O. Bonaventure, Interdomain
traffic engineering with BGP, IEEE Communications Magazine, May 2003,
http://www.info.uc l.ac.be/people/OBO/biblio.html
A similar result c onc erning the traffic distribution was obtained by studying
the traffic of a tier-1 ISP, see
This paper analyses the stability of the traffic sent by the UCL network to the
Internet during one month. The figure above was drawn by c omputing during
each hour, the sorted list of ac tive AS Paths during this period and then
counting how many of those top AS-Paths were required to c apture a given
amount of traffic.
Topological dynamics of the traffic
sent by a stub during one month
The figure above was drawn by counting the number of times eac h AS Path
that appeared in thehourly top 90% figure and comparing this information
with the amount of traffic sent on those AS Paths. It shows that a small
number of AS Paths are always present, but that most AS Paths only appear
during small periods of time.
The provider selection problem
Economical criteria
Cost of link
Cost of traffic
provider
Number of routes announced by provider
Length of the routes announced by provider
providers
12 large providers peering with routeviews
providers
P1
ISP
P2
100000
80000
non-deterministic choice
60000
40000
20000
0
AS 3257
AS 1239
AS 2914
AS 7911
AS 1668
AS 3561
AS 7018
AS 5511
AS 3356
AS 3549
AS 293
AS 1
AS peer
BGP/2003.4.37 © O. Bonaventure, 2003
AS2914 : Verio
AS3257 : TISCALI
AS1239 : Sprint
AS7911 : Williams
AS3561 : C&W USA
AS1668 : AOL
AS7018 : ATT
AS5511 : FT Bac kbone
AS3549 : GLBIX
AS3356 : Level3
AS1 : Genuity
AS293 : ESnet
For these ISPs that are in majority tier 1, the figure shows that the number of
common routes is very high varying between 96.9 and 98.1% of the full BGP
table except for AS2914 having on average 85% of the routes in common
with the 11 other peers. The figure also shows that between 56033 and
69735 routes are selec ted in a non-deterministic manner by the BGP
decision proc ess of our stub AS. A closer look at those routes reveals that
80% of them have an AS-Path length of 3 to 4 AS-hops. On average, for all
considered pairs, almost 62% of the routes are chosen in a non deterministic
manner. This result implies that the length of AS-Path is not always a
sufficient c ondition to select BGP routes and that ISPs c ould easily influence
their outgoing traffic by defining additional criteria to prefer one provider
over the other.
Tuning BGP to ...
control the outgoing traffic
Principle
To control its outgoing traffic , a domain must tune
local-pref
D. Allen, NPN: Multihoming and Route Optimization: Finding the Best Way
Home, Network Magazine, Feb. 2002,
http://www.networkmagazine.c om/article/NMG20020206S0004
Principle
Allow a BGP router to install several paths
AS3
AS1 AS2
Issues AS0
Which AS Path will be advertised by AS0
http://www.juniper.net/techpubs/software/junos53/swc onfig53-ipv6/html/ipv6-bgp-config29.html
BGP equal cost multipath (2)
Physical link
RA C=1 RB
RX RZ RY
Consequenc es
Border router receiving only eBGP routes
Besides considering equal cost paths for load balanc ing, some vendors also
support unequal load balancing by relying on the link bandwidth extended
community that allows routers to determine the bandwidth of external links.
See :
S. Sangli, D. Tappan, Y. Rekhter, BGP Extended Communities Attribute,
Internet draft, work in progress, Nov. 2002
http://www.ietf.org/internet-drafts/draft-ietf-idr-bgp-ext-c ommunities-05.txt
AS3 AS4
R31
R32
10/7:AS1 R41
10/7:AS1 4.0.0.0/ 8
R22
R11
10/7:AS1
R12
10.0.0.0/ 8
In this example, we assume that no filters are applied by AS2, AS3 and AS4
on the routes rec eived from AS1.
Control of the incoming traffic
Selective announcements
Principle
Advertise some prefixes only on some links
R31
AS3 AS4
R32
10/8:AS1 R41
4.0.0.0/ 8
11/8:AS1
R22
R11
11/8:AS1
R12 10/8:AS1 AS2
10.0.0.0/ 8
AS1 11.0.0.0/ 8
Drawbacks
splitting a prefix increases size of all BGP routing tables
Limited redundancy in case of link failure
BGP/2003.4.44 © O. Bonaventure, 2003
In this example, AS1 forc es AS3 to send the pac kets towards 10.0.0.0/8
on the R31-R11 link and the packets towards 11.0.0.0/8 on the R32-
R12 link. This is a common method used to balanc e traffic over
external links, but an important drawback is that if the R11-R31 link
fails, AS3 would not be able to utilize the R12-R32 link to reach
10.0.0.0/8 and would be forc ed to used the path through AS2.
Remember
When forwarding an IP packet, a router will always
Principle
advertise different overlapping routes on all links
Compared with the utilization of the selective announc ements, the main
advantage of using more spec ific prefixes is that if link R11-R31 fails,
then the packets towards 10.0.0.0/8 will still be sent by AS3 through
the R32-R12 link sinc e they are part of the 10.0.0.0/7 router learned
from R12.
Note that if AS1 wants to use the more selec tive prefixes only to control
the traffic on its links with AS3 and not beyond, then, the more specific
prefixes should be advertised with the NO_EXPORT c ommunity while
10.0.0.0/7 would be advertised without community values. With this
community value, the two more spec ific prefixes will not be advertised by
AS3 and thus will not contribute to the growth of the global BGP routing table.
Control of the incoming traffic
AS- Path prepending
Principle
Artificially prepend own AS number on some routes
R22
R11
AS-Path prepending is a popular tec hnique since in the BGP dec ision
process, the selection of the shortest AS-Path is one of the most important
criteria. In theory, the length of the AS-Path is not nec essarily an indication
of the quality of a path, but some studies have shown that, on average, short
AS-Paths offered a better performance that longer paths.
Due to the importance of the "shortest AS-Path" criteria in the BGP decision
process, most interdomain routes used in the Internet are relatively short (up
to 3-4 transit AS between sourc e and destination for most routes).
See
http://ipmon.sprintlabs.com/pac cess/routestat/trends.php?type=addrReachability_trend
for some information on the addresses that are reachable at N AS hops from
a large ISP like Sprint.
Traffic engineering with BGP communities
Principle
Attach spec ial community value to request
Drawbacks of c ommunity-based TE
Requires error-prone manual c onfigurations
Proposed solution
Utilize extended c ommunities to enc ode TE
AS3
R31 10/7:AS3:AS1 AS4
R32
10/7:AS1 R41
10/7:AS2:AS1
10/7:AS1 4.0.0.0/ 8
R22
R11
10/7:AS1
R12 R21 NOT_Announce(AS4)
10.0.0.0/ 8 10/7:AS1
NOT_Announce(AS4)
AS1 11.0.0.0/ 8 AS2
AS3
R31 10/7:AS3:AS1 AS4
R32
10/7:AS1 R41
10/7:AS2:AS1
10/7:AS1 4.0.0.0/ 8
10/7:AS2:AS2:AS2:AS1
R22
R11
10/7:AS1
R12 R21 Prepend(2,AS4)
10/7:AS1
10.0.0.0/ 8 Prepend(2,AS4)
Useful for backup link, but besides that, the only method
to find the amount of prepending is trial and error...
Communities/ redistribution c ommunities
BGP basic s
Case study
R
AS2111
BGP/2003.4.54 © O. Bonaventure, 2003
This evaluation was carried out by Cristel Pelsser in Marc h-April 2003. The
links with the two upstream providers were GRE tunnels. Those
measurements c ould not have been done without the help of Jan Torrele
(Belnet), Benoît Piret and Patric e Devemy This evaluation should be
considered as an experiment and not as a “c omparison” between Belnet and
the Belgian ISP.
Measurements with AS-Path prepending
Without prepending
68 % received via Belnet, 32% received via BISP
When prepending was used on the BISP link, the following results were
obtained :
With prepending onc e on BISP link
80% received via Belnet, 20% rec eived via BISP
With prepending twice on BISP link
80% received via Belnet, 20% rec eived via BISP
With prepending three times on BISP link
All traffic was received via Belnet
How to better balance the incoming
traffic ?
AS Path prepending is c learly not suffic ient
to another
Level3 Communities Telia Communities
65000:0 1299:2009
announce to customers but not to Do not annouce EU peers
peers 1299:5009
65000:XXX Do not annouce US peers
do not announce to peer ASXXX 1299:2609
65001:0 Do not anounce to Concert
prepend once to all peers 1299:2601
65001:XXX Prepend once to Concert
prepend once to peer ASXXX