Analyzing The Performance of An Anycast CDN
Analyzing The Performance of An Anycast CDN
Analyzing The Performance of An Anycast CDN
Matt Calder*,† , Ashley Flavel† , Ethan Katz-Bassett* , Ratul Mahajan† , and Jitendra Padhye†
† Microsoft * University of Southern California
ABSTRACT Some newer CDNs like CloudFlare rely on anycast [1], announc-
Content delivery networks must balance a number of trade-offs ing the same IP address(es) from multiple locations, leaving the
when deciding how to direct a client to a CDN server. Whereas client-front-end mapping at the mercy of Internet routing protocols.
DNS-based redirection requires a complex global traffic manager, Anycast offers only minimal control over client-front-end mapping
anycast depends on BGP to direct a client to a CDN front-end. Any- and is performance agnostic by design. However, it is easy and
cast is simple to operate, scalable, and naturally resilient to DDoS cheap to deploy an anycast-based CDN – it requires no infrastruc-
attacks. This simplicity, however, comes at the cost of precise con- ture investment, beyond deploying the front-ends themselves. The
trol of client redirection. We examine the performance implications anycast approach has been shown to be quite robust in practice [23].
of using anycast in a global, latency-sensitive, CDN. We analyze In this paper, we aim to answer the questions: Does anycast direct
millions of client-side measurements from the Bing search service clients to nearby front-ends? What is the performance impact of
to capture anycast versus unicast performance to nearby front-ends. poor redirection, if any? To study these questions, we use data from
We find that anycast usually performs well despite the lack of pre- Bing’s anycast-based CDN [23]. We instrumented the search stack
cise control but that it directs roughly 20% of clients to a suboptimal so that a small fraction of search response pages carry a JavaScript
front-end. We also show that the performance of these clients can beacon. After the search results display, the JavaScript measures
be improved through a simple history-based prediction scheme. latency to four front-ends– one selected by anycast, and three nearby
ones that the JavaScript targets. We compare these latencies to
understand anycast performance and determine potential gains from
Categories and Subject Descriptors deploying a DNS solution.
C.2.5 [Computer-Communication Networks]: Local and Wide- Our results paint a mixed picture of anycast performance. For
Area Networks—Internet; C.4 [Performance of Systems]: Mea- most clients, anycast performs well despite the lack of centralized
surement techniques control. However, anycast directs around 20% of clients to a sub-
optimal front-end. When anycast does not direct a client to the best
front-end, we find that the client usually still lands on a nearby alter-
Keywords native front-end. We demonstrate that the anycast inefficiencies are
stable enough that we can use a simple prediction scheme to drive
Anycast; CDN; Measurement;
DNS redirection for clients underserved by anycast, improving per-
formance of 15%-20% of clients. Like any such study, our specific
1. INTRODUCTION conclusions are closely tied to the current front-end deployment of
Content delivery networks are a critical part of Internet infras- the CDN we measure. However, as the first study of this kind that we
tructure. CDNs deploy front-end servers around the world and are aware of, the results reveal important insights about CDN per-
direct clients to nearby, available front-ends to reduce bandwidth, formance, demonstrating that anycast delivers optimal performance
improve performance, and maintain reliability. We will focus on for most clients.
a CDN architecture which directs the client to a nearby front-end,
which terminates the client’s TCP connection and relays requests to
a backend server in a data center. The key challenge for a CDN is
2. CLIENT REDIRECTION
to map each client to the right front-end. For latency-sensitive ser- A CDN can direct a client to a front-end in multiple ways.
vices such as search results, CDNs try to reduce the client-perceived DNS: The client will fetch a CDN-hosted resource via a hostname
latency by mapping the client to a nearby front-end. that belongs to the CDN. The client’s local DNS resolver (LDNS),
CDNs can use several mechanisms to direct the client to a front- typically configured by the client’s ISP, will receive the DNS request
end. The two most popular mechanisms are DNS and anycast. to resolve the hostname and forward it to the CDN’s authoritative
DNS-based redirection was pioneered by Akamai. It offers fine- nameserver. The CDN makes a performance-based decision about
grained and near-real time control over client-front-end mapping, what IP address to return based on which LDNS forwarded the
but requires considerable investment in infrastructure and opera- request. DNS redirection allows relatively precise control to redirect
tions [35]. clients on small timescales by using small DNS cache TTL values.
Since a CDN must make decisions at the granularity of LDNS
Permission to make digital or hard copies of all or part of this work for personal or
rather than client, DNS-based redirection faces some challenges.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
An LDNS may be distant from the clients that it serves or may serve
tion on the first page. Copyrights for components of this work owned by others than clients distributed over a large geographic region, such that there
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- is no good single redirection choice an authoritative resolver can
publish, to post on servers or to redistribute to lists, requires prior specific permission make. This situation is very common with public DNS resolvers
and/or a fee. Request permissions from [email protected]. such as Google Public DNS and OpenDNS, which serve large, ge-
IMC’15, October 28–30, 2015, Tokyo, Japan.
ographically disparate sets of clients [17]. A proposed solution to
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-3848-6/15/10 ...$15.00.
this issue is the EDNS client-subnet-prefix standard (ECS) which
DOI: http://dx.doi.org/10.1145/2815675.2815717. allows a portion of the client’s actual IP address to be forwarded
to the authoritative resolver, allowing per-prefix redirection deci- 3.2.1 Passive Measurements
sions [21]. Bing server logs provide detailed information about client re-
Anycast: Anycast is a routing strategy where the same IP address quests for each search query. For our analysis we use the client IP
is announced from many locations throughout the world. Then address, location, and what front-end was used during a particular
BGP routes clients to one front-end location based on BGP’s notion request. This data set was collected on the first week of April 2015
of best path. Because anycast defers client redirection to Internet and represents many millions of queries.
routing, it offers operational simplicity. Anycast has an advantage
over DNS-based redirection in that each client redirection is handled 3.2.2 Active Measurements
independently – avoiding the LDNS problems described above. To actively measure CDN performance from the client, we inject
Anycast has some well-known challenges. First, anycast is un- a JavaScript beacon into a small fraction of Bing Search results.
aware of network performance, just as BGP is, so it does not react After the results page has completely loaded, the beacon instructs
to changes in network quality along a path. Second, anycast is un- the client to fetch four test URLs. These URLs trigger a set of DNS
aware of server load. If a particular front-end becomes overloaded, queries to our authoritative DNS infrastructure. The DNS query
it is difficult to gradually direct traffic away from that front-end, results are randomized front-end IPs for measurement diversity,
although there has been recent progress in this area [23]. Simply which we discuss more in § 3.3.
withdrawing the route to take that front-end offline can lead to cas- The beacon measures the latency to these front-ends by down-
cading overloading of nearby front-ends. Third, anycast routing loading the resources pointed to by the URLs, and reports the results
changes can cause ongoing TCP sessions to terminate and need to to a backend infrastructure. Our authoritative DNS servers also push
be restarted. In the context of the Web, which is dominated by their query logs to the backend storage. Each test URL has a glob-
short flows, this does not appear to be an issue in practice [31, 23]. ally unique identifier, allowing us to join HTTP results from the
Many companies, including Cloudflare, CacheFly, Edgecast, and client side with DNS results from the server side [34].
Microsoft, run successful anycast-based CDNs. The JavaScript beacon implements two techniques to improve
Other Redirection Mechanisms: Whereas anycast and DNS direct quality of measurements. First, to remove the impact of DNS lookup
a client to a front-end before the client initiates a request, the re- from our measurements, we first issue a warm-up request so that
sponse from a front-end can also direct the client to a different server the subsequent test will use the cached DNS response. While DNS
for other resources, using, for example, HTTP status code 3xx or latency may be responsible for some aspects of poor Web-browsing
manifest-based redirection common for video [4]. These schemes performance [5], in this work we are focusing on the performance
add extra RTTs, and thus are not suitable for latency-sensitive Web of paths between client and front-ends. We set TTLs longer than
services such as search. We do not consider them further in this the duration of the beacon. Second, using JavaScript to measure
paper. the elapsed time between the start and end of a fetch is known to not
be a precise measurement of performance [32], whereas the W3C
Resource Timing API [29] provides access to accurate resource
3. METHODOLOGY download timing information from compliant Web browsers. The
Our goal is to answer two questions: 1) How effective is anycast beacon first records latency using the primitive timings. Upon
in directing clients to nearby front-ends? And 2) How does anycast completion, if the browser supports the resource timing API, then
performance compare against the more traditional DNS-based uni- the beacon substitutes the more accurate values.
cast redirection scheme? We experiment with Bing’s anycast-based We study measurements collected from many millions of search
CDN to answer these questions. The CDN has dozens of front end queries over March and April 2015. We aggregated client IP ad-
locations around the world, all within the same Microsoft-operated dresses from measurements into /24 prefixes because they tend to
autonomous system. We use measurements from real clients to be localized [27]. To reflect that the number of queries per /24 is
Bing CDN front-ends using anycast and unicast. In § 4, we com- heavily skewed across prefixes [35], for both the passive and ac-
pare the size of this CDN to others and show how close clients are tive measurements, we present some of our results weighting the
to the front ends. /24s by the number of queries from the prefix in our corresponding
measurements.
3.1 Routing Configuration
All test front-ends locations have both anycast and unicast IP 3.3 Choice of Front-ends to Measure
addresses. The main goal of our measurements is to compare the perfor-
Anycast: Bing is currently an anycast CDN. All production search mance achieved by anycast with the performance achieved by di-
traffic is current served using anycast from all front-ends. recting clients to their best performing front-end. Measuring from
each client to every front-end would introduce too much overhead,
Unicast: We also assign each front-end location a unique but we cannot know a priori which front-end is the best choice for
/24 prefix which does not serve production traffic. Only the routers a given client at a given point in time.
at the closest peering point to that front-end announce the prefix, We use three mechanisms to balance measurement overhead with
forcing traffic to the prefix to ingress near the front-end rather than measurement accuracy in terms of uncovering the best performing
entering Microsoft’s backbone at a different location and traversing choices and obtaining sufficient measurements to them. First, for
the backbone to reach the front-end. This routing configuration each LDNS, we consider only the ten closest front-ends to the LDNS
allows the best head-to-head comparison between unicast and (based on geolocation data) as candidates to consider returning to
anycast redirection, as anycast traffic ingressing at a particular the clients of that LDNS. Recent work has show that LDNS is a
peering point will also go to the closest front-end. good approximation of client location: excluding 8% of demand
from public resolvers, only 11-12% of demand comes from clients
3.2 Data Sets who are further than 500km from their LDNS [17]. In Figure 1, we
We use both passive and active measurements in our study, as will show that our geolocation data is sufficiently accurate that the
discussed below. best front-ends for the clients are generally within that set. Second,
1 1
1st Closest
0.9
2nd Closest
CDF of Clients
CDF of /24s 0.6 0.6
0.5
0.4 0.4
0.3 9 front-ends
7 front-ends
0.2 0.2
5 front-ends
0.1 3 front-ends
1 front-end
0 0
0 50 100 150 200 64 128 256 512 1024 2048 4096 8192
Figure 1: Diminishing returns of measuring to additional front-ends. Figure 2: Distances in kilometers (log scale) from volume-weighted
The close grouping of lines for the 5th+ closest front-ends suggests that clients to nearest front-ends.
measuring to additional front-ends provides negligible benefit.
0.9
0.8
0.8
CCDF of Requests
0.7
0.6
Europe
World
0.6
CDF
United States
0.4 0.5
0.4
Weighted Clients Past Closest
0.2 0.3 Clients Past Closest
Weighted Clients to Front-end
0.2 Clients to Front-end
0
0.1
0 20 40 60 80 100
64 128 256 512 1024 2048 4096 8192
Performance difference between
Distance (km)
anycast and best unicast (ms)
Figure 3: The fraction of requests where the best of three different Figure 4: The distance in kilometers (log scale) between clients and the
unicast front-ends out-performed anycast. anycast front-ends they are directed to.
Fraction of /24s
all
0.6
chose to route towards router A. However, internally in our network, > 10ms
> 25ms
router B is very close to a front-end C, whereas router A has a 0.4
> 50ms
> 100ms
longer intradomain route to the nearest front-end, front-end D. With
anycast, there is no way to communicate [39] this internal topology 0.2
5
/0
/1
/1
/2
04
04
04
04
front-end but the ISP’s internal policy chooses to hand off traffic at a Date
distant peering point. Microsoft intradomain policy then directs the Figure 5: Daily poor-path prevalence during April 2015 showing what
client’s request to the front-end nearest to the peering point, not to fraction of client /24s see different levels of latency improvement over
the client. Some examples we observed of this was an ISP carrying anycast when directed to their best performing unicast front-end.
traffic from a client in Denver to Phoenix and another carrying
traffic from Moscow to Stockholm. In both cases, direct peering
formance.1 Next we examine how common these issues are from
was present at each source city.
day-to-day and how long issues with individual networks persist.
Intrigued by these sorts of case studies, we sought to understand
Is anycast performance consistently poor? We first consider
anycast performance quantitatively. The first question we ask is
whether significant fractions of clients see consistently poor per-
whether anycast performance is poor simply because it occasionally
formance with anycast. At the end of each day, we analyzed all
directs clients to front-ends that are geographically far away, as was
collected client measurements to find prefixes with room for im-
the case when clients in Moscow went to Stockholm.
provement over anycast performance. For each client /24, we cal-
Does anycast direct clients to nearby front-ends? In a large
culate the median latency between the prefix and each measured
CDN with presence in major metro areas around the world, most
unicast front-end and anycast.
ISPs will see BGP announcements for front-ends from a number of
Figure 5 shows the prevalence of poor anycast performance each
different locations. If peering among these points is uniform, then
day during April 2015. Each line specifies a particular minimum
the ISP’s least cost path from a client to a front-end will often be the
latency improvement, and the figure shows the fraction of client
geographically closest. Since anycast is not load or latency aware,
/24s each day for which some unicast front-end yields at least that
geographic proximity is a good indicator of expected performance.
improvement over anycast. On average, we find that 19% of prefixes
Figure 4 shows the distribution of the distance from client to
see some performance benefit from going to a specific unicast front-
anycast front-end for all clients in one day of production Bing traffic.
end instead of using anycast. We see 12% of clients with 10ms or
One line weights clients by query volume. Anycast is shown to
more improvement, but only 4% see 50ms or more.
perform 5-10% better at all percentiles when accounting for more
Poor performance is not limited to a few days–it is a daily con-
active clients. We see that about 82% of clients are directed to a
cern. We next examine whether the same client networks experience
front-end within 2000 km while 87% of client volume is within
recurring poor performance. How long does poor performance per-
2000 km.
sist? Are the problems seen in Figure 5 always due to the same
The second pair of lines in Figure 4, labeled “Past Closest”,
problematic clients?
shows the distribution of the difference between the distance from
Figure 6 shows the duration of poor anycast performance dur-
a client to its closest front-end and the distance from the client to
ing April 2015. For the majority of /24s categorized as having
the front-end anycast directs to. About 55% of clients and weighted
poor-performing paths, those poor-performing paths are short-lived.
clients have distance 0, meaning they are directed to the nearest
Around 60% appear for only one day over the month. Around 10%
front-end. Further, 75% of clients are directed to a front-end within
of /24s show poor performance for 5 days or more. These days
around 400 km and 90% are within 1375 km of their closest. This
are not necessarily consecutive. We see that only 5% of /24s see
supports the idea that, with a dense front-end deployment such as
continuous poor performance over 5 days or more.
is achievable in North America and Europe, anycast directs most
These results show that while there is a persistent amount of poor
clients to a relatively nearby front-end that should be expected to
anycast performance over time, the majority of problems only last
deliver good performance, even if it is not the closest.
From a geographic view, we found that around 10-15% of /24s 1 No geolocation database is perfect. A fraction of very long client-to-front-end dis-
are directed to distant front-ends, a likely explanation for poor per- tances may be attributable to bad client geolocation data.
1
1
0.9
Figure 6: Poor path duration across April 2015. We consider poor Figure 8: The distribution of change in client-to-front-end distance
anycast paths to be those with any latency inflation over a unicast front- (log scale) when when the front-end changes, for the 7% of clients that
end. change front-end throughout a day.
1
0.8