Maglev: A Fast and Reliable Software Network Load Balancer
Maglev: A Fast and Reliable Software Network Load Balancer
Maglev: A Fast and Reliable Software Network Load Balancer
Abstract
Maglev is Google’s network load balancer. It is a
large distributed software system that runs on commodity
Linux servers. Unlike traditional hardware network load
balancers, it does not require a specialized physical rack
deployment, and its capacity can be easily adjusted by
adding or removing servers. Network routers distribute
packets evenly to the Maglev machines via Equal Cost
Multipath (ECMP); each Maglev machine then matches Figure 1: Hardware load balancer and Maglev.
the packets to their corresponding services and spreads
them evenly to the service endpoints. To accommodate
high and ever-increasing traffic, Maglev is specifically balancers form a critical component of Google’s produc-
optimized for packet processing performance. A single tion network infrastructure.
Maglev machine is able to saturate a 10Gbps link with A network load balancer is typically composed of
small packets. Maglev is also equipped with consistent multiple devices logically located between routers and
hashing and connection tracking features, to minimize service endpoints (generally TCP or UDP servers), as
the negative impact of unexpected faults and failures on shown in Figure 1. The load balancer is responsible for
connection-oriented protocols. Maglev has been serving matching each packet to its corresponding service and
Google’s traffic since 2008. It has sustained the rapid forwarding it to one of that service’s endpoints.
global growth of Google services, and it also provides Network load balancers have traditionally been imple-
network load balancing for Google Cloud Platform. mented as dedicated hardware devices [1, 2, 3, 5, 9, 12,
13], an approach that has several limitations. First, their
scalability is generally constrained by the maximum ca-
1 Introduction pacity of a single unit, making it impossible to keep up
with Google’s traffic growth. Second, they do not meet
Google is a major source of global Internet traffic [29, Google’s requirements for high availability. Though of-
30]. It provides hundreds of user-facing services, in ad- ten deployed in pairs to avoid single points of failure,
dition to many more services hosted on the rapidly grow- they only provide 1+1 redundancy. Third, they lack the
ing Cloud Platform [6]. Popular Google services such as flexibility and programmability needed for quick itera-
Google Search and Gmail receive millions of queries per tion, as it is usually difficult, if not impossible, to modify
second from around the globe, putting tremendous de- a hardware load balancer. Fourth, they are costly to up-
mand on the underlying serving infrastructure. grade. Augmenting the capacity of a hardware load bal-
To meet such high demand at low latency, a Google ancer usually involves purchasing new hardware as well
service is hosted on a number of servers located in mul- as physically deploying it. Because of all these limita-
tiple clusters around the world. Within each cluster, it tions, we investigated and pursued alternative solutions.
is essential to distribute traffic load evenly across these
With all services hosted in clusters full of commodity
servers in order to utilize resources efficiently so that no
servers, we can instead build the network load balancer
single server gets overloaded. As a result, network load
as a distributed software system running on these servers.
∗ Work was done while at Google. A software load balancing system has many advantages
over its hardware counterpart. We can address scalabil-
ity by adopting the scale-out model, where the capac-
ity of the load balancer can be improved by increasing
the number of machines in the system: through ECMP
forwarding, traffic can be evenly distributed across all
machines. Availability and reliability are enhanced as
the system provides N+1 redundancy. By controlling
the entire system ourselves, we can quickly add, test,
and deploy new features. Meanwhile, deployment of the
load balancers themselves is greatly simplified: the sys-
tem uses only existing servers inside the clusters. We
can also divide services between multiple shards of load
balancers in the same cluster in order to achieve perfor- Figure 2: Maglev packet flow.
mance isolation.
Despite all the benefits, the design and implementation 2.1 Frontend Serving Architecture
of a software network load balancer are highly complex
and challenging. First, each individual machine in the Maglev is deployed in Google’s frontend-serving loca-
system must provide high throughput. Let N be the num- tions, including clusters of varying sizes. For simplicity,
ber of machines in the system and T be the maximum we only focus on the setup in the smaller clusters in this
throughput of a single machine. The maximum capac- paper, and briefly describe the larger cluster setup below.
ity of the system is bounded by N × T . If T is not high Figure 2 shows an overview of Google’s frontend serving
enough, it will be uneconomical for the system to pro- architecture in the small cluster setup.
vide enough capacity for all services [22]. The system as Every Google service has one or more Virtual IP ad-
a whole must also provide connection persistence: pack- dresses (VIPs). A VIP is different from a physical IP in
ets belonging to the same connection should always be that it is not assigned to a specific network interface, but
directed to the same service endpoint. This ensures qual- rather served by multiple service endpoints behind Ma-
ity of service as clusters are very dynamic and failures glev. Maglev associates each VIP with a set of service
are quite common [23, 40]. endpoints and announces it to the router over BGP; the
This paper presents Maglev, a fast and reliable soft- router in turn announces the VIP to Google’s backbone.
ware network load balancing system. Maglev has been a Aggregations of the VIP networks are announced to the
critical component of Google’s frontend serving infras- Internet to make them globally accessible. Maglev han-
tructure since 2008, and currently serves almost all of dles both IPv4 and IPv6 traffic, and all the discussion
Google’s incoming user traffic. By exploiting recent ad- below applies equally to both.
vances in high-speed server networking techniques [18, When a user tries to access a Google service served on
41, 35, 31], each Maglev machine is able to achieve line- www.google.com, her browser first issues a DNS query,
rate throughput with small packets. Through consistent which gets a response (possibly cached) from one of
hashing and connection tracking, Maglev provides reli- Google’s authoritative DNS servers. The DNS server as-
able packet delivery despite frequent changes and unex- signs the user to a nearby frontend location taking into
pected failures. While some of the techniques described account both her geolocation and the current load at each
in this paper have existed for years, this paper shows how location, and returns a VIP belonging to the selected lo-
to build an operational system using these techniques. cation in response [16]. The browser will then try to es-
The major contributions of this paper are to: 1) present tablish a new connection with the VIP.
the design and implementation of Maglev, 2) share ex- When the router receives a VIP packet, it forwards
periences of operating Maglev at a global scale, and 3) the packet to one of the Maglev machines in the clus-
demonstrate the capability of Maglev through extensive ter through ECMP, since all Maglev machines announce
evaluations. the VIP with the same cost. When the Maglev machine
receives the packet, it selects an endpoint from the set
of service endpoints associated with the VIP, and encap-
2 System Overview sulates the packet using Generic Routing Encapsulation
(GRE) with the outer IP header destined to the endpoint.
This section provides an overview of how Maglev works When the packet arrives at the selected service end-
as a network load balancer. We give a brief introduction point, it is decapsulated and consumed. The response,
to Google’s frontend serving architecture, followed by a when ready, is put into an IP packet with the source ad-
description of how the Maglev system is configured. dress being the VIP and the destination address being
2
Figure 4: Maglev forwarder structure.
3
NIC (Network Interface Card), rewrites them with proper
GRE/IP headers and then sends them back to the NIC.
The Linux kernel is not involved in this process.
Packets received by the NIC are first processed by the
steering module of the forwarder, which calculates the 5-
tuple hash1 of the packets and assigns them to different
receiving queues depending on the hash value. Each re-
ceiving queue is attached to a packet rewriter thread. The
packet thread first tries to match each packet to a con-
figured VIP. This step filters out unwanted packets not
targeting any VIP. Then it recomputes the 5-tuple hash
of the packet and looks up the hash value in the connec-
tion tracking table (covered in Section 3.3). We do not
reuse the hash value from the steering module to avoid Figure 5: Packet movement into and out of the forwarder.
cross-thread synchronization.
The connection table stores backend selection results However, our requirements are much more stringent: we
for recent connections. If a match is found and the se- must handle very small packets effectively because in-
lected backend is still healthy, the result is simply reused. coming requests are typically small in size. Assuming IP
Otherwise the thread consults the consistent hashing packet size is 100 bytes on average, the forwarder must
module (covered in Section 3.4) and selects a new back- be able to process packets at 9.06Mpps. This subsection
end for the packet; it also adds an entry to the connection describes the key techniques we employed to reach and
table for future packets with the same 5-tuple. A packet exceed this packet processing speed.
is dropped if no backend is available. The forwarder Maglev is a userspace application running on com-
maintains one connection table per packet thread to avoid modity Linux servers. Since the Linux kernel network
access contention. After a backend is selected, the packet stack is rather computationally expensive, and Maglev
thread encapsulates the packet with proper GRE/IP head- doesn’t require any of the Linux stack’s features, it is
ers and sends it to the attached transmission queue. The desirable to make Maglev bypass the kernel entirely for
muxing module then polls all transmission queues and packet processing. With proper support from the NIC
passes the packets to the NIC. hardware, we have developed a mechanism to move
The steering module performs 5-tuple hashing instead packets between the forwarder and the NIC without any
of round-robin scheduling for two reasons. First, it helps involvement of the kernel, as shown in Figure 5. When
lower the probability of packet reordering within a con- Maglev is started, it pre-allocates a packet pool that is
nection caused by varying processing speed of different shared between the NIC and the forwarder. Both the
packet threads. Second, with connection tracking, the steering and muxing modules maintain a ring queue of
forwarder only needs to perform backend selection once pointers pointing to packets in the packet pool.
for each connection, saving clock cycles and eliminat- Both the steering and muxing modules maintain three
ing the possibility of differing backend selection results pointers to the rings. At the receiving side, the NIC
caused by race conditions with backend health updates. places newly received packets at the received pointer
In the rare cases where a given receiving queue fills up, and advances it. The steering module distributes the re-
the steering module falls back to round-robin scheduling ceived packets to packet threads and advances the pro-
and spreads packets to other available queues. This fall- cessed pointer. It also reserves unused packets from the
back mechanism is especially effective at handling large packet pool, places them into the ring and advances the
floods of packets with the same 5-tuple. reserved pointer. The three pointers chase one another
as shown by the arrows. Similarly, on the sending side
3.2 Fast Packet Processing the NIC sends packets pointed to by the sent pointer and
advances it. The muxing module places packets rewrit-
The Maglev forwarder needs to process packets as fast ten by packet threads into the ring and advances the
as possible in order to cost-effectively scale the serv- ready pointer. It also returns packets already sent by the
ing capacity to the demands of Google’s traffic. We NIC back to the packet pool and advances the recycled
engineered it to forward packets at line rate – typically pointer. Note that the packets are not copied anywhere
10Gbps in Google’s clusters today. This translates to by the forwarder.
813Kpps (packets per second) for 1500-byte IP packets. To reduce the number of expensive boundary-crossing
1 The 5-tuple of a packet refers to the source IP, source port, desti- operations, we process packets in batches whenever pos-
nation IP, destination port and IP protocol number. sible. In addition, the packet threads do not share any
4
data with each other, preventing contention between of Maglev does not usually provide connection affinity,
them. We pin each packet thread to a dedicated CPU core this assumption does not hold when the set of Maglev
to ensure best performance. With all these optimizations, machines changes. Unfortunately, such changes are in-
Maglev is able to achieve line rate with small packets, as evitable and may happen for various reasons. For exam-
shown in Section 5.2. ple, when upgrading Maglevs in a cluster we do a rolling
Further, the latency that Maglev adds to the path taken restart of machines, draining traffic from each one a few
by each packet is small. Normally it takes the packet moments beforehand and restoring it once the Maglev
thread about 350ns to process each packet on our stan- starts serving again. This process may last over an hour,
dard servers. There are two special cases in which packet during which the set of Maglevs keeps changing. We also
processing may take longer. Since the forwarder pro- sometimes add, remove, or replace Maglev machines.
cesses packets in batches, each batch is processed when All of these operations make standard ECMP implemen-
it grows large enough or when a periodic timer expires. tations shuffle traffic on a large scale, leading to connec-
In practice we set the timer to be 50µ s. Therefore if Ma- tions switching to different Maglevs in mid-stream. The
glev is significantly underloaded, a 50µ s delay will be new Maglevs will not have the correct connection table
added to each packet in the worst case. One possible entries, so if backend changes occur at the same time,
optimization to this case is to adjust batch sizes dynam- connections will break.
ically [32]. The other case where Maglev may add ex- A second theoretical limitation is that the connection
tra processing delay is when Maglev is overloaded. The tracking table has finite space. The table may fill up un-
maximum number of packets that Maglev can buffer is der heavy load or SYN flood attacks. Since Maglev only
the size of the packet pool; beyond that the packets will evicts entries from the connection table when they are
be dropped by the NIC. Assuming the packet pool size expired, once the table becomes full, we will need to
is 3000 and the forwarder can process 10Mpps, it takes select a backend for each packet that doesn’t fit in the
about 300µ s to process all buffered packets. Hence a table. While in practice there is plenty of memory on
maximum of 300µ s delay may be added to each packet a modern machine, in deployments where we share ma-
if Maglev is heavily overloaded. Fortunately, this case chines between Maglev and other services, we may need
can be avoided by proper capacity planning and adding to sharply limit the connection table size.
Maglev machines as needed. If any of the above cases occur, we can no longer rely
on connection tracking to handle backend changes. Thus
Maglev also provides consistent hashing to ensure reli-
3.3 Backend Selection able packet delivery under such circumstances.
Once a packet is matched to a VIP, we need to choose a
backend for the packet from the VIP’s backend pool. For
3.4 Consistent Hashing
connection-oriented protocols such as TCP, it is critical
to send all packets of a connection to the same backend. One possible approach to address the limitations of con-
We accomplish this with a two part strategy. First, we nection tracking is to share connection state among all
select a backend using a new form of consistent hashing Maglev machines, for example in a distributed hash ta-
which distributes traffic very evenly. Then we record the ble as suggested in [34]. However, this would negatively
selection in a local connection tracking table. affect forwarding performance – recall that connection
Maglev’s connection tracking table uses a fixed-size states are not even shared among packet threads on the
hash table mapping 5-tuple hash values of packets to same Maglev machine to avoid contention.
backends. If the hash value of a packet does not exist A better-performing solution is to use local consistent
in the table, Maglev will assign a backend to the packet hashing. The concept of consistent hashing [28] or ren-
and store the assignment in the table. Otherwise Maglev dezvous hashing [38] was first introduced in the 1990s.
will simply reuse the previously assigned backend. This The idea is to generate a large lookup table with each
guarantees that packets belonging to the same connec- backend taking a number of entries in the table. These
tion are always sent to the same backend, as long as the methods provide two desirable properties that Maglev
backend is still able to serve them. Connection tracking also needs for resilient backend selection:
comes in handy when the set of backends changes: for
instance, when backends go up and down, are added or • load balancing: each backend will receive an al-
removed, or when the backend weights change. most equal number of connections.
However, per-Maglev connection tracking alone is in-
sufficient in our distributed environment. First, it as- • minimal disruption: when the set of backends
sumes all packets with the same 5-tuple are always sent changes, a connection will likely be sent to the same
to the same Maglev machine. Because the router in front backend as it was before.
5
Pseudocode 1 Populate Maglev hashing lookup table.
Table 1: A sample consistent hash lookup table.
1: function P OPULATE
2: for each i < N do next[ i ] ← 0 end for B0 B1 B2 Before After
3: for each j < M do entry[ j ] ← −1 end for 0 3 0 3 0 B1 B0
4: n←0 1 0 2 4 1 B0 B0
5: while true do 2 4 4 5 2 B1 B0
6: for each i < N do 3 1 6 6 3 B0 B0
7: c ← permutation[ i ][ next[ i ] ] 4 5 1 0 4 B2 B2
8: while entry[ c ] ≥ 0 do 5 2 3 1 5 B2 B2
9: next[ i ] ← next[ i ] + 1 6 6 5 2 6 B0 B2
10: c ← permutation[ i ][ next[ i ] ] Permutation tables for the Lookup table before and
11: end while backends. after B1 is removed.
12: entry[ c ] ← i
13: next[ i ] ← next[ i ] + 1
is a random permutation of array (0 .. M − 1). As an
14: n ← n+1
efficient way of generating permutation[ i ], each back-
15: if n = M then return end if
end is assigned a unique name. We first hash the back-
16: end for
end name using two different hashing functions to gen-
17: end while
erate two numbers offset and skip. Then we generate
18: end function
permutation[ i ] using these numbers as follows:
offset ← h1 (name[ i ]) mod M
Both [28] and [38] prioritize minimal disruption over skip ← h2 (name[ i ]) mod (M − 1) + 1
load balancing, as they were designed to optimize web
permutation[ i ][ j ] ← (offset + j × skip) mod M
caching on a small number of servers. However, Maglev
takes the opposite approach for two reasons. First, it is M must be a prime number so that all values of skip
critical for Maglev to balance load as evenly as possible are relatively prime to it. Let N be the size of a VIP’s
among the backends. Otherwise the backends must be backend pool. Its lookup table is populated using Pseu-
aggressively overprovisioned in order to accommodate docode 1. We use next[ i ] to track the next index in
the peak traffic. Maglev may have hundreds of backends the permutation to be considered for backend i; the fi-
for certain VIPs, our experiments show that both [28] nal lookup table is stored in the array entry. In the body
and [38] will require a prohibitively large lookup table of the outer while loop, we iterate through all the back-
for each VIP to provide the level of load balancing that ends. For each backend i we find a candidate index c
Maglev desires. Second, while minimizing lookup table from permutation[ i ] which has not been filled yet, and
disruptions is important, a small number of disruptions is fill it with the backend. The loop keeps going until all
tolerable by Maglev. Steady state, changes to the lookup entries in the table have been filled.
table do not lead to connection resets because connec- The algorithm is guaranteed to finish. Its worst case
tions’ affinity to Maglev machines does not change at the time complexity is O(M 2 ) which only happens if there
same time. When connections’ affinity to Maglevs does are as many backends as lookup table entries and all the
change, resets are proportional to the number of lookup backends hash to the same permutation. To avoid this
table disruptions. happening we always choose M such that M ≫ N. The
With these considerations in mind, we developed a average time complexity is O(M log M) because at step
new consistent hashing algorithm, which we call Maglev M
n we expect the algorithm to take M−n tries to find an
hashing. The basic idea of Maglev hashing is to assign empty candidate index, so the total number of steps is
a preference list of all the lookup table positions to each ∑M M M M
n=1 n . Each backend will take either ⌊ N ⌋ or ⌈ N ⌉ en-
backend. Then all the backends take turns filling their tries in the lookup table. Therefore the number of entries
most-preferred table positions that are still empty, until occupied by different backends will differ by at most 1.
the lookup table is completely filled in. Hence, Maglev In practice, we choose M to be larger than 100 × N to
hashing gives an almost equal share of the lookup table ensure at most a 1% difference in hash space assigned to
to each of the backends. Heterogeneous backend weights backends. Other methods of generating random permu-
can be achieved by altering the relative frequency of the tations, such as the Fisher-Yates Shuffle [20], generate
backends’ turns; the implementation details are not de- better quality permutations using more state, and would
scribed in this paper. work fine here as well.
Let M be the size of the lookup table. The prefer- We use the example in Table 1 to illustrate how Ma-
ence list for backend i is stored in permutation[ i ], which glev hashing works. Assume there are 3 backends, the
6
lookup table size is 7, and the (offset, skip) pairs of the
three backends are (3, 4), (0, 2) and (3, 1). The generated
permutation tables are shown in the left column, and the
lookup tables before and after backend B1 is removed are
presented in the right column. As the example shows, the
lookup tables are evenly balanced among the backends
both with and without B1. After B1 is removed, aside
from updating all of the entries that contained B1, only
one other entry (row 6) needs to be changed. In prac-
tice, with larger lookup tables, Maglev hashing is fairly
resilient to backend changes, as we show in Section 5.3.
4.1.1 Failover
4.2 VIP Matching
Maglev machines were originally deployed in active-
passive pairs to provide failure resilience, as were the In Google’s production networks, each cluster is as-
hardware load balancers they replaced. Only active ma- signed an external IP prefix that is globally routable.
chines served traffic in normal situations. When an ac- For example, cluster C1 in Figure 6 has prefix
tive machine became unhealthy, its passive counterpart 74.125.137.0/24. The same service is configured as dif-
would take over and start serving. Connections were usu- ferent VIPs in different clusters, and the user is directed
ally uninterrupted during this process thanks to Maglev to one of them by DNS. For instance, Service1 is config-
hashing, but there were some drawbacks to this setup. It ured as 74.125.137.1 in C1 and 173.194.71.1 in C2.
used resources inefficiently, since half of the machines Google has several different classes of clusters, serv-
sat idle at all times. It also prevented us from scaling any ing different sets of VIPs. External prefix lengths are the
VIP past the capacity of a single Maglev machine. Fi- same for clusters of the same class, but may be different
nally, coordination between active and passive machines for different cluster classes. Sometimes, in emergencies,
was complex. In this setup, the machines’ announcers we need to redirect traffic to a different cluster via Ma-
would monitor each other’s health and serving priority, glev encapsulation. Therefore, we need the target Ma-
escalating their own BGP priority if they lost sight of glevs to be able to correctly classify traffic for arbitrary
each other, with various tie-breaking mechanisms. other clusters. One possible solution is to define all VIPs
We gained a great deal of capacity, efficiency, and in all the clusters that may receive redirected traffic, but
operational simplicity by moving to an ECMP model. that would cause synchronization and scalability issues.
While Maglev hashing continues to protect us against oc- Instead, we implemented a special numbering rule and
casional ECMP flaps, we can multiply the capacity of a a novel VIP matching mechanism to cope with the prob-
VIP by the maximum ECMP set size of the routers, and lem. For each cluster class, we assign each VIP the same
all machines can be fully utilized. suffix across all clusters of that class. Then we use a pre-
7
fix/suffix matching mechanism for VIP matching. First, Maglev computes its 3-tuple hash using the L3 header
the incoming packet goes through longest prefix match- and forwards it to a Maglev from the pool based on the
ing, to determine which cluster class it was destined for. hash value. Since all fragments belonging to the same
Then it goes through longest suffix matching specific to datagram contain the same 3-tuple, they are guaranteed
that cluster class, to determine which backend pool it to be redirected to the same Maglev. We use the GRE
should be sent to. In order to reduce the need to keep recursion control field to ensure that fragments are only
configs globally in sync on a tight time scale, we precon- redirected once.
figure maglevs with a large prefix group for each cluster To meet the second requirement, Maglev uses the
class, from which prefixes for new clusters of the same same backend selection algorithm to choose a backend
class are allocated. This way a Maglev can correctly for unfragmented packets and second-hop first fragments
serve traffic originally destined for a cluster that it has (usually on different Maglev instances.) It maintains a
never heard of. fixed-size fragment table which records forwarding de-
As a result, each VIP is configured as a <Prefix Group, cisions for first fragments. When a second-hop non-first
IP suffix, port, protocol> tuple. Take Figure 6 as an ex- fragment is received by the same machine, Maglev looks
ample. Assuming C2 and C3 are of the same class, if a it up in the fragment table and forwards it immediately if
packet towards 173.194.71.1 is received in C2 but Ma- a match is found; otherwise it is cached in the fragment
glev determines none of the endpoints in C2 can serve table until the first one is received or the entry expires.
the packet, it will encapsulate and tunnel the packet This approach has two limitations: it introduces extra
towards the VIP address in C3 for the same service hops to fragmented packets, which can potentially lead to
(173.194.72.1). Then a Maglev in C3 will decapsulate packet reordering. It also requires extra memory to buffer
the packet and match the inner packet to Service1 using non-first fragments. Since packet reordering may happen
prefix/suffix matching, and the packet will be served by anywhere in the network, we rely on the endpoints to
an endpoint in C3 instead. handle out-of-order packets. In practice only a few VIPs
This VIP matching mechanism is specific to Google’s are allowed to receive fragments, and we are easily able
production setup, but it provides a good example of the to provide a big enough fragment table to handle them.
value of rapid prototyping and iteration that a software-
based load balancer can offer.
4.4 Monitoring and Debugging
4.3 Fragment Handling We consistently monitor the health and behavior of Ma-
glev as we do any other production system – for exam-
One special case that is not covered by the system de- ple, we use both black box and white box monitoring.
scribed so far is IP fragmentation. Fragments require Our black box monitoring consists of agents all over the
special treatment because Maglev performs 5-tuple hash- world which periodically check the reachability and la-
ing for most VIPs, but fragments do not all contain the tency of the configured VIPs. For our white box moni-
full 5-tuple. For example, if a large datagram is split toring, we export various metrics from each Maglev ma-
into two fragments, the first fragment will contain both chine via an HTTP server, and the monitoring system pe-
L3 and L4 headers while the second will only contain riodically queries each server to learn the latest Maglev
the L3 header. Thus when Maglev receives a non-first serving status details. The system sends alerts when it
fragment, it cannot make the correct forwarding decision observes abnormal behavior.
based only on that packet’s headers. Due to Maglev’s distributed nature, multiple paths ex-
Maglev must satisfy two requirements in order to han- ist from the router through Maglev to the service end-
dle fragments correctly. First, all fragments of the same points. However, debugging is much easier when we are
datagram must be received by the same Maglev. Sec- able to discern the exact path that a specific packet takes
ond, the Maglev must make consistent backend selection through the network. Thus we developed the packet-
decisions for unfragmented packets, first fragments, and tracer tool, similar to X-trace [21]. Packet-tracer con-
non-first fragments. structs and sends specially marked Maglev-recognizable
In general, we cannot rely on the hardware in front of payloads with specified L3 and L4 headers. The pay-
Maglev to satisfy the first requirement on its own. For loads contain receiver IP addresses to which Maglev
example, some routers use 5-tuple hashing for first frag- sends debugging information. The packets usually target
ments and 3-tuple for non-first fragments. We therefore a specific VIP and are routed normally to our frontend
implemented a generic solution in Maglev to cope with locations. When a Maglev machine receives a packet-
any fragment hashing behavior. Each Maglev is config- tracer packet, it forwards the packet as usual, while also
ured with a special backend pool consisting of all Ma- sending debugging information, including its machine
glevs within the cluster. Upon receipt of a fragment, name and the selected backend, to the specified receiver.
8
Figure 7: Average, standard deviation and coefficient of Figure 8: Throughput with and without kernel bypass.
variation of normalized load on all service endpoints in
one cluster on a typical day.
glev machines. This provides a guideline of how much
to overprovision at this specific location.
Packet-tracer packets are rate-limited by Maglev, as they
are expensive to process. This tool is extremely helpful
in debugging production issues, especially when there is 5.2 Single Machine Throughput
more than one Maglev machine on the path, as happens Since each Maglev machine receives a roughly equal
in the case of fragment redirection. amount of traffic through ECMP, the overall throughput
of Maglev can be estimated as the number of Maglev
5 Evaluation machines times the throughput of each single machine.
The more traffic each machine can handle, the fewer ma-
In this section we evaluate Maglev’s efficiency and per- chines will be required to provide the same frontend ca-
formance. We present results from one of Google’s pro- pacity. Thus single machine throughput is essential to
duction clusters, as well as some microbenchmarks. the efficiency of the system.
The throughput of a Maglev machine is affected by
many factors, including the number of packet threads,
5.1 Load Balancing
NIC speed, and traffic type. In this subsection we report
As a network load balancer, Maglev’s major responsibil- results from a small testbed to evaluate the packet pro-
ity is to distribute traffic evenly across multiple service cessing capability of a Maglev machine under various
endpoints. To illustrate the load balancing performance conditions. Unless otherwise specified, all experiments
of Maglev, we collected connections per second (cps) are conducted on servers equipped with two 8-core re-
data from 458 endpoints in a cluster located in Europe. cent server-class CPUs, one 10Gbps NIC and 128GB of
The data is aggregated from multiple HTTP services in- memory. We only use one CPU for Maglev. Everything
cluding Web Search. The granularity of data collection else, including the operating system, runs on the other
is 5 minutes, and the load is normalized by the average CPU. The testbed consists of two senders, two receivers
cps throughout the day. Figure 7 shows the average and and one Maglev machine located in the same Ethernet
standard deviation of the load across all endpoints on a domain. The senders slowly increase their sending rates,
typical day. The traffic load exhibits a clear diurnal pat- and the throughput of Maglev is recorded as the maxi-
tern. The standard deviation is always small compared to mum number of packets per second (pps)2 that Maglev
the average load; the coefficient of variation is between can handle before starting to drop packets. We use two
6% and 7% most of the time. senders to ensure Maglev eventually gets overloaded.
Figure 7 also presents the overprovision factor com-
puted as the maximum load over the average load at each 5.2.1 Kernel Bypass
time point. It is an important metric because we must en-
sure even the busiest endpoints will always have enough In this experiment, we run Maglev in both vanilla Linux
capacity to serve all the traffic. The overprovision fac- network stack mode as well as kernel bypass mode to
tor is less than 1.2 over 60% of the time. It is notably evaluate the impact of kernel bypass on the throughput of
higher during off-peak hours, which is the expected be- 2 Note that we report throughput by pps instead of bps because the
havior because it is harder to balance the load when there effect of packet size on the pps throughput is negligible. Hence we
is less traffic. Besides, a higher overprovision factor dur- measure the pps throughput using minimum-sized packets. The bps
ing off-peak hours does not require the addition of Ma- throughput is equal to min(pps × packet size, line rate bps).
9
Figure 9: Throughput with different TCP packet types. Figure 10: Throughput with different NIC speeds.
Maglev. The senders are configured to send minimum- nection tracking afterwards. For the constant-5-tuple ex-
sized UDP packets from different source ports so that periment, all packets contain the same L3 and L4 head-
they are not assigned to the same packet thread by the ers. This is a special case because the steering module
steering module. Due to limitations of the test environ- generally tries to send packets with the same 5-tuple to
ment, the minimum size of UDP packets the senders can the same packet thread, and only spreads them to other
send is 52 bytes, slightly larger than the theoretical mini- threads when the chosen one is full. The senders vary the
mum for Ethernet. We vary the number of packet threads source ports for SYN and non-SYN experiments to gen-
in each run of the experiment. Each packet thread is erate different 5-tuples, but always use the same source
pinned to a dedicated CPU core (as we do in production) port for the constant-5-tuple experiment. They always
to ensure best performance. We use one core for steering send minimum-sized TCP packets, which are 64 bytes in
and muxing, thus there can be at most 7 packet threads. our test environment.
We measure Maglev’s throughput with and without ker- As in the previous experiment, Maglev reaches the
nel bypass and present the results in Figure 8. NIC’s capacity with 5 packet threads in the non-SYN
The figure shows the clear advantage of running Ma- and constant-5-tuple experiments. However, for SYN
glev in kernel bypass mode. There, Maglev is the bottle- packets, we see that Maglev needs 6 packet threads to
neck when there are no more than 4 packet threads; its saturate the NIC. This is because Maglev needs to per-
throughput increases with the number of packet threads. form backend selection for every SYN packet. Ma-
When there are 5 or more packet threads, however, the glev performs best under constant-5-tuple traffic, show-
NIC becomes the bottleneck. On the other hand, Maglev ing that the steering module can effectively steer poorly-
is always the bottleneck when using the vanilla Linux distributed packet patterns. Since all packets have the
network stack, and the maximum throughput achieved is same 5-tuple, their connection tracking information al-
less than 30% that of kernel bypass. ways stays in the CPU cache, ensuring the highest
throughput. For non-SYN packets, there are sporadic
cache misses for connection tracking lookup, and so the
5.2.2 Traffic Type
throughput is slightly lower than that for constant-5-tuple
Depending on the code execution paths within a packet traffic when there are fewer than 5 packet threads.
thread, Maglev handles different types of traffic at differ-
ent speeds. For example, a packet thread needs to select 5.2.3 NIC Speed
a backend for a TCP SYN packet and record it in the con-
nection tracking table; it only needs to do a lookup in the In the previous experiments, the NIC is the bottleneck
connection tracking table for non-SYN packets. In this as it is saturated by 5 packet threads. To understand
experiment we measure how fast Maglev handles differ- Maglev’s full capability, this experiment evaluates its
ent types of TCP packets. throughput using a faster NIC. Instead of the 10Gbps
Three traffic types are considered: SYN, non-SYN and NIC, we install a 40Gbps NIC on the Maglev machine,
constant-5-tuple. For SYN and non-SYN experiments, and use the same setup as in Section 5.2.1. The results
only SYN and non-SYN TCP packets are sent, respec- are illustrated in Figure 10. When there are no more
tively. The SYN experiment shows how Maglev behaves than 5 packet threads, the 40Gbps NIC provides slightly
during SYN flood attacks, while the non-SYN experi- higher throughput as its chip is faster than the 10Gbps
ment shows how Maglev works with regular TCP traf- one. However, the throughput growth for the 40Gbps
fic, performing backend selection once and using con- NIC does not slow down until 7 packet threads are used.
10
Figure 11: Load balancing efficiency of different hash- Figure 12: Resilience of Maglev hashing to backend
ing methods. M, K and R stand for Maglev, Karger and changes.
Rendezvous, respectively. Lookup table size is 65537 for
small and 655373 for large.
Therefore we only evaluate this metric for Maglev. Fig-
ure 12 presents the percent of changed table entries as
Because the NIC is no longer the bottleneck, this figure a function of the percent of concurrent backend failures.
shows the upper bound of Maglev throughput with the We set the number of backends to be 1000. For each fail-
current hardware, which is slightly higher than 15Mpps. ure number k, we randomly remove k backends from the
In fact, the bottleneck here is the Maglev steering mod- pool, regenerate the lookup table and compute the per-
ule, which will be our focus of optimization when we cent of changed entries. We repeat the experiment 200
switch to 40Gbps NICs in the future. times for each k value and report the average results.
Figure 12 shows that the ratio of changed entries in-
creases with the number of concurrent failures. Maglev
5.3 Consistent Hashing
hashing is more resilient to backend changes when the
In this experiment we evaluate Maglev hashing and com- table size is larger. In practice we use 65537 as the de-
pare it against Karger [28] and Rendezvous [38] hash- fault table size because we expect concurrent backend
ing. We are interested in two metrics: load balancing failures to be rare, and we still have connection track-
efficiency and resilience to backend changes. ing as the primary means of protection. In addition,
To evaluate the load balancing efficiency of the meth- microbenchmarks show that the lookup table generation
ods, we populate one lookup table using each method, time increases from 1.8ms to 22.9ms as the table size
and count the number of table entries assigned to each grows from 65537 to 655373, which prevents us from
backend. We set the total number of backends to be 1000 increasing the table size indefinitely.
and the lookup table size to be 65537 and 6553733. For
Karger we set the number of views to be 1000. Figure 11
presents the maximum and minimum percent of entries
6 Related Work
per backend for each method and table size. Unlike traditional hardware load balancers [1, 2, 3, 5, 9,
As expected, Maglev hashing provides almost perfect 12, 13], Maglev is a distributed software system which
load balancing no matter what the table size is. When ta- runs on commodity servers. Hardware load balancers
ble size is 65537, Karger and Rendezvous require back- are usually deployed as active-passive pairs. Maglev
ends to be overprovisioned by 29.7% and 49.5% respec- provides better efficiency and resiliency by running all
tively to accommodate the imbalanced traffic. The num- servers in active mode. In addition, upgrading hardware
bers drop to 10.3% and 12.3% as the table size grows to load balancer capacity requires purchasing new hardware
655373. Since there is one lookup table per VIP, the ta- as well as physically deploying it, making on demand ca-
ble size must be limited in order to scale the number of pacity adjustment difficult. On the other hand, Maglev’s
VIPs. Thus Karger and Rendezvous are not suitable for capacity can easily be adjusted up or down without caus-
Maglev’s load balancing needs. ing any service disruption. Some hardware vendors also
Another important metric for consistent hashing is re- provide load balancing software that runs in virtualized
silience to backend changes. Both Karger and Ren- environments. Maglev provides much higher throughput
dezvous guarantee that when some backends fail, the than these virtual load balancers.
entries for the remaining backends will not be affected. Ananta [34] is a distributed software load balancer.
3 There is no special significance to these numbers except that they Like Maglev, it employs ECMP to scale out the sys-
need to be prime. tem and uses a flow table to achieve connection affinity.
11
However, it does not provide a concrete mechanism to tasks across multiple CPU cores to improve CPU cache
handle changes to the load balancer pool gracefully, and hit ratio. CuckooSwitch [41] is a high-performance soft-
it is not specially optimized for single machine perfor- ware L2 switch. One of its key techniques is to mask
mance. Maglev does not have a component similar to memory access latency through batching and prefetch-
Ananta’s HostAgent which provides NAT services, but ing. RouteBricks [18] explained how to effectively uti-
there is an external system (not described here) that of- lize multi-core CPUs for parallel packet processing.
fers similar functionality. Ananta allows most internal Several kernel bypass techniques have been devel-
VIP traffic to bypass the load balancer. Maglev does oped recently, including DPDK [8], OpenOnload [15],
not provide a similar feature because it has enough ca- netmap [35], and PF RING [17], etc. A good summary
pacity for the internal traffic. Embrane [4] is a similar of popular kernel bypass techniques is presented in [10].
system developed for virtual environments. However, its These techniques can be used to effectively accelerate
throughput optimization can be difficult due to the limi- packet processing speed, but they all come with certain
tations of virtualization. Duet [22] is a hybrid hardware limitations. For example, DKPK and OpenOnload are
and software load balancer which aims to address the low tied to specific NIC vendors while netmap and PF RING
throughput issue of pure software load balancers. Ma- both require a modified Linux kernel. In Maglev we im-
glev is able to achieve sufficiently high throughput, thus plement a flexible I/O layer which does not require kernel
a hybrid solution becomes unnecessary. modification and allows us to conveniently switch among
There are also many generic load balancing software different NICs. As with other techniques, Maglev takes
packages, the most popular of which are NGINX [14], over the NIC once started. It uses the TAP interface to
HAProxy [7], and Linux Virtual Server [11]. They usu- inject kernel packets back to the kernel.
ally run on single servers, but it is also possible to deploy GPUs have recently started becoming popular for
multiple servers in an ECMP group behind a router to high-speed packet processing [24, 39]. However,
achieve the scale-out model. They all provide consistent Kalia et al [27] recently showed that CPU-based solu-
hashing mechanisms. Compared to Maglev, they mostly tions are able to achieve similar performance with more
prioritize minimum disruption over even load balancing efficient resource utilization if implemented correctly.
as is done by [28] and [38]. Because they are designed
for portability, they are not aggressively optimized for
performance. 7 Conclusion
Consistent hashing [28] and rendezvous hashing [38]
This paper presents Maglev, a fast, reliable, scalable and
were originally introduced for the purpose of distributed
flexible software network load balancer. We built Maglev
cache coordination. Both methods provide guaranteed
to scale out via ECMP and to reliably serve at 10Gbps
resilience such that when some backends are removed,
line rate on each machine, for cost-effective performance
only table entries pointing to those backends are updated.
with rapidly increasing serving demands. We map con-
However, they don’t provide good load balancing across
nections consistently to the same backends with a combi-
backends, which is an essential requirement for load bal-
nation of connection tracking and Maglev hashing. Run-
ancers. On the contrary, Maglev’s consistent hashing
ning this software system at scale has let us operate our
method achieves perfect balance across the backends at
websites effectively for many years, reacting quickly to
the cost of slightly reduced resilience, which works well
increased demand and new feature needs.
in practice when paired with connection tracking. An-
other option for implementing consistent hashing is dis-
tributed hash tables such as Chord [37], but this would Acknowledgements
add extra latency and complexity to the system.
Some of the performance optimization techniques We are grateful to Adam Lazur, Alex Tumko, Amin Vah-
used in Maglev have been extensively studied since dat, Angus Lees, Aspi Siganporia, Ben Treynor, Bill
1990s. Smith et al [36] suggested to improve appli- Coughran, Brad Calder, Craig Bergstrom, Doug Orr,
cation throughput by reducing interrupts and memory Dzevad Trumic, Elliott Karpilovsky, Jeff Mogul, John
copying. Mogul et al [33] developed a polling-based T. Reese, Kyle Moffett, Luca Bigliardi, Mahesh Kalla-
mechanism to avoid receive livelock caused by inter- halla, Mario Fanelli, Mike Dalton, Mike Shields, Natalya
rupts. Edwards et al [19] explored the idea of userspace Etina, Nori Heikkinen, Pierre Imai, Roberto Peon, Simon
networking but did not manage to bypass the kernel Newton, Tina Wong, Trisha Weir, Urs Hölzle, and many
completely. Marinos et al [31] showed that special- others for their significant contributions to this paper and
ized userspace networking stacks with kernel bypass the success of Maglev. We would also like to thank our
can significantly improve application throughput. Han- shepherd Nathan Bronson and the anonymous reviewers
ford et al [25] suggested to distribute packet processing for their insightful feedback.
12
References [26] V. Jacobson and B. Felderman. Speeding up networking.
http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf.
[1] A10. http://www.a10networks.com.
[27] A. Kalia, D. Zhou, M. Kaminsky, and D. G. Andersen. Raising
[2] Array networks. http://www.arraynetworks.com. the bar for using gpus in software packet processing. In Proceed-
[3] Barracuda. http://www.barracuda.com. ings of NSDI, 2015.
[4] Embrane. http://www.embrane.com. [28] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine,
[5] F5. http://www.f5.com. and D. Lewin. Consistent hashing and random trees: Distributed
caching protocols for relieving hot spots on the world wide web.
[6] Google cloud platform. http://cloud.google.com. In Proceedings of ACM Symposium on Theory of Computing,
[7] Haproxy. http://www.haproxy.org. 1997.
[8] Intel dpdk. http://www.dpdk.org. [29] C. Labovitz. Google sets new internet record.
http://www.deepfield.com/2013/07/google-sets-new-internet-
[9] Kemp. http://www.kemptechnologies.com. record/.
[10] Kernel bypass. http://blog.cloudflare.com/kernel-bypass.
[30] C. Labovitz, S. Iekel-Johnson, D. McPherson, J. Oberheide, and
[11] Linux virtual server. http://www.linuxvirtualserver.org. F. Jahanian. Internet inter-domain traffic. In Proceedings of SIG-
[12] Load balancer .org. http://www.loadbalancer.org. COMM, 2010.
[13] Netscaler. http://www.citrix.com. [31] I. Marinos, R. N. Watson, and M. Handley. Network stack spe-
[14] Nginx. http://www.nginx.org. cialization for performance. In Proceedings of SIGCOMM, 2014.
[15] Openonload. http://www.openonload.org. [32] J. C. McCullough, J. Dunagan, A. Wolman, and A. C. Snoeren.
Stout: An adaptive interface to scalable cloud storage. In Pro-
[16] F. Chen, R. K. Sitaraman, and M. Torres. End-user mapping:
ceedings of USENIX ATC, 2010.
Next generation request routing for content delivery. In Proceed-
ings of SIGCOMM, 2015. [33] J. C. Mogul and K. K. Ramakrishnan. Eliminating receive live-
[17] L. Deri. Improving passive packet capture: Beyond device lock in an interrupt-driven kernel. In Proceedings of USENIX
polling. In Proceedings of SANE, 2004. ATC, 1996.
[18] M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Ian- [34] P. Patel, D. Bansal, L. Yuan, A. Murthy, A. Greenberg, D. A.
naccone, A. Knies, M. Manesh, and S. Ratnasamy. Routebricks: Maltz, R. Kern, H. Kumar, M. Zikos, H. Wu, C. Kim, and
Exploiting parallelism to scale software routers. In Proceedings N. Karri. Ananta: Cloud scale load balancing. In Proceedings
of SOSP, 2009. of SIGCOMM, 2013.
[19] A. Edwards and S. Muir. Experiences implementing a high per- [35] L. Rizzo. netmap: A novel framework for fast packet i/o. In
formance tcp in user-space. In Proceedings of SIGCOMM, 1995. Proceedings of USENIX Security, 2012.
[20] R. A. Fisher and F. Yates. Statistical tables for biological, agri- [36] J. Smith and C. Traw. Giving applications access to gb/s network-
cultural and medical research. Edinburgh: Oliver and Boyd, ing. Network, IEEE, 7(4):44–52, 1993.
1963.
[37] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakr-
[21] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-
ishnan. Chord: A scalable peer-to-peer lookup service for internet
trace: A pervasive network tracing framework. In Proceedings of
applications. In Proceedings of SIGCOMM, 2001.
NSDI, 2007.
[22] R. Gandhi, H. H. Liu, Y. C. Hu, G. Lu, J. Padhye, L. Yuan, and [38] D. G. Thaler and C. V. Ravishankar. Using name-based mappings
M. Zhang. Duet: Cloud scale load balancing with hardware and to increase hit rates. IEEE/ACM Transactions on Networking,
software. In Proceedings of SIGCOMM, 2014. 6(1):1–14, 1998.
[23] P. Gill, N. Jain, and N. Nagappan. Understanding network fail- [39] M. Varvello, R. Laufer, F. Zhang, and T. Lakshman. Multi-layer
ures in data centers: Measurement, analysis, and implications. In packet classification with graphics processing units. In Proceed-
Proceedings of SIGCOMM, 2011. ings of CoNEXT, 2014.
[24] S. Han, K. Jang, K. Park, and S. Moon. Packetshader: A gpu- [40] K. V. Vishwanath and N. Nagappan. Characterizing cloud com-
accelerated software router. In Proceedings of SIGCOMM, 2010. puting hardware reliability. In Proceedings of SoCC, 2010.
[25] N. Hanford, V. Ahuja, M. Balman, M. K. Farrens, D. Ghosal, [41] D. Zhou, B. Fan, H. Lim, M. Kaminsky, and D. G. Ander-
E. Pouyoul, and B. Tierney. Characterizing the impact of end- sen. Scalable, high performance ethernet forwarding with cuck-
system affinities on the end-to-end performance of high-speed ooswitch. In Proceedings of CoNEXT, 2013.
flows. In Proceedings of NDM, 2013.
13