Nsdi21 Gao
Nsdi21 Gao
Nsdi21 Gao
Yixiao Gao, Nanjing University and Alibaba Group; Qiang Li, Lingbo Tang,
Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu,
Lei Yan, Fei Feng, Yan Zhuang, Fan Liu, Pan Liu, Xingkui Liu, Zhongjie Wu,
Junping Wu, and Zheng Cao, Alibaba Group; Chen Tian, Nanjing University;
Jinbo Wu, Jiaji Zhu, Haiyong Wang, Dennis Cai, and Jiesheng Wu, Alibaba Group
https://www.usenix.org/conference/nsdi21/presentation/gao
978-1-939133-21-2
Yixiao Gao♠♥ , Qiang Li♥ , Lingbo Tang♥ , Yongqing Xi♥ , Pengcheng Zhang♥ , Wenwen Peng♥ , Bo Li♥ ,
Yaohui Wu♥ , Shaozong Liu♥ , Lei Yan♥ , Fei Feng♥ , Yan Zhuang♥ , Fan Liu♥ , Pan Liu♥ , Xingkui Liu♥ ,
Zhongjie Wu♥ , Junping Wu♥ , Zheng Cao♥ , Chen Tian♠ , Jinbo Wu♥ , Jiaji Zhu♥ , Haiyong Wang♥ , Dennis
Cai♥ , and Jiesheng Wu♥
♠ Nanjing University, ♥ Alibaba Group
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 519
a number of challenges specifically related to cloud storage,
Ceph block service RDMA Master Clusters
Master Cluster Nodes User-space TCP
with additional problems associated with RDMA. We have BlockMaster PanguMaster
Ceph Monitor TCP
developed a number of cluster solutions
info to allow for RDMA to
BlockServer info block mapping ChunkServer info
Data Flow
function in a production-level cloud replication
storage,
OSD info 2
several of which
are engineering-level work-arounds. However, overcoming request BlockServer Storage Cluster
Control Flow Storage Cluster Nodes
the aforementioned
Client Node
RDMA issues proves to be a complicated replication 1
Client Node
task. Here, we expose the practical OSD1
limitations OSD2of the OSD3pro- ChunkServer
data (Computing)
duction systems librbd
in order to facilitate Disk 1
innovative Disk 2
research Disk
and3 Disk 1 Disk 2 Disk 3
replication 2&3
applications in this area.
librados
To other to other
In addition to the performance, availability, replication 3 and SLA
OSDs ChunkServers
requirements, the deployment planning of Pangu at the
production scale should consider storage volume and hard-
ware costs. Following the availability-first principal, RDMA Figure 1: Pangu block storage service framework.
communication is enabled only inside each podset [13]. Such
metrics. With the help of the dual-home topology feature,
a podset contains a group of leaf switches, and all Top-of-Rack
we optimize the fail-over performance of Pangu by reducing
(ToR) switches connected to these leaf switches. The podsets
the connection recovery time. We also fix network problems
are connected via spine switches. This setting is currently the
by exploiting application controls, for example, blacklisting
optimal balance between application demands, performance,
problematically connected nodes (§6).
and availability/SLA control. Storage node configurations
We share our experience in adopting the RDMA-enabled
are carefully planned to match the disk throughput with the
Pangu system and discuss several potential research directions
network bandwidth. We adopt the hybrid deployment of
(§7). This system has successfully served numerous online
RDMA/TCP in Pangu to exploit TCP as the last resort for the
mission-critical services under the scope of Alibaba over the
system (§3).
past four years, including several important shopping festivals
The performance optimization aims to minimize la-
(e.g., Double-11 [8]). Sharing our experience in integrating
tency while maximizing throughput. We leverage software-
RDMA into Pangu can be helpful for other RDMA-enabled
hardware co-design to minimize performance overhead. We
systems.
build a software framework in Pangu that integrates RDMA
with Pangu’s private user-space storage platform designed
for new storage media. By eliminating data-copy operations,
2 Background
the latency of a typical block service request is reduced 2.1 Pangu in Alibaba Cloud
to tens of microseconds. We observed that the memory Pangu Framework. Pangu is a distributed file system
bandwidth becomes a bottleneck when upgrading Pangu to developed by Alibaba Cloud. Released in 2009, it plays a
a 100Gbps network. By exploiting the RDMA features and major role in the core Alibaba businesses (e.g., e-business and
offloading critical computations, Pangu is able to saturate the online payment, cloud computing, enhanced solid state drive
underlying networks. Furthermore, we leverage a new thread backed cloud disk, elastic compute service, MapReduce-like
communication mode in Pangu to reduce the performance data processing, and distributed database). In this paper, we
pitfall caused by a large number of Queue Pairs (QPs, RDMA focus on the network features of Pangu.
connection abstraction) per node (§4). Pangu provides numerous storage services, including
Previous studies have reported the risks of large-scale elastic block service, object storage service, store service,
RDMA deployment [13]. RDMA-enabled Pangu clusters do etc. We take the block service as an example to demonstrate
encounter such problems, including PFC deadlocks [13], PFC the system framework. Fig. 1 presents the I/O workflows
pause frame storms, and head-of-line blocking [27, 44]. We of Pangu. Virtual block devices contain continuous address
determined several PFC storms to be attributed to a previously spaces that can be randomly accessed by applications. A
unexplored source that consequently invalidates an earlier Pangu client in a computing node organizes data into fixed-
solution [13]. In order to guarantee availability, we apply the sized (e.g., 32 GB) segments, while the BlockServers and
escape-as-fast-as-possible design principle to handle PFC ChunkServers run on storage nodes. Each segment is aligned
storms. We bring up a fine-grained switching mechanism to a BlockServer for I/O processing. On the BlockServers,
between RDMA/TCP traffic in Pangu and it handles PFC a segment is divided into blocks and replicated to the
storms regardless of their causes (§5). ChunkServers, which are in charge of the standalone back-end
In order to meet the SLA standards, we adopt the design storage of the blocks and device management.
principal of exploiting storage semantics whenever useful The BlockMasters manage metadata such as the mapping
in Pangu. By taking advantage of its ability to control the between a segment and its located BlockServer and the Block-
application layer, Pangu performs the real-time checking and Server’s living states. The PanguMasters manage the states
alarming for a large number of storage service and network of the ChunkServers. These master nodes are synchronized
520 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
using consistent protocols, such as Raft [36]. Cloud and Microsoft Azure [4, 9].
All data communication in Pangu is in the form of Remote Service-Level Agreement. Software and hardware failures
Procedure Calls (RPCs). Each ChunkServer initiates the RPC are extremely common in distributed systems. A cloud storage
clients/servers, and storage operations are performed by issu- system should exhibit graceful performance downgrade with
ing pre-registered RPCs. An RPC client can simultaneously the occurrence of various failures. Distributed storage systems
use different RPC channels (i.e., connections via RDMA, include mature node monitoring and fail-over mechanisms.
kernel TCP, user-space TCP, or shared memory)according to A single storage node failure has a minimal impact on the
the required RPCs. service quality. In our experience, the most challenging aspect
Cloud Storage Requires RDMA. The principal perfor- of ensuring a stable performance lies in the storage networks.
mance metrics for storage services are read/write throughput Network failures generally result in a larger affected range
and access latency. Low latency and high throughput prove to compared to storage node failures.
be advantageous for numerous application scenarios. Many In addition to its superior performance, customers of our
customers expect similar performance of the cloud storage to RDMA-enabled Pangu require the same levels of availability
that of the local physical storage. For example, the Alibaba and SLA standards to that of traditional TCP-backed versions.
e-commerce database requires extremely low latency in order
to ensure fast responses due to the potentially large peak 2.3 State-of-the-art Work Do Not Fit
number of transactions per second (e.g., 544,000 orders per Unknown PFC Storm Sources. PFC runs under a hop-by-
second at peak hours [8]). Moreover, the enhanced SSD hop mechanism, with the possibility of PFC storms, spreading
service promises 1 million IOPS, 4GB/s throughput, and into the whole cluster. A PFC storm can seriously affect
200µs latency for 4KB random writes [17]. cluster availability and is the most well-known issue of
The latency of traditional network stack (e.g., TCP/IP) RDMA. In 2016, Microsoft presented its experience in the
is generally within hundreds of microseconds [13]. The deployment of RDMA [13], where they revealed that a
maximum achievable TCP bandwidth per kernel thread can bug in the receiving pipeline of an RDMA-capable NICs
reach tens of Gbps [51]. In contrast, the access latency (RNICs) causes PFC storms. The problem was fixed by
of current NVMe SSDs is only at the microsecond level, building watchdogs on the NICs and switches. However, we
while the read/write bandwidth of a single device is at the identified an additional type of PFC storms that originates
GB/s level [49]. The total throughput of each storage node from switches, implying the complexity of PFC storms with
(generally with 8-16 NVMe disks) can exceed 100Gbps and multifarious sources. The Microsoft solution [13] fails to
the incoming Storage Class Memory (SCM, e.g., Intel 3D- solve this new problem (§5).
XPoint) can even achieve nanosecond level latency [35]. Thus, Practical Concerns that Limit Design Options. We are
networking is currently the primary performance bottleneck not able to simply treat RDMA as a black-box and wait for
for cloud storage. future research and technical advances to solve the current
RDMA is an alternative networking choice for cloud stor- problems. Despite the large number of recent studies [10, 22,
age. By implementing its entire protocol stack on host NICs, 29, 33, 40, 48], a production-level comprehensive PFC-free
RDMA provides both microsecond level access latency and solution is still premature. The application of RDMA over
a per-connection throughput close to 100Gbps with almost lossy Ethernet has been explored in previous work [7, 11, 15,
no CPU consumption [23]. RDMA has successfully been 26], allowing for the bypass of the PFC mechanism. However,
integrated into numerous network-bottlenecked systems, for such solutions rely on new hardware features.
example, key-value stores [22, 33], distributed transactions [6, The deployment of new hardware is a long process, with
24, 48], and graph queries [40], demonstrating an improved several months or even years of testing, followed by the
performance compared with non-RDMA predecessors. subsequent introduction to business applications. For example,
the process of testing Pangu with CX-4 RNICs, a joint
2.2 Challenges collaboration with NIC providers, lasted for over two years.
Besides performance, availability and SLA are also critical There is a tension between the fast growth of new RDMA
for a successful cloud storage system. demands and the long update cycles of new hardware. To date,
Availability. System disruptions incur significant finan- these PFC-free proposals are not mature enough for large-
cial/reputation loss for both tenants and their cloud providers. scale business deployment, particularly for the availability
In 2018, Amazon S3 experienced a system disruption that and SLA standard requirements of cloud storage systems.
lasted for 4 hours [2], affecting Amazon Elastic Compute Furthermore, large-scale industry deployment is generally
Cloud , Amazon Elastic Block Store volumes, and AWS associated with multiple generations of legacy RDMA NICs.
Lambda [3]. This disruption also had an impact on tens For example, we have already deployed several Mellanox
of thousands of websites built on the Amazon storage NIC generations (e.g., CX-4, CX-5), with the number of
service, including Netflix [34], Spotify [43], Pinterest [37], each reaching tens of thousands. It is operationally infeasible
and Buzzfeed [5]. Similar events have occurred with Google and costly to replace all legacy NICs in the running nodes,
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 521
Total bandwidth TCP bandwidth ratio TX pauses
Spine 25Gbps 40% 0
30Gbps 45% 1Kpps
522 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
BlockServer 800
TCP 2.5
TX Pause Duration(us)
Average Latency (us)
RDMA
600 2
Throughput (Gbps)
15
400 1.5
10
1
200
5 0.5
0 0 0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Time (Minutes) Time (Minutes) Time (Minutes)
(a) Throughput of RDMA/TCP/BlockServer. (b) Latency of BlockServer requests. (c) Average TX pause duration.
Figure 3: RDMA/TCP hybrid deployment tests at different ratios (from 0% to 100% TCP).
performance and may result in PFC storms. We investigated latency under 100% RDMA traffic is approximately half of the
this problem together with Mellanox and determined that latency under 100% TCP traffic, while the tail latency under
the processing of TCP in the Linux kernel is highly I/O- 100% TCP is more than 10× larger compared to 100% RDMA
intensive. Kernel TCP initiates too many partial writes on traffic. RDMA presents great latency advantages compared to
NICs’ PCIe bus. As the PCIe bandwidth is consumed, the TCP. Fig. 3(c) demonstrates the average TX pause duration
receiving pipeline of a NIC is slowed down. The buffer per second for this workload. Only a limited number of TX
overflows and the NIC subsequently begins to transmit PFC pauses are observed. When the TCP bandwidth ratio is around
pause frames. 50% at 30 minutes, the pause duration reaches a peak value.
In order to optimize the memory access of TCP, we make Overall, these results demonstrate the stable performance
several adjustments on the data access procedure. First, of our RDMA/TCP hybrid mechanism.
disabling the Large Receive Offset (LRO) can reduce the
memory bandwidth usage. This is attributed to the access of 4 Performance Optimization
multiple cache lines when the LRO is enabled. Furthermore,
4.1 Performance Hurdles
enabling NUMA also improves the efficiency of memory
accesses, which subsequently aids in relieving the pressure The performance optimization of Pangu aims to minimize
of PCIe. We also allocate a larger buffer on the RNICs latency while maximizing throughput.
for RDMA traffic to prevent TX pauses. Finally, making RDMA-Storage Co-Design. Integrating the RDMA proto-
application data cacheline-aligned is a common optimization col stack with the storage backend is challenging. It must
practice that improves memory efficiency [23]. cover key performance points such as thread modeling,
memory management, and data serialization. The thread
3.3 Evaluation model directly affects latency due to communication costs
We test several RDMA/TCP traffic ratios to investigate the among threads. Well-designed memory management and
effects of RDMA/TCP hybrid deployment. Each computing data serialization are key to achieving zero-copy during data
node runs FIO with 8 depths (inflight I/O requests), 8 jobs access. Here we present a brief introduction on the design of
(working threads), and 16 KB block size in order to write these components for storage purposes.
virtual disks. Note that one write request on a BlockServer The User Space Storage Operating System (USSOS) is
generates three data replicas. We enable all optimizations a unified user-space storage software platform that aims to
approaches detailed in §3.2 for the TCP kernel. support new storage media such as NVMe SSD and persistent
Fig. 3(a) depicts the BlockServer bandwidth with varying memory. Its design principles (e.g., memory management,
RDMA/TCP ratios. The workload starts at 10 minutes with shared memory mechanism, and user-space drivers) are based
100% RDMA traffic. Afterwards, in every 5 minutes, the on well-known user-space technologies (e.g., DPDK [19]
workload contains 10% more TCP traffic and 10% less and SPDK [42]). Related tests reveal that enabling USSOS
RDMA traffic. At 60, 65, 70 minutes we change the TCP in Pangu can improve CPU efficiency by more than 5× on
traffic ratio to 0%, 100%, and 0% respectively in order to average.
explore the performance of Pangu with quick traffic switching As a central part of USSOS, the User Space Storage File
between RDMA and TCP. The average BlockServer through- System (USSFS) is a high-performance local file system
put exhibits minimal reduction as the RDMA traffic ratio designed for SSDs. By running in the user space, USSFS
decreases. is able to bypass the kernel to avoid user-/kernel-space-
Fig. 3(b) presents the BlockServers’ average request crossing overhead. USSOS divides disks into “chunks” which
latency for the same workload as that in Fig. 3(a). The average ChunkServer uses in its APIs (e.g., create, seal, and delete).
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 523
Components Average Utilization Peak Utilization Maximum Physical Capacity
Physical CPU utilization ratio 66% 70% 100%
Memory read/write throughput 28GB/s / 29GB/s 33GB/s / 32GB/s 61GB/s in total (1:1 read/write)
SSD PCIe throughput (socket 0 + socket 1) 550MB/s + 550MB/s 1000MB/s + 1000MB/s 3.938GB/s + 3.938GB/s
Network PCIe RX throughput 10GB/s 11GB/s 15.754GB/s
Network PCIe TX throughput 8GB/s 9GB/s 15.754GB/s
Table 3: Measured resource utilization of Pangu in 100Gbps network with 1:1 read/write ratio.
USSOS
data crc gap data crc gap RX or TX User Space
4048B 4B 44B 4048B 4B 44B Driver
I/O buffer
Kernel DMA
(2)
DMA
(6)
Figure 4: Potential triggering of data copying by CRC.
USSOS directly writes data and metadata to disks and uses Hardware RNIC
NVME
SSD
524 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Client Server Client Server
0->0 0->1 0->2
Thread 0 Thread 0 Thread 0 Thread 0
0->1
1->0 1->1 1->2
0->2
Thread 1 Thread 1 Thread 1 Thread 1
1->2
2->0 2->1 2->2
Thread 2 Thread 2 Thread 2 Thread 2
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 525
fio: 8 jobs, 8 depth FIO: 16 jobs, 64 depth
50 15 connected to this ToR switch receive continuous pause frames
No Optimization No Optimization and thus the storm spreads.
UMR + CRC offload UMR + CRC offload
Memory Bandwidth(GB/s)
40
State-of-the-Art Solutions. Guo et al. [13] built a NIC-
Throughput (GB/s)
10
30 based watchdog to continuously monitor transmitted pause
frames, disabling the PFC mechanism if necessary. In
20 addition, watchdogs were also deployed on the switches
5
10
for disabling the PFC mechanism when switches receive
continuous pause frames and are unable to drain the queuing
0 0 packets. The switches can subsequently re-enable PFC in the
32KB 64KB 128K
B 32KB 64KB 128K
B
absence of pause frames over a specific period of time. Thus,
(a) (b) PFC storms can be controlled via these two watchdogs during
Figure 7: Performance of UMR zero copy + CRC offloading
phase (2).
the improvement in throughput following the optimization. However, this solution is unable to completely solve the
The throughput of a single ChunkServer thread is improved PFC storms originating from switches. In particular, the TX
by approximately 200% for a block size of 128KB. pause watchdogs on the NICs will not work since the NIC
Shared Link. We tested the shared link mode with several only receives PFC storms from the switches. Furthermore,
shared QP groups in a cluster of 198 computing nodes and current switch hardware does not support the monitoring of
93 storage nodes. The background workload compromises pause frame transmissions. If a storm occurs on a ToR switch,
4KB random writes with 8 threads and 8 I/O depths. Fig. 8(a) even though the watchdogs on other switches are able to
presents the throughput in the All Shared, 2 Shared Groups, stop its spread, the ToR switch will continue to send pauses
and 4 Shared Groups modes, whereby a performance trade-off to end-hosts in the rack. The RDMA traffic via this ToR is
can be observed. The All Shared mode exhibits slightly lower consequently blocked.
throughput but generates the lowest number of PFC pauses. Challenges. This new type of PFC storms invalidates
Note that the reduction in bandwidth at 5 and 24 minutes in Guo et al.’s solution, which focuses on insulating the PFC
the All Shared mode is attributed to the garbage collection pause sources to prevent the storm from spreading. This
mechanism in Pangu. Fig. 8(b) presents the TX pause duration methodology fails when the source is a ToR switch as all the
with 1, 2, and 4 Shared Groups, respectively. The lower the hosts in the ToR are paused by the storm. Therefore, in order
number of groups, the fewer the PFC pauses are generated to achieve high availability, a general solution is required in
due to the reduction in QP number. We use the All Shared Pangu to handle all PFC storm types, particularly those with
Group mode in our scale and configuration framework. unknown causes.
Ensuring the service quality of Pangu while simultaneously
solving PFC storms is challenging. PFC storm detection must
5 Availability Guarantee be timely and accurate to rapidly protect network traffic. In
5.1 PFC Storms terms of availability, the overall convergence time of the PFC
A New Type of PFC Storm. The PFC storm previously storm should be controlled to at most the minute level.
discussed in [13] originates from the NICs, with a bug in the
receiving pipeline acting as the root cause. Fig. 9(a) depicts 5.2 Design
the phases of this PFC storm: (1) The bug slows down the NIC Our design principle of handling PFC storms is escaping as
receive processing, filling its buffer; (2) the NIC transmits fast as possible. Despite new PFC storm solutions [11, 21,
the PFC pauses to its ToR switch in order to prevent packet 26], we still resort to engineering-level work-arounds due to
drop; (3) the ToR switch pauses the transmission; (4) the ToR practical considerations (§2.3).
switch’s buffer becomes full and starts to transmit the PFC In Pangu, each NIC monitors the received PFC pause
pauses; and (5) the victim ports are paused and are unable to frames. For continuous pause frames, the NIC determines
transmit. the presence of a PFC storm. Two work-around solutions are
We encountered a different type of PFC storm when available for administrators in the case of a PFC storm.
operating RDMA in Pangu. The root cause is a bug in the Workaround 1: Shutdown. This solution, denoted as the
switch hardware of a specific vendor. The bug reduces the “shutdown” solution, shuts down NIC ports affected by PFC
switching rate of the lossless priority queues to a very low storms for several seconds. The dual-home topology provides
rate. Fig. 9 compares the two types of PFC storms. As an an emergency escape for PFC storms, whereby QPs will
example, we assume that the bug occurs in a down port of a disconnect and connect again via another port. This method
ToR switch: (1) due to the low transmitting rate, the switch’s works together with the optimization to reduce the length
buffer becomes full; (2) the switch transmits the PFC pauses of the QP timeout. This optimization is discussed further
to the connected ports; and (3) the additional switches and in §6.2. Although this solution is simple and effective, it is
NICs stop the transmissions. The leaf switches and NICs sub-optimal due to the loss of half of the bandwidth.
526 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
40 10 4
15
30
10
All Shared (1 Shared Group)
20 2 Shared Groups
4 Shared Groups
All Shared (1 Shared Group) 5
10
2 Shared Groups
4 Shared Groups
0
wowTxTxand andTxTx full
full 0 5 10 15 20 25 0
xfull
fulland
andTxTxPFC
PFC 0 5 10 15 20 25 30
Time (Minutes) Time (Minutes)
Cs/Switches
/Switches areare paused
paused
(a) Throughput in different thread groups. (b) Pause time in different thread groups.
Figure 8: Throughput and pause for different types of thread groups.
2. 2.
PFC
PFCframes
frames
TXTX RXRX TXTX RXRX TXTX RXRX TXTX RXRX TX TX RX RX TX TX RX RX
Paused
Pausedlink
link
Host
Host
Healthy
Healthylink
link 1. 1. 5. 5. 3. 3.
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 527
In-ToR TCP In-ToR CNP sent In-ToR TX Pause
In-ToR RDMA In-ToR CNP handled In-ToR RX Pause
Out-ToR TCP Out-ToR CNP sent Out-ToR TX Pause
Out-ToR RDMA Out-ToR CNP handled Out-ToR RX Pause
10 6 10 6
20
CNP Sent/Handled
10 5
4
15 10 10 4
10 10 3
10 2 10 2
5
0 10 0 10 0
0 2 4 6 8 10 0 2 4 6 8 0 2 4 6 8
Time (min) Time (min) Time (min)
(a) Throughput of RDMA/TCP switching. (b) CNP of RDMA/TCP switching. (c) PFC Pause of RDMA/TCP switching.
30 10 6 10 6
CNP Sent/Handled
5
10 10 5
4
20 10 10 4
10 3 10 3
2
10 10 10 2
1
10 10 1
0
0 5 10 15 20 25 0 5 10 15 20 0 5 10 15 20
Time (min) Time (min) Time (min)
(d) Throughput of shutdown. (e) CNP of shutdown. (f) PFC Pause of shutdown.
Figure 10: Performance of two different solutions for PFC storms.
The RDMA links are then disconnected due to timeout. The identified an interesting failure type involving a small number
QPs are separately distributed over the server’s two NIC of links that flap between up and down states for a short period
ports, thus the RDMA links may need several attempts to of time (e.g., several seconds). This results in an extremely
reconnect successfully. Note that although the pause storm high tail latency for I/O requests, denoted as slow I/O (e.g.,
in the ToR is not terminated, it will not spread further as the over 1 second for storage clients). Hundreds of slow I/Os
neighboring switch ports are transformed into the lossy mode are observed daily for numerous reasons. Root causes of
via the RX pause watchdogs. The traffic throughput is not link flapping include optical ports covered with dust, loose
impacted during the migration to the TCP, and I/O hangs are physical interfaces, aging hardware, etc.
not present. Previous Research on Network Failures. The majority of
previous studies focus on determining the location of network
6 SLA Maintenance failures (e.g., Pingmesh [14] and 007 [1]). These solutions
focus on the system network and can achieve the timely
6.1 SLA in Pangu discovery of network errors. However, it may still take hours
It is commonly-known that network failures are hard to locate for engineers to manually check, fix, and replace the failed
and repair. Network failure causes include mis-configuration, switches and links. Cloud storage calls for a methodology
hardware/software bugs, and link errors. For example, the that integrates the storage and network function modules to
mis-configuration of the switch Access Control List (ACL) ensure stable service quality in failed cases.
may only affect a specific flow while other flows behave
normally [28, 45]. As a comparison, malfunctions occurring 6.2 Design
at storage devices or nodes can generally be easily located. Our SLA design principle aims to exploit storage semantics
Sometimes network failures may not be explicit. Ideally, whenever useful. Distributed storage is designed with a
when a node breaks down, the heartbeat mechanism should en- redundancy mechanism and its performance is measured via
sure that the unavailability of service daemons (BlockServers the storage semantics. These semantics, such as data replicas
and ChunkServers) on the node are informed to their masters and distributed placements, can be leveraged during a failure
(BlockMasters and PanguMasters). However, real situations to improve system performance.
can be more complicated, failing to be detected with just Network-Integrated Monitoring Service. Monitoring is
heartbeats. Connections may suffer intermittent packet loss a necessary component of distributed systems. A compre-
or long latency rather than simple break downs. We also hensive monitoring system is key for the safe deployment
528 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
of RDMA in production storage systems, as well as reliable configured according to the scale of the cluster. In order to
performance. ensure reliability, the maximum number of BlockServers in
Both configurations and counters must be monitored. NIC the blacklist is usually small (e.g., 3 hosts). This blacklist
configurations include system environments, PFC/ECN/QoS mechanism temporarily isolates the BlockServers that provide
configurations, and several link configurations. Pangu au- a poor service. The system performance is not affected and
tomatically checks, collects, and reports suspicious terms. engineers have sufficient time to fix the problems.
Inspecting potential mis-configurations can reduce configura-
tion and environment errors. For example, accidental reboot 6.3 Daily Operation Scheme of Pangu
and software upgrades may reset the QoS, PFC, DCQCN [52] The daily operations of Pangu rely on these modules to
configurations and affect system performance. Monitoring work together. The monitoring system collects and reflects
can discover such cases and help fix them in advance. the status of Pangu. If abnormal network indicators or I/O
Counters include storage system performance coun- metrics are observed, the monitoring system attempts to locate
ters (e.g., IOPS/latency) and network counters (e.g., CNP and report them to the administrators. For accidental failures
sent/handled, TX/RX pauses, and RDMA errors on NICs). such as link errors, the small QP timeout shortens the time
Congestion control counters that exceed thresholds can result required for failure recovery. The blacklist mechanism is able
in monitor daemons sending alarms to engineers for diagnosis to determine and isolate nodes with poor service quality. By
and repair. Monitoring both storage and network indexes following these design and operator framework specifications,
is crucial for the diagnosis of errors and for predictable our RDMA-enabled Pangu system has not experienced any
performance. Storage indexes such as tail latency, slow I/O, major faults in the last two years.
and queuing time can directly reflect the status of a system.
Moreover, monitoring system performance also help locate 7 Experiences and Future Work
errors. For example, network features are unable to quickly
reflect the flapping problem described in §6.1. However, this Monitoring NACK in Lossless RDMA. The operation of
problem can be easily located by monitoring slow I/Os on the RDMA over a lossless fabric is difficult due to PFC risks.
endpoints of storage applications. However, the lossless fabric increases the effectiveness of
Faster Timeout for Connection Failures. The basic NACK events as indicators of the network error location since
solution to network failures is to reconnect through an NACK is usually rare in a lossless fabric.
alternative path. Since we use dual-home topology, each In order to detect and locate network problems, we build
single point failure of the network can be bypassed using a subsystem based on packet loss in Pangu. In particular,
a different path. Thus, the timeout duration of the QPs Out-Of-Sequence (OOS) counters on RNICs and packet drop
is crucial in improving the system performance during a counters on switches are gathered. A packet loss is classified
failure. In the NIC manual, the timeout duration of QPs is as either explicit or implicit based on whether it is recorded
calculated as 4µs × 2timeout × 2retry_cnt , where timeout and by switch counters. The former is easy to locate by checking
retry_cnt denote the retransmission timeout value and the the switch counters. However, determining the location of
retransmission retry times respectively. Initially, this value the latter is complex as RNIC counters do not distinguish
was a constant (approximately 16 seconds) configured in the between flows. By monitoring NACK in the networks, we can
hardware and cannot be changed. In a combined effort with extract flows’ five tuples and locate the root of a problem.
the NIC providers, we were able to fix this bug. By using a Building a System-Level Benchmark. To evaluate the
smaller timeout value for QPs, the action time required for system network performance and SLA, a representative
reconnecting during network failures was reduced by 4×. benchmark must be constructed. Building the benchmark
An alternative work-around involves altering the connec- based on just the network metrics is simple. However, storage
tion path by modifying the source ports of the QPs (rather than features such as replica completion time, slow I/O, and failure
a direct reconnection). This can accelerate the link recovery recovery should not be ignored. To measure the storage
during a fail-over. However, effectively changing the QP system performance and the SLA, we build a benchmark at the
source port requires a more recent NIC firmware (MLNX storage service level. The system evaluation indexes include
OFED 4.7 or newer) than what is currently deployed in Pangu. FIO latency, IOPS, SLA with network errors, etc. Each
We leave this challenge to future work. upgrade in Pangu (for network and other components) is
Blacklist. We adopt blacklist in Pangu to further improve evaluated with the benchmark, allowing us to measure the
the service quality in fail-over cases. BlockMasters collect overall system performance.
information on I/O errors (including timeout) and slow Congestion Control for Fail-Over Scenarios. In §3, we
I/Os from clients. If a BlockServer has a large number introduced the dual-home topology adopted in Pangu. Dual-
of slow/error I/Os from multiple clients, the masters adds home topology is also crucial to fail-over cases since it
it to the blacklist for a short period of time. The number provides a backup path on NICs. However, we encounter
of clients and slow/error I/Os that triggers the blacklist is a problem when deploying dual-home topology in practice.
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 529
According to our test, when one ToR switch is down, the protocol (and corresponding hardware) based on NVMe-
RX pause duration can increase to 200ms per second. This over-Fabrics. A customized storage protocol for Pangu with
is due to the transfer of the traffic from the broken ToR hardware support can allow for more flexibility and control.
switch to the other switch. DCQCN handles the traffic burst
poorly under this asymmetric topology. We adapt several 8 Related Work
DCQCN parameters as a temporary solution and leverage the
PFC in RDMA PFC storm is the most well-known issue
fluid models [53] to analyze the available choices, including
of RDMA. Several studies [11, 30, 52] focus on controlling
canceling the Fast Recovery stage and extending the rate
network congestion to reduce the numbers of generated PFC
increase duration. When removing the Fast Recovery stage
pauses. DCQCN [52] is integrated in Mellanox RNICs. In
in DCQCN, the pause can be eliminated yet the flow tail
Pangu, we tune several parameters in DCQCN to improve its
latency increases due to the slow recovery of the flow rate. In
performance in fail-over scenarios. However, PFC storms
contrast, extending the duration of the rate-increase can result
still occur due to hardware bugs [13]. In this paper, we
in a sharp reduction in the pause but only slightly increases
present a different hardware bug that originates from switches.
the flow tail latency. In our experience, extending the rate-
Existing solutions to remedy PFC storms include deadlock
increase duration in DCQCN is effective for storage traffic
elimination [38] and performance optimization [46]. These
patterns. The bandwidth drops slightly while the number of
solutions require switch modification. In Pangu, we combat
RX pauses is dramatically reduced.
PFC storms by switching affected links from RDMA to TCP
This problem of DCQCN in fail-over scenarios (e.g., asym- without the need for any switch changes.
metric topology and traffic burst) indicates their important System & Network Co-Design. Recently, there have been
role when designing congestion control. We adopt parameter increasing amount of work that adopts system and network co-
tuning to fix this problem at the price of a slight performance design, including RPC systems [21, 47], distributed memory
loss. The storage network still requires a flexible, robust systems [39], key-value stores [22], distributed databases and
and well-implemented congestion control mechanism that transaction processing systems [6], and graph-processing
functions well in all scenarios. In 2019, Alibaba designed systems [40]. We co-design our storage system and RDMA
HPCC [30], a novel congestion control algorithm for the in Pangu. To our best knowledge, we are the first to share
RDMA network. Adapting and integrating HPCC with the the experience of employing RDMA networks in large-scale
storage networks is left for future work. distributed storage systems.
Slow Processing of RDMA Read. The majority of large
RPCs in Pangu are transferred via RDMA READ. We 9 Conclusions
observed that when a NIC receives too much RDMA requests
within a short period, it will send out many PFC frames. This As a distributed storage system, Pangu has provided storage
is due to the slowed receiving process that results from cache services to tenants both inside and outside of Alibaba for
misses. When a NIC is preparing for an RDMA READ, it over a decade. In order to overcome the challenges of rising
accesses the QP context in its cache. Processing many RDMA high-speed storage media and growing business requirements,
READs consumes an excessive amount of cache resources. we integrate RDMA into the storage network of Pangu,
For slow RX rates, the NIC sends out PFC pause frames providing a common solution to different types of PFC storms.
to prevent packet drops. We are currently working with the This allows for the safe deployment of RDMA. Pangu has
RNIC provider to solve this problem. successfully moved to a 100Gbps network by solving several
Lossy RDMA in Storage. Lossy RDMA is supported by new problems, such as the memory bandwidth bottleneck and
Mellanox CX-5 and CX-6 RNICs. Note that CX-6 supports QP number explosion. Furthermore, we improve the system
Selective Repeat (SR) retransmission. SR might be the performance of Pangu in fail-over cases.
ultimate step required to effectively eliminate PFC. The
construction of lossy RDMA is a focal point for all RDMA-
Acknowledgment
based systems. We tested lossy RDMA with Pangu over an We are really grateful to Yiying Zhang for shepherding
extensive period and will deploy it for new clusters. our paper. We also thanks anonymous reviewers for their
However, enabling the lossy feature with early generation constructive suggestions. Yixiao Gao and Chen Tian are
RNICs (e.g., CX-4) that have limited hardware resources and supported in part by the National Key R&D Program of China
do not support SR is hard, and many production RDMA-based 2018YFB1003505, the National Natural Science Foundation
systems still host early generations RNICs. of China under Grant Numbers 61772265, 61802172, and
NVMe-Over-Fabric. The ChunkServer data flow in Pangu 62072228, the Fundamental Research Funds for the Central
is processed by CPUs. However, with NVMe-Over-Fabric, Universities under Grant Numbers 14380073, the Collab-
NICs can directly write the received data into NVMe SSDs. orative Innovation Center of Novel Software Technology
This CPU-bypass solution can save CPU costs and reduce and Industrialization, and the Jiangsu Innovation and En-
latency. We are currently building our specialized storage trepreneurship (Shuangchuang) Program.
530 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
References [12] Alibaba Group. Alibaba group website. s https:
//www.alibabagroup.com/en/global/home, 1999-
[1] Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, 2020.
Hongqiang (Harry) Liu, Jitu Padhye, Boon Thau Loo,
and Geoff Outhred. 007: Democratically finding the [13] Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni,
cause of packet drops. In 15th USENIX Symposium on Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. Rdma
Networked Systems Design and Implementation (NSDI over commodity ethernet at scale. In Proceedings of the
18), pages 419–435, Renton, WA, April 2018. USENIX 2016 conference on ACM SIGCOMM 2016 Conference,
Association. pages 202–215. ACM, 2016.
[14] Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong
[2] Amazon AWS. Summary of the amazon s3 service
Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang,
disruption in the northern virginia (us-east-1) region.
Bin Pang, Hua Chen, et al. Pingmesh: A large-scale
s https://www.usatoday.com/story/tech/news
system for data center network latency measurement
/2017/02/28/amazons-cloud-service-goes-dow
and analysis. In ACM SIGCOMM Computer Communi-
n-sites-scramble/98530914/, 2020.
cation Review, volume 45, pages 139–152. ACM, 2015.
[3] Amazon AWS. Summary of the amazon s3 ser-
[15] Mark Handley, Costin Raiciu, Alexandru Agache, An-
vice disruption in the northern virginia (us-east-1)
drei Voinescu, Andrew W Moore, Gianni Antichi, and
region. s https://aws.amazon.com/cn/message/
Marcin Wójcik. Re-architecting datacenter networks
41926/, 2020.
and stacks for low latency and high performance. In
[4] Microsoft Azure. Azure status history. s https:// Proceedings of the Conference of the ACM Special
status.azure.com/status/history/, 2020. Interest Group on Data Communication, pages 29–42.
ACM, 2017.
[5] Buzzfeed. Buzzfeed website. s https://www.buzz
feed.com/, 2020. [16] Jaehyun Hwang, Qizhe Cai, Ao Tang, and Rachit Agar-
wal. TCP = RDMA: Cpu-efficient remote storage
[6] Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and access with i10. In 17th USENIX Symposium on
Haibo Chen. Fast and general distributed transactions Networked Systems Design and Implementation (NSDI
using rdma and htm. In Proceedings of the Eleventh 20), pages 127–140, Santa Clara, CA, February 2020.
European Conference on Computer Systems, page 26. USENIX Association.
ACM, 2016.
[17] Alibaba Inc. Block storage performance. s
[7] Inho Cho, Keon Jang, and Dongsu Han. Credit- https://www.alibabacloud.com/help/doc-deta
scheduled delay-bounded congestion control for data- il/25382.html?spm=a2c5t.10695662.1996646101
centers. In Proceedings of the Conference of the ACM .searchclickresult.458e478fYtRYOO, 2018.
Special Interest Group on Data Communication, pages
[18] Alibaba Inc. Pangu, the high performance
239–252. ACM, 2017.
distributed file system by alibaba cloud. s
[8] Alibaba Cloud. How does cloud empower double 11 https://www.alibabacloud.com/blog/pangu-
shopping festival. s https://resource.alibabacl the-high-performance-distributed-file-syst
oud.com/event/detail?id=1281, 2020. em-by-alibaba-cloud_594059, 2018.
[9] Google Cloud. Google cloud networking incident [19] Intel. Data plane development kit. s https://www.
no.19005. s https://status.cloud.google.com/ dpdk.org/, 2011.
incident/cloud-networking/19005, 2020. [20] Intel. Intel memory latency checker. s https://soft
ware.intel.com/content/www/us/en/develop/a
[10] Aleksandar Dragojević, Dushyanth Narayanan, Orion
rticles/intelr-memory-latency-checker.html,
Hodson, and Miguel Castro. Farm: Fast remote memory.
2020.
In Proceedings of the 11th USENIX Conference on
Networked Systems Design and Implementation, pages [21] Anuj Kalia, Michael Kaminsky, and David Andersen.
401–414, 2014. Datacenter rpcs can be general and fast. In USENIX
NSDI, pages 1–16, 2019.
[11] Yixiao Gao, Yuchen Yang, Tian Chen, Jiaqi Zheng, Bing
Mao, and Guihai Chen. Dcqcn+: Taming large-scale [22] Anuj Kalia, Michael Kaminsky, and David G Andersen.
incast congestion in rdma over ethernet networks. In Using rdma efficiently for key-value services. In
2018 IEEE 26th International Conference on Network ACM SIGCOMM Computer Communication Review,
Protocols (ICNP), pages 110–120. IEEE, 2018. volume 44, pages 295–306. ACM, 2014.
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 531
[23] Anuj Kalia, Michael Kaminsky, and David G Andersen. [32] Mellanox. Mellanox rdma progamming manual. s
Design guidelines for high performance rdma systems. https://www.mellanox.com/sites/default/fil
In 2016 USENIX Annual Technical Conference, page es/related-docs/prod_software/RDMA_Aware_P
437, 2016. rogramming_user_manual.pdf, 2015.
[24] Anuj Kalia, Michael Kaminsky, and David G Andersen. [33] Christopher Mitchell, Yifeng Geng, and Jinyang Li.
Fasst: Fast, scalable and simple distributed transactions Using one-sided rdma reads to build a fast, cpu-efficient
with two-sided (rdma) datagram rpcs. In OSDI, vol- key-value store. In USENIX Annual Technical Confer-
ume 16, pages 185–201, 2016. ence, pages 103–114, 2013.
[25] Ana Klimovic, Heiner Litz, and Christos Kozyrakis. [34] Netflix. Netflix website. s https://www.netflix.
Reflex: Remote flash = local flash. In Proceed- com/, 2020.
ings of the Twenty-Second International Conference
on Architectural Support for Programming Languages [35] J. Niu, J. Xu, and L. Xie. Hybrid storage systems: A
and Operating Systems, ASPLOS ’17, page 345–359, survey of architectures and algorithms. IEEE Access,
New York, NY, USA, 2017. Association for Computing 6:13385–13406, 2018.
Machinery.
[36] Diego Ongaro and John Ousterhout. In search of an
[26] Yanfang Le, Brent Stephens, Arjun Singhvi, Aditya understandable consensus algorithm. In 2014 USENIX
Akella, and Michael M Swift. Rogue: Rdma over Annual Technical Conference (USENIX ATC 14), pages
generic unconverged ethernet. In SoCC, pages 225–236, 305–319, Philadelphia, PA, June 2014. USENIX Asso-
2018. ciation.
532 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
[44] Brent Stephens, Alan L Cox, Ankit Singla, John Carter, of hyperscale applicationson on nvme ssds. In Pro-
Colin Dixon, and Wesley Felter. Practical dcb for ceedings of the 2015 ACM SIGMETRICS International
improved data center networks. In IEEE INFOCOM Conference on Measurement and Modeling of Computer
2014-IEEE Conference on Computer Communications, Systems, SIGMETRICS’ 15, pages 473–474, New York,
pages 1824–1832. IEEE, 2014. NY, USA, 2015. Association for Computing Machinery.
USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 533