Nsdi21 Gao

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

When Cloud Storage Meets RDMA

Yixiao Gao, Nanjing University and Alibaba Group; Qiang Li, Lingbo Tang,
Yongqing Xi, Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu, Shaozong Liu,
Lei Yan, Fei Feng, Yan Zhuang, Fan Liu, Pan Liu, Xingkui Liu, Zhongjie Wu,
Junping Wu, and Zheng Cao, Alibaba Group; Chen Tian, Nanjing University;
Jinbo Wu, Jiaji Zhu, Haiyong Wang, Dennis Cai, and Jiesheng Wu, Alibaba Group
https://www.usenix.org/conference/nsdi21/presentation/gao

This paper is included in the


Proceedings of the 18th USENIX Symposium on
Networked Systems Design and Implementation.
April 12–14, 2021

978-1-939133-21-2

Open access to the Proceedings of the


18th USENIX Symposium on Networked
Systems Design and Implementation
is sponsored by
When Cloud Storage Meets RDMA

Yixiao Gao♠♥ , Qiang Li♥ , Lingbo Tang♥ , Yongqing Xi♥ , Pengcheng Zhang♥ , Wenwen Peng♥ , Bo Li♥ ,
Yaohui Wu♥ , Shaozong Liu♥ , Lei Yan♥ , Fei Feng♥ , Yan Zhuang♥ , Fan Liu♥ , Pan Liu♥ , Xingkui Liu♥ ,
Zhongjie Wu♥ , Junping Wu♥ , Zheng Cao♥ , Chen Tian♠ , Jinbo Wu♥ , Jiaji Zhu♥ , Haiyong Wang♥ , Dennis
Cai♥ , and Jiesheng Wu♥
♠ Nanjing University, ♥ Alibaba Group

Abstract (i) High performance: Small latency and high throughput


A production-level cloud storage system must be high per- provide competitive advantages across many scenarios.
forming and readily available. It should also meet a Service-
Level Agreement (SLA). The rapid advancement in storage (ii) High availability: System disruptions incur significant
media has left networking lagging behind, resulting in a major financial/reputation loss for both tenants and their cloud
performance bottleneck for new cloud storage generations. providers.
Remote Direct Memory Access (RDMA) running on lossless
(iii) Service-Level Agreement (SLA): A cloud storage system
fabrics can potentially overcome this bottleneck. In this paper,
must be resilient, and thus its performance should
we present our experience in introducing RDMA into the
gracefully downgrade when various software/hardware
storage networks of Pangu, a cloud storage system developed
failures happen.
by Alibaba. Since its introduction in 2009, it has proven to be
crucial for Alibaba’s core businesses. In addition to the perfor- The rapid advancement in storage media has left net-
mance, availability, and SLA requirements, the deployment working lagging behind, resulting in a major performance
planning of Pangu at the production scale should consider bottleneck for new cloud storage generations. Networking
storage volume and hardware costs. We present an RDMA- is not a problem for traditional storage systems built with
enabled Pangu system that exhibits superior performance, Hard Disk Drives (HDDs). However, the access latency
with the availability and SLA standards matching those of of current Non-Volatile Memory Express (NVMe) disks
traditional TCP-backed versions. RDMA-enabled Pangu has is at the microsecond level [50] and the total throughput
been demonstrated to successfully serve numerous online of a storage node can exceed 100Gbps. In contrast, the
mission-critical services across four years, including several latency of traditional network stacks (e.g., TCP/IP) can reach
important shopping festivals. milliseconds [13], while the bandwidth per kernel TCP thread
is only tens of Gbps at most [51].
1 Introduction
Remote Direct Memory Access (RDMA) running on
Alibaba Group [12] is a China-based multinational technology lossless fabrics offers a promising solution to the network
company specializing in e-commerce, e-finance, and cloud bottleneck in cloud storage. By implementing its entire
computing. Numerous companies, including Alibaba, have protocol stack on host NICs, RDMA is able to provide
moved their core business systems onto clouds. As a funda- both microsecond level access latency and a per-connection
mental part of information technology (IT) infrastructure, a throughput of approximately 100Gbps with almost zero
cloud storage provides a storage service to tenants both inside CPU consumption [23]. The application of RDMA over
and outside the cloud provider. In 2009, Alibaba introduced Commodity Ethernet (RoCE) in data centers relies on the
Pangu [18], a cloud storage system that has subsequently Priority Flow Control (PFC) mechanism to provide a lossless
played a crucial role in many Alibaba core businesses. As of fabric.
2020, Pangu has been deployed in hundreds of clusters, and In this paper, we present our experience in introducing
it has been managing hundreds of thousands of storage nodes. RDMA into Pangu’s storage networks (i.e., the network
Furthermore, it supports the real-time access to exabyte-level among storage nodes). Our objective is to provide an RDMA-
data in numerous production environments. enabled Pangu system that exhibits superior performance,
In order to ensure comparability to local physical storage with availability and SLA standards equal to that of traditional
clusters, a cloud storage system must meet the following TCP-backed versions. Our experience spans 4 years and
requirements: will continue with the development of RDMA. We faced

USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 519
a number of challenges specifically related to cloud storage,
Ceph block service RDMA Master Clusters
Master Cluster Nodes User-space TCP
with additional problems associated with RDMA. We have BlockMaster PanguMaster
Ceph Monitor TCP
developed a number of cluster solutions
info to allow for RDMA to
BlockServer info block mapping ChunkServer info
Data Flow
function in a production-level cloud replication
storage,
OSD info 2
several of which
are engineering-level work-arounds. However, overcoming request BlockServer Storage Cluster
Control Flow Storage Cluster Nodes
the aforementioned
Client Node
RDMA issues proves to be a complicated replication 1
Client Node
task. Here, we expose the practical OSD1
limitations OSD2of the OSD3pro- ChunkServer
data (Computing)
duction systems librbd
in order to facilitate Disk 1
innovative Disk 2
research Disk
and3 Disk 1 Disk 2 Disk 3
replication 2&3
applications in this area.
librados
To other to other
In addition to the performance, availability, replication 3 and SLA
OSDs ChunkServers
requirements, the deployment planning of Pangu at the
production scale should consider storage volume and hard-
ware costs. Following the availability-first principal, RDMA Figure 1: Pangu block storage service framework.
communication is enabled only inside each podset [13]. Such
metrics. With the help of the dual-home topology feature,
a podset contains a group of leaf switches, and all Top-of-Rack
we optimize the fail-over performance of Pangu by reducing
(ToR) switches connected to these leaf switches. The podsets
the connection recovery time. We also fix network problems
are connected via spine switches. This setting is currently the
by exploiting application controls, for example, blacklisting
optimal balance between application demands, performance,
problematically connected nodes (§6).
and availability/SLA control. Storage node configurations
We share our experience in adopting the RDMA-enabled
are carefully planned to match the disk throughput with the
Pangu system and discuss several potential research directions
network bandwidth. We adopt the hybrid deployment of
(§7). This system has successfully served numerous online
RDMA/TCP in Pangu to exploit TCP as the last resort for the
mission-critical services under the scope of Alibaba over the
system (§3).
past four years, including several important shopping festivals
The performance optimization aims to minimize la-
(e.g., Double-11 [8]). Sharing our experience in integrating
tency while maximizing throughput. We leverage software-
RDMA into Pangu can be helpful for other RDMA-enabled
hardware co-design to minimize performance overhead. We
systems.
build a software framework in Pangu that integrates RDMA
with Pangu’s private user-space storage platform designed
for new storage media. By eliminating data-copy operations,
2 Background
the latency of a typical block service request is reduced 2.1 Pangu in Alibaba Cloud
to tens of microseconds. We observed that the memory Pangu Framework. Pangu is a distributed file system
bandwidth becomes a bottleneck when upgrading Pangu to developed by Alibaba Cloud. Released in 2009, it plays a
a 100Gbps network. By exploiting the RDMA features and major role in the core Alibaba businesses (e.g., e-business and
offloading critical computations, Pangu is able to saturate the online payment, cloud computing, enhanced solid state drive
underlying networks. Furthermore, we leverage a new thread backed cloud disk, elastic compute service, MapReduce-like
communication mode in Pangu to reduce the performance data processing, and distributed database). In this paper, we
pitfall caused by a large number of Queue Pairs (QPs, RDMA focus on the network features of Pangu.
connection abstraction) per node (§4). Pangu provides numerous storage services, including
Previous studies have reported the risks of large-scale elastic block service, object storage service, store service,
RDMA deployment [13]. RDMA-enabled Pangu clusters do etc. We take the block service as an example to demonstrate
encounter such problems, including PFC deadlocks [13], PFC the system framework. Fig. 1 presents the I/O workflows
pause frame storms, and head-of-line blocking [27, 44]. We of Pangu. Virtual block devices contain continuous address
determined several PFC storms to be attributed to a previously spaces that can be randomly accessed by applications. A
unexplored source that consequently invalidates an earlier Pangu client in a computing node organizes data into fixed-
solution [13]. In order to guarantee availability, we apply the sized (e.g., 32 GB) segments, while the BlockServers and
escape-as-fast-as-possible design principle to handle PFC ChunkServers run on storage nodes. Each segment is aligned
storms. We bring up a fine-grained switching mechanism to a BlockServer for I/O processing. On the BlockServers,
between RDMA/TCP traffic in Pangu and it handles PFC a segment is divided into blocks and replicated to the
storms regardless of their causes (§5). ChunkServers, which are in charge of the standalone back-end
In order to meet the SLA standards, we adopt the design storage of the blocks and device management.
principal of exploiting storage semantics whenever useful The BlockMasters manage metadata such as the mapping
in Pangu. By taking advantage of its ability to control the between a segment and its located BlockServer and the Block-
application layer, Pangu performs the real-time checking and Server’s living states. The PanguMasters manage the states
alarming for a large number of storage service and network of the ChunkServers. These master nodes are synchronized

520 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
using consistent protocols, such as Raft [36]. Cloud and Microsoft Azure [4, 9].
All data communication in Pangu is in the form of Remote Service-Level Agreement. Software and hardware failures
Procedure Calls (RPCs). Each ChunkServer initiates the RPC are extremely common in distributed systems. A cloud storage
clients/servers, and storage operations are performed by issu- system should exhibit graceful performance downgrade with
ing pre-registered RPCs. An RPC client can simultaneously the occurrence of various failures. Distributed storage systems
use different RPC channels (i.e., connections via RDMA, include mature node monitoring and fail-over mechanisms.
kernel TCP, user-space TCP, or shared memory)according to A single storage node failure has a minimal impact on the
the required RPCs. service quality. In our experience, the most challenging aspect
Cloud Storage Requires RDMA. The principal perfor- of ensuring a stable performance lies in the storage networks.
mance metrics for storage services are read/write throughput Network failures generally result in a larger affected range
and access latency. Low latency and high throughput prove to compared to storage node failures.
be advantageous for numerous application scenarios. Many In addition to its superior performance, customers of our
customers expect similar performance of the cloud storage to RDMA-enabled Pangu require the same levels of availability
that of the local physical storage. For example, the Alibaba and SLA standards to that of traditional TCP-backed versions.
e-commerce database requires extremely low latency in order
to ensure fast responses due to the potentially large peak 2.3 State-of-the-art Work Do Not Fit
number of transactions per second (e.g., 544,000 orders per Unknown PFC Storm Sources. PFC runs under a hop-by-
second at peak hours [8]). Moreover, the enhanced SSD hop mechanism, with the possibility of PFC storms, spreading
service promises 1 million IOPS, 4GB/s throughput, and into the whole cluster. A PFC storm can seriously affect
200µs latency for 4KB random writes [17]. cluster availability and is the most well-known issue of
The latency of traditional network stack (e.g., TCP/IP) RDMA. In 2016, Microsoft presented its experience in the
is generally within hundreds of microseconds [13]. The deployment of RDMA [13], where they revealed that a
maximum achievable TCP bandwidth per kernel thread can bug in the receiving pipeline of an RDMA-capable NICs
reach tens of Gbps [51]. In contrast, the access latency (RNICs) causes PFC storms. The problem was fixed by
of current NVMe SSDs is only at the microsecond level, building watchdogs on the NICs and switches. However, we
while the read/write bandwidth of a single device is at the identified an additional type of PFC storms that originates
GB/s level [49]. The total throughput of each storage node from switches, implying the complexity of PFC storms with
(generally with 8-16 NVMe disks) can exceed 100Gbps and multifarious sources. The Microsoft solution [13] fails to
the incoming Storage Class Memory (SCM, e.g., Intel 3D- solve this new problem (§5).
XPoint) can even achieve nanosecond level latency [35]. Thus, Practical Concerns that Limit Design Options. We are
networking is currently the primary performance bottleneck not able to simply treat RDMA as a black-box and wait for
for cloud storage. future research and technical advances to solve the current
RDMA is an alternative networking choice for cloud stor- problems. Despite the large number of recent studies [10, 22,
age. By implementing its entire protocol stack on host NICs, 29, 33, 40, 48], a production-level comprehensive PFC-free
RDMA provides both microsecond level access latency and solution is still premature. The application of RDMA over
a per-connection throughput close to 100Gbps with almost lossy Ethernet has been explored in previous work [7, 11, 15,
no CPU consumption [23]. RDMA has successfully been 26], allowing for the bypass of the PFC mechanism. However,
integrated into numerous network-bottlenecked systems, for such solutions rely on new hardware features.
example, key-value stores [22, 33], distributed transactions [6, The deployment of new hardware is a long process, with
24, 48], and graph queries [40], demonstrating an improved several months or even years of testing, followed by the
performance compared with non-RDMA predecessors. subsequent introduction to business applications. For example,
the process of testing Pangu with CX-4 RNICs, a joint
2.2 Challenges collaboration with NIC providers, lasted for over two years.
Besides performance, availability and SLA are also critical There is a tension between the fast growth of new RDMA
for a successful cloud storage system. demands and the long update cycles of new hardware. To date,
Availability. System disruptions incur significant finan- these PFC-free proposals are not mature enough for large-
cial/reputation loss for both tenants and their cloud providers. scale business deployment, particularly for the availability
In 2018, Amazon S3 experienced a system disruption that and SLA standard requirements of cloud storage systems.
lasted for 4 hours [2], affecting Amazon Elastic Compute Furthermore, large-scale industry deployment is generally
Cloud , Amazon Elastic Block Store volumes, and AWS associated with multiple generations of legacy RDMA NICs.
Lambda [3]. This disruption also had an impact on tens For example, we have already deployed several Mellanox
of thousands of websites built on the Amazon storage NIC generations (e.g., CX-4, CX-5), with the number of
service, including Netflix [34], Spotify [43], Pinterest [37], each reaching tens of thousands. It is operationally infeasible
and Buzzfeed [5]. Similar events have occurred with Google and costly to replace all legacy NICs in the running nodes,

USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 521
Total bandwidth TCP bandwidth ratio TX pauses
Spine 25Gbps 40% 0
30Gbps 45% 1Kpps

Leaf Podset 32Gbps 50% 8Kpps


35Gbps 46% 15Kpps

ToR Table 2: TX pauses in hybrid RDMA/TCP traffic.


Servers ... ... ... ...
When one port is down, the connections on this port can be
Computing Nodes Storage Nodes
migrated to another port.
Figure 2: Topology of Pangu. Table 1 reports typical hardware configurations for 25Gbps
Hardware 25Gbps 100Gbps
and 100Gbps RNIC storage nodes. The number of SSD per
CPU Xeon 2.5GHz, 64 cores Xeon 2.5GHz, 96 cores
Memory DDR4-2400, 128GB DDR4-2666, 128GB ×3 node is determined by the total RNIC bandwidth versus the
Storage 1.92TB SSD×12 3.84TB SSD×14 throughput of a single SSD, allowing the I/O throughput to
Network CX-4 Lx Dual-port CX-5 Dual-port match the network bandwidth. Note that the SSD types in the
PCIe PCIe Gen 3.0 PCIe Gen 3.0 25Gbps and 100Gbps configurations are distinct, resulting
Table 1: Example configurations of 25/100Gbps nodes. in disproportional numbers. Computing and storage nodes
while upgrading the firmware of tens of thousands of running are deployed in different racks within a single podset. The
servers is both time-consuming and error-prone. Thus, the numbers of computing and storage nodes are then calculated
need for new hardware features or firmware should be according to the computational demands.
minimized. RDMA Scope. In order to minimize the failure domain,
Domain Knowledge of Distributed Storage Should be we only enable RDMA communication within each podset
Exploited. Existing work largely ignores potential help and among storage nodes. The communication between
from the application layer. Storage service metrics, rather computing and storage nodes is performed via a private
than networking metrics, are a key concern for cloud service user-space TCP protocol (Fig. 1). This is attributed to the
applications. We take into account such storage semantics in complex hardware configurations of computing nodes, which
the design of Pangu when improving the engineering trade- update rapidly. Thus, TCP can be effectively applied as a
off and the decision making process for various networking hardware-independent transport protocol. User-space TCP
problems. is more convenient for upgrade and management compared
to kernel TCP, while kernel TCP is selected for cross-podset
3 RDMA Deployment communication due to its generality.
3.1 Consideration in Deployment Planning The production deployment is an additional concern
for podset-level RDMA. In many datacenters, podsets are
The deployment planning of storage clusters governs the located in different buildings. For cross-building RDMA
network topology, RDMA communication scope, storage links, the base link delay is much larger, while the PFC
node configurations, etc. Multiple factors must be considered, mechanism requires much larger headroom buffer. In order to
including matching the storage volume with demands, control- enable RDMA, the PFC/ECN thresholds located on the spine
ling hardware costs, optimizing performance, and minimizing switches must be carefully adapted and tested. This is a tough
availability and SLA risks. The final outcome is a trade-off task and at present, does not result in sufficient gains.
among all these factors. RDMA/TCP Hybrid Service. To the best of our knowledge,
For example, Microsoft deploys RDMA at the scale of previous research on RDMA deployment does not explore
an entire Clos network [13]. Thus, if not prevented, PFC RDMA and TCP hybrid services. We keep TCP as the last re-
storms could spread across the whole network and bring down sort in Pangu following the availability-first principal. Despite
an entire cluster. This amount of risk is unacceptable in a current progress, RDMA devices are far from flawless. Thus,
production-level storage system. when either availability or SLA are threatened, switching
affected links from RDMA to TCP can maintain the available
3.2 Deployment Choices of Pangu bandwidth. This escape plan does not impact the unaffected
The key principle employed by our RDMA deployment is RDMA links.
availability-first. However, during the hybrid deployment process, we deter-
Network and Node Configurations. Fig. 2 displays the mined that coexistent TCP traffic provoked a large number
Clos-based network topology of Pangu. Consistent with the of TX pauses (i.e., PFC pause frames sent by NICs), even
common dual-home practice, we deploy Mellanox CX series if RDMA/TCP traffic are isolated in two priority queues.
dual-port RNICs to connect a host with two distinct ToR Table 2 reports the TX pause generation rate in Pangu under
switches. In particular, two physical ports are bonded to a different loads with approximately 50% TCP traffic. The tests
single IP address. Network connections (e.g., QPs in RDMA) are performed on Mellanox CX-4 25Gbps dual-port RNICs.
are balanced over two ports following a round-robin fashion. Such a large number of TX pauses are detrimental to the

522 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
BlockServer 800
TCP 2.5

TX Pause Duration(us)
Average Latency (us)
RDMA
600 2
Throughput (Gbps)

15
400 1.5
10
1
200
5 0.5

0 0 0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Time (Minutes) Time (Minutes) Time (Minutes)
(a) Throughput of RDMA/TCP/BlockServer. (b) Latency of BlockServer requests. (c) Average TX pause duration.
Figure 3: RDMA/TCP hybrid deployment tests at different ratios (from 0% to 100% TCP).
performance and may result in PFC storms. We investigated latency under 100% RDMA traffic is approximately half of the
this problem together with Mellanox and determined that latency under 100% TCP traffic, while the tail latency under
the processing of TCP in the Linux kernel is highly I/O- 100% TCP is more than 10× larger compared to 100% RDMA
intensive. Kernel TCP initiates too many partial writes on traffic. RDMA presents great latency advantages compared to
NICs’ PCIe bus. As the PCIe bandwidth is consumed, the TCP. Fig. 3(c) demonstrates the average TX pause duration
receiving pipeline of a NIC is slowed down. The buffer per second for this workload. Only a limited number of TX
overflows and the NIC subsequently begins to transmit PFC pauses are observed. When the TCP bandwidth ratio is around
pause frames. 50% at 30 minutes, the pause duration reaches a peak value.
In order to optimize the memory access of TCP, we make Overall, these results demonstrate the stable performance
several adjustments on the data access procedure. First, of our RDMA/TCP hybrid mechanism.
disabling the Large Receive Offset (LRO) can reduce the
memory bandwidth usage. This is attributed to the access of 4 Performance Optimization
multiple cache lines when the LRO is enabled. Furthermore,
4.1 Performance Hurdles
enabling NUMA also improves the efficiency of memory
accesses, which subsequently aids in relieving the pressure The performance optimization of Pangu aims to minimize
of PCIe. We also allocate a larger buffer on the RNICs latency while maximizing throughput.
for RDMA traffic to prevent TX pauses. Finally, making RDMA-Storage Co-Design. Integrating the RDMA proto-
application data cacheline-aligned is a common optimization col stack with the storage backend is challenging. It must
practice that improves memory efficiency [23]. cover key performance points such as thread modeling,
memory management, and data serialization. The thread
3.3 Evaluation model directly affects latency due to communication costs
We test several RDMA/TCP traffic ratios to investigate the among threads. Well-designed memory management and
effects of RDMA/TCP hybrid deployment. Each computing data serialization are key to achieving zero-copy during data
node runs FIO with 8 depths (inflight I/O requests), 8 jobs access. Here we present a brief introduction on the design of
(working threads), and 16 KB block size in order to write these components for storage purposes.
virtual disks. Note that one write request on a BlockServer The User Space Storage Operating System (USSOS) is
generates three data replicas. We enable all optimizations a unified user-space storage software platform that aims to
approaches detailed in §3.2 for the TCP kernel. support new storage media such as NVMe SSD and persistent
Fig. 3(a) depicts the BlockServer bandwidth with varying memory. Its design principles (e.g., memory management,
RDMA/TCP ratios. The workload starts at 10 minutes with shared memory mechanism, and user-space drivers) are based
100% RDMA traffic. Afterwards, in every 5 minutes, the on well-known user-space technologies (e.g., DPDK [19]
workload contains 10% more TCP traffic and 10% less and SPDK [42]). Related tests reveal that enabling USSOS
RDMA traffic. At 60, 65, 70 minutes we change the TCP in Pangu can improve CPU efficiency by more than 5× on
traffic ratio to 0%, 100%, and 0% respectively in order to average.
explore the performance of Pangu with quick traffic switching As a central part of USSOS, the User Space Storage File
between RDMA and TCP. The average BlockServer through- System (USSFS) is a high-performance local file system
put exhibits minimal reduction as the RDMA traffic ratio designed for SSDs. By running in the user space, USSFS
decreases. is able to bypass the kernel to avoid user-/kernel-space-
Fig. 3(b) presents the BlockServers’ average request crossing overhead. USSOS divides disks into “chunks” which
latency for the same workload as that in Fig. 3(a). The average ChunkServer uses in its APIs (e.g., create, seal, and delete).

USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 523
Components Average Utilization Peak Utilization Maximum Physical Capacity
Physical CPU utilization ratio 66% 70% 100%
Memory read/write throughput 28GB/s / 29GB/s 33GB/s / 32GB/s 61GB/s in total (1:1 read/write)
SSD PCIe throughput (socket 0 + socket 1) 550MB/s + 550MB/s 1000MB/s + 1000MB/s 3.938GB/s + 3.938GB/s
Network PCIe RX throughput 10GB/s 11GB/s 15.754GB/s
Network PCIe TX throughput 8GB/s 9GB/s 15.754GB/s
Table 3: Measured resource utilization of Pangu in 100Gbps network with 1:1 read/write ratio.

RDMA buffer RPC Call Back Chunk


Framework (4) Server
data data API
4048B 4048B (5)
Polling
User space (3) USSFS

USSOS
data crc gap data crc gap RX or TX User Space
4048B 4B 44B 4048B 4B 44B Driver

I/O buffer
Kernel DMA
(2)
DMA
(6)
Figure 4: Potential triggering of data copying by CRC.

USSOS directly writes data and metadata to disks and uses Hardware RNIC
NVME
SSD

polling to perceive completion events. For different block


Request
sizes, USSFS is able to improve IOPS by 4-10× compared to (1)
Network
the Ext4 file system.
A run-to-completion model is considered as the optimal
approach for the integration of the RDMA network stack with Figure 5: Integrated network/storage processing.
the storage stack. This model has previously been explored
in studies discussing disaggregated storage (e.g., Reflex [25],
i10 [16]). However, these studies were published after the
intensive operation as the calculations are applied to the entire
introduction of RDMA to Pangu in 2017. Reflex and i10
dataset. The data are also copied when they are written into
focus on remote direct I/O while a ChunkServer in Pangu
the disks in order to include CRC footers. Copying is not
is applied as a local storage engine for distributed storage.
performed in other components due to the remote-memory
Google’s Snap [31] leverages a separate network process
access semantic of RDMA.
to unify network functionalities and reduce the number of
network connections. Large Number of QPs. We used to adopt the full-mesh link
Memory Bottleneck with 100Gbps networks. Deploying mode among running threads in Pangu in order to maximize
100Gbps networks can achieve lower latency and higher throughput and minimize latency (Fig. 6(a)). Assume that
throughput. With faster network, now the memory throughput each ChunkServer has 14 threads, each BlockServer has
becomes a bottleneck in our system. 8 threads, and each node contains both ChunkServers and
In order to obtain the upper bounds of the memory BlockServers. For the full-mesh mode in a cluster of 100
access throughput, we test the memory throughput using the storage nodes, there could be 14 × 8 × 2 × 99 = 2, 2176 QPs
Intel Memory Latency Checker (MLC) tool [20]. Table 3 in each node. RNICs’ performance drop dramatically for
details the measured usage of the hardware resources. In large numbers of QPs due to cache miss [21]. In particular,
our test, the maximal achievable memory bandwidth is the number of RX pauses (i.e., PFC pause frames received) is
61GB/s with a 1:1 read/write ratio. However, the aver- very high.
age memory throughput with Pangu’s workload is already
29GB/s + 28GB/s = 57GB/s. This indicates the memory to Previous studies have demonstrated the same issue [10,
be the bottleneck rather than the network. 23, 47]. In order to solve this problem, FaSST [24] shares
By monitoring the memory usage in Pangu, we determined QPs among threads, which subsequently lowers the CPU
that both the verification and data copy processes require efficiency and performance due to the lock contention of
optimization. Data integrity is one of the most significant QPs between threads. An alternative heuristic is the inclusion
features of distributed storage. We adopt Cyclic Redundancy of a dedicated proxy thread that manages all receive and send
Check (CRC) for application-level data verification in Pangu. requests [41]. However, switching to/from a dedicated proxy
As shown in Fig. 4, the received data is split into chunks thread increases latency. Furthermore, it is difficult to saturate
of 4KB, with a 4B CRC value and a 44B gap added to the full network bandwidth with a single thread. Moreover,
each chunk. This operation is a memory- and computation- the proxy solution is not transparent to the underlying RDMA
libraries.

524 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Client Server Client Server
0->0 0->1 0->2
Thread 0 Thread 0 Thread 0 Thread 0

0->1
1->0 1->1 1->2

0->2
Thread 1 Thread 1 Thread 1 Thread 1

1->2
2->0 2->1 2->2
Thread 2 Thread 2 Thread 2 Thread 2

(a) Full-mesh mode (b) Shared-link mode


Figure 6: Different link modes for send/receive RPC requests
4.2 Designs buffer at the receiver, which contains 4KB data, a 4B footer,
and a 44B gap in each unit. Following the CRC calculation,
The designs related to performance in Pangu are based on the filled I/O buffer can be directly applied for disk writing.
the principle of software-hardware co-design to minimize Besides, the CRC calculation is able to be offloaded to
performance overhead. capable RNICs (e.g., Mellanox CX-5), thus lowering CPU
Storage-RDMA Unified Run-to-Completion Stack. We and memory usage. The 4KB data are posted to the RNIC
adopt a run-to-completion thread model for both storage and and the 4B CRC checksum is then generated.
network to achieve low latency. Fig. 5 demonstrates the Shared Link. We adopt the shared link mode, an effective
procedure used to process requests. When a write RPC is solution for reducing the number of QPs in Pangu. The shared
received by a node, the RNIC posts it to the user space via link mode is implemented in the application layer and leaves
DMA. The RPC framework obtains the request using polling RDMA libraries untouched. A correspondent thread in the
and subsequently hands it over to a ChunkServer module for destination node is assigned to each thread in the source node
processing. The ChunkServer then informs USSFS to allocate (Fig. 6(b)). The thread’s requests to the node are sent to its
a “chunk” resource to the request. Finally, a user-space driver correspondent thread, which subsequently dispatches requests
interacts with NVMe SSDs to store the data. These operations to correct target threads.
are generally performed in a single server thread without Consider a daemon with N threads, each thread polls N
thread switching. This run-to-completion model minimizes request/response queues to obtain the requests/responses.
the latency. In order to reduce the blocking time caused by Note that there is only a single producer/consumer for each
large jobs, large I/O requests are split into smaller requests request/response queue. Thus we use lock-free queues for
when submitted by applications. This optimization ensures a each request/response queue to avoid contention. According
quick response to I/O signals. An additional optimization to our test, this design adds approximately 0.3 µs latency.
strategy for large I/O requests involves the passing of In the shared link mode, there is resource overhead at
auxiliary work (e.g., formatting and CRC calculation) to non- the correspondent thread during request dispatching when
I/O threads, where they are subsequently processed. These the source thread sends too many requests. Pangu supports
optimizations reduce the average latency of a typical storage shared groups, where threads in a node can be divided into
request (e.g., 4KB size) to less than 30µs. several groups. A correspondent thread only relays requests
The data formats are unified as I/O vectors. An I/O vector for its group members. Returning to the previous example,
is transmitted without copying via a single RDMA verb the number of QPs in the All Shared mode is now reduced to
using scatter-gather DMA (the transfer of discontinuous data (8 + 8) × 99 = 1, 584. If the threads are divided into 2 shared
through a single interruption) in network. Serialization is not groups, the number of QPs will be (8 × 2 + 8 × 2) × 99 =
necessary due to RDMA semantics. 3, 168.
Zero-Copy & CRC Offloading. As discussed in §4.1, in
Pangu, data has to be copied once on the I/O path as each 4KB 4.3 Evaluation
chunk is verified and attached with a CRC footer. Here, we Zero-Copy & CRC Offloading. We use FIO with 16 jobs
leverage the User-Mode Memory Registration (UMR) [32] and 64 I/O depth to test a virtual I/O block device on a single
feature of RNICs to avoid such data copy. UMR can scatter ChunkServer. Fig. 7(a) demonstrates the memory bandwidth
RDMA data on the remote side through the definition of usage (including read/write tasks) when UMR zero copy and
appropriate memory keys. Thus, data can be formatted and CRC offloading are used. The memory bandwidth usage is
organized according to storage application formats. We use reduced by more than 30%, revealing that these measures are
UMR to remap the continuous data from the sender into an I/O able to relieve the pressure of memory usage. Fig. 7(b) depicts

USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 525
fio: 8 jobs, 8 depth FIO: 16 jobs, 64 depth
50 15 connected to this ToR switch receive continuous pause frames
No Optimization No Optimization and thus the storm spreads.
UMR + CRC offload UMR + CRC offload
Memory Bandwidth(GB/s)

40
State-of-the-Art Solutions. Guo et al. [13] built a NIC-

Throughput (GB/s)
10
30 based watchdog to continuously monitor transmitted pause
frames, disabling the PFC mechanism if necessary. In
20 addition, watchdogs were also deployed on the switches
5

10
for disabling the PFC mechanism when switches receive
continuous pause frames and are unable to drain the queuing
0 0 packets. The switches can subsequently re-enable PFC in the
32KB 64KB 128K
B 32KB 64KB 128K
B
absence of pause frames over a specific period of time. Thus,
(a) (b) PFC storms can be controlled via these two watchdogs during
Figure 7: Performance of UMR zero copy + CRC offloading
phase (2).
the improvement in throughput following the optimization. However, this solution is unable to completely solve the
The throughput of a single ChunkServer thread is improved PFC storms originating from switches. In particular, the TX
by approximately 200% for a block size of 128KB. pause watchdogs on the NICs will not work since the NIC
Shared Link. We tested the shared link mode with several only receives PFC storms from the switches. Furthermore,
shared QP groups in a cluster of 198 computing nodes and current switch hardware does not support the monitoring of
93 storage nodes. The background workload compromises pause frame transmissions. If a storm occurs on a ToR switch,
4KB random writes with 8 threads and 8 I/O depths. Fig. 8(a) even though the watchdogs on other switches are able to
presents the throughput in the All Shared, 2 Shared Groups, stop its spread, the ToR switch will continue to send pauses
and 4 Shared Groups modes, whereby a performance trade-off to end-hosts in the rack. The RDMA traffic via this ToR is
can be observed. The All Shared mode exhibits slightly lower consequently blocked.
throughput but generates the lowest number of PFC pauses. Challenges. This new type of PFC storms invalidates
Note that the reduction in bandwidth at 5 and 24 minutes in Guo et al.’s solution, which focuses on insulating the PFC
the All Shared mode is attributed to the garbage collection pause sources to prevent the storm from spreading. This
mechanism in Pangu. Fig. 8(b) presents the TX pause duration methodology fails when the source is a ToR switch as all the
with 1, 2, and 4 Shared Groups, respectively. The lower the hosts in the ToR are paused by the storm. Therefore, in order
number of groups, the fewer the PFC pauses are generated to achieve high availability, a general solution is required in
due to the reduction in QP number. We use the All Shared Pangu to handle all PFC storm types, particularly those with
Group mode in our scale and configuration framework. unknown causes.
Ensuring the service quality of Pangu while simultaneously
solving PFC storms is challenging. PFC storm detection must
5 Availability Guarantee be timely and accurate to rapidly protect network traffic. In
5.1 PFC Storms terms of availability, the overall convergence time of the PFC
A New Type of PFC Storm. The PFC storm previously storm should be controlled to at most the minute level.
discussed in [13] originates from the NICs, with a bug in the
receiving pipeline acting as the root cause. Fig. 9(a) depicts 5.2 Design
the phases of this PFC storm: (1) The bug slows down the NIC Our design principle of handling PFC storms is escaping as
receive processing, filling its buffer; (2) the NIC transmits fast as possible. Despite new PFC storm solutions [11, 21,
the PFC pauses to its ToR switch in order to prevent packet 26], we still resort to engineering-level work-arounds due to
drop; (3) the ToR switch pauses the transmission; (4) the ToR practical considerations (§2.3).
switch’s buffer becomes full and starts to transmit the PFC In Pangu, each NIC monitors the received PFC pause
pauses; and (5) the victim ports are paused and are unable to frames. For continuous pause frames, the NIC determines
transmit. the presence of a PFC storm. Two work-around solutions are
We encountered a different type of PFC storm when available for administrators in the case of a PFC storm.
operating RDMA in Pangu. The root cause is a bug in the Workaround 1: Shutdown. This solution, denoted as the
switch hardware of a specific vendor. The bug reduces the “shutdown” solution, shuts down NIC ports affected by PFC
switching rate of the lossless priority queues to a very low storms for several seconds. The dual-home topology provides
rate. Fig. 9 compares the two types of PFC storms. As an an emergency escape for PFC storms, whereby QPs will
example, we assume that the bug occurs in a down port of a disconnect and connect again via another port. This method
ToR switch: (1) due to the low transmitting rate, the switch’s works together with the optimization to reduce the length
buffer becomes full; (2) the switch transmits the PFC pauses of the QP timeout. This optimization is discussed further
to the connected ports; and (3) the additional switches and in §6.2. Although this solution is simple and effective, it is
NICs stop the transmissions. The leaf switches and NICs sub-optimal due to the loss of half of the bandwidth.

526 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
40 10 4
15

TX Pause Duration (us)


Throughput (Gbps)

30
10
All Shared (1 Shared Group)
20 2 Shared Groups
4 Shared Groups
All Shared (1 Shared Group) 5
10
2 Shared Groups
4 Shared Groups
0
wowTxTxand andTxTx full
full 0 5 10 15 20 25 0
xfull
fulland
andTxTxPFC
PFC 0 5 10 15 20 25 30
Time (Minutes) Time (Minutes)
Cs/Switches
/Switches areare paused
paused
(a) Throughput in different thread groups. (b) Pause time in different thread groups.
Figure 8: Throughput and pause for different types of thread groups.

TXTX RXRX TXTX RXRX TXTX RXRX TXTX RXRX TX TX RX RX TX TX RX RX

5'. 5'. Leaf


Leaf 3'. 3'.
RXRX TXTX RXRX TXTX RXRX TXTX RXRX TX TX RX RX TX TX RX RX TX TX
FaultPort
Fault Port
TXTX RXRX TXTX RXRX TXTX RXRX TXTX RXRX TX TX RX RX TX TX RX RX
Paused Port
Paused Port
3. 3. 4. 4.
4'. 4'. ToR
ToR 2. 2.
2'. 2'.
1. 1.
Healthy
HealthyPort
Port RXRX TXTX RXRX TXTX RXRX TXTX RXRX TX TX RX RX TX TX RX RX TX TX

2. 2.
PFC
PFCframes
frames
TXTX RXRX TXTX RXRX TXTX RXRX TXTX RXRX TX TX RX RX TX TX RX RX
Paused
Pausedlink
link
Host
Host
Healthy
Healthylink
link 1. 1. 5. 5. 3. 3.

ch PFC storm caused by by


NIC in CLOS PFC storm caused by switch in Dual Home CLOS
(a) The PFCPFC storm
storm caused
originates in NIC in CLOS
NICs. (b)PFC
Thestorm
PFC caused
storm by switch
originatesininDual Home
switches. CLOS
xPFC
PFC Figure 9: Different types of PFC storms.
re
arepaused
paused
Workaround 2: RDMA/TCP Switching. In this solution, a ToR switch downlink port. Note that the nodes inside the
the affected RDMA links in a PFC storm are switched to TCP ToR behave differently from nodes outside the ToR. We
links. It compromises a more complex procedure compared to choose two nodes (inside and outside the ToR) in order to
the shutdown solution, yet it is able to maintain the available demonstrate the difference. In such a case, the pause frames
bandwidth. We adopt a method similar to PingMesh [14] to are transmitted to NICs and leaf switches directly connected
detect the RDMA links affected in PFC storms. At each T ms, to the given ToR switch.
every worker thread picks a server and separately pings all
The shutdown solution shuts down the NICs via the
its threads via the RDMA and TCP links. If the RDMA ping
watchdogs in the occurrence of a fault due to excessive RX
fails and the TCP ping succeeds for more than F times, the
pauses. RDMA links subsequently reconnect through another
traffic on this RDMA link is switched to the TCP link. Once
NIC port, thus recovering traffic. Note that the counters of
the RDMA ping has succeeded more than S times, the traffic
Congestion Notification Packet (CNP) and PFC frames gradu-
on the switched TCP link is switched back to the RDMA
ally increase since the system load (at 0 minutes) is larger than
link. For T = 10 ms and F = 3, bad RDMA links can be
the available bandwidth of a single port (25Gbps). The system
detected in approximately 10 seconds in a podset of 100
then reaches a new balance in approximately 30 minutes.
storage nodes. By switching the RDMA traffic to the TCP
However, the shutdown solution has several limitations. For
connections, the throughput can recover to more than 90% in
example, computing node requests may not respond within
less than 1 minute.
1 minute (known as I/O hang sensed by applications). The
downlink breakdown of a leaf or ToR switch can result in tens
5.3 Evaluation to hundreds of hang requests. Furthermore, the shutdown of
ports is itself an aggressive action. Hundreds of ports may
We simulate PFC storms by injecting the aforementioned bug
be shut down due to unexpected pauses. This risk may itself
into a switch for several cases, including the uplink/downlink
influence the availability of a large number of nodes.
ports on the ToR and Leaf switches. The RDMA/TCP
switching solution exhibits strong performance for all cases. The RDMA/TCP switching solution switches the RDMA
Fig. 10 displays the results for a PFC storm originating from traffic that passes through the broken-down switch to TCP.

USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 527
In-ToR TCP In-ToR CNP sent In-ToR TX Pause
In-ToR RDMA In-ToR CNP handled In-ToR RX Pause
Out-ToR TCP Out-ToR CNP sent Out-ToR TX Pause
Out-ToR RDMA Out-ToR CNP handled Out-ToR RX Pause
10 6 10 6
20

Pause Duration (us)


Throughput (Gbps)

CNP Sent/Handled
10 5
4
15 10 10 4

10 10 3
10 2 10 2
5

0 10 0 10 0
0 2 4 6 8 10 0 2 4 6 8 0 2 4 6 8
Time (min) Time (min) Time (min)
(a) Throughput of RDMA/TCP switching. (b) CNP of RDMA/TCP switching. (c) PFC Pause of RDMA/TCP switching.

30 10 6 10 6

Pause Duration (us)


Throughput (Gbps)

CNP Sent/Handled

5
10 10 5
4
20 10 10 4
10 3 10 3
2
10 10 10 2
1
10 10 1
0
0 5 10 15 20 25 0 5 10 15 20 0 5 10 15 20
Time (min) Time (min) Time (min)
(d) Throughput of shutdown. (e) CNP of shutdown. (f) PFC Pause of shutdown.
Figure 10: Performance of two different solutions for PFC storms.
The RDMA links are then disconnected due to timeout. The identified an interesting failure type involving a small number
QPs are separately distributed over the server’s two NIC of links that flap between up and down states for a short period
ports, thus the RDMA links may need several attempts to of time (e.g., several seconds). This results in an extremely
reconnect successfully. Note that although the pause storm high tail latency for I/O requests, denoted as slow I/O (e.g.,
in the ToR is not terminated, it will not spread further as the over 1 second for storage clients). Hundreds of slow I/Os
neighboring switch ports are transformed into the lossy mode are observed daily for numerous reasons. Root causes of
via the RX pause watchdogs. The traffic throughput is not link flapping include optical ports covered with dust, loose
impacted during the migration to the TCP, and I/O hangs are physical interfaces, aging hardware, etc.
not present. Previous Research on Network Failures. The majority of
previous studies focus on determining the location of network
6 SLA Maintenance failures (e.g., Pingmesh [14] and 007 [1]). These solutions
focus on the system network and can achieve the timely
6.1 SLA in Pangu discovery of network errors. However, it may still take hours
It is commonly-known that network failures are hard to locate for engineers to manually check, fix, and replace the failed
and repair. Network failure causes include mis-configuration, switches and links. Cloud storage calls for a methodology
hardware/software bugs, and link errors. For example, the that integrates the storage and network function modules to
mis-configuration of the switch Access Control List (ACL) ensure stable service quality in failed cases.
may only affect a specific flow while other flows behave
normally [28, 45]. As a comparison, malfunctions occurring 6.2 Design
at storage devices or nodes can generally be easily located. Our SLA design principle aims to exploit storage semantics
Sometimes network failures may not be explicit. Ideally, whenever useful. Distributed storage is designed with a
when a node breaks down, the heartbeat mechanism should en- redundancy mechanism and its performance is measured via
sure that the unavailability of service daemons (BlockServers the storage semantics. These semantics, such as data replicas
and ChunkServers) on the node are informed to their masters and distributed placements, can be leveraged during a failure
(BlockMasters and PanguMasters). However, real situations to improve system performance.
can be more complicated, failing to be detected with just Network-Integrated Monitoring Service. Monitoring is
heartbeats. Connections may suffer intermittent packet loss a necessary component of distributed systems. A compre-
or long latency rather than simple break downs. We also hensive monitoring system is key for the safe deployment

528 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
of RDMA in production storage systems, as well as reliable configured according to the scale of the cluster. In order to
performance. ensure reliability, the maximum number of BlockServers in
Both configurations and counters must be monitored. NIC the blacklist is usually small (e.g., 3 hosts). This blacklist
configurations include system environments, PFC/ECN/QoS mechanism temporarily isolates the BlockServers that provide
configurations, and several link configurations. Pangu au- a poor service. The system performance is not affected and
tomatically checks, collects, and reports suspicious terms. engineers have sufficient time to fix the problems.
Inspecting potential mis-configurations can reduce configura-
tion and environment errors. For example, accidental reboot 6.3 Daily Operation Scheme of Pangu
and software upgrades may reset the QoS, PFC, DCQCN [52] The daily operations of Pangu rely on these modules to
configurations and affect system performance. Monitoring work together. The monitoring system collects and reflects
can discover such cases and help fix them in advance. the status of Pangu. If abnormal network indicators or I/O
Counters include storage system performance coun- metrics are observed, the monitoring system attempts to locate
ters (e.g., IOPS/latency) and network counters (e.g., CNP and report them to the administrators. For accidental failures
sent/handled, TX/RX pauses, and RDMA errors on NICs). such as link errors, the small QP timeout shortens the time
Congestion control counters that exceed thresholds can result required for failure recovery. The blacklist mechanism is able
in monitor daemons sending alarms to engineers for diagnosis to determine and isolate nodes with poor service quality. By
and repair. Monitoring both storage and network indexes following these design and operator framework specifications,
is crucial for the diagnosis of errors and for predictable our RDMA-enabled Pangu system has not experienced any
performance. Storage indexes such as tail latency, slow I/O, major faults in the last two years.
and queuing time can directly reflect the status of a system.
Moreover, monitoring system performance also help locate 7 Experiences and Future Work
errors. For example, network features are unable to quickly
reflect the flapping problem described in §6.1. However, this Monitoring NACK in Lossless RDMA. The operation of
problem can be easily located by monitoring slow I/Os on the RDMA over a lossless fabric is difficult due to PFC risks.
endpoints of storage applications. However, the lossless fabric increases the effectiveness of
Faster Timeout for Connection Failures. The basic NACK events as indicators of the network error location since
solution to network failures is to reconnect through an NACK is usually rare in a lossless fabric.
alternative path. Since we use dual-home topology, each In order to detect and locate network problems, we build
single point failure of the network can be bypassed using a subsystem based on packet loss in Pangu. In particular,
a different path. Thus, the timeout duration of the QPs Out-Of-Sequence (OOS) counters on RNICs and packet drop
is crucial in improving the system performance during a counters on switches are gathered. A packet loss is classified
failure. In the NIC manual, the timeout duration of QPs is as either explicit or implicit based on whether it is recorded
calculated as 4µs × 2timeout × 2retry_cnt , where timeout and by switch counters. The former is easy to locate by checking
retry_cnt denote the retransmission timeout value and the the switch counters. However, determining the location of
retransmission retry times respectively. Initially, this value the latter is complex as RNIC counters do not distinguish
was a constant (approximately 16 seconds) configured in the between flows. By monitoring NACK in the networks, we can
hardware and cannot be changed. In a combined effort with extract flows’ five tuples and locate the root of a problem.
the NIC providers, we were able to fix this bug. By using a Building a System-Level Benchmark. To evaluate the
smaller timeout value for QPs, the action time required for system network performance and SLA, a representative
reconnecting during network failures was reduced by 4×. benchmark must be constructed. Building the benchmark
An alternative work-around involves altering the connec- based on just the network metrics is simple. However, storage
tion path by modifying the source ports of the QPs (rather than features such as replica completion time, slow I/O, and failure
a direct reconnection). This can accelerate the link recovery recovery should not be ignored. To measure the storage
during a fail-over. However, effectively changing the QP system performance and the SLA, we build a benchmark at the
source port requires a more recent NIC firmware (MLNX storage service level. The system evaluation indexes include
OFED 4.7 or newer) than what is currently deployed in Pangu. FIO latency, IOPS, SLA with network errors, etc. Each
We leave this challenge to future work. upgrade in Pangu (for network and other components) is
Blacklist. We adopt blacklist in Pangu to further improve evaluated with the benchmark, allowing us to measure the
the service quality in fail-over cases. BlockMasters collect overall system performance.
information on I/O errors (including timeout) and slow Congestion Control for Fail-Over Scenarios. In §3, we
I/Os from clients. If a BlockServer has a large number introduced the dual-home topology adopted in Pangu. Dual-
of slow/error I/Os from multiple clients, the masters adds home topology is also crucial to fail-over cases since it
it to the blacklist for a short period of time. The number provides a backup path on NICs. However, we encounter
of clients and slow/error I/Os that triggers the blacklist is a problem when deploying dual-home topology in practice.

USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 529
According to our test, when one ToR switch is down, the protocol (and corresponding hardware) based on NVMe-
RX pause duration can increase to 200ms per second. This over-Fabrics. A customized storage protocol for Pangu with
is due to the transfer of the traffic from the broken ToR hardware support can allow for more flexibility and control.
switch to the other switch. DCQCN handles the traffic burst
poorly under this asymmetric topology. We adapt several 8 Related Work
DCQCN parameters as a temporary solution and leverage the
PFC in RDMA PFC storm is the most well-known issue
fluid models [53] to analyze the available choices, including
of RDMA. Several studies [11, 30, 52] focus on controlling
canceling the Fast Recovery stage and extending the rate
network congestion to reduce the numbers of generated PFC
increase duration. When removing the Fast Recovery stage
pauses. DCQCN [52] is integrated in Mellanox RNICs. In
in DCQCN, the pause can be eliminated yet the flow tail
Pangu, we tune several parameters in DCQCN to improve its
latency increases due to the slow recovery of the flow rate. In
performance in fail-over scenarios. However, PFC storms
contrast, extending the duration of the rate-increase can result
still occur due to hardware bugs [13]. In this paper, we
in a sharp reduction in the pause but only slightly increases
present a different hardware bug that originates from switches.
the flow tail latency. In our experience, extending the rate-
Existing solutions to remedy PFC storms include deadlock
increase duration in DCQCN is effective for storage traffic
elimination [38] and performance optimization [46]. These
patterns. The bandwidth drops slightly while the number of
solutions require switch modification. In Pangu, we combat
RX pauses is dramatically reduced.
PFC storms by switching affected links from RDMA to TCP
This problem of DCQCN in fail-over scenarios (e.g., asym- without the need for any switch changes.
metric topology and traffic burst) indicates their important System & Network Co-Design. Recently, there have been
role when designing congestion control. We adopt parameter increasing amount of work that adopts system and network co-
tuning to fix this problem at the price of a slight performance design, including RPC systems [21, 47], distributed memory
loss. The storage network still requires a flexible, robust systems [39], key-value stores [22], distributed databases and
and well-implemented congestion control mechanism that transaction processing systems [6], and graph-processing
functions well in all scenarios. In 2019, Alibaba designed systems [40]. We co-design our storage system and RDMA
HPCC [30], a novel congestion control algorithm for the in Pangu. To our best knowledge, we are the first to share
RDMA network. Adapting and integrating HPCC with the the experience of employing RDMA networks in large-scale
storage networks is left for future work. distributed storage systems.
Slow Processing of RDMA Read. The majority of large
RPCs in Pangu are transferred via RDMA READ. We 9 Conclusions
observed that when a NIC receives too much RDMA requests
within a short period, it will send out many PFC frames. This As a distributed storage system, Pangu has provided storage
is due to the slowed receiving process that results from cache services to tenants both inside and outside of Alibaba for
misses. When a NIC is preparing for an RDMA READ, it over a decade. In order to overcome the challenges of rising
accesses the QP context in its cache. Processing many RDMA high-speed storage media and growing business requirements,
READs consumes an excessive amount of cache resources. we integrate RDMA into the storage network of Pangu,
For slow RX rates, the NIC sends out PFC pause frames providing a common solution to different types of PFC storms.
to prevent packet drops. We are currently working with the This allows for the safe deployment of RDMA. Pangu has
RNIC provider to solve this problem. successfully moved to a 100Gbps network by solving several
Lossy RDMA in Storage. Lossy RDMA is supported by new problems, such as the memory bandwidth bottleneck and
Mellanox CX-5 and CX-6 RNICs. Note that CX-6 supports QP number explosion. Furthermore, we improve the system
Selective Repeat (SR) retransmission. SR might be the performance of Pangu in fail-over cases.
ultimate step required to effectively eliminate PFC. The
construction of lossy RDMA is a focal point for all RDMA-
Acknowledgment
based systems. We tested lossy RDMA with Pangu over an We are really grateful to Yiying Zhang for shepherding
extensive period and will deploy it for new clusters. our paper. We also thanks anonymous reviewers for their
However, enabling the lossy feature with early generation constructive suggestions. Yixiao Gao and Chen Tian are
RNICs (e.g., CX-4) that have limited hardware resources and supported in part by the National Key R&D Program of China
do not support SR is hard, and many production RDMA-based 2018YFB1003505, the National Natural Science Foundation
systems still host early generations RNICs. of China under Grant Numbers 61772265, 61802172, and
NVMe-Over-Fabric. The ChunkServer data flow in Pangu 62072228, the Fundamental Research Funds for the Central
is processed by CPUs. However, with NVMe-Over-Fabric, Universities under Grant Numbers 14380073, the Collab-
NICs can directly write the received data into NVMe SSDs. orative Innovation Center of Novel Software Technology
This CPU-bypass solution can save CPU costs and reduce and Industrialization, and the Jiangsu Innovation and En-
latency. We are currently building our specialized storage trepreneurship (Shuangchuang) Program.

530 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
References [12] Alibaba Group. Alibaba group website. s https:
//www.alibabagroup.com/en/global/home, 1999-
[1] Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, 2020.
Hongqiang (Harry) Liu, Jitu Padhye, Boon Thau Loo,
and Geoff Outhred. 007: Democratically finding the [13] Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni,
cause of packet drops. In 15th USENIX Symposium on Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. Rdma
Networked Systems Design and Implementation (NSDI over commodity ethernet at scale. In Proceedings of the
18), pages 419–435, Renton, WA, April 2018. USENIX 2016 conference on ACM SIGCOMM 2016 Conference,
Association. pages 202–215. ACM, 2016.
[14] Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong
[2] Amazon AWS. Summary of the amazon s3 service
Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang,
disruption in the northern virginia (us-east-1) region.
Bin Pang, Hua Chen, et al. Pingmesh: A large-scale
s https://www.usatoday.com/story/tech/news
system for data center network latency measurement
/2017/02/28/amazons-cloud-service-goes-dow
and analysis. In ACM SIGCOMM Computer Communi-
n-sites-scramble/98530914/, 2020.
cation Review, volume 45, pages 139–152. ACM, 2015.
[3] Amazon AWS. Summary of the amazon s3 ser-
[15] Mark Handley, Costin Raiciu, Alexandru Agache, An-
vice disruption in the northern virginia (us-east-1)
drei Voinescu, Andrew W Moore, Gianni Antichi, and
region. s https://aws.amazon.com/cn/message/
Marcin Wójcik. Re-architecting datacenter networks
41926/, 2020.
and stacks for low latency and high performance. In
[4] Microsoft Azure. Azure status history. s https:// Proceedings of the Conference of the ACM Special
status.azure.com/status/history/, 2020. Interest Group on Data Communication, pages 29–42.
ACM, 2017.
[5] Buzzfeed. Buzzfeed website. s https://www.buzz
feed.com/, 2020. [16] Jaehyun Hwang, Qizhe Cai, Ao Tang, and Rachit Agar-
wal. TCP = RDMA: Cpu-efficient remote storage
[6] Yanzhe Chen, Xingda Wei, Jiaxin Shi, Rong Chen, and access with i10. In 17th USENIX Symposium on
Haibo Chen. Fast and general distributed transactions Networked Systems Design and Implementation (NSDI
using rdma and htm. In Proceedings of the Eleventh 20), pages 127–140, Santa Clara, CA, February 2020.
European Conference on Computer Systems, page 26. USENIX Association.
ACM, 2016.
[17] Alibaba Inc. Block storage performance. s
[7] Inho Cho, Keon Jang, and Dongsu Han. Credit- https://www.alibabacloud.com/help/doc-deta
scheduled delay-bounded congestion control for data- il/25382.html?spm=a2c5t.10695662.1996646101
centers. In Proceedings of the Conference of the ACM .searchclickresult.458e478fYtRYOO, 2018.
Special Interest Group on Data Communication, pages
[18] Alibaba Inc. Pangu, the high performance
239–252. ACM, 2017.
distributed file system by alibaba cloud. s
[8] Alibaba Cloud. How does cloud empower double 11 https://www.alibabacloud.com/blog/pangu-
shopping festival. s https://resource.alibabacl the-high-performance-distributed-file-syst
oud.com/event/detail?id=1281, 2020. em-by-alibaba-cloud_594059, 2018.

[9] Google Cloud. Google cloud networking incident [19] Intel. Data plane development kit. s https://www.
no.19005. s https://status.cloud.google.com/ dpdk.org/, 2011.
incident/cloud-networking/19005, 2020. [20] Intel. Intel memory latency checker. s https://soft
ware.intel.com/content/www/us/en/develop/a
[10] Aleksandar Dragojević, Dushyanth Narayanan, Orion
rticles/intelr-memory-latency-checker.html,
Hodson, and Miguel Castro. Farm: Fast remote memory.
2020.
In Proceedings of the 11th USENIX Conference on
Networked Systems Design and Implementation, pages [21] Anuj Kalia, Michael Kaminsky, and David Andersen.
401–414, 2014. Datacenter rpcs can be general and fast. In USENIX
NSDI, pages 1–16, 2019.
[11] Yixiao Gao, Yuchen Yang, Tian Chen, Jiaqi Zheng, Bing
Mao, and Guihai Chen. Dcqcn+: Taming large-scale [22] Anuj Kalia, Michael Kaminsky, and David G Andersen.
incast congestion in rdma over ethernet networks. In Using rdma efficiently for key-value services. In
2018 IEEE 26th International Conference on Network ACM SIGCOMM Computer Communication Review,
Protocols (ICNP), pages 110–120. IEEE, 2018. volume 44, pages 295–306. ACM, 2014.

USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 531
[23] Anuj Kalia, Michael Kaminsky, and David G Andersen. [32] Mellanox. Mellanox rdma progamming manual. s
Design guidelines for high performance rdma systems. https://www.mellanox.com/sites/default/fil
In 2016 USENIX Annual Technical Conference, page es/related-docs/prod_software/RDMA_Aware_P
437, 2016. rogramming_user_manual.pdf, 2015.

[24] Anuj Kalia, Michael Kaminsky, and David G Andersen. [33] Christopher Mitchell, Yifeng Geng, and Jinyang Li.
Fasst: Fast, scalable and simple distributed transactions Using one-sided rdma reads to build a fast, cpu-efficient
with two-sided (rdma) datagram rpcs. In OSDI, vol- key-value store. In USENIX Annual Technical Confer-
ume 16, pages 185–201, 2016. ence, pages 103–114, 2013.

[25] Ana Klimovic, Heiner Litz, and Christos Kozyrakis. [34] Netflix. Netflix website. s https://www.netflix.
Reflex: Remote flash = local flash. In Proceed- com/, 2020.
ings of the Twenty-Second International Conference
on Architectural Support for Programming Languages [35] J. Niu, J. Xu, and L. Xie. Hybrid storage systems: A
and Operating Systems, ASPLOS ’17, page 345–359, survey of architectures and algorithms. IEEE Access,
New York, NY, USA, 2017. Association for Computing 6:13385–13406, 2018.
Machinery.
[36] Diego Ongaro and John Ousterhout. In search of an
[26] Yanfang Le, Brent Stephens, Arjun Singhvi, Aditya understandable consensus algorithm. In 2014 USENIX
Akella, and Michael M Swift. Rogue: Rdma over Annual Technical Conference (USENIX ATC 14), pages
generic unconverged ethernet. In SoCC, pages 225–236, 305–319, Philadelphia, PA, June 2014. USENIX Asso-
2018. ciation.

[37] Pinterest. Pinterest website. s https://www.pinter


[27] David Lee, S Jamaloddin Golestani, and Mark John
est.com/, 2020.
Karol. Prevention of deadlocks and livelocks in lossless,
backpressured packet networks, February 22 2005. US
[38] Kun Qian, Wenxue Cheng, Tong Zhang, and Fengyuan
Patent 6,859,435.
Ren. Gentle flow control: avoiding deadlock in lossless
networks. In Proceedings of the ACM Special Interest
[28] Dan Li, Songtao Wang, Konglin Zhu, and Shutao Xia. A
Group on Data Communication, pages 75–89. ACM,
survey of network update in sdn. Frontiers of Computer
2019.
Science, 11(1):4–12, 2017.
[39] Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. Dis-
[29] Hao Li, Asim Kadav, Erik Kruus, and Cristian Ungure-
tributed shared persistent memory. In Proceedings of the
anu. Malt: distributed data-parallelism for existing ml
2017 Symposium on Cloud Computing, pages 323–337,
applications. In Proceedings of the Tenth European
2017.
Conference on Computer Systems, page 3. ACM, 2015.
[40] Jiaxin Shi, Youyang Yao, Rong Chen, Haibo Chen,
[30] Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan and Feifei Li. Fast and concurrent rdf queries with
Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming rdma-based distributed graph exploration. In 12th
Zhang, Frank Kelly, Mohammad Alizadeh, et al. Hpcc: USENIX Symposium on Operating Systems Design and
high precision congestion control. In Proceedings of the Implementation (OSDI 16), pages 317–332. USENIX
ACM Special Interest Group on Data Communication, Association, 2016.
pages 44–58. ACM, 2019.
[41] Galen M Shipman, Stephen Poole, Pavel Shamis, and
[31] Michael Marty, Marc de Kruijf, Jacob Adriaens, Christo- Ishai Rabinovitz. X-srq-improving scalability and
pher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dal- performance of multi-core infiniband clusters. In
ton, Nandita Dukkipati, William C. Evans, Steve Grib- European Parallel Virtual Machine/Message Passing
ble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Interface Users’ Group Meeting, pages 33–42. Springer,
Carl Mauer, Emily Musick, Lena Olson, Erik Rubow, 2008.
Michael Ryan, Kevin Springborn, Paul Turner, Valas
Valancius, Xi Wang, and Amin Vahdat. Snap: A micro- [42] SPDK. Storage performance development kit. s ht
kernel approach to host networking. In Proceedings tps://www.spdk.io, 2020.
of the 27th ACM Symposium on Operating Systems
Principles, SOSP ’19, page 399–413, New York, NY, [43] Spotify. Spotify website. s https://www.spotify.
USA, 2019. Association for Computing Machinery. com/, 2020.

532 18th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
[44] Brent Stephens, Alan L Cox, Ankit Singla, John Carter, of hyperscale applicationson on nvme ssds. In Pro-
Colin Dixon, and Wesley Felter. Practical dcb for ceedings of the 2015 ACM SIGMETRICS International
improved data center networks. In IEEE INFOCOM Conference on Measurement and Modeling of Computer
2014-IEEE Conference on Computer Communications, Systems, SIGMETRICS’ 15, pages 473–474, New York,
pages 1824–1832. IEEE, 2014. NY, USA, 2015. Association for Computing Machinery.

[45] Bingchuan Tian, Xinyi Zhang, Ennan Zhai,


[50] Yiying Zhang and Steven Swanson. A study of appli-
Hongqiang Harry Liu, Qiaobo Ye, Chunsheng Wang,
cation performance with non-volatile main memory. In
Xin Wu, Zhiming Ji, Yihong Sang, Ming Zhang, Da Yu,
Symposium on Mass Storage Systems and Technologies,
Chen Tian, Haitao Zheng, and Ben Y. Zhao. Safely and
2015.
automatically updating in-network acl configurations
with intent language. In Proceedings of the ACM
Special Interest Group on Data Communication, [51] Yang Zhao, Nai Xia, Chen Tian, Bo Li, Yizhou Tang,
SIGCOMM ’19, page 214–226, New York, NY, USA, Yi Wang, Gong Zhang, Rui Li, and Alex X. Liu. Per-
2019. Association for Computing Machinery. formance of container networking technologies. In
Proceedings of the Workshop on Hot Topics in Container
[46] C. Tian, B. Li, L. Qin, J. Zheng, J. Yang, W. Wang, Networking and Networked Systems, HotConNet ’17,
G. Chen, and W. Dou. P-pfc: Reducing tail latency page 1–6, New York, NY, USA, 2017. Association for
with predictive pfc in lossless data center networks. Computing Machinery.
IEEE Transactions on Parallel and Distributed Systems,
31(6):1447–1459, 2020. [52] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong
Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Pad-
[47] Shin-Yeh Tsai and Yiying Zhang. Lite kernel rdma hye, Shachar Raindel, Mohamad Haj Yahia, and Ming
support for datacenter applications. In Proceedings of Zhang. Congestion control for large-scale rdma deploy-
the 26th Symposium on Operating Systems Principles, ments. In Proceedings of the 2015 ACM Conference
pages 306–324. ACM, 2017. on Special Interest Group on Data Communication,
[48] Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and SIGCOMM ’15, page 523–536, New York, NY, USA,
Haibo Chen. Fast in-memory transaction processing 2015. Association for Computing Machinery.
using rdma and htm. In Proceedings of the 25th
Symposium on Operating Systems Principles, pages 87– [53] Yibo Zhu, Monia Ghobadi, Vishal Misra, and Jitendra
104. ACM, 2015. Padhye. Ecn or delay: Lessons learnt from analysis
of dcqcn and timely. In Proceedings of the 12th
[49] Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Manu International on Conference on emerging Networking
Awasthi, Tameesh Suri, Zvika Guz, Anahita Shayesteh, EXperiments and Technologies, pages 313–327. ACM,
and Vijay Balakrishnan. Performance characterization 2016.

USENIX Association 18th USENIX Symposium on Networked Systems Design and Implementation 533

You might also like