SDC NVMe
SDC NVMe
SDC NVMe
10,000X
Improvement
2
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Ethernet Storage Fabric
Ethernet
Storage Fabric
4
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Extending NVMe Over Fabrics (NVMe-oF)
NVMe SSDs shared by multiple servers
Better utilization, capacity, rack space,
power
Scalability, management, fault isolation
NVMe-oF industry standard
Version 1.0 completed in June 2016
RDMA protocol is part of the standard
NVMe-oF version 1.0 includes a Transport
binding specification for RDMA
Ethernet(RoCE) and InfiniBand
5
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
How Does NVMe-oF Maintain Performance?
6
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
RDMA: More Efficient Networking
RDMA Performs
Four Critical
Functions in Hardware
7
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
RDMA is Natural Extension for NVMe
SW-HW communication
Target through work & completion
Software
Software queues in shared memory
8
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Memory Queue Based Data Transfer Flow
RDMA Adapter
9
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
RDMA
10
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe and NVMeoF Fit Together Well
Network
11
11
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe-OF IO WRITE
Host NVMe RNIC RNIC
NVMe
Initiator Target
Post SEND carrying Command
Capsule (CC)
Post Send (CC)
Subsystem Send – Command Capsule
12
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe-OF IO READ
Host NVMe
NVMe RNIC RNIC
Post SEND carrying Command Initiator Target
Capsule (CC)
Subsystem Post Send (CC)
Write last
Post RDMA Write to write Ack
data back to host Send – Response Capsule
Completion
Send NVMe RC Completion Ack Free allocated buffer
Completion
Upon SEND Completion Free send buffer
Free memory
Free CC and completion
resources
13
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe-OF IO WRITE IN-Capsule
NVMe NVMe
RNIC RNIC
Host Initiator Target
Subsystem Completion
Ack
Completion
Free send buffer
14
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMf is Great!
15
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Shared Receive Queue Work Queue
Memory
buffer
16
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe-OF System Example
Host Memory
Data
PCIe
NVMe RNIC
17
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Target Data Path (NVMe WRITE)
3. Submit NVMe
Command
7. Send Fabrics
Response Host Memory
Data
2. Data
PCIe
6. NVMe Fetch
Completion
1. Fabrics
4. Doorbell Command
5. Command +
Data Fetch
NVMe RNIC
8. Fabrics
Response
18
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Controller Memory Buffer
Internal memory of the NVMe devices exposed over the PCIe
Few MB are enough to buffer the PCIe bandwidth for the latency of
the NVMe device
Latency ~ 100-200usec, Bandwidth ~ 25-50 GbE Capacity ~
2.5MB
Enabler for peer to peer communication of data and commands
between RDMA capable NIC and NVMe SSD
Optional from NVMe 1.2
19
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Target Data Path with CMB (NVMe WRITE)
3. Submit NVMe
Command
7. Send Fabrics
Response Host Memory
PCIe
6. NVMe
Completion
1. Fabrics
4. Doorbell Command
5. Command
Data
Fetch
NVMe RNIC
8. Fabrics
Response
2. Data
Fetch
20
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
SW Arch for P2P NVMe and RDMA
21
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
NVMe Over Fabrics Target Offload
Host Root Complex and
Memory Subsystem
NVMe Over Fabrics is built on top of RDMA
Transport communication in hardware
Admin
NVMe over Fabrics target offload enable the NVMe IO
hosts to access the remote NVMe devices w/o any
NVMe over Fabrics Target
CPU processing Offload
NVMe
By offloading the control part of the NVMf data RDMA Transport
RNIC
path
Encapsulation/Decapsulation NVMf <-> NVMe is
done by the adapter with 0% CPU
Network
Resiliency – i.e. NVMe is exposed through
Kernel Panic
OS Noise
Admin operations are maintained in software
22
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Target Data Path with NVMf Target Offload
(NVMe WRITE)
Host Memory
NVMe SQ NVMe CQ
Data
2. Data
PCIe
7. NVMe Fetch
3. Submit NVMe
Completion
Command
6. Command + 4. Poll CQ
Data Fetch
NVMe
7. Fabrics RNIC 1. Fabrics
Response Command
23
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Software API for NVMf Target Offload
submit_bio()
(I/O)
NVMf Target
Block
Driver
I/
Get IO QUEUES O
Register NVMf Get Properties
NS
Offload Parameters Get Data Buffer* NVMe-PCI
Bind queue pair to
NVMf offloads
NVMe
RDMA ConnectX®-5 I/O device
24
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Namespace / Controller Virtualization
Controller virtualization
Expose a single backend
controller as multiple NVMf
controllers
Multiplex the commands in the
transport
NVMe-OF Offload FrontEnd Subsystem
To enable scalability of the
amount of supported initiators namespace 0 namespace 1 namespace 2 namespace 3 namespace 4
25
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Performance
26
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.
Target Data Path with NVMf Target Offload
and CMB (NVMe WRITE)
Host Memory
3. Submit NVMe
PCIe
Command
4. Poll CQ
5. Doorbell RDMA SQ RDMA RQ
7. Fabrics RNIC
NVMe Response
27
2017 Storage Developer Conference. © Mellanox Technologies. All Rights Reserved.