Rtas 23

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Hardware Compute Partitioning on NVIDIA GPUs*

Joshua Bakita and James H. Anderson


Department of Computer Science, University of North Carolina at Chapel Hill
Email: {jbakita, anderson}@cs.unc.edu

Abstract—Embedded and autonomous systems are increasingly Context


integrating AI/ML features, often enabled by a hardware accel- Switches Lost Capacity
erator such as a GPU. As these workloads become increasingly
demanding, but size, weight, power, and cost constraints remain SM4
unyielding, ways to increase GPU capacity are an urgent need. In
this work, we provide a means by which to spatially partition the (a) 3
computing units of NVIDIA GPUs transparently, allowing oft- Time-Sharing 2
idled capacity to be reclaimed via safe and efficient GPU sharing. (before) Job 3
Our approach works on any NVIDIA GPU since 2013, and can be 1
Job 1
applied via our easy-to-use, user-space library titled libsmctrl. SM0 Job 2
We back the design of our system with deep investigations into
the hardware scheduling pipeline of NVIDIA GPUs. We provide SM4
guidelines for the use of our system, and demonstrate it via an (b) 3 Job 3 >60% Reclaimed
object detection case study using YOLOv2. Hardware Capacity
2
Partitioning
I. I NTRODUCTION 1 Job 2
(ours)
To counteract frequency-scaling limitations, processor man- SM0 Job 1
ufacturers have increasingly turned to multiprocessing to meet Time
ever-increasing computational demands. While well-studied 0 1 2 3 4 5
for CPUs, this surge in parallelism has also affected com- Fig. 1. With time-sharing, only one task can use the GPU at a time. This
results in capacity loss when individual tasks cannot saturate the GPU compute
putational accelerators such as GPUs. As such accelerators cores (SMs) as in (a). Our work enables spatial partitioning, allowing the GPU
see increasing adaptation in embedded and real-time systems, to be subdivided among multiple tasks, reclaiming idle capacity as in (b).
effective utilization of this parallelism is crucial to meet size,
weight, power, and cost (SWaP-C) constraints. Prior work. Existing efforts to partition NVIDIA GPUs have
Unfortunately, current GPU-management and scheduling been hamstrung by elusive documentation, and thus suffer
approaches often treat the GPU as a single, monolithic device unacceptable fragility, overheads, or programming-model con-
and enforce mutually-exclusive access control. This can result straints. One approach, used to implement a concept called
in severe capacity loss, and is akin to scheduling only one task “fractional GPUs” [1], [2], requires each GPU computation to
at a time on a multicore CPU. Fig. 1(a) shows such losses on be modified such that it cooperatively yields unallocated com-
a five-compute-unit GPU. puting resources. Another approach, used in some recent GPU
Innovation in GPU scheduling has been held back by interference-analysis work [3], [4], injects custom “blocking
two issues. First, increasing GPU parallelism has not been kernels” to force subsequent work to execute on a subset
partnered with a manufacturer-provided means to spatially of computing resources. Both of these approaches inherit
partition compute units—the ability to run multiple tasks at many of the problems of cooperative multitasking, including
once, with each task assigned to a mutually-exclusive subset a lack of isolation from misbehaving tasks. These approaches
of processing cores. Second, GPU hardware architecture and also curtail the permissible types of kernel launches, cause
scheduling mechanisms are largely undocumented, preventing unavoidable instruction-cache interference, and, among other
third parties from building efficient GPU-partitioning systems. pitfalls, require extensive expert program modification.
In this work, we present a novel mechanism for high- Contributions. In this work, we:
granularity spatial partitioning of GPU compute units on all 1) Reveal a hardware mechanism for spatial partitioning of
NVIDIA GPUs from 2013 to the present day. This enables compute units in all NVIDIA GPUs since 2013.
more efficient scheduling, as in Fig. 1(b). (We further explain 2) Build and demonstrate a simple, effective, and portable
this figure in Sec. II.) In this work, we further justify com- GPU spatial partitioning API.
pute partitioning use by uncovering hardware parallelism and 3) Provide heretofore unpublished details on the hardware
subdivisions of the memory and scheduling units in NVIDIA scheduling pipeline of NVIDIA GPUs.
GPUs. We focus on NVIDIA GPUs due to their industry- 4) Detail heretofore unpublished architectural patterns used
leading architectures and adoption rates. in NVIDIA GPUs, including the layout of and intercon-
*Work was supported by NSF grants CPS 1837337, CPS 2038855, and CPS nects between compute and memory units.
2038960, ARO grant W911NF-20-1-0237, and ONR grant N00014-20-1-2698. 5) Evaluate the limits of, and develop guidelines for, the

1
effective use of spatial partitioning on NVIDIA GPUs. NVIDIA Ampere Discrete GPU (GA100)
6) Demonstrate via a case study how spatial partitioning
can be applied to increase GPU utilization and reduce GPC
latency for a convolutional neural network (CNN). Video
Three

Rest of System (CPUs, DRAMs, storage, etc)


TPC TPC Decode
Eight Video
Organization. We define key GPU terms in Sec. II before GPCs SM SM SM SM
Encoders
more thoroughly discussing relevant prior work in Sec. III. Video
We detail our partitioning method and API in Sec. IV, and TPC TPC Encode
support the usefulness of our approach via an elucidation of SM SM SM SM
GPU hardware in Sec. V. We evaluate our method and provide JPEG
guidelines for its use in Sec. VI, demonstrate effectiveness via TPC TPC Decode
a case study in Sec. VII, and conclude in Sec. VIII. SM SM SM SM

II. BACKGROUND TPC TPC Copy


To provide initial terms and context, we overview recent SM SM SM SM
NVIDIA GPU architecture and the CUDA programming PCIe Bus Five Copy
framework often used by GPGPU (General-Purpose GPU) Memory Crossbar Engines
tasks.
GPU architecture overview. GPUs are highly parallel ac- L2 Slices L2 Slices L2 Slices
celerators, typically built of several discrete functional units
GPU GPU GPU
that each have the internal capability for parallelism. Fig. 2 DRAMs DRAMs DRAMs
shows the internal functional units of one such recent NVIDIA
GPU. The eight GPCs (General Processing Clusters),1 each Fig. 2. Architecture of a recent NVIDIA discrete GPU.
consisting of sixteen SMs (Streaming Multiprocessors), col-
lectively compose the GPU’s compute/graphics engine.2 SMs in Alg. 1. Procedures executed on the GPU are called CUDA
are arranged in groups of two per TPC (Thread Processing kernels. Comments in Alg. 1 describe each step. Note that
Cluster).3 Each SM contains 64 CUDA cores. Other GPU even this simple example utilizes both the compute/graphics
engines include five asynchronous copy engines, three video engine (Line 13) and a copy engine (Lines 11 and 14).
encoding engines, one video decoding engine, and one JPEG When a kernel is launched, the number of threads is
decoding engine.4 Prior work has shown that these engines specified as a number of thread blocks,5 and threads per
can operate with some degree of independence from the block. In our example, for a 2,000-entry vector, one could
compute/graphics engine [6]. set numBlocks = 2 and threadsPerBlock = 1000,
CUDA overview. In order to simplify programming these as a block can contain no more than 1024 threads.
complex accelerators, NVIDIA developed the CUDA program- All CUDA applications run in their own memory address
ming language and API. An example CUDA program, which space, called a context, and time-division multiplexing is used
adds two vectors A and B in parallel on the GPU, is shown to arbitrate among active CUDA and otherwise GPU-using
applications (such as display tasks) by default [7].6 Inside
1 Formerly known as a Graphics Processing Cluster.
2 Due 5 Also
known as Cooperative Thread Arrays (CTAs).
to manufacturing yields, at most seven of the eight GPCs are enabled
6 NVIDIA
MPS or MiG can bypass this, but are only available in discrete
in shipping products, with between seven and eight TPCs per GPC [5]
3 Formerly known as a Texture Processing Cluster. and server GPUs respectively [8], [9]. Despite repeated calls from academia,
4 Also known as PCE, NVENC, NVDEC, and NVJPG respectively. NVIDIA has brought neither technology to its embedded chips.

Algorithm 1 Vector Addition in CUDA.


1: kernel VEC A DD(A: ptr to int, B: ptr to int, C: ptr to int, len: int)
2: i := blockDim.x * blockIdx.x + threadIdx.x ▷ Calculate index based on built-in thread and block information
3: if i >= len then
4: return ▷ Exit thread if out of vector bounds
5: end if
6: C[i] := A[i] + B[i]
7: end kernel
8: procedure MAIN
9: cudaMalloc(d_A, len) ▷ (i) Allocate GPU memory for arrays A, B, and C
10: ...
11: cudaMemcpy(d_A, h_A, len) ▷ (ii) Copy data from CPU to GPU memory for arrays A and B
12: ...
13: vecAdd<<<numBlocks, threadsPerBlock>>>(d_A, d_B, d_C, len) ▷ (iii) Launch the CUDA kernel on GPU
14: cudaMemcpy(h_C, d_C, len) ▷ (iv) Copy results from GPU to CPU array C
15: cudaFree(d_A) ▷ (v) Free GPU memory for arrays A, B, and C
16: end procedure

2
a single CUDA application, multiple FIFO queues called with recent work remaining constrained to software-emulated
streams can be used to allow work to use the GPU fully partitioning [1]–[4], or conceptual analysis [16].
concurrently [10]. Combining all CUDA-using functionality Significant supporting work to reverse engineer GPU design
into a single context with multiple streams is often done to has accompanied GPU management efforts. Historically, com-
avoid cases where several small applications, none of which modity GPU designs have been shrouded in secrecy. Even in-
can fully use the GPU, are exclusively timesliced [11]. A toy struction encodings—a commonly shared piece of information
illustration of this is shown in Fig. 1, where combining multi- for any processor—have remained secret for GPUs [17]–[19].
ple applications into a single context reduces GPU busy time Notable reverse engineering works include those of Otterness
by > 60%—if the GPU’s SMs can be hardware partitioned. et al., Amert et al., and Olmedo et al. to expose the queue
Without hardware partitioning, Job 1 would attempt to use structure used for compute work in NVIDIA GPUs [10], [20],
two SMs when run alongside the others, preventing Job 2 [21], the work of Capodieci et al. and Spliet and Mullins to
from starting. With partitioning, we can force Job 1 to use a clarify the preemption capabilities of NVIDIA GPUs [7], [22],
single SM for twice as long, yielding a more efficient overall and the work of Jain et al. to elucidate the memory hierarchy
schedule. in NVIDIA GPUs [1]. Outside of the academic community,
Real-time terminology. Tasks are composed of jobs, with a the Nouveau [23] and Mesa [24] reverse-engineered NVIDIA
rate of release known as a period. After release, each job and GPU driver projects provide crucial details on GPU-to-CPU
must complete by a subsequent deadline. If tasks release jobs interfaces.
exactly at the start of their period they are known as periodic. Despite these efforts, we are aware of no published method
If, instead, the period only defines a minimum separation that implements transparent spatial partitioning of the com-
between releases, the task is known as sporadic. Tasks may pute/graphics engine for NVIDIA GPUs.
have an associated criticality, which reflects their relative IV. I MPLEMENTING C OMPUTE PARTITIONING
importance in the system. For example, in a self-driving car, a
We now discuss how we enable transparent hardware parti-
pedestrian-detection task would have high criticality, whereas
tioning of compute units in NVIDIA GPUs, and overview our
in-screen display updates would have low criticality.
easy-to-use, portable, user-space API for doing so.
III. R ELATED W ORK
A. Partitioning on TPC Boundaries
Hardware predictability is key in embedded and real-time
systems. Since the advent of applying GPUs towards general- We implemented partitioning by activating two existent,
purpose compute tasks, researchers have sought to develop but little-known and unused fields in an obscure NVIDIA
management approaches that enforce predictable behavior data structure known as a TMD (Task MetaData) used to
when a GPU is shared among applications. launch GPU compute work [25] (we return in detail to this
One of the most notable early works in this area is structure in Sec. V).7 The fields we uncovered are named
TimeGraph [12], which uses an interception layer and GPU- SM_DISABLE_MASK_LOWER and _UPPER. These fields are
interrupt-driven scheduler to serially arbitrate GPU work sub- publicly listed, from TMD structure V1.6 (Kepler) to V3.0
mission. This scheduling approach theoretically still applies (Ampere) [26], but not used in any open-source software. As
to NVIDIA GPUs, but requires fully open-source drivers and one of, if not the only, public reference to an SM mask, we
does not allow for parallel executions on the GPU. decided to explore the purpose and functionality of these fields.
Another, simpler, but more broadly applicable approach We refer to these fields as the SM mask going forward.
to predictably share a GPU is to treat it as a resource to Activating partitioning. Unfortunately, modifying a TMD
which accesses are protected by mutual-exclusion locking. created by the CUDA library is somewhat difficult—the TMD
The GPUSync framework [13] and extensions [6] can be is almost immediately uploaded onto the GPU after construc-
considered exemplars of this approach. These works fur- tion. In order to modify the TMD, we discovered and utilized
ther innovate by allowing for predictable intra-GPU sharing. an undocumented CUDA debugging callback API to intercept
Through the observation that some auxiliary GPU units such as and modify the TMD after construction, but before upload
the copy engines can operate independently of compute work, onto the GPU. Using this mechanism, we experimented with
these works allow for per-GPU-engine-granularity locking. the SM mask and found, on older GPUs, it is a bitmask where
This enables, for example, one application to perform GPU an enabled bit indicates a disabled SM for the kernel described
computations while another performs copies. by the TMD. This can allow for spatial partitioning of SMs.
Subsequent to these works, significant effort has gone into Given a mutually exclusive partition configuration, partitioning
allowing more predictable intra-GPU sharing, particularly of can be enforced by setting the SM mask for every kernel such
the GPU’s primary compute/graphics engine. The recent work that every kernel in one partition is prohibited from running
of Otterness and Anderson [14], [15] provided a breakthrough on any SM of another partition. We show such partitioning in
in this area, by allowing multiple applications to simulta- Fig. 3.
neously share commodity AMD GPUs through transparent,
7 The TMD is called a QMD in some sources. QMD stands for Queue
hardware spatial partitioning of the main compute cores.
MetaData. This has been subsumed by the TMD, and old behavior can be
This technique has yet to be extended to NVIDIA GPUs, accessed by flipping the IS_QUEUE TMD bit [25], [26]

3
Sharing without an SM Mask Sharing with an SM Mask GP106 Die 1 GP106 Die 2
Stream 1 (K1) Stream 2 (K2) Stream 1 (K1) Stream 2 (K2)

GPC 0

GPC 0
TPC X
TPC 0
TPC 1
TPC 2
TPC 3

TPC 0
TPC 1
TPC 2
TPC 3
TPC 4
K1: 17 K2: 7
SM 8 SM 8
K1: 8 K2: 3
K1: 16 K2: 6
SM 7 SM 7
K1: 7 K2: 2
K1: 15 K2: 5
SM 6 SM 6
K1: 6 K1: 19 K2: 1

GPC 1

GPC 1
TPC X
TPC 4
TPC 5
TPC 6
TPC 7
TPC 8

TPC 5
TPC 6
TPC 7
TPC 8
K1: 14 K2: 7 K2: 4
SM 5 SM 5
K1: 5 K2: 6 K2: 0
K1: 13 K1: 9 K1: 17
SM 4 SM 4
K1: 4 K2: 5 K1: 4 K1: 13
K1: 12 K1: 8 K1: 15
SM 3 SM 3
K1: 3 K2: 3 K1: 3 K1: 14 Fig. 5. Illustration of floorsweeping asymmetrically impacting GPU layout
K1: 11 K1: 7 K1: 19 and TPC numbering for two GP106 dies.
SM 2 SM 2
K1: 2 K2: 2 K1: 2 K1: 11
K1: 10 K2: 1 K1: 6 K1: 18
SM 1 SM 1
K1: 1 K1: 18 K1: 1 K1: 10 discovered an interesting discrepancy. We show this in Fig. 4.
K1: 9 K2: 4 K1: 5 K1: 16
SM 0
K1: 0 K2: 0
SM 0
K1: 0 K1: 12
We again launch two kernels, but flip every-other bit in the
0.1 0.2 0.3 0.4
Time (seconds)
0.5 0.6 0.7 0.1 0.2 0.3
Time (seconds)
0.4 0.5 0.6 0.7
SM mask, and run the experiment on two GPUs: first on the
GTX 1060 3GB at left, then on the Tesla P100 at right. Based
Fig. 3. Partitioning a GTX 1060 3 GB into four- and five-SM partitions.
Striped Mask on GTX 1060 3GB Striped Mask on Tesla P100 on the behavior in Fig. 3, we would expect our odd/even SM
Stream 1 (K1) Stream 2 (K2) Stream 1 (K1) Stream 2 (K2) mask to disable every odd SM for K1, and every even SM for
K1: 9 K1: 18 ...
SM 8 through SM 56 not shown K2. This expectation holds only on the left. With the P100,
SM 8
K1: 4 K1: 10 (pattern repeats)
the pattern repeats at double the period. Why?
K2: 7 K2: 15 K2: 43 K2: 87
SM 7
SM 7
K2: 3 K2: 8 K2: 15 K2: 76 The GTX 1060 and Tesla P100 are both based on NVIDIA’s
SM 6
K1: 8 K1: 19 SM 6
K2: 29 K2: 92 Pascal architecture, but use dies configured in a slightly
K1: 3 K1: 12 K2: 1 K2: 72
K2: 6 K2: 14 K1: 43 K1: 108 different way. The P100 and newer GPUs are configured more
SM 5 SM 5
K2: 2 K2: 9 K1: 15 K1: 63 similarly to the GA100 in Fig. 2, with two SMs per TPC,
K1: 7 K1: 15 K1: 29 K1: 107
SM 4
K1: 2 K1: 11
SM 4
K1: 1 K1: 64 whereas the GTX 1060 and older GPUs have a one-to-one SM
SM 3
K2: 5 K2: 13 SM 3
K2: 42 K2: 107 to TPC ratio. This key difference, and substantial supporting
K2: 1 K2: 11 K2: 14 K2: 69
K1: 6 K1: 16 K2: 28 K2: 110
results, points to the "SM mask" actually serving as a TPC
SM 2 SM 2
K1: 1 K1: 13 K2: 0 K2: 63 mask to the GPU. Presumably, the name of the field in the
K2: 4 K2: 12 K1: 42 K1: 95
SM 1
K2: 0 K2: 10
SM 1
K1: 14 K1: 77
TMD is inherited from the days when the number of TPCs and
SM 0
K1: 5 K1: 17 SM 0
K1: 28 K1: 87 SMs were identical, allowing for more term interchangeability.
K1: 0 K1: 14 K1: 0 K1: 76
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6
Time (seconds) Time (seconds) B. Partitioning on GPC Boundaries
Fig. 4. An "SM mask" acting as a TPC mask on the Telsa P100 GPU. Many GPU resources are instantiated on a per-GPC basis
(more on this in Sec. V-C), so it may be desirable to obviate
The experiment in Fig. 3 involves two kernels, K1 and contention by aligning GPU partitions to GPC boundaries.
K2, which spin for a fixed amount of time. K2 is launched Earlier work on AMD GPUs [14] assumed a constant mapping
immediately after K1, and needs to complete about half as of compute units to shader engines (a GPC’s AMD equiva-
much work. Launch times are indicated by the colored arrows lent). NVIDIA GPUs are different. We illustrate this for two
at the bottom. Each shaded square in this plot represents the instances of the GTX 1060 3 GB’s GP106 die in Fig. 5.
execution of a block of a kernel, with height representing the Disabled TPCs are crossed out, and the remaining TPCs are
number of resident threads, and width representing the length renumbered. Note how both dies have nine enabled TPCs,
of execution. The jth block of kernel i is denoted Ki:j. We but not the same TPCs. The first TPC of GPC 1 is disabled
show CUDA’s default behavior at left, and the behavior with at right, but not at left. Instead the first TPC of GPC 0 is
an SM mask at right. disabled. This causes TPC 4 to be in GPC 0 at right, but
In the left plot, observe how K1’s work is immediately in GPC 1 at left. In our experiments, we find this pattern
distributed among all the SMs of the GPU upon launch, and repeated across most NVIDIA GPUs from the past several
that the blocks of K2 are prevented from starting execution years. Why would two ostensibly identical GPUs have such
until after K1’s initial set of blocks completes. This pattern different internal configurations?
remains no matter the priority of K2, mirroring the findings This difference between dies stems from floorsweeping, a
of [10]. At right, we enable an SM mask of 0x1e0 on K1, technique increasingly employed by NVIDIA in recent years.
and 0x01f on K2. With these masks, SMs 5-8 are disabled As GPU dies have grown in size to meet ever-increasing
for K1, and SMs 0-5 are disabled for K2. Observe how this computational demands, the likelihood of die manufacturing
partitioning also allows for K2 to begin and finish earlier, as errors has increased. Floorsweeping enables use of these
no blocks of K1 prevent K2 from starting on SMs 5-8. imperfect dies by fusing off (“sweeping” away) the defective
TPCs, not SMs. The promising results of Fig. 3 led us to parts, allowing the remaining parts to function normally. As
experiment on other, more recent, NVIDIA GPUs where we

4
TABLE I
API OF OUR L I B S M C T R L LIBRARY.

Library Function Description Supported On


libsmctrl_set_global_mask(uint64_t mask) Set TPCs disabled by default across the entire appli- CUDA 10.2 through 12.1
cation. A set bit in the mask indicates a TPC is to
be disabled globally.
libsmctrl_set_stream_mask(cudaStream_t stream, Set TPCs disabled for all kernels launched via CUDA 8.0 through 12.1
uint64_t mask) stream (overrides global mask).
libsmctrl_set_next_mask(uint64_t mask) Set TPCs to be disabled for the next kernel launch CUDA 11.0 through 12.1
from the caller’s CPU thread (overrides global and
per-stream masks, applies only to next launch).r
libsmctrl_get_gpc_info(uint32_t* num_gpcs, Get number of GPCs for device number dev, and a Compute Capability 5.0
uint64_t** tpcs_for_gpc, int dev) GPC-indexed array containing masks of which TPCs and up (nvdebug mod-
are associated with each GPC. ule required)
libsmctrl_get_tpc_info(uint32_t* num_tpcs, int Get total number of TPCs on device number dev. Compute Capability 3.5
dev) and up

die manufacturing errors are randomly distributed, the disabled 1 int main() {
2 // Allow work to only use TPC 1 by default
units vary from die to die, as in Fig. 5. 3 libsmctrl_set_global_mask(~0x1ull);
Coupled with floorsweeping, dies are binned based on their 4 ...
number of defective parts. For the GP106 die shown in Fig. 5, 5 // Stream-ordering is still respected with
6 // partitioning, so avoid the NULL stream
if no TPCs are defective it is sold as a GTX 1060 6 GB, if at 7 cudaStream_t urgentStream, otherStream;
most one TPC is defective it is sold as a GTX 1060 3 GB, if 8 cudaStreamCreate(&urgentStream);
at most two TPCs are defective it is sold as a Quadro P2000, 9 cudaStreamCreate(&otherStream);
10 // Override the global default settings
and so on. Dies beyond a defect threshold are discarded. 11 // Allow otherStream to use the first 5 TPCs
When partitioning TPCs, each bit in the mask corresponds 12 libsmctrl_set_stream_mask(otherStream,
to a TPC after they have been renumbered, as in Fig. 5. This 13 ~0x1full);
14 long_kernel<<<2048, 2048, 0, otherStream>>>();
means that TPC to GPC mappings are necessary to partition 15 // Allow urgentStream to use the last 4 TPCs
on GPC boundaries across dies. We determined which GPU 16 libsmctrl_set_stream_mask(urgentStream,
registers contain this information, and built a Linux kernel 17 ~0x1e0ull);
18 // Launch short, sporadic work as it arrives
module named nvdebug to extract and expose this via an 19 bool done = 0;
interface in /proc.8 20 while (!done)
21 wait_for_work_or_done(&done);
C. Our API 22 sporadic_work<<<32, 32, 0, urgentStream>>>();
23 cudaStreamSynchronize(urgentStream);
Combining together the capability to partition TPCs and 24 }
GPCs, we present our flexible user-space library and API, 25 }
titled libsmctrl, written in C. We detail our API in Table I,
Listing 1. Example usage of partitioning API (9 SM GPU).
with function names and parameters in the first column,
function description in the next, and finally the prerequisites TPCs by default—CUDA may implicitly launch internal ker-
for use of the function. We support setting the TPC mask nels to support some API calls and, if no default mask is set,
at global, per-stream, and per-kernel levels, where setting a those calls may interfere with your partitions. In Listing 1 we
mask at higher-granularity will override a lower-granularity set the global default on Line 3, permitting work to run only on
mask. This allows for most TPCs to be idled by default, with TPC 0 unless otherwise specified. It is possible to disable all
access granted only to explicitly permitted streams or kernels. TPCs by default (with a mask of ~0ull),9 but we recommend
Our library supports aarch64 and x86_64 CPUs, and CUDA against this, as it causes kernels launched with the default TPC
versions stretching as far back as 2016. Our library is fully mask to hang indefinitely (including CUDA-internal ones).
user-space, and compiled binaries are portable to any recent Continuing on in the example of Listing 1, we create CUDA
Linux kernel and NVIDIA driver. When active, our library streams on Lines 7–9 to allow for concurrent kernel launches
adds only a few instructions of overhead onto each kernel (prior work [10] discusses why this is necessary), and set
call. Our implementation primarily relies on an undocumented the TPCs these streams can use on Lines 12 and 16. We
CUDA callback API, but we employ several other techniques allow otherStream to use TPCs 0–4, and urgentStream
on older CUDA versions. to use TPCs 5–8. This allows us to launch long-running
Basic partitioning. An example of our API in a CUDA and difficult-to-preempt kernels in otherStream, while
program is shown in Listing 1. We assume a nine-TPC GPU, still being able to immediately start sporadic work via
but a real program should compute the mask dynamically after urgentStream as it arrives. Our API is highly flexible, and
using libsmctrl_get_tpc_info() to get the number supports many other usage patterns not detailed here.
of TPCs. For most programs, we recommend disabling most GPC partitioning. To support partitioning on GPC bound-
8 See documentation linked at https://www.cs.unc.edu/~jbakita/rtas23-ae/. 9 Postfix ull indicates a 64-bit literal, and ~ is bitwise inversion.

5
1 // Get mask of enabled TPCs for each GPC Not all DRAM
2 uint32_t num_gpcs; Command Segment
is shown
3 uint64_t *tpcs_for_gpc;
cmd Indirect
4 libsmctrl_get_gpc_info(&num_gpcs,
TMD Buffer
5 &tpcs_for_gpc); cmd
6 assert(num_gpcs >= 2);
7 launch cmd cmd sg dsc
8 // Allow the next kernel to run on any TPC in
9 // the first two GPCs cmd cmd sg dsc
10 uint64_t tpc_mask = 0;
11
12
tpc_mask |= tpcs_for_gpc[0];
tpc_mask |= tpcs_for_gpc[1];
① Main (CPU) Memory (DRAM)
13
14 // The above lines created a bitmask of TPCs to
15 // enable, so invert to get the disable mask ② Host Interface I/O
16 libsmctrl_set_next_mask(~tpc_mask);
PBDMA PBDMA Unit
Listing 2. Example usage of GPC information API.
aries, we provide a easy-to-use wrapper function around our ③ Compute Front End
nvdebug kernel module, and include a brief example of this ④ Task Management Unit (TMU)
in Listing 2. This example allocates the TPCs of two GPCs to see Fig. 7 for detail
the next kernel launch. Line 4 obtains an num_gpcs-length
array of bitmasks. Array index i is associated with GPC i, and ⑤ Work Distribution Unit (WDU)
if bit j is set in the mask in that entry, TPC j is associated see Fig. 8 for detail
with GPC i. A TPC may only be associated with one GPC.
On Lines 10–12 we combine the bitmasks of the TPCs for
⑥ Execution Engine + Memory
see Fig. 9 for detail
GPCs 0 and 1, then apply the mask on Line 16.
D. Limitations ...
GPC GPC GPC
Our system works by intercepting TMDs after construction
on the CPU, but pre-GPU-upload. As such, it will not work for
kernels with TMDs constructed and executed from the GPU Memory Crossbar
without the involvement of the CPU, such as when CUDA
Memory Memory Memory
Dynamic Parallelism (CDP) or CUDA Graphs are used.
Partition Partition ... Partition
Further, we cannot partition GPUs with over 64 TPCs, as
Unit Unit Unit
all versions of the TMD structure presently contain only 64
bits for partitioning. No shipping GPU has that many TPCs,
Representative NVIDIA GPU
but NVIDIA’s upcoming H100 allegedly contains up to 72.
Finally, our current library does not support per-GPU global Fig. 6. High-level GPU hardware compute scheduling pipeline.
partition configuration, but could easily be extended to do so.
As an example of our validation process, consider NVIDIA’s
V. H OW NVIDIA GPU S S CHEDULE C OMPUTE patent for TMD error checking [25]. Part of this patent
The usefulness of GPU partitioning hinges on the ability describes, in order, the fields of the TMD structure. A publicly
of GPU hardware to supply jobs and data to each partition available header,10 used in the Mesa project,11 also provides
in parallel. In this section, we take a deep-dive into the the ordering and names of TMD fields. These sources match,
architectural patterns of NVIDIA GPUs, with a particular indicating that the patent likely describes real hardware. We
focus on how they allow for parallelism at every step of a applied this process, or performed verification experiments, to
kernel’s journey from user-space dispatch to completion. attain the following content.
We first briefly consider our sources of information before As the following information is generally a synthesis of all
delving into the full pipeline, as numbered in Fig. 6. the above sources, we omit redundant inline citations. To the
best of our knowledge, our discussion is accurate for NVIDIA
A. Sourcing Architectural Details GPUs from Kepler through at least Ampere.
We gleaned architectural information from cross-referenced
B. The Compute Scheduling Pipeline
public sources. NVIDIA’s patents cover the implementation
of GPU compute priorities [27], block distribution logic [28], We now begin following the path of a single kernel through
[29], preemption design [30], and more [25], [31]–[34]. As the GPU scheduling pipeline. Formally, we define a compute
patents can describe non-existent or infeasible inventions, we job on the GPU as a single kernel launch.
verified if the parents discuss hypothetical or actual designs 10 classes/compute/clc0c0qmd.h in open-gpu-doc [26].
by cross-referencing them with NVIDIA’s open-source nvgpu 11 gallium/drivers/nouveau/nvc0/clc0c0qmd.h in Mesa [24].
driver [35], NVIDIA’s public headers [26], the work of the
Nouveau and Mesa projects [23], [24], and other sources.

6
1 Kernel instantiation. On NVIDIA GPUs, kernels are
④ Task Management Unit (TMU)
internally described via the aforementioned TMD structure.
Next TMD1 TMD3 TMD4 High
Beyond the already-discussed properties, it describes the
TMD

Priority
thread blocks, threads per block, and shared memory resources
needed for the kernel, and includes the address of the kernel’s
entry point (among many other fields, such as priority). After Head
a user-space library (such as CUDA or OpenCL) constructs a Tail TMD0 TMD2 Low
TMD, it enqueues a launch command containing a pointer
to the TMD into the next available command slot in a Empty? TMD Group
command segment. A command segment is a contiguous block Fig. 7. A three-priority task management unit with five queued TMDs.
of memory containing GPU commands.
The TMD is used as a kernel descriptor and handle through-
⑤ Work Distribution Unit (WDU)
Task Table/Task Slots Priority-Sorted Task Table
out the scheduling pipeline, persisting until all computations

TMD3
TMD2

TMD2
TMD1
TMD3

TMD0

TMD0
TMD4

TMD4
of the GPU kernel are complete.
2 Host Interface.12 The GPU’s Host Interface bridges the
gap from CPU to GPU for the TMD. This unit contains one TPC Resource
or more Pushbuffer Direct Memory Access units (PBDMAs), Trackers Load Balancer
along with context-switching control logic, among other sub-
units. The PBDMAs load commands from user space via an
indirect buffer, which is a circular buffer of pointers to com-
mand segments.13 These structures, shown at the top of Fig. 6, To TPCs
allow user-space applications to directly dispatch commands to Fig. 8. A six-slot work distribution unit on a four-TPC GPU.
the GPU without system call or driver overheads. The PBDMA
units individually load, parse, and cache commands at a rate TMDs. In Fig. 7, we show a unit supporting three priority
much faster than that of most GPU engines. While in our levels, with TMDs illustratively numbered by arrival order.
illustration, we show GPU command queues as located in CPU Three TMDs are in the highest-priority-level list, none are
memory (as is most common), they can also be located in GPU in the medium-priority-level list, and two TMDs are in the
memory. After command acquisition and parsing, the Host lowest-priority-level list. Each priority-level list is formally
Interface forwards the command to the appropriate engine. In called a TMD Group.
the case of a kernel launch, it passes the TMD to the Compute These lists allow the TMU to reorder TMDs that it receives,
Front End. such that higher-priority ones skip ahead of lower-priority
ones. Upon TMD arrival from the Compute Front End, or
3 Compute Front End. The Compute Front End relays
elsewhere,17 the TMU reads the TMD’s GROUP_ID field, and
TMD pointers from 2 , the Host Interface to 4 , the Task
appends the TMD to the tail of the specified TMD Group.18
Management Unit. Further, we understand that this unit can
For example, in Fig. 7, TMD4 is the most-recently arrived
orchestrate context switches in subsequent units.14 While this
high-priority TMD, and TMD2 is the most-recently arrived
unit processes TMDs at a rate decoupled from the others, it
low-priority TMD.
may cause work to queue if the incorrect context is active. We
As unit 5 in the scheduling pipeline signals readiness for
focus on a single context in this work, and so do not further
another TMD, the TMU removes and passes the head of the
consider said issue. Further details on this unit are elusive. 15
highest-priority non-empty list. This is illustrated with the bold
4 Task Management Unit (TMU).16 Illustrated in Fig. 7, the outlines and “Next TMD” box in Fig. 7. Given the TMU state
Task Management Unit queues TMDs by priority and arrival of Fig. 7, TMDs would exit the TMU in the following order:
order until 5 , the Work Distribution Unit, is ready to receive TMD1, TMD3, TMD4, TMD0, and then TMD2.
them. Due to the explicit scheduling responsibilities of the The TMU is the final unit which can receive TMDs at a
TMU, we investigate it in depth. rate decoupled from the TMD completion rate. This causes
The TMU is built around a series of priority-level, singly- TMDs to eventually accumulate in the TMU whenever kernels
linked lists, with one linked-list-head and -tail pointer tracked are dispatched by user space at a rate faster than they can
for each priority level. Each list is exclusively composed of complete.

12 Also
5 Work Distribution Unit (WDU).19 Illustrated in Fig. 8
known as “FIFO” in Nouveau.
13 A “pushbuffer” is a combination of these segments and pointers. the Work Distribution Unit dispatches TMDs from task slots
14 Via the Front-End Context Switch (FECS) unit. See Spliet et al. [22].
17 The TMU may receive new TMDs directly from GPU compute units
15 The Compute Front End (COMP FE), like the better-documented and
when features such as CUDA Dynamic Parallelism are employed.
similarly-situated Graphics Front End (GFX FE), likely also configures 6 , 18 Unless field ADD_TO_HEAD_OF_QMD_GROUP_LINKED_LIST is true.
the Execution Engine and Memory, but this is difficult to confirm. 19 Also known as the CUDA Work Distributor (CWD) in marketing litera-
16 More recently called the “Scheduler Unit” or “SKED” in patents [33],
ture [36], and as the Compute Work Distributor in a recent patent [28].
[34], and known in marketing text as the Grid Management Unit (GMU) [36].

7
to available TPCs. The number of task slots is hardware GPC 0 GPC 1
limited,20 forcing the WDU to make scheduling decisions with SMs SMs

repeats; not shown


incomplete information, eg. when the number of ready kernels L1.5I L1.5I
exceeds the number of task slots. L0I+L1I Cache L0I+L1I Cache
The WDU signals unit 4 , the TMU, for a new TMD
CUDA CUDA
whenever a task slot becomes available. It then inserts the MMU MMU
TMD into a task slot in the Task Table, and inserts a reference Cores Cores
to the TMD in the Priority-Sorted Task Table. The priority- L1D TLB L1D TLB
sorted table is ordered first by priority, then by arrival time.
The Load Balancer dispatches thread blocks from the TMD Crossbar Bus
at the head of the priority-sorted table. Once all of a TMD’s

Partition
Memory
blocks are launched, it is removed from the priority-sorted L2 Slices L2 Slices L2 Slice

Units

...
table, but left in the Task Table until all its blocks complete.
Fig. 8 illustrates the WDU state shortly after receiving the DRAM Ctrlr. DRAM Ctrlr. DRAM Ct
TMDs of Fig. 7, and after dispatching all blocks of TMD1. To DRAMs
We reflect which kernels are executing on which TPCs via ⑥
shading of the TPC Resource Trackers. These units relay states Fig. 9. Topology of GPU execution engine and memory subsystem.
and commands between the TPCs and the WDU. Note how
TMD1 is still executing (light hatching) on the TPCs, and chips, for the execution engine in Fig. 9. An “I” (resp.
so remains in the Task Table. TMD3, as the highest-priority “D”) postfix on a cache indicates an instruction (resp. data)
and earliest-arrived TMD, is now dispatching blocks to TPCs. cache. Starting at the bottom in Fig. 9, consider the Memory
TMD4 is waiting for TMD3 to dispatch all its blocks before it Partition Units. These are typically configured as one-per-
can move to the head of the priority table and dispatch blocks. DRAM chip, with each partition unit containing a DRAM
This ordering and dispatch process can be disturbed by two controller and a subset of the L2 cache. Each partition unit
things: TPC partitioning, and pending higher-priority work in independently connects to every GPC via a crossbar bus, such
4 , the TMU. When TPC partitioning is in-use, if a TPC has that partitioning Memory Partition Units will also partition the
available capacity, but the TMD at the head of the sorted table crossbar bus and L2.
is prohibited from executing on that TPC, the WDU will skip The crossbar bus links inside each GPC with a Memory
forward in the table until it finds a TMD allowed to execute Management Unit (MMU, with associated Translation Looka-
on the available TPC (or reaches the end of the table). If, for side Buffer, TLB) for virtual memory support.21 This connects
example, TMD3 was prohibited from executing on TPC4, the to L1 data caches in each SM, and to a GPC-wide L1.5I cache.
Load Balancer could instead dispatch blocks from TMD4. The L1.5I cache respectively feeds per-SM instruction caches.
When all WDU task slots are occupied, and any task slot In all, each GPC can theoretically be configured to operate
contains a TMD with priority lower than any pending in with an exclusive subset of the GPU’s cache, bus, and DRAM
4 , the TMU, the lower-priority TMD will be evicted from resources (if Memory Partition Units are partitioned, as in [1]).
the WDU and replaced with the higher-priority TMD. When Further, each SM can operate from its L1 cache without
this occurs, the evicted TMD stops dispatching blocks and generating interference.
completes blocks already in progress. The evicted TMD is To summarize this section, we find GPU hardware to be
then inserted into the head of the respective priority-level list more than capable of feeding jobs to many partitions simulta-
in the TMU. We will further explore what this means for real neously, and also find it capable of supplying partitions with
workloads in Sec. VI. uncontended cache, bus, and DRAM resources.
This concludes our coverage of the TMD from construction VI. E VALUATING C OMPUTE PARTITIONING
to dispatch. We now consider aspects of GPU architecture that
are relevant while a thread block executes. We now evaluate spatial compute partitioning to assess its
scalability, its intersections with hardware scheduling units,
C. Memory Parallelism and its usefulness compared to prior work.
Inherited from the high-bandwidth needs of graphical- A. Methodology
processing tasks, GPUs are massively parallel in ways far be-
Our experiments utilized a variant of the
yond the high number of CUDA cores. This subsection extends
cuda_scheduling_examiner framework integrated
the background of Sec. II with new details on the parallelism
with our libsmctrl library. Unless otherwise noted, all
of GPU caches, interconnects, and DRAM controllers.
experiments ran without other processes competing for the
6 Execution engine and memory subsystem. We illustrate GPU. We found minimal impact due to host platform or
the data and instruction pipelines, from L1 caches to DRAM Linux kernel version, so we do not control for those variables
in our experiments.
20 Documented as the “Maximum number of resident grids per device” in
Table 15 of the CUDA C++ Programming Guide [37]. 21 Some GPUs instead put the MMU in the Memory Partition Unit.

8
Stream 1 (K1-K17) Stream 2 (K1-K21) TABLE II
SM 12 GPU S TESTED IN OUR EXPERIMENTS
K1:14 K4:14
K5:13 K7:9 K8:5 K9:5 K11:14 K15:13 K19:9 K20:3 K21:9
SM 11
K1:4 K4:6
K5:5 K7:4 K8:2 K9:2 K11:4 K15:4 K19:4 K20:1 K21:4 Compute
GPU Name Architecture
SM 10
K1:13
K1:3
K4:13
K5:12
K4:5
K5:4
K7:8
K7:3
K11:13
K11:3
K15:12
K15:3
K19:8
K19:3
K21:8
K21:3
Capability
SM 9
K1:12K2:3
K3:4
K4:12
K5:11 K7:7 K8:4 K9:4 K11:12 K15:11 K19:7 K21:7 Tesla K40 3.5 Kepler
K1:2 K2:0
K3:1
K4:4
K5:3 K7:2 K8:1 K9:1 K11:2 K15:2 K19:2 K21:2
K1:11 K4:11
K5:10 K7:6 K11:11 K15:10 K19:6 K21:6
GTX 970 5.2 Maxwell
SM 8
K1:1 K4:3
K5:2 K7:1 K11:1 K15:1 K19:1 K21:1 Jetson TX1 5.3 Maxwell (embedded)
K1:10 K4:10
K5:9 K7:5 K8:3 K9:3 K11:10 K12:3 K13:6 K15:9 K16:4 K17:5 K19:5 K20:2 K21:5
SM 7
K1:0 K4:2
K5:1 K7:0 K8:0 K9:0 K11:0 K12:0 K13:2 K15:0 K16:0 K17:1 K19:0 K20:0 K21:0
Tesla P100 6.0 Pascal
SM 6
GTX 1060 3 GB 6.1 Pascal
K1:4 K2:1
K3:3
K4:9
K5:3 K7:1 K8:9 K9:1 K11:4 K12:1 K13:5 K15:3 K16:2 K17:4
GTX 1070 6.1 Pascal
SM 5
K1:9 K2:2
K3:0
K4:1
K5:8 K7:0 K8:4 K9:0 K11:9 K12:2 K13:1 K15:8 K16:3 K17:0 GTX 1080 6.1 Pascal
K1:3 K4:2
K5:2 K8:8 K11:3 K13:2 K15:2
SM 4
K1:8 K4:8
K5:7 K8:3 K11:8 K13:4 K15:7 Jetson TX2 6.2 Pascal (embedded)
SM 3
K1:2 K2:0
K3:2
K1:7 K2:1
K4:7
K3:2
K5:6
K4:0
K5:0
K8:7
K8:2
K11:2 K12:0 K13:3
K11:7 K12:1 K13:0
K15:1 K16:1 K17:3
K15:6 K16:2 K17:3
Titan V 7.0 Volta
K1:1 K3:1
K4:4 K8:6 K11:1 K13:4 K17:2 Jetson Xavier 7.2 Volta (embedded)
SM 2
K1:6
K1:0
K3:0
K4:1
K4:3
K5:1
K8:1
K8:5
K11:6
K11:0
K13:1
K13:3
K17:0
K15:0 K16:0 K17:1
Tesla T4 7.5 Turing
SM 1
K1:5 K4:0
K5:0 K8:0 K11:5 K13:0 K15:5 K16:1 K17:2 A100 40 GB 8.0 Ampere
K6:0 K10:0 K14:0
SM 0
K6:0 K10:0 K14:0 K18:0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5
Time (seconds) partitions, which are the non-adjacent SMs 7, 9, and 11. For
Fig. 10. Complex TPC partitioning on the GTX 970. an example of overlap, throughout the rest of Stream 1, we
allow the partitions of each kernel to fully overlap with those
B. Compute Partitioning: Flexible and Portable? of Stream 2, creating the mix of kernels from different streams
At this point, we have only shown TPC partitioning in in the “R”, “A”, “S” and periods. We see no variation in the
small experiments involving few kernels. How does it scale behavior as the partition size is varied, with between one and
to systems of many kernels and compute partitions? ten SMs per partition in Fig. 10.
We experimented with complex arrangements of the Portability. Fig. 10 shows an experimental result from a single
GPUSpin and multikernel benchmarks, finding no ap- GPU, however, we find this behavior to hold on every recent
parent limits on the number or arrangement of TPC partitions NVIDIA GPU. Specifically, any GPU of Compute Capability
allowed, discovering no ways TPC partitioning could compro- 3.5 (2013) or greater, and CUDA 8.0 (2017) or newer supports
mise the semantics of CUDA, and encountering few portability TPC partitioning via our libsmctrl library. This includes
limits. embedded GPUs, such as that in the ARM64-based NVIDIA
We plot a representative experiment in Fig. 10. This plot Xavier System-on-Chip. We list all specific GPUs we tested
shows 38 kernels, across 17 unique TPC partitions and two in Table II.
streams. This experiment demonstrates per-kernel partitioning,
stream-ordering, and complex partitioning. We discuss these C. When Hardware Scheduling Breaks Down
aspects before considering how they apply to GPUs beyond In Sec. V, we discussed how much of the GPU, including
the GTX 970 used in Fig. 10. most parts of the scheduling pipeline, is highly parallel and
Per-kernel partitions. In the experiment of Fig. 10, every unlikely to hinder multiple compute partitions from running
kernel is assigned a different partition than its predecessor. For simultaneously. In this subsection, we examine the narrow
example, consider Stream 2 (red). In the “T” at the bottom, K7 cases when it does prevent partitions from running in parallel.
runs two blocks in a partition including only SM 5, whereas The consequences of greedy assignment. The WDU, the
the subsequent K8 in uses a partition including SM 1-5. K7 last step in the scheduling pipeline, dispatches blocks from
and K8 (along with all other kernels) are released just before kernels in global priority and then time-of-arrival order, with
the 1.0s mark, but the partition is encoded into the kernel’s limited situations in which it will skip ahead. This greedy
TMD—not as a global GPU property—so it automatically assignment approach can result in particularly non-optimal
takes effect when the kernel begins. block assignments when overlapping partitions are in use.
Preservation of stream-ordering. The GPU preserves the We demonstrate one such case in Fig. 11. Both subfigures
ordering of kernels in streams, even when a subsequent kernel have identical configurations; the GPU is split into two par-
may require a mutually exclusive set of SMs. For example, titions: SM 0-9 for Stream 1, and SM 0-3 for Stream 2. K1
in Stream 2 in Fig. 10, K9 only needs SM 5, and K10 only and K2 are GPUSpin benchmarks launched simultaneously
needs SM 10. However, K10 does not start until K9 completes. with identical configurations in both subfigures. In the desired
CUDA stream semantics require subsequent kernels in the outcome, at right, both kernels complete all their blocks before
same stream to not launch until all prior ones have been time 0.5s. However, this outcome only occurs approximately
completed—we find that to be preserved, even when TPC 50% of the time. The result at left occurs otherwise.
partitioning is in use. In the experiment of Fig. 11, K1 and K2 race to the WDU.
Whichever reaches the WDU first has all its blocks dispatched
Complex and overlapping partitions. TPC partitions may first. When K1 arrives first, its blocks are dispatched across
contain holes or overlap, and have no size restrictions. (By every TPC in its partition according to a previously researched
holes, we mean non-contiguous sets of TPCs.) For example, assignment algorithm [21]. This saturates all the TPCs in K2’s
the “E” in Stream 1 of Fig. 10 features holes in K8 and K9’s partition, forcing it to wait until K1 completes before running.

9
(a) K1 Assigned First (b) K2 Assigned First We illustrate one such case for the GTX 1060 3GB, which
Stream 1 (K1) Stream 2 (K2) Stream 1 (K1) Stream 2 (K2) has 32 Task Slots, in Fig. 12.
SM 8 SM 8
K1: 12 Fig. 12 includes 33 streams, each containing a single,
K1: 8 K1: 4
K1: 11 respectively numbered kernel. The GPU is partitioned roughly
SM 7 SM 7
K1: 7 K1: 3 in half: Streams 1-32 on SMs 0-3, and Stream 33 on SMs 4-8.
K1: 10
SM 6
K1: 6
SM 6
K1: 2 K33 consists of 60, 750ms blocks, whereas K1-K32 consist
SM 5 SM 5
K1: 9 of one, 2.5s block. All kernels are derived from the GPUSpin
K1: 5 K1: 1
K1: 8 benchmark. Stream 1-32 are higher priority than Stream 33.
SM 4 SM 4
K1: 4 K1: 0 Observe how K33 stops dispatching new blocks shortly
K1: 12 K2: 4 K2: 3
SM 3
K1: 3 K2: 0
SM 3
K1: 7 after time 0.2s, not resuming until time 0.4s. K33 is in an
SM 2
K1: 11
SM 2
K2: 2 independent stream and partition, but gets evicted from the
K1: 2 K2: 2 K1: 6
K1: 10 K2: 1 WDU by the TMU after K1-K32 arrive, as they are higher
SM 1 SM 1
K1: 1 K2: 1 K1: 5 priority. While these 32 kernels wait to complete execution,
K1: 9 K2: 4
SM 0
K1: 0 K2: 3
SM 0
K2: 0 no Task Slots are available, and K33 must wait in the TMU,
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Time (seconds) Time (seconds)
unable to dispatch blocks. Once one of the higher-priority
kernels completes, as they do around time 0.4s, their Task
Fig. 11. Serial greedy block dispatch yielding unpredictable and bad assign- Slot is vacated, allowing K33 to take it, and once more resume
ments when overlapping partitions are used on GTX 1060 3 GB.
dispatching blocks.
Stream 1 (K1) Stream 9 (K9) Stream 18 (K18) Stream 26 (K26)
In order to prevent this sort of blocking across unrelated
Stream 2 (K2) Stream 10 (K10) Stream 19 (K19) Stream 27 (K27)
Stream 3 (K3) Stream 11 (K11) Stream 20 (K20) Stream 28 (K28)
partitions, we recommend using no more CUDA streams than
Stream 4 (K4) Stream 12 (K12) Stream 21 (K21) Stream 29 (K29) the GPU has WDU slots, or prohibiting the use of CUDA
Stream 5 (K5) Stream 13 (K13) Stream 22 (K22) Stream 30 (K30) stream priorities across all partitions. As long as the number
Stream 6 (K6) Stream 14 (K14) Stream 23 (K23) Stream 31 (K31)
of pending kernels does not exceed the number of WDU slots,
Stream 7 (K7) Stream 15 (K15) Stream 24 (K24) Stream 32 (K32)
eviction will not occur, nor will it occur if all kernels have
Stream 8 (K8) Stream 16 (K16) Stream 25 (K25) Stream 33 (K33)
Stream 17 (K17)
identical priority.
K33: 9 K33: 18 K33: 28 K33: 39 K33: 45 K33: 56
SM 8 D. Evaluating Partitioning Strategy
K33: 4 K33: 16 K33: 22 K33: 34 K33: 41 K33: 50

SM 7
K33: 8 K33: 17 K33: 29 K33: 38 K33: 49 K33: 57 The principle preexisting work on spatial partitioning of
K33: 3 K33: 14 K33: 20 Blocking K33: 33 K33: 44 K33: 54 GPU compute units found that the assignment of compute
K33: 7 K33: 13 K33: 26 Time K33: 37 K33: 47 K33: 59
SM 6
K33: 2 K33: 10 K33: 21 K33: 32 K33: 42 K33: 51
units to partitions is crucial on AMD GPUs, as adding compute
K33: 6 K33: 15 K33: 25 K33: 36 K33: 46 K33: 58 units to partitions may slow the partition in some cases [14].
SM 5
K33: 1 K33: 11 K33: 23 K33: 31 K33: 40 K33: 52 That work proposed and evaluated two partitioning strategies:
K33: 5 K33: 19 K33: 27 K33: 35 K33: 48 K33: 55 SE-packed, and SE-distributed. SE stands for Shader Engine,
SM 4
K33: 0 K33: 12 K33: 24 K33: 30 K33: 43 K33: 53
K22: 0 K31: 0 and is equivalent to a GPC in our context. The SE-packed
SM 3 K14: 0 K27: 0
K26: 0
K21: 0
K19: 0
K11: 0
algorithm attempts to allocate all TPCs from the same GPC
K7: 0 K32: 0
SM 2 K25: 0 K20: 0 to the partition before expanding the partition across multiple
K4: 0 K10: 0
K5: 0
K3: 0
K6: 0
K12: 0
GPCs. Alternatively, the SE-distributed algorithm attempts to
K30: 0 K8: 0
SM 1 K17: 0 K9: 0 distribute the TPCs of each partition across GPCs as evenly
K16: 0 K2: 0
K29: 0
K24: 0
K28: 0
K18: 0
as possible. As this language has been adopted by other
SM 0 K23: 0 K13: 0
0.1 0.2
K15: 0
0.3 0.4
K1: 0
0.5 0.6 0.7
work [16], we continue its use here.
Time (seconds) To evaluate if any similar slowing effects occur on
Fig. 12. Task Slot exhaustion on the GTX 1060 3GB causing an unrelated NVIDIA GPUs, we port the relevant portions of the
partition to block K33 from dispatching blocks. hip_plugin_framework used in [14] to our variant of
the cuda_scheduling_examiner and reproduce their
If K2 arrives first, its blocks are dispatched across its partition, experiment, with the results shown in Fig. 13.
then K1’s across its, resulting in a more-efficient assignment. In Fig. 13, we mirror the figure layout of prior work [14];
The consequences of task slot exhaustion. The constrained Fig. 9, including how long a matrix multiply takes for each
number of Task Slots are one of the few limits in the hardware partition size under each assignment algorithm. Unlike on
scheduling pipeline in NVIDIA GPUs. Effectively, the number AMD GPUs, we find that adding an additional compute unit
of Task Slots limits the number of kernels that the WDU can (TPC) to a partition does not slow that partition, no matter
simultaneously consider for dispatch onto the GPU. whether the unit is assigned from the same, or a different GPC.
Task slots fill in priority order, with lower-priority tasks This is evident by the near-identical alignment of execution
evicted as higher-priority ones become available in the TMU. times for each algorithm. However, this finding does not
This can cause unrelated compute partitions to block one obviate considerations around inter-task interference, and so
another when many streams and stream priorities are in use. we recommend using the SE-packed assignment strategy to
minimize intra-GPC cache and bus interference.

10
TABLE III
YOLOV 2 RUNTIMES
Case Competitor Partition Mean Min Max St. Dev.
1 None None 24 20 53 2.7
2 YOLOv2 None 40 35 69 2.8
Malfunctioning
3 None 82 73 111 2.9
YOLOv2
Malfunctioning
4 50/50 45 35 157 22
YOLOv2

When run alone on 8 TPCs,25 YOLOv2 takes 24 ms on


average to process a single frame (Case 1, Table III). When
combined with a second YOLOv2 instance, this time nearly
Fig. 13. SE-packed vs SE-distributed on the Titan V GPU.
doubles to 40 ms (Case 2, Table III)—an unsurprising result
VII. C ASE S TUDY given the halved computing resources. More surprisingly, the
standard deviation remains low, and the relative min and max
In order to demonstrate the real-world usefulness of our
are hardly changed by the addition of this competing work.
GPU partitioning framework, we perform a case study where
However, this predictability does not persist if the com-
we apply TPC partitioning to a GPU shared by two instances
peting instance malfunctions, as demonstrated by Case 3 in
of the YOLOv2 [38] convolutional neural network (CNN)
Table III. In this case, the competing instance spawns numer-
trained for object detection.
ous unexpected kernels, stealing compute from the primary
Background and methodology. Object detection systems YOLOv2 instance. This occurs because the WDU dispatches
form a critical component of camera-based perception for from all runnable kernels equally, and the faulty addition of
autonomous systems. These systems take images as input, kernels from the competing instance displaces work from our
detect objects in them, and output bounding rectangles and primary instance. This results in a more-than-doubling of the
labels for each object. Transformer-based systems [39] can primary instance’s execution time to 82 ms—an unacceptable
reach higher accuracy than CNNs such as YOLOv2, but the spillover of a malfunction that should have been contained to
comparable speed of YOLOv2-like approaches has kept them the competing instance.
relevant in embedded computing applications. We consider TPC partitioning can help. When enabled and used to allo-
YOLOv2 for these reasons, and because it is built on the cate four TPCs to each instance, the execution time distribution
DarkNet framework—a simple and easily-modifiable library. for the primary YOLOv2 instance nearly returns to its pre-
We modify a DarkNet variant22 to support running multiple malfunction state—about 44 ms per frame.26 This indicates
instances of YoloV2 in parallel in different CUDA streams, to that partitioning compute units can be sufficient to protect
support setting TPC masks for each stream via libsmctrl, against interference; in short:
and to integrate with LITMUSRT . LITMUSRT is a patch to
the Linux kernel, adding support for more formal real-time Obs. 1. TPC partitioning can defend a GPU-context-sharing
schedulers than what Linux provides by default [40]–[42]. task from losing performance to compute-intensive tasks.
Our host system23 has plentiful CPU cores, so we use the Beyond fault containment, this can be usefully applied when
partitioned fixed-priority (P-FP) LITMUSRT scheduler and other tasks are unknown or have input-dependent computa-
pin each of our instances of YOLOv2 onto a separate core. tional sizes, but are expected to take a relatively constant
We configure YOLOv2 such that the image copy-in, detection amount of compute. However, to apply partitioning in this
pass, and result copy-out sequence form a single job. We context, we must answer how partitions should be sized.
run all instances on the NVIDIA Titan V GPU, with TPC
Tuning partition sizes. To investigate how the choice of
partitioning configured as specified. Images are loaded from
TPC partitioning effects runtimes, we tested the full range
disk in parallel threads. We configure each job with a period of
of partitioning options for two instances of YOLOv2 on 8
420ms,24 release jobs of all instances synchronously, and run
TPCs and plot the results in Fig. 14. This plot shows, for each
10,000 jobs, using a different image as input on each iteration
partitioning configuration, the mean, min, and max execution
drawn from the PascalVOC 2012 dataset [43]. Job lengths are
time of each instance. For example, when six TPCs are
timed using a library from prior work [44].
allocated to Instance A and two to B, Instance A takes 70
Protection from competitors. We first consider how our ms per frame, and Instance B takes 30 ms.
library is able to protect an instance of YOLOv2 from com-
25 Throughout our case study, we only consider up to 8 TPCs. YOLOv2 is
peting work which malfunctions. Table III contains the key
a small network, and allocations of more than 8 TPCs (on the Titan V) yield
data discussed throughout this section. negligible performance improvement.
26 The maximum and standard deviation in this case are driven by approx.
22 https://github.com/AlexeyAB/darknet
7% of samples which are extreme outliers. We suspect that these are not
23 Our test system has a 16-core AMD 3950X with 32GiB of RAM running
caused by a partitioning issue—we see them in no other experiment—but
LITMUSRT 5.4.224, NVIDIA driver 520.56.06, and CUDA 11.8. rather an internal CUDA synchronization issue, such as that uncovered in
24 A large period provides margin for our parallel image loading from disk.
recent work [45]. We intend to investigate this errata further in future work.

11
Number of TPCs allocated to Instance A how exactly the GPU Host Interface selects which queue to
1 2 3 4 5 6 7 pull from when acquiring commands.
160 ms
Grey denotes min/max range YOLOv2 Instance A
140 ms Colored line is mean YOLOv2 Instance B IX. ACKNOWLEDGEMENTS
120 ms We thank developers of the nouveau project, as well as
Execution Time

100 ms current and former employees of NVIDIA, who informed our


80 ms survey of public GPU architecture sources. We also thank
Nathan Otterness for his assistance in porting his AMD
60 ms
Compute Unit partitioning benchmarks to CUDA.
40 ms
20 ms R EFERENCES
0 ms [1] S. Jain, I. Baek, S. Wang, and R. Rajkumar, “Fractional GPUs: Software-
7 6 5 4 3 2 1 based compute and memory bandwidth reservation for GPUs,” in
Number of TPCs allocated to Instance B Proceedings of the 25th IEEE Real-Time and Embedded Technology
Fig. 14. Job time of two instances across various partitions of 8 TPCs. and Applications Symposium, April 2019, pp. 29–41.
[2] B. Wu, G. Chen, D. Li, X. Shen, and J. Vetter, “Enabling and exploiting
Obs. 2. TPC partitioning allows for smooth adjustment of task flexible task assignment on GPU through SM-centric program transfor-
mations,” in Proceedings of the 29th ACM on International Conference
execution times. on Supercomputing, ser. ICS, June 2015, p. 119–130.
[3] T. Yandrofski, L. Chen, N. Otterness, J. H. Anderson, and F. D. Smith,
As we increase the number of TPCs for an instance in “Making powerful enemies on NVIDIA GPUs,” in Proceedings of the
Fig. 14, min, mean, and max execution times all decline, if 43rd IEEE Real-Time Systems Symposium, Dec 2022, p. to appear.
we decrease the number of TPCs, the reverse occurs. This [4] N. Feddal, H.-E. Zahaf, and G. Lipari, “Toward precise real-time
scheduling on NVIDIA GPUs,” in Proceedings of the 15th Junior
allows for TPC allocations to be used as a proxy for priority. Researcher Workshop for Real-Time Computing, June 2022, pp. 20–24.
The larger an allocation, the sooner it will meet its deadline. [5] NVIDIA, “NVIDIA A100 tensor core GPU architecture,” NVIDIA,
However, there are further complications: Tech. Rep., 2020.
[6] G. A. Elliott, “Real-time scheduling for GPUs with applications in
Obs. 3. Providing additional TPCs to a task may provide a advanced automotive systems,” Ph.D. dissertation, The University of
North Carolina at Chapel Hill, 2015.
negligible performance improvement. [7] N. Capodieci, R. Cavicchioli, M. Bertogna, and A. Paramakuru,
“Deadline-based scheduling for GPU with preemption support,” in
Fig. 14 shows that for YOLOv2, the incremental benefit of Proceedings of the 39th IEEE Real-Time Systems Symposium, Dec 2018,
an additional TPC declines after the addition of a second TPC. pp. 119–130.
This may indicate that all layers of YOLOv2 can utilize at least [8] NVIDIA, “Multi-process service,” 2021, version R495.
[9] ——, “NVIDIA multi-instance GPU and NVIDIA virtual compute
two TPCs (since doubling the partition size halved execution server,” 2020.
time), but only some can utilize three or more TPCs—adding [10] T. Amert, N. Otterness, M. Yang, J. H. Anderson, and F. D. Smith,
the third TPC (a 50% capacity improvement) only speeds up “GPU scheduling on the NVIDIA TX2: Hidden details revealed,” in
Proceedings of the 38th IEEE Real-Time Systems Symposium, Dec 2017,
YOLOv2 by 14% (10 ms). pp. 104–115.
In all, our case study found TPC partitioning to be useful [11] M. Yang, N. Otterness, T. Amert, J. Bakita, J. H. Anderson, and F. D.
in real-world applications for protection, prioritization, and Smith, “Avoiding pitfalls when using NVIDIA GPUs for real-time
tasks in autonomous systems,” in Proceedings of the 30th Euromicro
characterization, while imputing a minimal performance cost. Conference on Real-Time Systems, July 2018, pp. 20:1–20:21.
[12] S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa, “TimeGraph:
VIII. C ONCLUSION GPU scheduling for Real-Time Multi-Tasking environments,” in Pro-
In this work, we uncovered a new means by which to ceedings of the 2011 USENIX Annual Technical Conference. USENIX
Association, June 2011.
spatially partition the computing cores on all NVIDIA GPUs [13] G. A. Elliott, B. C. Ward, and J. H. Anderson, “GPUSync: A framework
since 2013. Our library, libsmctrl, allows for easy parti- for real-time GPU management,” in Proceedings of the 34th Real-Time
tioning while exposing critical GPU details, such as the TPC Systems Symposium, Dec 2013, pp. 33–44.
[14] N. Otterness and J. H. Anderson, “Exploring AMD GPU scheduling
to GPC mappings. We reinforce the usefulness of partitioning details by experimenting with “worst practices”,” in Proceedings of
via a deep-dive into the capability of NVIDIA GPU hardware the 29th International Conference on Real-Time Networks and Systems,
to support multiple parallel partitions. In our evaluation, we April 2021, pp. 24–34.
[15] N. Otterness, “Developing real-time GPU-sharing platforms for
consider how the GPU’s scheduling hardware can break parti- artificial-intelligence applications,” Ph.D. dissertation, The University of
tion boundaries and make non-optimal scheduling decisions in North Carolina at Chapel Hill, 2022.
some cases. We compare to prior work, and find our approach [16] H.-E. Zahaf, I. S. Olmedo, J. Singh, N. Capodieci, and S. Faucou,
“Contention-aware GPU partitioning and task-to-partition allocation for
more flexible and capable, while also easily applying to real- real-time workloads,” in Proceedings of the 29th International Confer-
world workloads, as evidenced via our case study. ence on Real-Time Networks and Systems, July 2021, p. 226–236.
In future work, we hope to extend our approach such that co- [17] A. B. Hayes, F. Hua, J. Huang, Y. Chen, and E. Z. Zhang, “Decoding
CUDA binary,” in 2019 IEEE/ACM International Symposium on Code
running partitions need not share CUDA contexts. We further Generation and Optimization, Feb 2019, pp. 229–241.
hope to investigate how the GPU enforces CUDA stream- [18] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the
ordering, how it schedules other engines such as copy, and NVIDIA Volta GPU architecture via microbenchmarking,” April 2018.
[19] Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza, “Dissecting the
NVIDIA Turing T4 GPU via microbenchmarking,” March 2019.

12
[20] N. Otterness, M. Yang, T. Amert, J. Anderson, and F. D. Smith, [45] T. Amert, Z. Tong, S. Voronov, J. Bakita, F. D. Smith, and J. H.
“Inferring the scheduling policies of an embedded CUDA GPU,” in Pro- Anderson, “TimeWall: Enabling time partitioning for real-time multi-
ceedings of the 13th Annual Workshop on Operating Systems Platforms core+accelerator platforms,” in Proceedings of the 42nd IEEE Real-Time
for Embedded Real Time Applications, July 2017. Systems Symposium, Dec 2021, pp. 455–468.
[21] I. S. Olmedo, N. Capodieci, J. L. Martinez, A. Marongiu, and
M. Bertogna, “Dissecting the CUDA scheduling hierarchy: a perfor-
mance and predictability perspective,” in Proceedings of the 26th IEEE
Real-Time and Embedded Technology and Applications Symposium, Apr
2020, pp. 213–225.
[22] R. Spliet and R. Mullins, “The case for limited-preemptive scheduling in
GPUs for real-time systems,” in Proceedings of 14th Annual Workshop
on Operating Systems Platforms for Embedded Real-Time Applications,
July 2018.
[23] N. P. Authors, “Nouveau: Accelerated open source driver for nVidia
cards,” 2022. [Online]. Available: https://nouveau.freedesktop.org/
[24] M. P. Authors, “The mesa 3d graphics library,” 2022. [Online].
Available: https://www.mesa3d.org/
[25] J. F. Duluk Jr, T. J. Purcell, J. D. Hall, and P. A. Cuadra, “Error checking
in out-of-order task scheduling,” U.S. Patent 9,965,321, May, 2018.
[26] NVIDIA, “Open GPU documentation.” [Online]. Available: https:
//github.com/NVIDIA/open-gpu-doc
[27] T. J. Purcell, L. V. Shah, and J. F. Duluk Jr, “Scheduling and management
of compute tasks with different execution priority levels,” U.S. Patent
Application 13/236,473, Sep., 2011.
[28] J. F. Duluk Jr, L. Durant, R. M. Navarro, A. Menezes, J. Tuckey,
G. Hirota, and B. Pharris, “Dynamic partitioning of execution resources,”
U.S. Patent 11,307,903, Apr., 2022.
[29] K. M. Abdalla, L. V. Shah, J. F. Duluk Jr, T. J. Purcell, T. Mandal, and
G. Hirota, “Scheduling and execution of compute tasks,” U.S. Patent
9,069,609, Jun., 2015.
[30] P. A. Cuadra, C. Lamb, and L. V. Shah, “Software-assisted instruction
level execution preemption,” U.S. Patent 10,552,201, Feb., 2020.
[31] S. H. Duncan, L. V. Shah, S. J. Treichler, D. E. Wexler, J. F. Duluk Jr,
P. B. Johnson, and J. S. R. Evans, “Concurrent execution of independent
streams in multi-channel time slice groups,” U.S. Patent 9,442,759, Sep.,
2016.
[32] M. W. Rashid, G. Ward, W.-J. R. Huang, and P. B. Johnson, “Man-
aging copy operations in complex processor topologies,” U.S. Patent
10,275,275, Apr., 2019.
[33] M. Sullivan, S. K. S. Hari, B. M. Zimmer, T. Tsai, and S. W. Keckler,
“System and methods for hardware - software cooperative pipeline error
detection,” U.S. Patent Application 17/737,374, Aug., 2022.
[34] P. Tasinga, D. B. Yastremsky, J. Wyman, A. Ihsani, P. Nahar,
and P. Bhatt, “Neural network scheduler,” U.S. Patent Application
17/115,631, Jun., 2022.
[35] NVIDIA, “nvgpu git repository.” [Online]. Available: git://nv-tegra.
nvidia.com/linux-nvgpu.git
[36] ——, “NVIDIA’s next generation CUDA compute architecture: Kepler
GK110/210,” NVIDIA, Tech. Rep., 2014.
[37] NVIDIA, “CUDA C++ programming guide,” 2022, version PG-02829-
001_v11.8.
[38] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Jul. 2017, pp. 7263–7271.
[39] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,” in
European conference on computer vision. Springer, 2020, pp. 213–
229.
[40] “LITMUSRT home page,” Online at http://www.litmus-rt.org/, 2020.
[41] B. Brandenburg, “Scheduling and locking in multiprocessor real-time
operating systems,” Ph.D. dissertation, University of North Carolina at
Chapel Hill, 2011.
[42] J. M. Calandrino, H. Leontyev, A. Block, U. C. Devi, and J. Anderson,
“LITMUSRT : A testbed for empirically comparing real-time multipro-
cessor schedulers,” in RTSS, 2006, pp. 111–126.
[43] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman, “The PASCAL Visual Object Classes
Challenge 2012 (VOC2012) Results,” http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.html.
[44] J. Bakita, S. Ahmed, S. H. Osborne, S. Tang, J. Chen, F. D. Smith,
and J. H. Anderson, “Simultaneous multithreading in mixed-criticality
real-time systems,” in Proceedings of the 27th Real-Time and Embedded
Technology and Applications Symposium, May 2021, pp. 278–291.

13

You might also like