Rtas 23
Rtas 23
Rtas 23
1
effective use of spatial partitioning on NVIDIA GPUs. NVIDIA Ampere Discrete GPU (GA100)
6) Demonstrate via a case study how spatial partitioning
can be applied to increase GPU utilization and reduce GPC
latency for a convolutional neural network (CNN). Video
Three
2
a single CUDA application, multiple FIFO queues called with recent work remaining constrained to software-emulated
streams can be used to allow work to use the GPU fully partitioning [1]–[4], or conceptual analysis [16].
concurrently [10]. Combining all CUDA-using functionality Significant supporting work to reverse engineer GPU design
into a single context with multiple streams is often done to has accompanied GPU management efforts. Historically, com-
avoid cases where several small applications, none of which modity GPU designs have been shrouded in secrecy. Even in-
can fully use the GPU, are exclusively timesliced [11]. A toy struction encodings—a commonly shared piece of information
illustration of this is shown in Fig. 1, where combining multi- for any processor—have remained secret for GPUs [17]–[19].
ple applications into a single context reduces GPU busy time Notable reverse engineering works include those of Otterness
by > 60%—if the GPU’s SMs can be hardware partitioned. et al., Amert et al., and Olmedo et al. to expose the queue
Without hardware partitioning, Job 1 would attempt to use structure used for compute work in NVIDIA GPUs [10], [20],
two SMs when run alongside the others, preventing Job 2 [21], the work of Capodieci et al. and Spliet and Mullins to
from starting. With partitioning, we can force Job 1 to use a clarify the preemption capabilities of NVIDIA GPUs [7], [22],
single SM for twice as long, yielding a more efficient overall and the work of Jain et al. to elucidate the memory hierarchy
schedule. in NVIDIA GPUs [1]. Outside of the academic community,
Real-time terminology. Tasks are composed of jobs, with a the Nouveau [23] and Mesa [24] reverse-engineered NVIDIA
rate of release known as a period. After release, each job and GPU driver projects provide crucial details on GPU-to-CPU
must complete by a subsequent deadline. If tasks release jobs interfaces.
exactly at the start of their period they are known as periodic. Despite these efforts, we are aware of no published method
If, instead, the period only defines a minimum separation that implements transparent spatial partitioning of the com-
between releases, the task is known as sporadic. Tasks may pute/graphics engine for NVIDIA GPUs.
have an associated criticality, which reflects their relative IV. I MPLEMENTING C OMPUTE PARTITIONING
importance in the system. For example, in a self-driving car, a
We now discuss how we enable transparent hardware parti-
pedestrian-detection task would have high criticality, whereas
tioning of compute units in NVIDIA GPUs, and overview our
in-screen display updates would have low criticality.
easy-to-use, portable, user-space API for doing so.
III. R ELATED W ORK
A. Partitioning on TPC Boundaries
Hardware predictability is key in embedded and real-time
systems. Since the advent of applying GPUs towards general- We implemented partitioning by activating two existent,
purpose compute tasks, researchers have sought to develop but little-known and unused fields in an obscure NVIDIA
management approaches that enforce predictable behavior data structure known as a TMD (Task MetaData) used to
when a GPU is shared among applications. launch GPU compute work [25] (we return in detail to this
One of the most notable early works in this area is structure in Sec. V).7 The fields we uncovered are named
TimeGraph [12], which uses an interception layer and GPU- SM_DISABLE_MASK_LOWER and _UPPER. These fields are
interrupt-driven scheduler to serially arbitrate GPU work sub- publicly listed, from TMD structure V1.6 (Kepler) to V3.0
mission. This scheduling approach theoretically still applies (Ampere) [26], but not used in any open-source software. As
to NVIDIA GPUs, but requires fully open-source drivers and one of, if not the only, public reference to an SM mask, we
does not allow for parallel executions on the GPU. decided to explore the purpose and functionality of these fields.
Another, simpler, but more broadly applicable approach We refer to these fields as the SM mask going forward.
to predictably share a GPU is to treat it as a resource to Activating partitioning. Unfortunately, modifying a TMD
which accesses are protected by mutual-exclusion locking. created by the CUDA library is somewhat difficult—the TMD
The GPUSync framework [13] and extensions [6] can be is almost immediately uploaded onto the GPU after construc-
considered exemplars of this approach. These works fur- tion. In order to modify the TMD, we discovered and utilized
ther innovate by allowing for predictable intra-GPU sharing. an undocumented CUDA debugging callback API to intercept
Through the observation that some auxiliary GPU units such as and modify the TMD after construction, but before upload
the copy engines can operate independently of compute work, onto the GPU. Using this mechanism, we experimented with
these works allow for per-GPU-engine-granularity locking. the SM mask and found, on older GPUs, it is a bitmask where
This enables, for example, one application to perform GPU an enabled bit indicates a disabled SM for the kernel described
computations while another performs copies. by the TMD. This can allow for spatial partitioning of SMs.
Subsequent to these works, significant effort has gone into Given a mutually exclusive partition configuration, partitioning
allowing more predictable intra-GPU sharing, particularly of can be enforced by setting the SM mask for every kernel such
the GPU’s primary compute/graphics engine. The recent work that every kernel in one partition is prohibited from running
of Otterness and Anderson [14], [15] provided a breakthrough on any SM of another partition. We show such partitioning in
in this area, by allowing multiple applications to simulta- Fig. 3.
neously share commodity AMD GPUs through transparent,
7 The TMD is called a QMD in some sources. QMD stands for Queue
hardware spatial partitioning of the main compute cores.
MetaData. This has been subsumed by the TMD, and old behavior can be
This technique has yet to be extended to NVIDIA GPUs, accessed by flipping the IS_QUEUE TMD bit [25], [26]
3
Sharing without an SM Mask Sharing with an SM Mask GP106 Die 1 GP106 Die 2
Stream 1 (K1) Stream 2 (K2) Stream 1 (K1) Stream 2 (K2)
GPC 0
GPC 0
TPC X
TPC 0
TPC 1
TPC 2
TPC 3
TPC 0
TPC 1
TPC 2
TPC 3
TPC 4
K1: 17 K2: 7
SM 8 SM 8
K1: 8 K2: 3
K1: 16 K2: 6
SM 7 SM 7
K1: 7 K2: 2
K1: 15 K2: 5
SM 6 SM 6
K1: 6 K1: 19 K2: 1
GPC 1
GPC 1
TPC X
TPC 4
TPC 5
TPC 6
TPC 7
TPC 8
TPC 5
TPC 6
TPC 7
TPC 8
K1: 14 K2: 7 K2: 4
SM 5 SM 5
K1: 5 K2: 6 K2: 0
K1: 13 K1: 9 K1: 17
SM 4 SM 4
K1: 4 K2: 5 K1: 4 K1: 13
K1: 12 K1: 8 K1: 15
SM 3 SM 3
K1: 3 K2: 3 K1: 3 K1: 14 Fig. 5. Illustration of floorsweeping asymmetrically impacting GPU layout
K1: 11 K1: 7 K1: 19 and TPC numbering for two GP106 dies.
SM 2 SM 2
K1: 2 K2: 2 K1: 2 K1: 11
K1: 10 K2: 1 K1: 6 K1: 18
SM 1 SM 1
K1: 1 K1: 18 K1: 1 K1: 10 discovered an interesting discrepancy. We show this in Fig. 4.
K1: 9 K2: 4 K1: 5 K1: 16
SM 0
K1: 0 K2: 0
SM 0
K1: 0 K1: 12
We again launch two kernels, but flip every-other bit in the
0.1 0.2 0.3 0.4
Time (seconds)
0.5 0.6 0.7 0.1 0.2 0.3
Time (seconds)
0.4 0.5 0.6 0.7
SM mask, and run the experiment on two GPUs: first on the
GTX 1060 3GB at left, then on the Tesla P100 at right. Based
Fig. 3. Partitioning a GTX 1060 3 GB into four- and five-SM partitions.
Striped Mask on GTX 1060 3GB Striped Mask on Tesla P100 on the behavior in Fig. 3, we would expect our odd/even SM
Stream 1 (K1) Stream 2 (K2) Stream 1 (K1) Stream 2 (K2) mask to disable every odd SM for K1, and every even SM for
K1: 9 K1: 18 ...
SM 8 through SM 56 not shown K2. This expectation holds only on the left. With the P100,
SM 8
K1: 4 K1: 10 (pattern repeats)
the pattern repeats at double the period. Why?
K2: 7 K2: 15 K2: 43 K2: 87
SM 7
SM 7
K2: 3 K2: 8 K2: 15 K2: 76 The GTX 1060 and Tesla P100 are both based on NVIDIA’s
SM 6
K1: 8 K1: 19 SM 6
K2: 29 K2: 92 Pascal architecture, but use dies configured in a slightly
K1: 3 K1: 12 K2: 1 K2: 72
K2: 6 K2: 14 K1: 43 K1: 108 different way. The P100 and newer GPUs are configured more
SM 5 SM 5
K2: 2 K2: 9 K1: 15 K1: 63 similarly to the GA100 in Fig. 2, with two SMs per TPC,
K1: 7 K1: 15 K1: 29 K1: 107
SM 4
K1: 2 K1: 11
SM 4
K1: 1 K1: 64 whereas the GTX 1060 and older GPUs have a one-to-one SM
SM 3
K2: 5 K2: 13 SM 3
K2: 42 K2: 107 to TPC ratio. This key difference, and substantial supporting
K2: 1 K2: 11 K2: 14 K2: 69
K1: 6 K1: 16 K2: 28 K2: 110
results, points to the "SM mask" actually serving as a TPC
SM 2 SM 2
K1: 1 K1: 13 K2: 0 K2: 63 mask to the GPU. Presumably, the name of the field in the
K2: 4 K2: 12 K1: 42 K1: 95
SM 1
K2: 0 K2: 10
SM 1
K1: 14 K1: 77
TMD is inherited from the days when the number of TPCs and
SM 0
K1: 5 K1: 17 SM 0
K1: 28 K1: 87 SMs were identical, allowing for more term interchangeability.
K1: 0 K1: 14 K1: 0 K1: 76
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6
Time (seconds) Time (seconds) B. Partitioning on GPC Boundaries
Fig. 4. An "SM mask" acting as a TPC mask on the Telsa P100 GPU. Many GPU resources are instantiated on a per-GPC basis
(more on this in Sec. V-C), so it may be desirable to obviate
The experiment in Fig. 3 involves two kernels, K1 and contention by aligning GPU partitions to GPC boundaries.
K2, which spin for a fixed amount of time. K2 is launched Earlier work on AMD GPUs [14] assumed a constant mapping
immediately after K1, and needs to complete about half as of compute units to shader engines (a GPC’s AMD equiva-
much work. Launch times are indicated by the colored arrows lent). NVIDIA GPUs are different. We illustrate this for two
at the bottom. Each shaded square in this plot represents the instances of the GTX 1060 3 GB’s GP106 die in Fig. 5.
execution of a block of a kernel, with height representing the Disabled TPCs are crossed out, and the remaining TPCs are
number of resident threads, and width representing the length renumbered. Note how both dies have nine enabled TPCs,
of execution. The jth block of kernel i is denoted Ki:j. We but not the same TPCs. The first TPC of GPC 1 is disabled
show CUDA’s default behavior at left, and the behavior with at right, but not at left. Instead the first TPC of GPC 0 is
an SM mask at right. disabled. This causes TPC 4 to be in GPC 0 at right, but
In the left plot, observe how K1’s work is immediately in GPC 1 at left. In our experiments, we find this pattern
distributed among all the SMs of the GPU upon launch, and repeated across most NVIDIA GPUs from the past several
that the blocks of K2 are prevented from starting execution years. Why would two ostensibly identical GPUs have such
until after K1’s initial set of blocks completes. This pattern different internal configurations?
remains no matter the priority of K2, mirroring the findings This difference between dies stems from floorsweeping, a
of [10]. At right, we enable an SM mask of 0x1e0 on K1, technique increasingly employed by NVIDIA in recent years.
and 0x01f on K2. With these masks, SMs 5-8 are disabled As GPU dies have grown in size to meet ever-increasing
for K1, and SMs 0-5 are disabled for K2. Observe how this computational demands, the likelihood of die manufacturing
partitioning also allows for K2 to begin and finish earlier, as errors has increased. Floorsweeping enables use of these
no blocks of K1 prevent K2 from starting on SMs 5-8. imperfect dies by fusing off (“sweeping” away) the defective
TPCs, not SMs. The promising results of Fig. 3 led us to parts, allowing the remaining parts to function normally. As
experiment on other, more recent, NVIDIA GPUs where we
4
TABLE I
API OF OUR L I B S M C T R L LIBRARY.
die manufacturing errors are randomly distributed, the disabled 1 int main() {
2 // Allow work to only use TPC 1 by default
units vary from die to die, as in Fig. 5. 3 libsmctrl_set_global_mask(~0x1ull);
Coupled with floorsweeping, dies are binned based on their 4 ...
number of defective parts. For the GP106 die shown in Fig. 5, 5 // Stream-ordering is still respected with
6 // partitioning, so avoid the NULL stream
if no TPCs are defective it is sold as a GTX 1060 6 GB, if at 7 cudaStream_t urgentStream, otherStream;
most one TPC is defective it is sold as a GTX 1060 3 GB, if 8 cudaStreamCreate(&urgentStream);
at most two TPCs are defective it is sold as a Quadro P2000, 9 cudaStreamCreate(&otherStream);
10 // Override the global default settings
and so on. Dies beyond a defect threshold are discarded. 11 // Allow otherStream to use the first 5 TPCs
When partitioning TPCs, each bit in the mask corresponds 12 libsmctrl_set_stream_mask(otherStream,
to a TPC after they have been renumbered, as in Fig. 5. This 13 ~0x1full);
14 long_kernel<<<2048, 2048, 0, otherStream>>>();
means that TPC to GPC mappings are necessary to partition 15 // Allow urgentStream to use the last 4 TPCs
on GPC boundaries across dies. We determined which GPU 16 libsmctrl_set_stream_mask(urgentStream,
registers contain this information, and built a Linux kernel 17 ~0x1e0ull);
18 // Launch short, sporadic work as it arrives
module named nvdebug to extract and expose this via an 19 bool done = 0;
interface in /proc.8 20 while (!done)
21 wait_for_work_or_done(&done);
C. Our API 22 sporadic_work<<<32, 32, 0, urgentStream>>>();
23 cudaStreamSynchronize(urgentStream);
Combining together the capability to partition TPCs and 24 }
GPCs, we present our flexible user-space library and API, 25 }
titled libsmctrl, written in C. We detail our API in Table I,
Listing 1. Example usage of partitioning API (9 SM GPU).
with function names and parameters in the first column,
function description in the next, and finally the prerequisites TPCs by default—CUDA may implicitly launch internal ker-
for use of the function. We support setting the TPC mask nels to support some API calls and, if no default mask is set,
at global, per-stream, and per-kernel levels, where setting a those calls may interfere with your partitions. In Listing 1 we
mask at higher-granularity will override a lower-granularity set the global default on Line 3, permitting work to run only on
mask. This allows for most TPCs to be idled by default, with TPC 0 unless otherwise specified. It is possible to disable all
access granted only to explicitly permitted streams or kernels. TPCs by default (with a mask of ~0ull),9 but we recommend
Our library supports aarch64 and x86_64 CPUs, and CUDA against this, as it causes kernels launched with the default TPC
versions stretching as far back as 2016. Our library is fully mask to hang indefinitely (including CUDA-internal ones).
user-space, and compiled binaries are portable to any recent Continuing on in the example of Listing 1, we create CUDA
Linux kernel and NVIDIA driver. When active, our library streams on Lines 7–9 to allow for concurrent kernel launches
adds only a few instructions of overhead onto each kernel (prior work [10] discusses why this is necessary), and set
call. Our implementation primarily relies on an undocumented the TPCs these streams can use on Lines 12 and 16. We
CUDA callback API, but we employ several other techniques allow otherStream to use TPCs 0–4, and urgentStream
on older CUDA versions. to use TPCs 5–8. This allows us to launch long-running
Basic partitioning. An example of our API in a CUDA and difficult-to-preempt kernels in otherStream, while
program is shown in Listing 1. We assume a nine-TPC GPU, still being able to immediately start sporadic work via
but a real program should compute the mask dynamically after urgentStream as it arrives. Our API is highly flexible, and
using libsmctrl_get_tpc_info() to get the number supports many other usage patterns not detailed here.
of TPCs. For most programs, we recommend disabling most GPC partitioning. To support partitioning on GPC bound-
8 See documentation linked at https://www.cs.unc.edu/~jbakita/rtas23-ae/. 9 Postfix ull indicates a 64-bit literal, and ~ is bitwise inversion.
5
1 // Get mask of enabled TPCs for each GPC Not all DRAM
2 uint32_t num_gpcs; Command Segment
is shown
3 uint64_t *tpcs_for_gpc;
cmd Indirect
4 libsmctrl_get_gpc_info(&num_gpcs,
TMD Buffer
5 &tpcs_for_gpc); cmd
6 assert(num_gpcs >= 2);
7 launch cmd cmd sg dsc
8 // Allow the next kernel to run on any TPC in
9 // the first two GPCs cmd cmd sg dsc
10 uint64_t tpc_mask = 0;
11
12
tpc_mask |= tpcs_for_gpc[0];
tpc_mask |= tpcs_for_gpc[1];
① Main (CPU) Memory (DRAM)
13
14 // The above lines created a bitmask of TPCs to
15 // enable, so invert to get the disable mask ② Host Interface I/O
16 libsmctrl_set_next_mask(~tpc_mask);
PBDMA PBDMA Unit
Listing 2. Example usage of GPC information API.
aries, we provide a easy-to-use wrapper function around our ③ Compute Front End
nvdebug kernel module, and include a brief example of this ④ Task Management Unit (TMU)
in Listing 2. This example allocates the TPCs of two GPCs to see Fig. 7 for detail
the next kernel launch. Line 4 obtains an num_gpcs-length
array of bitmasks. Array index i is associated with GPC i, and ⑤ Work Distribution Unit (WDU)
if bit j is set in the mask in that entry, TPC j is associated see Fig. 8 for detail
with GPC i. A TPC may only be associated with one GPC.
On Lines 10–12 we combine the bitmasks of the TPCs for
⑥ Execution Engine + Memory
see Fig. 9 for detail
GPCs 0 and 1, then apply the mask on Line 16.
D. Limitations ...
GPC GPC GPC
Our system works by intercepting TMDs after construction
on the CPU, but pre-GPU-upload. As such, it will not work for
kernels with TMDs constructed and executed from the GPU Memory Crossbar
without the involvement of the CPU, such as when CUDA
Memory Memory Memory
Dynamic Parallelism (CDP) or CUDA Graphs are used.
Partition Partition ... Partition
Further, we cannot partition GPUs with over 64 TPCs, as
Unit Unit Unit
all versions of the TMD structure presently contain only 64
bits for partitioning. No shipping GPU has that many TPCs,
Representative NVIDIA GPU
but NVIDIA’s upcoming H100 allegedly contains up to 72.
Finally, our current library does not support per-GPU global Fig. 6. High-level GPU hardware compute scheduling pipeline.
partition configuration, but could easily be extended to do so.
As an example of our validation process, consider NVIDIA’s
V. H OW NVIDIA GPU S S CHEDULE C OMPUTE patent for TMD error checking [25]. Part of this patent
The usefulness of GPU partitioning hinges on the ability describes, in order, the fields of the TMD structure. A publicly
of GPU hardware to supply jobs and data to each partition available header,10 used in the Mesa project,11 also provides
in parallel. In this section, we take a deep-dive into the the ordering and names of TMD fields. These sources match,
architectural patterns of NVIDIA GPUs, with a particular indicating that the patent likely describes real hardware. We
focus on how they allow for parallelism at every step of a applied this process, or performed verification experiments, to
kernel’s journey from user-space dispatch to completion. attain the following content.
We first briefly consider our sources of information before As the following information is generally a synthesis of all
delving into the full pipeline, as numbered in Fig. 6. the above sources, we omit redundant inline citations. To the
best of our knowledge, our discussion is accurate for NVIDIA
A. Sourcing Architectural Details GPUs from Kepler through at least Ampere.
We gleaned architectural information from cross-referenced
B. The Compute Scheduling Pipeline
public sources. NVIDIA’s patents cover the implementation
of GPU compute priorities [27], block distribution logic [28], We now begin following the path of a single kernel through
[29], preemption design [30], and more [25], [31]–[34]. As the GPU scheduling pipeline. Formally, we define a compute
patents can describe non-existent or infeasible inventions, we job on the GPU as a single kernel launch.
verified if the parents discuss hypothetical or actual designs 10 classes/compute/clc0c0qmd.h in open-gpu-doc [26].
by cross-referencing them with NVIDIA’s open-source nvgpu 11 gallium/drivers/nouveau/nvc0/clc0c0qmd.h in Mesa [24].
driver [35], NVIDIA’s public headers [26], the work of the
Nouveau and Mesa projects [23], [24], and other sources.
6
1 Kernel instantiation. On NVIDIA GPUs, kernels are
④ Task Management Unit (TMU)
internally described via the aforementioned TMD structure.
Next TMD1 TMD3 TMD4 High
Beyond the already-discussed properties, it describes the
TMD
Priority
thread blocks, threads per block, and shared memory resources
needed for the kernel, and includes the address of the kernel’s
entry point (among many other fields, such as priority). After Head
a user-space library (such as CUDA or OpenCL) constructs a Tail TMD0 TMD2 Low
TMD, it enqueues a launch command containing a pointer
to the TMD into the next available command slot in a Empty? TMD Group
command segment. A command segment is a contiguous block Fig. 7. A three-priority task management unit with five queued TMDs.
of memory containing GPU commands.
The TMD is used as a kernel descriptor and handle through-
⑤ Work Distribution Unit (WDU)
Task Table/Task Slots Priority-Sorted Task Table
out the scheduling pipeline, persisting until all computations
TMD3
TMD2
TMD2
TMD1
TMD3
TMD0
TMD0
TMD4
TMD4
of the GPU kernel are complete.
2 Host Interface.12 The GPU’s Host Interface bridges the
gap from CPU to GPU for the TMD. This unit contains one TPC Resource
or more Pushbuffer Direct Memory Access units (PBDMAs), Trackers Load Balancer
along with context-switching control logic, among other sub-
units. The PBDMAs load commands from user space via an
indirect buffer, which is a circular buffer of pointers to com-
mand segments.13 These structures, shown at the top of Fig. 6, To TPCs
allow user-space applications to directly dispatch commands to Fig. 8. A six-slot work distribution unit on a four-TPC GPU.
the GPU without system call or driver overheads. The PBDMA
units individually load, parse, and cache commands at a rate TMDs. In Fig. 7, we show a unit supporting three priority
much faster than that of most GPU engines. While in our levels, with TMDs illustratively numbered by arrival order.
illustration, we show GPU command queues as located in CPU Three TMDs are in the highest-priority-level list, none are
memory (as is most common), they can also be located in GPU in the medium-priority-level list, and two TMDs are in the
memory. After command acquisition and parsing, the Host lowest-priority-level list. Each priority-level list is formally
Interface forwards the command to the appropriate engine. In called a TMD Group.
the case of a kernel launch, it passes the TMD to the Compute These lists allow the TMU to reorder TMDs that it receives,
Front End. such that higher-priority ones skip ahead of lower-priority
ones. Upon TMD arrival from the Compute Front End, or
3 Compute Front End. The Compute Front End relays
elsewhere,17 the TMU reads the TMD’s GROUP_ID field, and
TMD pointers from 2 , the Host Interface to 4 , the Task
appends the TMD to the tail of the specified TMD Group.18
Management Unit. Further, we understand that this unit can
For example, in Fig. 7, TMD4 is the most-recently arrived
orchestrate context switches in subsequent units.14 While this
high-priority TMD, and TMD2 is the most-recently arrived
unit processes TMDs at a rate decoupled from the others, it
low-priority TMD.
may cause work to queue if the incorrect context is active. We
As unit 5 in the scheduling pipeline signals readiness for
focus on a single context in this work, and so do not further
another TMD, the TMU removes and passes the head of the
consider said issue. Further details on this unit are elusive. 15
highest-priority non-empty list. This is illustrated with the bold
4 Task Management Unit (TMU).16 Illustrated in Fig. 7, the outlines and “Next TMD” box in Fig. 7. Given the TMU state
Task Management Unit queues TMDs by priority and arrival of Fig. 7, TMDs would exit the TMU in the following order:
order until 5 , the Work Distribution Unit, is ready to receive TMD1, TMD3, TMD4, TMD0, and then TMD2.
them. Due to the explicit scheduling responsibilities of the The TMU is the final unit which can receive TMDs at a
TMU, we investigate it in depth. rate decoupled from the TMD completion rate. This causes
The TMU is built around a series of priority-level, singly- TMDs to eventually accumulate in the TMU whenever kernels
linked lists, with one linked-list-head and -tail pointer tracked are dispatched by user space at a rate faster than they can
for each priority level. Each list is exclusively composed of complete.
12 Also
5 Work Distribution Unit (WDU).19 Illustrated in Fig. 8
known as “FIFO” in Nouveau.
13 A “pushbuffer” is a combination of these segments and pointers. the Work Distribution Unit dispatches TMDs from task slots
14 Via the Front-End Context Switch (FECS) unit. See Spliet et al. [22].
17 The TMU may receive new TMDs directly from GPU compute units
15 The Compute Front End (COMP FE), like the better-documented and
when features such as CUDA Dynamic Parallelism are employed.
similarly-situated Graphics Front End (GFX FE), likely also configures 6 , 18 Unless field ADD_TO_HEAD_OF_QMD_GROUP_LINKED_LIST is true.
the Execution Engine and Memory, but this is difficult to confirm. 19 Also known as the CUDA Work Distributor (CWD) in marketing litera-
16 More recently called the “Scheduler Unit” or “SKED” in patents [33],
ture [36], and as the Compute Work Distributor in a recent patent [28].
[34], and known in marketing text as the Grid Management Unit (GMU) [36].
7
to available TPCs. The number of task slots is hardware GPC 0 GPC 1
limited,20 forcing the WDU to make scheduling decisions with SMs SMs
Partition
Memory
blocks are launched, it is removed from the priority-sorted L2 Slices L2 Slices L2 Slice
Units
...
table, but left in the Task Table until all its blocks complete.
Fig. 8 illustrates the WDU state shortly after receiving the DRAM Ctrlr. DRAM Ctrlr. DRAM Ct
TMDs of Fig. 7, and after dispatching all blocks of TMD1. To DRAMs
We reflect which kernels are executing on which TPCs via ⑥
shading of the TPC Resource Trackers. These units relay states Fig. 9. Topology of GPU execution engine and memory subsystem.
and commands between the TPCs and the WDU. Note how
TMD1 is still executing (light hatching) on the TPCs, and chips, for the execution engine in Fig. 9. An “I” (resp.
so remains in the Task Table. TMD3, as the highest-priority “D”) postfix on a cache indicates an instruction (resp. data)
and earliest-arrived TMD, is now dispatching blocks to TPCs. cache. Starting at the bottom in Fig. 9, consider the Memory
TMD4 is waiting for TMD3 to dispatch all its blocks before it Partition Units. These are typically configured as one-per-
can move to the head of the priority table and dispatch blocks. DRAM chip, with each partition unit containing a DRAM
This ordering and dispatch process can be disturbed by two controller and a subset of the L2 cache. Each partition unit
things: TPC partitioning, and pending higher-priority work in independently connects to every GPC via a crossbar bus, such
4 , the TMU. When TPC partitioning is in-use, if a TPC has that partitioning Memory Partition Units will also partition the
available capacity, but the TMD at the head of the sorted table crossbar bus and L2.
is prohibited from executing on that TPC, the WDU will skip The crossbar bus links inside each GPC with a Memory
forward in the table until it finds a TMD allowed to execute Management Unit (MMU, with associated Translation Looka-
on the available TPC (or reaches the end of the table). If, for side Buffer, TLB) for virtual memory support.21 This connects
example, TMD3 was prohibited from executing on TPC4, the to L1 data caches in each SM, and to a GPC-wide L1.5I cache.
Load Balancer could instead dispatch blocks from TMD4. The L1.5I cache respectively feeds per-SM instruction caches.
When all WDU task slots are occupied, and any task slot In all, each GPC can theoretically be configured to operate
contains a TMD with priority lower than any pending in with an exclusive subset of the GPU’s cache, bus, and DRAM
4 , the TMU, the lower-priority TMD will be evicted from resources (if Memory Partition Units are partitioned, as in [1]).
the WDU and replaced with the higher-priority TMD. When Further, each SM can operate from its L1 cache without
this occurs, the evicted TMD stops dispatching blocks and generating interference.
completes blocks already in progress. The evicted TMD is To summarize this section, we find GPU hardware to be
then inserted into the head of the respective priority-level list more than capable of feeding jobs to many partitions simulta-
in the TMU. We will further explore what this means for real neously, and also find it capable of supplying partitions with
workloads in Sec. VI. uncontended cache, bus, and DRAM resources.
This concludes our coverage of the TMD from construction VI. E VALUATING C OMPUTE PARTITIONING
to dispatch. We now consider aspects of GPU architecture that
are relevant while a thread block executes. We now evaluate spatial compute partitioning to assess its
scalability, its intersections with hardware scheduling units,
C. Memory Parallelism and its usefulness compared to prior work.
Inherited from the high-bandwidth needs of graphical- A. Methodology
processing tasks, GPUs are massively parallel in ways far be-
Our experiments utilized a variant of the
yond the high number of CUDA cores. This subsection extends
cuda_scheduling_examiner framework integrated
the background of Sec. II with new details on the parallelism
with our libsmctrl library. Unless otherwise noted, all
of GPU caches, interconnects, and DRAM controllers.
experiments ran without other processes competing for the
6 Execution engine and memory subsystem. We illustrate GPU. We found minimal impact due to host platform or
the data and instruction pipelines, from L1 caches to DRAM Linux kernel version, so we do not control for those variables
in our experiments.
20 Documented as the “Maximum number of resident grids per device” in
Table 15 of the CUDA C++ Programming Guide [37]. 21 Some GPUs instead put the MMU in the Memory Partition Unit.
8
Stream 1 (K1-K17) Stream 2 (K1-K21) TABLE II
SM 12 GPU S TESTED IN OUR EXPERIMENTS
K1:14 K4:14
K5:13 K7:9 K8:5 K9:5 K11:14 K15:13 K19:9 K20:3 K21:9
SM 11
K1:4 K4:6
K5:5 K7:4 K8:2 K9:2 K11:4 K15:4 K19:4 K20:1 K21:4 Compute
GPU Name Architecture
SM 10
K1:13
K1:3
K4:13
K5:12
K4:5
K5:4
K7:8
K7:3
K11:13
K11:3
K15:12
K15:3
K19:8
K19:3
K21:8
K21:3
Capability
SM 9
K1:12K2:3
K3:4
K4:12
K5:11 K7:7 K8:4 K9:4 K11:12 K15:11 K19:7 K21:7 Tesla K40 3.5 Kepler
K1:2 K2:0
K3:1
K4:4
K5:3 K7:2 K8:1 K9:1 K11:2 K15:2 K19:2 K21:2
K1:11 K4:11
K5:10 K7:6 K11:11 K15:10 K19:6 K21:6
GTX 970 5.2 Maxwell
SM 8
K1:1 K4:3
K5:2 K7:1 K11:1 K15:1 K19:1 K21:1 Jetson TX1 5.3 Maxwell (embedded)
K1:10 K4:10
K5:9 K7:5 K8:3 K9:3 K11:10 K12:3 K13:6 K15:9 K16:4 K17:5 K19:5 K20:2 K21:5
SM 7
K1:0 K4:2
K5:1 K7:0 K8:0 K9:0 K11:0 K12:0 K13:2 K15:0 K16:0 K17:1 K19:0 K20:0 K21:0
Tesla P100 6.0 Pascal
SM 6
GTX 1060 3 GB 6.1 Pascal
K1:4 K2:1
K3:3
K4:9
K5:3 K7:1 K8:9 K9:1 K11:4 K12:1 K13:5 K15:3 K16:2 K17:4
GTX 1070 6.1 Pascal
SM 5
K1:9 K2:2
K3:0
K4:1
K5:8 K7:0 K8:4 K9:0 K11:9 K12:2 K13:1 K15:8 K16:3 K17:0 GTX 1080 6.1 Pascal
K1:3 K4:2
K5:2 K8:8 K11:3 K13:2 K15:2
SM 4
K1:8 K4:8
K5:7 K8:3 K11:8 K13:4 K15:7 Jetson TX2 6.2 Pascal (embedded)
SM 3
K1:2 K2:0
K3:2
K1:7 K2:1
K4:7
K3:2
K5:6
K4:0
K5:0
K8:7
K8:2
K11:2 K12:0 K13:3
K11:7 K12:1 K13:0
K15:1 K16:1 K17:3
K15:6 K16:2 K17:3
Titan V 7.0 Volta
K1:1 K3:1
K4:4 K8:6 K11:1 K13:4 K17:2 Jetson Xavier 7.2 Volta (embedded)
SM 2
K1:6
K1:0
K3:0
K4:1
K4:3
K5:1
K8:1
K8:5
K11:6
K11:0
K13:1
K13:3
K17:0
K15:0 K16:0 K17:1
Tesla T4 7.5 Turing
SM 1
K1:5 K4:0
K5:0 K8:0 K11:5 K13:0 K15:5 K16:1 K17:2 A100 40 GB 8.0 Ampere
K6:0 K10:0 K14:0
SM 0
K6:0 K10:0 K14:0 K18:0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5
Time (seconds) partitions, which are the non-adjacent SMs 7, 9, and 11. For
Fig. 10. Complex TPC partitioning on the GTX 970. an example of overlap, throughout the rest of Stream 1, we
allow the partitions of each kernel to fully overlap with those
B. Compute Partitioning: Flexible and Portable? of Stream 2, creating the mix of kernels from different streams
At this point, we have only shown TPC partitioning in in the “R”, “A”, “S” and periods. We see no variation in the
small experiments involving few kernels. How does it scale behavior as the partition size is varied, with between one and
to systems of many kernels and compute partitions? ten SMs per partition in Fig. 10.
We experimented with complex arrangements of the Portability. Fig. 10 shows an experimental result from a single
GPUSpin and multikernel benchmarks, finding no ap- GPU, however, we find this behavior to hold on every recent
parent limits on the number or arrangement of TPC partitions NVIDIA GPU. Specifically, any GPU of Compute Capability
allowed, discovering no ways TPC partitioning could compro- 3.5 (2013) or greater, and CUDA 8.0 (2017) or newer supports
mise the semantics of CUDA, and encountering few portability TPC partitioning via our libsmctrl library. This includes
limits. embedded GPUs, such as that in the ARM64-based NVIDIA
We plot a representative experiment in Fig. 10. This plot Xavier System-on-Chip. We list all specific GPUs we tested
shows 38 kernels, across 17 unique TPC partitions and two in Table II.
streams. This experiment demonstrates per-kernel partitioning,
stream-ordering, and complex partitioning. We discuss these C. When Hardware Scheduling Breaks Down
aspects before considering how they apply to GPUs beyond In Sec. V, we discussed how much of the GPU, including
the GTX 970 used in Fig. 10. most parts of the scheduling pipeline, is highly parallel and
Per-kernel partitions. In the experiment of Fig. 10, every unlikely to hinder multiple compute partitions from running
kernel is assigned a different partition than its predecessor. For simultaneously. In this subsection, we examine the narrow
example, consider Stream 2 (red). In the “T” at the bottom, K7 cases when it does prevent partitions from running in parallel.
runs two blocks in a partition including only SM 5, whereas The consequences of greedy assignment. The WDU, the
the subsequent K8 in uses a partition including SM 1-5. K7 last step in the scheduling pipeline, dispatches blocks from
and K8 (along with all other kernels) are released just before kernels in global priority and then time-of-arrival order, with
the 1.0s mark, but the partition is encoded into the kernel’s limited situations in which it will skip ahead. This greedy
TMD—not as a global GPU property—so it automatically assignment approach can result in particularly non-optimal
takes effect when the kernel begins. block assignments when overlapping partitions are in use.
Preservation of stream-ordering. The GPU preserves the We demonstrate one such case in Fig. 11. Both subfigures
ordering of kernels in streams, even when a subsequent kernel have identical configurations; the GPU is split into two par-
may require a mutually exclusive set of SMs. For example, titions: SM 0-9 for Stream 1, and SM 0-3 for Stream 2. K1
in Stream 2 in Fig. 10, K9 only needs SM 5, and K10 only and K2 are GPUSpin benchmarks launched simultaneously
needs SM 10. However, K10 does not start until K9 completes. with identical configurations in both subfigures. In the desired
CUDA stream semantics require subsequent kernels in the outcome, at right, both kernels complete all their blocks before
same stream to not launch until all prior ones have been time 0.5s. However, this outcome only occurs approximately
completed—we find that to be preserved, even when TPC 50% of the time. The result at left occurs otherwise.
partitioning is in use. In the experiment of Fig. 11, K1 and K2 race to the WDU.
Whichever reaches the WDU first has all its blocks dispatched
Complex and overlapping partitions. TPC partitions may first. When K1 arrives first, its blocks are dispatched across
contain holes or overlap, and have no size restrictions. (By every TPC in its partition according to a previously researched
holes, we mean non-contiguous sets of TPCs.) For example, assignment algorithm [21]. This saturates all the TPCs in K2’s
the “E” in Stream 1 of Fig. 10 features holes in K8 and K9’s partition, forcing it to wait until K1 completes before running.
9
(a) K1 Assigned First (b) K2 Assigned First We illustrate one such case for the GTX 1060 3GB, which
Stream 1 (K1) Stream 2 (K2) Stream 1 (K1) Stream 2 (K2) has 32 Task Slots, in Fig. 12.
SM 8 SM 8
K1: 12 Fig. 12 includes 33 streams, each containing a single,
K1: 8 K1: 4
K1: 11 respectively numbered kernel. The GPU is partitioned roughly
SM 7 SM 7
K1: 7 K1: 3 in half: Streams 1-32 on SMs 0-3, and Stream 33 on SMs 4-8.
K1: 10
SM 6
K1: 6
SM 6
K1: 2 K33 consists of 60, 750ms blocks, whereas K1-K32 consist
SM 5 SM 5
K1: 9 of one, 2.5s block. All kernels are derived from the GPUSpin
K1: 5 K1: 1
K1: 8 benchmark. Stream 1-32 are higher priority than Stream 33.
SM 4 SM 4
K1: 4 K1: 0 Observe how K33 stops dispatching new blocks shortly
K1: 12 K2: 4 K2: 3
SM 3
K1: 3 K2: 0
SM 3
K1: 7 after time 0.2s, not resuming until time 0.4s. K33 is in an
SM 2
K1: 11
SM 2
K2: 2 independent stream and partition, but gets evicted from the
K1: 2 K2: 2 K1: 6
K1: 10 K2: 1 WDU by the TMU after K1-K32 arrive, as they are higher
SM 1 SM 1
K1: 1 K2: 1 K1: 5 priority. While these 32 kernels wait to complete execution,
K1: 9 K2: 4
SM 0
K1: 0 K2: 3
SM 0
K2: 0 no Task Slots are available, and K33 must wait in the TMU,
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Time (seconds) Time (seconds)
unable to dispatch blocks. Once one of the higher-priority
kernels completes, as they do around time 0.4s, their Task
Fig. 11. Serial greedy block dispatch yielding unpredictable and bad assign- Slot is vacated, allowing K33 to take it, and once more resume
ments when overlapping partitions are used on GTX 1060 3 GB.
dispatching blocks.
Stream 1 (K1) Stream 9 (K9) Stream 18 (K18) Stream 26 (K26)
In order to prevent this sort of blocking across unrelated
Stream 2 (K2) Stream 10 (K10) Stream 19 (K19) Stream 27 (K27)
Stream 3 (K3) Stream 11 (K11) Stream 20 (K20) Stream 28 (K28)
partitions, we recommend using no more CUDA streams than
Stream 4 (K4) Stream 12 (K12) Stream 21 (K21) Stream 29 (K29) the GPU has WDU slots, or prohibiting the use of CUDA
Stream 5 (K5) Stream 13 (K13) Stream 22 (K22) Stream 30 (K30) stream priorities across all partitions. As long as the number
Stream 6 (K6) Stream 14 (K14) Stream 23 (K23) Stream 31 (K31)
of pending kernels does not exceed the number of WDU slots,
Stream 7 (K7) Stream 15 (K15) Stream 24 (K24) Stream 32 (K32)
eviction will not occur, nor will it occur if all kernels have
Stream 8 (K8) Stream 16 (K16) Stream 25 (K25) Stream 33 (K33)
Stream 17 (K17)
identical priority.
K33: 9 K33: 18 K33: 28 K33: 39 K33: 45 K33: 56
SM 8 D. Evaluating Partitioning Strategy
K33: 4 K33: 16 K33: 22 K33: 34 K33: 41 K33: 50
SM 7
K33: 8 K33: 17 K33: 29 K33: 38 K33: 49 K33: 57 The principle preexisting work on spatial partitioning of
K33: 3 K33: 14 K33: 20 Blocking K33: 33 K33: 44 K33: 54 GPU compute units found that the assignment of compute
K33: 7 K33: 13 K33: 26 Time K33: 37 K33: 47 K33: 59
SM 6
K33: 2 K33: 10 K33: 21 K33: 32 K33: 42 K33: 51
units to partitions is crucial on AMD GPUs, as adding compute
K33: 6 K33: 15 K33: 25 K33: 36 K33: 46 K33: 58 units to partitions may slow the partition in some cases [14].
SM 5
K33: 1 K33: 11 K33: 23 K33: 31 K33: 40 K33: 52 That work proposed and evaluated two partitioning strategies:
K33: 5 K33: 19 K33: 27 K33: 35 K33: 48 K33: 55 SE-packed, and SE-distributed. SE stands for Shader Engine,
SM 4
K33: 0 K33: 12 K33: 24 K33: 30 K33: 43 K33: 53
K22: 0 K31: 0 and is equivalent to a GPC in our context. The SE-packed
SM 3 K14: 0 K27: 0
K26: 0
K21: 0
K19: 0
K11: 0
algorithm attempts to allocate all TPCs from the same GPC
K7: 0 K32: 0
SM 2 K25: 0 K20: 0 to the partition before expanding the partition across multiple
K4: 0 K10: 0
K5: 0
K3: 0
K6: 0
K12: 0
GPCs. Alternatively, the SE-distributed algorithm attempts to
K30: 0 K8: 0
SM 1 K17: 0 K9: 0 distribute the TPCs of each partition across GPCs as evenly
K16: 0 K2: 0
K29: 0
K24: 0
K28: 0
K18: 0
as possible. As this language has been adopted by other
SM 0 K23: 0 K13: 0
0.1 0.2
K15: 0
0.3 0.4
K1: 0
0.5 0.6 0.7
work [16], we continue its use here.
Time (seconds) To evaluate if any similar slowing effects occur on
Fig. 12. Task Slot exhaustion on the GTX 1060 3GB causing an unrelated NVIDIA GPUs, we port the relevant portions of the
partition to block K33 from dispatching blocks. hip_plugin_framework used in [14] to our variant of
the cuda_scheduling_examiner and reproduce their
If K2 arrives first, its blocks are dispatched across its partition, experiment, with the results shown in Fig. 13.
then K1’s across its, resulting in a more-efficient assignment. In Fig. 13, we mirror the figure layout of prior work [14];
The consequences of task slot exhaustion. The constrained Fig. 9, including how long a matrix multiply takes for each
number of Task Slots are one of the few limits in the hardware partition size under each assignment algorithm. Unlike on
scheduling pipeline in NVIDIA GPUs. Effectively, the number AMD GPUs, we find that adding an additional compute unit
of Task Slots limits the number of kernels that the WDU can (TPC) to a partition does not slow that partition, no matter
simultaneously consider for dispatch onto the GPU. whether the unit is assigned from the same, or a different GPC.
Task slots fill in priority order, with lower-priority tasks This is evident by the near-identical alignment of execution
evicted as higher-priority ones become available in the TMU. times for each algorithm. However, this finding does not
This can cause unrelated compute partitions to block one obviate considerations around inter-task interference, and so
another when many streams and stream priorities are in use. we recommend using the SE-packed assignment strategy to
minimize intra-GPC cache and bus interference.
10
TABLE III
YOLOV 2 RUNTIMES
Case Competitor Partition Mean Min Max St. Dev.
1 None None 24 20 53 2.7
2 YOLOv2 None 40 35 69 2.8
Malfunctioning
3 None 82 73 111 2.9
YOLOv2
Malfunctioning
4 50/50 45 35 157 22
YOLOv2
11
Number of TPCs allocated to Instance A how exactly the GPU Host Interface selects which queue to
1 2 3 4 5 6 7 pull from when acquiring commands.
160 ms
Grey denotes min/max range YOLOv2 Instance A
140 ms Colored line is mean YOLOv2 Instance B IX. ACKNOWLEDGEMENTS
120 ms We thank developers of the nouveau project, as well as
Execution Time
12
[20] N. Otterness, M. Yang, T. Amert, J. Anderson, and F. D. Smith, [45] T. Amert, Z. Tong, S. Voronov, J. Bakita, F. D. Smith, and J. H.
“Inferring the scheduling policies of an embedded CUDA GPU,” in Pro- Anderson, “TimeWall: Enabling time partitioning for real-time multi-
ceedings of the 13th Annual Workshop on Operating Systems Platforms core+accelerator platforms,” in Proceedings of the 42nd IEEE Real-Time
for Embedded Real Time Applications, July 2017. Systems Symposium, Dec 2021, pp. 455–468.
[21] I. S. Olmedo, N. Capodieci, J. L. Martinez, A. Marongiu, and
M. Bertogna, “Dissecting the CUDA scheduling hierarchy: a perfor-
mance and predictability perspective,” in Proceedings of the 26th IEEE
Real-Time and Embedded Technology and Applications Symposium, Apr
2020, pp. 213–225.
[22] R. Spliet and R. Mullins, “The case for limited-preemptive scheduling in
GPUs for real-time systems,” in Proceedings of 14th Annual Workshop
on Operating Systems Platforms for Embedded Real-Time Applications,
July 2018.
[23] N. P. Authors, “Nouveau: Accelerated open source driver for nVidia
cards,” 2022. [Online]. Available: https://nouveau.freedesktop.org/
[24] M. P. Authors, “The mesa 3d graphics library,” 2022. [Online].
Available: https://www.mesa3d.org/
[25] J. F. Duluk Jr, T. J. Purcell, J. D. Hall, and P. A. Cuadra, “Error checking
in out-of-order task scheduling,” U.S. Patent 9,965,321, May, 2018.
[26] NVIDIA, “Open GPU documentation.” [Online]. Available: https:
//github.com/NVIDIA/open-gpu-doc
[27] T. J. Purcell, L. V. Shah, and J. F. Duluk Jr, “Scheduling and management
of compute tasks with different execution priority levels,” U.S. Patent
Application 13/236,473, Sep., 2011.
[28] J. F. Duluk Jr, L. Durant, R. M. Navarro, A. Menezes, J. Tuckey,
G. Hirota, and B. Pharris, “Dynamic partitioning of execution resources,”
U.S. Patent 11,307,903, Apr., 2022.
[29] K. M. Abdalla, L. V. Shah, J. F. Duluk Jr, T. J. Purcell, T. Mandal, and
G. Hirota, “Scheduling and execution of compute tasks,” U.S. Patent
9,069,609, Jun., 2015.
[30] P. A. Cuadra, C. Lamb, and L. V. Shah, “Software-assisted instruction
level execution preemption,” U.S. Patent 10,552,201, Feb., 2020.
[31] S. H. Duncan, L. V. Shah, S. J. Treichler, D. E. Wexler, J. F. Duluk Jr,
P. B. Johnson, and J. S. R. Evans, “Concurrent execution of independent
streams in multi-channel time slice groups,” U.S. Patent 9,442,759, Sep.,
2016.
[32] M. W. Rashid, G. Ward, W.-J. R. Huang, and P. B. Johnson, “Man-
aging copy operations in complex processor topologies,” U.S. Patent
10,275,275, Apr., 2019.
[33] M. Sullivan, S. K. S. Hari, B. M. Zimmer, T. Tsai, and S. W. Keckler,
“System and methods for hardware - software cooperative pipeline error
detection,” U.S. Patent Application 17/737,374, Aug., 2022.
[34] P. Tasinga, D. B. Yastremsky, J. Wyman, A. Ihsani, P. Nahar,
and P. Bhatt, “Neural network scheduler,” U.S. Patent Application
17/115,631, Jun., 2022.
[35] NVIDIA, “nvgpu git repository.” [Online]. Available: git://nv-tegra.
nvidia.com/linux-nvgpu.git
[36] ——, “NVIDIA’s next generation CUDA compute architecture: Kepler
GK110/210,” NVIDIA, Tech. Rep., 2014.
[37] NVIDIA, “CUDA C++ programming guide,” 2022, version PG-02829-
001_v11.8.
[38] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Jul. 2017, pp. 7263–7271.
[39] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,” in
European conference on computer vision. Springer, 2020, pp. 213–
229.
[40] “LITMUSRT home page,” Online at http://www.litmus-rt.org/, 2020.
[41] B. Brandenburg, “Scheduling and locking in multiprocessor real-time
operating systems,” Ph.D. dissertation, University of North Carolina at
Chapel Hill, 2011.
[42] J. M. Calandrino, H. Leontyev, A. Block, U. C. Devi, and J. Anderson,
“LITMUSRT : A testbed for empirically comparing real-time multipro-
cessor schedulers,” in RTSS, 2006, pp. 111–126.
[43] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman, “The PASCAL Visual Object Classes
Challenge 2012 (VOC2012) Results,” http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.html.
[44] J. Bakita, S. Ahmed, S. H. Osborne, S. Tang, J. Chen, F. D. Smith,
and J. H. Anderson, “Simultaneous multithreading in mixed-criticality
real-time systems,” in Proceedings of the 27th Real-Time and Embedded
Technology and Applications Symposium, May 2021, pp. 278–291.
13