http://dx.doi.org/10.1109/TPDS.2014.2333526
1
A Performance Model for GPUs with Caches
Thanh Tuan Dao, Jungwon Kim, Sangmin Seo, Member, IEEE,
Bernhard Egger, Member, IEEE, and Jaejin Lee, Member, IEEE
Abstract—To exploit the abundant computational power of the world’s fastest supercomputers, an even workload distribution to
the typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional
CPUs, accurate performance estimation models for modern GPUs do not exist. This paper presents two accurate models for
modern GPUs: a sampling-based linear model, and a model based on Machine-Learning (ML) techniques which improves the
accuracy of the linear model and is applicable to modern GPUs with and without caches. We first construct the sampling-based
linear model to predict the runtime of an arbitrary OpenCL kernel. Based on an analysis of NVIDIA GPUs’ scheduling policies
we determine the earliest sampling points that allow an accurate estimation. The linear model cannot capture well the significant
effects that memory coalescing or caching as implemented in modern GPUs have on performance. We therefore propose a
model based on ML techniques that takes several compiler-generated statistics about the kernel as well as the GPU’s hardware
performance counters as additional inputs to obtain a more accurate runtime performance estimation for modern GPUs. We
demonstrate the effectiveness and broad applicability of the model by applying it to three different NVIDIA GPU architectures
and one AMD GPU architecture. On an extensive set of OpenCL benchmarks, on average, the proposed model estimates the
runtime performance with less than 7% error for a second-generation GTX 280 with no on-chip caches and less than 5% for the
Fermi-based GTX 580 with hardware caches. On the Kepler-based GTX 680, the linear model has an error of less than 10%. On
an AMD GPU architecture, Radeon HD 6970, the model estimates with 8% of error rates. The proposed technique outperforms
existing models by a factor of 5 to 6 in terms of accuracy.
Index Terms—GPU, performance modeling, caches, scheduling, OpenCL, NVIDIA, AMD
✦
1
I NTRODUCTION
T
ODAY, the word’s second-fastest supercomputer,
Titan, and many more on the Top500 list [25]
are heterogeneous systems comprising both CPUs
and GPUs. In order to achieve optimal performance
the workload needs to be distributed evenly to the
different computing nodes, and to do so an accurate
performance model is required.
Static workload distributions [8], [28] based on
throughput and threshold values are often far away
from the optimal distribution because the performance difference between CPUs and GPUs heavily
depends on the program characteristics. A performance estimation model can be used to dynamically
distribute the workload inversely proportional to the
estimated execution time. While performance modeling for general-purpose CPUs has been researched
actively and accurate performance models are available, the state-of-the-art models for GPUs still suffer
from a relatively large estimation error. Models for
older GPU architectures do not consider the GPU’s
hardware caches [11], [2], [29], and many are not
suited for performance estimation at runtime [12],
[14], [2], [29].
A performance model suitable for runtime workload distribution thus needs to capture and charac• All authors are with the School of Computer Science and Engineering,
Seoul National University, Seoul, Korea.
E-Mail:
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
terize the intricate interactions of the most important
hardware and software features of modern GPUs.
This is undoubtedly a challenging problem because
of the GPU’s complex hardware that enables its massively parallel processing capability. Aside from a
large number of processors, each with multiple scalar
execution units, the GPU’s hardware thread context
switching mechanism and the on-board memory subsystem with different levels of cache memories further complicate the task. Several approaches [2], [3],
[11], [13], [15], [29] have been proposed to model
GPU performance, including analytical modeling and
Machine Learning (ML)-based modeling. The analytical modeling approaches [2], [11], [15], [29] typically
rely on micro-architecture information to predict the
performance of a program in a handcrafted manner.
As GPU architectures continue to evolve dramatically
this approach is not that attractive: a minor change
in the architecture may require extensive work to
adapt the model to the new architecture. On the other
hand, instead of evaluating a number of predefined
formulas and quickly reporting the execution time,
ML models [3], [4], [13] rely on training data to learn
the mapping between program features and execution
time. ML-based approaches seem to be more appropriate for this task as they are more robust to changes
in the GPU architecture.
In this paper, we present two models to accurately
predict the performance of an OpenCL kernel on
GPUs. Both the linear and the ML-based model rely
on sampling information to overcome the limitations
2
of analytical methods. Sampling incurs an overhead
because a small part of the actual workload has to
be executed before the performance of the entire
workload can be predicted; thus, it is important to
keep the sampling overhead to a minimum. In order
to determine the earliest sampling points that allow
an accurate estimation, it is essential to have a clear
understanding how GPUs schedule a workload. Unfortunately, GPU manufacturers do not disclose how
the work is distributed to the different compute units
of a GPU, or how the more than 1000 active threads
are scheduled inside these units. We thus reverseengineer the scheduling policies of GPUs by analyzing
the results of a set of hand-crafted OpenCL microbenchmarks. Based on these observations we formulate and verify the scheduling policies of modern
NVIDIA GPUs which then form the basis of the two
models.
The analysis of the GPU’s scheduling policy allows
us to compute the minimal number of work-items at
which the GPU attains maximum throughput. The
sampled data is used by a linear model to extrapolate the total execution time of the entire kernel.
This linear model works reasonably well because the
amount of work per work-item in scientific workloads
is typically distributed evenly. On modern GPUs with
on-chip caches and memory coalescing, however, the
execution time shows much more variance. To cope
with such architectures we combine the linear model
with an ML-based approach that takes the data of the
GPU’s hardware performance counters as additional
inputs. Training the ML-based model with over 300
data sets allows the model to detect the complex
correlations between a workload’s sampled data and
the actual execution time. The experiments show that
the ML-based model is able to improve the accuracy
of the linear model significantly.
We evaluate the accuracy of the proposed model
with 70 different OpenCL kernels by comparing
the estimated kernel execution time with the actual
measured execution time. We evaluate the model
on different GPU architectures from two representative GPU vendors: NVIDIA and AMD. On the GTX
580 [16], a representative of NVIDIA’s Fermi GPUarchitecture, we achieve, on average, an error rate of
5.72% for the linear model and 4.76% for the MLbased model. To demonstrate that our model is not
tied to a certain GPU architecture, we have applied it
to an NVIDIA GTX 280 and an NVIDIA GTX 680.
The GTX 280 is a second-generation GPU with no
hardware caches. The model achieves an error rate
of 7.19% for the linear and 6.42% for the ML-based
model. The GTX 680 is based on NVIDIA’s latest Kepler architecture [17]. It is currently impossible to read
the GPU’s performance counters from OpenCL, hence
we were only able to run the linear model. An error
rate of less than 10% shows that the linear model,
although simple, can cover a wide range of GPU
generations with acceptable error rates. Interestingly,
on the AMD Radeon HD 6970, the model achieves an
error rate of 8%. This shows that although the model
is formed based on an analysis on a specific NVIDIA
GPU, it could be used with not only different GPU
generations from NVIDIA, but also from AMD.
The contributions of this paper are:
•
•
•
•
An in-depth analysis of scheduling policies of
NVIDIA GPUs that allows us to determine the
sampling points with the fewest number of workitems that still allow an accurate performance
estimation.
A simple linear performance model that is very
accurate for applications with an evenly distributed workload.
An ML-based performance model that can cover
a wide range of applications and significantly
improve the performance of the linear model.
An implementation and evaluation of the models with 70 OpenCL kernels from four different
benchmark suites and OpenCL kernel collections
on four different generations of GPU architectures from two different vendors.
The rest of this paper is organized as follows.
Section 2 reviews related work. Section 3 contains a
brief introduction to the OpenCL execution model and
the Fermi GPU architecture. In Section 4, we provide
a comprehensive analysis of the way modern GPUs
execute kernels. The performance estimation models
described in detail in Section 5 are built upon these
observations. Section 6 evaluates and compares the
accuracy of our models, and Section 7 concludes this
paper.
2
R ELATED W ORK
Wong et al. [27] have analyzed the NVIDIA GTX
280 and revealed a number of undisclosed characteristics by running micro-benchmarks. However, their
analysis does not consider contention and does thus
not provide much information about how warps are
scheduled. Liu et al. [15] have developed a performance predictor for a specific application. Although
the prediction is precise, this method requires a priori
knowledge of the application’s throughput. Bitirgen
et al. [4] proposed an interval-based framework to
dynamically select resource allocation decisions on
chip multiprocessors. Their model takes program behavior as input and estimates the performance of
the program at certain intervals. Though interesting,
this interval-based approach is not applicable to GPU
performance estimation.
Zhang and Owens [29] have developed a performance model for CUDA programs executing on
NVIDIA GeForce 200-series GPUs. Their main goal
is to help the programmer identify performance bottlenecks, potential optimizations and architecture improvements. Static program features are collected by
the compiler and dynamic features are collected in a
3
simulator [5]. Together with machine-specific characteristics, these features are used to analytically calculate the execution time of CUDA applications. This
overhead can easily exceed the running time of the
kernel on a GPU itself. Jia et al. [14] proposed a
regression-based performance model for GPU design
space exploration. Training this model requires executing the program over a large number of design
points which makes this approach unsuitable for
workload distribution.
Kerr et al. [30] have proposed a Machine Learning
based model to predict performance of CUDA programs on GPUs and CPUs (based on Ocelot compiler).
The static features of the program are used to derive
the relationship between the program behavior and
the performance on a target architecture. This model
uses static features of the program so it is limited to an
average error of 30%. Luk et al. [32] have proposed a
linear regression-based model to distribute workload
to a heterogeneous system of CPUs and GPUs. Unlike
the proposed technique, the linear regression-model
needs to be trained for each new kernel encountered.
Hong and Kim [11], Baghsorkhi [2] and Grewe [31]
have presented performance models that can be used
for workload distribution. These models are based on
static information of CUDA/OpenCL programs and
require the programs to be written in a parametrized
way so that variables related to the problem size
can be analyzed symbolically. Another drawback of
these models is that they do not consider the presence
of caches and are thus not applicable to modern
GPUs with caches. Finally, none of these models take
into account the interaction between the application
and the hardware as well as the effect of compiler
optimizations.
The models proposed in this paper remove the
restrictions of static analysis by using dynamic information through sampling. Experiments with a large
number of OpenCL kernels from a wide range of applications show that the models can be easily applied
to GPU from different architectures and produce a
more accurate performance estimate with an average
error of less than 5% for the GTX 580, 7% for the GTX
280, 10% for the GTX 680 and 8% for the HD 6970.
3
BACKGROUND
3.1 OpenCL Framework
In this section, we briefly introduce the OpenCL
framework, the NVIDIA’s Fermi GPU architecture [19], and provide a short introduction to the
Machine Learning techniques applied in this work. To
make the discussion consistent we use OpenCL terms
to describe the GPU architecture.
In the OpenCL platform model [9] a host processor
is connected to one or more compute devices. A compute device contains a number of compute units (CUs),
each of which contains one or more processing elements
(PEs) (Figure 1).
Host
Host
Processor
Main Memory
Interconnect Bus
Compute
Device
Compute
Device
Compute
Device
CU
...
CU
Global/Constant
Memory Data Cache
Global
Memory
Compute
Device
Compute Unit
Compute Device
CU
...
Constant
Memory
PE
PE
PE
...
Local Memory
Private Memory
Compute Device Memory
Fig. 1: OpenCL Architecture.
An OpenCL application consists of a host program
and one or several kernels. The host program is executed on the host processor, and the kernels are executed on the devices. For each kernel, the host defines
an N -dimensional abstract index space in which the
kernel will be executed (N ∈ {1, 2, 3}). Each point in
this space defines one execution instance of the kernel,
called work-item. The work-items are organized into
groups of equal size, called work-groups. The workgroups are executed independently, i.e., concurrently
and in any order.
There are four different types of memory accesses
in an OpenCL kernel: global memory, constant memory, local memory, and private memory. Global and
constant memory accesses have the highest latency
since the accessed data resides in the device global
memory. The local memory is shared by all PEs in
the same compute unit. The private memory is local
to a PE. Accesses to the global memory or the constant
memory may be cached in the global/constant memory data cache if there is such a cache in the device.
3.2 GPU Architecture
NVIDIA’s Fermi architecture [19] is designed for massive parallel processing. It comprises a fairly large
number of streaming multiprocessors (SMs). Each SM
contains 32 streaming processors (SPs) for general
computations, 16 load/store (LD/ST) units, and four
Special Function Units (SFUs) that handle transcendental operations. There are two levels of data caches:
an L1 cache for each SM and an L2 cache shared by all
the SMs. Figure 2 shows a schematic block diagram
of one of the total 16 SMs in the GTX 580 GPU. In
OpenCL terminology, the GPU represents a compute
device, an SM corresponds to a CU, and an SP to a
PE.
When running a work-group on an SM, each workitem is executed by one thread. Threads are bundled
1.
2.
3.
4.
5.
6.
print
open
ViewCrop
select
save
4
Warp Scheduler 1
Execution Index Space
Warp Scheduler 2
warp 0
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SM
L1 Cache/Shared Memory
0
1
64
65
SFU
SFU
...
warp 1
31
32
33
95
96
97
warp 2
...
...
63
work-group 0
127
work-group 1
warp 3
...
...
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
work-item
Fig. 3: Relationship of work-groups, work-items, and warps.
SFU
SFU
L2 Cache
Global (Device) Memory
Fig. 2: Schematic block diagram of one SM in the GTX 580
GPU.
into so-called warps. One warp consists of 32 threads
(or work-items; Figure 3). All threads in a warp
execute in lock-step. The GPU compiler thus converts
conditionally executed code (such as if-else-constructs)
into predicated code. A divergence describes the situation when, at run-time, not all 32 threads take
the same control path. When divergence occurs, both
parts of the conditionally executed code must be
executed. Divergence inevitably leads to reduced performance.
The work-group is the minimal allocation unit of
work for an SM. A work-group running on an SM
is called an active work-group [22]. Similarly, active
warps denote warps that are currently executing or
eligible for execution. If enough hardware resources
are available, several work-groups can be active simultaneously in one SM. The number of active workgroups depends on the kernel code, the kernel’s index
space, and the hardware resources of the SM. The
metric to characterize this degree of concurrency is
termed occupancy [21]. Occupancy is defined as the
ratio of the actual number of active warps to the
maximum possible number of active warps per SM.
This maximum is hardware specific; in the case of the
GTX 580, at most 48 warps (and thus 48 ∗ 32 = 1536
threads or work-items) can be active in one SM at the
same time.
The work-groups and warps in a work-group are
formed from consecutive work-items. If the size of
a work-group is not divisible by the warp size (i.e.,
32), every work-group contains one warp that has
less than 32 work-items. Thus, to maximize resource
utilization, the number of work-items in a work-group
should always be a multiple of 32.
The active work-groups that execute concurrently
in an SM might have different starting and finish-
ing times. As soon as one work-group finishes, the
SM scheduler activates a new work-group. The 32
work-items of a warp are conceptually executed in
one cycle. Recall from Figure 2 that one SM in the
Fermi architecture contains 32 SPs for general-purpose
instructions but only 16 load/store units. Every time
a warp issues a load/store instruction, its execution
would thus have to be split up into two cycles. Instead
of doing so, every SM contains two warp schedulers
that, in every clock cycle, schedule two half-warps from
different warps [22]. One scheduler handles warps
with even IDs, and the other one handles warps with
odd IDs. Both schedulers issue the instruction to one
of the four execution units (2 x 16 SPs, 1 x 16 LD/ST
units, 1 x 4 SFUs). Such a setup uses the hardware
resources more efficiently and does not require extra
management of half-way executed warps because,
for example, a general-purpose instruction from one
half-warp and a memory operation from the other
half-warp can be executed in parallel. However, for
transcendental operations that are scheduled on the 4
SFUs, a half-warp will require at least four cycles to
be scheduled.
Each SM contains an instruction cache and 64KB of
local data memory that can be dynamically configured
as scratchpad memory or L1 cache. A unified L2 data
cache of 768KB is shared by all SMs in the GPU.
3.3 Support Vector Regression
Here, we briefly introduce Support Vector Regression
(SVR), the regression learning algorithm used in the
ML-based model. We use a non-linear form of a supervised learning algorithm called ǫ-SVR [26]. In this
form, the model takes a feature vector as its input; this
is simply a vector of input parameters. In a learning
phase the model is trained with a large number of input vectors and the corresponding output values. The
model learns the complex interactions between the
features of the input vector and the desired objective.
We use the leave-one-out-cross-validation technique: we
train the model with training data from which all data
points of the kernel under test have been removed,
and then apply the model to the kernel. This method
guarantees that the training data set and the test data
set are distinct which is important if we only have a
limited number of training data points.
In our context, the input data comprises a set of
performance features of an OpenCL kernel (provided
5
by the compiler, and obtained through sampling), and
the objective is the execution time of kernel.
For the interested reader, the following paragraph
provides some more details about ǫ-SVR. To keep the
discussion reasonably simple, we describe the linear
form; the basic idea is identical in the non-linear form.
The training phase is performed on a collection of
training data. A data instance has the form {x, y}
where x is a collection of d representative features
for this data instance and y is the objective that is
associated with this data instance. The goal is to find
a function f (x) that deviates by no more than ǫ from
the training target yi for all yi while being as flat as
possible. In the linear problem statement, the function
f (x) has the form
f (x) = w · x + b
where w denotes the normal vector to the hyperplane
and b is the bias with w ∈ Rd , b ∈ R. The flatness in this
case means that w is small [24]. Denoting the training
data as (x1 , y1 ), (x2 , y2 ), ..., (xN , yN ), the objective then
becomes
1
kwk2
(1)
minimize
2
subject to |w · xi + b − yi | ≤ ǫ, ∀i
(2)
This convex optimization problem can be solved using
its dual formulation by introducing Lagrange multipliers [26]. Condition 2 assumes that such a function
f (x) exists; however, this is often not the case for large
datasets. To overcome this limitation, soft margin
SVR [6] was introduced. Details of solving the dual
formulation and soft margin SVR are out of the scope
of this paper.
We chose ǫ-SVR because it generalizes well to unseen data. In the training step, rather than only minimizing the observed training error, ǫ-SVR finds the
trade-off between the error and the complexity of the
objective function. Once the function f (x) has been
determined, it can be used to predict the objective
value for yet unseen feature vectors x.
4
P REREQUISITES
A N INSIGHT TO
ING :
TO EFFICIENT PROFIL WARP SCHEDULING
In order to accurately estimating the runtime performance of an OpenCL kernel, a deep understanding of
the inner workings of a GPU is necessary. In addition,
a sampling-based model must keep the number of
sampled work-groups to a minimum. This section
presents our analysis of the NVIDIA GTX 580 scheduler and provides the foundation for the performance
model presented in the following section.
From pieces of information about GPU scheduling
gathered from various sources [16], [18], [19], [21] we
first state and verify the following two assumptions:
1) On the device-level, the work-group scheduler
assigns a new work-group to an SM as soon
as an active work-group on that SM finishes
executing.
2) On the SM-level, the two warp-schedulers use
a round-robin scheduling policy to schedule the
active warps. Whenever a warp blocks or has
executed for a certain number of cycles, the
scheduler picks a new warp.
To accurately measure the execution time of workitems, we add instrumentation code to the very beginning and the very end of real-world and several handcrafted kernels. The code reads the GPU’s clock cycle
performance counter. This instrumentation allows us
to record the start time, swi (i) and finish time, ewi (i)
for each work-item i. The execution time is then given
as twi (i) = ewi (i) − swi (i). For a warp w containing N
work-items, let
swarp (w)
ewarp (w)
twarp (w)
=
=
=
N
M INk=1
swi (k)
N
M AXk=1
ewi (k)
ewarp (w) − swarp (w)
The start, finish and execution time for work-groups
comprising M warps is defined accordingly:
swg (j)
ewg (j)
twg (j)
=
=
=
M
M INk=1
swarp (k)
M
M AXk=1
ewarp (k)
ewg (j) − swg (j)
Recall that several work-groups can be active within
one SM. The compiler computes a kernel’s occupancy
as a by-product of compiling. Based on the occupancy
we can compute how many work-groups can be active
simultaneously within on SM. For example, with an
occupancy of 1.0 and work-groups containing 256
work-items each, six work-groups can be active at
the same time (one work-group contains 256/32 = 8
warps, and to achieve an occupancy of 1.0, we need
6 work-groups with 8 warps to get 48 warps, the
maximum number of active warps per SM). Similarly,
if the occupancy is 0.833, five work-groups can be
active at the same time. We call the number of active
work-groups per SM for a given kernel the saturation
point, denoted Psat .
When executing one, then two, then up to Psat
work-groups on one SM we expect that the throughput, i.e., the number of issued instructions per second (IPS) increases up to the saturation point. At
that point, the running kernel exploits as much of
the hardware resources as possible, and the warp
schedulers within the SM have the biggest freedom
when picking ready-to-run warps. Micro-benchmarks
consisting only of ALU operations and no memory
accesses indeed show the expected behavior. For more
complex kernels, however, the IPS can increase even
after the saturation point due to cache warm-up
effects. In Figure 4 (a), the saturation point of the
microbenchmark is two, and indeed the throughput
achieves a stable maximum at multiples of Psat . In
between saturation points, maximum throughput is
not achieved because not all hardware resources are
fully utilized. For nbody in Figure 4 (b), the kernel
from the NBody application, the saturation point Psat
is 3, yet, the throughput clearly increases after the
6
instr / ms
instr / ms
0.7
0.6
0.5
0.4
1
2
3
4
5
6
7
8
9
1.5
1.4
1.3
1.2
1.1
1.0
0.9
0.8
1
10
2
3
number of work-groups per SM
4
5
6
7
8
9
10
number of work-groups per SM
(a) microkernel IPS
(b) nbody IPS
30
milliseconds
milliseconds
Fig. 4: Instructions per microsecond for the first 160 workgroups of a microkernel and nbody.
20
10
0
0
16
32
48
work-group ID
64
80
95
80
60
40
20
0
0
16
(a) microkernel start/finish times
32
48
work-group ID
64
80
95
(b) nbody start/finish times
120
90
60
30
0
0
16
32
48
work-group ID
64
80
95
64
80
95
(a)
milliseconds
first saturation point. This increase is caused by cache
warm-up effects, i.e., the first work-groups suffer from
more cold misses and can profit less from memory
coalescing than later work-groups. Figures 5 plots
the start and finish time for each work-group for
the microbenchmark and nbody. In the case of the
microbenchmark, Psat = 2 so the the first 32 workgroups are started at the same time (each of the 16
SMs receives two work-groups). We observe that the
work-group scheduler distributes the first 16 workgroups one-by-one to each of the 16 SMs. Workgroups 16 to 31 are again evenly distributed to all
SMs, and so on. For nbody in Figure 5(b) Psat = 3 so
initially, each SM is given three work-groups. These
correspond to the first 16 ∗ 3 = 48 work-groups. For
both benchmarks, we observe that whenever a workgroup finishes the work-group scheduler immediately
assigns a new work-group to the corresponding SM.
This confirms assumption 1 about the work-group
scheduler.
We expect all active work-groups in an SM to start
and finish at the same time. Figure 5(a) confirms this
assumption for the microbenchmark. In the case of
nbody, however, we observe that the third workgroup on each SM takes almost double the time to
finish compared to the first work-group (Figure 5(b)).
We call such a behavior a staircase. We have run a
significant number of hand-crafted micro-benchmarks
that stress different types of instruction mixes in loops
to explain the staircase. Figure 6(a) shows the behavior of a kernel that contains a single-precision floating
point addition in a loop. There are a total of 96 workgroups in the kernel index space of this benchmark,
with each work-group containing 256 work-items. The
kernel’s occupancy is 1.0, hence six work-groups or
48 warps are active at the same time per SM. In total,
there are 16 SMs on the GPU, so all 96 work-groups
can run in parallel. This explains why all work-groups
start at time 0. Since there is no resource contention
between warps, warps block very infrequently (e.g.,
only on changes in the control flow caused by iterat-
milliseconds
Fig. 5: Start/finish times for the first 96 workgroups of a microkernel and nbody.
200
150
100
50
0
0
16
32
48
work-group ID
(b)
Fig. 6: Micro-benchmark containing a loop with a (a) singleprecision floating point addition (b) a double-precision floating point addition.
ing the loop). According to assumption 2, if a warp
does not block, the scheduler will continue to run it
until it exceeds a certain number of cycles. After this
point the scheduler will schedule another warp that
is ready-to-run. Since the kernel does not exhibit a
staircase, the selection of the next warp seems to be
round-robin.
To trigger resource contention, we replace the
single-precision with a double-precision floating point
addition. Single-precision floating point operations
are executed on one SP, but for double-precision operations, the two rows of sixteen SPs are logically linked
and operate as 16 double-precision SPs (Figure 2).
If one of the half-warps executes a double-precision
floating point operation, all 32 SPs are occupied and
the other half-warp will block unless it executes a
memory or transcendental operation. The execution
time of the work-groups is shown in Figure 6(b) and
clearly exhibits a staircase. If the warp schedulers
schedule half-warps in a round-robin fashion, no staircase should appear since all warps will get an equal
share of the SM.
To better understand how the warp schedulers se-
20%
40%
60%
80%
milliseconds
7
100%
450
400
milliseconds
350
300
250
150
100
50
0
0
200
16
32
150
48
work-group ID
64
80
95
64
80
95
64
80
95
(a)
100
0
0
16
32
48
64
80
95
work-group ID
milliseconds
50
Fig. 7: Progress within work-groups in the presence of a
staircase.
300
200
100
0
0
16
32
48
work-group ID
lect the next warps, we have added additional instrumentation code to the kernel. We not only measure
the start and finish time of the kernel, but also record
the time after processing 20%, 40%, 60%, and 80% of
the workload. The result of this experiment is shown
in Figure 7. The figure reveals that, while all workgroups start around time 0, they do not progress
at the same speed. In fact, the third work-group on
each SM does not even start running past the first
couple of loop iterations before the first work-group
has finished. The same is true for the fourth workgroup (it effectively starts executing when the second
one finishes), and the fifth work-group (when the
third work-group finishes). Using this result, we refine
assumption 2 on warp-scheduling:
2) On the SM-level, the two warp-schedulers use a
round-robin scheduling policy when the active
warp has executed for a certain number of cycles. If a warp blocks, the scheduler selects the
first warp that is ready-to-run from the workgroup with the lowest ID.
With this refined assumption, we are able to explain
kernels that do not exhibit staircase behavior because
warps seldom block and thus run more or less at the
same speed thanks to the round-robin scheduling policy. In high-contention situations, however, the warp
schedulers give higher priority to earlier work-groups
by picking the first warp that is ready to run from the
work-group that has been active for the longest time
(i.e., has the lowest ID).
Clearly, the instruction mix will have an effect
on what warp will be selected. Figure 8 shows the
start and end times of all 96 simultaneously active
work-groups. We run a loop with one, two, and
three double-precision floating point additions in the
loop body. We anticipate that the staircase is more
pronounced the higher the resource contention. This
behavior is indeed observed in Figures 8 (a-c): in the
case of only one double-precision addition, the mix
between operations that cause a warp to block and
those that do not is still balanced so that in most cases
the scheduler can run a warp until its time quantum
expires. As we add more double-precision additions
in Figure 8(b-c), the contention becomes more severe.
The warp schedulers now clearly favor work-groups
milliseconds
(b)
400
300
200
100
0
0
16
32
48
work-group ID
(c)
Fig. 8: Micro-benchmark containing a loop with a (a) one (b)
two and (c) three double-precision floating point additions.
with a low ID. Indeed, for the case with three doubleprecision additions, the later work-groups will not
even start executing before the first work-group has
finished executing (Figure 8(c)).
A warp can block for several reasons: it blocks
when it encounters an instruction whose operands
are not available (data dependency), when the required execution units are occupied (hardware contention), or when control flow occurs. We observe
that there are three types of operations that can cause
hardware contention: double precision instructions,
transcendental instructions (such as sine, cosine, exp,
etc.), and memory operations. However, whether the
warp will actually block or not also depends on the
instruction mix of the kernel. If the kernel contains
a well-balanced mix of instructions that are handled
by different hardware resources, even a kernel with
many double-precision floating point operations may
not exhibit a staircase.
The following section details the construction of the
linear model, the ML-based model and, in particular,
the points at which the workload is sampled based on
the observations on the the scheduling mechanism.
5
P ERFORMANCE E STIMATION
Based on our analysis of the execution model of
Fermi GPUs in the previous section, we first construct
a linear model to estimate the execution time of a
workload. This simple linear model, however, cannot
capture the non-linear effects of memory accesses on
performance. Since recent GPUs such as those based
on the Fermi architecture [19] implement several levels of hardware caches, we then propose a model
based on machine learning techniques to improve the
8
t3*Psat
time
staircase
workgroups
Psat
no staircase
t2*Psat
tPsat
2*Psat
3*Psat
Fig. 9: Sampling at saturation points and the linear model.
prediction accuracy for modern architectures and nonlinear benchmarks. The ML-based model, guided by
the linear model, has more relaxed assumptions than
the linear model and can thus make a more accurate
execution time estimation for the general case.
5.1 Linear Model
The linear model estimates the total execution time of
a kernel by extrapolating the execution time obtained
from two sampling points. Our goal is to choose two
sampling points with as small a number of workgroups as possible while maintaining an acceptable
accuracy for the estimation to keep the overhead to a
minimum.
The previous section shows that even though a kernel fully utilizes the hardware resources of a GPU at
the saturation point, Psat , the throughput can increase
even after Psat due to warming-up effects of the L1and L2-caches on the GPU (Figure 4(b)). We thus sample each kernel twice, once at Psat1 and once at Psat2 ,
where both Psat1 and Psat2 are multiples of Psat . The
execution time of the first sample, tP sat1 , determines
the displacement, and the difference in execution time
from the second sample to the first, i.e., tP sat2 − tP sat1
represents the slope. Sampling at integer multiples of
the saturation point has the added advantage that the
staircase behavior can be ignored in the linear model.
In fact, the staircase as explained in Section 4 only
manifests itself in between two saturation points, at
(integer multiples of) the saturation point the sampled
execution times are identical. Figure 9 visualizes this
observation. We have found that Psat1 = 2 ∗ Psat and
Psat2 = 3∗Psat are sufficient to avoid the non-linearity
of the first few work-groups.
Using two sampling points at multiples of Psat , the
execution time of a kernel can be modeled as follows.
The first sampling point determines the displacement,
and the difference in execution time from the second
to the first sampling point yields the increment (Figure 9)
tlinear (N ) = tP sat1 +
tP sat2 − tP sat1
× (N − Psat1 )
Psat2 − Psat1
Here, tlinear (N ) denotes the total execution time of
the kernel for N work-groups. tP sat1 and tP sat2 stand
for the sampled execution time at saturation points
Psat1 and Psat2 , respectively.
This simple linear model accurately estimates the
execution time of a kernel when the following two
conditions hold: (1) the workload is evenly distributed
and (2) the memory access patterns are similar for
all work-groups. The accuracy drops for unevenly
distributed workloads or workloads that exhibit a
variance in performance caused by caching or memory coalescing effects. The ML-based model tries to
eliminate these shortcomings.
5.2 Model based on Machine Learning
For the ML-based model, we break the task of estimating the execution time into two smaller tasks. First, we
estimate the saturated instructions per second (IPS)
of a kernel using a machine learning algorithm. By
instructions per seconds we refer to the number of
issued instructions per second, not the number of
executed instructions. The difference between issued
instructions and executed instructions is that issued
instructions include instructions that are serialized
due to hardware contention, memory conflicts, or divergence. Serialization seriously affects performance,
thus using issued instructions improves the accuracy
of the model. After approximating the saturated IPS,
we compute the total number of instructions for the
whole kernel and can easily calculate the total execution time. The ML-based model makes less assumptions and covers a wider range of kernels than
the linear model since it only requires the kernel to
distribute the workload evenly between the workgroups.
Estimation of the saturated IPS. The features of a
kernel comprise static features collected by the compiler and dynamic features obtained during sampling
runs on the GPU. We use the data from the same two
runs as the linear model (at the second and the third
saturation point). Table 1 lists the kernel features that
were used to construct the IPS estimation model for
the GTX 580 and the GTX 280.
Training and deploying the learning model. An
ML-model requires a large amount of training data
during the training phase. To obtain a sufficient
amount of data from various kernels, each kernel is
run several times, every time with a different number
of work-groups. Since the performance factors of a
kernel often fluctuate when executing the first few
work-groups, several runs at a varying number of
work-groups help cover all behavioral differences of
the kernel.
Each run produces a set of performance counters.
We combine these values with the static features
obtained from the compiler and the result produced
by the linear model using the second and the third
saturation point to form a feature vector. The feature
vector is associated with the saturated IPS of the
kernel to form the training data. The saturated IPS
9
Feature
active work-groups
registers per work-item
execution time
branches
work-items
divergent branches
L1/L2 miss rate
total L1 accesses
issued instructions
executed instructions
bank conflicts
shared loads/stores
global loads/coalesced
warp serializes
IPS
Description
number of work-groups that can be active at the same time
number of registers used by a work-item
execution time of the run
number of branches
number of sampled work-items
number of divergent branches
ratio of L1/L2 miss count to the issued instructions count
L1 miss count plus L1 hit count
number of issued instructions
number of executed instructions
bank conflict count
shared load/store count
number of global total/coalesced load requests
number of warp serializes
IPS obtained by the linear model
Source
compiler
compiler
performance counter
performance counter
performance counter
performance counter
performance counter
performance counter
performance counter
performance counter
performance counter
performance counter
performance counter
performance counter
computed
GPU Model
GTX 280/580
GTX 280/580
GTX 280/580
GTX 280/580
GTX 280/580
GTX 280
GTX 580
GTX 580
GTX 280
GTX 280/580
GTX 580
GTX 580
GTX 280
GTX 280
GTX 280/580
TABLE 1: List of features of the ML model for the GTX 280 and the GTX 580
OpenCL
kernel
OpenCL
compiler
kernel
statistics
sample on GPU
at 2*Psat
kernel
binary
active work-groups, Psat .
2) Extract the data buffer needed by the first workgroups at both sampling points and copy it to
the device memory.
3) Sample at Psat1 and Psat2 work-groups and
record the performance counters.
4) Evaluate the linear model to obtain the estimated execution time and compute IP Sest .
5) Combine the data from the performance counters with the estimated execution time from the
linear model to form a feature vector for the
ML-based model. Then use the output of the
model, the estimated saturated IPS, to compute
the estimated total execution time of the kernel
for the entire workload.
runtime t2*Psat
sample on GPU
at 3*Psat
runtime t3*Psat
Psat
Linear Model
t3*Psat
time
staircase
workgroups
Psat
no staircase
t2*Psat
feature vector
kernel stats runtime stats
tlinear
tPsat
2*Psat
3*Psat
Machine Learning Model
tlinear
trained
model
tML-model
Fig. 10: Actions and information flow for both models.
is obtained by running the kernel with the largest
possible input data and recording the IPS.
Estimating the total execution time. Calculating the
total execution time for a kernel is straightforward.
The features of a kernel include: the static features of
the kernel; the performance data obtained from the
two sampling runs; and the result given by the linear
model. The features are combined to form a feature
vector that is fed into ML model. The model outputs
the estimated IPS for one work-group. Since we assume that the workload is evenly distributed between
the work-groups, we simply multiply the estimated
IPS by the number of work-groups as follows
tM L (N ) =
IP sat2 × N
Psat2 × IP Sest
Here, tM L (N ) denotes the total execution time of the
kernel for N work-groups. IP sat2 is the number of instructions issued on one SM at Psat2 work-groups that
is recorded when evaluating the linear model. Using
sampling information at Psat2 leads to more accurate
results than at Psat1 because of lessened warmingup effects and therefore more precise performance
counter data. IP Sest is the estimated IPS computed
by the ML model.
To summarize, as illustrated in figure 10, we describe the processing routine to test an OpenCL kernel
(that is not included in the training set):
1) Based on occupancy and work-group size, analytically calculate the number of concurrently
6
E VALUATION
6.1 Evaluation Setup
The proposed model has been developed for NVIDIA
GPUs with caches. However, its simplicity and generality allow it to be applicable to any GPU that
shares similar design concepts. To demonstrate that it
can be easily adapted to other GPU architectures, we
have evaluated the model on different architectures
from the two most significant GPU vendors: NVIDIA
and AMD. For NVIDIA, we evaluate the model on
three NVIDIA GPUs from three architectural generations: the Kepler-based GeForce GTX 680, the Fermibased GeForce GTX 580, and the GeForce GTX 280
as a representative for NVIDIA’s second-generation
GPU architecture without hardware caches. Currently,
there are no CUDA/OpenCL drivers for the GTX 680
that support reading the performance counters from
within OpenCL programs. For this reason we evaluate
only the linear model on the GTX 680. For AMD, we
evaluate the model on a recent GPU; the Radeon HD
6970, which comprises hardware caches.
To test the models, we use 70 kernels extracted from
39 OpenCL applications. The OpenCL applications
stem from the Parboil [10], SNU NPB [23], SHOC [7],
AMD [1] and NVIDIA [18] benchmark suites. The selected kernels all satisfy two conditions: (1) the kernel
is executed with a large number of work-groups (at
least several hundreds), and (2) the kernel’s execution
time is longer than one tenth of a second to minimize
10
Source
AMD
NVIDIA
Parboil
Shock
SNU NPB
Application (Kernels)
Binomial (binomial)
Blackscholes (blackscholes), BoxFilter (BoxRowsLmem), ConvolutionSeparable (convolutionRows,
convolutionColumns), CopyComputeOverlap (VectorHypot), DXTCompress (compress),
DotProduct (DotProduct), FDTD3d (FiniteDifferences), HiddenMarkov (ViterbiOneStep),
Histogram (histrogram64), Matmul (matmul), MatVecMul (uncoal0, uncoal1, coal0, coal1, coal2,
coal3), MedialFilter (ckMedian), Nbody (nbody), QuasiRandomGenerator (Quasi, InvCND),
RadixSort (reorderDatakey, radixSortBlockKey), SobelFilter (ckSobel), SortingNetworks (sortLocal,
sortLocal1, mergeGlobal, mergeLocal), Transpose (trans naive), Tridiagonal (pcr, cycle)
CP (cp), Cutcp (lattice), LBM (StreamCollideX), Mri-q (ComputeQ GPU), Mri-gridding (binning,
reorder, sort, rearrange, gridding), Sad (mb calc, calc 8, calc 16), RPES (computeX), Tpacf (tpacf)
FFT (fft1D 512, fft1D 512), BFS (bfs), Sgemm (sgemmNN, sgemmNT), Spmv (spmv csr vector),
Stencil2D (stencilKernel)
BT (initialize2), CG (conj grad 2, conj grad 6), EP (ep), FT (cffts1, cffts2, cffts3, indexmap, evolve),
LU (setiv), MG (resid, norm2u3, rprj3, psinv, interp)
Input
32768 samples
default
large
-s 4
class=B/C
TABLE 2: OpenCL kernels
the effect of small measurement fluctuations. Table 2
lists the applications, the kernels, the source of the
application and the input data. All experiments are
performed with the NVIDIA OpenCL driver 1.1 and
AMD APP SDK v2.9.
By executing each kernel at several saturation
points, we extract a total of 683 data instances for
NVIDIA GPUs and 578 data instances for AMD
GPUs. We use the leave-one-out cross validation (see
Section 3.3) technique to evaluate the ML model: for
each kernel to be evaluated, all data sets generated
by that kernel are removed from the data set before
training the ML model.
To estimate the execution time of a kernel, we
sample at the second and the third saturation point.
The kernel features and the performance counters are
used to compute the execution time for the linear
model. The result of linear model is combined with
the recorded performance counters to estimate the
final execution time using the ML-based model.
6.2 Performance estimation results
6.2.1 NVIDIA GPUs
Figures 11 and 12 show the error rates of the the
linear and the ML-based model in terms of estimated
execution time compared to the actual execution time
for NVIDIA GPUs. The error rates of the linear and
ML-based model are, on average, 5.72% and 4.76% for
the GTX 580 and 7.19% and 6.42% for the GTX 280,
respectively.
As expected, the linear model performs extremely
well for the regular kernels, i.e., when all performance
factors scale linearly. GPUs perform best when executing regular application, so it is not surprising that the
performance counter data of 80% of all kernels we
have encountered scale linearly.
For certain kernels, the linear model over- or underestimates the execution time significantly. This comes
from several sources. First, even though we have
excluded very short running kernels, a few kernels
still exhibit a rather short execution time and cause a
high fluctuation in the measurements. The kernels belonging to this group include resid, rprj3, psinv,
stencilKernel, tran_naive. These kernels show
an average error of approximately 10%.
Second, for some kernels we have observed a
significant warm-up effect at the beginning of the
execution. These effects are caused by the hardware caches, or special instruction scheduling policies that cause the GPU to not evenly distribute
the work-groups across the SMs. The learning model
observes these non-linear effects through the performance counters and learns to correct the result given
by the linear model. On the GTX 580, the significant
cases belonging to this type include ifft1D_512,
BoxRowsLmem, Quasi, MergeLocal, integrate,
StreamCollide, resid, norm2u3, and binning.
The ML-based model achieves the biggest improvement for binning. This kernel has a large number of
warps that are serialized (due to branch divergence)
and warp re-issues (due to cache misses) during
its execution. These serializations are unevenly distributed across the saturation points. The learning
model observes this warm-up effect and is able to
produce a much more accurate result. The kernel
Quasi has high error rate with the linear model
because during sampling, the GPU does not seem
to evenly distribute the work-groups across the SMs.
The GPU performance counters only report data for
one SM, hence if the SM being sampled is assigned
more work-groups than some other SMs the performance counter numbers are distorted. The ML-based
model observes this behavior from the performance
counter sm cta launch and achieves an improved result. The kernels ifft1D_512 and BoxRowsLmem
have high error rates for the linear model because
the performance factors, such as the number of L1
and L2 cache misses or the number of coalesced
memory requests, do not scale linearly. On the GTX
280, the kernels falling into this category include
binning, ComputeX, cffts3, ViterbiOneStep
and reorderDataKeys.
The third source for large errors are unevenly
distributed workloads across the work-groups. This
happens when a kernel contains a conditional branch
or a loop whose iteration count depends on the
value of some input data, such as the work-group
ID or a work-item ID. The kernels that exhibit this
behavior are binning, gridding and gen_hists,
StreamCollide, uncoal1, integrate on the GTX
280 and gridding, lattice, rprj3, trans_naive,
11
45/43
35.00
Linear
30.00
ML
Error rate (%)
25.00
20.00
15.00
10.00
5.00
0.00
Fig. 11: Error rates of the linear and ML-based model on the GTX 580 for the execution time of each kernel.
93/93
Linear
35.00
ML
Error rate (%)
30.00
25.00
20.00
15.00
10.00
5.00
0.00
Fig. 12: Error rates of the linear and ML-based model on the GTX 280 for the execution time of each kernel.
95
85
35.00
Error rate (%)
30.00
25.00
20.00
15.00
10.00
5.00
0.00
Fig. 13: Error rates of the linear model on the GTX 680 for the execution time of each kernel.
integrate on the GTX 580. On most of those kernels, the ML-based model performs slightly better
than the linear model. This is due to the fact that the
learning model attributes this imbalance to warm-up
effects and thus computes a more precise throughput.
Comparing the performance of the linear model
on GTX 580 and GTX 280, we observe a substantial
difference in accuracy between a number of
kernels. This includes the kernels gridding,
StreamCollide, calc_16, gen_hists, psinv,
cffts3, setiv, ifft1D_512, BoxRowsLmem,
ViterbiOneStep, uncoal1, StencilKernel and
Quasi. There are several reasons for these differences:
First, certain kernels have an imbalanced number
of replays (across the work-groups) on the GTX 580
but are balanced in terms of replays on GTX 280. A
replay is the re-execution of an instruction after a
cache-miss [20]. The kernels falling into this category
are psinv, BoxRowsLmem and ifft1D_512. For
these kernels the linear model produces more
accurate results on the GTX 280 than the GTX 580.
Similarly, the linear model works better on the GTX
580 for kernels that have imbalanced replays on the
GTX 280 but are balanced on the GTX 580. Included
in this group are StreamCollide and uncoal1.
Second, on the GTX 280 the requirements for memory
coalescing are stricter than the GTX 580. For those
kernels, calc_16, cffts3 and ViterbiOneStep,
the warm-up effect is stronger on the GTX 280 and
therefore the linear model performs better for the
GTX 580 than the GTX 280. Third, due to different
architectural parameters, the saturation points are not
necessarily equal for the same kernel. gen_hists
has a large workload imbalance in the last quarter
of its work-groups. On the GTX 580, the second
sampling point includes work-groups from the last
quarter whereas on the GTX 280 the second sampling
point includes less work-groups which do not exhibit
workload imbalance. For this reason, the linear
model has more information on the GTX 580 and is
thus more accurate. Similarly, gridding is highly
imbalanced in terms of the number of instructions
12
Fig. 14: Error rates of the linear and ML-based model on the Radeon HD 6970 for the execution time of each kernel.
per work-group. The occupancy of gridding for the
GTX 280 and the GTX 580 are different; therefore, the
degree of imbalance in the number of instructions at
the sampling points is different from the GTX 280
to the GTX 580. Finally, for kernels with a relatively
small sampling execution time, the measurement
variation is also a source of error, especially for
kernels that have large number of work-groups such
as psinv or stencialKernel.
We note that the irregularity in warp serialization
on the GTX 280 affects the result of the performance
estimation less than the irregularity in memory access
patterns on the GTX 580. This explains why the
overall error rate of the linear model on the GTX 280
is lower than that on the GTX 580.
We have implemented the model presented by
Hong and Kim [11], an analytical model to estimate
performance on a GTX 280. Hong’s model has an
average error rate of 30%, almost five times as much
as the ML-based model in Figure 12. Note that we use
the arithmetic average to compare error rates, while
Hong et al. report the geometric average in their paper.
To demonstrate the portability of the proposed
model, we evaluate the linear model on one of the
most recent NVIDIA GPUs, a GTX 680 based on
the Kepler architecture, to see how it adapts to a
newer GPU. It is currently impossible to read the
performance counters from OpenCL, thus we cannot
apply the ML-based model to this architecture yet.
The linear model obtains a slightly worse overall
error rate of 10%. Figure 13 details the result on
the GTX 680 for each kernel. There are two kernels
that suffer from especially high error rates: gridding
and setiv. The kernel gridding has an extremely
uneven workload distribution across the work-groups
as described above. There is no serious workload
imbalance in setiv. However on the GTX 680, the
number of compute units and memory units has been
increased, therefore having only 32 work-items per
work-group and 8 active work-groups per SM does
not use all the GPU resources effectively. To confirm
this conjecture we double the number of work-groups
per sampling run, i.e., we sample at 2∗Psat and 4∗Psat .
This dramatically reduces the error rate for setiv.
Note that ML models perform better the more
training data is available. We expect the ML-based
model to perform better as we add more benchmarks.
6.2.2
AMD GPUs
Figures 14 shows the error rates of our model for
AMD HD 6970 GPU. The error rates of the linear and
ML-based model are, on average, 8.10% and 7.91%
respectively. It is interesting that the linear model,
although developed based on an analysis of a specific
NVIDIA GPU, predicts quite well for most kernels
running on AMD HD 6970 GPU. This is because both
NVIDIA GPUs and AMD GPUs use the concept of
occupancy to indicate how many active groups of
work-items can be executed concurrently. This group
is called warp in NVIDIA GPUs and wave-front in
AMD GPUs. The difference between warp and wavefront is their size, i.e., the number of work-items
executed in a lock-step manner. This size is simply an
input to the linear model. Because the sampling points
embody the execution behavior of warps and wavefronts on the hardware, the linear model can predict
very well on both NVIDIA GPUs and AMD GPUs.
The linear model does not predict well on some kernels that include trans_naive, lattice, binning,
gridding, setiv, gen_hist, evolve, compress
and ViterbiOneStep with an error of more than
20%. There are several reasons for this error. First,
as explained in the result section for NVIDIA
GPUs, there is a workload imbalance between workgroups for binning, gridding, gen_hist and
trans_naive. Second, although there is no significant imbalance in the workload, some performance
factors do not scale linearly (e.g., the cycles stalled due
to a memory access) across different saturation points
as in trans_naive, latice, setiv, compress,
evolve and ViterbiOneStep.
On the Radeon HD 6970, the ML-based model predicts almost equally as well as the linear model. This
is because there is a limited number of performance
counters we can access using OpenCL. Especially,
there is no performance counters related to hardware
caches on AMD GPUs that can be accessed through
the OpenCL interface. The performance counters used
as inputs to the machine learning model on AMD
are insufficient to capture the correlation between the
execution time and input performance factors, hence
its performance is similar that of the linear model.
13
Error class
0-5%
5-10%
>10%
LM
1.88
6.65
21.24
GTX 280
ML
delta
2.36
-0.48
3.13
3.50
18.67
2.57
LM
2.03
7.59
18.46
GTX 580
ML
delta
2.13
-0.10
6.57
1.01
13.41
5.06
Error rates (%)
TABLE 3: Performance for different error classes.
Linear
10.00
8.00
6.00
4.00
2.00
0.00
1+2
2+3
ML
3+4
4+5
Saturation points
5+6
Error rates (%)
(a) NVIDIA GTX 580
Linear
10.00
8.00
6.00
4.00
2.00
0.00
ML
does not perform significantly better; the reason is
that AMD GPUs provide much fewer performance
counters compared to NVIDIA GPUs. For example,
there is not sufficient information about the cache
usage for the L1 and the L2 cache. Without these
performance counters, the ML model is unable to
improve the accuracy of the estimation significantly.
For both architectures, sampling at the second/third sampling point provides reasonable accuracy at the lowest possible overhead. The overhead of
sampling compared to the total runtime of the kernels
amounts to, on average, 8% for NVIDIA and 15% for
AMD GPUs. It is important to note that the sampling
overhead is independent of the size of the data input
size and will thus be significantly smaller for realworld kernels with very large input data sets.
7
1+2
2+3
3+4
4+5
Saturation points
5+6
(b) AMD Radeon HD 6790
Fig. 15: Average error rates of the linear and ML-based
model at different saturation points.
6.3 The ML-based model on different classes of
kernels
Table 3 compares the linear model to the ML-based
model in relation to the error class of the kernel for
two NVIDIA GPUs. The error class 0-5% comprises
all kernels for which the error of the linear model
is 0 to 5%, and the error classes 5-10% and >10%
are constructed accordingly. The results confirm that
the ML-based model performs better on kernels for
which the linear based model has a relatively big
estimation error. For regular kernels the linear model
slightly outperforms the ML-model. While the MLbased model is capable of adapting to irregularities
in the kernels, this flexibility comes at the expense of
a small degradation in accuracy for such kernels.
6.4 The performance at different saturation points
Figure 15 shows the average error rate of the linear
and ML-based model at different sampling points for
(a) the NVIDIA GTX 580 and (b) the AMD Radeon
HD 6970, respectively. The x-axis denotes the first
sampling point at x-multiples of the saturation point
(Psat ); the second sampling point is at (x+1)∗Psat . For
the GTX 580, warm-up effects in the caches lead to a
reduced accuracy when sampled at the first saturation
point. From the second saturation point on, the warmup effects can be mostly covered and the accuracy
of the model remains similar even with a higher
amount of sampling. For the Radeon HD 6970, the
linear model achieves error rates of 8-11% for the first
five saturation points. This implies that the model
benefits from the architectural similarities between
GPUs from different vendors. The ML-based model
C ONCLUSIONS
We have presented a linear and an ML-based performance estimation model for GPUs with or without
hardware caches. The wide applicability of the models
is demonstrated by running the benchmarks on 70
OpenCL kernels on three different generations of
NVIDIA GPU architectures and one AMD GPU.
The linear model outperforms existing models and
achieves error rates below 10%. On modern GPUs, the
complex effects of cached memory accesses are difficult to capture by a linear model. On NVIDIA GPUs,
with performance counters reflecting cache usage and
branch divergences, the ML-based model detects the
correlation of these factors and the execution time
and reduces the average estimation error to about
5%. The ML-based model performs especially well on
benchmarks that exhibit non-linear throughput and
are thus difficult to estimate for the linear model.
On the other hand, with AMD GPUs, the ML-based
model cannot detect this correlation due to the lack
of certain important performance counters.
The proposed models significantly outperform related work on various GPU architectures. The models
achieve a very good accuracy for a wide range of
benchmarks on different GPU architectures and can
thus be used as performance estimation models in a
dynamic workload-distribution framework.
ACKNOWLEDGEMENTS
This work was supported by the National Research
Foundation of Korea (NRF) grants funded by the
Korean government (MSIP) (No. 2013R1A3A2003664)
and the Ministry of Education, Science and Technology (No. 2012R1A1A1042938). ICT at Seoul National
University provided research facilities for this study.
R EFERENCES
[1]
[2]
AMD. ATI Stream Software Development Kit (SDK) v2.1, 2010.
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and
W. W. Hwu. An adaptive performance modeling tool for GPU
architectures. In Proc. of the 15th ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming, 2010.
14
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
Bradley J. Barnes, Barry Rountree, David K. Lowenthal, Jaxk
Reeves, Bronis de Supinski, and Martin Schulz. A regressionbased approach to scalability prediction. In Proceedings of the
22nd annual International Conference on Supercomputing, 2008.
R. Bitirgen, E. Ipek, and J.F. Martinez. Coordinated management of multiple interacting resources in chip multiprocessors:
A machine learning approach. In Proceedings of the 41st annual
IEEE/ACM International Symposium on Microarchitecture, 2008.
S. Collange, D. Defour, and D. Parello. Barra, a parallel
functional GPGPU simulator. In Technical Report hal-00359342,
Université de Perpignan, 2009.
C. Cortes and V. Vapnik. Support-vector networks. Machine
Learning, 20(3):273–297, 1995.
Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S.
Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju,
and Jeffrey S. Vetter. The Scalable Heterogeneous Computing
(SHOC) benchmark suite. In Proceedings of the 3rd Workshop on
General-Purpose Computation on Graphics Processing Units, 2010.
M. Fatica. Accelerating linpack with CUDA on heterogenous
clusters. In Proceedings of 2nd Workshop on General Purpose
Processing on Graphics Processing Units, 2009.
Khronos OpenCLWorking Group. The OpenCL Specification
Version 1.0. Khronos Group, 2009.
The IMPACT Research Group. Parboil Benchmark Suite, 2009.
S. Hong and H. Kim. An analytical model for a GPU
architecture with memory-level and thread-level parallelism
awareness. In Proceedings of the 36th International Symposium
on Computer Architecture, June 2009.
Sunpyo Hong and Hyesoon Kim. An integrated GPU power
and performance model. In Proceedings of the 37th annual
International Symposium on Computer architecture, 2010.
Engin Ipek, Bronis R. De Supinski, Martin Schulz, and Sally A.
Mckee. An approach to performance prediction for parallel
applications. In Euro-Par, Springer LNCS, page 2005, 2005.
W. Jia, K.A. Shaw, and M. Martonosi. Stargazer: Automated
regression-based GPU design space exploration. In 2012 IEEE
International Symposium on Performance Analysis of Systems and
Software (ISPASS), 2012.
W. Liu, W. Muller-Wittig, and B. Schmidt. Performance
Predictions for General-Purpose Computation on GPUs. In
International Conference on Parallel Processing, September 2007.
NVIDIA. NVIDIA GeForce GTX 580.
NVIDIA. NVIDIA GeForce GTX 680.
NVIDIA. NVIDIA OpenCL SDK code samples.
NVIDIA. NVIDIA Fermi Compute Architecture White Paper.
NVIDIA NVIDIA CUDA Profiler.
NVIDIA. NVIDIA Compute - PTX: Parallel Thread Execution,
ISA Version 1.4, March 2011.
NVIDIA. OpenCL Programming Guide for the CUDA Architecture, Version 4.0, February 2011.
Sangmin Seo, Gangwon Jo, , and Jaejin Lee. Performance
Characterization of the NAS Parallel Benchmarks in OpenCL.
In Proceedings of the 2011 IEEE International Symposium on
Workload Characterization, 2011.
A.J. Smola and B. Schölkopf. A tutorial on support vector
regression. Statistics and Computing, 14(3):199–222, 2004.
Top500.org. Top500 Supercomputer Sites - November 2013, 2013.
V. Vapnik. The Nature of Statistical Learning. Springer, 1999.
H. Wong, M.M. Papadopoulou, M. Sadooghi-Alvandi, and
A. Moshovos. Demystifying GPU microarchitecture through
microbenchmarking. In 2010 IEEE International Symposium on
Performance Analysis of Systems & Software (ISPASS), 2010.
Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu,
Huizhan Yi, and Kai Lu. Adaptive Optimization for Petascale
Heterogeneous CPU/GPU Computing. Cluster Computing,
IEEE International Conference on, 0:19–28, 2010.
Yao Zhang and J.D. Owens. A quantitative performance
analysis model for GPU architectures. In Proceedings of the
2011 IEEE 17th International Symposium on High Performance
Computer Architecture, February 2011.
Kerr, Andrew and Diamos, Gregory and Yalamanchili, Sudhakar. Modeling GPU-CPU workloads and systems. In
Proceedings of the 3rd Workshop on General-Purpose Computation
on Graphics Processing Units, 2010.
Grewe, Dominik and OBoyle, Michael FP. A static task partitioning approach for heterogeneous systems using opencl.
Compiler Construction, 2011.
Luk, Chi-Keung and Hong, Sunpyo and Kim, Hyesoon. Qilin:
exploiting parallelism on heterogeneous multiprocessors with
adaptive mapping. 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009.
Thanh Tuan Dao received the BS degree
in computer science from VNU University of
Engineering and Technology in 2008, and
the Master degree in computer science and
engineering from Seoul National University in
2011. He is currently a PhD student at Center
for Manycore Programming, Seoul National
University. His research interests include
high-performance computing, compilers and
runtime systems for multicores/manycores.
Jungwon Kim received the BS degree in
computer science and engineering and the
PhD degree in electrical engineering and
computer science from Seoul National University in 2006 and 2013, respectively. He is
currently a postdoctoral research associate
at Oak Ridge National Laboratory. His research interests include compilers, runtime
systems for multicores/manycores, and scalable heterogeneous HPC computing.
Sangmin Seo received the BS degree in
computer science and engineering and the
PhD degree in electrical engineering and
computer science from Seoul National University in 2007 and 2013, respectively. After
obtaining his PhD degree, he spent half a
year as a CEO at ManyCoreSoft Co., Ltd. He
is currently a postdoctoral appointee in the
Mathematics and Computer Science division
at Argonne National Laboratory. His research
interests include high-performance computing, parallel programming models, compilers,
and parallel runtime systems. He is a member of the IEEE and the
IEEE Computer Society.
Bernhard Egger received the diploma in
computer science from the Swiss Federal
Institute of Technology in Zürich, Switzerland
and the PhD degree in computer science
and engineering from Seoul National University, Korea. He then spent three years at
Samsung Electronic’s research institute before joining Seoul National University in 2011
where is currently is an assistant professor
at the School of Computer Science and Engineering. His research interests include programming language design, compilers, and
operating systems for heterogeneous manycore systems. He is a
member of the IEEE and ACM. More information can be found at
http://csap.snu.ac.kr/∼bernhard.
Jaejin Lee received the BS degree in
physics from Seoul National University in
1991, the MS degree in computer science
from Stanford University in 1995, and the
PhD degree in computer science from the
University of Illinois at Urbana-Champaign in
1999. He is a professor in the Department
of Computer Science and Engineering at
Seoul National University, Korea, where he
has been a faculty member since September
2002. Before joining Seoul National University, he was an assistant professor in the
Computer Science and Engineering Department at Michigan State
University. He was a recipent of an IBM Cooperative Fellowship and
a fellowship from the Korea Foundation for Advanced Studies during
his PhD study. His research interests include compilers, computer
architectures, embedded computer systems, and systems in highperformance computing. He is a member of the IEEE and ACM. More
information can be found at http://aces.snu.ac.kr/∼jlee.