Sponge: Portable Stream Programming on Graphics Engines
Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke
Advanced Computer Architecture Laboratory
University of Michigan - Ann Arbor, MI
{hormati, mehrzads, mwoh, tnm, mahlke}@umich.edu
Abstract
Recent works have shown that in the optimistic case, speedups of
100-300x [21] and in the pessimistic case, speedups of 2.5x [17]
have been achieved between the most recent versions of GPUs
compared to the latest processors. Maximizing the utilization of
the GPU in heterogeneous systems will be key to achieving high
performance and efficiency.
While GPUs provide an inexpensive, highly parallel system
for accelerating parallel workloads, the programming complexity
posed to application developers is a significant challenge. Developing applications to utilize the massive compute power and memory bandwidth requires a thorough understanding of the algorithm
and details of the underlying architecture. Graphics chip manufacturers, such as NVIDIA, have tried to alleviate the complexity
problem by introducing user-friendly programming models, such
as CUDA [19]. Although CUDA and other similar programming
models abstract the underlying GPU architecture by providing a
unified processor model, managing the amount of on-chip memory
used per thread, the total number of threads per multiprocessor, and
the pattern of off-chip memory accesses are examples of problems
that developers still need to manage in order to maximize GPU utilization [23]. Often the programmer must perform a tedious cycle
of performance tuning to extract the desired performance.
Graphics processing units (GPUs) provide a low cost platform
for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL,
makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for
two primary reasons: tedious performance optimizations and lack
of portability. First, optimizing an algorithm for a specific GPU
is a time-consuming task that requires a thorough understanding
of both the algorithm and the underlying hardware. Unoptimized
CUDA programs typically only achieve a small fraction of the peak
GPU performance. Second, GPU code lacks efficient portability as
code written for one GPU can be inefficient when executed on another. Moving code from one GPU to another while maintaining
the desired performance is a non-trivial task often requiring significant modifications to account for the hardware differences. In
this work, we propose Sponge, a compilation framework for GPUs
using synchronous data flow streaming languages. Sponge is capable of performing a wide variety of optimizations to generate
efficient code for graphics engines. Sponge alleviates the problems
associated with current GPU programming methods by providing
portability across different generations of GPUs and CPUs, and a
better abstraction of the hardware details, such as the memory hierarchy and threading model. Using streaming, we provide a writeonce software paradigm and rely on the compiler to automatically
create optimized CUDA code for a wide variety of GPU targets.
Sponge’s compiler optimizations improve the performance of the
baseline CUDA implementations by an average of 3.2x.
General Terms
Keywords
350
High
Performance
Desktop
Mobile
8
300
16
250
Time (ms)
Categories and Subject Descriptors
guages]: Processors—Compilers
400
D.3.4 [Programming Lan-
200
Design, Languages, Performance
48
150
150
64
Streaming, Compiler, GPU, Optimization, Portability
100
32
optimized for
GeForce 8400 GS
optimized for
GeForce GTX 285
50
1. Introduction
0
Support for parallelism in hardware has sharply grown in recent
years as a response to performance and power demands of both
emerging and traditional high performance application domains.
Among the multitude of vastly different solutions offered by hardware companies, graphics processing units (GPUs) have been
shown to provide significant performance, power efficiency and
cost benefits for general purpose computing in highly parallel computing domains. Recently, heterogeneous systems that combine traditional processors with powerful GPUs have become standard in
all systems ranging from servers to cell phones. GPUs achieve their
high performance and efficiency by providing a massively parallel
architecture with hundreds of in-order cores while exposing parallelism mechanisms and the memory hierarchy to the programmer.
Number of Registers Per Thread
Figure 1: This graph shows the runtime of a kernel optimized for architectures with different number of registers on a GeForce GTX 285 which has
the most number of registers. The kernel used in this graph is organized in
128 blocks each with 256 threads.
Another problem of developing applications in CUDA is the
lack of efficient portability between different generations of GPUs
and also between the host processors and GPUs in the system. Different NVIDIA GPUs vary in several key micro-architectural parameters such as number of registers, maximum number of active
threads, and the size of global memory. These parameters will vary
even more when newer high performance cards, such as NVIDIA’s
Fermi [20], and future resource-constrained mobile GPUs with less
resources are released. These differences in hardware lead to a different set of optimization choices for each GPU. As a result, optimization decisions for one generation of GPUs are likely to be poor
choices for another generation. An example of this is shown in Figure 1. This figure shows a CUDA kernel that requires 78 registers
per thread, running with 128 blocks of 256 threads per block on
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.
ASPLOS’11, March 5–11, 2011, Newport Beach, California, USA.
Copyright c 2011 ACM 978-1-4503-0266-1/11/03. . . $10.00.
381
• Streaming-specific optimizations for CUDA and generic CUDA
an NVIDIA GeForce GTX 285. This graph shows how the runtime
(lower is better) would change if the benchmark was optimized for
GPU architectures with less than 16K registers available on each
streaming multiprocessor of the GTX 285. For example, if this kernel is compiled for GeForce 8400 GS, it will use 32 registers per
thread since there are 8K registers available for the 256 threads in
each block on that architecture. Data elements that do not fit in the
smaller register file will be spilled to the slower parts of the memory
hierarchy causing performance degradation. In short, CUDA code
must be separately customized for each target GPU as the choice
of optimizations for peak performance is typically sensitive to the
hardware configuration.
One solution to the GPU programming complexity is to adopt
a higher level programming abstraction similar to the stream programming model. The streaming model provides an extensive set of
compiler optimizations for mapping and scheduling applications to
various homogeneous and heterogeneous architectures ([5, 6, 14]).
The retargetability of streaming languages, such as StreamIt [26],
has made them an excellent choice for parallel system programmers in shared/distributed memory and tiled architectures. Streaming language retargetability and performance benefits on heterogeneous systems are mainly a result of having well-encapsulated
constructs that expose parallelism and communication without depending on the topology or granularity of the underlying architecture.
GPUs are important drivers for current and future heterogeneous systems, therefore extending the applicability of streaming languages to GPUs is advantageous for several reasons. First,
streaming, which expresses programs at a higher level than CUDA,
enables optimizing and porting to different generations of GPUs
and between different topologies of CPUs and GPUs. Second, exposed communication in streaming programs help the compiler to
efficiently map data transfers onto different memory hierarchies.
Finally, streaming applications can be tailored for any number of
cores and devices by performing graph restructurings such as horizontal or vertical fusion or fission of actors.
In this work, we introduce Sponge, a streaming compiler for the
StreamIt language that is capable of automatically producing customized CUDA code for a wide range of GPUs. Sponge consists
of stream graph optimizations to optimize the organization of the
computation graph and an efficient CUDA code generator to express the parallelism for the target GPU. Producing efficient CUDA
code is a multi-variable optimization problem and can be difficult
for software programmers due to the unconventional organization
and the interaction of computing resources of GPUs. Sponge is
equipped with a set of optimizations to handle the memory hierarchy and also to efficiently utilize the processing units.
The Stream-to-CUDA compilation in Sponge consists of four
steps. First, Sponge performs graph reorganization and modification on the stream graph and also classifies actors based on their
memory traffic. The classification information is used throughout
all the phases of the compilation. Second, memory layout optimizations are performed. These optimizations are designed to enable efficient utilization of the memory bandwidth. In this phase, Sponge
decides if actors should use the faster but smaller on-chip memories or the slower but larger off-chip memory on the GPU. Also,
techniques such as helper threads and bank conflict resolution in
the context of StreamIt are introduced to increase the efficiency of
memory accesses. The third compilation phase deals with actor size
granularity of each thread. In this step, based on the classification
information from step one, Sponge tries to create larger threads by
fusing producer/consumer actors in order to reduce communication and kernel call overheads. Finally, software prefetching and
loop unrolling are used to exploit unused registers to decrease loop
control code overhead and increase memory bandwidth utilization.
In summary, this paper makes the following contributions:
optimizations for streaming applications.
• Discussion of the limitations of StreamIt as a GPU program-
ming language.
The rest of the paper is organized as follows. In Section 2, the
stream programming model, the input language (StreamIt), and the
CUDA programming model are discussed. Portable stream compilation in Sponge and its optimizations are explained in Section 3.
Experiments are shown in Section 4. A comparison between two
hand-optimized CUDA benchmarks and their StreamIt implementation is done in Section 5. Related works are discussed in Section 6. Finally, Section 7 contains the conclusion.
Thread
int RegisterVar
Block
Per-thread
Register
Per-block
Shared Memory
__shared__ int SharedVar
Per-thread
Local Memory
int LocalVarArray[10]
Grid 0 (kernel 0)
Grid
Sequence
Per-app
Device Global
Memory
Grid 1 (kernel 1)
__device__ int GlobalVar
Figure 2: CUDA/GPU Execution Model
2. Background
In this section, the CUDA programming model, GPU architecture,
and stream programming model are explained.
2.1 CUDA and GPUs
The CUDA programming model is a multi-threaded SIMD model
that enables implementation of general purpose programs on heterogeneous GPU/CPU systems. There are two different device
types in CUDA: the Host processor and the GPU. A CUDA program consists of a host code segment that contains the sequential
sections of the program, which is run on the CPU, and a parallel
code segment which is launched from the host onto one or more
GPU devices. Data-level parallelism (DLP) and thread-level parallelism (TLP) are handled differently in these systems. DLP is
converted into TLP and executed onto the GPU devices, while TLP
is handled by executing multiple kernels on different GPU devices
launched by the host processor. The threading and memory abstraction of the CUDA model is shown in Figure 2.
The threading abstraction in CUDA consists of three levels of
hierarchy. The basic block of work is a single thread. A group
of threads executing the same code are combined together to
form a thread block or simply a block. Together, these thread
blocks combine to form the parallel segments called grids where
each grid is scheduled onto a GPU at a time. Threads within a
thread block are synchronized together through a barrier operation
( syncthreads()). However, there is no explicit software or hardware support for synchronization across thread blocks. Synchronization between thread blocks is performed through the global
memory of the GPU, and the barriers needed for synchronization
are handled by the host processor. Thread blocks communicate by
executing separate kernels on the GPU.
The memory abstraction in CUDA consists of multiple levels of
hierarchy. The lowest level of memory is registers, which are onchip memories private to a single thread. The next level of memory
• Extending applicability and portability of synchronous data-
flow languages, specifically StreamIt, to GPUs.
382
els [15]. In SDF, computation is performed by actors, which are
autonomous and isolated computational units. Actors communicate
through data-flow buffers (i.e. tapes), often realized as FIFOs. SDF,
and its many variations, expose the input and output processing
rates of actors, and in turn this affords many optimization opportunities that can lead to efficient schedules (e.g., allocation of actors
to cores, and tapes to local memories).
For our purpose, we assume all computation that is performed
in an actor is largely embodied in a work method. The work method
runs repeatedly as long as the actor has data to consume on its input
port. The amount of data that the work method consumes is called
the pop rate. Similarly, the amount of data each work invocation
produces is called the push rate. Some streaming languages (e.g.,
StreamIt [26]) provide a non-destructive read which does not alter
the state of the input buffer. The amount of data that is read in this
manner is specified by the peek rate. An actor can also have an init
method that is executed only once for the purpose of initializing the
actor before the execution of program starts.
We distinguish between stateful and stateless actors. A stateful
actor modifies its local state and maintains a persistent history of its
execution. Unlike a stateful actor, which restricts opportunities for
parallelism, a stateless actor is data-parallel in that every invocation
of the work method does not depend on or mutate the actor’s
state. The semantics of stateless actors thus allow us to replicate
a stateless actor. This opportunity is quite fruitful in scaling the
amount of parallelism that an application can exploit, as shown
in [5, 6].
We use the StreamIt programming language to implement
streaming programs. StreamIt is an architecture-independent streaming language based on SDF. The language allows a programmer to
algorithmically describe the computational graph. In StreamIt, actors are known as filters. Filters can be organized hierarchically
into pipelines (i.e., sequential composition), split-joins (i.e., parallel composition), and feedback loops (i.e., cyclic composition).
StreamIt is a convenient language for describing streaming algorithms, and its accompanying static compilation technology makes
it suitable for our work.
A crucial consideration in StreamIt programs is to create a
steady state schedule which involves rate-matching of the stream
graph. Rate-matching guarantees that, in the steady state, the number of data elements that is produced by an actor is equal to
the number of data elements its successors will consume. Ratematching assigns a static repetition number to each actor. In the
implementation of a StreamIt schedule, an actor is enclosed by a
for-loop that iterates as many times as its repetition number. The
steady state schedule is a sequence of appearances of these forloops enclosed in an outer-loop whose main job is to repeat the
steady schedule. The template code in Figure 3b shows the inter-
Actor 1
Actor 2
Actor 3
Actor 4
Actor n
(a)
(b)
Figure 3: This figure shows an example stream graph and also the intermediate code template for executing a steady state schedule. Ri is the repetition number for actor i.
is shared memory, which is an on-chip memory shared only by
threads within the same thread block. Access latency to both the
registers and shared memory is extremely low. The next level of
memory is local memory, which is an off-chip memory private to
a single thread. Local memory is mainly used as spill memory for
local arrays. Mapping arrays to shared memory instead of spilling
to local memory can provide much better performance. Finally,
the last level of memory is global memory, which is an off-chip
memory that is accessible to all threads in the grid. This memory is
used primarily to stream data in and out of the GPU from the host
processor. The latency for off-chip memory is 100-150x more than
that for on-chip memories. Two other memory levels exist on-chip
called the texture memory and constant memory. Texture memory
is accessible through special built-in texture functions and constant
memory is accessible to all threads in the grid.
The CUDA programming model is an abstraction layer to
access GPUs. NVIDIA GPUs use a single instruction multiple
thread (SIMT) model of execution where multiple thread blocks
are mapped to streaming multiprocessors (SM). Each SM contains
a number of processing elements called Streaming Processors (SP).
A thread executes on a single SP. Threads in a block are executed in
smaller execution groups of threads called warps. All threads in a
warp share one program counter and execute the same instructions.
If conditional branches within a warp take different paths, called
control path divergence, the warp will execute each branch path
serially, stalling the other paths until all the paths are complete.
Such control path divergences severely degrade the performance.
Because off-chip global memory access is very slow, GPUs support coalesced memory accesses. Coalescing memory accesses allows one bulk memory request from multiple threads in a half-warp
to be sent to global memory instead of multiple separate requests.
In order to coalesce memory accesses, three general restrictions apply: each thread in a half-warp must access successive addresses
in order of the thread number, the memory accesses can only be
32, 64, or 128-bit, and all the addresses must be aligned to either
64, 128 or 256-byte boundaries. Effective memory bandwidth is an
order of magnitude lower using non-coalesced memory accesses
which further signifies the importance of memory coalescing for
achieving high performance.
In modern GPUs, such as NVIDIA GTX 285, there are 30 SMs
each with 8 SPs. Each SM processes warp sizes of 32 threads. The
memory sizes for this GPU are: 16K of registers per SM, 16KB
divided into 16 banks of shared memory per SM, and 2GB of global
memory shared across all threads in the GPU.
StreamIt
Reorganization and
Classification
Shared/Global
Memory
Memory Layout
Helper Threads
Bank Conflict
Resolution
Graph
Restructuring
2.2 Stream Programming Model
Software Prefetching
Register
Optimization
With the ubiquity of multi-core systems, the stream programming
paradigm has become increasingly important. Exposed communication and an abundance of parallelism are the key features making streaming a flexible and architecture-independent solution for
parallel programming. In this paper, we focus on stream programming models that are based on synchronous data flow (SDF) mod-
Loop Unrolling
CUDA
Figure 4: Compilation flow in Sponge.
383
ActiveW arpsP erSM
Iterations
=
T hreadsP erBlock × ActiveBlocksP erSM
(1)
T HREADS P ER W ARP
InputBuf f erSize
=
(2)
P op × T hreadsP erBlock × Blocks
T hreadsP erBlockLoT
=
ExecCyclesLoT
=
SHARED M EM ORY SIZE
(3)
(P op + P ush)
CompInsts × COM P IN ST ISSU E DELAY
×
N U M BER SM
T hreadsP erBlockLoT × Iterations
T hreadsP erBlockHiT
=
M AX T HREAD P ER BLOCK
M emCycles
=
(U ncoalM emInsts + CoalM emInsts/COAL F ACT OR)
ExecCyclesHiT
=
(5)
×M EM ORY DELAY + M EM IN ST ISSU E DELAY
Name
SHARED M EM ORY SIZE
T HREADS P ER W ARP
N U M BER SM s
M AX T HREAD P ER BLOCK
M EM ORY DELAY
COAL F ACT OR
M EM IN ST ISSU E DELAY
COM P IN ST ISSU E DELAY
(4)
ActiveW arpsP erSM
M emCycles
×
N U M BER SM
T hreadsP erBlockHiT × Iterations
(6)
(7)
ActiveW arpsP erSM
Description
Size of shared memory on GPU
Number of threads in each warp
Number of streaming processor on GPU
Max number of threads allowed per block
Number of cycles to access global memory
Max number of memory accesses that can be coalesced
Number of cycles to issue a memory instruction
Number of cycles to issue a compute instruction
Name
pop, push
InputBuf f erSize
T hreadsP erBlock
Blocks
Iterations
ActiveBlocksP erSM
(U n)CoalM emInsts
Description
push and pop rate of an actor
Size of input buffer for an actor
Number of threads in one block
Number of blocks on the GPU
Number of iterations to run an actor on the GPU
Blocks active on one SM
(Un)Coalesced instructions in one actor
Figure 5: In this Figure, equations for calculating execution cycles of both HiT and LoT actors are shown. Equations 1 and 2 can be used for both HiT and
LoT actors. The table summarizes what each variable means.
mediate code for the steady state schedule of the streaming graph
shown in Figure 3a.
cases that the actors between the splitter and joiner are stateless and
equivalent. This will remove the splitter and joiner actors and replace the structure with a single actor. In cases where it is not possible to collapse a splitter-joiner structure to one actor, Sponge treats
the splitter and joiner as special actors with more than one input
and output. Based on the type and weights of the splitter and joiner
actors, Sponge decides to allocate their input and output buffers in
shared memory or global memory.
Sponge excludes stateful actors from being executed on the
GPU and runs them on the host CPU. This is because only one
instance of a stateful actor can be active and data-parallelism is not
applicable to these actors. Host to GPU and GPU to host transfers
are inserted before and after stateful actors, if necessary.
Next, Sponge classifies actors assigned to the GPU as either
High-Traffic (HiT) or Low-Traffic (LoT). HiT actors have a large
number of memory accesses. These actors perform better on a GPU
if their buffers are mapped to global memory rather than shared
memory because mapping the buffers to shared memory will result
in having too few threads and under-utilizing the processors and the
available memory bandwidth. LoT actors, on the other hand, are
mostly computation dominated and if mapped to shared memory
will have a reasonable number of threads to utilize the GPU.
In order to determine if an actor is a LoT or HiT, Sponge estimates execution cycles of an actor for both global memory (HiT)
and shared memory (LoT) mappings, based on Equations 1-7 in
Figure 5. For each actor, Sponge treats that actor as both HiT and
LoT and calculates the corresponding execution cycles (ExecCyclesLoT , ExecCyclesHiT ). The two numbers show if that actor is
suitable to be treated as a LoT or HiT actor. If ExecCyclesHiT
is smaller than ExecCyclesLoT for an actor, that actor will perform better if its buffer is mapped to global memory. Otherwise, it
will be classified as a LoT actor for which both shared and global
memory will be used to help with coalescing of data accesses.
In the equations for LoT actors, number of threads per block (T hreadsP erBlock) is determined by the size of shared memory(SHARED M EM ORY SIZE 1 ) and the number of pushes and
pops. Threads per block defines the number of active warps per
SM (ActiveW arpsP erSM ) and the number of iterations based
on Equations 1 and 2. Finally, the execution cycle of a LoT actor
is estimated depending on the number of compute instructions and
the distribution of threads in the GPU (Equation 4).
3. Portable Stream Compilation
Sponge takes StreamIt programs as its input and generates GPUspecific CUDA code. Each actor in the StreamIt graph is converted to a CUDA kernel running with some number of threads
and blocks. By performing portable stream compilation, Sponge
decides how many threads and blocks to assign to the CUDA kernel generated for each actor. The input buffer size of the first actor, Ai , in the graph determines how many times that actor has to
run (Ri ). As a result, Ri is changed to Ri divided by the multiplication of number of threads and blocks assigned to the actor. We
call the result number of iterations for actor Ai .
Portable stream compilation in Sponge consists of four main
steps as shown in Figure 4. In the first phase, Sponge reads a
StreamIt program and performs Actor Reorganization and Classification in which simple graph reorganization is done and actors are classified into two categories: High-Traffic (HiT) and LowTraffic (LoT). The classification information is used throughout all
the phases of the compilation flow. The second phase deals with the
Memory Layout and Optimization of each actor. This step decides if
an actor uses shared or global memory, eliminates shared memory
bank conflicts and also improves memory performance by introducing Helper Threads to better utilize the unused processors and
bring the data needed by an actor into shared memory faster. This
compilation step is crucial to achieving better performance since
memory bandwidth can be a limiting factor on GPUs. The third
phase performs Graph Restructuring by changing the granularity
of the kernels and vertically fusing actors based on classification
results. After graph restructuring, the compiler reiterates from the
beginning of the compilation flow, treating the post-fused stream
graph as the input until no more graph restructuring is possible.
Finally, Register Optimization tries to utilize unused registers on
each SM by employing software prefetching and also by unrolling
for loops in each kernel.
3.1 Actor Reorganization and Classification
As mentioned in Section 2.1, GPUs are built for data-level parallelism and are not suitable for task-level parallelism and global
synchronization. Therefore, splitter-joiner structures will not perform well on the GPU since each joiner introduces a synchronization point. First, Sponge collapses splitter-joiners to one actor in
1 Variables
384
with all capital characters show GPU-specific parameters
Global
Memory
Global
Memory
0
1
2
3
4
5
6
7
... 248 249
250
251
252
253
254
Global
Memory
0
Pop =4
Push =4
Thread 0
Thread 1
1
2
3
4
5
6
...
7
Pop =4
Push =4
Pop =4
Push =4
Thread 62
Thread 63
... 248 249
250
251
252
253
254
1
2
3
4
Global to
Shared
255
Shared
Memory
Pop =4
Push =4
0
0
1
2
3
4
Pop =4
Push =4
Shared
Memory
0
1
2
3
4
0
1
2
7
8
5
6
5
6
9
7
8
9
4
5
(a)
6
11
12
10
7
8
9
10
11
12
11
12
9
10
15
13
14
15
13
14
15
Shared to
Global
Thread 2
8
14
Pop =4
Push =4
Shared to
Global
7
13
Global to
Shared
Pop =4
Push =4
Thread 1
3
10
Global to
Shared
Shared to
Global
Thread 0
Global
Memory
6
Pop =4
Push =4
Shared to
Global
255
5
Global to
Shared
Thread 3
11
12
13
14
15
(b)
Figure 6: This figure shows how HiT and LoT threads access their buffers. Part (a) illustrates the memory access pattern for a sample HiT actor with four
pops and four pushes. Part (b) shows the access pattern for a LoT actor.
of pushes and pops. Before and after the two new for loops, L1 ,
L2 , barriers (synchthreads) are necessary because, as mentioned
earlier, each thread does not fetch all of its own data and has to
wait for other threads in the block to finish their data-fetch phase.
Execution time of HiT actors is calculated based on their memory access time because these actors are mapped to global memory
and have a large number of global memory reads and writes. Equations 5-7 show how execution time estimation is done based on
the number of coalesced and uncoalesced memory accesses. Unlike shared memory, the size of global memory does not limit the
number of threads. Therefore, the number of threads per block for
HiT actors can be equal to the maximum number of threads allowed
in each block (M AX T HREAD P ER BLOCK).
In this section, memory layout and optimization techniques used
in Sponge are discussed. First, the way shared memory is utilized
for LoT actors is explained. Second, helper threads, a technique
that Sponge uses to reduce global memory access time of actors,
is discussed. Finally, shared bank conflict resolution in Sponge is
explained.
3.2 Memory Layout and Optimization
Memory hierarchy in GPUs is significantly different from both
conventional shared memory and distributed memory systems. As
mentioned in Section 2.1, efficient use of global memory, shared
memory and registers on GPUs is crucial to obtain high performance. Coalescing accesses to global memory can greatly reduce
memory access overheads, but it will not be possible without careful memory layout. Utilizing shared memory, which is significantly
faster than global memory, is also very beneficial. Due to its limited
size, shared memory can restrict the number of threads and degrade
the performance. In this section, techniques used for memory layout and optimization in Sponge are discussed.
3.1.1 Shared/Global Memory
To deal with high-latency memory access issues, Sponge uses the
classification information calculated in the previous phase and tries
to alleviate the problem by coalescing the buffer accesses or overlapping a large number of uncoalesced buffer accesses to amortize
the cost. As discussed earlier, HiT actors will be mapped to global
memory and LoT actors will use the shared memory. The kernel
generated for these actors will have a large number of threads, each
accessing its own buffer sequentially in global memory. The memory accesses will not be coalesced because the accesses of consecutive threads are not consecutive in the memory. Since the number of
threads is large, the overhead of memory accesses will be hidden by
the execution of many threads. Figure 6a illustrates how a HiT actor with four pops and four pushes accesses global memory. Since
the addresses generated by the first pop operations of the threads
are not consecutive in the memory, they are not coalesced.
LoT actors, unlike HiT actors, have a higher compute to memory ratio. Therefore, a LoT actor can use shared memory and have
a large number of threads. As shown in Figure 6b, threads of a
LoT kernel in a block can use coalesced memory accesses to copy
their input (output) buffer to (from) shared memory from (to) global
memory. To do so, the threads of a block work as a group and bring
parts of data that belong to other threads as well as part of their own
data. In this way, consecutive threads’ accesses to shared memory
will be to consecutive locations and will get coalesced. Since all of
the data is in shared memory, all threads in a block will have access to it. Figure 7 shows how the CUDA code needs to be changed
to utilize shared memory in LoT actors. In the baseline form (Figure 7a) the input and output buffers are allocated in global memory
and the work function directly accesses global memory. If shared
memory is used, then two for loops are added before and after the
work function to copy the data in and out of shared memory, as
shown in Figure 7b. The addresses for the memory reads and writes
in these for loops are set based on the T hreadID, and the number
3.2.1 Helper Threads
The first optimization of this phase is to use helper threads to
fetch data for the worker threads. In cases where there are not
enough threads to efficiently utilize all the SMs for LoT kernels or
a HiT actor has a fair number of threads when it is treated as a LoT
actor (mapped to shared memory), Sponge uses helper threads to
reduce the buffer (i.e. memory) accesses of each thread (push and
pop rate). Each helper thread aids some worker threads to bring
their data to shared memory in a coalesced way.
Figure 7c shows how the CUDA code is modified. Based on the
thread IDs, Sponge generates the helper and worker threads. Helper
threads are in charge of handling the data accesses and worker
threads are in charge of the computation. In order to avoid control
flow divergence, the thread assignment is performed such that the
helper and worker threads form complete warps. If the number of
worker threads are less than the warp size, then the helper threads
are placed in the first set of warps and the worker threads form
the last warp. This is done by predicating out the work function
for the helper threads and the memory access for loops for the
worker functions. The if statement in Figure 7c do this based on
threadID. This technique works because control flow divergence
negatively affects the performance within one warp but not across
warps.
Sponge estimates number of instructions that helper threads
will add to each thread and also takes into account the parallelism
between the helper and worker threads to calculate how beneficial
helper thread optimization will be for both LoT and HiT threads. As
illustrated in Figure 7c, Sponge counts the time it takes to run L1 ,
L2 , and L3 sections and estimates the total execution time based on
the equations in Figure 5. If the total execution time using helper
threads is reduced, Sponge generates CUDA code using them.
385
Begin Kernel <<< Blocks, Threads + Helper >>>:
Begin Kernel <<<Blocks, Threads>>>:
Begin Kernel <<<Blocks, Threads>>>:
For number of iterations
If helper threads
For number of iterations
For number of iterations
For number of pops
L1
Work
L1
Shared Memory
Global Memory
Shared Memory
Shared Memory
Global Memory
Global Memory
syncthreads
syncthreads
If worker threads
Work
Work
L2
syncthreads
syncthreads
If helper threads
For number of pushs
L2
Shared Memory
Global Memory
L3
End Kernel
End Kernel
Shared Memory
Global Memory
Shared Memory
Global Memory
End Kernel
(a)
(b)
(c)
Figure 7: Part (a) shows the baseline translation for a HiT actor. How shared memory is used in a LoT actor is illustrated in part (b). In part (c) the way
Sponge generates CUDA code to divide threads as helpers and workers is shown.
each thread serially writes and reads from global memory which
results in all uncoalesced accesses (marked by U ). If the buffer allocation for A is changed such that its memory accesses can be
coalesced (marked by C), as shown in Figure 8b, the accesses of
threads running B will still be uncoalesced. Figure 8c shows the
accesses to the internal buffer between A and B after fusion is performed. The new actor, (4A)B, runs with two threads. Since there
are 8 pushes and pops between 4A and B all the accesses will be
coalesced, as shown in Figure 8c.
For the LoT actors, global memory accesses are already coalesced with the help of shared memory. These accesses happen in
two for loops before and after the work function. Similar to the
HiT case, the accesses between the two LoT actors become coalesced. Therefore, the resulting LoT actor does not need to use
shared memory anymore. This will result in elimination of a large
number of complex address calculations and for loop control instructions.
Sponge uses its cost estimation equations to decide if fusing
a pair of actors is beneficial or not. For a candidate pair, Sponge
calculates the number of cycles for both cases where the resulting
actor is HiT or LoT. If in either case the execution time is less than
the sum of the original actors’ execution times, fusion is performed.
3.2.2 Bank Conflict Resolution
Shared memory bank conflict is another source of bottleneck in
GPU systems. For example, whenever threads of a kernel access
their input buffer in shared memory with:
data
=
buf f er[baseAddress + s ∗ threadId];
threads threadId and threadId + n access the same bank
whenever n is a multiple of m/d (m is the number of memory
banks) where d is the greatest common divisor of m and s. As a
consequence, there will be no bank conflicts only if half the warp
size is less than or equal to m/d. For current NVIDIA devices, this
translates to no bank conflict only if d is equal to 1, or in other
words, only if s is odd since m is a power of two (16 for GTX
285 [19]). In the StreamIt code, s is the number of pops. To make s
odd, if the number of pops is even, Sponge artificially changes the
pop rate of an actor by incrementing the pops by one. In this way,
an actor with 2k pops will use 2k + 1 entries in the memory and the
buffers get shifted in the memory. Removing bank conflicts greatly
improves the performance of some of the benchmarks. The same
technique can be applied for pushes.
3.3 Graph Restructuring
3.4 Register Optimization
In this part, Sponge vertically fuses some actors to improve performance by increasing coalesced memory accesses, removing kernel call overhead, and also increasing instruction overlap. Fusion
is not beneficial in all cases because it can increase the memory
traffic (push + pop) of a pair of LoT actors and reduce the number
of threads (Equation 3). For HiT actors, fusion may increase the
memory traffic as a result of register spilling.
The main benefit of fusing HiT actors is replacing uncoalesced
memory accesses at the end of the first actor and at the beginning
of the second actor with coalesced accesses. The memory accesses
become coalesced because the two actors within the fused actor
are rate matched. Therefore, the first actor can write to the internal buffer using coalesced memory writes and the second actor can
read the same data with coalesced memory reads. Figure 8 illustrates how fusion can lead to coalescing of memory accesses in
a simple GPU that has warp size of four and can coalesce two
memory accesses into one. In this figure, the memory accesses between actors A (producer with 2 pushes running with 8 threads)
and B (consumer with 8 pops running with 2 threads) are shown.
Wi,j is jth push by the ith thread of A, and Rk,m is the mth pop of
the kth thread of B. Figure 8a shows how writes and reads are performed between these actors in the case of no fusion. In this case
Registers on GPUs are a precious resource. Efficiently using the
registers can greatly improve performance. In this section, two
optimizations that Sponge performs to increase register utilization
are discussed.
3.4.1 Software Prefetch
To better tolerate long memory access latency, the CUDA threading
model allows some warps to make progress while others wait for
their memory access results. This mechanism is not effective in
some cases where all threads are waiting for their memory access
results. This case happens if all threads have very few independent
instructions between memory access instructions and the use of
the accessed data. Prefetching is a technique that some CUDA
programs use to overlap fetching data from global memory for
iteration i + 1 of an actor with compute instructions in iteration
i by utilizing the available registers.
Figure 9a shows how software prefetching can be done for LoT
actors. Before the main for loop, the first batch of data (for iteration
1) is loaded into registers (L1 ). Once L2 has started, the data is
moved from registers into shared memory. At this point, threads
have to wait for the shared memory transfers to finish before they
386
U
0
W1,1=>R1,1
8
W5,1=>R2,1
U
U
1
W1,2=>R1,2
9
W5,2=>R2,2
U
U
2
W2,1=>R1,3
10
W6,1=>R2,3
U
U
U
3
W2,2=>R1,4
11
W6,2=>R2,4
4
W3,1=>R1,5
12
W7,1=>R2,5
U
U
U
5
W3,2=>R1,6
13
W7,2=>R2,6
U
U
6
W4,1=>R1,7
14
W8,1=>R2,7
U
U
as shown in Figure 9a, 2 for transferring data to and from global
memory, 2 for prefetching and 1 for the work function. Depending
on the number of registers available on the target GPU and the
instruction mix of the kernel, Sponge decides to perform loop
unrolling on the for loops.
Figure 9b shows how the unrolling is applied to all five for loops
to both remove the for loop overheads and also increase the register
utilization. In this example, unrolling factor of two is applied to
the work function. As shown in Figure 9a, loop L1 is unrolled
to U1 . Because the work function is unrolled two times, all the
corresponding for loops now appear twice except L4 . Replicating
L4 two times will result in having two if statements. To remove the
conditional branch instruction overhead, these two replicated for
loops are merged into U6 .
7
W4,2=>R1,8
15
W8,2=>R2,8
U
(a)
C
0
W1,1=>R1,1
8
W5,1=>R2,1
C
U
1
W2,1=>R1,3
9
W6,1=>R2,3
U
C
2
W1,3=>R1,5
10
W7,1=>R2,5
C
U
C
3
W1,4=>R1,7
11
W8,1=>R2,7
4
W1,2=>R1,2
12
W5,2=>R2,2
C
U
U
5
W2,2=>R1,4
13
W6,2=>R2,4
U
C
6
W3,2=>R1,6
14
W7,2=>R2,6
C
U
7
W4,2=>R1,8
15
W8,2=>R2,8
U
(b)
C
0
W1,1=>R1,1
8
W1,5=>R1,5
C
C
1
W2,1=>R2,1
9
W2,5=>R2,5
C
C
2
W1,2=>R1,2
10
W1,6=>R1,6
C
C
C
3
W2,2=>R2,2
11
W2,6=>R2,6
4
W1,3=>R1,3
12
W1,7=>R1,7
C
C
C
5
W2,3=>R2,3
13
W2,7=>R2,7
C
C
6
W1,4=>R1,4
14
W1,8=>R1,8
C
C
3.5 A Stream Compilation Example
7
W2,4=>R2,4
15
W2,8=>R2,8
In this section, a running example, as shown in Figure 10, is used to
better illustrate how the optimizations affect the streaming graph.
The base graph in Figure 10a has 12 unique actors two of which
are in a splitter-joiner structure. Each box shows one actor in the
program. Each edge in this graph indicates a tape implemented
using FIFO queues. The text written inside each box shows how
each actor interacts with its input and output tapes. All the actors
are stateless except G. This actor as well as the source (A) and the
sink (H) actors are mapped to the host processor.
In the classification phase, Sponge will remove the two splitters
and joiners and replace all the copies of C and E with one of each.
This is done because GPUs do not support task level parallelism
and the joiner will introduce synchronization overhead. After this,
actors are classified as HiT and LoT based on their memory traffic
and computation instructions. LoT actors use shared memory but
HiT actors operate on global memory. In the example, actors D and
F are identified as HiT actors (shown with a darker color) and B, C
and E as LoT actors. [i, j] next to each GPU actor shows number
of threads (i) and number of blocks (j) that will run that actor.
For the LoT actors, the number of threads depends on the size of
shared memory and the memory usage of the actor. For HiT actors,
the number of threads is always equal to the maximum number of
threads allowed per block because global memory is significantly
larger than the actors’ memory footprint.
Next, helper threads are used to fetch data from global memory
to shared memory more efficiently. After applying this, the number
of threads for LoT actors will increase but HiT actors will be
running with less threads because they have to use the shared
memory. In the example, as shown in Figure 10c, except actor D,
every actor benefits from using helper threads. If the number of
threads for an actor is written as w + h, then w shows the number
of worker threads and h shows number of helper threads assigned
to that actor.
Finally, graph restructuring is performed on the graph and as a
result several actors get fused together. Figure 10d shows the result
of fusion and then re-applying classification and helper thread
optimization. Actor B and C are fused together in a LoT actor and
actors D, E, and F are classified as a HiT actor.
C
(c)
Figure 8: This figure shows the memory accesses between actors A with 2
pushes and 8 threads and B with 8 pops and 2 threads. Wi,j (Ri,j ) shows
jth memory write (read) performed by ith thread running actor A (B). U
and C denote uncoalesced and coalesced. Part (a) shows the accesses in
the base case. Part (b) illustrates the same accesses when the buffer for A
is allocated such that its writes are coalesced. Part (c) shows coalesced
accesses between these two actors when they are fused as (4A)B and
executed with two threads. The number on the top left corner of each box
shows the memory address of that location.
can progress because that data is needed for computation after
this point. After all threads are done moving the data to shared
memory, they pass the barrier synchronization point and begin to
load the next batch of data into registers. The key is that the work
function does not need the data from these memory accesses and
overlapping of compute and memory accesses can happen. Loop
L4 has to be wrapped in an if statement because the last iteration
of the kernel does not need to prefetch any data. This if statement
does not introduce branch divergence since all the threads take the
same path at this point.
One possible downside of this technique is that using additional
registers for prefetching can reduce the number of blocks that can
run on an SM . However, prefetching is beneficial if it significantly
reduces the amount of time each thread waits for global memory
accesses. Since different classes of NVIDIA GPUs are equipped
with different number of registers, Sponge tunes this optimization
for each GPU target. If performing prefetching for the whole buffer
introduces register spill or reduce the number of concurrent blocks,
Sponge tunes the prefetching optimization by applying it to only a
fraction of the input buffer.
3.4.2 Loop Unrolling
Instruction processing bandwidth on the processing cores of current
CUDA graphics engines can negatively affect the performance of
an actor. Address calculation and loop control instructions can
become important if an actor has small number of computation
instructions. In other words, these type of instructions introduce
overhead and prevent a kernel from utilizing the peak performance
of a GPU. Loop unrolling is one way to reduce the overhead. This
optimization can also increase the register utilization by unrolling
loops that use registers. The degree of unrolling depends on the
number of registers the kernel uses and also the number of registers
that are available on the GPU. Since different classes of GPUs
are equipped with different number of registers per SM, blindly
applying unrolling to the for-loops in a kernel may worsen the
performance.
An example of the unrolling is shown in Figure 9b. There are
five potential for loops in a typical LoT actor generated by Sponge,
4. Experiments
In this section Sponge’s optimization techniques are evaluated and
compared with two alternative approaches:
1. GPU baseline: All stateless actors of the benchmarks are
mapped to the GPU utilizing the maximum number of threads
supported (M AX T HREAD P ER BLOCK). In this
technique, all of the actors are compiled as HiT actors. Stateful
actors as well as source and sink actors are mapped to the host
processor.
2. CPU baseline: All the actors are executed sequentially on the
CPU.
387
Begin Kernel <<<Blocks, Threads>>>:
Reg
Global Memory
Reg
Global Memory
U1
Begin Kernel <<<Blocks, Threads>>>:
For number of pops
L1
Registers
For number of iterations/2
Global Memory
U3
For number of iterations
Shared Memory
Reg
Shared Memory
syncthreads
Reg
Work
For number of pops
Shared Memory
L3
Registers
syncthreads
If not the last iteration
U4
For number of pops
C1
L4
Registers
Global Memory
U5
U2
L2
syncthreads
Shared Memory
Global Memory
Shared Memory
Global Memory
Shared Memory
Reg
Shared Memory
Reg
syncthreads
If not the last iteration
Work
Reg
Global Memory
Reg
Global Memory
U6
Work
syncthreads
For number of pushs
L5
Shared Memory
Global Memory
syncthreads
End Kernel
U7
Shared Memory
Global Memory
Shared Memory
Global Memory
End Kernel
(a)
(b)
Figure 9: Part (a) shows how prefetching is performed to improve the performance of a kernel. Part (b) depicts the result of unrolling on the kernel in part (a).
actors such as Merge sort and Bitonic. Unrolling allows Sponge to
utilize unused registers and reduce the number of instructions. This
technique can increase the performance of LoT actors that use few
registers. DCT, Merge Sort, Radix, and Bitonic have such actors
and unrolling can increase their performance.
Another effective optimization in Sponge is employing helper
threads. As described in the previous sections, helper threads can
reduce the execution time of both LoT and HiT actors with two
exceptions:
4.1 Methodology
A set of benchmarks from the StreamIt suite [26] are used to evaluate Sponge. The benchmarks are compiled and evaluated on a system containing a 3GHz Intel Core 2 Extreme CPU with 6GB of
RAM and a GeForce GTX 285 GPU with 2GB DDR3 global memory. Sponge compilation phases are implemented as a compiler
backend to the StreamIt compiler. Sponge generates customized
CUDA code which is compiled using NVIDIA nvcc 3.1 for execution on the GPU. GCC 4.1 is used to generate the x86 binary for
execution on the host processor.
• LoT actors with many threads: In this case, it is not possible
to run more threads to help the worker threads. Reducing the
number of worker threads would decrease performance.
4.2 Techniques Performance
In this section, we try to compare the Sponge optimization techniques to the GPU baseline and highlight the effectiveness of each
optimization. Figure 11a shows how Sponge-generated CUDA
code performs and shows the performance gain of each optimization technique. On average, Sponge improves the performance by
3.2x compared to the GPU baseline.
The first optimization, shared/global memory, which divides actors into two categories LoT and HiT, is one of the most beneficial Sponge techniques. By using shared memory, Sponge is able
to coalesce all the memory accesses in LoT actors, therefore performance of benchmarks containing LoT actors will significantly
increase. As shown in the Figure 11a, Matrix Multiply Block benefits the most because this benchmark has several LoT actors. As a
result, most of the actors in Matrix Multiply Block have coalesced
memory accesses. In some benchmarks, such as Histogram, little
benefit is seen using this optimization because most actors are HiT
actors.
Prefetching and unrolling are two other optimizations illustrated
in Figure 11a. These optimizations, collectively, contribute to 3.1%
of the total average speedup. Prefetching technique is used only
for LoT actors and is useful mostly in applications with many LoT
• HiT actors with few threads: To utilize helper threads, HiT
actors would be converted into LoT actors, which have less
threads because of the limited shared memory size. Though
transferring data to shared memory improves memory performance, too few worker threads can become a bottleneck, underutilizing the SMs and decreasing the overall performance.
As shown in Figure 11a, helper thread optimization effectively increases the performance of DCT, FFT, Matrix multiply
and Merge sort. For example DCT has multiple HiT actors with a
large number of worker threads. In this case, coalescing data accesses using shared memory provides enough performance gain
that running the actors with less threads will not result in slowdown. On average, helper threads contributes to 16% of the total
average speedup compared to the GPU baseline.
Graph restructuring decreases the overhead of kernel launching
and uncoalesced memory accesses. As discussed in Section 3.3,
there are some cases where fusing two actors may result in degraded performance. Since two actors that are fused must execute
together, the number of threads that the resulting actor can run will
be less than the number of threads created by running each ac-
388
A
pop=0, push=8
A
pop=0, push=8
A
pop=0, push=8
Host
To
GPU
Host
To
GPU
A
pop=0, push=8
B
pop=2, push=2
B
pop=2, push=2
Splitter
C
C
C
C
C
C
C
C
pop=8, push=8
pop=8, push=8
pop=8, push=8
pop=8, push=8
pop=8, push=8
pop=8, push=8
pop=8, push=8
pop=8, push=8
C
pop=8, push=8
[512 ,128]
B
pop=2, push=2
[256 ,128]
C
pop=8, push=8
[512 ,128]
Host
To
GPU
[256+256,128]
[256+256 ,128]
4B C
pop=8, push=8
joiner
D
pop=256, push=256
D
pop=256, push=256
[512 ,128]
[512 ,128]
D
pop=256, push=256
[512 ,128]
D 32E 4F
pop=256, push=256
Splitter
[256+256 ,128]
E
pop=8, push=8
[256 ,128]
E
pop=8, push=8
F
pop=64, push=64
[512 ,128]
F
pop=64, push=64
E
E
E
E
pop=8, push=8
pop=8, push=8
pop=8, push=8
pop=8, push=8
GPU
To
Host
[32+256 ,128]
joiner
G
pop=1, push=1
F
pop=64, push=64
GPU
To
Host
GPU
To
Host
G
pop=1, push=1
G
pop=1, push=1
H
pop=8, push=0
H
pop=8, push=0
H
pop=8, push=0
G
pop=1, push=1
H
pop=8, push=0
(a)
(b)
(c)
(d)
Figure 10: Part (a) shows a stream graph with 12 unique actors. Part (b) is about how actor classification and graph reorganization affects this graph. In
this part, shaded actor are HiT actors. Part (c) illustrates the result of the helper thread optimization. Part (d) depicts the same graph after applying graph
restructuring. [i, j] next to each GPU actor shows number of threads (i) and number of blocks (j) that will run that actor. If i is written as w + h, w is number
of worker threads and h is the number of helper threads.
Shared/Global
Prefetch/ Unrolling
Helper Threads
With transfer
Graph Restructuring
Without Transfer
50
7
45
6
40
4
35
Speedup(x)
Speedup(x)
5
G-Mean: 2.89
H-Mean: 2.52
3
30
25
20
15
2
G-Mean: 20.1
H-Mean: 17.2
G-Mean: 7.8
H-Mean: 4.7
10
1
5
0
0
(a) Performance breakdown of Sponge optimizations in comparison to the baseline CUDA code, both running on the GPU.
(b) Speedup of Sponge optimized code in comparison to the host
CPU with and without data transfer overhead.
Figure 11: Effectiveness of Sponge optimization techniques on StreamIt benchmarks.
tor separately. Because the reduction of threads can decrease the
performance, Sponge intelligently decides whether or not to use
this optimization. Several benchmarks, such as FFT, have large
pipelines of actors that are all fused together by Sponge. Graph restructuring provides a large portion of the speedup for these types
of benchmarks. Since Batcher and Vector Add have only one actor,
fusion cannot increase their performance. In Merge Sort, the opportunity for performing fusion is minimal because most of the actors
in this benchmark are isolated from each other and do not form a
pipeline.
4.3 Overall performance
Figure 11b presents the speedup of Sponge’s generated CUDA applications against the CPU baseline, both with and without the
data transfer time between the GPU and CPU. On average, Sponge
achieves about 20x speedup compared to running each benchmark
completely on the CPU. The only case that the CPU baseline outperforms Sponge is Vector Add including the data transfer overhead. In this special case, the memory to compute ratio in Vector
Add is very high. Although the GPU can execute the Vector Add
actor 10x faster than CPU, the overhead of transferring the data
389
Shared
Prefetch
Unrolling
Helping Threads
DCT
66.7
0
50
50
FFT
66.7
4.2
70.8
16.7
MM
62.5
12.5
37.5
25
MM Block
80
0
80
20
Bitonic
100
40.7
50.7
0.7
Shared
Prefetch
Unrolling
Helping Threads
100
0
50
50
100
41.7
33.3
0
87.5
12.5
50
25
93.3
6.7
60
6.7
100
1.5
34.4
0.7
GTX 285
Batcher
100
0
100
0
Tesla C2050
100
0
100
0
Radix
100
0
100
0
Merge Sort
100
34.8
30.4
52.1
Comp Count
100
13
63.2
0
Vect Add
100
0
0
0
Histogram
33.3
0
33.3
0
100
100
0
0
100
0
0
0
100
100
0
0
100
0
0
0
33.3
66.7
0
33.3
Table 1: This table shows how Sponge optimizes each benchmark differently for two GPU targets. For each benchmark and target, the percentage of actors
that are optimized by each optimization is shown.
between the host and GPU global memory decreases the overall
performance.
StreamIt actors are through a single shared buffer between the
actors. Future work will try to represent these multiple input/output
streams in StreamIt so the compiler can detect such cases and
improve memory layout for GPUs.
4.4 Portability
Quantifying portability is inherently a hard problem. To show how
Sponge solves the portability issue, we show how it optimizes each
benchmark differently for two GPUs, Tesla C2050 and GeForce
GTX 285. The C2050 is based on the newer NVIDIA architecture (Fermi) which has 48KB of shared memory, 32K registers and
420 streaming processors, providing Sponge with more resources
to exploit. Table 1 shows how Sponge makes different decisions
based on the target architecture. This table illustrates the percentage of actors in each benchmark optimized using various optimizations in Sponge. As shown in Table 1, Sponge is able to classify
more actors as LoT actors and utilize the larger shared memory
in C2050. The number of registers also affects how Sponge performs unrolling and prefetching for each target. In general, Sponge
adopts its compilation strategy based on the characteristics of the
GPU target without any source code modification or programmer
involvement.
5.2 Histogram
The histogram benchmark computes the distribution of pixel intensities within an image. Histogram is implemented using a technique
called stream reduction, which is common in many GPU applications. Each phase of stream reduction removes some elements of
input data, performs computation on them, and sends the results as
a new input to the next phase. The histogram benchmark has several
phases. In the first phase, the input data array is divided into fixed
size blocks. In the second phase, a sub-histogram for each block
is computed. In the final phase, all the sub-histograms are collated
into a single histogram.
A StreamIt graph of stream reduction is shown in Figure 12.
The number of actors in these type of benchmarks is data-size
dependent, therefore, as the size of the input data grows, the number
of phases increases and the overhead of launching the kernels
becomes dominant. Sponge can fuse all of these phases together
but the final actor would have a large pop rate. Since the pop rate
of this actor is very large, it is not possible to use the limited shared
memory to coalesce its memory accesses. In both cases, the large
number of kernels and the uncoalesced memory accesses result in
degraded performance.
In the hand-optimized CUDA implementation, there is only one
kernel for all of the phases of the reduction but the number of
threads that do the actual work in each phase is different. As a
result, the hand-written CUDA histogram benchmark outperforms
Sponge’s generate CUDA code by 5x.
We would like to enhance the performance of Sponge in this
type of benchmarks by detecting the stream reduction subgraph
in the compiler and replacing them with one specialized stream
reduction kernel that mimics the behavior of the hand-optimized
CUDA.
5. Case Study and Future Work
Sponge is designed to reduce the performance gap between automatically generated CUDA programs and hand-optimized ones.
In this section two hand-optimized CUDA programs from the
NVIDIA SDK are analyzed to highlight the reasons for performance differences between Sponge-generated and hand-optimized
CUDA code.
5.1 Black-Scholes
The Black-Scholes algorithm is a differential equation that can
predict how the value of an option changes. This equation reads
five parameters from the input data and computes the price for an
option call and an option put and writes these two values to the
output. In the code generated by Sponge for GTX 285 GPU, there is
only one kernel that pops five memory element from the input and
calculates the output and pushes two results to the output buffer.
This actor is classified as an LoT actor. Therefore, Sponge uses
shared memory to coalesce all the buffer accesses in that actor.
In the hand-optimized code, only one kernel is launched as well,
but each parameter is placed in a different array. The kernel has
five input arrays and two output arrays. By using this technique, all
threads are able to read data from each input array and write data
to each output array consecutively allowing all memory accesses to
be coalesced. Coalescing all accesses without using shared memory
reduces the number of instructions in the hand-optimized version.
As a result, the performance of the hand-written program is 1.3x
better than Sponge’s generated code.
This input/output buffer re-mapping is not currently done in
Sponge because StreamIt does not support actors with multi-inputs
and multi-outputs streams. All input and output streams between
6. Related Work
The most common language GPU programmers use to write CUDA
code is ”C for CUDA” (C with NVIDIA extensions and certain
restrictions). Tuning these C like programs is highly challenging
because managing the amount of on-chip memory used per thread,
the total number of threads per multiprocessor, and the pattern of
off-chip memory accesses are some of the problems that developer
need to solve manually to achieve good performance [23]. To
alleviate this burden, recent studies has been done to automatically
manage these parameters in CUDA programs.
One study, closely related to Sponge, is the optimizing compiler introduced by Udupa et al. [27]. They compile stream programs for GPUs using software pipelining techniques. In the software pipelining approach, different actors from different iterations
390
0
1
2
A
pop=2, push=1
3
A
pop=2, push=1
4
5
6
A
pop=2, push=1
7
8
A
pop=2, push=1
9
10
A
pop=2, push=1
11
A
pop=2, push=1
12
13
A
pop=2, push=1
14
A
pop=2, push=1
joiner
joiner
joiner
joiner
B
pop=2, push=1
B
pop=2, push=1
B
pop=2, push=1
B
pop=2, push=1
joiner
joiner
C
pop=2, push=1
C
pop=2, push=1
15
joiner
D
pop=2, push=1
0
Figure 12: This graph shows the stream graph of a generic stream reduction kernel.
are simultaneously processed. Their technique, though promising,
does not perform well on GPUs because it exploits task-level parallelism and is not able to exploit the massive amount of data-level
parallelism power of GPUs. There has been recent work [29] on
GPU compilation for memory optimization and parallelism management. The input to this compiler is a naive GPU kernel function and their compiler analyzes the code and generates optimized
CUDA code. This work is distinctively different from this work because Sponge is able to exploit the information in the high level
stream graph and perform kernel-level optimizations specific to
StreamIt, such as graph restructuring, and then apply lower optimizations , such as memory and thread hierarchy management.
CUDA-Lite [30] is another compilation framework that takes
naive GPU kernel functions as input and tries to coalesce all memory accesses by using shared memory. Programmers need to provide annotations describing certain properties of data structures and
code regions designated for GPU execution. Our work is different
because Sponge does not need any annotations. Sponge also uses
shared memory to coalesce memory accesses and can maximize
the utilization of various resources on GPUs, such as registers. Another difference is that when the size of shared memory limits the
number of worker threads, Sponge is able to insert helper threads to
accelerate the transferring of data between global and shared memory. hiCUDA [8] is a high level directive based compiler framework
for CUDA programming where programmers need to insert directives to define the boundaries of the kernel function into sequential
C code. Another work in the area of automatic CUDA generation
is [16]. The Authors in this work generate optimized CUDA programs from OpenMP programs. They do not use shared memory in
their compiler for coalescing memory accesses. Hong et al. [9] propose an analytical performance model for GPUs that compilers can
use to predict the behavior of their generated code. Fung et al. [4]
regroup threads into new warps to minimize the number of divergent warps. Chen et al. [2] use communication and computation
threads to overlap the data exchange of the boundary nodes between adjacent thread blocks. This is fundamentally different from
what Sponge achieves using helper threads by performing parallel
prefetching of data.
MCUDA [24] tries to compile CUDA programs for a conventional shared memory architecture. MCUDA can be used to increase the performance of traditional shared memory parallel systems using CUDA optimization techniques. With the stream programming model, it is possible to use architecture specific optimizations for a wide range of architectures. Researchers have al-
ready proposed ways to map and optimize synchronous data-flow
languages to SIMD engines [12], distributed shared memory systems [14], and also field programmable gate arrays [10].
Performing runtime re-compilation of GPU binaries for adapting code to different targets is another approach that can provide
portability across GPUs. OpenCL [13] is one the approaches taken
by industry to achieve portability. We believe OpenCL in its current form suffers from the same inefficiencies as CUDA and does
not provide an architecture independent solution.
There is a large body of literature that deals with exploiting parallelism in streaming codes for better performance. The most recent and relevant works include compilation of new streaming languages such as StreamIt, Brook [1], Sequoia [3], and Cg [18] to
multi-cores or data-parallel architectures. For example, Gordon et
al. [6] and [5] perform stream graph refinements to statically determine the best mapping of a StreamIt program to a multi-core CPU.
Liao et al. applies classic affine partitioning techniques to exploit
the properties of stream operators [28]. There is also a rich history of scheduling and resource allocation techniques developed in
Ptolemy that make fundamental contributions to stream-scheduling
(e.g., [7, 22]). In a recent work [25], the authors talk about the usefulness of different features of StreamIt to a wide range of streaming applications. Several works, such as [11], propose techniques
to dynamically recompile streaming application based on availability of resources in heterogeneous system. Sponge can be a complementary addition to these works as GPUs are becoming a commodity in heterogeneous systems.
7. Conclusion
Heterogeneous systems, where sequential work is done on traditional processors and parallelizable work is offloaded to a specialized computing engine, will be ubiquitous in the future. Among
the different solutions that can take advantage of this parallelism,
GPUs are the most popular solution and have been shown to provide significant performance, power efficiency and cost benefits for
general purpose computing in highly-parallel computing domains.
GPUs achieve their high performance and efficiency by providing a
massively parallel architecture with hundreds of in-order cores and
exposing parallelism mechanism and also the memory hierarchy to
the programmer. One key to maximizing the performance in these
future heterogeneous systems will be to efficiently utilize not only
the host processor, but also the GPU.
391
While GPUs provide a very desirable target platform for accelerating parallel workloads, their programming complexity poses a
significant challenge to application developers. Languages, such as
CUDA, alleviate the complexity problem to some extent but fail at
abstracting the underlying GPU architecture. Therefore, managing
the amount of on-chip memory used per thread, the total number
of threads per multiprocessor, and the pattern of off-chip memory
accesses are problems that developers still need to manage in order
to maximize GPU utilization.
In this work, we propose Sponge; a streaming compiler for the
StreamIt language that is capable of performing an array of optimizations on stream graphs and generate efficient CUDA code
for GPUs. Optimizations in Sponge facilitate a write-once software
paradigm where programmers can rely on the compiler to automatically create customized CUDA for a wide variety of GPU targets.
The optimizations in Sponge improve the performance compared
to naive CUDA implementations by an average of 3.2x. Finally,
as a case study, we compare the performance and implementation of two hand-optimized CUDA benchmarks, Black-Scholes and
Histogram. For Black-Scholes, Sponge is able to achieve within
30% of the performance of the hand-optimized CUDA code. Future
work on Sponge will improve automatic detection of certain memory layout characteristics and stream graph representations that are
currently not supported.
[11] A. H. Hormati, Y. Choi, M. Kudlur, R. Rabbah, T. Mudge, and
S. Mahlke. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In Proc. of the 18th International Conference on Parallel Architectures and Compilation Techniques, pages 214–223, 2009.
[12] A. H. Hormati, Y. Choi, M. Woh, M. Kudlur, T. Mudge, and S. Mahlke.
Macross: Macro-simdization of streaming applications. In 18th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 285–296, 2010.
[13] KHRONOS Group. OpenCL - the open standard for parallel programming of heterogeneous systems, 2010.
[14] M. Kudlur and S. Mahlke. Orchestrating the execution of stream
programs on multicore platforms. In Proc. of the ’08 Conference on
Programming Language Design and Implementation, pages 114–124,
June 2008.
[15] E. Lee and D. Messerschmitt. Synchronous data flow. Proceedings of
the IEEE, 75(9):1235–1245, 1987.
[16] S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: a compiler
framework for automatic translation and optimization. In Proc. of
the 14th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, pages 101–110, 2009.
[17] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen,
N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x GPU vs. CPU myth: an
evaluation of throughput computing on CPU and GPU. In Proc. of
the 37th Annual International Symposium on Computer Architecture,
pages 451–460, 2010.
[18] W. Mark, R. Glanville, K. Akeley, and J. Kilgard. Cg: A system for
programming graphics hardware in a C-like language. In Proc. of the
30th International Conference on Computer Graphics and Interactive
Techniques, pages 893–907, July 2003.
[19] NVIDIA. CUDA Programming Guide, June 2007. http://developer.
download.nvidia.com/compute/cuda.
[20] NVIDIA. Fermi: Nvidias next generation cuda compute architecture,
2009. http://www.nvidia.com/content/PDF/fermi white papers/NVI
DIA Fermi Compute Architecture Whitepaper.pdf.
[21] NVIDIA. Gpus are only up to 14 times faster than cpus says intel,
2010. http://blogs.nvidia.com/ntersect/2010/06/gpus-are-only-up-to14-times-faster-than-cpus-says-intel.html.
[22] J. L. Pino, S. S. Bhattacharyya, and E. A. Lee. A hierarchical multiprocessor scheduling framework for synchronous dataflow graphs.
Technical Report UCB/ERL M95/36, University of California, Berkeley, May 1995.
[23] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and
W. mei W. Hwu. Optimization principles and application performance
evaluation of a multithreaded gpu using cuda. In Proc. of the 13th
ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, pages 73–82, 2008.
[24] J. A. Stratton, S. S. Stone, and W.-M. W. Hwu. Mcuda: An efficient
implementation of cuda kernels for multi-core cpus. In Proc. of
the 13th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, pages 16–30, 2008.
[25] W. Thies and S. Amarasinghe. An empirical characterization of stream
programs and its implications for language and compiler design. In
Proc. of the 19th International Conference on Parallel Architectures
and Compilation Techniques, page To Appear, 2010.
[26] W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A language for streaming applications. In Proc. of the 2002 International
Conference on Compiler Construction, pages 179–196, 2002.
[27] A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil. Software
pipelined execution of stream programs on gpus. In Proc. of the
2009 International Symposium on Code Generation and Optimization,
pages 200–209, 2009.
[28] S. wei Liao, Z. Du, G. Wu, and G.-Y. Lueh. Data and computation
transformations for brook streaming applications on multiprocessors.
Proc. of the 2006 International Symposium on Code Generation and
Optimization, 0(1):196–207, 2006.
[29] Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for
memory optimization and parallelism management. In Proc. of the ’10
Conference on Programming Language Design and Implementation,
pages 86–97, 2010.
[30] S. zee Ueng, M. Lathara, S. S. Baghsorkhi, and W. mei W. Hwu.
Cuda-lite: Reducing gpu programming complexity. In Proc. of the
21st Workshop on Languages and Compilers for Parallel Computing,
pages 1–15, 2008.
Acknowledgement
Much gratitude goes to the anonymous referees who provided excellent feedback on this work. We also thank Todd Mowry for shepherding this paper. This research was supported by ARM Ltd. and
the National Science Foundation under grants CNS-0964478 and
CCF-0916689.
References
[1] I. Buck et al. Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics, 23(3):777–786, Aug. 2004.
[2] J. Chen, Z. Huang, F. Su, J.-K. Peir, J. Ho, and L. Peng. Weak
execution ordering - exploiting iterative methods on many-core gpus.
In Proc. of the 2010 IEEE Symposium on Performance Analysis of
Systems and Software, pages 154–163, 2010.
[3] K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y.
Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan.
Sequoia: programming the memory hierarchy. In Proceedings of the
2006 ACM/IEEE conference on Supercomputing, page 83, 2006.
[4] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp
formation and scheduling for efficient GPU control flow. In Proc. of
the 40th Annual International Symposium on Microarchitecture, pages
407–420, 2007.
[5] M. I. Gordon, W. Thies, and S. Amarasinghe. Exploiting coarsegrained task, data, and pipeline parallelism in stream programs. In
14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 151–162, 2006.
[6] M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A.
Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures.
In Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 291–303, Oct.
2002.
[7] S. Ha and E. A. Lee. Compile-time scheduling and assignment of
data-flow program graphs with data-dependent iteration. IEEE Transactions on Computers, 40(11):1225–1238, 1991.
[8] T. Han and T. Abdelrahman. hicuda: High-level gpgpu programming.
IEEE Transactions on Parallel and Distributed Systems, (99):1–1,
2010.
[9] S. Hong and H. Kim. An analytical model for a gpu architecture
with memory-level and thread-level parallelism awareness. In Proc. of
the 36th Annual International Symposium on Computer Architecture,
pages 152–163, 2009.
[10] A. Hormati, M. Kudlur, D. Bacon, S. Mahlke, and R. Rabbah. Optimus: Efficient realization of streaming applications on FPGAs. In
Proc. of the 2008 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 41–50, Oct. 2008.
392