Gpgpu Workshop Cuda

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

A Performance Study of General Purpose Applications on Graphics Processors

Shuai Che Jiayuan Meng Jeremy W. Sheaffer Kevin Skadron


[email protected] [email protected] [email protected] [email protected]
The University of Virginia, Department of Computer Science

Abstract new term GPGPU or General-Purpose computation on the


GPU.
Graphic processors (GPUs), with many light-weight The GPU has several key advantages over CPU ar-
data-parallel cores, can provide substantial parallel com- chitectures for highly parallel, compute intensive work-
putational power to accelerate general purpose applica- loads, including higher memory bandwidth, significantly
tions. To best utilize the GPU’s parallel computing re- higher floating-point throughput, and thousands of hard-
sources, it is crucial to understand how GPU architectures ware thread contexts with hundreds of parallel compute
and programming models can be applied to different cat- pipelines executing programs in a SIMD fashion. The GPU
egories of traditionally CPU applications. In this paper can be an attractive alternative to CPU clusters in high per-
we examine several common, computationally demanding formance computing environments.
applications—Traffic Simulation, Thermal Simulation, and The term GPGPU causes some confusion nowadays,
K-Means—whose performance may benefit from graphics with its implication of structuring a general-purpose ap-
hardware’s parallel computing capabilities. We show that plication to make it amenable to graphics rendering APIs
all of our applications can be accelerated using the GPU, (OpenGL or DirectX) with no additional hardware support.
demonstrating as high as 40× speedup when compared with NVIDIA has introduced a new data-parallel, C-language
a CPU implementation. We also examine the performance programming API called CUDA (for Compute Unified De-
characteristics of our applications, presenting advantages vice Architecture) that bypasses the rendering interface and
and inefficiencies of the programming model and desirable avoids the difficulties of classic GPGPU. Parallel computa-
features to more easily and completely support a larger tions are instead expressed as general-purpose, C-language
body of applications. kernels operating in parallel over all the points in a domain.
To best utilize GPUs’ powerful computing resources, it
is necessary to examine to what extent traditionally CPU
domain problems can be mapped to GPU architectures,
1 Introduction
and what kinds of features the GPU parallel programming
model should support. A recent report from Berkeley [1]
Traditional single-core microprocessors are having dif- argued that successful parallel architectures should perform
ficulty achieving higher clock frequencies. The limitations well over a set of 13 representative classes of problems,
imposed by deep pipelining, transistor scaling, and power termed dwarves, which each capture a body of related prob-
and thermal constraints have forced CPU vendors to find lems and include Structured Grid, N-Body problem and Dy-
other ways to meet high performance computing needs. namic Programming dwarves.
One solution is that of multi-core architectures, which Inspired by this work, and noting an apparent architec-
integrate multiple cores onto a single chip. Examples are tural convergence of CPUs and GPUs, our goal is to find a
off-the-shelf Intel Duo-core and Quad-core products, the good programming model that can provide a rich and use-
Sony/Toshiba/IBM alliance’s Cell Broadband Engine [9], ful feature set for today’s parallel computing needs. This
MIT RAW [24], and Sun’s Niagra [13]. paper introduces our work in progress, still developmental.
Another powerful solution is the GPU. GPUs represent We port some widely used applications to CUDA and an-
highly specialized architectures designed for graphics ren- alyze their performance. Our applications—traffic simula-
dering, their development driven by the computer gaming tion, thermal simulation (with HotSpot [8]), and k-means—
industry. Recently, there has been a trend to accelerate com- have a great deal of data-parallelism and benefit from the
putationally intensive applications, including scientific ap- GPU’s parallel computing resources. We compare results
plications, on graphic processors. This trend introduced the on an NVIDIA Geforce 8800 GTX against and an Intel Pen-

1
tium 4 both under Windows XP. The CPU code is compiled raster graphics. GPUs have been developed hand-in-hand
with Visual Studio 2005 with the SSE2 (Streaming SIMD with graphics APIs, thus each stage in the traditional GPU
Extensions 2) option and O2 optimization turned on. pipeline corresponds very closely with a corresponding
All of our applications show satisfactory speedups. The stage in the conceptual OpenGL pipeline. Recent GPUs
maximum observed speedups are about 40× for traffic sim- have added programmability to the vertex and fragment
ulation, 6.5× for HotSpot and 8× for k-means. Also, shading stages of the pipeline. Instead of separate process-
to better utilize the GPU under CUDA, we make novel, ing units for different shaders, the Unified Shader Model
architecturally-aware changes to the algorithms and data (USM) was introduced with DirectX 10, allowing for bet-
structures, such as the introduction of a pyramid data struc- ter utilization of GPU processing resources. ATI’s Xenos
ture in the HotSpot benchmark to avoid the need for excess chip for the Xbox 360, AMD R600, and NVIDIA’s Geforce
synchronization between thread blocks. 8800 [16] all implement USM. The computing resources
In addition to showing speedups using CUDA, another needed by shaders varies greatly among different applica-
contribution of our work is that we present some advantages tions. The unified design can overcome this issue by balanc-
and inefficiencies of the CUDA model. For example, we ing loads between geometry, vertex, and fragment shader
find that a global synchronization mechanism is necessary functionality, yielding maximum utilization of computing
in order to avoid the need to separate a logical kernel into resources.
disjoint parts. Double buffering or a bulk load/store mech-
anism would be useful as an additional mechanism to help
reduce memory access latency.

2 Related Work

A lot of recent work has focused on GPGPU. A frame-


work for solving linear algebra problems on graphics pro-
cessors is presented by Krüger et al [14]. Harris et al.
present a cloud dynamics simulation using partial differ-
ential equations [7]. Some important database operations
are implemented on the GPU by efficiently using pixel
engines [6]. Bioinformatics algorithms such as sequence
alignment [15] have also been successfully implemented on Figure 1. The Geforce 8800 GTX Architecture,
GPUs. with 16 multiprocessors, each with 8 stream-
OpenGL [12] and DirectX [20] are the two major API ing processors.
interfaces for programming graphics applications, but they
are not convenient for developing general purpose applica-
tions on GPUs. With increased programmability in recent A diagram of the NVIDIA Geforce 8800 GTX architec-
GPU generations, languages such as Brook [2], Sh [18], and ture is shown in Figure 1. Its design is a radical departure
Cg [17] were developed to provide simpler programming from traditional mainstream GPU design. The 8800 GTX
environments for GPU developers. is comprised of 16 multiprocessors. Each multiprocessor
CUDA is a new language and development environ- has 8 streaming processors (SPs) for a total of 128 SPs.
ment from NVIDIA, allowing execution of applications Each group of 8 SPs shares one L1 data cache. An SP
with thousands of data-parallel threads on NVIDIA G80 contains a scalar ALU and can perform floating point op-
class GPUs. AMD recently introduced their own GPU di- erations. Instructions are executed in a SIMD fashion. The
rect compute API, Close To the Metal (CTM). CTM also 8800 GTX has 768 MB of graphics memory, with a peak
provides a way around graphics APIs, with a driver and pro- observed performance of 330 GFLOPS and 86 GB/s peak
gramming interface designed specifically for compute. memory bandwidth [3]. This specialized architecture can
sufficiently meet the needs of many massively data-parallel
computations.
3 GPU Architectures
3.2 CUDA
3.1 The NVIDIA Geforce 8800 GTX GPU
Direct3D and OpenGL include many functions provided
GPUs have developed over the course of the last decade to render complex graphics scenes and abstract away the
as highly specialized processors for the acceleration of communication between applications and graphics drivers.

2
Unfortunately, for non-graphics applications, these APIs in- register files, a shared memory data cache, and a read-only
troduce many hurdles to the general purpose application (constant) cache. Threads within one block can share data
programmer, including the inherent problems born of the through shared memory, allowing very high access speeds.
need to caress general purpose problems into looking like Constant memory is optimized for fast read-only access to
rendering problems and the functional problem imposed by a region of global, read/write device memory [21].
lack of scatter functionality. Below, we give a simple example of CUDA program
which assigns the values in an n × n matrix B to a matrix
A and an example of the same operation in C. If n is 128
and each block is 16 by 16, CUDA will divide the whole
matrix into 64 blocks, and each assignment becomes the re-
sponsibility of a single thread. The CUDA program model
abstracts away the loop needed in the CPU implementation,
a powerful abstraction very useful in solving more compli-
cated problems.

• Matrix copy in CUDA

index = n * BLOCK_SIZE * blockIdx.y +


BLOCK_SIZE * blockIdx.x +
n * threadIdx.y + threadIdx.x;

A[index] = B[index];

• Matrix copy on the CPU in C

for (i = 0; i < n; i++)


for (j = 0; j < n; j++)
A[i][j] = B[i][j];

3.3 Parallel Execution Model


Figure 2. CUDA’s shared memory architec-
ture. Threads in one block can share data The two dominant parallel architectural models are the
using on-chip shared memory [21]. message passing and shared memory models [26]. CUDA’s
shared memory model allows threads within the same block
to share data using high-speed on-chip shared memory.
CUDA provides a cleaner interface than traditional Threads from different blocks can share data via global
GPGPU to program the GPU for general purpose appli- memory. CUDA adopts a SIMD data-parallel model in
cations, exposing important architectural features of G80 which one instruction is executed multiple times in paral-
and eliminating the need to map computations to a graph- lel on different data elements.
ics API. In CUDA, the GPU is a device that can execute In CUDA, the CPU is viewed as a control processor that
multiple concurrent threads. Threads are executed in SIMD is responsible for parameter setup, data initialization, and
thread blocks, which facilitate efficient data sharing. Blocks execution of the serial portions of the program. The GPU
and threads are indexed by block and thread ids, and the is viewed as a co-processor whose job is to accelerate data-
GPU is conceptually viewed as a set of multiprocessors. parallel computations.
One or more thread blocks is dispatched to each proces- Though CUDA provides a powerful API with a swift
sor, and executed using time sharing [21]. Blocks are fur- learning curve, there are several challenges for CUDA de-
ther organized into grids, with blocks within a grid being velopers. First, before a kernel can be launched on the
run to completion and not currently guaranteed to execute graphics processor, data must be copied from host mem-
in any particular order; hence, blocks should not communi- ory to GPU device memory. Copy overhead is proportional
cate except by allowing an entire grid to complete, return- to the amount of data to be copied. Also CUDA has an ex-
ing to the host, and subsequently starting a new kernel. The plicitly controlled memory hierarchy that is critical to per-
3D capability of blocks, coupled with the 2D capability of formance and requires fine-grained tuning. In addition, bar-
grids, can potentially allow expression of 5D domains. As rier functionality can only be imposed on threads within the
is illustrated in Figure 2, each multiprocessor has a set of same block. Synchronizing with threads in different blocks

3
can be achieved by terminating a function call, otherwise also connects to the CPU and main memory, so the data
read-after-write (RAW), write-after-read (WAR), and write- transfer rate of bus and memory bandwidth are very crucial
after-write (WAW) hazards become a concern [21]. to the performance of applications. We measured the mem-
ory overhead by varying the amount of data transferred be-
4 Experiment Setup and Methodology tween our CPU and GPU. We found that the time necessary
to transfer date increases linearly with the amount of data,
as illustrated in Figure 3, thus the constant overhead of each
We chose representative commercial products from both individual transfer is immeasurable so long as transfers are
of the GPU and CPU markets: an NIVIDIA Geforce 8800 large or infrequent.
GTX (128 stream processors clocked at 675 MHz with
768 MB of graphics memory and a 16 kB parallel data
5.2 Traffic Simulation
cache per multiprocessor block) with NVIDIA driver ver-
sion 6.14.11.6201 and CUDA 1.0, and a traditional single-
A great deal of civil engineering work exists on simulat-
core Intel Pentium 4 CPU (2.4 GHz with 512 MB main
ing automotive traffic. Aimed at improving traffic control
memory, 512 kB L2 and 8 kB L1). We developed our GPU
to reduce congestion and accidents, these simulations tend
code using NVIDIA’s CUDA API. The CPU code is com-
to require a lot of compute power. There has been no previ-
piled under Visual Studio with O2 and SSE2. The GPU
ous work that uses graphics processors to accelerate traffic
implementation results are compared against the Intel CPU
simulation. Our work is based on a part of the MITSIM
results under Windows. Given the same input data set, the
model [27], which simulates traffic networks.
speedup is calculated by taking the wall-clock time required
by applications on the CPU divided by the time required by
the GPU. Times are measured after initial setup (e.g. after 5.2.1 Algorithm Overview
file I/O) but do include PCI-E bus transfer times. Our work re-implements part of MITSIM’s lane-changing
model—only a portion of the MITSIM infrastructure, which
5 Experimental Results provides a much broader and more comprehensive traffic
simulation model than our benchmark—with cars running
5.1 Memory Overhead in 4 lanes [23]. The lanes are implemented as a 4-wide 2-D
array with each cell representing a position. The car struc-
ture carries the information about the car’s speed, position,
the lane the car resides in and so forth. All of this informa-
tion is dynamically updated during the lifetime of the sim-
ulation. Cars can change lanes only if the behind-diagonal,
next-to, and forward-diagonal cells are clear of cars. A car
can accelerate when there are 2 blank positions in front of
it.
This model is straightforward to implement on CUDA’s
shared memory model and benefits from using the GPU’s
large number of concurrent threads. In each iteration, the
determination of the next action of each car (e.g. to change
lanes, accelerate, brake, etc.) is dependent only on the loca-
tions of the cars withing a small, local neighborhood.

5.2.2 Traffic Simulation Results and Analysis


We used 256 threads per block. The number of blocks are
determined by the number of cars the user specifies on the
Figure 3. Memory transfer overhead between command line. We assign each car a unique id and map that
our CPU and GPU. As the size of data in- to a thread. The number of cars (number of threads) is var-
creases, the time overhead for a data transfer ied in the experiment. As is evident in Figure 5, the G80
increases linearly. can achieve speedups of about 40× with respect to the Pen-
tium 4. When the number of cars is 800 or less, our CPU
implementations outperforms the GPU simulator; however,
Classically, in a PC the graphics card is connected via as the number of cars increases, the GPU implementation
the AGP or PCI Express bus to a North Bridge chip which gradually surpasses the CPU, which demonstrates that GPU

4
is more appropriate for data-parallel operation with many
threads given our configuration. Applications like the traf-
fic simulation present an especially good fit for GPUs and
the CUDA programming model as, except for synchroniz-
ing the blocks for each iteration, there are no data depen-
dencies between threads within one block or between the
blocks.

Figure 5. Speedup of the traffic simulation.


The maximum speedup is more than 40×.

equations for block temperatures. Our GPU implementa-


tion re-implements the transient thermal differential equa-
tion solver from HotSpot. Given the power density and tem-
perature at time t and the simulation differential time ∆t,
Figure 4. Execution time of the traffic sim- the transient solver can compute the processor temperature
ulation. The input size is the number of at time t + ∆t.
cars. The GPU implementation outperforms In HotSpot, each cell in the grid represents the average
the CPU implementation on simulations with power density and temperature of the corresponding area
about 1600 or more cars. on the chip. The grid is divided into sub-blocks along the
x- and y-axes. Therefore at the end of each iteration, the
data that lies on the boundaries between blocks should be
exchanged. In CUDA, data communication among blocks is
5.3 Thermal Simulation not supported. The threads in one block can only read data
written by another block from global memory. This solution
HotSpot [8] is a widely used tool to estimate proces- involves heavy global memory read/write overhead, which
sor temperature based on block layout and simulated per- significantly reduces the performance of the program.
formance measurement in architectural simulation. It is a To ameliorate the effects of this synchronization prob-
representative of the structured grid dwarf [1], in which the lem, we want to avoid exchanging data between blocks at
computation is regionally divided into sub-blocks with high the end of each iteration. A novel solution, based on a
spatial locality. Structured grid applications are at the core pyramid structure (see Figure 6), can achieve satisfactory
of many scientific computations. Other notable examples speedups. Using this method, we assign each processing
include Lattice Boltzmann hydrodynamics [25] and Cac- thread block a much bigger region than the final result. If
tus [4]. A major challenge of these applications comes from the pyramid base is an N × N data block, after one itera-
dealing with the boundary data between sub-blocks. tion the inner (N − 2) × (N − 2) data block has precise
results. For instance, if we want to compute the result of a
5.3.1 HotSpot Algorithm Overview grid which is comprised of many 4 × 4 blocks, at the begin-
ning we can designate 16 × 16 as the size of each block in
In HotSpot, a silicon die is partitioned into functional blocks CUDA and load each block to shared memory for computa-
based on the floorplan of the microprocessor, with a ther- tion. The data processed in adjacent blocks should overlap
mal RC network connecting the various blocks [8]. The in order that after 6 iterations, each of the cells in the grid
thermal simulation iteratively solves a set of differential has a correct result.

5
Figure 6. Starting from an 8 × 8 block, it takes
1 iteration to compute the result of the inner
6 × 6 block

Figure 7. Execution Time of HotSpot. The x-


5.3.2 HotSpot Simulation Results axis represents the dimensions of the com-
putation grid. a dimension of 50 means a
Our experiments show that our pyramid architecture can 50 × 50 grid.
achieve a maximum speedup of more than 6.5× when com-
pared with the Pentium 4 CPU. This result is much better
than our original naı̈ve implementation, which was based
more closely on the CPU implementation and did not use
the pyramid structure. Obviously, efficient use of shared
memory is important for an efficient implementation.
However, as demonstrated in Figure 8, the thermal sim-
ulation does not show the high speedups seen in the traffic
simulation. Additionally the HotSpot speedup curve satu-
rates early. We believe that this phenomenon is due to the
fact that, though the pyramid architecture can efficiently re-
duce communications between thread blocks, it simultane-
ously creates more tasks for the GPU to perform. For ex-
ample, if the result grid is comprised of a number of 4 × 4
blocks, for each block, starting from a 16 × 16 block, the
simulations requires 6 iterations to converge. For a grid of
size 256 × 256, the GPU must allocate 64 × 64 blocks to
run in parallel on 8 multiprocessors. A large number of
blocks makes the speedup saturate early. However, using
the pyramid architecture, the performance improves signif-
icantly when compared with our original implementation.
In developing this benchmark, we found that to develop
Figure 8. Speedup of HotSpot. The maximum
an application with good performance, CUDA’s parallel
speedup is more than 6.5× for HotSpot
programming model requires that programmers have famil-
iarity with the G80 architecture, as data must be explicitly
managed in the memory hierarchy. Additionally, program-
mers also need to be careful when dealing with shared mem-
ory, because inappropriate scheduling can cause bank con-
flicts which may force threads to access memory in serial- chronize at a certain point. Inside a GPU function call, us-
ized order [21]. ing a for loop is dangerous when there are data dependen-
Another problem with CUDA is that there is no mech- cies between the two adjacent loop iterations because the
anism to enforce all the threads of different blocks to syn- CUDA parallel model is a for-all model.

6
5.4 K-Means sub-cluster to calculate a new centroid.

K-means is a clustering algorithm used extensively in 5.4.2 K-Means Results


data-mining and elsewhere, important primarily for its sim-
plicity. Data-mining algorithms are classified into cluster-
ing, classification, and association rule mining, among oth-
ers [22]. Many data-mining algorithms show a high de-
gree of task parallelism or data parallelism. Researchers
at Northwestern University developed Minebench using
OpenMP [22]. Our goal with this benchmark is to test the
applicability of the GPU to data-mining using k-means.
In k-means, a data object is comprised of several values,
called features. By dividing a cluster of data objects into K
sub-clusters, k-means represents all the data objects by the
mean values or centroids of their respective sub-clusters.

5.4.1 K-Means Algorithm Overview


In a basic implementation, the initial cluster center for
each sub-cluster is randomly chosen or derived from some
heuristic. In each iteration, the algorithm associates each
data object with its nearest center, based on some chosen Figure 9. Execution time of k-means (reduc-
distance metric. The new centroids are calculated by taking tion part not included). The input size is the
mean of all the data objects within each sub-cluster respec- number of the data objects. The time showed
tively. The algorithm iterates until no data objects move is for one iteration
from one sub-cluster to another [22].
The data objects can be organized in a 2-D array with
In Figure 9, we compare the running times of k-means
each row representing one data object and each column rep-
separately on a Geforce 8800 and a Pentium 4 on a dataset
resenting the value of the corresponding feature. Some im-
from an intrusion detection problem in the 1999 KDD
plementations use K-d trees [10] to accelerate the execution
Cup [11]. Figure 10 shows that for a dataset with 1,638,400
time. Advantages of k-means over other clustering algo-
elements, the CUDA program on the G80 can achieve about
rithms include its fast convergence and ease of implementa-
8× speedup as compared with the Pentium 4. Because cur-
tion.
rently the CPU is responsible for calculating new cluster
In our CUDA implementation, the clusters of data ob-
centroids, the memory overhead is relatively large. In each
jects are partitioned into thread blocks, each thread asso-
iteration we have to copy the data to the GPU to compute
ciated with one data object. The task of searching for the
the membership of each data object, then copy back to the
nearest centroid to each data object is independent. Our
CPU for the reduction.
program uses the Euclidean distance as its distance metric.
After all the threads find their nearest centroid, a reduction
step will produce the new centroid for each sub-cluster. An 6 Discussion: CUDA Programming Model
error function signals convergence and terminates the itera-
tion. In the course of developing these applications for
We make the GPU is responsible for calculating the dis- CUDA, we have developed a short “wish-list” of features
tances of each object to the k clusters in parallel, while the that could help improve performance. Firstly, we believe
CPU takes over the reduction step. CUDA does provide a that CUDA would benefit from facilities for asynchronous
handful of atomic operations that could be used to imple- bulk data movement from host memory to device memory.
ment this reduction; however, these operations are integer Currently, CUDA allows only synchronous direct memory
only and are restricted to the Geforce 8600, so we choose access (DMA) transfers between the GPU and CPU. As
to ship this latter step back to the CPU. The speedup mea- a result, a kernel cannot begin execution until the DMA
surement is performed on the distance calculation only, not download completes, which introduces overhead.
on the serial reduction! The algorithm can be further paral- Secondly, a pre-fetching mechanism can serve as a com-
lelized by calculating the partial sum for the threads in each plementary memory access latency hiding tool in addition
block and then adding the partial sums together for each to hardware thread context switching. Especially when all

7
program complexity that would be better left unconsidered.
CUDA’s performance is hurt by its inability to collect
data from a set of producer threads and stream them to a set
of consumer threads. Intermediate data has to be stored in
device memory before it is consumed by another thread. By
allowing fine grained synchronization among producer and
consumer threads, programs would be able to consume data
at a higher rate, earlier freeing its storage for reuse. Since
thread blocks must run to completion, and only one shared
memory can be allocated to a thread block, it is unwieldy to
support high-quality producer/consumer steaming given the
current implementation of CUDA’s shared memory model.
Finally, even though we did not perform extensive per-
formance tuning (tiling, managing bank conflicts, etc.), we
found that CUDA programming required extensive knowl-
edge of the underlying hardware architecture. This is ac-
tually true of almost any parallel programming environ-
Figure 10. Speedup of k-means (reduction ment today. This makes parallel programming an excel-
part not included). The maximum speedup lent vehicle for teaching computer architecture, and in fact
is about 8×. we used CUDA to good effect in a senior-level computer-
architecture course. We worry, however, that the associated
learning curve will deter adoption of parallel programming,
especially for more complex algorithms and data structures
threads are accessing device memory intensively, switching that do not map as easily to a manycore environment.
threads is unable to sufficiently overlap computation with
memory access because all threads are memory bound. Cor- 7 Conclusions and Future Work
related threads within the same kernel are executed simul-
taneously, so it is probable that many threads will reach a This work compared the performance of CPU and GPU
memory intensive phase simultaneously. implementations of three, naturally data-parallel applica-
It may also be beneficial to provide a faster local tions: traffic simulation, thermal simulation, and k-means.
store. This could be useful in situations where data Our experiments used NVIDIA’s C-based CUDA interface
has to be loaded from device memory and reused many and compared performance on an NVIDIA Geforce 8800
times. Options include caching, scratchpads, and bulk GTX with that on an Intel Pentium 4 CPU. Even though
DMAs. Caching supports existing programming models we did not perform extensive performance tuning, the GPU
and toolchains, but caching in shared memory raises the implementations of these applications obtained impressive
traditional problems of coherence and memory-ordering speedups and add to the growing body of GPGPU work
guarantees. The other two options also have drawbacks. showing the potential of GPUs for general-purpose comput-
Scratchpads require explicit compiler management and ing. Furthermore, the CUDA interface made programming
DMA may require double buffering, using extra memory. these applications vastly easier than traditional rendering-
Lack of persistent state in the shared memory data cache based GPGPU approaches (in particular, we have prior ex-
results in less efficient communication among producer and perience with structured grids [5]). We also believe that
consumer kernels than might be possible. The producer ker- the availability of shared memory and the domain abstrac-
nel has to store the shared memory data into device mem- tion provided by CUDA made these applications vastly eas-
ory; the data is then read back over the bus by the consumer ier to implement than traditional SPMD/thread-based ap-
kernel. This also undercuts the efficiency of global synchro- proaches. In the case of k-means and the traffic simula-
nization which involves kernel termination and creation; tion, CUDA was probably a bit more difficult than OpenMP,
however, a persistent shared memory contradicts with the chiefly due to the need to explicitly move data and deal
current programming model, in which threads blocks run with the GPU’s heterogeneous memory model. In the case
to completion and by definition leave no state afterwards. of HotSpot with the pyramidal implementation, CUDA’s
Alternatively, a programmer can choose to use a novel al- “grid-of-blocks” paradigm probably simplified implemen-
gorithm that involves less communication and global syn- tation compared to OpenMP.
chronization, such as the pyramid algorithm that we use in The work we presented in this paper only shows a de-
HotSpot, but this leads to—often undesirable—tradeoffs in velopmental stage of our work. We plan to extend our

8
GPGPU work by comparing with more recent commodity References
configurations such as Intel dual-core processors and ex-
amining the programmability of more complex applications [1] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Hus-
with various kinds of data structures and memory access bands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf,
patterns. In addition, in order to better understand the pros S. W. Williams, and K. A. Yelick. The landscape of parallel
and cons of GPU architectures for general-purpose paral- computing research: A view from berkeley. Technical Re-
port UCB/EECS-2006-183, EECS Department, University
lel programming, new metrics are needed for characterizing
of California, Berkeley, Dec 2006.
the applications. With greater architectural convergence of [2] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian,
CPUs and GPUs, our goal is to find a parallel programming M. Houston, and P. Hanrahan. Brook for gpus: stream com-
model that can best aid developers to program in today’s puting on graphics hardware. In SIGGRAPH ’04: ACM SIG-
high-performance parallel computing environments, includ- GRAPH 2004 Papers, pages 777–786, New York, NY, USA,
ing GPUs and multi-core CPUs. 2004. ACM Press.
[3] R. Fang, B. He, M. Lu, K. Yang, N. K. Govindaraju, Q. Luo,
Our sample applications mapped nicely into CUDA and and P. V. Sander. Gpuqp: query co-processing using graph-
would map easily enough to most manycore programming ics processors. In SIGMOD ’07: Proceedings of the 2007
environments. In all cases, however, managing data place- ACM SIGMOD international conference on Management of
ment, communication, and synchronization becomes a nui- data, pages 1061–1063, New York, NY, USA, 2007. ACM
sance at best—and intractable at worst—with more com- Press.
plex applications. Higher-level programming APIs are [4] T. Goodale, G. Allen, G. Lanfermann, J. Massó, T. Radke,
E. Seidel, and J. Shalf. The cactus framework and toolkit:
needed! Ideally, these should promote use of higher-level
Design and applications. In VECPAR, pages 197–227, 2002.
data structures that implicitly convey information about de- [5] N. Goodnight, G. Lewin, D. Luebke, and K. Skadron. A
pendencies and data layout to the compiler or middleware, multigrid solver for boundary value problems using graphics
which can then manage much of the concurrency, data h ardware. Technical report, University of Virginia, January
placement, communication, and synchronization. Opera- 2003. University of Virginia Technical Report CS-2003-03.
tions on these data structures then would implicitly con- [6] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and
vey parallelism while preserving a more natural, quasi- D. Manocha. Fast computation of database operations using
sequential programming style [19]. Ideally, programmers graphics processors. In SIGGRAPH ’05: ACM SIGGRAPH
2005 Courses, page 206, New York, NY, USA, 2005. ACM
should still be able to “drill down”—to manage the hard-
Press.
ware themselves using a lower-level API such as CUDA— [7] M. J. Harris, W. V. Baxter, T. Scheuermann, and A. Las-
albeit at their own risk. tra. Simulation of cloud dynamics on graphics hard-
ware. In HWWS ’03: Proceedings of the ACM SIG-
The common view that effective parallel programming
GRAPH/EUROGRAPHICS conference on Graphics hard-
requires low-level, explicit management of concurrency and
ware, pages 92–101, Aire-la-Ville, Switzerland, Switzer-
hardware resources only applies to the most expert pro- land, 2003. Eurographics Association.
grammers! We contend that high-level APIs and auto- [8] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan,
matic parallelism will boost productivity and, for many pro- K. Skadron, and M. R. Stan. Hotspot: A compact thermal
grammers, will yield better speedups. Higher-level APIs modeling methodology for early-stage vlsi design. IEEE
that abstract away hardware details also simplify portability Trans. VLSI Syst., 14(5):501–513, 2006.
among different parallel platforms. [9] J. Kahle, M. Day, H. Hofstee, C. Johns, T. Maeurer, and
D. Shippy. Introduction to the cell multiprocessor. IBM J.
Res. Dev.,49(4/5):589C604, 2005.
[10] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silver-
man, and A. Wu. An efficient k-means clustering algorithm:
8 Acknowledgements analysis and implementation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 24(7):881–892, 2002.
[11] KDD Cup. http://kdd.ics.uci.edu/databases/kddcup99/kdd-
cup99.html.
This work is supported by NSF grant nos. CNS- [12] J. Kessenich, D. Baldwin, and R. Rost. The opengl shading
0306404 and CNS-0615277, and by a grant from language. Technical report, The OpenGL Architectural Re-
NVIDIA Research. We would like to thank Michael Gar- view Board, 2006. http://www.opengl.org/documentation/-
land of NVIDIA Research for pointing out the pyramidal glsl.
[13] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A
algorithm for the HotSpot problem, Kevin Stammetti for
32-way multithreaded sparc processor. IEEE Micro, 2005.
his help with the traffic simulation algorithm and imple- [14] J. Krüger and R. Westermann. Linear algebra operators
mentation, Karthik Sankaranarayanan for his help with the for gpu implementation of numerical algorithms. In SIG-
Hotspot grid solver and John Stuppy and S. R. Siddarth for GRAPH ’05: ACM SIGGRAPH 2005 Courses, page 234,
their help with the k-means implementation. New York, NY, USA, 2005. ACM Press.

9
[15] W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig. Stream-
ing algorithms for biological sequence alignment on gpus.
IEEE Transactions on Parallel and Distributed Systems,
18(9):1270–1281, 2007.
[16] D. Luebke and G. Humphreys. How gpus work. IEEE Com-
puter, 40(2):96–100, 2007.
[17] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard.
Cg: a system for programming graphics hardware in a c-like
language. ACM Trans. Graph., 22(3):896–907, 2003.
[18] M. D. McCool. Metaprogramming GPUs with Sh. AK Pe-
ters, 2004.
[19] W. mei W. Hwu, S. Ryoo, S.-Z. Ueng, J. H. Kelm, I. Gelado,
S. S. Stone, R. E. Kidd, S. S. Baghsorkhi, A. Mahesri, S. C.
Tsao, N. Navarro, S. S. Lumetta, M. I. Frank, and S. J. Patel.
Implicitly parallel programming models for thousand-core
microprocessors. In DAC, pages 754–759. IEEE, 2007.
[20] Microsoft Corp. DirectX. http://www.gamesforwindows.-
com/en-US/AboutGFW/Pages/DirectX10.aspx.
[21] NVIDIA. Cuda programming guide 1.0, 2007. http://devel-
oper.nvidia.com/object/cuda.html.
[22] J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik,
and J. Parhi. Nu-minebench 2.0. Technical report, North-
western University Department of Electrical and Computer
Engineering, 2005. http://cucis.ece.northwestern.edu/tech-
reports/pdf/CUCIS-2004-08-001.pdf.
[23] K. Stammetti. Testing the feasibility of running a computa-
tionally intensive real-time traffic simulation on a multicore
programming graphics processor, May 2007. U.Va. SEAS
Senior Thesis.
[24] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt,
B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, J. Psota,
A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amaras-
inghe, and A. Agarwal. Evaluation of the raw microproces-
sor: An exposed-wire-delay architecture for ilp and streams.
In ISCA ’04: Proceedings of the 31st annual international
symposium on Computer architecture, page 2, Washington,
DC, USA, 2004. IEEE Computer Society.
[25] G. Vahala, J. Yepez, L. Vahala, M. Soe, and J. Carter. 3d
entropic lattice boltzmann simulations of 3d navier-stokes
turbulence. In Proceedings of the 47th Annual Meeting of
the APS Division of Plasma Phsyics, 2005.
[26] B. Wilkinson and M. Allen. Parallel programming: tech-
niques and applications using networked workstations and
parallel computers. Prentice-Hall, Inc., Upper Saddle River,
NJ, USA, 1999.
[27] Q. Yang and H. Koutsopoulos. Microscopic traffic sim-
ulator for evaluation of dynamic traffic managment sys-
tems. Transportation Research Part C(Emerging Technolo-
gies),4(3), 1996.

10

You might also like