Gpgpu Workshop Cuda
Gpgpu Workshop Cuda
Gpgpu Workshop Cuda
1
tium 4 both under Windows XP. The CPU code is compiled raster graphics. GPUs have been developed hand-in-hand
with Visual Studio 2005 with the SSE2 (Streaming SIMD with graphics APIs, thus each stage in the traditional GPU
Extensions 2) option and O2 optimization turned on. pipeline corresponds very closely with a corresponding
All of our applications show satisfactory speedups. The stage in the conceptual OpenGL pipeline. Recent GPUs
maximum observed speedups are about 40× for traffic sim- have added programmability to the vertex and fragment
ulation, 6.5× for HotSpot and 8× for k-means. Also, shading stages of the pipeline. Instead of separate process-
to better utilize the GPU under CUDA, we make novel, ing units for different shaders, the Unified Shader Model
architecturally-aware changes to the algorithms and data (USM) was introduced with DirectX 10, allowing for bet-
structures, such as the introduction of a pyramid data struc- ter utilization of GPU processing resources. ATI’s Xenos
ture in the HotSpot benchmark to avoid the need for excess chip for the Xbox 360, AMD R600, and NVIDIA’s Geforce
synchronization between thread blocks. 8800 [16] all implement USM. The computing resources
In addition to showing speedups using CUDA, another needed by shaders varies greatly among different applica-
contribution of our work is that we present some advantages tions. The unified design can overcome this issue by balanc-
and inefficiencies of the CUDA model. For example, we ing loads between geometry, vertex, and fragment shader
find that a global synchronization mechanism is necessary functionality, yielding maximum utilization of computing
in order to avoid the need to separate a logical kernel into resources.
disjoint parts. Double buffering or a bulk load/store mech-
anism would be useful as an additional mechanism to help
reduce memory access latency.
2 Related Work
2
Unfortunately, for non-graphics applications, these APIs in- register files, a shared memory data cache, and a read-only
troduce many hurdles to the general purpose application (constant) cache. Threads within one block can share data
programmer, including the inherent problems born of the through shared memory, allowing very high access speeds.
need to caress general purpose problems into looking like Constant memory is optimized for fast read-only access to
rendering problems and the functional problem imposed by a region of global, read/write device memory [21].
lack of scatter functionality. Below, we give a simple example of CUDA program
which assigns the values in an n × n matrix B to a matrix
A and an example of the same operation in C. If n is 128
and each block is 16 by 16, CUDA will divide the whole
matrix into 64 blocks, and each assignment becomes the re-
sponsibility of a single thread. The CUDA program model
abstracts away the loop needed in the CPU implementation,
a powerful abstraction very useful in solving more compli-
cated problems.
A[index] = B[index];
3
can be achieved by terminating a function call, otherwise also connects to the CPU and main memory, so the data
read-after-write (RAW), write-after-read (WAR), and write- transfer rate of bus and memory bandwidth are very crucial
after-write (WAW) hazards become a concern [21]. to the performance of applications. We measured the mem-
ory overhead by varying the amount of data transferred be-
4 Experiment Setup and Methodology tween our CPU and GPU. We found that the time necessary
to transfer date increases linearly with the amount of data,
as illustrated in Figure 3, thus the constant overhead of each
We chose representative commercial products from both individual transfer is immeasurable so long as transfers are
of the GPU and CPU markets: an NIVIDIA Geforce 8800 large or infrequent.
GTX (128 stream processors clocked at 675 MHz with
768 MB of graphics memory and a 16 kB parallel data
5.2 Traffic Simulation
cache per multiprocessor block) with NVIDIA driver ver-
sion 6.14.11.6201 and CUDA 1.0, and a traditional single-
A great deal of civil engineering work exists on simulat-
core Intel Pentium 4 CPU (2.4 GHz with 512 MB main
ing automotive traffic. Aimed at improving traffic control
memory, 512 kB L2 and 8 kB L1). We developed our GPU
to reduce congestion and accidents, these simulations tend
code using NVIDIA’s CUDA API. The CPU code is com-
to require a lot of compute power. There has been no previ-
piled under Visual Studio with O2 and SSE2. The GPU
ous work that uses graphics processors to accelerate traffic
implementation results are compared against the Intel CPU
simulation. Our work is based on a part of the MITSIM
results under Windows. Given the same input data set, the
model [27], which simulates traffic networks.
speedup is calculated by taking the wall-clock time required
by applications on the CPU divided by the time required by
the GPU. Times are measured after initial setup (e.g. after 5.2.1 Algorithm Overview
file I/O) but do include PCI-E bus transfer times. Our work re-implements part of MITSIM’s lane-changing
model—only a portion of the MITSIM infrastructure, which
5 Experimental Results provides a much broader and more comprehensive traffic
simulation model than our benchmark—with cars running
5.1 Memory Overhead in 4 lanes [23]. The lanes are implemented as a 4-wide 2-D
array with each cell representing a position. The car struc-
ture carries the information about the car’s speed, position,
the lane the car resides in and so forth. All of this informa-
tion is dynamically updated during the lifetime of the sim-
ulation. Cars can change lanes only if the behind-diagonal,
next-to, and forward-diagonal cells are clear of cars. A car
can accelerate when there are 2 blank positions in front of
it.
This model is straightforward to implement on CUDA’s
shared memory model and benefits from using the GPU’s
large number of concurrent threads. In each iteration, the
determination of the next action of each car (e.g. to change
lanes, accelerate, brake, etc.) is dependent only on the loca-
tions of the cars withing a small, local neighborhood.
4
is more appropriate for data-parallel operation with many
threads given our configuration. Applications like the traf-
fic simulation present an especially good fit for GPUs and
the CUDA programming model as, except for synchroniz-
ing the blocks for each iteration, there are no data depen-
dencies between threads within one block or between the
blocks.
5
Figure 6. Starting from an 8 × 8 block, it takes
1 iteration to compute the result of the inner
6 × 6 block
6
5.4 K-Means sub-cluster to calculate a new centroid.
7
program complexity that would be better left unconsidered.
CUDA’s performance is hurt by its inability to collect
data from a set of producer threads and stream them to a set
of consumer threads. Intermediate data has to be stored in
device memory before it is consumed by another thread. By
allowing fine grained synchronization among producer and
consumer threads, programs would be able to consume data
at a higher rate, earlier freeing its storage for reuse. Since
thread blocks must run to completion, and only one shared
memory can be allocated to a thread block, it is unwieldy to
support high-quality producer/consumer steaming given the
current implementation of CUDA’s shared memory model.
Finally, even though we did not perform extensive per-
formance tuning (tiling, managing bank conflicts, etc.), we
found that CUDA programming required extensive knowl-
edge of the underlying hardware architecture. This is ac-
tually true of almost any parallel programming environ-
Figure 10. Speedup of k-means (reduction ment today. This makes parallel programming an excel-
part not included). The maximum speedup lent vehicle for teaching computer architecture, and in fact
is about 8×. we used CUDA to good effect in a senior-level computer-
architecture course. We worry, however, that the associated
learning curve will deter adoption of parallel programming,
especially for more complex algorithms and data structures
threads are accessing device memory intensively, switching that do not map as easily to a manycore environment.
threads is unable to sufficiently overlap computation with
memory access because all threads are memory bound. Cor- 7 Conclusions and Future Work
related threads within the same kernel are executed simul-
taneously, so it is probable that many threads will reach a This work compared the performance of CPU and GPU
memory intensive phase simultaneously. implementations of three, naturally data-parallel applica-
It may also be beneficial to provide a faster local tions: traffic simulation, thermal simulation, and k-means.
store. This could be useful in situations where data Our experiments used NVIDIA’s C-based CUDA interface
has to be loaded from device memory and reused many and compared performance on an NVIDIA Geforce 8800
times. Options include caching, scratchpads, and bulk GTX with that on an Intel Pentium 4 CPU. Even though
DMAs. Caching supports existing programming models we did not perform extensive performance tuning, the GPU
and toolchains, but caching in shared memory raises the implementations of these applications obtained impressive
traditional problems of coherence and memory-ordering speedups and add to the growing body of GPGPU work
guarantees. The other two options also have drawbacks. showing the potential of GPUs for general-purpose comput-
Scratchpads require explicit compiler management and ing. Furthermore, the CUDA interface made programming
DMA may require double buffering, using extra memory. these applications vastly easier than traditional rendering-
Lack of persistent state in the shared memory data cache based GPGPU approaches (in particular, we have prior ex-
results in less efficient communication among producer and perience with structured grids [5]). We also believe that
consumer kernels than might be possible. The producer ker- the availability of shared memory and the domain abstrac-
nel has to store the shared memory data into device mem- tion provided by CUDA made these applications vastly eas-
ory; the data is then read back over the bus by the consumer ier to implement than traditional SPMD/thread-based ap-
kernel. This also undercuts the efficiency of global synchro- proaches. In the case of k-means and the traffic simula-
nization which involves kernel termination and creation; tion, CUDA was probably a bit more difficult than OpenMP,
however, a persistent shared memory contradicts with the chiefly due to the need to explicitly move data and deal
current programming model, in which threads blocks run with the GPU’s heterogeneous memory model. In the case
to completion and by definition leave no state afterwards. of HotSpot with the pyramidal implementation, CUDA’s
Alternatively, a programmer can choose to use a novel al- “grid-of-blocks” paradigm probably simplified implemen-
gorithm that involves less communication and global syn- tation compared to OpenMP.
chronization, such as the pyramid algorithm that we use in The work we presented in this paper only shows a de-
HotSpot, but this leads to—often undesirable—tradeoffs in velopmental stage of our work. We plan to extend our
8
GPGPU work by comparing with more recent commodity References
configurations such as Intel dual-core processors and ex-
amining the programmability of more complex applications [1] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Hus-
with various kinds of data structures and memory access bands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf,
patterns. In addition, in order to better understand the pros S. W. Williams, and K. A. Yelick. The landscape of parallel
and cons of GPU architectures for general-purpose paral- computing research: A view from berkeley. Technical Re-
port UCB/EECS-2006-183, EECS Department, University
lel programming, new metrics are needed for characterizing
of California, Berkeley, Dec 2006.
the applications. With greater architectural convergence of [2] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian,
CPUs and GPUs, our goal is to find a parallel programming M. Houston, and P. Hanrahan. Brook for gpus: stream com-
model that can best aid developers to program in today’s puting on graphics hardware. In SIGGRAPH ’04: ACM SIG-
high-performance parallel computing environments, includ- GRAPH 2004 Papers, pages 777–786, New York, NY, USA,
ing GPUs and multi-core CPUs. 2004. ACM Press.
[3] R. Fang, B. He, M. Lu, K. Yang, N. K. Govindaraju, Q. Luo,
Our sample applications mapped nicely into CUDA and and P. V. Sander. Gpuqp: query co-processing using graph-
would map easily enough to most manycore programming ics processors. In SIGMOD ’07: Proceedings of the 2007
environments. In all cases, however, managing data place- ACM SIGMOD international conference on Management of
ment, communication, and synchronization becomes a nui- data, pages 1061–1063, New York, NY, USA, 2007. ACM
sance at best—and intractable at worst—with more com- Press.
plex applications. Higher-level programming APIs are [4] T. Goodale, G. Allen, G. Lanfermann, J. Massó, T. Radke,
E. Seidel, and J. Shalf. The cactus framework and toolkit:
needed! Ideally, these should promote use of higher-level
Design and applications. In VECPAR, pages 197–227, 2002.
data structures that implicitly convey information about de- [5] N. Goodnight, G. Lewin, D. Luebke, and K. Skadron. A
pendencies and data layout to the compiler or middleware, multigrid solver for boundary value problems using graphics
which can then manage much of the concurrency, data h ardware. Technical report, University of Virginia, January
placement, communication, and synchronization. Opera- 2003. University of Virginia Technical Report CS-2003-03.
tions on these data structures then would implicitly con- [6] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and
vey parallelism while preserving a more natural, quasi- D. Manocha. Fast computation of database operations using
sequential programming style [19]. Ideally, programmers graphics processors. In SIGGRAPH ’05: ACM SIGGRAPH
2005 Courses, page 206, New York, NY, USA, 2005. ACM
should still be able to “drill down”—to manage the hard-
Press.
ware themselves using a lower-level API such as CUDA— [7] M. J. Harris, W. V. Baxter, T. Scheuermann, and A. Las-
albeit at their own risk. tra. Simulation of cloud dynamics on graphics hard-
ware. In HWWS ’03: Proceedings of the ACM SIG-
The common view that effective parallel programming
GRAPH/EUROGRAPHICS conference on Graphics hard-
requires low-level, explicit management of concurrency and
ware, pages 92–101, Aire-la-Ville, Switzerland, Switzer-
hardware resources only applies to the most expert pro- land, 2003. Eurographics Association.
grammers! We contend that high-level APIs and auto- [8] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan,
matic parallelism will boost productivity and, for many pro- K. Skadron, and M. R. Stan. Hotspot: A compact thermal
grammers, will yield better speedups. Higher-level APIs modeling methodology for early-stage vlsi design. IEEE
that abstract away hardware details also simplify portability Trans. VLSI Syst., 14(5):501–513, 2006.
among different parallel platforms. [9] J. Kahle, M. Day, H. Hofstee, C. Johns, T. Maeurer, and
D. Shippy. Introduction to the cell multiprocessor. IBM J.
Res. Dev.,49(4/5):589C604, 2005.
[10] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silver-
man, and A. Wu. An efficient k-means clustering algorithm:
8 Acknowledgements analysis and implementation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 24(7):881–892, 2002.
[11] KDD Cup. http://kdd.ics.uci.edu/databases/kddcup99/kdd-
cup99.html.
This work is supported by NSF grant nos. CNS- [12] J. Kessenich, D. Baldwin, and R. Rost. The opengl shading
0306404 and CNS-0615277, and by a grant from language. Technical report, The OpenGL Architectural Re-
NVIDIA Research. We would like to thank Michael Gar- view Board, 2006. http://www.opengl.org/documentation/-
land of NVIDIA Research for pointing out the pyramidal glsl.
[13] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A
algorithm for the HotSpot problem, Kevin Stammetti for
32-way multithreaded sparc processor. IEEE Micro, 2005.
his help with the traffic simulation algorithm and imple- [14] J. Krüger and R. Westermann. Linear algebra operators
mentation, Karthik Sankaranarayanan for his help with the for gpu implementation of numerical algorithms. In SIG-
Hotspot grid solver and John Stuppy and S. R. Siddarth for GRAPH ’05: ACM SIGGRAPH 2005 Courses, page 234,
their help with the k-means implementation. New York, NY, USA, 2005. ACM Press.
9
[15] W. Liu, B. Schmidt, G. Voss, and W. Muller-Wittig. Stream-
ing algorithms for biological sequence alignment on gpus.
IEEE Transactions on Parallel and Distributed Systems,
18(9):1270–1281, 2007.
[16] D. Luebke and G. Humphreys. How gpus work. IEEE Com-
puter, 40(2):96–100, 2007.
[17] W. R. Mark, R. S. Glanville, K. Akeley, and M. J. Kilgard.
Cg: a system for programming graphics hardware in a c-like
language. ACM Trans. Graph., 22(3):896–907, 2003.
[18] M. D. McCool. Metaprogramming GPUs with Sh. AK Pe-
ters, 2004.
[19] W. mei W. Hwu, S. Ryoo, S.-Z. Ueng, J. H. Kelm, I. Gelado,
S. S. Stone, R. E. Kidd, S. S. Baghsorkhi, A. Mahesri, S. C.
Tsao, N. Navarro, S. S. Lumetta, M. I. Frank, and S. J. Patel.
Implicitly parallel programming models for thousand-core
microprocessors. In DAC, pages 754–759. IEEE, 2007.
[20] Microsoft Corp. DirectX. http://www.gamesforwindows.-
com/en-US/AboutGFW/Pages/DirectX10.aspx.
[21] NVIDIA. Cuda programming guide 1.0, 2007. http://devel-
oper.nvidia.com/object/cuda.html.
[22] J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik,
and J. Parhi. Nu-minebench 2.0. Technical report, North-
western University Department of Electrical and Computer
Engineering, 2005. http://cucis.ece.northwestern.edu/tech-
reports/pdf/CUCIS-2004-08-001.pdf.
[23] K. Stammetti. Testing the feasibility of running a computa-
tionally intensive real-time traffic simulation on a multicore
programming graphics processor, May 2007. U.Va. SEAS
Senior Thesis.
[24] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt,
B. Greenwald, H. Hoffmann, P. Johnson, J. Kim, J. Psota,
A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amaras-
inghe, and A. Agarwal. Evaluation of the raw microproces-
sor: An exposed-wire-delay architecture for ilp and streams.
In ISCA ’04: Proceedings of the 31st annual international
symposium on Computer architecture, page 2, Washington,
DC, USA, 2004. IEEE Computer Society.
[25] G. Vahala, J. Yepez, L. Vahala, M. Soe, and J. Carter. 3d
entropic lattice boltzmann simulations of 3d navier-stokes
turbulence. In Proceedings of the 47th Annual Meeting of
the APS Division of Plasma Phsyics, 2005.
[26] B. Wilkinson and M. Allen. Parallel programming: tech-
niques and applications using networked workstations and
parallel computers. Prentice-Hall, Inc., Upper Saddle River,
NJ, USA, 1999.
[27] Q. Yang and H. Koutsopoulos. Microscopic traffic sim-
ulator for evaluation of dynamic traffic managment sys-
tems. Transportation Research Part C(Emerging Technolo-
gies),4(3), 1996.
10