Cuda Program + Wait For User Input

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

A Compiler and Runtime System for Enabling Data Mining

Applications on GPUs

Wenjing Ma Gagan Agrawal


Department of Computer Science and Engineering
The Ohio State University
Columbus, OH 43210
{mawe,agrawal}@cse.ohio-state.edu

1. Abstract tions to be parallelized on GPUs and additional information about


With increasing need for accelerating data mining and scientific the variables. The system can generate CUDA functions that ex-
data analysis on large data sets, and less chance to improve pro- ecute these reduction functions in parallel, and the host functions
cessor performance by simply increasing clock frequencies, multi- invoking them. The architecture of the system is shown in Figure 1.
core architectures and accelerators like FPGAs and GPUs have There are four components in the user input. The first three (vari-
become popular. A recent development in using GPU for general able information, reduction function(s), and additional optional
computing has been the release of CUDA (Compute Unified De- functions) are analyzed by the system. The fourth component is
vice Architecture) by NVIDIA. CUDA allows GPU programming the host program. The system itself has three components: code
with C language-like features, thus easing the development of non- analyzer, which obtains variable access patterns and combination
graphics applications on a GPU. However, several challenges still operations, variable analyzer, and the code generator. By analyzing
remain in programming the GPUs with CUDA, because CUDA in- the variables and the sequential reduction function(s), the system
volves explicit parallel programming and management of its com- generates the kernel functions, grid configuration, and other neces-
plex memory hierarchy, as well as allocating device memory, mov- sary code. By compiling these functions with the the user-specified
ing data between CPU and device memory, and specification of host program, an executable file is generated.
thread grid configurations. We used LLVM as the framework for program analysis [2]. We
In this paper, we offer a solution for the programmers to gener- particularly benefited from the clear structure of its Intermediate
ate CUDA code by specifying the sequential reduction loop(s) with Representation (IR).
some information about the parameters. With program analysis and Since data mining and scientific data analysis applications share
code generation, the applications are mapped to a GPU. Several ad- some common structures, our system targets GPU-based paral-
ditional optimizations are also performed by the middleware. lelization of only the functions that process a 1 dimension array of
We have evaluated our system using three popular data min- data. Thus we can simplify program analysis and automatic genera-
ing applications, k-means clustering, EM clustering, and Principal tion of CUDA programs, while still offering a simple and high-level
Component Analysis (PCA). The speedup that each of these appli- interface for the programmers.
cations achieve over a sequential CPU version ranges between 20
and 50. 2.1 System API
We provide a convenient API for a programmer. For a reduction
Categories and Subject Descriptors D.1.3 [Concurrent Pro- function, there are mainly 3 parts in the user input: a list specifying
gramming]: Parallel Programming the type and size of each variable, a sequential reduction function,
General Terms Design and optional initialization and combination functions.

Keywords GPGPU, CUDA, Data Mining 2.2 Code and Variable Analysis
The program analysis comprises of three components. The first is
2. System Design obtaining variable access feature(input, output, temporary storage)
Though CUDA has accelerated the use of GPUs for non-graphics from a reduction function. Another is extracting the combination
applications, it still requires explicit parallel programming and operations( currently, we only consider two operations “+” and
managing the memory hierarchy by the programmers. Our sys- “*”). Those two tasks made use of the IR generated by LLVM.
tem is designed to ease GPU programming for a specific class of The third one is variable analysis and parallelization, which mainly
applications. What the user need to provide is just reduction func- extracts information for variables access and replication, as well as
arranging data on shared memory.

2.3 Mapping to CUDA


Using the user input and the information extracted by the variable
and code analyzer, the system next generates grid configuration ,
kernel invocation, kernel code, local reduction function, and com-
Copyright is held by the author/owner(s). bination function if necessary.
PPoPP’09, February 14–18, 2009, Raleigh, North Carolina, USA. After variable analysis, variables distributed across threads are
ACM 978-1-60558-397-6/09/02. denoted as loop variables. The read-write variables not labeled as

287
User input

11111
0000000000
11111 0000
1111
0000
1111
Dell Dimension 9200 PC. It is equipped with Intel(tm) CoreT M 2

00000
1111100000
11111
Variable Reduction Optional E6420 Duo Processor with 2.13 GHz clock rate, 1GB Dual Channel

00000 1111
0000
DDR2 SDRAM memory at 667 MHz, a 4MB L2 cache and a 1066

0000011111
11111
information functions functions MHz front side bus. The GPU versions used the same CPU, and a
768MB NVIDIA GeForce 8800 GTX, with 16 multiprocessors and
16KB shared memory on each multiprocessor.
The results reported in this paper are from k-means cluster-
ing [1], which is one of the most popular data mining algorithms.
Code Analyzer The performance of automatically generated CUDA programs on a
( In LLVM ) 384 MB dataset is shown in Figure 2. All results are reported as a
speedup over a sequential execution on the CPU. On the X scale, n
threads implies executions with 1 block and n threads per block,
Variable Access while m blocks means m blocks and 256 threads per block. We
Variable Analyzer Pattern and report two different speedup numbers, with and without the data
Combination Operation movement time.
The best speedups are nearly 50 over the CPU version, though
when the data movement times are included, the speedup decreases
to about 20. We can also see the execution times of middleware
Code Generator versions are close to the hand-coded version. The best performance
is obtained with 256 threads per block and 16 or 32 blocks. More
threads per block allows more concurrency. The maximum threads
we can use in a block is 512, but this configuration does not obtain

11111
00000
the best speedup, because of the larger amount of time spent on

00000
11111
Kernel Grid configuration Host global combination. As there are 16 multiprocessors, best speedups
are obtained with 16 or 32 blocks.

00000
11111
functions and kernel invocation Program

Executable

Figure 1. Overall System Design: User Input is Shown as Shaded


Boxes

loop might be updated simultaneously by multiple threads, so we


create a copy for each thread.
Local reduction function has the main loop that does the compu-
tation on GPU, and it is generated by modifying the sequential re-
duction function provided by the user. These modification include:
1) Dividing the loop to be parallelized by the number of blocks and
number of threads in each block. 2) Rewriting the index of the array
which are distributed.
2.4 Optimizations
The main optimization is using shared memory, because it can re-
duce the memory access time significantly. We developed strategies
of arranging data on shared memory according to the size and ac-
cess feature of the variables. The size of each variable is calculates
using the form:
Size = length * sizeof(type) * thread inf o
length is the length of this variable, type is one of char,
int, float. The last factor thread info is 1 if this variable
is denoted as input or loop, and n threads otherwise. It implies
that if an array is read-write and not distributed over all threads, we Figure 2. Speedup of k-means
need n threads copies of it.
We also provided flexibility for use to specify their own global
combination function. They can also specify common data used References
among iterations, and the middleware will optimize the data move-
ment for that data. [1] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice
Hall, 1988.
3. Applications and Experimental Results [2] Chris Lattner and Vikram Adve. LLVM: A Compilation Framework for
Lifelong Program Analysis & Transformation. In Proceedings of the
We had three data mining applications ported on GPUs with our 2004 International Symposium on Code Generation and Optimization
system. The sequential baseline executions were obtained on a (CGO’04), Palo Alto, California, Mar 2004.

288

You might also like