Cuda Program + Wait For User Input
Cuda Program + Wait For User Input
Cuda Program + Wait For User Input
Applications on GPUs
Keywords GPGPU, CUDA, Data Mining 2.2 Code and Variable Analysis
The program analysis comprises of three components. The first is
2. System Design obtaining variable access feature(input, output, temporary storage)
Though CUDA has accelerated the use of GPUs for non-graphics from a reduction function. Another is extracting the combination
applications, it still requires explicit parallel programming and operations( currently, we only consider two operations “+” and
managing the memory hierarchy by the programmers. Our sys- “*”). Those two tasks made use of the IR generated by LLVM.
tem is designed to ease GPU programming for a specific class of The third one is variable analysis and parallelization, which mainly
applications. What the user need to provide is just reduction func- extracts information for variables access and replication, as well as
arranging data on shared memory.
287
User input
11111
0000000000
11111 0000
1111
0000
1111
Dell Dimension 9200 PC. It is equipped with Intel(tm) CoreT M 2
00000
1111100000
11111
Variable Reduction Optional E6420 Duo Processor with 2.13 GHz clock rate, 1GB Dual Channel
00000 1111
0000
DDR2 SDRAM memory at 667 MHz, a 4MB L2 cache and a 1066
0000011111
11111
information functions functions MHz front side bus. The GPU versions used the same CPU, and a
768MB NVIDIA GeForce 8800 GTX, with 16 multiprocessors and
16KB shared memory on each multiprocessor.
The results reported in this paper are from k-means cluster-
ing [1], which is one of the most popular data mining algorithms.
Code Analyzer The performance of automatically generated CUDA programs on a
( In LLVM ) 384 MB dataset is shown in Figure 2. All results are reported as a
speedup over a sequential execution on the CPU. On the X scale, n
threads implies executions with 1 block and n threads per block,
Variable Access while m blocks means m blocks and 256 threads per block. We
Variable Analyzer Pattern and report two different speedup numbers, with and without the data
Combination Operation movement time.
The best speedups are nearly 50 over the CPU version, though
when the data movement times are included, the speedup decreases
to about 20. We can also see the execution times of middleware
Code Generator versions are close to the hand-coded version. The best performance
is obtained with 256 threads per block and 16 or 32 blocks. More
threads per block allows more concurrency. The maximum threads
we can use in a block is 512, but this configuration does not obtain
11111
00000
the best speedup, because of the larger amount of time spent on
00000
11111
Kernel Grid configuration Host global combination. As there are 16 multiprocessors, best speedups
are obtained with 16 or 32 blocks.
00000
11111
functions and kernel invocation Program
Executable
288