Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster

Recipe for running simple CUDA code on a GPU based Rocks cluster
Head node: sneth.pas.rochester.edu http://astro.pas.rochester.edu/~aqui llen/gpuworkshop.html
GPU workshop UR June 20, 2008 Alice Quillen
Outline
The Kernel .cu CUDA files Calling __global__ functions Setting grids and threads Allocating memory on device Compiling in emulate mode Submitting a job to the queue using SGE Timing your job for benchmarks
CUDA C, CU files
Routines that call the device must be in plain C -- with extension .cu Often 2 files
1) The kernel -- .cu file containing routines that are running on the device. 2) A .cu file that calls the routines from kernel. #includes the kernel
Optional additional .cpp or .c files with other routines can be linked
Kernel, Threads + Blocks

Kernel has two different types of functions
__global__ called from host, executed on device __device__ called from device, run on device
__global__ void addone(float* d_array) // kernel is addone

{ int index = blockIdx.x * blockDim.x + threadIdx.x; d_array[index] += 1.0; }
d_array must be allocated on device the above code loops over all threads and blocks automatically on a CPU this would be inside a loop
Calling a __global__ routine threads and blocks

Calling the routine from the host to run on device
int p = 256; dim3 threads = (p,1,1); dim3 grid = (arraylength/p,1,1); // power of 2 // number of threads per block is p // number of blocks is arraylength divided by the // number of threads // arraylength is assumed a multiple of p
int sharedMemsize = 0; addone <<< grid, threads, sharedMemsize >>> (d_array); // call the kernel
Inside the routine blockDim.x is p (number of threads), threadIdx.x ranges from 0 to p-1, (covers all threads) blockIdx.x ranges from 0 to arraylength/p-1 (covers all blocks)
__global__ void addone(float * d_array) { int index = blockIdx.x * blockDim.x + threadIdx.x; d_array[index] += 1.0; }
Memory allocation on device

int lengthvector = 10240; // note multiple of 256, pad with zeros if your vector // isnt a multiple unsigned int memSize = lengtharray*sizeof(float); // allocate array on host float* h_array = (float*) malloc(memSize); // array on host float* d_array; // array on device // allocate array on device CUDA_SAFE_CALL(cudaMalloc((void**)&d_array, memSize )); // CUDA_SAFE_CALL utility Tool. If an error is found, it prints out and error // message, file name, and line number in file where the error can be found // Copy array from host to device CUDA_SAFE_CALL(cudaMemcpy(d_array, h_array, memSize, cudaMemcpyHostToDevice)); // Copy array from device to host CUDA_SAFE_CALL(cudaMemcpy(h_array, d_array, memSize, cudaMemcpyDeviceToHost));
Global Device Memory

Keep track of what data is on device and on host If you address host memory from device then the code doesnt work
A code that adds 1 to every element of a large array on the GPU

#include<cutil.h> // CUDA defined stuff #include<stdio.h> // usual io #include<addone_kernel.cu> // contains kernel int main(int argc, char ** argv) { CUT_DEVICE_INIT(); // device init, also will tell you if multiple devices // present Allocate host array, fill it with something Allocate device array Copy array from host to device Launch kernel Copy array back from device to host Print out some of it out so you can see that the device did something CUDA_SAFE_CALL(cudaFree(d_vector)); // free memory CUT_EXIT(argc, argv); // Use CUDA Utility Tool to exit cleanly }
Compiling
Sample head of Makefile
# name of executable EXECUTABLE := addone # Cuda source files (compiled with cudacc) don't put kernel in as is included in other file CUFILES := addone.cu CU_DEPS := addone_kernel.cu # C/C++ source files (compiled with gcc / c++) CCFILES := C_DEPS := ROOTDIR = /usr/local/NVIDIA_CUDA_SDK/common/ include $(ROOTDIR)/common.mk # SDK make
Emulator
[aquillen@sneth ~]$ make emu=1 Makes executable code that runs on the CPU not on device executable can be run on head node (which lacks GPUs) not necessary to send a job to queue in addone_kernel.cu:
__global__ void addone(float* d_array)
{ int index = blockIdx.x * blockDim.x + threadIdx.x; d_array[index] += 1.0; printf(%.2e\n,d_array[index]); // emulator only! } Will not run or compile for device. Will run and compile in emulator. Prints for every block and thread. If (threadIdx.x == 0) printf(%.2e\n,d_array[index]); // much more manageable!
Debugging
Code that works in emulator mode may not work on device (particularly memory access, synch problems) However debugging in emulator mode is a pretty good way to get closer to working code Profiler recently available, but we have not yet had a chance to play with it.
Queue system
SGE = Sun Grid Engine is an open source batch-queuing system supported by SUN and it comes as a roll with ROCKS Rocks is an open-source Linux cluster distribution that enables end users to easily build computational clusters. CUDA comes as a roll. To submit a job [aquillen@sneth ~]$ qsub sge.bash where sge.bash is a text file that looks like
Example sge.bash file

#!/bin/bash # #$ -cwd #$ -o std.out #$ -e std.err #$ -S /bin/bash #$ -m be #$ -M [email protected] #$ -N gal10 #$ -l h_rt=24:00:00 #$ -V remind me that this uses bash shell use current directory stdout goes into this file stderr goes into this file specify bash shell email at begin and end of job where to send email name of job let it run no more than 24 hours
something about exporting env variables
date +%s time in seconds for simple bench mark # execute some code comment /home/aquillen/galint/exe/release/mkgal.out executables /home/aquillen/galint/exe/release/galint print date date date +%s echo complete
queue commands
qstat look at what jobs are running
qdel 701 qhost qstat j 703
kill that job check all the nodes explain to me what is going on with that job
Workshop
http://astro.pas.rochester.edu/~aquillen/gpuworkshop/assigna.html
Assignment #1 Submitting a job to the queue using SGE qsub Here is a sample sge (Grid Engine) file that is written in bash shell. sge.bash Transfer this file to your home directory on our rocks cluster sneth.pas.rochester.edu. Modify the file sge.bash so that it will print "hello, World" into a file called std.out and will sleep for 60 seconds or so. Submit it as a job to the queue (qsub sge.bash). Check the queue status to make sure your job is run (qstat). Assignment #2 Compiling and running the routine addone that adds 1 to every element in a large array Here is a sample kernel addone_kernel.cu Here is a routine that calls this kernel addone.cu A Makefile to compile it Makefile A sge bash file to send it to the queue cudasge.bash Transfer the above 4 files to your home directory on sneth. They can also be copied directly from the directory ~aquillen/addone/ on sneth. Edit the cudasge.bash file so it is appropriate for you. Compile the code with make. The executable will be in a subdirectory called release . Run it by submitting the cudasge.bash file to the queue. Compile the code in emulate model (make emu=1). The executable will be in a subdirectory called emurelease . Run this code. It will run on the head node so you don't need to submit it to the queue. Assignment #3 Write your own routine by modifying the above addone routines. Modify the kernel so that it computes the derivative of the array using a finite difference (f[i+1]-f[i]). You will need an input and output arrays as arguments to your __global__ routine in the kernel as the order that threads and blocks executed is not specified. You do not want to address memory outside the array as this will crash the code. If you check the addressing with an if statement your routine will take twice as long to run. Try to write the routine without an if statement. Hint: you can trade an irrelevant amount of device global memory for the if statement. Assignment #4 Discuss how you could implement a box smooth or 1d convolution routine
Benchmarking
There are many ways to time routines. Here is the one used in the SDK. I think it uses sys/math.h gettimeofday() in linux // initialize a timer unsigned int timer = 0; CUT_SAFE_CALL(cutCreateTimer(&timer)); // start it CUT_SAFE_CALL(cutStartTimer(timer)); // // run your code here
//
// Stop the timer CUT_SAFE_CALL(cutStopTimer(timer)); printf( "Processing time: %f (ms)\n", cutGetTimerValue(timer)); // time in milliseconds, returned as a float // Delete the timer CUT_SAFE_CALL(cutDeleteTimer(timer));
Links
http://astro.pas.rochester.edu/~aquillen/gpuworkshop.html
Includes links to the CUDA manual, the SDK (software devel kit, lots of good programming examples) and other on line tutorials

Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster

Uploaded by

Copyright:

Available Formats

Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster

Uploaded by

Copyright:

Available Formats

Recipe for running simple CUDA code on a GPU based Rocks cluster

Head node: sneth.pas.rochester.edu http://astro.pas.rochester.edu/~aqui llen/gpuworkshop.html

GPU workshop UR June 20, 2008 Alice Quillen

Optional additional .cpp or .c files with other routines can be linked

Kernel, Threads + Blocks

global void addone(float* d_array) // kernel is addone

Calling a global routine threads and blocks

Memory allocation on device

Global Device Memory

A code that adds 1 to every element of a large array on the GPU

Example sge.bash file

qdel 701 qhost qstat j 703

You might also like