CUDA Introduction
CUDA Introduction
CUDA Introduction
Parallel Computing
with CUDA
Oswald Haan
[email protected]
Schedule
9:15 Introduction to Parallel Computing with CUDA
10:15 Using CUDA
11:00 Break
11:15 CUDA Application Examples
12:30 Lunch break
13:30 Using Multiple GPUs
14:30 CUDA Application Libraries
15:00 Closing
Course Material
• All presentations available under
http://wwwuser.gwdg.de/~ohaan/CUDA_2021-1
• All code examples can be copied with
cp –r ~ohaan/cuda_kurs/* .
cp –r ~ohaan/cuda_kurs_f/* .
• Detailed documentation from NVIDIAs CUDA Toolkit Documentation
http://docs.nvidia.com/cuda
• Introductory Book: CUDA by Examples,
Authors: J. Sanders and E. Kandrot
Course Material
CUDA for Fortran
host memory
hh device Cc
memory
General Purpose Processor vs. Graphics Processor
Size of chip area for different purposes
NVIDIA GPU Tesla K40
Streaming Multiprocessor
(SM)
Streaming
Multiprocessor
(SM)
host memory
hh device memory
Cc Unified memory
GPGPU-Programming Model:
Parallel execution of Kernel Threads
• SPMD (Single Program-Multiple Data):
• Multiple Threads execute in parallel the same kernel program
• Accessed data and control flow within each execution thread can be
differentiated by using a thread identification number tid , which in
each thread has a different value
• All threads have access to the common global device memory space
• Synchronization of write accesses to the same memory object is not
prescribed; write order must be specified by means of explicit
synchronization mechanisms
GPGPU with CUDA
• CUDA (Compute Unified Device Architecture) is NVIDIA’s program
development environment.
• Contains extensions to C/C++ and library routines implementing the
GPGPU programming model
• Provides
compiler nvcc for CUDA programs
Profiling and debugging tools
Numerical Libraries (e.g. CUDA Math Library, cuBLAS)
• contains low level drivers for NVIDIA’s graphic cards
CUDA Fortran
• Contains extensions to Fortran and library routines implementing the
GPGPU programming model
• Developed jointly by PGI (Portland Group Inc.) and NVIDIA
• CUDA Fortran includes a Fortran 2003 compiler and tool chain for
programming NVIDIA GPUs using Fortran.
• Available in PGI 2010 and later releases.
• CUDA Fortran is supported on Linux, macOS and Windows.
CUDA Extensions to C/C++
• Qualifiers for functions indicating their scope
__global__ void kernel() called from host, executed on device
__device__ float function() called from device, executed on device
__host__ float function() called from host , executed on host
__device__ __host__ float function()
will be executed on device or on host, depending from where it is called
• Qualifiers for variables specifying their location in device memory
__device__, __constant__, __shared__, __managed__
• Execution configuration setting the number and arrangement of threads in
kernels to be executed on device
kernel <<<grid, block,...>>> (arg1,arg2,...)
• thread-local variables for thread identification
threadIdx.x, blockDim.x, blockIdx.x, gridDim.x
Some CUDA functions for memory management
To be called from host:
• cudaMalloc allocation of device memory
• cudaMallocHost allocation of non-paged (pinned) host memory
• cudaMallocManaged allocation of unified memory
• cudaMemcpy(..., kind) blocking copy
• cudaMemcpyAsync(..., kind) nonblocking copy
where kind can be
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
• cudaFree deallocating device memory
• cudaFreeHost deallocating pinned host memory
Synchronization
• blocking calls: cudaMalloc, cudaMemcpy,...
• cudaDeviceSynchronize();
will synchronize the host with all previously started activities of the
device
• __syncthreads()
will synchronize kernel threads running in the device
CUDA Fortran Extension to Fortran90
Qualifiers for subroutines and functions:
Control unit:
...
SM n
SM 1
Schedule, dispatch
32 bit registers
SP / DP / SF / LS
functional unit
L2 cache
...
...
...
...
Main memory
• All threads of a thread-block must complete the kernel execution before the
SM-resources used for this thread block are freed to be used for the next block
of threads
• The number of registers and the amount of shared memory space needed for a
single thread in a thread-block of a given kernel is determined at compile time
and can be enquired with a compile flag --ptxas-options=-v
SIMT Thread Scheduling
• SIMT : single instruction – multiple threads
32 threads (called a warp of threads) are scheduled together,
always executing the same instruction simultaneously on groups of 32
CUDA cores on the SM
• A warp is resident on a SM, until all of its threads have completed the kernel
• Number of threads of resident warps on a SM can exceed by far its number
of CUDA cores.
Execution is switched to ready warps, in which all threads are ready to
execute the next instruction.
Hiding the latency for memory access
Occupancy
• Occupancy =
The number of resident warps per SM/ maximal number of warps per SM
• High occupancy helps to hide memory latency
• Conditions for 100% occupancy
threads per block is a multiple of 32 (size of a warp) and
threads per block is a divisor of 2048 (max. number of threads in a SMX) and
threads per block >= 64 (Geforce GTX 980), >=128 (Tesla K40) and
register size per thread <= 128 B (Geforce GTX 980, Tesla K40) and
shared memory per thread <= 48 B (Geforce GTX 980), <= 24 B (Tesla K40)
• Trade off: high occupancy – large amount of fast SM-memory per thread
Diverging Threads
• Conditional execution depending on thread number
• Groups of threads in a warp with different execution paths will be
scheduled separately
• Leads to longer execution times
Thread Synchronization from Device
• In a device function, threads within a block can be synchronized by calling
the barrier
__syncthreads();
• Waits until all threads in a block have reached this instruction and all
accesses to global and shared memory from these threads are completed
• Danger of stalled execution:
if (i < cut ) __syncthreads();
will hang if in a block not all threads have i < cut or i>= cut
• Is used to coordinate memory access from threads within a single block
• __syncthreads()cannot coordinate the execution of threads from
different blocks
Memory Organization, Hardware View
Control unit:
Schedule, dispatch
...
SM n
SM 1
Shared 32 bit registers
CPU 1
...
L2 cache
cache
...
...
...
...
Main memory Main memory
Shared memory / L1 cache
local mem
local mem
local mem
local mem
register register register register
thread
thread
thread
thread
. . . . . . . . .
block block
global mem
constant mem
texture mem
CUDA type qualifiers