CUDA Introduction

Introduction to
Parallel Computing
with CUDA
Oswald Haan
[email protected]
Schedule
9:15 Introduction to Parallel Computing with CUDA
10:15 Using CUDA
11:00 Break
11:15 CUDA Application Examples
12:30 Lunch break
13:30 Using Multiple GPUs
14:30 CUDA Application Libraries
15:00 Closing
Course Material
• All presentations available under
http://wwwuser.gwdg.de/~ohaan/CUDA_2021-1
• All code examples can be copied with
cp –r ~ohaan/cuda_kurs/* .
cp –r ~ohaan/cuda_kurs_f/* .
• Detailed documentation from NVIDIAs CUDA Toolkit Documentation
http://docs.nvidia.com/cuda
• Introductory Book: CUDA by Examples,
Authors: J. Sanders and E. Kandrot
Course Material
CUDA for Fortran
• NVIDIAs web site for CUDA Fortran

https://developer.nvidia.com/cuda-fortran
• CUDA for Fortran Programmimg Guide
https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/index.html
• A series of Introductory articles

https://devblogs.nvidia.com/author/gruetsch/
GPGPU: General Purpose Computation on
Graphic Processing Units
host
hh device
hh
multicore CPU manycore GPU
host memory
hh device Cc
memory
General Purpose Processor vs. Graphics Processor
Size of chip area for different purposes
NVIDIA GPU Tesla K40
Streaming Multiprocessor
(SM)
Tesla K40 provides:

15 Streaming Multiprocessors
12 GB Main Memory
288 GB/s Memory Bandwidth
1.5 MB L2 Cache
Tesla K40 SM
192 SP Floating Point Units 256 KB registers per SM
64 DP Floating Point Units 64 KB shared memory/L1 cache
32 Special Functional Units per SM
32 Load/Store Units 48 KB read only/texture cache
Clock rate 745 MHz per SM
Nominal maximal speed Nominal maximal speed

for floating point operations: for main memory accesses:
SP: 4,38 TeraFlop/s SP: 72 GigaWords/s

DP: 1,43 TeraFlop/s DP: 36 GigaWords/s
GeForce GTX 1080
Streaming
Multiprocessor
(SM)
GeForce GTX 1080 provides:

20 Streaming Multiprocessors
8 GB Main Memory
320 GB/s Memory Bandwidth
2 MB L2 Cache
GeForce GTX 1080
Streaming Multiprocessor
128 SP Floating Point Units 256 KB registers per SM
4 DP Floating Point Units 96 KB shared memory per SM
32 Special Functional Units 48KB texture/L1 cache per SM
32 Load/Store Units
Clock rate 1607 MHz
Nominal maximal speed Nominal maximal speed

for floating point operations: for device memory accesses:
SP: 8.23 TeraFlop/s SP: 80 GigaWords/s

DP: 0.26 TeraFlop/s DP: 40 GigaWords/s
GPGPU-Programming Model:
Offloading of kernels
• Within a conventional sequential program special subroutines are

defined as kernels, which can be offloaded to the GPU.
• Memory areas for data accessed by the kernel subroutine must be
provided on the GPU
• Input data to be used by the kernel must be copied from host to GPU
before the kernel is invoked
• Result data produced by the kernel must be copied from GPU to host
after the kernel execution is completed
host device
mem. allocation on host
invocation of mem. allocation on device mem. allocation on device
copy of kernel input data copy of kernel input data
invocation of kernel execution on device Kernel execution on device
Execution of independent workload on host
copy of kernel output data copy of kernel output data

Unified Memory
Introduced in CUDA 6
host
hh device
hh host
hh device
hh
Multicore CPU Manycore GPU Multicore CPU Manycore GPU
host memory
hh device memory
Cc Unified memory
GPGPU-Programming Model:
Parallel execution of Kernel Threads
• SPMD (Single Program-Multiple Data):
• Multiple Threads execute in parallel the same kernel program
• Accessed data and control flow within each execution thread can be
differentiated by using a thread identification number tid , which in
each thread has a different value
• All threads have access to the common global device memory space
• Synchronization of write accesses to the same memory object is not
prescribed; write order must be specified by means of explicit
synchronization mechanisms
GPGPU with CUDA
• CUDA (Compute Unified Device Architecture) is NVIDIA’s program
development environment.
• Contains extensions to C/C++ and library routines implementing the
GPGPU programming model
• Provides
compiler nvcc for CUDA programs
Profiling and debugging tools
Numerical Libraries (e.g. CUDA Math Library, cuBLAS)
• contains low level drivers for NVIDIA’s graphic cards
CUDA Fortran
• Contains extensions to Fortran and library routines implementing the
GPGPU programming model
• Developed jointly by PGI (Portland Group Inc.) and NVIDIA
• CUDA Fortran includes a Fortran 2003 compiler and tool chain for
programming NVIDIA GPUs using Fortran.
• Available in PGI 2010 and later releases.
• CUDA Fortran is supported on Linux, macOS and Windows.
CUDA Extensions to C/C++
• Qualifiers for functions indicating their scope
__global__ void kernel() called from host, executed on device
__device__ float function() called from device, executed on device
__host__ float function() called from host , executed on host
__device__ __host__ float function()
will be executed on device or on host, depending from where it is called
• Qualifiers for variables specifying their location in device memory
__device__, __constant__, __shared__, __managed__
• Execution configuration setting the number and arrangement of threads in
kernels to be executed on device
kernel <<<grid, block,...>>> (arg1,arg2,...)
• thread-local variables for thread identification
threadIdx.x, blockDim.x, blockIdx.x, gridDim.x
Some CUDA functions for memory management
To be called from host:
• cudaMalloc allocation of device memory
• cudaMallocHost allocation of non-paged (pinned) host memory
• cudaMallocManaged allocation of unified memory
• cudaMemcpy(..., kind) blocking copy
• cudaMemcpyAsync(..., kind) nonblocking copy
where kind can be
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
• cudaFree deallocating device memory
• cudaFreeHost deallocating pinned host memory
Synchronization
• blocking calls: cudaMalloc, cudaMemcpy,...
• non-blocking calls: kernel<<<... >>> (...),

cudaMemcpyAsync, ...
• cudaDeviceSynchronize();
will synchronize the host with all previously started activities of the
device
• __syncthreads()
will synchronize kernel threads running in the device
CUDA Fortran Extension to Fortran90
Qualifiers for subroutines and functions:
• attributes(host): to be executed on host, to be called from

subprograms with host attribute
• attributes(global): only for subroutines; declares a kernel to
be called from host and to be executed on device
• attributes(device): to be executed on device, to be called
from subprograms with global or device attribute
• host is the default attribute
Qualifiers for variables determine in which memory the memory space
for the variables will be allocated:
• By default, variables declared in modules or host subprograms are
allocated in the host main memory.
• device: variable is allocated in device main memory
• managed: variable migrates between host and device, depending
from where it is accessed. (Unified Memory)
• constant: variable is allocated in device constant memory space.
• shared: variable is allocated the device shared memory
• texture: variable is allocated in device texture memory space,
accesses to texture data goes through a separate cache on the device
• pinned: variable is allocated in host page-locked memory, copies
from page-locked memory to device memory are faster
• Execution configuration setting the number and arrangement of
threads in kernels to be executed on device
kernel <<<grid, block,...>>> (arg1,arg2,...)
• thread-local variables for thread identification
threadIdx%x, blockDim%x, blockIdx%x, gridDim%x
• Kernel loop directive
!$cuf kernel do[(n)] <<< grid, block, ... >>>
Generates automatically device code for a nested loop with nesting > n
For more details see: 2.11. Kernel Loop Directive in
https://www.pgroup.com/resources/docs/19.4/pdf/pgi19cudaforug.pdf
Thread Hierarchy
• The threads created by executing a kernel are
organized in a two level hierarchy:
1, 2 or 3-dim grid of
1, 2 or 3-dim blocks of threads.
• Each thread is unambiguously numbered by

four index vectors
gridDim.j, j=x,y,z
blockIdx.j = 0,…,gridDim.j-1
blockDim.j, j=x,y,z
threadIdx.j = 0,…,blockDim.j-1
Calculating the Thread ID
• Grid contains gridsize blocks ,
each block containing blocksize threads
where
gridsize = gridDim.x * gridDim.y * gridDim.z
blocksize = blockDim.x * blockDim.y * blockDim.z
• Threads are numbered from 0 to blocksize*gridsize – 1 :

tid = id_thr + blocksize * id_blk
id_thr = threadIdx.x + threadIdx.y * blockDim.x
+ threadIdx.z * blockDim.x * blockDim.y
id_blk = blockIdx.x + blockIdx.y * gridDim.x
+ blockIdx.z * gridDim.x * gridDim.y
Calculating the Thread ID for CUDA Fortran
• Grid contains gridsize blocks ,
each block containing blocksize threads
where
gridsize = gridDim.x * gridDim.y * gridDim.z
blocksize = blockDim.x * blockDim.y * blockDim.z
• Threads are numbered from 1 to blocksize*gridsize :

tid = id_thr + (blocksize-1) * id_blk
id_thr = threadIdx.x + (threadIdx.y-1) * blockDim.x
+ (threadIdx.z-1) * blockDim.x * blockDim.y
id_blk = blockIdx.x + (blockIdx.y-1) * gridDim.x
+ (blockIdx.z-1) * gridDim.x * gridDim.y
Limits for Grid and Block Dimensions
MaxGridDim.x = 231 – 1 = 2147483647
MaxGridDim.y = MaxGridDim.z = 216 – 1 = 65535
MaxBlockDim.x = MaxBlockDim.y = 1024

MaxBlockDim.z = = 64
Total number of threads per block

= blockDim.x*blockDim.y*blockDim.z ≤ 1024
Example: scaling of a vector
Sequential execution for a vector in host memory at address v_h:
for(i=0; i<N; i++) {

v_h[i] = s* v_h[i];
}
Example: scaling of a vector
Thread parallel execution for a vector in device memory at address v_d
__global__ kernel (float s, float *a ) {

a[threadIdx.x] = s*a[threadIdx.x];
}
called from host with:

kernel<<<1,N>>>(s,v_d);
Hardware Hierarchy
Control unit:
...
SM n
SM 1
Schedule, dispatch
32 bit registers
SP / DP / SF / LS
functional unit
L2 cache
...
...
...
...
Main memory
Shared memory / L1 cache
Graphics Device Streaming Multiprocessor (SM) CUDA Core

Mapping of Threads to Hardware
• All threads of a thread-block are executed “simultaneously” on the same SM
Threads within a block can communicate via shared memory
• All threads of a thread-block must complete the kernel execution before the
SM-resources used for this thread block are freed to be used for the next block
of threads
• Different thread-blocks are distributed to the same or to different SMs

according to the availability of resources on the SMs.
Execution order of threads in different thread-blocks is not prescribed
No communication between threads in different thread-blocks
Maximal Number of Active Threads in a SM
• Upper limit of the number of active threads per SM:
1024 or 2048 threads
• Upper limit of the number of threads per block:

1024 threads per block
• Upper limit of the number of active blocks per SM:

32 blocks for Geforce GTX 980, 16 blocks for Tesla K40
Actual Number of Threads in a SM
• The actual number of threads residing in a SM can be smaller than the maximal
number, depending on:
1. Number of registers used per thread
216= 65536 registers = 262 kB per SM for Geforce GTX 980 and Tesla K40
1020 B maximal register size per thread
128 B register size per thread for maximal number of 2048 threads in a SM
2. Shared memory used per thread
96 kB shared memory per SM for Geforce GTX 980, 48 kB for Tesla K40
• The number of registers and the amount of shared memory space needed for a
single thread in a thread-block of a given kernel is determined at compile time
and can be enquired with a compile flag --ptxas-options=-v
SIMT Thread Scheduling
• SIMT : single instruction – multiple threads
32 threads (called a warp of threads) are scheduled together,
always executing the same instruction simultaneously on groups of 32
CUDA cores on the SM
• A warp is resident on a SM, until all of its threads have completed the kernel
• Number of threads of resident warps on a SM can exceed by far its number
of CUDA cores.
Execution is switched to ready warps, in which all threads are ready to
execute the next instruction.
Hiding the latency for memory access
Occupancy
• Occupancy =
The number of resident warps per SM/ maximal number of warps per SM
• High occupancy helps to hide memory latency
• Conditions for 100% occupancy
threads per block is a multiple of 32 (size of a warp) and
threads per block is a divisor of 2048 (max. number of threads in a SMX) and
threads per block >= 64 (Geforce GTX 980), >=128 (Tesla K40) and
register size per thread <= 128 B (Geforce GTX 980, Tesla K40) and
shared memory per thread <= 48 B (Geforce GTX 980), <= 24 B (Tesla K40)
• Trade off: high occupancy – large amount of fast SM-memory per thread
Diverging Threads
• Conditional execution depending on thread number
• Groups of threads in a warp with different execution paths will be
scheduled separately
• Leads to longer execution times
Thread Synchronization from Device
• In a device function, threads within a block can be synchronized by calling
the barrier
__syncthreads();
• Waits until all threads in a block have reached this instruction and all
accesses to global and shared memory from these threads are completed
• Danger of stalled execution:
if (i < cut ) __syncthreads();
will hang if in a block not all threads have i < cut or i>= cut
• Is used to coordinate memory access from threads within a single block
• __syncthreads()cannot coordinate the execution of threads from
different blocks
Memory Organization, Hardware View
Control unit:
Schedule, dispatch
...
SM n
SM 1
Shared 32 bit registers
CPU 1
...
L2 cache
cache
...
...
...
...
Main memory Main memory
Shared memory / L1 cache
constant + texture cache
Host Graphics Device Streaming Multiprocessor (SM)

Device Memory , Software View
local mem
local mem
local mem
local mem
register register register register
thread
thread
thread
thread
. . . . . . . . .
block block
shared men shared men
global mem
constant mem
texture mem
CUDA type qualifiers

CUDA Introduction

Uploaded by

Copyright:

Available Formats

CUDA Introduction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CUDA Introduction

Uploaded by

Copyright:

Available Formats

Introduction to

• NVIDIAs web site for CUDA Fortran

• A series of Introductory articles

multicore CPU manycore GPU

Tesla K40 provides:

Nominal maximal speed Nominal maximal speed

SP: 4,38 TeraFlop/s SP: 72 GigaWords/s

GeForce GTX 1080 provides:

Nominal maximal speed Nominal maximal speed

SP: 8.23 TeraFlop/s SP: 80 GigaWords/s

• Within a conventional sequential program special subroutines are

invocation of mem. allocation on device mem. allocation on device

copy of kernel input data copy of kernel input data

invocation of kernel execution on device Kernel execution on device

Execution of independent workload on host

copy of kernel output data copy of kernel output data

Multicore CPU Manycore GPU Multicore CPU Manycore GPU

• non-blocking calls: kernel<<<... >>> (...),

• attributes(host): to be executed on host, to be called from

• Each thread is unambiguously numbered by

• Threads are numbered from 0 to blocksize*gridsize – 1 :

• Threads are numbered from 1 to blocksize*gridsize :

MaxBlockDim.x = MaxBlockDim.y = 1024

Total number of threads per block

Sequential execution for a vector in host memory at address v_h:

for(i=0; i<N; i++) {

Thread parallel execution for a vector in device memory at address v_d

__global__ kernel (float s, float *a ) {

called from host with:

Shared memory / L1 cache

Graphics Device Streaming Multiprocessor (SM) CUDA Core

• Different thread-blocks are distributed to the same or to different SMs

• Upper limit of the number of threads per block:

• Upper limit of the number of active blocks per SM:

constant + texture cache

Host Graphics Device Streaming Multiprocessor (SM)

shared men shared men

You might also like

global kernel (float s, float *a ) {