GC3 SW Abstraction-2021Fall-1
GC3 SW Abstraction-2021Fall-1
GC3 SW Abstraction-2021Fall-1
void main(){
……
int *dev_a,*dev_b,*dev_c;
cudaMalloc((void**)&dev_c, 128* sizeof(int));
……
cudaMemcpy(dev_a, a, 128* sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, 128* sizeof(int), cudaMemcpyHostToDevice);
// Launch a kernel on the GPU with one thread for each element.
addKernel<<<1, 128>>>(dev_c, dev_a, dev_b);
cudaFree(dev_c);
……
}
哈尔滨工业大学 GPU计算
Tonghua Su
School of Software
Harbin Institute of Technology
Lec 3 CUDA Abstraction 3 GPU计算
Outline
1 Multithreading
2 CUDA Abstraction
3 Kernel Execution
4 Warp Scheduling
5
CUDA Toolchain
Outline
1 Multithreading
2 CUDA Abstraction
3 Kernel Execution
4 Warp Scheduling
5
CUDA Toolchain
Design Philosophy
Design Philosophy
Multithreading
●SIMD in CPU
✓ All cores execute the same instructions simultaneously, but with
different data
✓ Similar to vector computing on CRAY supercomputers
✓ e. g. SSE4, AVX instruction set
●SIMT in SMX
✓ Multithreaded CUDA core
✓ Threads on each SMX execute in group sharing same instruction
✓ Fine-grained parallelism
✓ Natural for graphics processing and much scientific computing
✓ SIMT is also a natural choice for many-core chips to simplify each
core
Multithreading
●Thread: instruction stream with own PC and data
✓ Owning private register, private memory, program counter and
thread execution state
✓ Thread Level Parallelism(TLP): Exploit the parallelism inherent
between threads
●Multithreading
✓ Multiple threads to share the functional units of 1 processor via
overlapping
✓ Processor must duplicate independent state of each thread
• e.g., a separate copy of register file, a separate PC
✓ Often, hardware for fast thread switch
✓ Memory to be shared
✓ Solving the memory access stall
Tonghua Su, School of Software, Harbin Institute of Technology, China
Lec 3 CUDA Abstraction 9 GPU计算
GPU Multithreading
●Multithreaded Hardware
✓ CUDA core is multithreaded processor
• Supporting 96 threads in GTX 8800
✓ Run in group of 32 threads (called a warp)
✓ Zero-cost “context switching”
• Each thread has its own registers (which limits the number of active threads)
✓ Shared memory/L1 cache
●Fine-grained multithreading
✓ Able to switch between warps on each instruction
✓ Schedule without pre-emption
✓ Scheduling eligible warps in a round-robin fashion
GPU Multithreading
●Hiding Latency Stalls
✓ Fetch → Decode → Execute → Memory → Writeback
✓ Execution alternates between “active” warps, with warps becoming temporarily
“inactive” when waiting for data
✓ Lots of active warps is the key to high performance
Warp 1
stall
Warp 2
stall
Warp 3
stall
Warp 4
stall
Quiz
●定量计算内存访问带来的停滞
✓ 假设从显存读取数据的延时是 400时钟周期
✓ 如果每个线程束(warp)可以运⾏10个周期,那么另外需要多少个就绪的线程
束才能掩盖停滞带来的时间缝隙?
作业预热
●请编写程序,实现两个矩阵相乘
● 先编写CPU版程序,然后给出GPU代码
● 每个矩阵⽤线性数组表⽰
● 考虑多个block
● 考虑矩阵尺⼨不是block尺⼨的整倍数
Outline
1 Multithreading
2 CUDA Abstraction
3 Kernel Execution
4 Warp Scheduling
5
CUDA Toolchain
CUDA
●CUDA(Compute Unified Device Architecture) is developed by Nvidia
around 2007
✓ 2-4 week learning curve for those with experience of OpenMP and MPI
programming
✓ large user community on NVIDIA forums
●CUDA is a parallel computing platform and programming model that
enables dramatic increases in computing performance by harnessing
the power of the GPU
CUDA
●CUDA as Parallel Computing Platform
✓ Language: CUDA C which based on C with some extensions(extensive C++
support)
✓ Editor: Eclipse/Visual Studio
✓ Complier: nvcc
✓ SDK: CUDA toolkit, Libraries, Samples
✓ Profiler & Debugger: Nsight
●CUDA as Programming Model
✓ Software Abstraction of GPU hardwares
✓ Independent to OSs, CPUs, Nvidia GPUs
CUDA Abstraction
●CUDA Virtualizes the Physical Hardware
✓ thread is a virtualized CUDA cores (registers, PC, state)
✓ block is a virtualized streaming multiprocessor (threads, shared mem.)
●Scheduled onto Physical Hardware without Pre-emption
✓ threads/blocks launch & run to completion/suspension
✓ blocks should be independent
Global Memory
CUDA Abstraction
●Key Parallel Abstractions in CUDA
✓ Hierarchy of concurrent threads
✓ Shared memory model for cooperating threads
✓ Lightweight synchronization primitives
Block 1 Warps Block 2 Warps Block 1 Warps
…t0 t1 t2 … t31 …t0 t1 t2 … t31 …t0 t1 t2 … t31
… … …
Streaming Multiprocessor
Instruction L1
Instruction Fetch/Dispatch
Shared Memory
SP SP
SP SP
SFU SFU
SP SP
SP SP
Tonghua Su, School of Software, Harbin Institute of Technology, China
Lec 3 CUDA Abstraction 18 GPU计算
CUDA Abstraction
●Key Parallel Abstractions in CUDA
✓ Hierarchy of concurrent threads
✓ Shared memory modelHost
for cooperating
Device threads
✓ Lightweight synchronization primitives
Grid 1
Grid 2
Kernel
2
Block (1, 1)
Thread Hierarchy
●Thread —> Block—> Grid
void main(){
……
Grid
int *dev_a,*dev_b,*dev_c;
…… Block
addKernel<<<1, 128>>>(dev_c, dev_a, dev_b);
……
}
void main(){
……
Grid
int *dev_a,*dev_b,*dev_c;
…… Block Block
addKernel<<<100, 128>>>(dev_c, dev_a, dev_b);
……
}
Tonghua Su, School of Software, Harbin Institute of Technology, China
Lec 3 CUDA Abstraction 20 GPU计算
Thread Hierarchy
●Thread Mapping
Grid: Many blocks of threads
Thread Thread block
...
CUDA Core
Streaming Multiprocessor
SMEM
SMEM
SMEM
SMEM
SMEM
Thread Hierarchy
Host Device
Grid 1
Grid 2
Kernel 2
Block (1, 1)
CUDA Abstraction
●Key Parallel Abstractions in CUDA
✓ Hierarchy of concurrent threads
✓ Shared memory model for cooperating threads
✓ Lightweight synchronization primitives
(Device) Grid
Block (0, 0) Block (1, 0)
Shared Memory Shared Memory
Host Global
Memory
Constant
Memory
Texture
Memory
Tonghua Su, School of Software, Harbin Institute of Technology, China
Lec 3 CUDA Abstraction 23 GPU计算
Memory Model
Thread Thread block
Streaming Processor
Streaming Multiprocessor
Grid: Many blocks of threads
SMEM
...
SMEM
SMEM
SMEM
SMEM
Global Memory
Memory Model
Grid
● Each thread can: Block (0, 0) Block (1, 0)
✓ Read/write per-thread registers
✓ Read/write per-thread local memory Shared Memory Shared Memory
Constant Memory
Multithreading Review
●Hiding Latency Stalls
✓ Fetch → Decode → Execute → Memory → Writeback
✓ Execution alternates between “active” warps, with warps becoming temporarily
“inactive” when waiting for data
✓ Lots of active warps is the key to high performance
Warp 1
stall
Warp 2
stall
Warp 3
stall
Warp 4
stall
...
CUDA Core
SMEM
SMEM
SMEM
SMEM
SMEM
Streaming Multiprocessor
GPU
Constant Memory
CUDA Abstraction
●Key Parallel Abstractions in CUDA
✓ Hierarchy of concurrent threads
✓ Shared memory model for cooperating threads
✓ Lightweight synchronization primitives
Synchronization
●Global Synchronization
✓ Finish a kernel and start a new one
✓ All writes from all threads complete before a kernel finishes
step1<<<grid1,blk1>>>(...);
// The system ensures that all writes from step1
complete.
step2<<<grid2,blk2>>>(...);
Synchronization
●Threads Synchronization
✓ To ensure the threads visit the shared memory in order
✓ __syncthreads() __global__ void sum(const float* array, unsigned int N, volatile float* result)
{
// Each block sums a subset of the input array.
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
// Thread 0 of each block stores the partial sum to global memory.
result[blockIdx.x] = partialSum;
// Thread 0 makes sure the partial sum has been written to global memory.
__threadfence();
__device__ unsigned int count = 0; // Thread 0 signals that it is done.
__shared__ bool isLastBlockDone; unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// make sure that each thread reads the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums stored in result[0 .. gridDim.x-1]
float totalSum = calculateTotalSum(result);
if (threadIdx.x == 0) {
result[0] = totalSum;
count = 0;}
}
Tonghua Su, School of Software, Harbin Institute of Technology, China
}
Lec 3 CUDA Abstraction 33 GPU计算
Synchronization
●Race Conditions
✓ What is the value of a in thread 0?
✓ What is the value of a in thread 127?
threadId:0 threadId:127
// vector[0] was equal to 0
vector[0] += 5; vector[0] += 1;
... ...
a = vector[0]; a = vector[0];
✓ CUDA provides atomic operations to deal with this problem
Synchronization
●Atomics
✓ An atomic operation guarantees that only a single thread has access to a piece of
memory while an operation completes
✓ Different types of atomic instructions:
• atomic{Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And, Or, Xor}
✓ Atomics are slower than normal load/store
✓ You can have the whole machine queuing on a single location in memory
✓ More types in Fermi
✓ Atomics unavailable on G80!
Quiz
●对输⼊的⼀张灰度图,要统计其灰度直⽅图,请编写CPU版代码
● 可以通过OpenCV载⼊灰度图⽚,也可以⽤unsigned char数组gray模拟
● 请采⽤线性内存存储数据
● 暂且假定图⽚的长乘宽不⼤于1024
Quiz
●对输⼊的⼀张灰度图,要统计其灰度直⽅图,请编写可以完成此功能的GPU程序
● 使⽤原⼦操作进⾏替换
● 与CPU版代码结果对⽐