GPU Architecture: National Tsing-Hua University 2017, Summer Semester
GPU Architecture: National Tsing-Hua University 2017, Summer Semester
GPU Architecture: National Tsing-Hua University 2017, Summer Semester
Memory hierarchy
warp 3 instruction 95
..
.
warp 8 instruction 12
warp 3 instruction 96
Parallel Programming – NTHU LSA Lab 6
Warp Divergence
What if different threads in a warp need to do different things:
Including any flow control instruction (if, switch, do, for, while)
if(foo(threadIdx.x)){
Inside a warp
do_A();
} else {
do_B();
}
Different execution paths
within a warp are serialized
Predicated instructions which are
carried out only if logical flag is true
All threads compute the logical predicate and
two predicated instructions/statements
Potential large lost of performance
Parallel Programming – NTHU LSA Lab 7
Avoid Diverging in a Warp
Example with divergence:
if (threadIdx.x > 2) {...}
else {...}
Branch granularity < warp size
Shared memory
Read/write per-block
Similar to register performance
Constant memory
Read only per-grid
Cached Parallel Programming – NTHU LSA Lab 17
Memory Access
A store writes a line to L1
If evicted, that line is written to L2
The line could also be evicted from L2 and written to DRAM (global mem.)
A load requests the line from L1
If a hit, operation is complete
If a miss, then requests the line from L2
If a miss, then requests the line from DRAM (global memory)
Only GPU threads can access local memory addresses
On chip
Off chip
Register spilling
Max number of register per threads is 63
Local memory is used if the register limit is met
Array variables always are allocated in
local memory (DRAM)
Max amount of local memory per thread is 512K
Shared memory
Global memory