12 Gpu Cuda 3
12 Gpu Cuda 3
12 Gpu Cuda 3
Computing
Prof. Marco Bertini
Data
parallelism:
GPU computing
CUDA:
efficient
programming
Basic problem: matrix multiplication
• When performing a matrix
multiplication, each element of the
output matrix P is an inner product of a
row of M and a column of N.
N
WIDTH
thread will compute a single value of
the output matrix
M P
• We organize threads as blocks
BLOCK_WIDTHE
WIDTH
Row
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
A solution
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
}
Toy example visualization
N0,0 N0,1
N1,0 N1,1
Block(0,0) Block(0,1)
Thread(0,1) N2,0 N2,1
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
Thread(1,0)
N3,0 N3,1
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3
M0,0 M0,1 M0,2 M0,3 P0,0 P0,1
P3,0 P3,1 P3,2 P3,3
M1,0 M1,1 M1,2 M1,3 P1,0 P1,1
Block(1,0) Block(1,1)
Reduce
memory traffic:
tiling
Tiling and CUDA memories
• Remind the tradeoffs of CUDA memories:
Thread 1 Thread 2
Bad
Global Memory Global Memory
Good
Memory access in matrix multiplication
N0,0 N0,1
N1,0 N1,1
N2,0 N2,1
N3,0 N3,1
• Each thread accesses four elements of M and four elements of N during its execution. Among the four threads
highlighted, there is a significant overlap in the M and N elements they access. For example, thread0,0 and
thread0,1 both access M0,0 as well as the rest of row 0 of M.
• If we can somehow manage to have thread0,0 and thread0,1 to collaborate so that these M elements are only
loaded from global memory once, we can reduce the total number of accesses to the global memory by half.
• In fact, we can see that every M and N element is accessed exactly twice during the execution of block0,0.
• Have the multiple threads to access their data from the on-
chip memory
WIDTH
• The tile is of BLOCK_SIZE elements in each
dimension
M P
• All threads in a block participate
BLOCK_WIDTHE
Each thread loads one M
WIDTH
•
Row
element and one N element
in tiled code BLOCK_WIDTH
WIDTH WIDTH
Col
Tiling phases
Phase 0 Load for Block (0,0)
• In general, if an input matrix is of dimension Width and the tile size is TILE_WIDTH, the dot
product would be performed in Width/TILE_WIDTH phases.
• The creation of these phases is key to the reduction of accesses to the global memory.
• With each phase focusing on a small subset of the input matrix values, the threads can
collaboratively load the subset into the shared memory and use the values in the shared
• Shared memory
memorytoallows
satisfy each value to beinput
their overlapping accessed
needs by multiple
in the phase.threads
• Mds and Nds are re-used to hold the input values in different phases: reducing need of
shared memory.
Barrier Synchronization
• Synchronize all threads in a block
• __syncthreads()
WIDTH
• 2D indexing for accessing Tile 0:
M[Row][tx]
M P
TILE_WIDTHE
N[ty][Col]
WIDTH
Row
TILE_WIDTH
WIDTH WIDTH
Col
Loading Input Tile 0 of N (Phase 0)
• Have each thread load an M element and an N
element at the same relative position as its P
element.
WIDTH
• 2D indexing for accessing Tile 0:
M[Row][tx]
M P
BLOCK_WIDTHE
N[ty][Col]
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
Loading Input Tile 1 of M (Phase 1)
WIDTH
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
Loading Input Tile 1 of N (Phase 1)
WIDTH
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
Allocating M and N
• M and N can be allocated dynamically, using 1D indexing
• M[Row][ph*TILE_WIDTH+tx]
• N[ph*TILE_WIDTH+ty][Col]
• N[(ph*TILE_WIDTH+ty)*Width + Col]
P[Row*Width+Col] = Pvalue;
}
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
P[Row*Width+Col] = Pvalue;
}
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
P[Row*Width+Col] = Pvalue;
}
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
P[Row*Width+Col] = Pvalue;
}
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
P[Row*Width+Col] = Pvalue;
}
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
P[Row*Width+Col] = Pvalue;
}
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
// Loop over the M and N tiles required to compute the P element
iterates over phases
for (int ph = 0; ph < Width/TILE_WIDTH; ++ph) {
// Collaborative loading of M and N tiles into shared memory
Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH+tx];
Nds[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
P[Row*Width+Col] = Pvalue;
}
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
// Loop over the M and N tiles required to compute the P element
iterates over phases
for (int ph = 0; ph < Width/TILE_WIDTH; ++ph) {
// Collaborative loading of M and N tiles into shared memory
Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH+tx];
Nds[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
load into shared memory
__syncthreads();
P[Row*Width+Col] = Pvalue;
}
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
// Loop over the M and N tiles required to compute the P element
iterates over phases
for (int ph = 0; ph < Width/TILE_WIDTH; ++ph) {
// Collaborative loading of M and N tiles into shared memory
Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH+tx];
Nds[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
load into shared memory
__syncthreads();
assures that all threads have loaded the data into shared mem.
for (int i = 0; i < TILE_WIDTH; ++i)
Pvalue += Mds[ty][i] * Nds[i][tx];
__synchthreads();
P[Row*Width+Col] = Pvalue;
}
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width) {
// Loop over the M and N tiles required to compute the P element
iterates over phases
for (int ph = 0; ph < Width/TILE_WIDTH; ++ph) {
// Collaborative loading of M and N tiles into shared memory
Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH+tx];
Nds[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
load into shared memory
__syncthreads();
assures that all threads have loaded the data into shared mem.
for (int i = 0; i < TILE_WIDTH; ++i)
Pvalue += Mds[ty][i] * Nds[i][tx];
__synchthreads(); assures that all threads have finished using the data into
} shared mem.
P[Row*Width+Col] = Pvalue;
}
Tile (Thread Block) Size Considerations
• Each thread block should have many threads
• For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
• For 16KB shared memory, one can potentially have up to 8 thread blocks executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per
block)
• The next TILE_WIDTH 32 would lead to 2*32*32*4 Byte= 8K Byte shared memory
usage per thread block, allowing 2 thread blocks active at the same time
• However, the thread count limitation of 1536 threads per SM in current generation
GPUs will reduce the number of blocks per SM to one!
• Each __syncthread() can reduce the number of active threads for a block
Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2
Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2
Phase 1 Use for Block (0,0) (iteration 1)
Shared Memory
M0,0 M0,1 M0,2 M0,2 P0,0 P0,1 P0,2
M1,0 M1,1 M1,2 M1,2 P1,0 P1,1 P1,2
M2,0 M2,1 M2,2 P2,0 P2,1 P2,2
participate in loading tile element N[1,2] M2,0 M2,1 M2,2 M2,0 M2,1 P2,0 P2,1 P2,2
• M[Row][p*TILE_WIDTH+tx]
• M[Row*Width + p*TILE_WIDTH+tx]
• Need to test
A
• (Row < Width) && (p*TILE_WIDTH+tx < Width)
• Else , load 0
TILE_WIDTH TILE_WIDTH
Boundary Condition for Input N Tile
• Each thread loads
• N[p*TILE_WIDTH+ty][Col]
• N[(p*TILE_WIDTH+ty)*Width+ Col]
• Need to test
TILE_WIDTH
• Else , load 0 B
TILE_WIDTH
Loading Elements – with boundary check
for (int p = 0; p < (Width-1) / TILE_WIDTH + 1; ++p) {
} else {
ds_M[ty][tx] = 0.0;
} else {
ds_N[ty][tx] = 0.0;
__syncthreads();
}
Inner Product – Before and After
if(Row < Width && Col < Width) {
__syncthreads();
} /* end of kernel */
Some Important Points
• For each thread the conditions are different for
• Loading M element
• Loading N element