10 Cuda Dgemm Tiled
10 Cuda Dgemm Tiled
10 Cuda Dgemm Tiled
DGEMM
Tiled matrix multiplication with Cuda
Jochen Kreutz (JSC)
Loop over
Kernel tiles
function
// loop over inner tile dimension
for ( int iktile = 0; iktile < ntiles; iktile++ ) {
// loop over row tiles
for ( int irowtile = 0; irowtile < ntiles; irowtile++ ) {
// loop over column tiles
for ( int icoltile = 0; icoltile < ntiles; icoltile++ ) {
}
}
}
DGEMM
C := alpha*op(A)*op(B) + beta*C
Workflow:
Kernel function
Init data (elements of result matrix C have to be set to 0)
Loop over tiles in input matrices and over tiles in C
G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 15
Tiled Matrix Multiplication – Implementation
Step 1
Input matrix A Input matrix B Result matrix C
G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 16
Tiled Matrix Multiplication – Implementation
Step 2
Input matrix A Input matrix B Result matrix C
G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 17
Tiled Matrix Multiplication – Implementation
Step 3
Input matrix A Input matrix B Result matrix C
G
P A * B = C Global GPU memory
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 18
Tiled Matrix Multiplication – Implementation
Step 4
Input matrix A Input matrix B Result matrix C
G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 19
Tiled Matrix Multiplication – Implementation
Step 5
Input matrix A Input matrix B Result matrix C
G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 20
Tiled Matrix Multiplication – Implementation
Repeat steps 1 to 5
Input matrix A Input matrix B Result matrix C
G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 21
Exercise: task1
.../exercises/tasks/Cuda_DGEMM_tiled.cu
Each stream will use its own tile buffers (multi buffering)
Pinned host
memory
Stream 1
G
P Stream 2 GPU buffers
U Stream 3
Pinned host
memory
Stream 1
G
P Stream 2 GPU buffers
U Stream 3
Pinned host
memory
Stream 1
G
P Stream 2 GPU buffers
U Stream 3
Pinned host
memory
G
Stream 1 A * B =C
P Stream 2 GPU buffers
U Stream 3
Pinned host
memory
Stream 1 D2H
G
P Stream 2 A * B= C GPU buffers Kernel
U Stream 3 H2D
.../exercises/tasks/Cuda_DGEMM_tiled_streams.cu
Pinned host
memory
Stream 1 Stream 1
G
P Stream 2 Stream 2 GPU buffers
U Stream 3 Stream 3
Device
GPUProgramming with CUDA @ JSC, 24. 0 2017
- 26. April Device 1 Slide 32
Exercise: task3
.../exercises/tasks/Cuda_DGEMM_tiled_streams_multigpu.cu