10 Cuda Dgemm Tiled

Mitglied der Helmholtz-Gemeinschaft
DGEMM
Tiled matrix multiplication with Cuda
Jochen Kreutz (JSC)
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017

Overview
 Tiled matrix multiplication algorithm
 Cuda implementation with and without streams
 Using multi-GPUs and streams
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 2

Tiled Matrix Multiplication
Input matrix A Input matrix B Result matrix C
... ... ...
... ... ...

. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...
 Split matrices into tiles

 Allows for distributing work onto different streams (and
GPUs)

... ... ...
... ... ...

. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...
 Do partial (block-wise) computation

 Sum up partial results

... ... ...
... ... ...

. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...


... ... ...
... ... ...

. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...


 Change order of computations and run over all tiles of the
result matrix in an inner loop
 Do first computations for all tiles in result matrix and then

repeat with next tiles of input matrices
 Allows for concurrency in computation of tiles in C

... ... ...
... ... ...

. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

result matrix with an inner loop
 Do first computations for all tiles in result matrix and then
repeat with next tiles of input matrices
... ... ...
... ... ...

. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

result matrix in the inner loop
 Do first computations for tiles in result matrix and then
proceed to next tiles of input matrices
... ... ...
... ... ...

. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

... ... ...
... ... ...

. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

Tiled Matrix Multiplication - Implementation
Loop over
Kernel tiles
function
// loop over inner tile dimension
for ( int iktile = 0; iktile < ntiles; iktile++ ) {

// loop over row tiles
for ( int irowtile = 0; irowtile < ntiles; irowtile++ ) {
// loop over column tiles
for ( int icoltile = 0; icoltile < ntiles; icoltile++ ) {
}
}
}

 Tiled approach allows to operate large matrices that

would not fit into GPU memory as a whole
 For each step only 3 tiles have to be present on the
device
 Use pinned memory for tiles to do asynchronous host to
device copies and speed up data transfers
 Set beta to 1 in cublasDgemm call to reuse previous
calculated results
DGEMM
C := alpha*op(A)*op(B) + beta*C

 Workflow:
Kernel function
 Init data (elements of result matrix C have to be set to 0)
 Loop over tiles in input matrices and over tiles in C
1. Read input data (3 tiles) from global matrices to pinned

buffers
2. Transfer 3 relevant tiles to device
3. Call cublasDgemm with beta = 1
4. Read back results from device to pinned buffer
5. Write back temporary results (1 tile) from pinned host
buffer to global result matrix in host memory

... ... ...

H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...
Pinned host memory
G
Global GPU memory
P
U
Tiled Matrix Multiplication – Implementation
Step 1
... ... ...

H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...
Pinned host memory
G
Global GPU memory
P
U
Step 2
... ... ...

H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...
Pinned host memory
G
Global GPU memory
P
U
Step 3
... ... ...

H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...
Pinned host memory
G
P A * B = C Global GPU memory
U
Step 4
... ... ...

H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...
Pinned host memory
G
Global GPU memory
P
U
Step 5
... ... ...

H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...
Pinned host memory
G
Global GPU memory
P
U
Repeat steps 1 to 5
... ... ...

H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...
Pinned host memory
G
Global GPU memory
P
U
Exercise: task1
.../exercises/tasks/Cuda_DGEMM_tiled.cu

Tiled Matrix Multiplication – Using Streams
 Distribute computation of tiles to different streams

Kernel function
 Use asynchronous data transfers to overlap kernel

executions and memory copies
 Unnecessary data movement can be hidden and
simplify the implementation
 Each stream will use its own tile buffers (multi buffering)
 Synchronization will be necessary

 Example: 3 streams
... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...
Pinned host
memory
Stream 1
G
P Stream 2 GPU buffers
U Stream 3

 For every tile:
 H2D data transfer
 Kernel execution (dgemm)
 D2H data transfer
Stream 1 H2D Kernel D2H


... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...
Pinned host
memory
Stream 1
G
U Stream 3

... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...
Pinned host
memory
Stream 1
G
U Stream 3

... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...
Pinned host
memory
G
Stream 1 A * B =C
U Stream 3

... ... ... Synchronization
H ... ... ... needed before
O . . . . . . . . . . . . writing results
. . . . . . . . . . . . back to global
S . . . . . . . . . . . .
matrice in host
T ... ... ...
memory
Pinned host
memory
Stream 1 D2H
G
P Stream 2 A * B= C GPU buffers Kernel
U Stream 3 H2D

Exercise: task2
.../exercises/tasks/Cuda_DGEMM_tiled_streams.cu

Tiled Matrix Multiplication – Using Multi-GPUs
with Streams
 Use all GPUs within a node
Kernel function
 Each GPU uses several streams
 First fill all streams of a GPU then move to next GPU

Tiled Matrix Multiplication – Using Multi-GPUs
with Streams
 Example: 2 GPUs, 3 streams
... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...
Pinned host
memory
Stream 1 Stream 1
G
P Stream 2 Stream 2 GPU buffers
U Stream 3 Stream 3
Device
GPUProgramming with CUDA @ JSC, 24. 0 2017
- 26. April Device 1 Slide 32
Exercise: task3
.../exercises/tasks/Cuda_DGEMM_tiled_streams_multigpu.cu

10 Cuda Dgemm Tiled

Uploaded by

Copyright:

Available Formats

10 Cuda Dgemm Tiled

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Cuda Dgemm Tiled

Uploaded by

Copyright:

Available Formats

Mitglied der Helmholtz-Gemeinschaft

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017

 Tiled matrix multiplication algorithm

 Cuda implementation with and without streams

 Using multi-GPUs and streams

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 2

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...

 Split matrices into tiles

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 3

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...

 Do partial (block-wise) computation

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 4

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...

 Do partial (block-wise) computation

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 5

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...

 Do partial (block-wise) computation

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 6

 Do first computations for all tiles in result matrix and then

 Allows for concurrency in computation of tiles in C

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 7

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...

 Change order of computations and run over all tiles of the

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...

 Change order of computations and run over all tiles of the

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...

 Change order of computations and run over all tiles of the

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...

 Change order of computations and run over all tiles of the

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 12

 Tiled approach allows to operate large matrices that

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 13

1. Read input data (3 tiles) from global matrices to pinned

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 14

Input matrix A Input matrix B Result matrix C

... ... ...

Pinned host memory

... ... ...

Pinned host memory

... ... ...

Pinned host memory

... ... ...

Pinned host memory

... ... ...