10 Cuda Dgemm Tiled

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Mitglied der Helmholtz-Gemeinschaft

DGEMM
Tiled matrix multiplication with Cuda
Jochen Kreutz (JSC)

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017


Overview

 Tiled matrix multiplication algorithm

 Cuda implementation with and without streams

 Using multi-GPUs and streams

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 2


Tiled Matrix Multiplication

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...


. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

 Split matrices into tiles


 Allows for distributing work onto different streams (and
GPUs)

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 3


Tiled Matrix Multiplication

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...


. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

 Do partial (block-wise) computation


 Sum up partial results

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 4


Tiled Matrix Multiplication

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...


. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

 Do partial (block-wise) computation


 Sum up partial results

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 5


Tiled Matrix Multiplication

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...


. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

 Do partial (block-wise) computation


 Sum up partial results

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 6


Tiled Matrix Multiplication
 Change order of computations and run over all tiles of the
result matrix in an inner loop

 Do first computations for all tiles in result matrix and then


repeat with next tiles of input matrices

 Allows for concurrency in computation of tiles in C

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 7


Tiled Matrix Multiplication

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...


. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

 Change order of computations and run over all tiles of the


result matrix with an inner loop
 Do first computations for all tiles in result matrix and then
repeat with next tiles of input matrices
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 8
Tiled Matrix Multiplication

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...


. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

 Change order of computations and run over all tiles of the


result matrix in the inner loop
 Do first computations for tiles in result matrix and then
proceed to next tiles of input matrices
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 9
Tiled Matrix Multiplication

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...


. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

 Change order of computations and run over all tiles of the


result matrix in the inner loop
 Do first computations for tiles in result matrix and then
proceed to next tiles of input matrices
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 10
Tiled Matrix Multiplication

Input matrix A Input matrix B Result matrix C

... ... ...

... ... ...


. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

 Change order of computations and run over all tiles of the


result matrix in the inner loop
 Do first computations for tiles in result matrix and then
proceed to next tiles of input matrices
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 11
Tiled Matrix Multiplication - Implementation

Loop over
Kernel tiles
function

// loop over inner tile dimension
    for ( int iktile = 0; iktile < ntiles; iktile++ ) {
  
      // loop over row tiles
      for ( int irowtile = 0; irowtile < ntiles; irowtile++ ) {

        // loop over column tiles
        for ( int icoltile = 0; icoltile < ntiles; icoltile++ ) {

}
      }
    }

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 12


Tiled Matrix Multiplication - Implementation

 Tiled approach allows to operate large matrices that


would not fit into GPU memory as a whole
 For each step only 3 tiles have to be present on the
device
 Use pinned memory for tiles to do asynchronous host to
device copies and speed up data transfers
 Set beta to 1 in cublasDgemm call to reuse previous
calculated results

DGEMM
C := alpha*op(A)*op(B) + beta*C

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 13


Tiled Matrix Multiplication - Implementation

 Workflow:
Kernel function
 Init data (elements of result matrix C have to be set to 0)
 Loop over tiles in input matrices and over tiles in C

1. Read input data (3 tiles) from global matrices to pinned


buffers
2. Transfer 3 relevant tiles to device
3. Call cublasDgemm with beta = 1
4. Read back results from device to pinned buffer
5. Write back temporary results (1 tile) from pinned host
buffer to global result matrix in host memory

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 14


Tiled Matrix Multiplication - Implementation

Input matrix A Input matrix B Result matrix C

... ... ...


H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

Pinned host memory

G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 15
Tiled Matrix Multiplication – Implementation
Step 1
Input matrix A Input matrix B Result matrix C

... ... ...


H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

Pinned host memory

G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 16
Tiled Matrix Multiplication – Implementation
Step 2
Input matrix A Input matrix B Result matrix C

... ... ...


H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

Pinned host memory

G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 17
Tiled Matrix Multiplication – Implementation
Step 3
Input matrix A Input matrix B Result matrix C

... ... ...


H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

Pinned host memory

G
P A * B = C Global GPU memory

U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 18
Tiled Matrix Multiplication – Implementation
Step 4
Input matrix A Input matrix B Result matrix C

... ... ...


H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

Pinned host memory

G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 19
Tiled Matrix Multiplication – Implementation
Step 5
Input matrix A Input matrix B Result matrix C

... ... ...


H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

Pinned host memory

G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 20
Tiled Matrix Multiplication – Implementation
Repeat steps 1 to 5
Input matrix A Input matrix B Result matrix C

... ... ...


H
... ... ...
O
S . . . . . . . . . . . .
T . . . . . . . . . . . .
. . . . . . . . . . . .
... ... ...

Pinned host memory

G
Global GPU memory
P
U
GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 21
Exercise: task1

.../exercises/tasks/Cuda_DGEMM_tiled.cu

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 22


Tiled Matrix Multiplication – Using Streams

 Distribute computation of tiles to different streams


Kernel function

 Use asynchronous data transfers to overlap kernel


executions and memory copies
 Unnecessary data movement can be hidden and
simplify the implementation

 Each stream will use its own tile buffers (multi buffering)

 Synchronization will be necessary

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 23


Tiled Matrix Multiplication – Using Streams
 Example: 3 streams
Input matrix A Input matrix B Result matrix C
... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...

Pinned host
memory

Stream 1
G
P Stream 2 GPU buffers
U Stream 3

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 24


Tiled Matrix Multiplication – Using Streams
 Example: 3 streams
 For every tile:
 H2D data transfer
 Kernel execution (dgemm)
 D2H data transfer

Stream 1 H2D Kernel D2H

Stream 2 H2D Kernel D2H


Stream 3 H2D Kernel D2H

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 25


Tiled Matrix Multiplication – Using Streams
 Example: 3 streams
Input matrix A Input matrix B Result matrix C
... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...

Pinned host
memory

Stream 1
G
P Stream 2 GPU buffers
U Stream 3

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 26


Tiled Matrix Multiplication – Using Streams
 Example: 3 streams
Input matrix A Input matrix B Result matrix C
... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...

Pinned host
memory

Stream 1
G
P Stream 2 GPU buffers
U Stream 3

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 27


Tiled Matrix Multiplication – Using Streams
 Example: 3 streams
Input matrix A Input matrix B Result matrix C
... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...

Pinned host
memory

G
Stream 1 A * B =C
P Stream 2 GPU buffers
U Stream 3

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 28


Tiled Matrix Multiplication – Using Streams
 Example: 3 streams
Input matrix A Input matrix B Result matrix C
... ... ... Synchronization
H ... ... ... needed before
O . . . . . . . . . . . . writing results
. . . . . . . . . . . . back to global
S . . . . . . . . . . . .
matrice in host
T ... ... ...
memory

Pinned host
memory

Stream 1 D2H
G
P Stream 2 A * B= C GPU buffers Kernel

U Stream 3 H2D

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 29


Exercise: task2

.../exercises/tasks/Cuda_DGEMM_tiled_streams.cu

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 30


Tiled Matrix Multiplication – Using Multi-GPUs
with Streams
 Use all GPUs within a node
Kernel function
 Each GPU uses several streams
 First fill all streams of a GPU then move to next GPU

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 31


Tiled Matrix Multiplication – Using Multi-GPUs
with Streams
 Example: 2 GPUs, 3 streams
Input matrix A Input matrix B Result matrix C
... ... ...
H ... ... ...
O . . . . . . . . . . . .
. . . . . . . . . . . .
S . . . . . . . . . . . .
T ... ... ...

Pinned host
memory

Stream 1 Stream 1
G
P Stream 2 Stream 2 GPU buffers
U Stream 3 Stream 3

Device
GPUProgramming with CUDA @ JSC, 24. 0 2017
- 26. April Device 1 Slide 32
Exercise: task3

.../exercises/tasks/Cuda_DGEMM_tiled_streams_multigpu.cu

GPUProgramming with CUDA @ JSC, 24. - 26. April 2017 Slide 33

You might also like