Pc10 NumLinAl I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

Parallel Computing

Numerical linear algebra


Thorsten Grahs, 22.06.2015

Overview
Direct methods
Iterative methods
Splitting methods
Krylov subspace methods
Preconditioner
Efficient storage formats

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

Overview. Applications

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Solution algorithms Classification


Dense matrices
Few/no zero entries
Direct algorithms (Gauss or splitting algorithms)
Band matrices

Special algorithms: e.g. Thomas algorithm


Sparse matrices
Special storage formats
(store only non-zeros)
Iterative algorithms
(Jacobi, Gauss-Seidel, Krylov methods)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

Dense matrix algorithms


Gaussian elimination
Solve system of equations Ax = b
Reduction

a11 a12
0 a
22

..
..
.
.
0
0



x1
b1
a1n
x2
b2
2n
a


.. .. = ..
..
.
. . .
nn
a
xn
bn

Substitution

xk =

1
bk
akk

n
X
j=k+1

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

akj xj

LU/LR decomposition
LU (Lower/Upper) or LR (Left/Right) decomposition
Solve stem of equations Ax = b
LU decomposition


l11 0
a11 a12 a1n
a21 a22 a2n l21 l22


..
.. . .
.. = ..
..
.
. . .
.
.
an1 an2 ann
ln1 ln2

0
u11 u12

0 0 u22
.. ..
..
. .
.
lnn
0
0

..
.

Solve
Ax = b

LUx = L(Ux) = Ly = b
Ly = b Ux = y

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

u1n
u2n

..
..
. .
unn

Single Value Decomposition


Single Value Decomposition(SVD)
Solve system of equations Ax = b
A = UVT with

1 0 0
0 1 0

UT U = VT V = I = .. .. . . .. ,
. .
. .
Decompose

0 0 1

Solve
Ax = V1 UT b

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7

11 0
0 22

= ..
..
.
.
0
0

..
.

0
0
..
.

nn

Q-R Algorithm
QR decomposition
Solve system of equations Ax = b
Decompose A = QR with
R upper diagonal matrix
Q orthogonal matrix with

1 0
0 1

QT Q = I = .. .. . .
. .
.
0 0

0
0

..
.
1

Solve
Ax = b

QRx = b

QT QRx = QT b
Rx = QT b

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

Parallel algorithms?
Example Gaussian Elimination (GE)
Standard sequential algorithm for system of equations
n
P
aij xj = bi , i = 1, . . . , n
j=1
1
2
3
4
5
6
7
8
9
10

for k=1,n
for i =k+1,n
l ( i ,k) = a( i ,k) /a(k,k)
endfor( i )
for j =k+1,n
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
GE example from L.R. Scott Scientific Parallel Computing
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9

Iteration space f. Gaussian elimination

The iteration space for the standard sequential algorithm


for Gaussian elimination forms a trapezoidal region with
square cross-section in the i, j plane.
Within each square (with k fixed) there are no
dependencies
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10

What Gaussian elimination does


Gaussian elimination reduces the original to a triangular
system
n
P
lij yj = bi , i = 1, . . . , n
j=1

Matrix L is the lower-triangular matrix computed in GE


example (which determines the values below the
diagonal) with ones placed on the diagonal
Thus, Gaussian elimination performs a matrix factorization
(LU decomposition)
Ax = LUx = b
Solve triangular systems

Ly = b

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Ux = y

Properties Gaussian elimination

The dominant part of the computation is the factorization


(GE example) in which L and U are determined.
The triangular system solves require less computation
There are no loop-carried dependences in the inner-most
two loops (i, j-loops) in GE because i, j > k
Therefore these loops can be decomposed in any
desired fashion

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

Standard parallelization of GE
1
2
3
4
5
6
7
8
9
10
11
12
13
14

for k=1,n
if ( " I own column k " )
for i =k+1,n
l ( i ,k) = a( i ,k) /a(k,k)
endfor( i )
"broadcast" l (k+1 : n)
else "receive" l (k+1 : n)
endif
for j =k+1,n ("modulo owning column j")
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
G. A. Geist and C. H. Romine. LU factorization algorithms on distributed-memory
multiprocessor architectures. SIAM J. Sci. Stat. Comput., 9:639649, 1988.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13

Parallel complexity
For each value of k, n k divisions are performed in
computing the multipliers l(i, k)
Subsequently, these multipliers are broadcast to all other
processors.
T1 = (n k)(Ta + Tk )
Once these are received, (n k)2 multiply-add pairs are
executed, all of which can be done in parallel
T2 = 2(n k)2 Ta /P
For each K
T = T1 + T2 = (n k)(Ta + Tk ) +
computation time is used
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14

2(n k)2 Ta
P

Parallel complexity
Summing over k, the total time of execution is

n1 
X
2(n k)2 Ta
T =
(n k)(Ta + Tk ) +
P
k=1

1 2
2n3 Ta
n (Ta + Tk ) +
2
3P

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

Overlapped Gaussian elimination


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

first processor computes l(2:n,1) & "broadcast"


for k=1,n
unless(" I own column k ") "recieve" l(k+1:n,k)
if ( " I own column k+1 " )
for i =k+1,n
a( i ,k+1) = a( i ,k+1) l ( i ,k) a(k,k+1)
l ( i ,k+1) = a( i ,k+1)/a(k+1,k+1)
endfor( i )
"broadcast" l (k+1 : n)
endif
for j =k+2,n ("modulo owning column j")
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16

Overlapped GE Results
Idea of overlapped Gaussian elimination
Compute multipliers l(i, k) as soon as possible
Broadcast them to all other processors

Efficiencies for Gaussian elimination for factoring full n n


matrices on an Intel iPSC.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17

Problems
Storage consumption
Gauss elimination is very memory consuming
GEmemory O(n2 )
Since unknowns from industrial application could be around
N 109
GE O(1018 ) i.e. 1 Exa Bytes.

time consuming
Number of arithmetic operations is O(n3 )
(dont think about communication)
GE is slow (compared to other approaches)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

GE timing Laplace (2D)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

GE timing Laplace (3D)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

Iterative solvers
Iterative algorithms
In general for sparse matrices

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

Iterative Solvers
In contrast to direct solvers, iterative solvers
Compute approximate solutions uk = A1 f .
Needs only matrix-vector product yk = Ax
One iteration is cheap to compute, in 2D
Based on Fix-point iteration
Calculates successively better approximation uk+1 in an
iterative process
Improved convergence by pre-conditioning
Types of methods
Splitting methods
Krylov methods
Multigrid
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22

Fix-point iteration

f is contractive when kf (x) f (y)k < kx yk.


There exist an unique Fix-point x = f (x )
k

For x k+1 = f (x k ),
x k x
If kf (x) f (y)k < kx yk, < 1 x, y
kx k x k < kx x k
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23

Splitting methods
Iterative methods with a splitting of the form
A = M N,
A, M, N Rnn
M easy to invert (f.i. diagonal matrix)
and iteration rule
Mx k+1 = Nx k + b
x k+1 = Cx k + d

.
with

C := M1 N,
d := M1 b
called splitting methods
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24

Matrix-Splitting

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

Convergence
Guaranteed for splitting methods with
Spectral radius (M1 N) < 1
M 1 acts as a preconditioning matrix
Preconditioner helps the method to converge faster
Best preconditioning matrix for convergence: M1 = A1
Simplest preconditioning matrix: M1 = I
Realistic preconditioning matrix between
Jacobi:
M=D
Gau-Seidel: M = (D L)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

Jacobi methods
or total-step iteration
Splitting
A=DLR
D, L, R Rnn , D diagonal matrix
iteration rule
Dx k+1 = (L + R)x k + b

x k+1 = Cx k + d,
xik+1 =

1
aii

C := D1 (L + R) d := D1 b
!
n
P
bi
aij xjk , i = 1, . . . , n.
j=1,j6=i

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

Gau-Seidel method
Single-step method
Splitting
A=DLR
D, L, R Rnn ,

D diagonal matrix

iteration rule
(D L)x k+1 = Rx k + b

x k+1 = Cx k + d, C := (D L)1 (R), d := (D L)1 b


xik+1 =

1
aii

bi

i1
P
j=1

aij xjk+1

n
P
j=i+1

!
aij xjk

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

, i = 1, . . . , n.

SOR (Successive Over Relaxation)


or relaxed Gau-Seidel method
(with relaxation parameter (0, 2))
Splitting
1
A=
D L R 1
D, D, L, R Rnn ,

Iteration rule
(D L)x k+1 = (1 )Dx k + Rx k + b
x k+1 = Cx k + d,
C := (D L)1 [(1 )D + R]
xik+1 =

aii

bi

i1
P
j=1

aij xjk+1

n
P
j=i+1

D diag.matrix

e.g..

d := (D L)1 b
!
aij xjk

i = 1, . .22.06.2015
. , n. Thorsten Grahs Parallel Computing I SS 2015 Seite 29

+ (1 )xik ,

Krylov methods
Matrix-vector multiplication with A
Krylov subspace
Kk (A, r0 ) := [r0 , Ar0 , A2 r0 , . . . , Ak1 r0 ]
Select best solution in Krylov subspace

1
2
3

r0 = b - A*x0
Compute
best
xk = x0 + uk

uk in KN(A, r0)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

Krylov methods
The question arises in which sense
best
is meant.
Algorithms
GMRES
CG
BiCG
BiCGStab
...

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

GMRES
Generalized Minimum Residual Method
(Saad, Schultz 1986)
Best = smallest residual
1
2
3
4
5

Compute
uk in KN(A, r0)
such that
|| b - A (x0 + uk) ||
is minimized
Cost and memory grows linear with N
Pick small k (e.g. k = 20)
Restart with x 0 restart = x k (compute new basis)
Restarted GMRES GMRES(k)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32

CG
Conjugated Gradient method
(Hestenes/Stiefel 1952)
Starting point
Minimizing the functional
1
E(x) := hAx, yi hb, xi
2
is equivalent with solving the linear system Ax = b.
Idea
Minimizing the functional E over a subspace with
search/basis vectors dk+1 := rk + k dk , d0 = r0 .
The Gradient of E at position xk is
E|xk = Axx b := rk
which is easy to compute
The dk s are A-conjugated, i.e. hAdi , dj i = 0, i 6= j
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33

CG
Best = smallest residual in the A-norm
i.e. krk kA < tol
1
2
3
4
5
6

Compute
uk in KN(A, r0)
such that
|| b - A (x0 + uk) ||A
is minimized, where
|| v ||A = vT A v
Matrix must be S.P.D.
Symmetric and Positive-Definite
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34

CG | Prerequisites
Given:
Matrix (Dimension n n) :
Vectors (length n) :
Scalars:

A
x,b
kmax (Max. iteration)
eps (prescribed tolerance)

Local variable
Vectors (length n) : r, d, Ad
Scalars :
rNew, rOld, alpha, beta

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

CG | Pseudo code
1
2
3
4
5
6
7
8
9
10

d = r = b - A*x
rOld = r*r
For k = 1, ..., kMax
If modulus rOld smaller than eps is, than stop
else
Ad = A*d
alpha = rOld/(d*Ad)
x = x + alpha*d
r = r - alpha*Ad
rNew = r*r

11
12

beta = rNew / rOld

13
14

d = beta*d + r

15
16

rOld = rNew
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36

CG
Finite termination/convergence
of the CG method after N
steps/iterations
A RNN
in exact arithmetic

Name stems from the


conjugated search directions di
with
hAdi , dj i = 0
CG compared with gradient descent method.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37

Convergence of iterative methods


Splitting methods(Jacobi/Gauss-Seidel)
vs.
Krylov methods (GMRES/CG)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

Preconditioner
What does a preconditioner?
The preconditioner approximates the inverse of A
What is this good for?
Reduces the condition number of the matrix
i.e. reduces the spectral radius of the matrix.
Helps to steer the iteration in the right direction
Better approximation Faster Convergence

M A1

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

Diagonal Preconditioner
Simplest preconditioner
Inverse of matrix diagonal
Always cheap, sometimes effective

1
0

5
0

7
2
0
6

0
8
3
0
A

0
0

9
4

1 0
0
0
0 1/2 0
0

0 0 1/3 0
0 0
0 1/4
M

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40

Preconditioner

Solving with no preconditioner


Solver will continue until residual norm 5e-06
Iteration Number | Residual Norm
0
5.000000e+00
1
1.195035e+01
2
1.147768e+01
3
1.544747e+01
4
1.735989e+01
5
1.160256e+01
6
1.550610e+01
7
2.534706e+01
...
89
4.568771e-05
90
3.513509e-05
91
3.462404e-05
92
3.977090e-05
93
2.056327e-06
Successfully converged after 93 iterations.

Solving with diagonal preconditioner (M = D^-1)


Solver will continue until residual norm 5e-06
Iteration Number | Residual Norm
0
5.000000e+00
1
5.148771e+01
2
4.341677e+01
3
3.299794e+01
4
1.074329e+01
5
2.807501e+00
6
1.739602e+00
7
1.538450e+00
8
4.079266e-01
9
7.029972e-02
10
2.436168e-02
11
6.656054e-03
12
1.284295e-03
13
1.432453e-06
Successfully converged after 13 iterations.

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41

Preconditioner
Can be extremely efficient

Solving with no preconditioner


Solver will continue until residual norm 2.5e-04
Iteration Number | Residual Norm
0
5.000000e+00
1
1.195035e+01
2
1.147768e+01
3
1.544747e+01
4
1.735989e+01
5
1.160256e+01
6
1.550610e+01
7
2.534706e+01
...
512
3.064404e-04
513
2.943765e-04
514
2.796355e-04
515
2.690365e-04
516
2.472412e-04
Successfully converged after 516 iterations.

Solving with smoothed aggregation preconditioner.


Solver will continue until residual norm 2.5e-04
Iteration Number | Residual Norm
0
2.560000e+02
1
4.373383e+03
2
2.359818e+03
3
1.211452e+03
4
6.210880e+02
5
3.169577e+02
6
1.569394e+02
7
8.138102e+01
...
20
6.011106e-03
21
2.555795e-03
22
1.083367e-03
23
4.799830e-04
24
2.213949e-04
Successfully converged after 24 iterations.

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

Storing in two dim. arrays


Dense or banded matrix
It is natural to use a 2D array to store a dense or banded
matrix. Unfortunately, there are a couple of significant issues
that complicate this seemingly simple approach.
Row-major vs. column-major storage pattern is language
dependent.
It is not possible to dynamically allocate two-dimensional
arrays in C and C++.
(at least not without pointer storage and manipulation
overhead).
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43

Row-major storage
Languages like C and C++ use what is often called a
row-major storage pattern for 2D matrices.
In C and C++, the last index in a multidimensional array
indexes contiguous memory locations. Thus a[i][j] and
a[i][j + 1] are adjacent in memory.
Example

The stride between adjacent elements in the same row is


1.
The stride between adjacent elements in the same column
is the row length (5 in this example).
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44

Column-major storage
In Fortran 2D matrices are stored in column-major form.
The first index in a multidimensional array indexes
contiguous memory locations. Thus a(i, j) and a(i + 1, j)
are adjacent in memory.
Example

The stride between adjacent elements in the same row is


the column length (2 in this example) while the stride
between adjacent elements in the same column is 1.
Notice that if C is used to read a matrix stored in Fortran
(or vice-versa), the transpose matrix will be read.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45

Significance of array ordering


There are two main reasons why HPC programmers need to
be aware of this issue:
Memory access patterns can have a dramatic impact on
performance, especially on modern systems with a
complicated memory hierarchy.
These code segments access the same elements of an
array, but the order of accesses is different.

Many important numerical libraries (e.g. LAPACK) are


written in Fortran. To use them with C or C++ the
programmer must often work with a transposed matrix
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46

Dense storage formats


Efficient storage formats for sparse matrices
Obviously, there is no need to store the zeros for an
sparse matrix.
There are several ways to store a sparse matrix efficiently
Common sparse storage formats:
Coordinate
Diagonal
Compressed Sparse Row (CSR)
ELLPACK
Hybrid

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47

Storage format COO


Coordinate (COO)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48

Storage format CSR


Compressed Sparse Row (CSR)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49

Storage format DIA


Diagonal (DIA)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50

Storage format ELL


ELLPACK (ELL)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 51

Storage format Hyb


Hybrid (HYB) ELL+COO
(CUSP only)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 52

Use of formats

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 53

Formats comparison | structured


Structured matrix

Bytes per non-zero entry

Format
COO
CSR
DIA
ELL
HYB

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 54

float
12.00
8.45
4.05
8.11
8.11

double
16.00
12.45
8.10
12.16
12.16

Formats comparison | unstructured


Unstructured matrix

Bytes per non-zero entry

Format
COO
CSR
DIA
ELL
HYB

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 55

float double
12.00
16.00
8.62
12.62
164.11 328.22
11.06
16.60
9.00
13.44

Formats comparison | random


Random matrix

Bytes per non-zero entry

Format
COO
CSR
DIA
ELL
HYB

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 56

float
12.00
8.42
76.83
14.20
9.60

double
16.00
12.42
153.65
21.29
14.20

Formats comparison | power-law


Power-law matrix

Bytes per non-zero entry

Format
COO
CSR
DIA
ELL
HYB

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 57

float double
12.00
16.00
8.74
12.73
118.83 237.66
49.88
74.82
13.50
19.46

Literature
L.R. Scott, T. Clark, & B. Bagheri
Scientific Parallel Computing
Princton Universit press, 2005.
Y. Saad
Iterative methods for sparse linear system
Society for Industrial and Applied Mathematics (SIAM),
2003
A. Meister
Numerik linearer Gleichungssysteme
Springer-Spektrum, 2014

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 58

You might also like