Pc10 NumLinAl I

Parallel Computing
Numerical linear algebra

Thorsten Grahs, 22.06.2015
Overview
Direct methods
Iterative methods
Splitting methods
Krylov subspace methods
Preconditioner
Efficient storage formats
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
Overview. Applications
Solution algorithms Classification

Dense matrices
Few/no zero entries
Direct algorithms (Gauss or splitting algorithms)
Band matrices
Special algorithms: e.g. Thomas algorithm

Sparse matrices
Special storage formats
(store only non-zeros)
Iterative algorithms
(Jacobi, Gauss-Seidel, Krylov methods)
Dense matrix algorithms

Gaussian elimination
Solve system of equations Ax = b
Reduction
a11 a12
0 a
22
..
..
.
.
0
0

x1
b1
a1n
x2
b2
2n
a

.. .. = ..
..
.
. . .
nn
a
xn
bn
Substitution
xk =
1
bk
akk
n
X
j=k+1
akj xj
LU/LR decomposition
LU (Lower/Upper) or LR (Left/Right) decomposition
Solve stem of equations Ax = b
LU decomposition

l11 0
a11 a12 a1n
a21 a22 a2n l21 l22

..
.. . .
.. = ..
..
.
. . .
.
.
an1 an2 ann
ln1 ln2
0
u11 u12
0 0 u22
.. ..
..
. .
.
lnn
0
0
..
.
Solve
Ax = b
LUx = L(Ux) = Ly = b
Ly = b Ux = y
u1n
u2n
..
..
. .
unn
Single Value Decomposition

Single Value Decomposition(SVD)
A = UVT with
1 0 0
0 1 0
UT U = VT V = I = .. .. . . .. ,
. .
. .
Decompose
0 0 1
Solve
Ax = V1 UT b
11 0
0 22
= ..
..
.
.
0
0
..
.
0
0
..
.
nn
Q-R Algorithm
QR decomposition
Decompose A = QR with
R upper diagonal matrix
Q orthogonal matrix with
1 0
0 1
QT Q = I = .. .. . .
. .
.
0 0
0
0
..
.
1
Solve
Ax = b
QRx = b
QT QRx = QT b
Rx = QT b
Parallel algorithms?
Example Gaussian Elimination (GE)
Standard sequential algorithm for system of equations
n
P
aij xj = bi , i = 1, . . . , n
j=1
1
2
3
4
5
6
7
8
9
10
for k=1,n
for i =k+1,n
l ( i ,k) = a( i ,k) /a(k,k)
endfor( i )
for j =k+1,n
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
GE example from L.R. Scott Scientific Parallel Computing
Iteration space f. Gaussian elimination
The iteration space for the standard sequential algorithm

for Gaussian elimination forms a trapezoidal region with
square cross-section in the i, j plane.
Within each square (with k fixed) there are no
dependencies
What Gaussian elimination does

Gaussian elimination reduces the original to a triangular
system
n
P
lij yj = bi , i = 1, . . . , n
j=1
Matrix L is the lower-triangular matrix computed in GE

example (which determines the values below the
diagonal) with ones placed on the diagonal
Thus, Gaussian elimination performs a matrix factorization
(LU decomposition)
Ax = LUx = b
Solve triangular systems
Ly = b
Ux = y
Properties Gaussian elimination
The dominant part of the computation is the factorization

(GE example) in which L and U are determined.
The triangular system solves require less computation
There are no loop-carried dependences in the inner-most
two loops (i, j-loops) in GE because i, j > k
Therefore these loops can be decomposed in any
desired fashion
Standard parallelization of GE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
for k=1,n
if ( " I own column k " )
for i =k+1,n
l ( i ,k) = a( i ,k) /a(k,k)
endfor( i )
"broadcast" l (k+1 : n)
else "receive" l (k+1 : n)
endif
for j =k+1,n ("modulo owning column j")
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
G. A. Geist and C. H. Romine. LU factorization algorithms on distributed-memory
multiprocessor architectures. SIAM J. Sci. Stat. Comput., 9:639649, 1988.
Parallel complexity
For each value of k, n k divisions are performed in
computing the multipliers l(i, k)
Subsequently, these multipliers are broadcast to all other
processors.
T1 = (n k)(Ta + Tk )
Once these are received, (n k)2 multiply-add pairs are
executed, all of which can be done in parallel
T2 = 2(n k)2 Ta /P
For each K
T = T1 + T2 = (n k)(Ta + Tk ) +
computation time is used
2(n k)2 Ta
P
Parallel complexity
Summing over k, the total time of execution is

n1
X
2(n k)2 Ta
T =
(n k)(Ta + Tk ) +
P
k=1
1 2
2n3 Ta
n (Ta + Tk ) +
2
3P
Overlapped Gaussian elimination

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
first processor computes l(2:n,1) & "broadcast"

for k=1,n
unless(" I own column k ") "recieve" l(k+1:n,k)
if ( " I own column k+1 " )
for i =k+1,n
a( i ,k+1) = a( i ,k+1) l ( i ,k) a(k,k+1)
l ( i ,k+1) = a( i ,k+1)/a(k+1,k+1)
endfor( i )
"broadcast" l (k+1 : n)
endif
for j =k+2,n ("modulo owning column j")
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
Overlapped GE Results
Idea of overlapped Gaussian elimination
Compute multipliers l(i, k) as soon as possible
Broadcast them to all other processors
Efficiencies for Gaussian elimination for factoring full n n

matrices on an Intel iPSC.
Problems
Storage consumption
Gauss elimination is very memory consuming
GEmemory O(n2 )
Since unknowns from industrial application could be around
N 109
GE O(1018 ) i.e. 1 Exa Bytes.
time consuming
Number of arithmetic operations is O(n3 )
(dont think about communication)
GE is slow (compared to other approaches)
GE timing Laplace (2D)
GE timing Laplace (3D)
Iterative solvers
Iterative algorithms
In general for sparse matrices
Iterative Solvers
In contrast to direct solvers, iterative solvers
Compute approximate solutions uk = A1 f .
Needs only matrix-vector product yk = Ax
One iteration is cheap to compute, in 2D
Based on Fix-point iteration
Calculates successively better approximation uk+1 in an
iterative process
Improved convergence by pre-conditioning
Types of methods
Splitting methods
Krylov methods
Multigrid
Fix-point iteration
f is contractive when kf (x) f (y)k < kx yk.

There exist an unique Fix-point x = f (x )
k
For x k+1 = f (x k ),
x k x
If kf (x) f (y)k < kx yk, < 1 x, y
kx k x k < kx x k
Splitting methods
Iterative methods with a splitting of the form
A = M N,
A, M, N Rnn
M easy to invert (f.i. diagonal matrix)
and iteration rule
Mx k+1 = Nx k + b
x k+1 = Cx k + d
.
with
C := M1 N,
d := M1 b
called splitting methods
Matrix-Splitting
Convergence
Guaranteed for splitting methods with
Spectral radius (M1 N) < 1
M 1 acts as a preconditioning matrix
Preconditioner helps the method to converge faster
Best preconditioning matrix for convergence: M1 = A1
Simplest preconditioning matrix: M1 = I
Realistic preconditioning matrix between
Jacobi:
M=D
Gau-Seidel: M = (D L)
Jacobi methods
or total-step iteration
Splitting
A=DLR
D, L, R Rnn , D diagonal matrix
iteration rule
Dx k+1 = (L + R)x k + b
x k+1 = Cx k + d,
xik+1 =
1
aii
C := D1 (L + R) d := D1 b
!
n
P
bi
aij xjk , i = 1, . . . , n.
j=1,j6=i
Gau-Seidel method
Single-step method
Splitting
A=DLR
D, L, R Rnn ,
D diagonal matrix
iteration rule
(D L)x k+1 = Rx k + b
x k+1 = Cx k + d, C := (D L)1 (R), d := (D L)1 b

xik+1 =
1
aii
bi
i1
P
j=1
aij xjk+1
n
P
j=i+1
!
aij xjk
, i = 1, . . . , n.
SOR (Successive Over Relaxation)

or relaxed Gau-Seidel method
(with relaxation parameter (0, 2))
Splitting
1
A=
D L R 1
D, D, L, R Rnn ,
Iteration rule
(D L)x k+1 = (1 )Dx k + Rx k + b
x k+1 = Cx k + d,
C := (D L)1 [(1 )D + R]
xik+1 =
aii
bi
i1
P
j=1
aij xjk+1
n
P
j=i+1
D diag.matrix
e.g..
d := (D L)1 b
!
aij xjk
i = 1, . .22.06.2015
. , n. Thorsten Grahs Parallel Computing I SS 2015 Seite 29
+ (1 )xik ,
Krylov methods
Matrix-vector multiplication with A
Krylov subspace
Kk (A, r0 ) := [r0 , Ar0 , A2 r0 , . . . , Ak1 r0 ]
Select best solution in Krylov subspace
1
2
3
r0 = b - A*x0
Compute
best
xk = x0 + uk
uk in KN(A, r0)
Krylov methods
The question arises in which sense
best
is meant.
Algorithms
GMRES
CG
BiCG
BiCGStab
...
GMRES
Generalized Minimum Residual Method
(Saad, Schultz 1986)
Best = smallest residual
1
2
3
4
5
Compute
uk in KN(A, r0)
such that
|| b - A (x0 + uk) ||
is minimized
Cost and memory grows linear with N
Pick small k (e.g. k = 20)
Restart with x 0 restart = x k (compute new basis)
Restarted GMRES GMRES(k)
CG
Conjugated Gradient method
(Hestenes/Stiefel 1952)
Starting point
Minimizing the functional
1
E(x) := hAx, yi hb, xi
2
is equivalent with solving the linear system Ax = b.
Idea
Minimizing the functional E over a subspace with
search/basis vectors dk+1 := rk + k dk , d0 = r0 .
The Gradient of E at position xk is
E|xk = Axx b := rk
which is easy to compute
The dk s are A-conjugated, i.e. hAdi , dj i = 0, i 6= j
CG
Best = smallest residual in the A-norm
i.e. krk kA < tol
1
2
3
4
5
6
Compute
uk in KN(A, r0)
such that
|| b - A (x0 + uk) ||A
is minimized, where
|| v ||A = vT A v
Matrix must be S.P.D.
Symmetric and Positive-Definite
CG | Prerequisites
Given:
Matrix (Dimension n n) :
Vectors (length n) :
Scalars:
A
x,b
kmax (Max. iteration)
eps (prescribed tolerance)
Local variable
Vectors (length n) : r, d, Ad
Scalars :
rNew, rOld, alpha, beta
CG | Pseudo code
1
2
3
4
5
6
7
8
9
10
d = r = b - A*x
rOld = r*r
For k = 1, ..., kMax
If modulus rOld smaller than eps is, than stop
else
Ad = A*d
alpha = rOld/(d*Ad)
x = x + alpha*d
r = r - alpha*Ad
rNew = r*r
11
12
beta = rNew / rOld
13
14
d = beta*d + r
15
16
rOld = rNew
CG
Finite termination/convergence
of the CG method after N
steps/iterations
A RNN
in exact arithmetic
Name stems from the

conjugated search directions di
with
hAdi , dj i = 0
CG compared with gradient descent method.
Convergence of iterative methods

Splitting methods(Jacobi/Gauss-Seidel)
vs.
Krylov methods (GMRES/CG)
Preconditioner
What does a preconditioner?
The preconditioner approximates the inverse of A
What is this good for?
Reduces the condition number of the matrix
i.e. reduces the spectral radius of the matrix.
Helps to steer the iteration in the right direction
Better approximation Faster Convergence
M A1
Diagonal Preconditioner
Simplest preconditioner
Inverse of matrix diagonal
Always cheap, sometimes effective
1
0
5
0
7
2
0
6
0
8
3
0
A
0
0
9
4
1 0
0
0
0 1/2 0
0
0 0 1/3 0
0 0
0 1/4
M
Preconditioner
Solving with no preconditioner

Solver will continue until residual norm 5e-06
Iteration Number | Residual Norm
0
5.000000e+00
1
1.195035e+01
2
1.147768e+01
3
1.544747e+01
4
1.735989e+01
5
1.160256e+01
6
1.550610e+01
7
2.534706e+01
...
89
4.568771e-05
90
3.513509e-05
91
3.462404e-05
92
3.977090e-05
93
2.056327e-06
Successfully converged after 93 iterations.
Solving with diagonal preconditioner (M = D^-1)

Solver will continue until residual norm 5e-06
0
5.000000e+00
1
5.148771e+01
2
4.341677e+01
3
3.299794e+01
4
1.074329e+01
5
2.807501e+00
6
1.739602e+00
7
1.538450e+00
8
4.079266e-01
9
7.029972e-02
10
2.436168e-02
11
6.656054e-03
12
1.284295e-03
13
1.432453e-06
Preconditioner
Can be extremely efficient
Solving with no preconditioner

Solver will continue until residual norm 2.5e-04
0
5.000000e+00
1
1.195035e+01
2
1.147768e+01
3
1.544747e+01
4
1.735989e+01
5
1.160256e+01
6
1.550610e+01
7
2.534706e+01
...
512
3.064404e-04
513
2.943765e-04
514
2.796355e-04
515
2.690365e-04
516
2.472412e-04
Solving with smoothed aggregation preconditioner.

Solver will continue until residual norm 2.5e-04
0
2.560000e+02
1
4.373383e+03
2
2.359818e+03
3
1.211452e+03
4
6.210880e+02
5
3.169577e+02
6
1.569394e+02
7
8.138102e+01
...
20
6.011106e-03
21
2.555795e-03
22
1.083367e-03
23
4.799830e-04
24
2.213949e-04
Storing in two dim. arrays

Dense or banded matrix
It is natural to use a 2D array to store a dense or banded
matrix. Unfortunately, there are a couple of significant issues
that complicate this seemingly simple approach.
Row-major vs. column-major storage pattern is language
dependent.
It is not possible to dynamically allocate two-dimensional
arrays in C and C++.
(at least not without pointer storage and manipulation
overhead).
Row-major storage
Languages like C and C++ use what is often called a
row-major storage pattern for 2D matrices.
In C and C++, the last index in a multidimensional array
indexes contiguous memory locations. Thus a[i][j] and
a[i][j + 1] are adjacent in memory.
Example
The stride between adjacent elements in the same row is

1.
The stride between adjacent elements in the same column
is the row length (5 in this example).
Column-major storage
In Fortran 2D matrices are stored in column-major form.
The first index in a multidimensional array indexes
contiguous memory locations. Thus a(i, j) and a(i + 1, j)
are adjacent in memory.
Example
The stride between adjacent elements in the same row is

the column length (2 in this example) while the stride
between adjacent elements in the same column is 1.
Notice that if C is used to read a matrix stored in Fortran
(or vice-versa), the transpose matrix will be read.
Significance of array ordering

There are two main reasons why HPC programmers need to
be aware of this issue:
Memory access patterns can have a dramatic impact on
performance, especially on modern systems with a
complicated memory hierarchy.
These code segments access the same elements of an
array, but the order of accesses is different.
Many important numerical libraries (e.g. LAPACK) are

written in Fortran. To use them with C or C++ the
programmer must often work with a transposed matrix
Dense storage formats

Efficient storage formats for sparse matrices
Obviously, there is no need to store the zeros for an
sparse matrix.
There are several ways to store a sparse matrix efficiently
Common sparse storage formats:
Coordinate
Diagonal
Compressed Sparse Row (CSR)
ELLPACK
Hybrid
Storage format COO

Coordinate (COO)
Storage format CSR

Compressed Sparse Row (CSR)
Storage format DIA

Diagonal (DIA)
Storage format ELL

ELLPACK (ELL)
Storage format Hyb

Hybrid (HYB) ELL+COO
(CUSP only)
Use of formats
Formats comparison | structured

Structured matrix
Bytes per non-zero entry
Format
COO
CSR
DIA
ELL
HYB
float
12.00
8.45
4.05
8.11
8.11
double
16.00
12.45
8.10
12.16
12.16
Formats comparison | unstructured

Unstructured matrix
Format
COO
CSR
DIA
ELL
HYB
float double
12.00
16.00
8.62
12.62
164.11 328.22
11.06
16.60
9.00
13.44
Formats comparison | random

Random matrix
Format
COO
CSR
DIA
ELL
HYB
float
12.00
8.42
76.83
14.20
9.60
double
16.00
12.42
153.65
21.29
14.20
Formats comparison | power-law

Power-law matrix
Format
COO
CSR
DIA
ELL
HYB
float double
12.00
16.00
8.74
12.73
118.83 237.66
49.88
74.82
13.50
19.46
Literature
L.R. Scott, T. Clark, & B. Bagheri
Scientific Parallel Computing
Princton Universit press, 2005.
Y. Saad
Iterative methods for sparse linear system
Society for Industrial and Applied Mathematics (SIAM),
2003
A. Meister
Numerik linearer Gleichungssysteme
Springer-Spektrum, 2014

Pc10 NumLinAl I

Uploaded by

Copyright:

Available Formats

Pc10 NumLinAl I

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pc10 NumLinAl I

Uploaded by

Copyright:

Available Formats

Parallel Computing

Numerical linear algebra

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3

Solution algorithms Classification

Special algorithms: e.g. Thomas algorithm

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4

Dense matrix algorithms

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6

Single Value Decomposition

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8

Iteration space f. Gaussian elimination

The iteration space for the standard sequential algorithm

What Gaussian elimination does

Matrix L is the lower-triangular matrix computed in GE

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11

Properties Gaussian elimination

The dominant part of the computation is the factorization

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15

Overlapped Gaussian elimination

first processor computes l(2:n,1) & "broadcast"

Efficiencies for Gaussian elimination for factoring full n n

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18

GE timing Laplace (2D)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19

GE timing Laplace (3D)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21

f is contractive when kf (x) f (y)k < kx yk.

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27

x k+1 = Cx k + d, C := (D L)1 (R), d := (D L)1 b

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28

SOR (Successive Over Relaxation)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35

beta = rNew / rOld

Name stems from the

Convergence of iterative methods

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40

Solving with no preconditioner

Solving with diagonal preconditioner (M = D^-1)

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41

Solving with no preconditioner

Solving with smoothed aggregation preconditioner.

22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42

Storing in two dim. arrays

The stride between adjacent elements in the same row is

The stride between adjacent elements in the same row is

Significance of array ordering

Many important numerical libraries (e.g. LAPACK) are

Dense storage formats