Pc10 NumLinAl I
Pc10 NumLinAl I
Pc10 NumLinAl I
Overview
Direct methods
Iterative methods
Splitting methods
Krylov subspace methods
Preconditioner
Efficient storage formats
Overview. Applications
a11 a12
0 a
22
..
..
.
.
0
0
x1
b1
a1n
x2
b2
2n
a
.. .. = ..
..
.
. . .
nn
a
xn
bn
Substitution
xk =
1
bk
akk
n
X
j=k+1
akj xj
LU/LR decomposition
LU (Lower/Upper) or LR (Left/Right) decomposition
Solve stem of equations Ax = b
LU decomposition
l11 0
a11 a12 a1n
a21 a22 a2n l21 l22
..
.. . .
.. = ..
..
.
. . .
.
.
an1 an2 ann
ln1 ln2
0
u11 u12
0 0 u22
.. ..
..
. .
.
lnn
0
0
..
.
Solve
Ax = b
LUx = L(Ux) = Ly = b
Ly = b Ux = y
u1n
u2n
..
..
. .
unn
1 0 0
0 1 0
UT U = VT V = I = .. .. . . .. ,
. .
. .
Decompose
0 0 1
Solve
Ax = V1 UT b
11 0
0 22
= ..
..
.
.
0
0
..
.
0
0
..
.
nn
Q-R Algorithm
QR decomposition
Solve system of equations Ax = b
Decompose A = QR with
R upper diagonal matrix
Q orthogonal matrix with
1 0
0 1
QT Q = I = .. .. . .
. .
.
0 0
0
0
..
.
1
Solve
Ax = b
QRx = b
QT QRx = QT b
Rx = QT b
Parallel algorithms?
Example Gaussian Elimination (GE)
Standard sequential algorithm for system of equations
n
P
aij xj = bi , i = 1, . . . , n
j=1
1
2
3
4
5
6
7
8
9
10
for k=1,n
for i =k+1,n
l ( i ,k) = a( i ,k) /a(k,k)
endfor( i )
for j =k+1,n
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
GE example from L.R. Scott Scientific Parallel Computing
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9
Ly = b
Ux = y
Standard parallelization of GE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
for k=1,n
if ( " I own column k " )
for i =k+1,n
l ( i ,k) = a( i ,k) /a(k,k)
endfor( i )
"broadcast" l (k+1 : n)
else "receive" l (k+1 : n)
endif
for j =k+1,n ("modulo owning column j")
for i =k+1,n
a( i , j ) = a( i , j ) l ( i ,k) a(k, j )
endfor( i )
endfor( j )
endfor(k)
G. A. Geist and C. H. Romine. LU factorization algorithms on distributed-memory
multiprocessor architectures. SIAM J. Sci. Stat. Comput., 9:639649, 1988.
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13
Parallel complexity
For each value of k, n k divisions are performed in
computing the multipliers l(i, k)
Subsequently, these multipliers are broadcast to all other
processors.
T1 = (n k)(Ta + Tk )
Once these are received, (n k)2 multiply-add pairs are
executed, all of which can be done in parallel
T2 = 2(n k)2 Ta /P
For each K
T = T1 + T2 = (n k)(Ta + Tk ) +
computation time is used
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14
2(n k)2 Ta
P
Parallel complexity
Summing over k, the total time of execution is
n1
X
2(n k)2 Ta
T =
(n k)(Ta + Tk ) +
P
k=1
1 2
2n3 Ta
n (Ta + Tk ) +
2
3P
Overlapped GE Results
Idea of overlapped Gaussian elimination
Compute multipliers l(i, k) as soon as possible
Broadcast them to all other processors
Problems
Storage consumption
Gauss elimination is very memory consuming
GEmemory O(n2 )
Since unknowns from industrial application could be around
N 109
GE O(1018 ) i.e. 1 Exa Bytes.
time consuming
Number of arithmetic operations is O(n3 )
(dont think about communication)
GE is slow (compared to other approaches)
Iterative solvers
Iterative algorithms
In general for sparse matrices
Iterative Solvers
In contrast to direct solvers, iterative solvers
Compute approximate solutions uk = A1 f .
Needs only matrix-vector product yk = Ax
One iteration is cheap to compute, in 2D
Based on Fix-point iteration
Calculates successively better approximation uk+1 in an
iterative process
Improved convergence by pre-conditioning
Types of methods
Splitting methods
Krylov methods
Multigrid
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22
Fix-point iteration
For x k+1 = f (x k ),
x k x
If kf (x) f (y)k < kx yk, < 1 x, y
kx k x k < kx x k
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23
Splitting methods
Iterative methods with a splitting of the form
A = M N,
A, M, N Rnn
M easy to invert (f.i. diagonal matrix)
and iteration rule
Mx k+1 = Nx k + b
x k+1 = Cx k + d
.
with
C := M1 N,
d := M1 b
called splitting methods
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24
Matrix-Splitting
Convergence
Guaranteed for splitting methods with
Spectral radius (M1 N) < 1
M 1 acts as a preconditioning matrix
Preconditioner helps the method to converge faster
Best preconditioning matrix for convergence: M1 = A1
Simplest preconditioning matrix: M1 = I
Realistic preconditioning matrix between
Jacobi:
M=D
Gau-Seidel: M = (D L)
Jacobi methods
or total-step iteration
Splitting
A=DLR
D, L, R Rnn , D diagonal matrix
iteration rule
Dx k+1 = (L + R)x k + b
x k+1 = Cx k + d,
xik+1 =
1
aii
C := D1 (L + R) d := D1 b
!
n
P
bi
aij xjk , i = 1, . . . , n.
j=1,j6=i
Gau-Seidel method
Single-step method
Splitting
A=DLR
D, L, R Rnn ,
D diagonal matrix
iteration rule
(D L)x k+1 = Rx k + b
1
aii
bi
i1
P
j=1
aij xjk+1
n
P
j=i+1
!
aij xjk
, i = 1, . . . , n.
Iteration rule
(D L)x k+1 = (1 )Dx k + Rx k + b
x k+1 = Cx k + d,
C := (D L)1 [(1 )D + R]
xik+1 =
aii
bi
i1
P
j=1
aij xjk+1
n
P
j=i+1
D diag.matrix
e.g..
d := (D L)1 b
!
aij xjk
i = 1, . .22.06.2015
. , n. Thorsten Grahs Parallel Computing I SS 2015 Seite 29
+ (1 )xik ,
Krylov methods
Matrix-vector multiplication with A
Krylov subspace
Kk (A, r0 ) := [r0 , Ar0 , A2 r0 , . . . , Ak1 r0 ]
Select best solution in Krylov subspace
1
2
3
r0 = b - A*x0
Compute
best
xk = x0 + uk
uk in KN(A, r0)
Krylov methods
The question arises in which sense
best
is meant.
Algorithms
GMRES
CG
BiCG
BiCGStab
...
GMRES
Generalized Minimum Residual Method
(Saad, Schultz 1986)
Best = smallest residual
1
2
3
4
5
Compute
uk in KN(A, r0)
such that
|| b - A (x0 + uk) ||
is minimized
Cost and memory grows linear with N
Pick small k (e.g. k = 20)
Restart with x 0 restart = x k (compute new basis)
Restarted GMRES GMRES(k)
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32
CG
Conjugated Gradient method
(Hestenes/Stiefel 1952)
Starting point
Minimizing the functional
1
E(x) := hAx, yi hb, xi
2
is equivalent with solving the linear system Ax = b.
Idea
Minimizing the functional E over a subspace with
search/basis vectors dk+1 := rk + k dk , d0 = r0 .
The Gradient of E at position xk is
E|xk = Axx b := rk
which is easy to compute
The dk s are A-conjugated, i.e. hAdi , dj i = 0, i 6= j
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33
CG
Best = smallest residual in the A-norm
i.e. krk kA < tol
1
2
3
4
5
6
Compute
uk in KN(A, r0)
such that
|| b - A (x0 + uk) ||A
is minimized, where
|| v ||A = vT A v
Matrix must be S.P.D.
Symmetric and Positive-Definite
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34
CG | Prerequisites
Given:
Matrix (Dimension n n) :
Vectors (length n) :
Scalars:
A
x,b
kmax (Max. iteration)
eps (prescribed tolerance)
Local variable
Vectors (length n) : r, d, Ad
Scalars :
rNew, rOld, alpha, beta
CG | Pseudo code
1
2
3
4
5
6
7
8
9
10
d = r = b - A*x
rOld = r*r
For k = 1, ..., kMax
If modulus rOld smaller than eps is, than stop
else
Ad = A*d
alpha = rOld/(d*Ad)
x = x + alpha*d
r = r - alpha*Ad
rNew = r*r
11
12
13
14
d = beta*d + r
15
16
rOld = rNew
22.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36
CG
Finite termination/convergence
of the CG method after N
steps/iterations
A RNN
in exact arithmetic
Preconditioner
What does a preconditioner?
The preconditioner approximates the inverse of A
What is this good for?
Reduces the condition number of the matrix
i.e. reduces the spectral radius of the matrix.
Helps to steer the iteration in the right direction
Better approximation Faster Convergence
M A1
Diagonal Preconditioner
Simplest preconditioner
Inverse of matrix diagonal
Always cheap, sometimes effective
1
0
5
0
7
2
0
6
0
8
3
0
A
0
0
9
4
1 0
0
0
0 1/2 0
0
0 0 1/3 0
0 0
0 1/4
M
Preconditioner
Preconditioner
Can be extremely efficient
Row-major storage
Languages like C and C++ use what is often called a
row-major storage pattern for 2D matrices.
In C and C++, the last index in a multidimensional array
indexes contiguous memory locations. Thus a[i][j] and
a[i][j + 1] are adjacent in memory.
Example
Column-major storage
In Fortran 2D matrices are stored in column-major form.
The first index in a multidimensional array indexes
contiguous memory locations. Thus a(i, j) and a(i + 1, j)
are adjacent in memory.
Example
Use of formats
Format
COO
CSR
DIA
ELL
HYB
float
12.00
8.45
4.05
8.11
8.11
double
16.00
12.45
8.10
12.16
12.16
Format
COO
CSR
DIA
ELL
HYB
float double
12.00
16.00
8.62
12.62
164.11 328.22
11.06
16.60
9.00
13.44
Format
COO
CSR
DIA
ELL
HYB
float
12.00
8.42
76.83
14.20
9.60
double
16.00
12.42
153.65
21.29
14.20
Format
COO
CSR
DIA
ELL
HYB
float double
12.00
16.00
8.74
12.73
118.83 237.66
49.88
74.82
13.50
19.46
Literature
L.R. Scott, T. Clark, & B. Bagheri
Scientific Parallel Computing
Princton Universit press, 2005.
Y. Saad
Iterative methods for sparse linear system
Society for Industrial and Applied Mathematics (SIAM),
2003
A. Meister
Numerik linearer Gleichungssysteme
Springer-Spektrum, 2014