Delft University of Technology
Faculty of Electrical Engineering, Mathematics and Computer Science
Delft Institute of Applied Mathematics
Implicit Methods for Real-Time Simulation of
Interactive Waves
A thesis submitted to the
Delft Institute of Applied Mathematics
in partial fulfillment of the requirements
for the degree
MASTER OF SCIENCE
in
APPLIED MATHEMATICS
by
Akash Mittal
Delft, The Netherlands
July 23, 2014
Copyright c 2014 by Akash Mittal. All rights reserved.
MSc THESIS APPLIED MATHEMATICS
Implicit Methods for Real-Time Simulation of Interactive
Waves
Akash Mittal
Delft University of Technology
Daily supervisor
Responsible professor
Prof. dr. ir. Cornelis Vuik
Prof. dr. ir. Cornelis Vuik
Committee members
Prof. dr. ir. Cornelis Vuik
Dr. ir. Peter Wilders
Dr. Auke Ditzel
July 23, 2014
Delft, The Netherlands
Preface
This Master thesis has been written within the Erasmus Mundus Master’s program Computer
Simulations for Science and Engineering (COSSE) 1 . This included a six months internship carried out at Maritime Research Institute Netherlands (MARIN) in collaboration with Technical
University Delft.
MARIN provides a unique combination of simulation, model testing, full-scale measurements
and training programmes to the shipbuilding and offshore industry and governments. One of
such service includes the supply of full-scale bridge simulators to the shipping companies which
can not only be used to train captains, steersmen and other ship workers, but also for research
and consultancy purposes.
MARIN continuously strive towards bringing innovation in their products. Currently they are
working on a new wave model called the ’Interactive Waves’. The project focuses on developing a
simulator in which ships and waves interact. The new wave model is the Variational Boussinesq
model (VBM) as suggested by Gert Klopman. However, this new realistic model brings much
more computation effort with it. The VBM mainly requires an unsteady state solver, that solves
a coupled system of equations at each frame (20 fps) . The model is currently limited by the
time-step constraints, which makes it difficult to simulate large and non-uniform domains in
real-time. The focus of the current Master’s thesis is to develop a fast and robust solver with
good stability properties.
Acknowledgements
During the past two years of my studies, I had the opportunity to study for one year at the
University of Erlangen-Nuremberg, Germany, as well at the Technical University Delft, The
Netherlands under the COSSE program. COSSE also provided me the possibility to interact
with some of the great minds in the field of Computer Science and Applied Mathematics and,
develop a friendship network which will last beyond the boundaries of the program.
Foremost, I would like to thank my supervisor Professor Kees Vuik from TU Delft for providing
regular feedback and support during the thesis.
I would like to thank MARIN and in particular Auke Ditzel for giving me the opportunity to
do research at MARIN on this challenging project, Auke Van Der Ploeg on his expertise on
the linear solvers and Anneke Sicherer-Roetman for the support on the programming aspects.
Also, I would like to thank Martijn de Jong, his work helped me alot.
Also, I would like to thank Professor Ulrich Ruede from FAU Erlangen for his encouragement.
During the last two years of studying abroad, the support of my parents never ended and I am
very thankful for many of their advices.
Akash Mittal, Delft, July 2014
1
For more information, please visit: http://www.kth.se/en/studies/programmes/em/cosse
4
Contents
1 Introduction
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Earlier Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Elwin van’t Wout: Fast solver and model explanation . . . . . . . . . .
1.2.2 Martijn De Jong: Developing a CUDA solver for large sparse matrices for
MARIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Research Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Model Equations and Discretization
2.1 The Variational Boussinesq Model (VBM)
2.1.1 Hamiltonian Description . . . . . .
2.1.2 Linearized Model Description . . .
2.2 Spatial Discretization . . . . . . . . . . .
.
.
.
.
3 Test Problems
3.1 One Dimensional Test Problem . . . . . . .
3.2 Two Dimensional Test problem- Assessment
3.3 Realistic problems IJssel, Plymouth . . . .
3.3.1 The Gelderse IJssel . . . . . . . . . .
3.3.2 The Plymouth Sound . . . . . . . .
3.4 Testing System . . . . . . . . . . . . . . . .
4 Temporal Discretization
4.1 Explicit Time Stepping :Leap-Frog Scheme
4.2 Implicit Time Stepping . . . . . . . . . . . .
4.2.1 Fully Implicit Scheme . . . . . . . .
4.2.2 β - Implicit Scheme . . . . . . . . .
4.2.3 Stability . . . . . . . . . . . . . . . .
4.2.4 Backward Differentiation Formula .
4.2.5 Semi-Implicit schemes . . . . . . . .
4.3 Symplectic integration . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
. . .
with
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
MATLAB
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Analysis of Time Integration Schemes
5.1 Stability and Accuracy of various time integration schemes
5.1.1 Benchmark Solution using MATLAB integrators . .
5.1.2 Explicit Scheme . . . . . . . . . . . . . . . . . . . .
5.1.3 Fully Implicit Schemes . . . . . . . . . . . . . . . . .
5.1.4 Semi Implicit Schemes . . . . . . . . . . . . . . . . .
i
.
.
.
2
2
3
3
.
.
.
3
4
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
5
6
7
. . . . . .
and C++
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
11
11
11
12
13
.
.
.
.
.
.
.
.
14
14
15
15
16
17
18
18
19
.
.
.
.
.
21
21
22
24
26
28
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.2
5.3
5.4
5.5
Study of Implicit Trapezoidal Scheme . . . . . . . . . . .
5.2.1 Higher Order Schemes and Impact on Phase Error
5.2.2 Impact of Initial Conditions . . . . . . . . . . . . .
5.2.3 Impact of Boundary Conditions . . . . . . . . . . .
5.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . .
Impact of Spatial Discretization . . . . . . . . . . . . . . .
5.3.1 Impact of Grid size on Accuracy . . . . . . . . . .
Stability Analysis with Vertical Structure ψ included . . .
5.4.1 Benchmark . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Chorin’s Projection technique . . . . . . . . . . . .
Stability Analysis for Two Dimensional Case . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
33
34
36
37
38
42
42
43
44
45
6 Iterative Solvers
6.1 Background and Previous Work . . . . . . . . . . . . . . . . . . . .
6.1.1 Properties of the Discretization Matrix Sψψ . . . . . . . .
6.1.2 Preconditioned Conjugate Gradient Method . . . . . . . . .
6.1.3 Repeated Red Black (RRB) ordering as preconditioner . . .
6.1.4 Implementation of RRB preconditioner . . . . . . . . . . .
6.1.5 Effect of RRB ordering on sparsity pattern . . . . . . . . .
6.1.6 RRB-k method . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.7 The Schur complement and the 9-point Stencil . . . . . . .
6.1.8 Constructing the preconditioning matrix M . . . . . . . . .
6.1.9 Solving M z = r . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.10 Efficient CUDA and C++ Code Implementation . . . . . .
6.2 Solvers for Non-Symmetric Matrices . . . . . . . . . . . . . . . . .
6.2.1 From Symmetric to Non-Symmetric Matrices . . . . . . . .
6.2.2 Arnoldi Iteration and GMRES . . . . . . . . . . . . . . . .
6.2.3 BiConjugate Gradient methods . . . . . . . . . . . . . . . .
6.3 Combined Solver - GMRES/ Bi-CGStab . . . . . . . . . . . . . . .
6.3.1 Preconditioners . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.2 Preconditioner for Skew-Symmettric Matrix . . . . . . . . .
6.3.3 Impact of different Preconditioners on Eigenvalue Spectrum
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
48
48
49
49
51
51
52
52
52
53
53
53
54
55
56
59
59
60
61
7 C++ and CUDA Implementation
7.1 General comments on implementation . . . . . . . . . . . . . . .
7.2 Construction of Preconditioner for the Implicit Equation . . . . .
7.3 Implementation of Chorin’s Projection Scheme . . . . . . . . . .
7.3.1 Sparse Matrix Vector Multiplication . . . . . . . . . . . .
7.3.2 Solve Step for the Coupled System . . . . . . . . . . . . .
7.3.3 RRB Solver for ψ, and ζ correction . . . . . . . . . . . . .
7.4 Plugging the Solver into the Interactive waves Code . . . . . . .
7.4.1 Incoming Boundary Conditions- Previous Implementation
7.4.2 Incoming Boundary Conditions- Current Implementation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
69
70
70
72
78
78
78
80
.
.
.
.
.
.
81
81
82
82
84
85
86
8 Results - Two Dimensional Test case
8.1 Convergence properties . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 Termination Criteria . . . . . . . . . . . . . . . . . . . . .
8.1.2 Variation with ∆t for various preconditioners . . . . . . .
8.1.3 Variation with time step ratio . . . . . . . . . . . . . . . .
8.1.4 Impact of the problem size . . . . . . . . . . . . . . . . .
8.2 Bi-CGSTAB vs CG: Memory requirement and Operation Count
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
86
88
91
92
96
9 Results - Real Test case
9.1 Variation with time step . . . . . . . . .
9.2 Variation with problem size . . . . . . . .
9.3 Profiling Results . . . . . . . . . . . . . .
9.4 Comparison: Explicit vs. Implicit . . . . .
9.5 Concluding Remarks- Realistic Test Case
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
98
100
102
105
107
110
10 Dispersion Analysis
10.1 Dispersion Relation . . . . . . . . . . . . .
10.2 Numerical Dispersion for One dimensional
10.3 Non-Uniform Grid . . . . . . . . . . . . .
10.3.1 Spatial discretization . . . . . . . .
10.3.2 Accuracy of the Non-Uniform Grid
10.3.3 Non-Uniform Grid Results . . . .
. . . . . .
test case .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
111
112
113
114
116
118
118
8.3
8.4
8.2.1 Memory requirement: C++ .
8.2.2 Memory requirement: CUDA
8.2.3 Operation Count . . . . . . .
Timing Results . . . . . . . . . . . .
CUDA profiling . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
11 Conclusion
123
12 Future Work
124
References
124
1
Chapter 1
Introduction
1.1
Background
Simulation of ocean waves can be categorized into two major groups. First one is based on
the physical models whereas the other generates the ocean waves based on either geometrical
shapes or oceanography spectrum. The later method group requires less computational effort,
but results in the the waves which are less realistic in nature.
Currently MARIN (Maritime Research Institute Netherlands) provides ship maneuvering simulators to the shipping companies which can be used for training and research programs. Computation of the wave field in these simulators is based on the second method and utilizes wave
spectra (Fourier theory). Being deterministic in time and space; they are easy to implement on
distributed simulation systems. However, this model is not interactive, that is, the impact of
the ship movement on the wave field is not taken into account. In order to capture the impact of
the ship movement and perform realistic simulations, the waves need to be modeled physically;
but in a real time, which is a requirement imposed by the ship simulator.
The physical modeling for the ocean wave simulation requires solving the non-linear Navier
Stokes equations. With the current simulation techniques available, it is not possible to solve
these equations in three dimensions in real time. Although it is possible to simplify the physical
model to a certain extent by making some suitable approximations. This is where the so-called
Boussinesq’s approximation comes into picture, which can be applied to weakly non-linear and
fairly long waves. The idea is to employ these approximations and develop a numerical solver
which is fast enough to carry out the simulations in real time.
The essential idea in the Boussinesq approximation is the elimination of the vertical coordinate
from the flow structure, while still retaining some of the influence of the vertical structure
of the flow. This is possible because wavs propagate in the horizontal direction, whereas it
has a different non-wave like behavior in the vertical direction. In Boussinesq-like models, an
approximate vertical structure of the flow is used to eliminate the vertical space. Though, the
classical Boussinesq models are limited to long waves - having wavelengths much longer than
the water depth. Also, they cannot deal with varying depths.
The variational Bousinnesq model (developed by Klopman [2]) can deal with deep waters with
varying depth, as the vertical structure is treated as a function of depth, surface elevation and
other parameters. The big advantage over the deterministic model is that it can incorporate
the influence of the ship movement, thus making it an interactive wave model. Owing to the
2
variation in the vertical structure and its interactive nature, it is much more computational
intensive, and much work has been done in order to improve the running time of the solver.
Two previous master thesis by Van’t Wout [3] and De Jong [1] have focused on creating a fast
linear solver and increasing the efficiency by GPU programming.
MARIN intends to use this solver to simulate the cases including multiple ships. In these
cases smaller grid sizes are required and this puts severe restrictions on the current numerical
techniques.
1.2
1.2.1
Earlier Work
Elwin van’t Wout: Fast solver and model explanation
Elwin’s work comprised of improving the linear solver that is used in the real-time ship simulator.
The underlying wave model is the variational Boussinesq model as suggested by Klopman [2].
Elwin starts with derivation of the Variational Boussinesq Model (VBM). Starting point of the
derivation of the model equations are the Euler equations which are valid for motions of ideal
fluids and are a special case of the Navier-Stokes equations (inviscid incompressible flow). In
order to reduce the problem from 3 dimensions to 2 dimensions, vertical shape functions are
employed. The resulting model is linearized to reduce complexity. The linearized model is
described in detail in Chapter 2.
Equations are then discretized (spatially) using a finite volume method with structured Cartesian grids. Time integration is performed using an explicit Leap-Frog scheme which is described
later. One of the equations results in a system of linear equations. Elwin provides a detailed
description of various numerical methods available for solving linear system of equations. Based
on the properties of the matrix involved, he selects the Conjugate Gradient method as the solver.
This iterative method is then combined with various preconditioners, namely Diagonal scaling,
modified incomplete Cholesky and repeated red black ordering. Elwin modified the original
implementation of the repeated red black-preconditioner to a repeated red black preconditioner
with a predefined number of levels, combined with a complete Cholesky decomposition on the
maximum level. Elwin concludes that the repeated red black preconditioner requires the lowest
amount of computation time, in most of the cases.
1.2.2
Martijn De Jong: Developing a CUDA solver for large sparse matrices
for MARIN
The C++ RRB solver developed by Elwin was able to solve the system of linear equations within
50 ms for domains no bigger than 100,000 to 200,000 nodes. The focus of Martijn’s thesis was to
increase the performance of above solver by utilizing GPU architecture. By carefully selecting
the storage data structures and optimal CUDA programming parameters, Martijn was able to
achieve a speedup of 30x as compared to the serial code for the linear solver. Martijn reports
that the new solver can solve systems that have more than 1.5 million nodes within 50ms with
ease.
3
1.3
Research Direction
Our main target is to explore different approaches to increase the performance of the current
time dependent solver, and allow the solution of larger problems in a similar runtime. Our
approach will be to first analyze the proposed algorithms on smaller test problems in MATLAB,
and then implement the suitable algorithms and solvers in C++ and CUDA. The following
research questions and goals are addressed in the thesis to achieve the desired target:
• Analyze the stability and accuracy of various implicit time integration schemes.
• Implement generalized Krylov subspace method.
– Analyze the impact on the performance (speed-up) and storage requirements for the
generalized Krylov Subspace methods for CUDA and C++ code.
– Study the performance of different Preconditioner combinations to solve the coupled
set of equations.
• Provide a framework to study non-uniform meshes.
1.4
Organization of the Thesis
The organization of the thesis is given below:
1. Chapter 2 describes the governing equations and discretization.
2. Chapter 3 provides a brief overview of the test cases and the systems used for testing.
3. Chapter 4 and 5 discusses various time integration techniques.
4. Chapter 6 discusses the iterative linear solvers and preconditioners.
5. Chapter 7 discusses the implementation of the linear solvers for the selected time integration.
6. Chapter 8 and 9 provides the results for the various test cases in C++ and CUDA.
7. Chapter 10 discusses the analytical results for the numerical dispersion and provides a
framework for the non-uniform grid two-dimensional test case.
4
Chapter 2
Model Equations and Discretization
2.1
The Variational Boussinesq Model (VBM)
The basic idea of the model is to minimize the pressure in the whole fluid. Starting point of the
derivation of the model equation comes from the physical model of the Navier-Stokes Equations.
The Euler equations are a special case of the Navier Stokes equations where viscous forces are
neglected. For ocean waves, this is considered to be a valid assumption as the Reynolds number
is typically large, and thus surface waves in water are a superb example of a stationary and
ergodic random process. Also, rotational effects are negligible in large water waves. The Euler
equation is given by:
∂u
1
+ (u.∇)u = ∇p − g,
∂t
ρ
(2.1)
The fluid velocity is given by u, ρ denotes the mass density, p the pressure and g the gravitational
constant. Understanding physics behind the problem is important as it provides information
on the numerical techniques required for the solution of the problem. Note that the inviscid
and irrotational assumption is not necessarily valid near solid boundaries, where very small flow
structures associated with turbulence result from the no-slip boundary condition. In the current
model, this has been ignored and the assumptions are considered to be valid throughout the
domain.
Equation 2.1 is converted to the instationary Bernoulli equation (fluid being assumed irrotational) , and then the pressure is integrated over the whole domain. The basic idea of the
variational Boussinesq model is to minimize the total pressure p. For irrotational flows, velocity
can be represented by a velcoity potential (φ). Another parameter ζ (water level) is also introduced while performing integration. The problem thus becomes finding these functions (φ and
ζ). The vertical structure of the flow is often known. The velocity potential can be written as
a series expansion in predefined vertical shape functions, thus reducing the 3D model to a 2D
model.
2.1.1
Hamiltonian Description
Being an irrotational and incompressible flow, the system of equations has a Hamiltionian
structure, where the force acting on the system is the gravitational and the pressure force. The
5
Hamiltionian H(ζ, ϕ) is given by the sum of kinetic and potential Energy. In order to obtain
the minimal value of total pressure, the Hamiltionian function is minimized with respect to
variables ζ and ϕ [3]. This results in :
∂ζ
= ∇Hϕ (ϕ, ζ)
∂t
(2.2a)
∂ϕ
= ∇Hζ (ϕ, ζ)
∂t
(2.2b)
where ∇Hζ refers to the partial derivative of H with respect to ζ. The structure of the above
equations will play an important role in the discussion of temporal integration which is discussed
in Chapter 4.
2.1.2
Linearized Model Description
The variational Boussinesq model constructed based on the above simplifications is non-linear
in nature. In order to simplify these equations and thus reducing the computational effort,
the Hamiltionian equations given by Equation (2.2) is linearized. Detailed derivation from the
Hamiltionian to linear set of equations is described in [3].
The final set of equations after performing the linearization are given by:
∂ζ
+ ∇.(ζU + h∇ϕ − hD0 ∇ψ) = 0,
∂t
(2.3a)
∂ϕ
+ U.∇ϕ + gζ = Ps ,
∂t
(2.3b)
M0 ψ + ∇.(hD0 ∇ϕ − N0 ∇ψ) = 0,
(2.3c)
Notes:
1. Equations 2.3 are solved for three basic variables: water level ζ, surface velocity potential
(2-D) ϕ and vertical structure ψ. Splitting of the velocity potential into the surface
potential and the vertical shape function is given by:
φ(x, y, z, t) = ϕ(x, y, t) + f (z)ψ(x, y, t)
(2.4)
2. Shape function f is chosen among either a parabolic or a cosine-hyperbolic shape. The
model parameters (functionals D, M and N ) are computed using the shape function f .
3. The water depth h = h(x, y, t) is relative to the reference level z = 0. The bottom of a
basin, river or sea is thus at level z = −h.
4. The total velocity can be split into the average current velocity and the velocity due to
the wave front. U = U(x, y, t) is the time average horizontal velocity of the current and
is used as an external input in the model.
5. Equations 2.3 represent the motion of waves. The impact of a moving ship is not seen
directly. In order to model a ship, a pressure pulse on the water surface is defined. In
the pressure functional given by Equation (2.3b), Ps represented the source term with
Ps := − pρs . The draft of a ship’s hull is the vertical distance between the waterline and
6
the bottom of the hull. Given the draft of a ship, ps is computed as the hydrostatic
pressure at the given depth. Let the draft be given as ds , then ps = gds α(x, y) with
α(x, y) a shape function with one in the middle of the ship and zero on the horizontal
boundary of the ship. Alternatively, the shape of the ship’s hull can be imported from an
external surface description file.
2.2
Spatial Discretization
Both Elwin and Martijn have assumed the domain to be rectangular, and divided the domain
in a rectilinear Cartesian grid. The dimensions of the domain are Lx × Ly . It is divided into
Nx × Ny grid points. The outermost nodes represent the physical boundary. The mesh spacing
Ly
x
in each direction is given as ∆x = NLx −1
and ∆y = Ny −1
. An example is given in ??
Figure 2.1: Physical domain and corresponding Cartesian grid
The model Equations 2.3 are discretized using a finite volume method. The variables are
evaluated at the grid points and the finite volumes are rectangles of size ∆x × ∆y centered
around the grid point. The derivatives are approximated with centered differences yielding a
five-point stencil. For the grid point located at C (= center) the surrounding control volume
V and its four nearest neighbors (N = north,E = east, S = south, W = west) are indicated
in Section 2.2. On node (i, j), the discretized versions of the variables ζ, ϕ and ψ are given by
ζij , ϕij and ψij . In order to put the variables in matrix format, one dimensional ordering of the
~ ϕ
~ The spatial discretization can be written
variables is defined which gives the vector ζ,
~ and ψ.
7
as :
S
ζ~
d ζζ
ϕ
~ + Sϕζ
dt
Sψζ
0
Sζϕ Sζψ
0
ζ~
Sϕϕ Sϕψ
ϕ
~ = Ps
~
Sψϕ Sψψ
0
ψ
(2.5)
The matrix S’s are given by five point stencils as follows:
0
1
2∆y VN
1 U
Sζζ : − 2∆x
W
0
−
1
2∆y VN
1
1
2∆y VS − 2∆x UW
1
VS
− 2∆y
+
0
1
2∆x UE
0
1
2∆x UE
(2.6)
where U and V denotes the current velocity in x and y direction respectively.
1
− ∆y
2 hN
1
1
1
h + ∆x2 hW + ∆x
2 hE +
∆y 2 N
1
− ∆y2 hS
0
1 h
Sζϕ : − ∆x
2 W
0
Sζψ
0
− 1 h D
: ∆x2 W W
0
Sψϕ
0
1 h D
: ∆x∆y ∆x2 W W
0
Sϕψ = 0,
0
1 N
Sψψ : ∆x∆y − ∆x
W
2
0
1
−( 2∆y
VN −
1
2∆y VS
1
h D )
∆x2 E E
(2.7)
0
1
− ∆x
2 hE D E
0
(2.8)
Sψζ = 0
1
− 2∆x
UW
1
− 2∆y VS
1
−( ∆y
2 hN D N +
1
− ∆x
2 hE
0
(2.9)
0
1
2∆x UE
0
1
2∆y VN
0
1 U
Sϕϕ : − 2∆x
W
0
1
h
∆y 2 S
1
− ∆y
2 hN D N
1
1
(hN DN + hS DS ) + ∆x
2 (hW DW +
∆y 2
1
− ∆y2 hS DS
Sϕζ = g,
0
+
1
2∆x UE )
1
h D
∆y 2 N N
1
1
h D + ∆x
2 hE D E
∆x2 W W
1
h D
∆y 2 S S
1
− ∆y
2 NN
1
1
1
N + ∆x2 NW + ∆x
2 NE +
∆y 2 N
1
− ∆y2 NS
+
1
h D )
∆y 2 S S
0
1
N
∆y 2 S
+M
(2.10)
0
1
h D
∆x2 E E
0
(2.11)
1
− ∆x
2 NE (2.12)
0
The system can be written as :
q̇ = Lq + f,
(2.13a)
~=b
Sψ
(2.13b)
8
S
ζ~
and q̇ its time derivative. The matrix L = − ζζ
with q =
S
ϕ
~
ϕζ
"
#
~
S ψ
tion matrix and f = − ζψ ~ .
Sϕψ ψ
Sζϕ
is the spatial discretizaSϕϕ
Elwin and Martijn focused on solving the system of equations represented by Equation (2.13b),
whereas the Equation (2.13a) was solved using explicit time integration schemes described in
the next Chapter.
9
Chapter 3
Test Problems
3.1
One Dimensional Test Problem
To assess the stability and accuracy of various implicit time integration procedures, we shall
consider a simplified problem in one dimension given below as the test problem.
∂ζ
+ ∇.(ζU + h∇ϕ) = 0,
∂t
(3.1a)
∂ϕ
+ U.∇ϕ + gζ = 0,
∂t
(3.1b)
Here, we have neglected the impact of the vertical structure ψ and pressure pulse Ps . This is
now a coupled initial boundary value problem, where both equations are hyperbolic in nature
and also represents an Hamiltonian Set. The boundary conditions for both ζ and ϕ are taken
as periodic.
Assuming no spatial variation of mean current velocity U and depth h, the above equations can
be written as:
∂ζ
∂2ϕ
∂ζ
+U
+ h 2 = 0,
∂t
∂x
∂x
(3.2a)
∂ϕ
∂ϕ
+U
+ gζ = 0,
∂t
∂x
(3.2b)
Implicit time integration procedures will require solving a system of linear equations. We will use
MATLAB to solve the system after performing spatial discretization and applying the boundary
conditions. The idea here is to assess the stability and accuracy of various methods and not the
performance. THe solution from various time integration procedures will be benchmarked with
MATLAB’s inbuilt ODE integrators.
10
3.2
Two Dimensional Test problem- Assessment with MATLAB and C++
In order to analyze the stability and accuracy of the complete system and implement the required
solver in C++ and CUDA, it is required to study the two dimensional case which results in a
Pentadiagonal matrix instead of a Tridiagonal matrix.
In this test problem, a simplified problem which addresses issues related to the coupling of vertical structure along with wave height and surface potential in two dimensions will be considered.
First the system will be tested with MATLAB to understand the stability of the required scheme
and then will be implemented in C++ and CUDA. The test problem is used to build the coupled RRB solver for the implicit set of equations first in C++ and then in CUDA. The solution
obtained from C++ and CUDA can then be benchmarked against the results from MATLAB.
The equations considered are:
∂ζ
∂ζ
∂ζ
∂2ϕ ∂2ϕ
∂2ψ ∂2ψ
+ Ux
+ Uy
+ h( 2 +
)
−
hD(
−
) = 0,
∂t
∂x
∂y
∂x
∂y 2
∂x2
∂y 2
(3.3a)
∂ϕ
∂ϕ
∂ϕ
+ Ux
+ Uy
+ gζ = 0,
∂t
∂x
∂y
(3.3b)
Mψ + hD(
∂2ψ ∂2ψ
∂2ϕ ∂2ϕ
+
)
−
N
(
+
) = 0,
∂x2
∂y 2
∂x2
∂y 2
(3.3c)
The algorithm required to solve the above set of equations is described in detail in Section 5.4.2.
3.3
Realistic problems IJssel, Plymouth
Two realistic problems are obtained from MARIN’s database and previous simulations by Martijn in [1]. Testing will be carried out with respect to the validated models in [1]. A very small
time step (much lower than the allowable limit) will be used to first carry out simulations with
the current model which use Explicit Leap Frog time integration. Results from implicit time
integration will then be compared with the results from explicit time integration.
3.3.1
The Gelderse IJssel
The Gelderse IJssel, a small river, is a branch from the Rhine in the Dutch provinces Gelderland
and Overijssel. The river flows from Westervoort and discharges in the IJsselmeer. In Figure
6.1 a (small) part of the river is shown. From this part several test problems are extracted.
This is done by extracting small regions out of the the displayed region . For the discretization
an equidistant 2 m by 2 m grid is used.
11
Figure 3.1: The Gelderse IJssel (Google Maps)
3.3.2
The Plymouth Sound
Plymouth Sound is a bay located at Plymouth, a town in the South shore region of England,
United Kingdom. The Plymouth Breakwater is a dam in the centre of the Plymouth Sound
which protects anchored ships in the Northern part of the Plymouth Sound against southwestern storms. From this region also test problems are extracted, see Figure 6.2. For the
discretization an equidistant 5 m by 5 m grid is used.
Figure 3.2: The Plymouth Sound (Google Maps)
12
3.4
Testing System
Below is an overview of the machine that has been used in our simulations. This system can be
considered having one of the best available GPU cards along with CPU configuration. Though
it is possible to obtain GPU cards with higher specifications, their cost is much higher.
Brand / Type
Dell Precision Workstation T3600
Owner / System no.
MARIN LIN0245
CPU
Intel Xeon E5-1620 @ 3.6 GHz
No. of cores
4 (8 Threads)
Cache
256 kB L1 / 1 MB L2 / 10 MB L3
Memory
15.5 GB RAM DDR2 @ 1066 MHz
Motherboard
Dell Custom
Operating System
Scientific Linux Release 6.5
System kernel
2.6.32 - 431 .17.1.ei1.x86-64
CUDA release
6.0
Nvidia Driver version
331.62
GCC version
4.4.7
GPU
(CUDA) GeForce GTX 780
Memory
3072 MB
No. of cores
12 SM 192 (cores/SM) = 2304 cores
GPU Clock Rate
941 MHz
Compute capability
3.5
13
Chapter 4
Temporal Discretization
The system of equations to be advanced in time are given in Equation (2.13a). Expanding the
equation we get:
˙
~
ζ~ = −Sζζ ζ~ − Sζϕ ϕ
~ − Sζψ ψ
~ + Ps
ϕ
~˙ = −Sϕζ ζ~ − Sϕϕ ϕ
~ − Sϕψ ψ
(4.1a)
(4.1b)
The equation for ϕ can be further simplified taking into account the structure of the stencils.
ϕ
~˙ = −Sϕϕ ϕ
~ − g ζ~ + Ps
(4.1c)
where g is the gravitational constant. Above equations in the form of Equation (2.13a) are
solved using the explicit Leap-Frog scheme in [1] and [3]. The method is described in 4.1.
The equation for variable ψ, which does not contain the time derivative, but still depends on
the other variables is given by
~ = −Sψϕ ϕ
Sψψ ψ
~
4.1
(4.1d)
Explicit Time Stepping :Leap-Frog Scheme
In explicit schemes, the fluxes and the sources are computed at the nth (previous) time level
and their contribution is added to the current value of the variable. In the work by Elwin
and Martijn, the Leap-Frog method has been used to integrate equation 2.13a. The Leapfrog
method is one of the Symplectic Integration techniques designed for the numerical solution of
Hamilton’s equations given by Equation (2.2).
The method described in [3] depends on two previous time steps (a so-called multistep) method.
The exact solution to Equation (2.13a) is approximated at the time intervals tn = n∆t, n =
0, 1, 2, . . . with ∆t > 0 being the time step size. The numerical approximations are denoted by
qn ≈ q(tn ).
To keep the derivation short, we will first focus on the fixed constant step size ∆t := tn+1 − tn .
A Taylor series expansion gives:
1
1
...
qn+1 = qn + ∆tq̇n + ∆t2 q̈n + ∆t3 q n + O(∆t4 ),
2
6
14
(4.2a)
1
1
...
qn−1 = qn − ∆tq̇n + ∆t2 q̈n − ∆t3 q n + O(∆t4 )
2
6
(4.2b)
Subtracting the second equation from first, we obtain:
qn+1 − qn−1 = 2∆tq̇n + O(∆t3 ),
(4.3)
which reduces to
q̇n =
qn+1 − qn−1
+ O(∆t2 ),
2∆t
(4.4)
The Explicit Leap-Frog method is second order accurate in time, and is only conditionally stable.
The main advantage of this method is that, it is simple and fast, and is easily parallelizable.
The disadvantage is that the time step is limited by the numerical stability requirements. For
the equations represented by Equation (2.13a), the stability condition requires that the time
step is shorter than the crossing time of the grid cells by the faster wave:
∆t <
∆xi
cmax
i
(4.5)
for all grid cells and all directions i = x, y and c represents the velocity of the wave. This is the
famous Courant-Friedrich-Levy (CFL) condition valid for explicit time integration of arbitrary
hyperbolic PDE’s. In the current implementation, with the uniform-grid, ∆x is constant for
each cell, and the velocity of the surface gravity wave is used to determine the time step such
that it satisfies the CFL condition. A safety margin is added to ensure stability without needing
to check the stability criteria at each time step.
For example, given the equation ẏ = λy (λ arises as an eigenvalue of a local Jacobian and could
be complex), the Leap-Frog method is stable only for |λ∆t ≤ 1|. In order to approximate the
interaction between the ship and the waves, a grid size of the order of half a meter or smaller is
required. This limits the time step to 0.01 seconds maximum, while the maneuvering simulator
at MARIN often can use time steps as large as 0.1 seconds.
The intention to use lower mesh sizes to capture details of wave interaction with ships at finer
levels will require a further reduction in the time step, hence increasing the computation time.
This provides the motivation to explore time integration methods which are more stable, hence
allowing us to use larger time steps, and still giving good accuracy.
4.2
4.2.1
Implicit Time Stepping
Fully Implicit Scheme
Stability of the time integration can be significantly improved by employing fully implicit time
integration techniques. The most common and simplest of the fully implicit scheme is the
Backward Euler scheme. For the set of equations given by Equation (2.13a), we get:
qn+1 − qn
~n+1 )
= (Lqn+1 − Sζψ ψ
dt
15
(4.6a)
ζ~
. The method is first order accurate in time
where q =
ϕ
~
1
and is unconditionally stable.
While the stability of the implicit time discretization is a great improvement over the explicit
~n+1 , which requires
schemes, one has to solve the implicit equations for the unknowns qn+1 and ψ
solving a system of equations (Differential Algebraic Equations) represented by :
~n+1
(I − L)qn+1 = qn − Sζψ ψ
~n+1 = −Sψϕ ϕ
Sψψ ψ
~ n+1
(4.7a)
(4.7b)
As the method is first order accurate in time, and the spatial discretization was second order
accurate, overall accuracy of method is only first order. Such methods are usually good to
achieve steady-state results, but not very accurate. The disadvantages of using the fully implicit
scheme includes:
• Moving from explicit to implicit schemes incurs extra computational efforts as it requires
solving the linear system of equations given by Equation (4.7a). For the system represented
above, the iterative linear solvers are used, which are described in more detail in Chapter
5.
• First order implicit methods simply under-relax more to maintain the stability of the
iterative solution. It is this increased damping, with the increase in time-step size, which
induces more inaccuracies in transient behavior.
• Spatial discretization plays an important role in the stability of the numerical solutions
of unsteady state hyperbolic equations. Generally for the Euler equations, the central
difference scheme is more accurate than the first order upwind schemes. Also stability
in the case of the central differencing scheme is not an issue as the diffusive forces are
not active. In the case flux limiters are used for spatial discretization which have the
capability of capturing shocks, discontinuities or sharp changes in the solution domain,
the implicit scheme results in the formation of non-linear system of equations, which then
requires more computational effort (computation of Jacobi).
4.2.2
β - Implicit Scheme
Semi-implicit methods try to combine the stability of the implicit methods with the efficiency of
the explicit methods. The system of equations is discretized using a one parameter (β) implicit
scheme to advance in time:
~n+1 ) + (1 − β)(Lqn − Sζψ ψ
~n ))
qn+1 = qn + ∆t(β(Lqn+1 − Sζψ ψ
(4.8a)
The parameter β can vary between 0 and 1. For β = 1, we get the Backward-Euler method
(fully implicit) and the so-called Trapezoidal scheme for β = 0.5
For β = 0.5 , we can rewrite the equations as:
qn+1 = qn + 0.5∆t(qn+1
˙ + q˙n )
= qn + ∆tq˙n + 0.5∆t(qn+1
˙ − q˙n )
1 2 dq̇
+ O(∆t3 )
= qn + ∆tq˙n + ∆t
2
dt
which gives the trapezoidal method of second order temporal accuracy.
1
The can be derived from the Taylor series expansion
16
(4.9a)
(4.9b)
(4.9c)
4.2.3
Stability
The Scalar Test Equation
To understand the stability of time integration methods, we consider the scalar, complex test
equation :
′
w (t) = λw(t)
(4.10)
where λ ∈ C .
Application of the time integration scheme (Explicit or Implicit) gives :
wn+1 = R(∆tλ)wn ,
(4.11)
R(∆tλ) is called the Stability Function. Let z = ∆tλ. For the explicit schemes of order s, R(z)
is a polynomial of degree ≤ s. For implicit methods it is a rational function with degree of both
numerator and denominator ≤ s. The stability region is defined in terms of R(z) as:
S = z ∈ C : |R(z)| ≤ 1
(4.12)
The scheme that has the property that S contains entire left half plane C − = x ∈ C : Re(z) ≤ 0
is called A-Stable. A scheme is said to be Strongly A-Stable if it is A-stable with |R(∞)| < 1,
and it is said to be L-stable if in addition |R(∞)| = 0.
1 + (1 − β)z
.
1 − βz
The implicit trapezoidal rule given by β = 0.5 is A-stable, whereas the fully implicit Backward
Euler method is L-stable with β = 1.
For the semi-implicit scheme with parameter β, the stability function is R(z) =
Stability for Linear Systems
Let the linear system of equations (m equations, m unknowns) be given as:
′
w (t) = Aw(t) + g(t)
(4.13)
with A ∈ Rmxm . Application of the semi-implicit scheme with parameter β gives:
wn+1 = R(∆tA)wn + (I − β∆tA)−1 ∆tgn+β
(4.14)
R(∆tA) = (I − β∆tA)−1 )(I + (1 − β)∆tA)
(4.15)
where
and gn+β = (1 − β)g(tn ) + βg(tn+1 ). Starting from initial solution of w0 , we obtain:
wn = R(∆tA)n w0 + ∆t
n−1
X
i=0
R(∆tA)n−i−1 (I − β∆tA)−1 gi+β
(4.16)
If we perturb the initial solution ŵ0 , we get the formula for the perturbed solution at nth time
step as:
ŵn − wn = R(∆tA)n (ŵ0 − w0 )
Hence, the powers R(∆tA)n determine the growth of the initial errors.
17
(4.17)
Let λj with 1 ≤ j ≤ m denote the eigenvalues of the matrix A, and let A be diagonizable such
that A = U ΛU −1 where Λ = diag(λj ) [7]. Let K be the condition number of U . Then for
∆tλj ∈ S, 1 ≤ j ≤ m =⇒ ||R(∆tA)n || ≤ K ∀n ≥ 1 where S represents the stability region
described above in the scalar test equation case.
Stability of the semi-implicit methods requires a moderate bound for these powers. In case
the powers are not bounded, ∆tλj does not belong to the Stability Region S, resulting in an
unstable method.
4.2.4
Backward Differentiation Formula
Another approach to achieve second order temporal accuracy is by using information from
multiple previous time steps. This gives rise to a two-parameter three-level time integration
scheme:
qn+1
qn − qn−1
= qn + ∆t α
− αq˙n + β qn+1
˙ + (1 − β)q˙n
∆t
(4.18)
˙ n represents the derivative of q at the nth time interval.
where (q
This scheme is three-level whenever the parameter α 6= 0. When α = 1/3 and β = 2/3, we
obtain the second order Backward Differentiation Formula (BDF2) for constant time step ∆t.
The analysis of the accuracy for a multi-step method is presented below. Let us assume that
qn was computed from qn−1 with a second order temporal accuracy. This implies
dq̇
qn − qn−1
= q˙n − (∆t/2) + O(∆t2 )
∆t
dt
Substituting this into Equation (4.18)
∆t dq̇
dq̇
2
+ β∆t + O(∆t )
qn+1 = qn + ∆t q˙n − α
2 dt
dt
α
d
q̇
= qn + ∆tq˙n + ∆t2 (β − ) + O(∆t3 )
2 dt
(4.19)
(4.20a)
(4.20b)
The method is second order temporal accurate when 2β − α = 1. At the start of simulation, one
could use trapezoidal scheme for second order accuracy, or more stable backward Euler method
(as the value at n − 1th time step is not available). The BDF2 method has better stability
properties than trapezoidal rule, and is regularly used for stiff problems. It can also be viewed
as an implicit counterpart of the explicit Leap-Frog scheme.
4.2.5
Semi-Implicit schemes
Any implicit method requires solving a linear system of equations. In [1], a linear solver has
been developed for the case when the system of equations is represented by a pentadiagonal
matrix. From Equation (4.7a), the linear system of equations
formed for the variables qn+1 is
~
ζ
, and each ζ~ and ϕ
~ are represented by
block pentadiagonal instead of pentadiagonal as q =
ϕ
~
their own five point stencils.
18
As previously mentioned, it is our intention to make use of the linear solver in [1] with minimum
modifications possible. For this reason, we split Equation (4.7a) into Equation (4.1a) and Equation (4.1b). Then we derive the corresponding equations for various implicit time integration
procedures.
~ and use the previous time step values of
In the first approach, we implicitly advance variable ζ,
~
variables ϕ
~ and ψ. The equations are given below:
ζ~n+1 = ζ~n + ∆t(−Sζζ ζ~n+1 − Sζϕ ϕ~n − Sζψ ψ~n )
I
1 ~
+ Sζζ )ζ~n+1 =
ζn − (Sζϕ ϕ~n + Sζψ ψ~n )
(
∆t
∆t
(4.21a)
(4.21b)
Please note that variables ϕ and ψ here are treated explicitly.
~ equation for ϕ
After advancing ζ,
~ is advanced implicitly.
~n )
ϕ
~ n+1 = ϕ
~ n + ∆t(−Sϕζ ζ~n+1 − Sϕϕ ϕ
~ n+1 − Sϕψ ψ
1
I
~n )
+ Sϕϕ )~
ϕn+1 =
ϕ
~ n − (Sϕζ ζ~n+1 + Sϕψ ψ
(
∆t
∆t
(4.22a)
(4.22b)
Now, only variable ψ here is treated explicitly. As the value of ζ at n + 1th time step is available,
we are able to use it. After obtaining the value of both ζ and ϕ at the new time step, the linear
~ is solved.
system of equations for ψ
As it is a combination of implicit and explicit scheme, stability of the method is not guaranteed.
Another possibility here is to use a predictor-corrector kind of scheme, described below:
• Advance ϕ
~ using an explicit time integration scheme.
~ by solving linear system of equations.
• Based on the new value of ϕ
~ , compute ψ
~ on the right-hand side of Equation (4.21)
• Implicitly advance ζ~ where the values of ϕ
~ and ψ
are substituted by above computed values.
~
• Perform implicit correction of ϕ
~ and compute ψ.
This will definitely demand more number computational effort, but will be more stable than
Equation (4.21). Similar derivations can be done for higher order methods.
4.3
Symplectic integration
The Symplectic integrators are designed for the numerical solution of Hamiltonian equations
given by Equation (2.2). Symplectic integrators ensures time-reversibility and preservation of
the symplectic (Ergodic) nature of the equation.
The so-called symplectic Euler method (first order) can be constructed as follows:
ζn+1 = ζn + ∆t∇Hϕ (ζn+1 , ϕn )
(4.23a)
ϕn+1 = ϕn + ∆t∇Hζ (ζn+1 , ϕn )
(4.23b)
19
The methods are implicit for general Hamiltonian systems. However if H(ζ, ϕ) is separable as
H(ζ, ϕ) = T (ζ) + U (ϕ), it turns out to be explicit.
The Stormer-Verlet schemes are Symplectic methods of order 2. They are composed of the two
∆t
.
symplectic Euler methods with step size
2
∆t
∇Hϕ (ζn+1/2 , ϕn )
2
(4.24a)
∆t
(∇Hζ (ζn+1/2 , ϕn ) + ∇Hζ (ζn+1/2 , ϕn+1 )
2
(4.24b)
ζn+1/2 = ζn +
ϕn+1 = ϕn +
ζn+1 = ζn+1/2 +
∆t
∇Hϕ (ζn+1/2 , ϕn+1 )
2
(4.24c)
Implicit mid-point rule:
For a fully
implicit method, ζ and ϕ are combined in one equation represented by variable
ζ~
. Then the implicit mid-point rule is given by:
~q =
ϕ
~
qn+1 = qn + ∆t∇H(
qn+1 + qn
)
2
(4.25)
In the next Chapter, we implement the above time integration schemes for various test cases
and analyze their accuracy and stability behaviors.
20
Chapter 5
Analysis of Time Integration
Schemes
5.1
Stability and Accuracy of various time integration schemes
To assess the stability of various implicit time integration procedures, we have considered a
simplified problem in one dimension given below.
∂ζ
∂ζ
∂2ϕ
+U
+ h 2 = 0,
∂t
∂x
∂x
(5.1a)
∂ϕ
∂ϕ
+U
+ gζ = 0,
∂t
∂x
(5.1b)
Here U represents the mean current velocity and h represents the depth of the ocean. Both U
and h have been assumed to be uniform in x direction.
Periodic boundary conditions have been assumed for both ζ and ϕ, and the initial conditions
are given as:
• ζ(0, x) = cos(
4πx
)
L
• ϕ(0, x) = 1
The central difference scheme has been used for the spatial discretization. It results in following
equations:
∂ ζ~
+ Sζζ ζ~ + Sζϕ ϕ
~ = 0,
∂t
(5.2a)
∂ϕ
~
+ Sϕζ ζ~ + Sϕϕ ϕ
~ = 0,
∂t
(5.2b)
where the discretization matrices are given by three point stencils as given below.
21
Sζζ
Sζϕ
Sϕζ
Sϕϕ
−U
:
2∆x
h
:
∆x2
g
: 0
−U
:
2∆x
0
−2h
∆x2
0
0
U
2∆x
h
∆x2
U
2∆x
The matrices Sζζ and Sϕϕ are skew- symmetric, matrix Sζϕ is symmetric, whereas Sϕζ is only
a diagonal matrix. For the sample problem, following values have been used:
• x − length = L =100 m
• ∆x = 0.5 m
• U = 1.0 m/sec
• h = 50 m
• time − max = 100 sec
eFor the combined system
√ of equations represented by Equation (5.2), the maximum eigenvalue
obtained is :λmax = 2 gh. Time step is limited by the CFL condition for the explicit time
integration schemes. It is given as:
∆t ≤
∆x
∆x
≤ √ = 0.0113 sec = ∆tcf l
λmax
2 gh
(5.3)
Various time integration schemes as discussed in Chapter 4 have been implemented. In case
of Fully Implicit Schemes (Fully Implicit Backward Euler, Fully Implicit Trapezoidal etc.),
Equation (5.2) is solved simultaneously. In case of Semi-Implicit schemes, an iterative procedure
is followed to solve Equation (5.2).
5.1.1
Benchmark Solution using MATLAB integrators
In order to get the benchmark solution, MATLAB ODE integrators are used. The system of
ODE’s given by Equation (5.2) are integrated by ODE45 and ODE15s functions. ODE45 is an
explicit Runge-Kutta method with adaptive time stepping. The time step is modified based on
the error estimate provided. ODE15s is a higher order implicit method. For the benchmark, a
relative error estimate of 1e−6 is used.
Using ODE15s, we obtain the following solution for the the variable ζ.
22
1
ζ at t=0
ζ at t=100
0.8
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.1: ζ variation with length at different times using ODE15s
The result depends on the Initial condition chosen. If the Initial conditions are smooth, as in
Figure 5.1, the final solution is also smooth.
1.5
ζ at t=0
ζ at t=100
1
ζ
0.5
0
−0.5
−1
−1.5
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.2: ζ variation with length at different times using ODE15s with discontinuity
In case if we introduce some discontinuity in the initial condition as shown in in Figure 5.2
at x=0, wiggles/oscillations are observed with the Central discretization. In order to remove
these oscillations, high resolution spatial discretization schemes (flux-limiters or total variation
23
diminishing) are required, which are not considered in the current study.
The comparison of the ODE45 (explicit) and ODE15s (implicit) is shown in Figure 5.3.
1
ζ with ODE15s
ζ with ODE45
0.8
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.3: ζ variation with length at different times using ODE15s and ODE45
In order to compare the two solutions, we define the norm of the difference between the two
vectors. Let xs be the number of grid points in x direction.
v
u
xs
u1 X
(u(i) − v(i))2
(5.4)
N orm(u, v) = t
xs
i=1
For the solutions given in Figure 5.3, the norm as given by Equation (5.4) is 2.0500e−4 . We will
now use ODE45 as the benchmark for the explicit time integration solvers given in Chapter 4,
and ODE15s as the benchmark for the implicit and semi-implicit time integration methods.
5.1.2
Explicit Scheme
The Leap Frog scheme has been used as the explicit time integration scheme as described in
the Chapter 4. For the first time step as the value at two previous time steps in unknown,
the Forward Euler method is used. The Leap Frog scheme requires the CFL condition for the
stability of the time integration. For time step upto ∆tcf l , the method is stable. It becomes
unstable when the time step is equal or greater than ∆tcf l .
For various time steps, simulation was run in Matlab for 100 seconds, and the results plotted
in Figure 5.4.
24
1
ζ ODE45
0.8
ζ , Leap-Frog dt =0.001
ζ Leap-Frog dt =0.01
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.4: Comparison of ζ variation with length for ODE45 vs Explicit Leap Frog
It is known that the Leap-Frog method does not suffer from phase lag or numerical dispersion
(an analytical behaviour of the Leap-Frog method is discussed in Chapter 11). We can see that
the solution is accurate for the time steps below ∆tCF L . In order to compare the results, the
norm defined in Equation (5.4) is used and tabulated below.
Table 5.1: Leap Frog vs ODE45, Error Norm
Time step (sec)
Error Norm
0.0113 (∆tcf l )
0.0370
0.01
0.0229
0.001
0.0007
25
5.1.3
Fully Implicit Schemes
Equation (5.2) is solved using three implicit schemes : The Backward Euler scheme, the Trapezoidal scheme and the Backward Differentiation Formulae (BDF2). It is observed that all the
methods are unconditionally stable. For a linear system of equations (as in our case), the Trapezoidal method is equivalent to the implicit mid-point rule which is a Symplectic integrator. At
larger time steps, the results are not very accurate and the solution is damped. The following
subsections give the plots of the three schemes for varying time steps.
Fully Implicit Backward Euler Method
1
ζ ODE15s
0.8
ζ Backward Implicit, dt =0.001
ζ Backward Implicit, dt =0.01
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.5: Comparison of ζ variation with length for ODE15s vs Implicit Backward Euler
Even though the Implicit Backward Euler method is unconditionally stable, it suffers from very
large damping at larger time steps as observed in the Figure 5.5. As the time step is increased to
0.1 seconds, the solution dies out very quickly, which is unwanted for the equations represented
by the Symplectic system where total energy should be conserved. The error norm is given in
Table 5.2.
26
Fully Implicit Trapezoidal Method
1
ζ ODE15s
0.8
ζ Implicit Trapezoidal dt =0.01
ζ Implicit Trapezoidal dt =0.1
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.6: Comparison of ζ variation with length for ODE15s vs Implicit Trapezoidal Method
As expected, the Implicit Trapezoidal scheme provides better results than Backward Euler. For
larger time steps (0.1 sec), the solution is not very accurate. Some phase error is observed around
xs = 140, which results in large error norm for ∆t = 0.1 sec. Along with the phase error, some
damping is also observed. As later studied in Section 5.2, this reduction in amplitude is not
damping, but results from amplitude variation with change in phase. The phase error reduces
as we reduce the time step (around 0.01 second).
27
Fully Implicit BDF2 Method
1
ζ ODE15s
ζ BDF2, dt =0.02
0.8
ζ BDF2, dt =0.1
ζ BDF2, dt =1
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.7: Comparison of ζ variation with length for ODE15s vs Implicit BDF2 Method
BDF2 is a multi-step second order method, and is not Symplectic in nature. As can be seen
from Figure 5.7, the solution suffers from very large numerical damping at larger time steps.
In Table 5.2, the error norm for various implicit schemes at different time steps is provided. In
terms of accuracy, the Trapezoidal method is better than the explicit Leap Frog scheme and
BDF2 for a given time step.
Table 5.2: Implicit Methods vs ODE15s, Error Norm
5.1.4
Time step (sec)
Backward Euler-Norm
Trapezoidal-Norm
BDF-Norm
0.1
0.672
0.470
0.673
0.01
0.672
0.0136
0.0364
0.001
0.387
0.0003
0.007
Semi Implicit Schemes
Semi Implicit schemes of the predictor corrector nature were proposed in Chapter 4 in order
to make use of the existing structure of the RRB-k solver. The following algorithm is applied
while using the Semi-Implicit scheme.
Here itermax represents the maximum sub-iterations performed for the corrector step of the
algorithm. Apart from the time step, itermax is important for the stability considerations.
Larger value of itermax should typically allow usage of large timestep. After performing numerical experiments, itermax vs maximum allowable timestep has been computed and tabulated
below.
28
Algorithm 5.1 Semi Implicit Algorithm to solve Equation (5.2)
Input ζ~0 , ϕ
~ 0 (Initial Vector), ∆t, itermax
while t ≤ tmax do
while iter ≤ itermax do
if iter == 1 then
Step 1.
if t == 1 then
Advance ϕ
~ explicitly using Forward Euler method on Equation (5.2)
else
Advance ϕ
~ explicitly using Leap Frog method on Equation (5.2)
end if
else
Advance ζ~ implicitly using Implicit Trapezoidal Scheme with the implicit value of ϕ
~
taken from Step1.
Correct ϕ
~ implicitly using Implicit Trapezoidal Scheme with implicit value of ζ~ taken
from above Step.
end if
end while
end while
Table 5.3: Semi Implicit Methods: itermax vs. ∆t
itermax
∆t maximum
3
0.019
∆t
∆tcf l
1.68
5
0.020
1.77
7
0.021
1.86
9
0.021
1.86
11
0.021
1.86
19
0.022
1.95
21
0.022
1.95
23
0.022
1.95
100
0.022
1.95
It is observed that the time step can be increased upto two times the ∆tCF L by increasing the
number of iterations. The error norm for the Semi-implicit schemes follow the similar trend as
that of the Explicit Leap-Frog scheme. For the parameters and the numerical techniques used
for the test case, Semi-Implicit methods provides a little advantage over Explicit methods in
terms of Stability.
29
Impact of depth h
The depth h appears in the maximum eigenvalue of the system, and influences the stability
criteria of any scheme. In order to ascertain the impact of the depth on the stability of the
Semi-implicit schemes, a sensitivity analysis has been carried out. The CFL number is given
by Equation (5.3) and varies with the depth h.
For a given itermax, it is observed that the ratio between the maximum ∆t achieved and the
∆t given by CFL number remains the same.
5.2
Study of Implicit Trapezoidal Scheme
As discussed in the previous section, the implicit Trapezoidal scheme provides a stable solution
without much damping as compared to other methods. Although it suffers from the phase shift
or phase error at larger time steps resulting in larger error norms.
In this section, the impact of phase difference on the amplitude of the solution for the coupled
set of equations and non-coupled set of equations is discussed.
In Figure 5.8, the values of variable ζ is plotted at 50 seconds. The system of equations have
been decoupled by setting h = 0 and g = 0. No phase error is observed for this case even
when large time steps are used. As the initial solution do not contain sharp discontinuities,
Central discretization method does not introduce oscillations and the phase errors are minimal.
This changes when we introduce the coupling between the two equations by setting h = 50 and
g = 9.81.
1.5
ζ ODE15s
ζ Implicit Trapezoidal, dt =0.01
ζ Implicit Trapezoidal, dt =0.1
1
ζ
0.5
0
−0.5
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.8: Implicit Trapezoidal method without Coupling at t = 50sec for various time steps
For the coupled equations (Figure 5.9), it is observed that the amplitude of the solution does
not remain constant due to the impact of the source term, and varies with the phase. This
30
impact can be seen in Figure 5.9 where the benchmark solution is plotted at three different
time intervals. Without any coupling between the equations, the solution should travel with
the wave speed without changing the shape or the amplitude. With the coupling, a change in
amplitude is seen at different time intervals.
1
t=20s
0.8
t=22s
t=24s
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.9: Solution of coupled equation at t = 20,22 and 24 sec with Matlab ODE integrator
This change in the amplitude with the phase for the coupled system causes large error norms
for Implicit Trapezoidal method. As Implicit Trapezoidal method induces a phase error, the
change in the phase results in a change in amplitude. This is reflected in the Figure 5.10
31
1.5
ζ ODE15s
ζ Implicit Trapezoidal, dt =0.01
ζ Implicit Trapezoidal, dt =0.1
ζ Implicit Trapezoidal, dt =1
1
ζ
0.5
0
−0.5
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.10: Implicit Trapezoidal method with Coupling at t = 20 sec for various time steps
For time step ∆t = 0.01s, the solution is accurate and does not suffer from phase error. For
larger timer steps, both phase change and amplitude variation is observed. It is important to
observe that, for ∆t = 0.1s, the solution appears to be already damped at time 20s. If there
was damping in the solution, then it would be greater for larger time steps, which is not seen
for ∆t = 1s. Instead the amplitude at ∆t = 1s is greater than the benchmark solution.
It could be argued that the larger ∆t is impacting the stability of the solution. In order to
check this, the solution is plotted for time 100 sec in Figure 5.11.
32
1
ζ ODE15s
ζ Implicit Trapezoidal, dt =0.01
0.8
ζ Implicit Trapezoidal, dt =0.1
ζ Implicit Trapezoidal, dt =1
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.11: Implicit Trapezoidal method with Coupling at t = 100 sec for various time steps
It is observed that the solution remains stable, and only phase error and amplitude difference
is observed for larger time steps.
5.2.1
Higher Order Schemes and Impact on Phase Error
Higher order (Two stage, Fourth ordered) Symplectic Implicit Runge Kutta scheme has been
implemented to analyze the impact on the phase error due to the larger time steps. These
methods are computationally more intense, but provides better accuracy. Given the coupled
set of equations as:
dq
+ Lq = 0
dt
we have
1
q n+1 − q n
= (K1 + K2 )
∆t
2
(5.5)
where
√
K1 = −L(q n + ∆t/4K1 + (1/4 + 3/6)∆tK2 )
√
K2 = −L(q n + (1/4 − 3/6)∆tK1 + 1/4∆tK2 )
At every time step, above system of equation is simultaneously solved and then plugged into
Equation (5.5). Please note that the coefficients are chosen such that the method remains
Symplectic in nature.
33
The results are shown in Figure 5.12.
Higher Order Runge Kutta Method
1
ζ ODE15s
0.8
ζ Implicit Trapezoidal, dt =1.0
ζ Implicit Trapezoidal, dt =0.1
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.12: Higher order Runge Kutta method at time 100s
It is observed that time step ∆t = 0.1s, the solution is accurate and there is no phase error.
As compared to the Implicit Trapezoidal method, higher order Runge Kutta method is more
accurate. A phase error is observed when the time step is further increased.
5.2.2
Impact of Initial Conditions
In a real case scenario, the incoming waves are generated by mixing various waves of different
frequencies and amplitudes. The initial condition in this case is chosen as:
ζ 0 (x) = 0.25(
3
X
i=0
cos(
2i 4πx
)
L
(5.6)
For the initial conditions with single wave, the solution was accurate for ∆t = 0.01s. The
solution for same ∆t = 0.01 but the mixed initial conditions is plotted in Figure 5.13.
34
0.8
ζ ODE15s
0.6
ζ Implicit Trapezoidal, dt =0.01
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.13: Implicit Trapezoidal Method with mixed initial conditions, ∆t =0.01s
It is observed that the solution is not accurate anymore. This can be attributed to high frequency
waves. The phase error for Implicit Trapezoidal schemes is generally proportional to square of
the frequency of the wave.
32πx
Instead of a mix of waves, simulation is carried with initial condition as ζ 0 (x) = cos(
).
L
The result for ∆t = 0.01s is plotted in Figure 5.14. Phase error is observed even for a single
initial wave with high frequency at ∆t = 0.01s. This implies that for high frequency waves, a
smaller time step is required to maintain accuracy.
Spatial discretization also influences the solution with the high frequency waves. As a rule of
thumb, a minimum of 15 grid points per wavelength should be used to accurately represent the
wave. In our study, we have used 12.5 gridpoints per wavelength. Increasing the number of grid
points reduces ∆x, and thus will result in lower CFL limit. In general, it is the low frequency
waves, with larges wavelengths which influences the Ships most. The ∆x should be chosen to
represent these low frequency waves accurately.
35
1.5
ζ ODE15s
1
ζ Implicit Trapezoidal, dt =0.01
ζ
0.5
0
−0.5
−1
−1.5
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.14: Implicit Trapezoidal Method with high frequency inlet conditions, ∆t =0.01s
5.2.3
Impact of Boundary Conditions
Periodic boundary conditions are prone to accumulation of phase error as the time grows. In
this section, we try to analyze the impact on the solution when different boundary conditions
are used.
We use the following:
• ζ(0, t) = 1.0 : Dirichlet Boundary Condition.
• For numerics, ζN +1 = ζN where N is the node number at right boundary. This is also
equivalent to the Neumann boundary condition.
• Neumann Boundary Condition for ϕ
We again observe phase error for larger time steps. Though we did not study the difference
between the phase error for a Periodic boundary condition and Neumann boundary condition.
36
5.2.4
Summary
A summary of the methods is provided below.
Figure 5.15: Comparison of various time integration schemes and stability estimates
37
5.3
Impact of Spatial Discretization
As discussed in Section 5.1.1, wiggles/oscillations are observed in ζ when the initial condition
has a discontinuity. Following are the possible reasons.
• Spatial Discretization
• Coupling of equations
In order to analyze the impact of the coupling on the oscillations, Equation (5.2) is decoupled
by setting h = 0 and g = 0. This results into two decoupled wave equations, whose analytical
solution is available. Let the initial condition be given as ζ 0 (x) and ϕ0 (x). Then the solution
at (x, t) is given by :
ζ(x, t) = ζ 0 (x − U t)
ϕ(x, t) = ϕ0 (x − U t)
Based on the above equations, if the initial condition has a discontinuity, then it should propagate with time without changing any shape. Spatial discretization usually impacts the solution
of the wave equation in the case of discontinuities or shock-waves.
Again, equations are discretized using Central discretization and solved using the Matlab benchmark code (ODE15s). The results obtained for ζ are given in Figure 5.16.
1.5
1
ζ
0.5
0
−0.5
−1
−1.5
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.16: ζ for decoupled wave equation after one time period with Central discretization
and with discontinuity in the initial condition
Wiggles and oscillations are still present in the solution, which leads us to explore the other
possibility of Spatial discretization. Whereas Central discretization causes oscillations around
the discontinuity, the error introduced by the Upwind discretization can be classified as follows:
38
• Smearing: The length of Smearing after the discontinuity.
• Phase Error
Figure 5.17: Classification of Errors induced by Upwind Discretization
In Figure 5.17, the anlaytical solution is shown as the black line, whereas the solution from the
Upwind discretization is shown in red.
Instead of Central discretization, the Upwind scheme is now used to discretize Equation (5.1).
This results in following discretization stencils:
Sζζ
Sζϕ
Sϕζ
Sϕϕ
−U
:
∆x
h
:
∆x2
g
: 0
−U
:
∆x
U
∆x
−2h
∆x2
0
U
∆x
0
h
∆x2
0
Using Matlab ODE15s and the above discretization stencils, the obtained solution is plotted in
Figure 5.18:
39
0.6
0.4
0.2
ζ
0
−0.2
−0.4
−0.6
−0.8
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.18: ζ for decoupled wave equation after one time period with Upwinding discretization
In Figure 5.18, we observe that the solution is damped, but no oscillations or wiggles are
present in the solution. Also, larger damping at x = 0 is observed which can be attributed to
the discontinuity.
Damping is a characteristic of the Upwinding discretization and can be removed by adopting
higher order or higher resolution methods. In order to confirm that the wiggles are generated
only because of the central discretization scheme and not because of the coupling, we run the
simulation with the Upwinding discretization and by setting h = 50 and g = 9.81, which again
gives us coupled set of equations. Again running the simulation with Matlab ODE15s integrator,
we obtain the following result:
40
1
0.8
ζ at t = 0
ζ at t = 100
0.6
0.4
ζ
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.19: ζ for Coupled wave equation after one time period with Upwinding discretization
In Figure 5.19, for the coupled equations, the Upwinding discretization removes the wiggles and
also results in a smooth but damped solution.
We conclude that the wiggles are generated in the simulation if a discontinuity is present in
the initial condition, with the Central discretizations scheme. However, it does not impact the
stability of the Semi-implicit schemes.
41
5.3.1
Impact of Grid size on Accuracy
One of the goals of the current project is to make the mesh finer near the regions surrounding
a ship or obstacles. As we have seen that varying the time step impacts the accuracy of the
solution, we will try to assess the time step requirements for similar accuracy for different grid
sizes.
Simulations are run with the implicit Trapezoidal scheme, and varying grid spacing and adjusting time step. As the Implicit Trapezoidal method is second order accurate, reduction in
grid size implies reduction in time step if the CFL number is not changed. Thus the accuracy
of the solution improves by ∆t2 , and a better benchmark is required to compare the solution.
Again the Matlab ODE15s method is used for the benchmarking along with an absolute error
estimate of 1e−9 as compared to the previous absolute error of 1e−6 . The results obtained are
indicated in Table 5.4
∆x
CFL Number
Error Norm
2
1.8
0.717622
0.9
0.235133
0.4
0.053416
1.8
0.201493
0.9
0.046899
0.4
0.015783
1.8
0.011764
0.9
0.0063474
0.4
0.004944
1.8
0.005404
0.9
0.005445
0.4
0.003001
1
0.5
0.25
Table 5.4: Error Norm for various Grid sizes and CFL numbers
For the same CFL number, an error reduction from 0.71 to 0.2 is observed when the grid spacing
is reduced from 2 to 1. This approximate factor of 4 is attributed to the reduction in time step
by two times and the second order temporal accuracy. Another observation is that, for the
combination of small grid size (∆x = 0.25) and small ∆t, the error norm does not improve with
further reduction in ∆t.
5.4
Stability Analysis with Vertical Structure ψ included
Based on the observations in Section 5.1.3, the fully Implicit Trapezoidal scheme provides the
best stability properties for the test case given by Equation (5.1). We still need to evaluate the
impact of the vertical structure ψ which is given by an elliptic equation and does not include a
time derivative term. To assess the impact of this additional variable and equation, the following
42
test case has been considered.
∂ζ
∂2ϕ
∂2ψ
∂ζ
+U
+ h 2 − hD 2 = 0,
∂t
∂x
∂x
∂x
(5.7a)
∂ϕ
∂ϕ
+U
+ gζ = 0,
∂t
∂x
(5.7b)
Mψ + hD
∂2ψ
∂2ϕ
−
N
= 0,
∂x2
∂x2
(5.7c)
Three additional discretization matrices are formulated. The model parameters M, N D are
derived from the vertical structure shape. For the linearized parabolic model we obtain,
1
D = h,
3
1
M = h,
3
and N =
2 3
h
15
(5.8)
The discretization matrices are given as:
Sζϕ
Sϕζ
Sϕϕ
Sζψ
Sψϕ
Sψψ
−U
2∆x
h
:
∆x2
g
: 0
−U
:
2∆x
h2
:
3∆x2
h2
:
3∆x2
−2h3
:
15∆x2
Sζζ :
U
2∆x
0
−2h
∆x2
0
0
h
∆x2
U
2∆x
−2h2
3∆x2
−2h2
3∆x2
4h3
15∆x2
h2
3∆x2
h2
3∆x2
−2h3
15∆x2
Periodic boundary conditions for ψ has been assumed. The model parameters h, U and g
remain the same as in the case of two equation model.
5.4.1
Benchmark
Equations for ζ and ϕ are coupled and represented as a Lumped model. It contains terms from
ψ which is explicitly not known as it depends on ϕ.
Let the lumped equation be represented as :
where L =
Sζζ
Sϕζ
d~q
~
+ L~q = C ψ
dt
Sζϕ
Sζψ
and C = −
Sϕϕ
Sϕψ
43
(5.9)
The set of differential algebraic equations given by Equation (5.7) can be solved by the Matlab
ODE15s solver, an implicit method capable of solving stiff problems and DAE’s. The derivatives
in time are described as :
d~q
"
#
~
−L~
q
+
C
ψ
dt
~ =
~
dψ
−Sψϕ ϕ
~ − Sψψ ψ
dt
(5.10)
I2xs ×2xs
0
In order to solve the DAE by Matlab ODE15s, an additional input given by M =
0
0 3x ×3x
s
s
is required, where xs is the number of grid points. This matrix M, also called as the Singular
Matrix characterizes the DAE. One has to be careful while chosing the initial conditions for the
system of DAE’s. An unique solution is obtained when the initial conditions also satisfy the
~ and ϕ
the algebraic equation. We have chosen the initial values of ψ
~ , such that they satisfy the
algebraic equation.
5.4.2
Chorin’s Projection technique
In order to solve this coupled equation, a projection method similar to Chorin’s projection
method [8] has been used. Chorin’s Projection scheme was originally used to solve the incompressible unsteady state Navier Stokes equations. The key advantage of the projection method
is that the computation of the dependent variables are decoupled (Pressure and velocity in the
case of Navier Stokes equations). Typically the algorithm consists of two stage fractional step
schene, a method which uses multiple calculation steps for each numerical time-step.
The steps are split as following:
d~q
• Ignore the ψ term from the lumped equation, and solve the system
= −L~q with the
dt
Implicit Trapezoidal method. Denote the solution after one time step as ~q∗
~n+1 from Sψϕ ϕ
~ = 0 by solving the system of equations where ϕ
• Compute ψ
~ + Sψψ ψ
~ is
derived from q∗.
~n+1
• Compute ~qn+1 as ~qn+1 = ~q ∗ +∆tC ψ
In Figure 5.20, the solution obtained from Chorin’s projection scheme for different time steps
is compared with the Benchmark solution.
44
1.5
ζ ODE15s
ζ Chorin Projection Scheme dt =0.01
ζ Chorin Projection Scheme dt =0.1
1
ζ Chorin Projection Scheme dt =1.0
ζ
0.5
0
−0.5
−1
−1.5
0
20
40
60
80
100
x
120
140
160
180
200
Figure 5.20: ζ variation with length at different times using ODE15s using Chorin’s Projection
As the time step is reduced, the solution is more close to the Benchmark solution. As the time
step is increased, a larger phase error can be observed. High error norms for the solution at the
larger time steps can be attributed to this phase error.
5.5
Stability Analysis for Two Dimensional Case
In order to analyze the stability of the complete system, it is required to study the two dimensional case which results in a Pentadiagonal matrix instead of a Tridiagonal matrix.
Following system of equations are considered as the test case:
∂ζ
∂ζ
∂ζ
∂2ϕ ∂2ϕ
∂2ψ ∂2ψ
+ Ux
+ Uy
+ h( 2 +
)
−
hD(
−
) = 0,
∂t
∂x
∂y
∂x
∂y 2
∂x2
∂y 2
(5.11a)
∂ϕ
∂ϕ
∂ϕ
+ Ux
+ Uy
+ gζ = 0,
∂t
∂x
∂y
(5.11b)
Mψ + hD(
∂2ϕ ∂2ϕ
∂2ψ ∂2ψ
+
)
−
N
(
+
) = 0,
∂x2
∂y 2
∂x2
∂y 2
(5.11c)
Finite volume method based on Central discretization has been used for the spatial discretization
of the system of equations. A rectangular grid has been assumed with the possibility of different
grid sizes for the X and Y direction. The discretization matrices are Pentadiagonal matrices
represented by 5-point stencil as described in Chapter 2 in detail.
It is important to consider the boundary conditions while constructing the Pentadiagonal matrices. In our case, Periodic boundary conditions in both X and Y directions have been used.
The model parameters are given below for the Two dimensional case:
45
• Lx =100 m
• Ly =100 m
• Nx = 50 Number of grid points
• Ny = 50 Number of grid points
• Ux = Uy = 1.0 m/s
• h = 50 m
• time-max = 50 sec
• Initial Condition: ζ 0 (x, y) = cos(
4πx
)
Lx
• ϕ0 (0, x, y) = 1
• ψ 0 (x, y) = 0
The equations in lumped form for the benchmark solution and the Chorin’s projection remains
the same. Instead of the tridiagonal matrices, we have pentadiagonal block matrices. The grid
is ordered lexicographically in Column major format (Similar to Fortran ordering).
Stability, accuracy and the Symplectic nature of the solution from Chorin’s projection method
is compared with the Benchmark solution. The Implicit Trapezoidal method has been used as
the substep of the Chorin’s projection method.
Figure 5.21: ζ variation using ODE15s vs. Chorin’s Projection for ∆t = 0.01s and finer grid
46
Figure 5.22: ζ variation using ODE15s vs. Chorin’s Projection for ∆t = 1s and finer grid
Phase error is observed at larger time steps, though the stability fo the system is not impacted.
This gives us confidence in proceeding with Chorin’s Projection scheme for the real case of the
interactive waves.
47
Chapter 6
Iterative Solvers
6.1
Background and Previous Work
In [3] and [1], methods are given to solve the system of equations represented by Equation (2.13b)
also given below:
~ = −Sψϕ ϕ
Sψψ ψ
~=b
In this section, the solver and the properties of the matrix Sψψ will be presented.
6.1.1
Properties of the Discretization Matrix Sψψ
Numerical discretization of Equation (2.3c) gives Equation (2.13b). Sψψ is given by the five
point stencil given in 2.12. When using lexicographical ordering of the grid points, this leads
to a pentadiagonal matrix. Since the model parameters (functionals D, M and N ) are positive
[3], the center term in the stencil becomes positive and the outer elements are negative. The
center term is equal to the sum of the absolute value of the outer terms plus one additional
term which is also positive as h > 0 and M0C > 0. For a row of the matrix formed by the five
point stencil, the diagonal entry is given by the central point of the five point stencil, and the
off-diagonal entries are given by the remaining four entries of the stencil. As the diagonal entry
is greater than the sum of absolute value of other entries in a row of the matrix, the matrix is
strictly diagonally dominant.
To verify the symmetry of the matrix, one has to compare the outer diagonals with each other.
For example the contribution
westterm and the east term should be the same. Thus
of the
∆y ¯
∆y ¯
it is required that ∆x
NW
= ∆x
NE
. In the case of non-uniform mesh, (∆x)i,j will
i,j
i−i,j
be different from (∆x)i−1,j , and a different discretization scheme will be required to ensure
symmetry. For the uniform mesh, (∆x)i,j = (∆x)i−1,j . With the overbar notation, then the
requirement can be rewritten as (NW + NC )i,j = (NC + NE )i−1,j ⇔ (Ni−1,j + Ni,j ) = (Ni−1,j +
Ni,j ), which is clearly satisfied. Applying the same reasoning to the north and the south
neighbors, we conclude that the matrix is symmetrical. A symmetrical diagonally dominant
matrix is positive definite and special methods like Conjugate Gradient can be used to solve
Equation (2.13b).
48
6.1.2
Preconditioned Conjugate Gradient Method
The Conjugate Gradient(CG) method is an element of the class of Krylov subspace methods.
The Krylov subspace is created by repeatedly applying a matrix to a vector. These are iterative
methods, i.e., for a starting vector x0 , the method generates a series of vectors x1 , x2 , · · ·
converging to x = S −1 b.
A detailed description and derivation of the CG method is provided in [1], [3]. Here we will
describe the convergence characteristics and the requirements for the method, along with the
algorithm pseudo-code.
The CG method can only be applied to the matrices which are symmetric and positive definite.
Convergence of the CG Method for solving the linear system of equations depends on the
wavelength spectrum of the matrix S. The improvement in the error is bounded by the Condition
, where λmax and λmin are the maximum
Number of the matrix which is given as k(S) = λλmax
min
and minimum eigenvalues of the matrix S. Higher the condition number, slower the convergence
of the method. It can be derived [3] that for the given system represented by the stencil given
in 2.12, the number of iterations required for the convergence of the CG method is inversely
proportional to the mesh size.
Preconditioner
If the condition number of the matrix S, (k(S)) is large, preconditioning is used to replace
the original system Sψ − b = 0 by M−1 (Sψ − b = 0) so that (k(M −1 S)) gets smaller than
(k(S)). In most cases, preconditioning is necessary to ensure fast convergence of the CG method.
Algorithm 6.1 provides the pseudo-code for the PCG method.
The choice of the preconditioner is important to achieve a fast solver. Using a preconditioner
should ultimately lead to a reduction in work coming from a reduction in required number
of iterations for CG to converge. This implies that solving the system M z = r should be
relatively cheap to solve in the case of PCG. Martijn [1] worked on two preconditioners and
translated them in an effective CUDA solver. We will describe the RRB (Repeated Red Black)
preconditioner in brief here. The other preconditioner (Incomplete Poisson) is specific to the
SPD (Symmetric Positive Definite) matrices, and may not be applicable for non-symmetric
matrices (for example the matrix given by the stencil in Equation (2.6) ).
6.1.3
Repeated Red Black (RRB) ordering as preconditioner
The RRB method uses a renumbering of the grid points according to a red-black ordering. Redblack ordering is the classic way of parallelizing the Gauss Seidel iterative solver by removing
the dependencies between the adjacent cells.
What the RRB- method basically does is making the following decomposition:
A = LDLT + R
(6.1)
where L is a lower triangular matrix, D a block diagonal matrix and R a matrix that contains
the adjustments made during the lumping procedure.
In a two dimensional grid, the red nodes are given by the points xi,j , yi,j with i + j as even,
and the black nodes with i + j as odd. First the black nodes are numbered in lexicographical
ordering and then the red points. For the RRB ordering, the red nodes are again split up into
red and black nodes, and we repeat the procedure. When the black node elimination is repeated
k times, the method is named RRB-k method. With an elimination process on the finest level,
49
Algorithm 6.1 Preconditioned CG Algorithm to solve Sx = b
Input x (Start Vector), r (Right Hand Side) , S (Matrix),
M (Preconditioner), ǫ (Tolerance)
r = b − Sx
solve : M z = r
ρnew = rT z
i=0
while ρnew > ǫ2 ||b|| do
i=i+1
if i = 1 then
p=z
else
β=
ρnew
ρold
p = z + βp
end if
q = Ap
σ = pT q
α=
ρnew
σ
x = x + αp
r = r − αq
solve : M z = r
ρold = ρnew
ρnew = rT z
end while
50
a stencil is obtained on the next coarser level, which is then simplified by lumping some outer
elements (Converting the 9 point stencil on the coarser level back to 5 point stencil). The
process of elimination and lumping is then repeatedly applied on the resulting stencil, until the
grid is coarse enough to use a direct solver on it.
6.1.4
Implementation of RRB preconditioner
Martijn [1] has taken great efforts in describing the implementation of the RRB solver both
in C++ programming, and CUDA kernels. Instead of digressing into minor details, the main
points which will later impact the implicit time integration are described in brief here.
6.1.5
Effect of RRB ordering on sparsity pattern
The matrix S in the system Sψ = b is given by a five point stencil, see Section 2.2. This results
in a matrix whose structure is pentadiagonal. The sparsity pattern changes when RRB ordering
is applied.
Figure 6.1: Sparsity pattern of S ∈ R64x64 before RRB ordering [1]
Figure 6.2: Sparsity pattern of S ∈ R64x64 after RRB ordering [1]
51
6.1.6
RRB-k method
The maximum number of levels of the RRB ordering could be computed given the dimensions
of the grid in x and y direction. Though, it is not required to go upto the coarsest level. It
is possible to stop at any level of coarsening. Such level is named k, and hence the RRB-k
method. On level k, the nodes are numbered naturally (not in red-black fashion). From these
nodes a matrix E is formed, which is a symmetric pentadiagonal matrix. The idea of stopping
earlier is that we want to solve the remaining nodes accurately and exactly. In order to keep
the computational cost lower, it is important to keep the size of the matrix E small.
On level k, we have to solve the system of like Ex = b, where x is formed by level k nodes of
z and b by the level k nodes of r in the problem M z = r. It is possible to use direct solvers
(Complete Cholesky factorization) or Gaussian elimination for such small problem.
6.1.7
The Schur complement and the 9-point Stencil
With the basic red-black numbering, the matrix S can be written in a block matrix format.
The system Sψ = b can then be written in the form:
Db
Sbr ψb
b
= b
Srb
D r ψr
br
Subscript b and r denotes the black and red nodes respectively. Dr and Db are the diagonal
T (matrices with 4 off-diagonals).
matrices, and for a symmetric matrix S, Srb = Sbr
Next, the black nodes are eliminated by applying Gaussian elimination. This yields:
Db
Sbr
bb
ψb
=
br − Srb Db−1 Sbr
0
Dr − Srb Db−1 Sbr ψr
The matrix S1 := Dr − Srb Db−1 Sbr is called the 1st Schur complement and is given by a 9-point
stencil, the vector b1 := br − Srb Db−1 Sbr is the corresponding right-hand side. Using the solution
of red nodes, it is thus possible to compute the solution of black nodes. The catch here is that
instead of the five point stencil of S, more expensive 9-point stencil of S1 has to be solved.
6.1.8
Constructing the preconditioning matrix M
The preconditioning matrix M = LDLT is constructed in following four steps:
• Elimination of black nodes: 9-point stencil is generated from the 5-point stencil (creation
of the Schur complement). In the case of C++ code, all the five diagonals are stored
separately, whereas in CUDA code only three out of five diagonals are stored. West and
South stencil are used to access the East and North stencil respectively (utilizing the
symmetry of the matrix).
• Lumping from the 9-point stencil to the 5-point stencil for the remaining red nodes.
This ensures removing stencil dependencies by adding the respective coefficients to other
coefficients.
• Elimination of the first level red nodes using the lumped 5-point stencil (creating the
second level red nodes with 9-point stencil).
52
• Lumping on the second level. The resulting matrix on the coarse grid is pentadiagonal
and has the same properties as the original matrix S.
The above procedure is only one RRB iteration. It consists of an elimination S = L1 S1 LT1 ,
lumping S1 = S1 + R1 and again an elimination S1 = L2 S2 LT2 and lumping S2 = S2 + R2 .
Combined we have,
S = L1 L2 S2 LT2 LT1 + L1 L2 R2 LT2 LT1 + L1 R1 LT1
= LDLT + R.
6.1.9
Solving M z = r
As given in Algorithm 6.1, in each iteration the preconditioning step M z = r =⇒ LDLT z = r
has to be solved for z. This is done in three steps:
• La = r is solved using forward-substitution. This is done level-wise going from finer grid
to the coarser grid.
• b = D−1 a. As D is block diagonal matrix, its inverse is easy to compute.
• LT z = b is solved using back-substitution. This is done by going from coarser grids to
fine grids level-wise.
6.1.10
Efficient CUDA and C++ Code Implementation
In order to boost performance, Martijn has implemented efficient kernels for the following:
• Matrix Vector product
• Dot product : Performing mass reduction on GPU and using Kahan summation algorithm
on CPU. Kahan summation algorithm greatly reduces error in floating point additions (It
makes possible to sum n numbers with an error that only depends on the floating-point
precisions)
• Vector Updates
These optimized codes could be reused even if the PCG algorithm is changed with minor
modifications.
6.2
Solvers for Non-Symmetric Matrices
As we have seen, most of the implicit time integration methods described in Chapter 4 (except
MR-PC) will require solving a set of equations where the matrices have stencils as given in
Section 2.2. The matrices involved are not symmetric as compared to the matrix given by
Stencil Sψψ . For this reason, the PCG method as described in Chapter 3 cannot be applied to
solve the linear system of equations arising from the implicit time integration.
The objective is therefore to determine the solution vector for a large, sparse and linear system
of equations which is not symmetric. An overview of these methods can be found in [6], which
forms the basis for this Chapter.
53
As mentioned before in Chapter 3, the idea of some iterative methods is to project an mdimensional problem into a lower-dimensional Krylov subspace. Given a matrix A of size
m × m and a vector b of size m × 1, the associated Krylov sequence is the set of vectors
b, Ab, A2 b, A3 b, · · · , which can be computed by matrix-vector multiplications in the form b, Ab, A(Ab), A(A(Ab)))
The corresponding Krylov subspaces are the spaces spanned by successively larger group of these
vectors.
6.2.1
From Symmetric to Non-Symmetric Matrices
Figure 6.3: Classification of Krylov Subspace iterations
Figure 5.1 shows the classification of Krylov Subspace methods as we move from symmetric
matrices to non-symmetric matrices. The Conjugate Gradient (CG) method results in the
tridiagonal orthogonalization of the original matrix, which can be described as A = QT QT ,
where Q is the Unitary matrix and T is a tridiagonal matrix. When A is non-symmetric, this
result cannot be obtained from a CG iteration. Two approaches can be followed :
• Use of the so-called Arnoldi Iteration, a process of Hessenberg orthogonalization. This
results in A = QHQT where Q is a Unitary matrix and H is an upper Hessenberg matrix.
(An upper Hessenberg matrix has zero entries below the first sub-diagonal).
• Bi-orthogonalization methods are based on the opposite choice. If we insist on obtaining
a tridiagonal result, then we have to give up the unitary transformations, which gives us
tridiagonal biorthogonalization : A = V T V −1 , where V is non-singular but generally not
Unitary. The term ’biorthogonal’ refers to the fact that though all the columns of V are
not orthogonal to each other, they are orthogonal to the columns of (V −1 )T = (V T )−1 .
Let V be a non-singular matrix such that A = V T V −1 with T tridiagonal and define
W = (V T )−1 . Let vj and wj denote the j t h columns of V and W respectively. These
vectors are biorthogonal in the sense that wiT vj = δij where δij is the Kronecker delta
54
function. For each n with 1 ≤ n ≤ m, define the m × n matrices such that:
Vn = v 1 v 2 · · · v n ,
Wn = w 1 w 2 · · · w n
(6.2)
In matrix form, the biorthogonality can be written as WnT Vn = VnT Wn = In , where In is
the identity matrix of dimension n. The iterations in biorthogonalization methods can be
summarized as:
AVn = Vn+1 T̃n
(6.3a)
AT Wn = Wn+1 S̃n
(6.3b)
Tn =
SnT
=
WnT AVn
(6.3c)
Here Vn and Wn have dimensions m × n, T̃n+1 and S̃n+1 are tridiagonal matrices with
dimensions (n + 1) × n, and Tn = SnT is the n × n matrix obtained by deleting the last
row of T̃n+1 . BiConjugate gradient algorithms are described in Section 5.3.
6.2.2
Arnoldi Iteration and GMRES
Arnoldi iteration can be understood as the analogue of a Gram-Schmidt type iteration for
similarity transformations to upper Hessenberg form. It has the advantage that it can be
stopped part-way, leaving one with a partial reduction to Hessenberg form that is exploited
when dimensions upto n (dimension of the Krylov subspace) are considered. A simple algorithm
of Arnoldi iteration is given below:
Algorithm 6.2 Arnoldi Iteration
b = arbitrary, q1 = b/||b||
for n = 1, 2, 3, · · · do
v = Aqn
for j = 1 do
hjn = qj∗ v
v = v − hjn qj
end for
hn+1,n = ||v||
qn+1 = v/hn+1,n
end for
The above algorithm can be condensed in the following form:
• The matrices Qn = q1 q2 · · · qn generated by the Arnoldi iteration are reduced QR factors
of the Krylov matrix:
Kn = Q n Rn
(6.4)
where Kn is the m × n Krylov matrix.
• The Hessenberg matrices Hn are the corresponding projections : Hn = QTn AQn
′
′
• The successive iterates are related by the formula: AQn = Qn+1 Hn , where Hn is the
(n + 1) × n upper-left section of H.
55
The idea of GMRES is straightforward. At step n, the exact solution (xo = A−1 b) is approximated by the vector xn ∈ Kn that minimizes the norm of the residual rn = b − Axn , hence the
name : Generalized Minimal Residuals (GMRES). It was proposed in 1986 and is applicable to
a system of equations when the matrix is a general (Non-singular) square matrix. Arnoldi iteration is used to construct a sequence of Krylov matrix Qn whose columns q1 , q2 , · · · successively
span the Krylov subspace Kn . Thus we can write xn = Qn y , where y represents the vector
such that
||AQn y − b|| = minimum
(6.5)
Using the similarity transformation, this equation can be written as:
′
||Qn+1 Hn y − b|| = minimum
(6.6)
Multiplication by a Unitary matrix does not change the 2-norm, thus we can rewrite above
equation as:
′
′
||Q∗n+1 Qn+1 Hn y − Q∗n+1 b|| = minimum||Hn y − Q∗n+1 b|| = minimum
(6.7)
Finally, by construction of the Krylov matrices Qn , we have: Q∗n+1 b = ||b||e1 where e1 =
(1, 0, 0, · · · )T . Thus we obtain:
′
||Hn y − ||b||e1 || = minimum
(6.8)
The GMRES algorithm (unPreconditioned) can be written as :
Algorithm 6.3 GMRES
q1 = b/||b||
for n = 1, 2, 3, · · · do
<step n of Arnoldi iteration, Algorithm 6.2>
′
Find y to minimize ||Hn y − ||b||e1 ||(= ||rn ||)
xn = Qn y
end for
In order to find y, a QR factorization can be used which requires O(n2 ) flops, because of the
′
Hessenberg structure of the matrix H . Here n is the dimension of the Krylov subspace. Also
′
′
it is possible to get the QR factorization of Hn from that of Hn−1 by using Given’s Rotation.
One of the disadvantages of the GMRES method is the storage requirements. As it requires
storing the whole sequence of the Krylov subspace, a large amount of storage is required as
compares to the Conjugate Gradient method. For this reason, restarted versions of this method
are used, where computational and storage costs are limited by specifying a fixed number of
vectors to be generated.
6.2.3
BiConjugate Gradient methods
The BiConjugate Gradient method (BiCG) is the other extension for non-symmetric matrices.
As we saw in the previous section, the principle of GMRES is to pick vector xn such that
56
the residual corresponding to xn is minimized. The principle of BiCG algorithm is to pick
xn in the same sub-space, i.e. xn ∈ Kn , but to enforce that the residual is orthogonal to
w1 , A ∗ w1 , · · · , (A∗)n−1 w1 , where w1 ∈ Rm is an arbitrary vector satisfying the dot product,
w1∗ v1 = 1. Its advantage is that it can be implemented with three-term recurrences rather than
the (n+1) - term recurrences of GMRES (Difference arising from Hessenberg form of matrix vs.
tridiagonal form).
There are two major problems with the BiCG method :
• Convergence is slower as compared to GMRES and often erratic. Also, it may have the
consequence of reducing the ultimately attainable accuracy because of rounding errors.
• It required multiplication with AT (transpose) as well as A. Computing the transpose
brings serialization to the code and thus is not preferred.
To address this problem, other variants of BiCG method were developed. One of them is
stabilized BiCG method (Bi-CGSTAB). As for the Conjugate Gradient method, any other
Krylov subspace method needs a good preconditioner to ensure fast and robust convergence.
The algorithm for the preconditioned Bi-CGSTAB method is given below. One of the future
task is to adjust this algorithm so as to use the same building blocks as developed by Martijn
in [1] for the RRB-k method.
57
Algorithm 6.4 Bi-CGSTAB
Solve the system of equation given by Ax = b by Bi-CGSTAB
Compute r0 = b − Ax0 for some initial guess x0 .
Choose r̃ (for example r̃ = r0 )
for i = 1, 2, 3, · · · do
ρi−1 = r̃T ri−1
if ρi−1 = 0 then
Method Fails
end if
if i = 1 then
pi = ri−1
else
ρi−1 αi−1
ρi−2 ωi−1
pi = ri−1 + βi−1 (pi−1 − ωi−1 v i−1
βi−1 =
end if
Solve M p̃ = pi
v i = Ap̃
ρi−1
αi = T v i
r̃
s = ri−1 − αi v i
Check norm of s; if small enough, set xi = xi−1 + αi p̃
Solve M s̃ = s
t = As̃
tT s
ωi = T
t t
xi = xi−1 + αi p̃ + ωi s̃
r i = s − ωi t
Check Convergence, continue if necessary
For continuation it is required that ωi 6= 0
end for
58
6.3
Combined Solver - GMRES/ Bi-CGStab
We formulate the strategy for the implementation of the Solver for the Test Case discussed in the
Chapter 4, which can then be benchmarked with the solution from Matlab. We will implement
Chorin’s Projection scheme in C++ for the Two-Dimensional Case. As we intend to use the
same framework as used in Martijn’s Code, we should be able to plugin Martijn’s RRB solver
into our code to solve the system of equation represented by the symmetric matrix without any
modification. Step 1 of the Chorin’s projection scheme requires a different strategy than the
current implementation of the preconditioned Conjuate Gradient method. This is discussed in
detail in following sub sections.
For the first step of the Chorin’s Projection Scheme, the system of equation to be solved is
given by :
q̇ + Lq = 0,
(6.9)
S
ζ~
and q̇ its time derivative. The matrix L = ζζ
with q =
Sϕζ
ϕ
~
matrix as discussed in Chapter 2.
Sζϕ
is the spatial discretization
Sϕϕ
The Implicit Trapezoidal method is used to integrate the system of ODE’s.
(
I
I
+ βL)q n+1 = (
− (1 − β)L)q n
∆t
∆t
(6.10)
This can also be written as :
I
∆t + βSζζ
βSϕζ
βSζϕ
I
+ βSϕϕ
∆t
I
n+1
∆t − (1 − β)Sζζ
ζ~
=
ϕ
~
(1 − β)S
ϕζ
(1 − β)Sζϕ
I
+ (1 − β)Sϕϕ
∆t
n
ζ~
ϕ
~
(6.11)
The system matrix given by the matrix on the left hand side can be seen as combination of
4 block matrices. The diagonal blocks arise from the spatial discretization of the hyperbolic
system of equations, and thus are Skew-Symmetric in nature. The off-diagonal blocks represent
the coupling between the variables ζ and ϕ.
6.3.1
Preconditioners
Each GMRES or Bi-CGSTAB iteration requires the preconditioning step, namely solving a
system of equation M z = r, where r is some residual and M is the combined preconditioner.
For the diagonal block matrices, it should be possible to construct a preconditioner similar to
the RRB preconditioner of the symmetric pentadiagonal matrix. These are represented as P ,
and let Q represent the off-diagonal blocks in the preconditioner.
In general form, the Combined preconditioner M is given by:
Pζζ Qζϕ
M=
Qϕζ Pϕϕ
Various preconditioners can be constructed based on the structure of Q .
59
(6.12)
Pζζ
0
. This is a block Jacobi preconditioner for the
By setting Q = 0, we obtain : M1 =
0 Pϕϕ
combined system. Equation M1 z = r can be written as:
Pζζ
0
0
Pϕϕ
r
zζ
= ζ
rϕ
zϕ
(6.13)
which can be split into solving Pζζ zζ = rζ and Pϕϕ zϕ = rϕ
Similarly, block Gaussian forms of the preconditioner are given by:
0
Pϕϕ
(6.14)
Sζϕ
Pϕϕ
(6.15)
P
M2 = ζζ
Sϕζ
P
M3 = ζζ
0
In order to solve the system of equation M2 z = r, we can solve:
Pζζ
Sϕζ
0
Pϕϕ
zζ
r
= ζ
zϕ
rϕ
(6.16)
This can be split up into solving Pζζ zζ = rζ and Pϕϕ zϕ = rϕ − Sϕζ zζ
Similarly we can solve M3 z = r.
6.3.2
Preconditioner for Skew-Symmettric Matrix
As discussed above, it is required to formulate preconditioner for the diagonal blocks, which
are skew-symmetric in nature. We will again use Repeated Red Black method to construct the
preconditioner.
The same strategy as described in Section 6.1.8 will be followed to construct the preconditioner.
In the CUDA code, now instead of storing only three diagonals, all five diagonals will be used.
In the current implementation by Martijn [1], a clever trick is used which saves memory (Phase
2c of the construction step) by utilizing the symmetry. The matrix S is split into LDLT and
the stencils corresponding to LT are not stored.
In the case of the matrix being skew symmetric, it can be derived that the preconditioning
matrix has the form P = LDU , where U = −LT . Hence, we do not need to store the diagonals
corresponding to U , and we can use the same preconditioning code in C++ and CUDA as used
by Martijn.
Algorithm
In order to construct the preconditioner represented by the Matrix Equation (6.11), the stencils
that are part of the four blocks, namely Sζζ , Sϕϕ , Sζϕ and Sϕζ are copied into the solver class.
The impact of boundary conditions are incorporated while formulating these stencils, which will
be discussed later in detail.
Copied stencils are modified to represent the left-hand side of the Equation (6.11), and then
preconditioning matrices, Pζζ and Pϕϕ are constructed respectively.
The same algorithm is also implemented in Matlab, and the matrices generated are stored in
order to analyze their convergence characteristics as discussed below.
60
6.3.3
Impact of different Preconditioners on Eigenvalue Spectrum
The effectiveness of a preconditioner for the Symmetric matrix can be evaluated from the reduction in the condition number. For the non-symmetric matrices, the eigenvalues can also contain
complex imaginary components. Thus, the condition number cannot be used to determine the
convergence properties of a given preconditioner.
To understand the convergence property of methods like GMRES and Bi-CGSTAB on a given
matrix A, the eigenvalue spectrum is plotted. Let us assume a hypothetical system whose
eigenvalue spectrum is given in Figure 6.4
Figure 6.4: Eigenvalue Spectrum of matrix A
A circle is drawn which encapsulates all the eigenvalues. Let the distance of the circle from the
origin be C, and radius of the circle be R, as shown in Figure 6.4. The convergence property
R
of the matrix can be related to the ratio . Lower the number, less number of GMRES/BiC
CGSTAB iterations will be required. We will
now study the eigenvalue
spectrum of the coupled
I
+ βSζζ
βSζϕ
system given in Equation (6.11). Let A = ∆t
.
I
βSϕζ
+ βSϕϕ
∆t
Then for a sample test case, where ∆x = ∆y = 5, ∆t = 0.1 , β = 0.5 , h = 50 and Ux = Uy = 1
, the spectrum obtained is shown in Figure 6.5.
61
Figure 6.5: Eigenvalue Spectrum of matrix A
In order to analyze the impact of different preconditioners, the eigenvalue spectrum of M −1 A
is plotted in the following figures. The inverse of the preconditioner and matrix matrix multiplication is carried out in Matlab.
Figure 6.6: Eigenvalue Spectrum of matrix M1−1 A
62
Figure 6.7: Eigenvalue Spectrum of matrix M2−1 A
Figure 6.8: Eigenvalue Spectrum of matrix M3−1 A
63
For all the preconditioners, the ratio R/C is computed for two different values of ∆t and
tabulated below.
Table 6.1: Convergence characteristics with different preconditioners
Without Preconditioner
R
, ∆t = 0.1
C
0.6
R
, ∆t = 1.0
C
6
M1
0.4
4
M2
0.18
1
M3
0.18
1
Preconditioner
It can be seen that the Jacobi block preconditioner M1 does not impact the convergence characteristics, whereas the Gaussian preconditioners M2 and M3 results in a reduction of more
than 3-4 times. This can be attributed to the coupling between the two governing equations.
As discussed before, the maximum
eigenvalue of the governing system of equations is given
√
by the suface gravity wave gh. We have previously seen in the Semi-Implicit schemes, that
ignoring the coupling between the equations for ζ and φ do not give us a stable solution. Similar
behaviour is observed when we choose the Jacobi block preconditioner, as we ignore the contribution from the off-diagonal blocks which contain the components representing the coupling
and determined by the parameters g and h.
R
Also, we can oberseve that an increase in the time step results in a larger
ratio. As the time
C
I
reduces and the matrix A loses its diagonal dominance. If we use a large
step is increased,
∆t
time step, then we should observe in an increase in number of GMRES/Bi-CGSTAB iterations.
64
Chapter 7
C++ and CUDA Implementation
The solver for the Implicit Trapezoidal method in C++ and CUDA is first developed for the
two-dimensional test case. We will employ the Chorin’s Projection scheme as the algorithm to
march in time. An important consideration while developing the code has been to keep the
modularity as it was developed by Martijn [1]. This helps in plugging different solvers with ease
into the code. The kernels developed for CUDA computation by Martijn have been modified in
order to reflect the requirements of implicit set of equations.
The complete flowchart for the Implicit solver for the simplified test problem is given in Figure 7.4. The idea is to benchmark the solver routines, and then use them in the Interactive
wave program. In this chapter, the implementation of the major steps which are also used later
in the Interactive code in detail are explained.
7.1
General comments on implementation
Throughout the C++ code, the class Array2D is used which is a very efficient custom-built C++
class exploiting pointers that let us store and manipulate 2D arrays in a comparatively cheap
manner. The memory structure in C++ is shown in Figure 7.1. Here COLUMNS = Nx1+3 and
ROWS=Nx2+3. This includes the Ghost layers which are used for boundary computations.
In the CUDA environment, the arrays from the host (C++) are copied to the device (GPU) and
stored in an embedded grid. When the original grid has size Nx1 (x) by Nx2 (y) on the host, on
the device it arrives as a Nx2 (x) by Nx1 (y) array. This mirroring effect is due to the storage
pattern difference between C++ and GPU (row-major vs. column-major order respectively).
The structure of the grid in CUDA is shown in Figure 7.2. The embedding is carried out by
choosing a appropriate BORDER WIDTH. Martijn[1] has taken BORDER WIDTH of 16 which
makes that data stored according to the ’optimal 16- spacing (= 64 byte)’ thus allowing access
to the elements in a coalesced fashion.
65
Figure 7.1: Array2D structure in C++ [1]
Figure 7.2: Embedded grid structure in CUDA [1]
The new array is nx[0] by ny[0]. Also a so-called compute-block is indicated. A compute-block
is the basic identity on which CUDA will do the computations. The compute-block is square
given by the dimension DIM COMPUTE BLOCK: 32 x 32 elements . The computing area is
given by cx[0] by cy[0], which is a multiple of the compute-block.
Since the repeated red black ordering is used in the preconditioning step, the data is restored in
66
a r1/r2/b1/b2 storage format. It is divided into four groups representing the repeated red black
ordering: The r1-nodes (even/even red nodes), the b1-nodes (the odd/even black nodes), the
r2 nodes (the odd/odd red nodes), and the b2-nodes (the even/odd black nodes). ”Even/Odd”
here implies that the first coordinate is even and the second coordinate is odd. After the data
is copied in the device from the host, it is restored in this r1/r2/b1/b2 storage format. The
storage structure is shown in Figure 7.3. More details on the storage format, and how to access
the elements by thread indexing can be found in [1].
Figure 7.3: r1/r2/b1/b2 Storage Format [1]
67
Start
1.1: Construct pentadiagonal stencils, taking into account the boundary conditions
1.2 :Construct the Solver for ζ and ϕ and setup the preconditioners for the skew symmetric diagonal blocks.
1.3: Construct the Solver for ψ and setup
the preconditioner for the symmetric matrix.
1.4 Construct the right hand side stencils for
the coupled equation representing ζ and ϕ
1.5 Initiate variables: ζ, ϕ and psi
Is current
no
time ≤ tmax ?
Yes
1.6 Compute the RHS for the coupled
equation, ignoring the ψ contribution.
1.7 Solved the coupled system for ζ and ϕ using preconditioned BiCGStab or GMRES
1.8 Compute the RHS for the psi equation using above value of ϕ
1.9 Solve for psi using RRB Solver
1.10: Update ζ by including contribution from psi
Increment time
Figure 7.4: Flowchart for the Two Dimensional Simplified Test Problem
68
Stop
7.2
Construction of Preconditioner for the Implicit Equation
The spatial discretization matrices are stored in the Array2D structures in C++. Say for Sζζ
which represents the pentadiagonal stencil for the hyperbolic term give in Equation (2.6), the
five diagonals are stored in the variables matzz C, matzz S, matzz W , matzz N , matzz E, where
the subscript zz denotes ζζ, and S to E denotes the directions. In the CUDA implementation,
a structure Grid is used which contain the 9 stencils as the structure members, including the
stencils corresponding to Noth-East, South-East, North-West and South-West directions. It also
has structure members nx, ny, cx and cy which represent the dimensions of the corresnponding
embedded grid and compute area respectively. More details can be found in [1].
The stencils are then adjusted for the boundary conditions. As we have used the Ghost cell
approach in specifying the spatial discretization, it is easy to adjust the stencil at the boundary
for the Neumann boundary conditions. The discretization stencil in two dimensions for Sζζ is
given as:
0
1 U
Sζζ : − 2∆x
W
0
1
2∆y VN
−
1
2∆y VN
1
1
2∆y VS − 2∆x UW
1
VS
− 2∆y
+
1
2∆x UE
0
1
2∆x UE
0
(7.1)
Say for example, at the North boundary, the Neumann boundary conditions are applicable.
Then we set ζi,N x2 +1 = ζi,N x2 , which translates into the following stencil at the North boundary.
Here (i, N x2 + 1) is the Ghost node.
0
1
Sζζ : − 2∆x UW
0
0
1
1
1
V
+
(
V
−
2∆y N
2∆y N
2∆y VS −
1
VS
− 2∆y
1
2∆x UW
+
1
2∆x UE )
0
1
2∆x UE
0
(7.2)
This is programmed in C++ as follows:
Listing 7.1: Stencil construction for Neumann boundary conditions
1
2
3
4
5
6
7
8
9
f o r ( int i 1 = 1 ; i 1 <= Nx1 ; ++i 1 )
{
// North Boundary
m a t z z c [ i 1 ] [ Nx2 ] += m a t z z n [ i 1 ] [ Nx2 ] ;
m a t z z n [ i 1 ] [ Nx2 ] = 0 ;
// South Boundary
mat zz c [ i1 ] [ 1 ]
+= m a t z z s [ i 1 ] [ 1 ] ;
mat zz s [ i1 ] [ 1 ]
= 0;
}
In order to construct the preconditioner for the system of equations represented by the Equation (6.11), the stencils that are part of the four blocks, namely Sζζ , Sϕϕ , Sζϕ and Sϕζ are copied
into the solver class. The off-diagonal block matrices are symmetric in nature, whereas for the
main diagonal block matrices: L = −U T , where L is the strictly lower triangular part of matrix
and U is the strictly upper part of the matrix. Hence, only three diagonals are required to be
copied into the Solver class: South, West and Center. The North and East stencils can then be
reconstructed inside the Solver class. The corresponding C++ code for the reconstruction of
stencils for the diagonal block matrices is shown below.
69
Listing 7.2: Stencil Reconstruction inside the solver class
1
2
3
4
5
6
7
8
f o r ( int j j = 0 ; j j <= Nx2
{
f o r ( int i i = 0 ; i i <=
{
mat zz n [ i i ] [ j j ] =
mat zz e [ i i ] [ j j ] =
}
}
+ 1 ; ++j j )
Nx1 + 1 ; ++i i )
−m a t z z s [ i i ] [ j j + 1 ] ;
−mat zz w [ i i + 1 ] [ j j ] ;
Please note that this reconstruction is valid only when the Neumann boundary conditions have
already been applied to the stencils.
Inside the Solver class, the stencils are modified to represent the left-hand side of the Equation (6.11). This also involves division by the time step (∆t), which is copied to the Solver
class. After the modification, the stencils are stored in C++ standard library format (std::vector
<Array2D <REAL>>). This allows an easy access to the member matrices, and provides modularity to the preconditioner construction and preconditioning solve steps. The preconditioning
matrices, Pζζ and Pϕϕ are then constructed respectively by following the algorithm given in
Section 6.3.2.
The stencils for the right-hand side of the Equation (6.11) are constructed similarly, and stored
in memory for the computation of right-hand side through Sparse matrix vector multiplication
routines.
~ = −Sψϕ ϕ
The Solver for ψ, which is represented by the elliptic equation, Sψψ ψ
~ is constructed
in exactly the same manner as in [1].
7.3
Implementation of Chorin’s Projection Scheme
The time loop in the Figure 7.4 follows the Chorin’s projection scheme described in Chapter 6.
7.3.1
Sparse Matrix Vector Multiplication
The right-hand side of the Equation (6.11) is computed through sparse matrix vector multiplication. In Martijn’s code [1], this step is carried out while computing the time derivatives for
the explicit Leap Frog method. The stencils are computed on the fly. In our algorithm, it is
done differently, as we need to store the matrices in any case. Thus we avoid recomputing the
stencils at every time step. As we have a coupled system, the computations are done in two
parts.
I
~
RHSζ =
− (1 − β)Sζζ ζ~ + (1 − β)Sζϕ ϕ
∆t
I
~
~
RHSϕ = (1 − β)Sϕζ ζ +
+ (1 − β)Sϕϕ ϕ
∆t
(7.3a)
(7.3b)
Computing right-hand sides for both variables ζ and ϕ require two sparse matrix vector mutiplications each.
70
Following code snippet shows the sparse matrix vector multiplication in C++. The diagonals are
represented by cn, .... They are obtained by accessing the elements of the std::vector <Array2D
<REAL>>. The multiplication by X is stored in Y which has the structure of Array2D.
OpenMP directives have been used to employ parallel programming on the CPU.
Listing 7.3: Sparse matrix vector multiplication in C++
1 #pragma
2
for
3
{
4
5
6
7
8
9
10
11
12
13
14
15
16
}
omp f o r
( int i 1 = 1 ; i 1 < nx1 ; ++i 1 )
int i 1 e = i 1 +1;
int i1w = i 1 −1;
f o r ( int j 1 = 1 ; j 1 < nx2 ; ++j 1 )
{
int j 1 n = j 1 +1;
int j 1 s = j1 −1;
Y[ i 1 ] [ j 1 ] = cn [ i 1
] [ j 1 ] ∗ X[ i 1
] [ j1n ] +
ce [ i 1 ] [ j1
] ∗ X[ i 1 e ] [ j 1
] +
cs [ i1
] [ j1
] ∗ X[ i 1
] [ j1s ] +
cw [ i 1
] [ j1
] ∗ X[ i1w ] [ j 1
]
+ X[ i 1 ] [ j 1 ] ∗ c c [ i 1 ] [ j 1 ] ;
}
The computation is done in similar fashion in CUDA. The grid representing the stencil Sζζ
is passed to the CUDA kernel along with the vectors Y and X. A two-dimensional thread
indexing is used here. The computation is done over the compute area shown in Figure 7.2.
Please note that this implementation is different from the sparse matrix vector mupltiplication
on the r1r2b1b2 storage format which will be explained later in the solve step.
Listing 7.4: Sparse matrix vector multiplication in CUDA
1 template <c l a s s T>
2
global
void k e r n e l m a t v 1 (T ∗y , const T ∗x , const Grid &g )
3 {
4
// Computation o f t h e Thread ID−−−−−−−−−−−−
5
int l d = g . nx ;
6
int u = BORDER WIDTH + bx ∗ Bx + tx ;
7
int v = BORDER WIDTH + by ∗ By + ty ;
8
int t i d = v ∗ l d + u ;
9
// −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
10
y [ t i d ] = g . cc [ t i d ] ∗ x [ t i d ] +
11
g . nn [ t i d ] ∗ x [ t i d+l d ] +
12
g . s s [ t i d ] ∗ x [ t i d −l d ] +
13
g . e e [ t i d ] ∗ x [ t i d +1] +
14
g .ww[ t i d ] ∗ x [ t i d − 1 ] ;
15
16 }
71
7.3.2
Solve Step for the Coupled System
This is the backbone of the Chorin’s projection scheme, and one of the most important step
in current thesis. This solves the system of equations arising from the Implicit Trapezoidal
approach. The algorithm has already been discussed. We first present the implementation
of major functions and kernels which are used in the combined solver, whether GMRES/BiCGStab.
In the CUDA implementation, variables ζ and ϕ are restored in the r1r2b1b2 format as the
preconditioning step is involved.
Vector updates
Vector updates (AXPY) is a very easy to implement routine. Following is the equivalent routine
in C++.
Listing 7.5: Vector update (axpy) in C++
1
2
3
4
5
6
7
8
9
10
11
12
13
// Computes y = a∗ y + b ∗ x
void axpy ( Array2D<REAL> &y , const Array2D<REAL> &x ,
const REAL a , const REAL b )
{
#pragma omp p a r a l l e l f o r private ( i 1 , j 1 )
f o r ( int i 1 = 0 ; i 1 < y . dim1 ( ) ; i 1 ++)
{
f o r ( int j 1 = 0 ; j 1 < y . dim2 ( ) ; j 1++)
{
y [ i1 ] [ j1 ] = y [ i1 ] [ j1 ]∗ a + x [ i1 ] [ j1 ]∗b ;
}
}
}
In the actual implementation in C++, the function axpy() is split into vector addition and scalar
multiplication separately, in order avoid incurring extra FLOPS. In the CUDA implementation,
the vector update is carried out over the r1r2b1b2 storage. First the location or thread index
of the respective nodes are computed and then the vector is updated. In the previous implementation by Martijn [1], computation on only red nodes was carried out. Here as we do not
operate on the Schur Complement as discussed before, computation is carried out over both red
and black nodes.
The Code snippet is shown below:
Listing 7.6: Sparse Matrix Vector Multiplication in C++
1
2
3
4
5
6
7
8
9
template <c l a s s T>
global
void k e r n e l a x p y (T ∗y , const T ∗x ,
const T a , const T b , const Grid g )
{
int cgx = g . cx ;
int cgy = g . cy ;
int l d = g . nx ;
int v r 1 = BORDER WIDTH + by ∗ By + ty ;
72
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
int u r 1 = BORDER WIDTH + bx ∗ Bx + tx ;
int v r 2 = BORDER WIDTH2 + cgy + by ∗ By + ty ;
int u r 2 = BORDER WIDTH2 + cgx + bx ∗ Bx + tx ;
int
int
int
int
u
v
u
v
int
int
int
int
loc
loc
loc
loc
y[ loc
y[ loc
y [ loc
y [ loc
b1
b1
b2
b2
=
=
=
=
r1
r2
b1
b2
r1 ]
r2 ]
b1 ]
b2 ]
BORDER WIDTH2 + cgx + bx ∗ Bx + tx ;
BORDER WIDTH + by ∗ By + ty ;
BORDER WIDTH + bx ∗ Bx + tx ;
BORDER WIDTH2 + cgy+ by ∗ By + ty ;
=
=
=
=
=
=
=
=
a
a
a
a
ld
ld
ld
ld
∗
∗
∗
∗
∗
∗
∗
∗
v
v
v
v
r1
r2
b1
b2
y[ loc
y[ loc
y [ loc
y [ loc
+
+
+
+
r1 ]
r2 ]
b1 ]
b2 ]
u
u
u
u
r1 ;
r2 ;
b1 ;
b2 ;
+
+
+
+
b
b
b
b
∗
∗
∗
∗
x[ loc
x[ loc
x [ loc
x [ loc
r1
r2
b1
b2
];
];
];
];
}
Let’s take the example of the r1 nodes to understand the above code snippet. First the twodimensional index (u,v) of the r1 node is determined using the structure shown in Figure 7.3.
Then the global thread index is computed which is given by locr1 = ld ∗ vr1 + ur1 . Then the
vector y is updated. Similarly the computation is carried out for r2, b1 and b2 nodes.
Dot Products
Two implementations of the dot product, also called as the inner product for a vector of structure
Array2D<REAL> is provided for the C++ code. The first implementation uses the BLAS
library. The second implementation is a generic one, which can be used if the BLAS library is
not installed. This uses openmp reduction scheme. This implementation excludes the operation
over the Ghost boundary nodes, and computes the dot product of only inner nodes.
Listing 7.7: Dot Product version 2 in C++
1 REAL d o t p r o d u c t ( const Array2D<REAL> &Vec1 , const Array2D<REAL> &Vec2 )
2 {
3
REAL sum =0;
4 #pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum )
5
f o r ( int i 1 = 1 ; i 1 < Vec1 . dim1 () −2; ++i 1 )
6
{
7
f o r ( int j 1 = 1 ; j 1 < Vec1 . dim2 () −2; ++j 1 )
8
{
9
sum += Vec1 [ i 1 ] [ j 1 ] ∗ Vec2 [ i 1 ] [ j 1 ] ;
10
}
11
}
12
return sum ;
13 }
73
On a GPU, it is difficult to use the existing libraries like CUBLAS as the variables are stored in
r1r2b1b2 format, and these library routines require the input vectors as linear one dimensional
arrays. Also, as majority of the routines for the RRB-solver are already custom built, we do not
strive to use the CUBLAS library, but build upon the existing implementation from Martijn
[1].
As discussed before, this is done in two steps.
• Do a mass reduction on GPU over the compute blocks, and store the intermediate results.
• Use Kahan summation on the CPU to add the elements from the intermediate results.
The reader is encouraged to go through Martijn’s work[1] to understand the implementation
of above two steps. Here we explain the modifications required for the computation of the dot
product for all the nodes, instead of only the red nodes. The change required is in the first
step, which is the mass reduction on GPU and can be achieved easily. Each compute block is
divided in sub-blocks by introduction of the block-factor called DOTP BF.
Listing 7.8: Dot product in CUDA
1 template <c l a s s T>
2
global
void k e r n e l d o t p 1 (T ∗ odata , const T ∗y , const T ∗x ,
3
const Grid g )
4 {
5
int cx = g . cx ;
6
int cy = g . cy ;
7
int l d = g . nx ;
8
9
int u r 1 = BORDER WIDTH + bx ∗ DIM COMPUTE BLOCK + tx ;
10
int v r 1 = BORDER WIDTH + by ∗ DIM COMPUTE BLOCK + ty ;
11
int u r 2 = BORDER WIDTH2 + cx + bx ∗ DIM COMPUTE BLOCK + tx ;
12
int v r 2 = BORDER WIDTH2 + cy + by ∗ DIM COMPUTE BLOCK + ty ;
13
14
15
// Also i n c l u d e t h e B l a c k Nodes
16
int u b1 = BORDER WIDTH2 + cx + bx ∗ DIM COMPUTE BLOCK + tx ;
17
int v b1 = BORDER WIDTH + by ∗ DIM COMPUTE BLOCK + ty ;
18
int u b2 = BORDER WIDTH + bx ∗ DIM COMPUTE BLOCK + tx ;
19
int v b2 = BORDER WIDTH2 + cy + by ∗ DIM COMPUTE BLOCK + ty ;
20
21
int l o c r 1 = l d ∗ v r 1 + u r 1 ;
22
int l o c r 2 = l d ∗ v r 2 + u r 2 ;
23
int l o c b 1 = l d ∗ v b1 + u b1 ;
24
int l o c b 2 = l d ∗ v b2 + u b2 ;
25
int t i d = Bx ∗ ty + tx ;
26
27
shared
T sm [DIM COMPUTE BLOCK ∗ (DIM COMPUTE BLOCK/DOTP BF ) ] ;
28
T sum = 0 ;
29
30
f o r ( int k = 0 ; k < DOTP BF ; ++k ) {
31
32
sum += y [ l o c r 1 ] ∗ x [ l o c r 1 ] ;
33
sum += y [ l o c r 2 ] ∗ x [ l o c r 2 ] ;
34
sum += y [ l o c b 1 ] ∗ x [ l o c b 1 ] ;
74
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68 }
sum
loc
loc
loc
loc
+=
r1
r2
b1
b2
y [ loc
+= l d
+= l d
+= l d
+= l d
b2 ] ∗ x [ loc b2 ] ;
∗ (DIM COMPUTE BLOCK
∗ (DIM COMPUTE BLOCK
∗ (DIM COMPUTE BLOCK
∗ (DIM COMPUTE BLOCK
/
/
/
/
DOTP BF ) ;
DOTP BF ) ;
DOTP BF ) ;
DOTP BF ) ;
}
sm [ t i d ] = sum ;
syncthreads ( ) ;
f o r ( int k = (DIM COMPUTE BLOCK / 2 ) ∗
(DIM COMPUTE BLOCK / DOTP BF ) ; k > 1 6 ; k >>= 1 ) {
if ( tid < k) {
sm [ t i d ] += sm [ t i d + k ] ;
syncthreads ( ) ;
}
}
i f ( t i d < 16)
sm [ t i d ] += sm [ t i d + 1 6 ] ;
syncthreads ( ) ;
i f ( t i d < 8)
sm [ t i d ] += sm [ t i d + 8 ] ;
syncthreads ( ) ;
i f ( t i d < 4)
sm [ t i d ] += sm [ t i d + 4 ] ;
syncthreads ( ) ;
i f ( t i d < 2)
sm [ t i d ] += sm [ t i d + 2 ] ;
syncthreads ( ) ;
i f ( t i d < 1)
sm [ t i d ] += sm [ t i d + 1 ] ;
syncthreads ( ) ;
i f ( t i d == 0 )
odata [ by ∗ gridDim . x + bx ] = sm [ t i d ] ;
Sparse matrix vector multiplication over r1r2b1b2
Inside the solve routine, the implementation of sparse matrix vector multiplication in C++
remains the same as in Listing 7.3. In CUDA the matrix-vector product is implemented differently as the r1r2b1b2 storage has to be taken into account. For each r1,r2,b1,b2 node, first
the location is determined. Then the matrix vector product computed using the stencil values
adjacent to the node.
Listing 7.9: Sparse matrix cector multiplication in CUDA
1 template <c l a s s T>
2
global
void k e r n e l m a t v 1 (T ∗y , const T ∗x , const Grid g )
3 {
4
int l d = g . nx ; // l e a d i n g dimension
5
6
int u r 1 = BORDER WIDTH + bx ∗ Bx + tx ;
75
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
int v r 1 = BORDER WIDTH + by ∗ By + ty ;
int u r 2 = BORDER WIDTH2 + g . cx + bx ∗ Bx + tx ;
int v r 2 = BORDER WIDTH2 + g . cy + by ∗ By + ty ;
int
int
int
int
u
v
u
v
b1
b1
b2
b2
int
int
int
int
loc
loc
loc
loc
=
=
=
=
r1
r2
b1
b2
BORDER WIDTH2 + g . cx +
BORDER WIDTH + by ∗ By
BORDER WIDTH + bx ∗ Bx
BORDER WIDTH2 + g . cy +
=
=
=
=
ld
ld
ld
ld
∗
∗
∗
∗
v
v
v
v
r1
r2
b1
b2
+
+
+
+
u
u
u
u
bx ∗ Bx + tx ;
+ ty ;
+ tx ;
by ∗ By + ty ;
r1 ;
r2 ;
b1 ;
b2 ;
const
const
const
const
const
const
T
T
T
T
T
T
x
x
x
x
x
x
r1
r1up1
r1vp1
r2
r2vm1
r2um1
=
=
=
=
=
=
FETCH
FETCH
FETCH
FETCH
FETCH
FETCH
XX( u
XX( u
XX( u
XX( u
XX( u
XX( u
r1 ,
r 1 +1,
r1 ,
r2 ,
r2 ,
r2 −1,
v
v
v
v
v
v
r1 ,
r1 ,
r 1 +1,
r2 ,
r 2 −1,
r2 ,
ld
ld
ld
ld
ld
ld
);
);
);
);
);
);
const
const
const
const
const
const
T
T
T
T
T
T
x
x
x
x
x
x
b1
b1um1
b1vp1
b2
b2vm1
b2up1
=
=
=
=
=
=
FETCH
FETCH
FETCH
FETCH
FETCH
FETCH
XX( u
XX( u
XX( u
XX( u
XX( u
XX( u
b1 ,
b1 −1,
b1 ,
b2 ,
b2 ,
b2 +1,
v
v
v
v
v
v
b1 ,
b1 ,
b1 +1,
b2 ,
b2 −1,
b2 ,
ld
ld
ld
ld
ld
ld
);
);
);
);
);
);
y [ l o c b 1 ] = g . cc [ l o c
g . nn [ l o c
g . ss [ loc
g . ee [ l o c
g .ww[ l o c
b1 ]∗ x
b1 ]∗ x
b1 ]∗ x
b1 ]∗ x
b1 ]∗ x
b1 +
r2 +
r2vm1 +
r1up1 +
r1 ;
y [ l o c b 2 ] = g . cc [ l o c
g . nn [ l o c
g . ss [ loc
g . ee [ l o c
g .ww[ l o c
b2 ]∗ x
b2 ]∗ x
b2 ]∗ x
b2 ]∗ x
b2 ]∗ x
b2 +
r1vp1 +
r1 +
r2 +
r2um1 ;
y [ l o c r 1 ] = g . cc [
g . nn [
g . ss [
g . ee [
g .ww[
loc
loc
loc
loc
loc
r1
r1
r1
r1
r1
]∗ x
]∗ x
]∗ x
]∗ x
]∗ x
r1 +
b2 +
b2vm1 +
b1 +
b1um1 ;
y [ l o c r 2 ] = g . cc [
g . nn [
g . ss [
g . ee [
loc
loc
loc
loc
r2
r2
r2
r2
]∗ x
]∗ x
]∗ x
]∗ x
r2 +
b1vp1 +
b1 +
b2up1 +
76
58
59 }
g .ww[ l o c r 2 ] ∗ x b2 ;
Preconditioning Solve
As discussed in Section 6.3, each iteration step requires the preconditioning step, where M z = r
has to be solved for z. For the main diagonal block matrices, the preconditioning matrix can be
written as M = LDU . Here L is the lower triangular matrix with diagonal elements as entry
1. Let Ls be the strictly lower triangular part of L. Then we can assign Us = −LTs where Us
is the strictly upper triangular matrix part of U . The diagonal elements of U also contain the
entry 1. During the preconditioner construction step, only the strictly lower triangular part Ls
and D are stored. D is the diagonal matrix.
Solving M z = r can be done in three steps as follows. Set y := U z and x := DU z = Dy, then:
• solve Lx = r using forward substitution
• compute y = D−1 x
• solve U T z = y using backward substitution. U can be constructed with the help of
Us = −LTs .
Implementation of above three steps is described in detail in [1]. The step 1 of solving Lx = r
is done grid wise in two phases, where grid refers to the repeated red black ordering level. We
go from the finest grid to the coarsest grid. In the case of the symmetric RRB solver, the
CG algorithm operates only on the red nodes. The first level black nodes are eliminated while
constructing the Schur complement. Phase 1 corresponds to updating the r2-nodes using the
r1-nodes in the same level. Phase 2 corresponds to updating r1- and r2- nodes using b1- and
b2-nodes in the same level. Thus starting with Phase 1, we end in Phase 1.
In the case of Skew-Symmetric preconditioning step, as the computation is done over all the
nodes, we start with the phase 2 of the process instead, in which the r1- and r2- nodes are
updated using the b1- and b2- nodes in the finest level.
The step 2 of the preconditioning step is easy, as it only requires division by the diagonals.
Again, for the RRB symmetric solver, this division is carried out only for the r1- and r2- nodes
in the finest level. In our implementation, this step is carried out over all the nodes in the finest
level.
In Step 3 we go from coarse grids to fine grids and backward substitution is used. Here the
phase 4 corresponds to updating r1-nodes using r2-nodes in the same level, and the phase
3 corresponds to updating b1- and b2-nodes using r1- and r2- nodes in the same level. The
backward substitution starts with phase 4 and ends with phase 4, as there is no need of updating
the black nodes at the finest level.
Again, two changes are required for the implementation in the case of Skew-Symmetric preconditioning step. In the RRB solver, symmetry (L = LT ) has been used to carry out the computations in phase 4 and phase 3. We will employ the skew symmetry here using L = (−L)T .
Secondly, our backward substitution will end with phase 3 instead, where the black nodes at
the finest level are also updated using the red nodes at the finest level.
77
7.3.3
RRB Solver for ψ, and ζ correction
In order to solve the elliptic equation 2.13b for ψ , first the right-hand side is computed. This
is done in C++ with a simple sparse matrix vector multiplication. The stencil corresponding
to Sψϕ has already been stored in the system, and thus computing −Sψϕ ϕ
~ is easy. In CUDA,
the previous routine ’solvepsi()’ from Martijn [1] has been used.
After computing the right-hand side of the equation, the RRB solve function is called. Depending on the predefined directives, it either calls the C++ or the CUDA RRBSolver.
With the updated value of ψ, now ζ can be updated, which is the last step in the Chorin’s
Projection scheme. This step only requires sparse matrix multiplication and a vector update.
Please note that in CUDA, the computations outside the solve steps are carried out on the
embedded grid given in Figure 7.2. Only for the solve routines, the computations are carried
out on the r1r2b1b2 structure given in Figure 7.3.
7.4
Plugging the Solver into the Interactive waves Code
The Solver developed above is then plugged into the Interactive wave code. As the solver for
the test case has been developed only for the Neumann boundary conditions, some changes are
required while taking into account the impact of Incoming Waves. We will first discuss the
previous implementation of the Boundary conditions by de Jong[1] for the RRB solver and then
employ the ideas in the implicit solver.
7.4.1
Incoming Boundary Conditions- Previous Implementation
Dirichlet or Incoming boundary conditions depends on the current time, and fixed parameters
like wavenumber and frequency which are inputs to the model. At every time step, the incomingboundary() routine is called to generate the Incoming boundary conditions. Lets see how
it is applied to Equation (2.13b). The stencil Sψψ has been constructed considering Neumann
boundary conditions in mind, whereas during construction of Sψϕ , the Neumann boundary
conditions are not applied. First the routine boundaryzetaphi() is called which is shown below.
Listing 7.10: Dirichlet/Incoming boundary conditions
1
2
if
3
{
4
5
6
7
8
9
10 }
( m bcN == 0 ) /∗ r e f l e c t i v e Neumann boundary ∗/
f o r ( int j 1 = 1 ; j 1 <= m Nx1 ; ++j 1 )
{
p h i [ j 1 ] [ Nx2+1] = p h i [ j 1 ] [ m Nx2 ]
+ phi in N ext [ j1 ]
− phi in N cen [ j1 ] ;
}
In the above code, the Incoming boundary conditions given by variables phi in N ext and
phi in N cen are applied to the variable φ. These variables are generated by the incomingboundary() routine. Similarly the Incoming boundary conditions are applied to the variable ψ.
78
The right-hand side of Equation (2.13b) is computed with sparse matrix vector multiplication.
On the North boundary, the stencil Sψϕ takes the following form:
Sψϕ
0
1 h D
: ∆x∆y ∆x2 W W
0
1
−( ∆y
2 hN D N +
1
h D
∆y 2 N N
1
1
h D + ∆x
2 hE D E
∆x2 W W
1
h D
∆y 2 S S
+
1
h D )
∆y 2 S S
0
1
h D
∆x2 E E
0
(7.4)
Please note that the Neumann boundary conditions are not applied while constructing this
1
stencil. Thus the stencil component at the North boundary given by ∆y
2 hN DN is not zero.
When the sparse matrix multiplication at the North boundary is carried out, we get the following
code:
Listing 7.11: Right-hand side contribution at the North boundary
11
12
13
14
15
16
17
18
f o r ( int j 1 = 1 ; j 1 <= m Nx1 ; ++j 1 )
{
b p s i [ j 1 ] [ Nx2 ] −= p h i [ j 1 ] [ Nx2 ] ∗ S p s i
+ p h i [ j 1 ] [ Nx2+1]∗ S
+ p h i [ j 1 ] [ Nx2−1]∗ S
+ p h i [ j 1 + 1 ] [ Nx2 ] ∗ S
+ p h i [ j1 − 1 ] [ Nx2 ] ∗ S
}
phi
psi
psi
psi
psi
C [ j 1 ] [ Nx2 ]
p h i N [ j 1 ] [ Nx2 ]
p h i S [ j 1 ] [ Nx2 ]
p h i E [ j 1 ] [ Nx2 ]
p h i W [ j 1 ] [ Nx2 ] ;
We can see that the value of φ at the Ghost node [j1][Nx2+1], which was set in the Listing
7.10 has been used in Listing 7.11. Now, we will see what happens if the Neumann boundary
conditions ared to the stencil Sψψ , which is the stencil on the left-hand side of Equation (2.13b).
In that case, for the North boundary, the term Sψψ N will be zero and Sψψ C ′ = Sψψ C + Sψψ N
as discussed in Listing 7.1. The matrix vector multiplication will result in :
~ = Sψψ
Sψψ ψ
C ′ ψ[j1[N x2]
= (Sψψ
C
+ Sψψ
+ ···
N )ψ[j1[N x2]
+ ···
(7.5)
whereas if we have Dirichlet/Incoming boundary conditions instead of Neumann boundary
conditions, then sparse matrix multiplication should result in following:
~ = Sψψ
Sψψ ψ
C ψ[j1][N x2]
+ Sψψ
N ψ[j1][N x2
= Sψψ
C ψ[j1][N x2]
+ Sψψ
N (ψ[j1][N x2]
= (Sψψ
= Sψψ
C
+ Sψψ
N )ψ[j1][N x2])
C ′ ψ[j1][N x2]
+ Sψψ
+ Sψψ
+ 1] + · · ·
+ ψ.in.N.ext[j1] − ψ.in.N.cen[j1]) + · · ·
N (ψ.in.N. xt[j1]
N (ψ.in.N. xt[j1]
− ψ.in.N.cen[j1]) + · · ·
(7.6)
− ψ.in.N.cen[j1]) + · · ·
Comparing Equation (7.5) and Equation (7.6), we can easily see that there is a missing term.
This has to be additionally taken into account while computing the right-hand side of the
Equation (2.13b). As Neumann boundary conditions have only been applied to Sψψ , we need to
subtract right hand side by Sψψ N (ψ.in.N. xt[j1]−ψ.in.N.cen[j1]), for all directions respectively.
This can be seen as superimposition of Dirichlet boundary conditions over Neumann boundary
conditions. A similar method need to be adopted while computing the right-hand side for the
implicit set of equations. This is explained in the next section.
79
7.4.2
Incoming Boundary Conditions- Current Implementation
Let us consider the system of coupled equations arising from the Implicit Trapezoidal time
integration given in Equation (7.7).
(
I
I
+ βL)q n+1 = (
− (1 − β)L)q n
∆t
∆t
(7.7)
Here, assume that Neumann boundary conditions have been applied while constructing the
matrix L. Now assume that on the North boundary, we have Dirichlet (Incoming) boundary
conditions, which is generated by the routine incomingwaves(). In that case, we need to superimpose the Incoming boundary conditions on the Neumann boundary conditions, as was
done in the previous section. There is one catch though. In the previous section, Incoming
conditions were taken at only next time step. In our case of Implicit Trapezoidal scheme, we
need to account for the Incoming boundary conditions at previous time step and the new time
step. This is crucial for the stability of the Trapezoidal scheme, as the Incoming boundary conditions depend on time. In order to avoid the double computation of the Incoming boundary
conditions, following steps are carried out.
• Introduce a Start-up step, which initializes the Incoming waves and also solves for ψ based
on the initial values of φ. This step is called before the time loop.
• Inside the time loop, initialize the right-hand side for the coupled implicit equation based
on the previous Incoming waves. The correction due to the superimposition of Incoming
waves is computed.
• Increment time and call the incomingwaves() routine to compute the Incoming boundary
conditions at the new time step.
• Update the right-hand side for the coupled implicit equation by sparse matrix vector
multiplication and correction due to the superimposition of Incoming waves at the new
time step.
Incoming waves have a great influence over the way the final solution behaves. Devising a
proper strategy to handle the boundary conditions has been a very important step in current
thesis for maintaining the stability of the implicit time integration method. We encourage to
implement a simplified two-dimensional test case with Incoming boundary waves in future so
that the required properties can be studied in more detail and simplified way.
80
Chapter 8
Results - Two Dimensional Test case
In this chapter, we discuss the results of the two dimensional test problem for the C++ and
CUDA implementation.
We have seen the benchmark results from MATLAB in Section 5.5. There is one difference.
The boundary conditions used in 5.5 is periodic, whereas the test case built in C++ or CUDA
makes use of the Neumann boundary conditions. The MATLAB code has been modified to take
into account the Neumann boundary conditions by adjusting the stencil formulation.
We will directly jump to the important problem related results, which include discussion on
the role of different preconditioners, number of iterations required for GMRES/Bi-CGTAB,
storage requirements, and floating point operations. Then we will discuss the timing results
corresponding to the C++ and the CUDA code.
8.1
Convergence properties
Here we discuss the number of iterations required to solve the coupled set of equations either
by the Bi-CGSTAB or GMRES algorithm. The number of iterations required depends on the
variables given below.
• stucture of the preconditioner
• the accuracy or the solver tolerance
• the time step and the grid size
• the problem size
The other parameters which are fixed during this section are:
• Ux = Uy = 1 m/s
• h = 50 m
• g = 9.81 m/s2
81
8.1.1
Termination Criteria
The number of iterations required to achieve convergence depends on the tolerance chosen. We
have used the relative error of the 2-norm of residual ri with respect to the 2-norm of the initial
right-hand side vector b, as the termination criterion, i.e. stop the iterative process when:
< ri , ri >
≤ psitol2
< b0 , b0 >
(8.1)
This criterion is scale invariant, which implies that increasing the number of grid points does not
lead to a more stringent criterion. We have fixed the tolerance to be 1e−5 . The iterative process
is stopped either when we satisfy the termination criteria, or reach the maximum number of
iterations.
8.1.2
Variation with ∆t for various preconditioners
The matrix solved by the linear iterative solvers is given in Equation (6.11). The diagonal
1
elements contain an
term, which results in reduction in diagonal dominance of the matrix
∆t
with increase in the time step. Also the overall convergence properties depend on the preconditioner chosen. This has been previously discussed
in Section 6.3.2. Here M1 refers to the
Pζζ
0
. M2 and M3 refer to the two variants
block-Jacobi preconditioner given by: M1 =
0 Pϕϕ
Pζζ
0
Pζζ Sζϕ
of block-Gaussian preconditioners given by: M2 =
and M3 =
.
Sϕζ Pϕϕ
0 Pϕϕ
Figure 8.1: Variation of number of iterations with timestep: Bi-CGSTAB
Figure 8.1 shows the variation of number of iterations with the time step, for different preconditioners. Here Bi-CGSTAB is used as the iterative solver. The results are close to our
expectations as discussed in Section 6.3.3. While the block Jacobi preconditioner does not
82
greatly help in reducing the number of iterations required to solve the system, the Gauss Seidel
preconditioners results in a large reduction in the number of iterations.
It is well known that in certain situations, the Bi-CGSTAB method suffers from stagnation. In
order to analyze this, we implemented the GMRES(m) method, which is the restarted version
of GMRES method discussed in Section 6.2.2. The major drawback of the GMRES method
is that the amount of work and storage required per iteration rises linearly with the iteration
count. In the restarted version, after a chosen number of iterations m, the accumulated data is
cleared and the intermediate results are used as the initial data for the next m iterations. This
procedure is repeated until convergence is reached. Here we have chosen m = 30. The results
with the GMRES(m) method are shown in Figure 8.2.
Figure 8.2: Variation of the number of iterations with timestep: GMRES(30)
Let us consider the results with preconditioner M2 and M3 . The number of iterations required
for convergence is higher for GMRES(m) and the difference increases as we take higher time
steps. The comparison between the two methods is shown in Figure 8.3
83
Figure 8.3: Variation of the number of iterations with timestep: GMRES vs Bi-CGSTAB for
preconditioner M2
The results encourage us to proceed with the implementation of Bi-CGSTAB on CUDA. Though
there is a catch here. If we look at the time steps below 1 second, the different in number of
iterations is not very large. Also, from the computation side, GMRES method only requires
one sparse matrix vector multiplication and preconditioner solve per iteration whereas the BiCGSTAB method requires two sparse matrix vector multiplication and two preconditioner solves
per iteration. A good comparison would be the timing results between the two methods.
For the CUDA implementation of the implicit solver, we have only considered Bi-CGSTAB
algorithm as it was relatively easier to switch from the existing CUDA CG solver to CUDA
Bi-CGSTAB solver. As the preconditioner M2 provides the optimum number of iterations
amongst the preconditioners compared, we have only considered it for further analysis, though
functionality is provided in the code to switch between various preconditioners.
8.1.3
Variation with time step ratio
The ∆t based on the CFL number depends on the spatial step size. If the spatial step size
is reduced, then ∆tCF L will also reduce proportionally. We examined the behavior of the BiCGSTAB method with the preconditioner M2 , and plotted the required number of iterations
∆t
ratio. This is shown in Figure 8.4.
for various
∆tCF L
84
Figure 8.4: Varitaion of number of iterations with
tioner M2
∆t
ratio: Bi-CGSTAB with precondi∆tCF L
We do this for various step sizes (keeping the number of grid points the same). It is observed
∆t
ratio is constant, the number of iterations remains
that irrespective of the step size, if the
∆tCF L
constant too. 1 Also, we can see a linear increase in the number of iterations with the time
step. Although after a certain time step, the slope is larger. It can be attributed to the impact
of initial guess. If the time step is higher than the wave period, then we lose information about
the wave. In this case, the initial guess for the iterative solver, which is the solution from the
previous time step, may not be good enough.
Further research can be carried out in determining a suitable initial guess by extrapolating
values from previous time steps. Although this would result in extra storage requirement, it is
required to evaluate the trade-off between storage and efficiency for such case.
8.1.4
Impact of the problem size
To analyze the impact of the problem size, we have chosen a square grid with
N = 32, 64, 128, 256, 512, 1024, where N is the number of unknowns in both x− and y− direction. The step size is fixed at ∆x = 5m, and ∆t = 1s. The results are gathered in Figure 8.5.
1
This is not presented in this report.
85
Figure 8.5: Variation of number of iterations with problem size
We can see that the iteration count is almost constant for N > 256. We have checked this for
both GMRES and Bi-CGSTAB methods. The realistic test cases like River IJssel or Plymouth
Sound have larger problem sizes. This observation becomes more attractive for them. We can
say that the required number of iterations for convergence will be constant for realistic problem
size.
There is a catch here. In the two-dimensional test case, the waves occupy the complete grid
starting from the initial conditions. The solver then has to compute the waves at all the grid
points from the first time step itself. This is different for the realisitc case. Initially the sea is at
a steady-state position, and then the waves are introduced through incoming boundary waves.
The required number of iterations may vary with time as the waves start filling the grid. We
will discuss this in detail in the next Chapter.
8.2
Bi-CGSTAB vs CG: Memory requirement and Operation
Count
In this section, we discuss the extra memory and operation count required in converting the
symmetric RRB solver based on preconditioned CG to Non-symmetric RRB solver based on
preconditioned Bi-CGSTAB
8.2.1
Memory requirement: C++
In C++, computing the memory requirement is straightforward as we do not need to allocate
memory in different storage scheme as in CUDA.
The storage requirement for the preconditioned CG method is given in Table 8.1 [1]. Here
n = N x1 × N x2 is the total number of grid points.
86
Table 8.1: Storage requirements for the preconditioned CG algorithm
Variable
Type
Description
Memory
A
Matrix
System Matrix
5n
P
Matrix
Preconditioner
5n
b
Vector
Right hand Side
n
x
Vector
Solution
n
r
Vector
Residual
n
p
Vector
Search direction
n
q
Vector
q = Ap
n
z
Vector
z = M −1 r
n
Total = 16 × n
In the Bi-CGSTAB C++ solver, we need to allocate memory for all system matrices, preconditioners, and for the variables corresponding to both ζ and ϕ. Table 8.2 shows the memory
requirement of the Bi-CGSTAB solver.
87
Table 8.2: Storage requirements for the Bi-CGSTAB algorithm
Variable
Type
Description
Memory
Sζζ , Sζϕ , Sϕϕ , Sϕζ
Matrix
System Matrices
5n × 4
Pζζ , Pϕϕ
Matrix
Preconditioners
5n × 2
bζ , bϕ
Vector
Right hand Side
n×2
x ζ , xϕ
Vector
Solution
n×2
Rζ , Rϕ
Vector
Residual
n×2
R̃ζ , R̃ϕ
Vector
Initial Residual
n×2
Pζ , Pϕ
Vector
p
n×2
P̂ζ , R̂ϕ
Vector
M −1 p
n×2
Sζ , Sϕ
Vector
s
n×2
Ŝζ , Ŝϕ
Vector
M −1 s
n×2
T ζ , Tϕ
Vector
t
n×2
Total = 48 × n
We can see that there is a three times increase in the memory required. When the PCG method
is used with the explicit Leap-Frog scheme, other stencils are not stored but computed on the
fly.
Lets discuss if we really need to store the stencils for the Bi-CGSTAB (or GMRES) method.
The stencils Sζζ and Sϕϕ are used to compute the preconditioner, and they are required to be
stored in memory. Though it is possible to store only three diagonals and use skew-symmetry,
but that makes the code reading more complex.
The stencils Sζϕ and Sϕζ are used in the Bi-CGSTAB algorithm while computing the matrixvector product. As this operation is required at every iteration (we will discuss the required
number of operations in the next section), we have decided to store the stencil in memory
instead of computing it at every iteration. In the previous implementation by Martijn[1], these
stencils were constructed only once per time step, whereas in our case, they are required at-least
two times every iteration of Bi-CGSTAB or GMRES. Though it is possible to use the symmetry
of the stencil Sζϕ and diagonal nature of stencil Sϕζ , and reduce the memory requirement from
48n to 42n.
8.2.2
Memory requirement: CUDA
In this section, we shall give indications of the memory requirement for the Bi-CGSTAB solver
on the GPU and compare it with the CG requirements.
88
In the CUDA implementation, we have seen the use of an embedded grid and the r1r2b1b2
storage format. It can be argued that the embedding results in a larger memory requirements,
but that is valid only for small size problems. The embedding width is given by 16 grid points,
and is not significant for larger problems that we intend to solve.
For storing the preconditioners which are constructed with the repeated red black ordering
scheme, extra storage will be required, as we need to store the grid at each RRB level. At every
level, we reduce the number of grid points by a factor of 4.
Thus we end up with a geometric series starting with n and ratio of 1/4. The total memory
requirement (in the ideal case of very large initial grid) is given by:
∞
X
4×n
n
=
Total memory required =
4i
3
(8.2)
i=0
4
will be higher due to the embedding grid effect.
3
The list of data objects required in PCG algorithm are given below:
For smaller problems, the ratio of
Table 8.3: Storage requirements for the preconditioned CG algorithm in CUDA
Variable
Type
Storage format
Memory
*m prec
preconditioner stencils
repeated r1/r2/b1/b2
*m orig
Original Stencil
single r1/r2/b1/b2
5×n
*m dX
vector x
Standard
n
*m dB
vector b
Standard
n
*m dXr1r2b1b2
vector x
single r1/r2/b1/b2
n
*m dBr1r2b1b2
vector b
single r1/r2/b1/b2
n
*m dYr1r2b1b2
vector y
single r1/r2/b1/b2
n
*m dPr1r2b1b2
vector p
single r1/r2/b1/b2
n
*m dQr1r2b1b2
vector q
single r1/r2/b1/b2
n
*m dRr1r2b1b2
vector r
single r1/r2/b1/b2
n
*m dZr1r2b1b2
vector z
repeated r1/r2/b1/b2
4n
3
*m drem
for dot products
standard
n
5×
4n
3
Total = 23 × n
The vectors x and b are stored twice in different storage formats. Compared to the C++ PCG
implementation there is an increase in storage requirement of minimum 1.4 times.
89
We now tabulate the storage requirements for the Bi-CGSTAB method in CUDA. In the Table 8.4, we have used the nomenclature ’lex’ representing the lexicographic or standard ordering.
Table 8.4: Storage requirements for the preconditioned Bi-CGSTAB algorithm in CUDA
Variable
Type
Storage format
Memory
*m precζζ ,*m precϕϕ
preconditioner stencils
repeated r1/r2/b1/b2
*m origζζ ,*m origϕϕ
Original Stencils
single r1/r2/b1/b2
2×5×n
*m origζϕ ,*m precϕζ
Original Stencils
single r1/r2/b1/b2
2×5×n
*m dXζ -lex, *m dXϕ -lex
vector x
Standard
2×n
*m dBζ -lex, *m dBϕ -lex
vector b
Standard
2×n
*m dXζ , *m dXϕ
vector x
single r1/r2/b1/b2
2×n
*m dBζ , *m dBϕ
vector b
single r1/r2/b1/b2
2×n
*m dYζ , *m dYϕ
vector y
single r1/r2/b1/b2
2×n
*m dRζ , *m dRϕ
vector r
single r1/r2/b1/b2
2×n
˜ ζ , *m dR
˜ϕ
*m dR
vector r̃
single r1/r2/b1/b2
2×n
*m dVζ , *m dVϕ
vector v
single r1/r2/b1/b2
2×n
*m dPζ , *m dPϕ
vector p
single r1/r2/b1/b2
2×n
ˆ ζ , *m dP
ˆϕ
*m dP
vector p̂
single r1/r2/b1/b2
2×n
*m dTζ , *m dTϕ
vector t
single r1/r2/b1/b2
2×n
ˆ ζ , *m dS
ˆϕ
*m dS
vector ŝ
single r1/r2/b1/b2
2×n
*m dtemp
vector temp
single r1/r2/b1/b2
n
*m dZζ , *m dZϕ
vector z
repeated r1/r2/b1/b2
*m drem
for dot products
standard
2×5×
2×5×
4n
3
4n
3
n
Total ≃ 72 × n
Clearly, the CUDA implementation of the Bi-CGSTAB gives extra strain to the memory requirement as compared to the CG implementation. Martijn [1] reports that the realtistic test
problem of Plymouth Sound described in Chapter 3, requires around 180 MB in storage (with
1.5 Million number of grid points). With the Bi-CGSTAB implementation, this number will
increase roughly three times to 540 MB; plus the requirement of the CG solver itself. This is
still under the global memory limit of 3048 MB for the GTX780 card.
90
8.2.3
Operation Count
We compare the floating point operations (flops) required for the preconditioned CG and the
preconditioned Bi-CGSTAB method. Work is split into the dot product, vector update, matrix
vector multiplication and preconditioning solve step. Let the size of the problem given by n
and i be the number of iterations required for convergence. Following are the flops for each
operation:
• dot product : n − 1 summations and n multiplications ∼ 2n
• Vector update: n summations and n multiplications = 2n
• sparse matrix vector multiplication : 5n multiplications and 4n summations = 9n
• preconditioning: Solving LDLT z = r: 2n multiplications and summations for each lower
triangular and upper triangular solve, n multiplications for division by diagonal = 9n
Table 8.5 provides the flop count for the PCG algorithm. The second column ’count’ specifies
the number of times each operation in called.
Table 8.5: Operations Count for the preconditioned CG algorithm
Operation
Count
flops per operation
dot product
2i + 1
2n
vector update
3i + 1
2n
Matrix vector product
i+1
9n
preconditioning solve
i
9n
Total flops: n(28i + 13)
In the case of Bi-CGSTAB algorithm, as the computation is done for ζ and ϕ, we have two
times the operations as compared to a regular Bi-CGSTAB method. This is reflected in column
’count’ in Table 8.6, where the dot products and vector updates are multiplied by 2. The matrix
vector product and preconditioning solve requires special attention.
Sζζ
Sζϕ
Let the coupled matrix be given as A =
. Then the matrix vector multiplication
Sϕζ
Sϕϕ
ζ~
is given by:
with the vector ~x =
ϕ
~
#
"
Sζζ ζ~ + Sζϕ ϕ
~
A~x =
(8.3)
Sϕζ ζ~ + Sϕϕ ϕ
~
We require two matrix vector multiplications and one vector summation for each variable ζ and
ϕ resulting in 19n flops per variable.
91
Flops for the preconditioning solve for the coupled system depends on preconditioner chosen.
Here we study the preconditioner M2 which has the structure:
P
M2 = ζζ
Sϕζ
0
Pϕϕ
(8.4)
Solving M2 z = r requires the following:
• Two preconditioning solve operations which require 9n flops each = 18n
• one matrix vector product =9n
• one vector summation =n
resulting in a total of 28n flops.
Table 8.6: Operations Count for the preconditioned Bi-CGSTAB algorithm
Operation
Count
flops per operation
dot product
2 × (5i + 1)
2n
vector update
2 × (6i + 1)
2n
Matrix vector product
2 × (2i + 1)
19n
preconditioned solve
(2i)
28n
Total flops: n(176i + 46)
The flop count of the Bi-CGSTAB is 6 times higher than that of the CG algorithm. This also
makes sense, as we now solve a system of equations which has twice the number of variables
and is coupled in nature. The extra computational effort gives us the flexibility of choosing a
time step irrespective of the CFL number. Also, the Bi-CGSTAB algorithm has the advantage
that the number of iterations i required to solve the system does not depend on the problem
size. We will discuss more on the impact of the problem size and time step on the required
computation time for both Bi-CGSTAB and CG solver in the next sections.
8.3
Timing Results
In this section, we shall discuss how to establish the efficiency of the different solvers. Our aim
is to perform the simulations at time steps larger than what is allowed by the CFL number
limitation for the explicit time stepping case. This will allow us to carrry out simulations for
finer, larger and non-uniform meshes in real time. The following four quantities are defined:
• Implicit solver time: This is the time required by the Bi-CGSTAB implicit solver to
compute one time step. The simulation is run for 100 seconds in real time, and an average
is computed. This is reported in milli seconds (ms). For example, if the time step is 0.1
seconds, then 1000 time steps are required to compute the 100 seconds in real time, and
an average of 1000 steps is considered as the Implicit solver time.
92
• CG solver time : This is the time required by the RRB-k solver developed by Martijn[1]
for the symmetric system of equations to carry out simulation for one time step, similar
to the Implicit solver time described above.
• Additional time : This is the time required by the other functions (Computing right hand
side etc.). Again an average is considered.
• Total time : This is computed differently. Total time is the time required to carry out the
simulation for 100 seconds. In the case if the simulation is carried out for other than 100
seconds, then it will be mentioned. This is reported in seconds(s).
We first try to analyze the Implicit solver time for the implementation in C++. To get an
insight on the cost of computation for various functions involved, a profiler like Valgrind has be
used. It is possible to generate the so-called callgrind-files which can thereafter be studied using
a package called kcachegrind. These profilers are available in OpenSource. It is not possible to
profile GPU tasks with Valgrind though. Also, it gives random results in the case OpenMP is
used. We have used Valgrind for the C++ case without OpenMP implementation.
It was observed that the maximum time is consumed by the sparse matrix vector multiplication,
which is required for the preconditioning solve and the matrix vector multiplications both. Later
on, we will profile the CUDA implicit solver with a different profiler called nvprof.
Following tables shows the timing results for the test problem in two dimensions. The grid step
size is taken as ∆x = ∆y = 5m. The relative tolerance is set at 1e−5 . Single precision has been
used for this test study. For the real test cases, double floating point precision will be used as
some errors creep into the simulation after simiulating for linger times with single precision for
the real test cases. It should be possible to remove this discrepancy and will be part of the
furture work. We will run the code on C++ and CUDA for various problem sizes and time
steps. Furthermore, the C++ run is divided into with or without OpenMP implementation.
The algorithm chosen for solving the implicit set of equations is Bi-CGSTAB.
In the tables below, the column ’iteration’ refers to the iterations required for the convergence
of the Bi-CGSTAB algorithm and the rest are our time quantities. The Total time refers to
runtime for 100 seconds of realtime simulation as discussed before.
Table 8.7: Timing results for C++ with a single core. N x = 240 and N y = 120
∆t
Iterations
Total time (s)
Implicit solver time (ms)
CG solver time (ms)
0.05
1
29.79
6.19
7.31
0.1
1
15.69
7.19
7.10
0.25
3
9.61
15.54
6.97
0.5
6
7.62
29.49
6.87
1
13
7.55
66.54
6.79
93
Table 8.8: Timing results for C++ with OpenMP (4 cores). N x = 240 and N y = 120
∆t
Iterations
Total time (s)
Implicit solver time (ms)
CG solver time (ms)
0.05
1
13.36
2.24
3.65
0.1
1
7.74
3.33
3.60
0.25
2
3.66
4.97
3.43
0.5
6
2.95
10.50
3.42
1
13
2.74
22.83
3.35
Table 8.9: Timing results with CUDA. The CG solver is still in C++ with OpenMP2 . N x = 240
and N y = 120
∆t
Iterations
Total time (s)
Implicit solver time (ms)
CG solver time (ms)
0.05
1
13.33
2.20
3.61
0.1
1
7.60
3.05
3.54
0.25
3
3.85
4.67
3.77
0.5
7
2.95
10.06
3.37
1
16
2.43
19.66
3.58
Following are the observations from the above tables:
• There is a gain of three times in the Implicit solver time when we go from a single core to
four core processor. This corresponds to a parallel efficiency of 75% for the Bi-CGSTAB
algorithm.
• A marginal speed-up in the CUDA routines as compared to the C++ routine with
OpenMP implementation is observed. This could be attributed to two reasons:
– Size of the problem (Number of grid points). For smaller problems, it is possible
that the communication time between the host (CPU) and device (GPU) become
the bottleneck. This is not a concern for the OpenMP parallelization, as it uses
shared memory architecture, and the communication time is minimal.
– The number of iterations required for CUDA convergence are slightly more as compared to the C++ implementation.
• The total time for the simulation reduces till the time step of 1 second, and then it stagnates. As discussed in the previous sections, number of iterations required for convergence
of Bi-CGSTAB increases linearly with the time step. Though after a certain time step the
slope of the linear curve start increasing. This increases the resulant computation time
leading to stagnation.
In order to check the impact of the problem size on the efficiency of the solvers, we ran the code
for different problem sizes, and keeping the time step constant at 0.5 seconds.
2
CUDA CG solver for the two-dimensional test case has not been implemented. Complete integration of the
two CUDA solvers has been done only for the real test case and the Non-Uniform grid case.
94
Table 8.10: Timing results in C++ without OpenMP. ∆t = 0.5s. Variation with problem size
Nx × Ny
Iterations
Total time (s)
Implicit solver time (ms)
CG solver time (ms)
240x120
6
7.56
29.24
6.78
480x240
5
26.90
100.88
26.83
960x480
4
106.87
380.93
123.24
Table 8.11: Timing results in C++ with OpenMP. ∆t = 0.5s. Variation with problem size
Nx × Ny
Iterations
Total time (s)
Implicit solver time (ms)
CG solver time (ms)
240x120
6
2.97
10.42
3.38
480x240
5
12.15
43.21
13.81
960x480
4
45.16
142.32
64.95
1024x1024
4
116.874
344.612
189.287
2048x2048
4
517.471
1281.47
1113.57
Table 8.12: Timing results in CUDA Implicit Solver. ∆t = 0.5s. Variation with problem size
Nx × Ny
Iterations
Total time (s)
Implicit solver time (ms)
CG solver time (ms)
240x120
7
2.95
10.06
3.37
480x240
6
6.82
13.91
14.73
960x480
6
21.65
24.38
67.62
1024x1024
5
58.14
42.31
190.61
2048x2048
5
279.12
122.73
1093.67
The real impact of the CUDA implicit solver can be seen now. For the small problem size
of 240x120 grid points, we did not see any improvement, but a speed-up of 10 times can be
observed with the problem size of 2048x2048 points. The speedup graph is also shown below.
95
Figure 8.6: Speedup of the CUDA Implicit solver with respect to C++ solver with OpenMP
implementation
While we obtain a speedup of 10 times for the implicit solver time for the increasing problem
size, the speed-up for the total time saturates. The RRB solver implementation in C++ here
now becomes the limiting factor, instead of the implicit solver computation. We will observe the
impact of the problem size with the CUDA RRB solver implementation in the next Chapter.
8.4
CUDA profiling
As we carried out the analysis of the C++ code to identify the bottlenecks with the valgrind
memory tool, we do the same with the CUDA code using Nvidia’s nvprof command line utility
tool.
We carried out the simulation for the test case with the problem size of 2048x2048. This allows
focusing on the compute intensive parts of the program. A time step of 0.5 second is chosen
which is around 5 times the time step based on the CFL condition. We saw that the number
of iterations required for this case is 5. To limit profiling to the Bi-CGSTAB algorithm , we
have used CUDA functions, cudaProfilerStart() and cudaProfilerStop() to start and stop profile
data collection. The timing results obtained from the profile shows the distribution of all the
compute kernels (or functions). The kernels then can be distributed into four major operations:
preconditioning solve, matrix vector multiplication, vector update and the dot product. The
results are shown in following table. The column ’time’ represents the actual time spend in
these function, for the simulation which was carried out for 100 seconds in real time. This
excludes the time spent on communication between the host and the device.
96
Table 8.13: Profiling of the Bi-CGSTAB algorithm implementation
Function
Time (s)
Time (%)
Preconditioning Solve
6.543
36.18%
Matrix vector multiplication
4.952
27.39%
AXPY (vector update)
4.807
26.59%
Dot Product
1.780
9.84%
The above table provides an insight into the time consuming steps. When the tool was used
initially, it showed a lot of time being spent into cuda:memcpy. The implementation at that
time included initializing many variables repeatedly, and also required extra copying in the
preconditioning step. Following the profiling results, it was possible to remove such extra time
consuming steps and optimize the code.
The test code allowed us to study the Mathematical and Computing aspects of solving the
implicit set of equations. We will now perform simulations for the real test cases and provide
the timing results in the next Chapter.
97
Chapter 9
Results - Real Test case
In this Chapter we discuss the results for the real test cases of River IJssel and Plymouth Sound.
Analysis has been carried out over various problem sizes and time steps. CUDA has been used
while simulating routines for both Bi-CGSTAB and CG solvers along with the computation of
other functions (right-hand side computation etc.).
Before discussing the performance of the implicit solver, we will present the stability and accuracy of results for various time steps ∆t. We have used the Chorin’s projection scheme as
discussed in the previous Chapters to perform the time integration. It is of interest to see
whether we can perform the simulations without any constraint on time step, as this was the
intention with which we started our study.
We have used a smaller test case of the Plymouth Sound, with the dimensions N x × N y =
240 × 120 and ∆x = ∆y = 5m. The accuracy is checked by observing the plots visually and
by computing Froude force (Fext ). Details of computation of this force is not reported in this
report. If the simulations are carried out for a fixed period of time (say 100 seconds), then the
Froude force at the end of 100 seconds should not change. We already know that the implicit
methods at larger time steps suffer from phase lag. We will try to analyze the differences in
Froude force for various time steps. Following three figures show the snapshot after 40 seconds
of real-time simulation for various time steps. The CFL condition dictates that the time step
∆t should be less than or equal to 0.15 seconds for the explicit methods.
Figure 9.1: Snapshot at t=40 seconds, ∆t = 0.2s
98
Figure 9.2: Snapshot at t=40 seconds, ∆t = 0.4s
Figure 9.3: Snapshot at t=40 seconds, ∆t = 1.0s
Following are the observations from the above figures:
• The solution remains stable for time steps greater than ∆tCF L . We checked the stability
in detail by running the simulation over 1000 seconds, and by testing other problem sizes.
In the case of the explicit simulator, the time step greater than 0.15 seconds cannot be
used as it results in unstable solution.
• As we increase the time step, the solution behind the ship and near the boundary looks
distorted. This is certainly not the impact of the phase error, and can be attributed to
the following:
– As the time step is increased, the information of the interaction of the ship with the
waves during the intermediate times is lost. For example, let’s take ∆t = 1s. Ship’s
velocity is 15.5m/s. This implies that the ship would have moved by 15.5m (three
grid points here, as dx = 5m) in one time step and we lose the information of the
ship’s interaction with the waves for the intermediate grid points.
– The incoming boundary conditions are computed using some predefined time periods
and amplitudes. A time step should be chosen which can resolve the wave form of the
99
incoming boundary waves. As a rule of thumb, one time period should be covered by
10-15 time steps. For the current simulation, the minimum time period is around 4
seconds. In order to capture this time period, we need the time step to be less than
or equal to 0.4 seconds. We will see later for the case of Plymouth Sound, even with
a time step of 0.4 seconds, we can have slightly inaccurate results.
Based on the above results, we will limit the time step to 0.4 seconds for our discussions on
timing and code profiling. Please note that, this time step limitation is due to the physical
considerations and not due to the numerical restrictions imposed by the CFL condition for the
explicit methods. The physical limitations remain constant irrespective of the spatial step size.
The timing results are presented in a different way as compared to the results discussed for the
two-dimensional test case in the previous Chapter. Our intention is not to compute the speedup between C++ and CUDA code, but understand the behaviour of the implicit code with the
variation in the time step and the problem size, and compare it with the explicit scheme.
The four quantities to quantify time are re-defined here:
• Implicit solver time: This is the time required by the Bi-CGSTAB implicit solver to carry
out the simulation for 100 seconds. This is reported in seconds(s).
• CG solver time : This is the time required by the RRB-k solver developed by Martijn for
the symmetric system of equations to carry out simulation for 100 seconds
• Additional time : This is the time required by the other functions (Computing right hand
side etc.).
• Total time : Total time is the time required to carry out the simulation for 100 seconds.
This is the sum of the above times.
We have switched from single precision floating point numbers to double precision for CUDA
simulation. It would be part of the future work to understand why the single precision accuracy
is not providing stable results for the current CUDA implicit solver. One of the area which will
require attention is the sparse matrix vector multiplication with the r1r2b1b2 ordering and the
preconditioning solve function. The implicit solver implemented in C++ works fine with single
precision.
9.1
Variation with time step
In order to see the impact of the time step on the accuracy and the speed of the solver, we
have taken the problem of river IJssel with 0.5 and 1 million nodes. The ∆t based on the CFL
condition is 0.12 seconds. In this case, the incoming boundary waves are based on deterministic
model with maximum frequency of 0.8 rad/second (long waves). Following table shows the
timing results for various time steps. The column ’iteration’ refers to the average iterations
required for the convergence of the Bi-CGSTAB algorithm over 100 seconds.
Table 9.1: Timing results for IJssel-500000
∆t
Iterations
Total time (s)
Implicit solver time (s)
CG solver time (s)
0.1
3.3
35.98
18.43
5.15
0.2
4.97
18.65
13.1
2.34
0.4
8.88
14.63
11.87
1.15
100
Table 9.2: Timing results for IJssel-1000000
∆t
Iterations
Total time (s)
Implicit solver time (s)
CG solver time (s)
0.1
3.3
49.88
31.20
7.97
0.2
4.96
32.58
23.28
3.95
0.4
8.93
25.77
21.05
1.97
Following are the observations :
• The total simulation time reduces as we increase the time step.
• Increasing the problem size does not impact the required number of iterations for the
Bi-CGSTAB solver. This is also similar to the results observed for the two-dimensional
test case.
• As we go from ∆t = 0.2 to ∆t = 0.4, the implicit solver time remains almost constant
as the number of iterations required for Bi-CGSTAB convergence increases propotionally
with the time step.
In order to see the accuracy of the solution with respect to the time step, we plot the Fext with
time for various time steps.
Figure 9.4: Froude force variation with time for various time steps: River IJssel, 1.0 million
nodes
As the time step is increased, we can see the difference in Froude force starts increasing which
can be attributed to either numerical dispersion or pressure pulse. A snapshot of the simulation
for the test case of River IJssel with 1.5 million nodes is shown in Figure 9.5.
101
Figure 9.5: Snapshot of River IJssel with 1.5 million nodes at t=100 seconds, ∆t = 0.2s
9.2
Variation with problem size
In order to see the impact of the problem size on the efficiency of the solvers, we carried out
simulation for the test case of Plymouth Sound for various problem sizes, keeping the time step
constant at 0.2 and 0.4 seconds. The results are shown in Table 9.3 and Table 9.4.
Table 9.3: Timing results for Plymouth. ∆t = 0.2s. Variation with problem size
N
Iterations
Total time (s)
Implicit solver time (s)
CG solver time (s)
50000
4.1
4.83
2.91
0.82
100000
4.1
9.19
5.13
1.59
200000
4.2
11.63
6.53
1.76
500000
5.1
23.28
14.3
2.86
1000000
5.45
41.77
26.63
4.77
1500000
5.86
62.26
41.09
6.65
102
Table 9.4: Timing results for Plymouth. ∆t = 0.4s. Variation with problem size
N
Iterations
Total time (s)
Implicit solver time (ms)
CG solver time (ms)
50000
6.45
3.28
2.33
0.42
100000
6.63
4.22
3.07
0.48
200000
6.7
6.20
4.62
0.67
500000
8.74
16.57
12.13
1.43
1000000
9.31
30.5
22.96
2.40
1500000
10.58
49.26
37.03
3.28
Following are the observations:
• A propotional increase in the simulation time with the problem size is observed.
• The CG solver time does not increase in proportion to the problem size. This can be
attributed to slight reduction in the required number of iterations for the CG solver
which was also reported by de Jong [1]
• For the Implicit solver, we can see an increase in the number of iterations. This is different
from as observed for the two-dimensional test case and the real case of river IJssel. For the
Plymouth Sound, there is large influence of incoming boundary waves. As the problem size
increases, the waves start occupying a larger space with time. If there were no waves, then
the solution will be zero, and no iterations are required. As the waves start propagating,
the problem field gets filled with more and more waves, resulting in more number of
iterations required for convergence.
• As we increase the time step, we can see an overall improvement in the solver performance.
With the time step ratio of 2 times, the number of iterations required for the convergence
of Bi-CGSTAB method does not increase by 2 times.
The snapshots for Plymouth Sound with 1.5 million nodes for different time steps are shown in
Figure 9.6 and Figure 9.7.
103
Figure 9.6: Snapshot of Plymouth Sound with 1.0 million nodes at t=100 seconds, ∆t = 0.2s
Figure 9.7: Snapshot of Plymouth Sound with 1.0 million nodes at t=100 seconds, ∆t = 0.4s
There is a need to check the accuracy of the solution. In the case of river IJssel, the incoming
waves have smaller frequency and thus larger time periods. Hence, we do not suffer from the
inaccuracy of not capturing the incoming boundary waves. In the case of Plymouth Sound,
minimum time period of the incoming waves is 4 seconds, and with a time step of 0.4 seconds
we may not have enough temporal resolution. The variation of Fext with time for different time
steps for the test case of Plymouth Sound, 1.0 million nodes is plotted in Figure 9.8.
104
Figure 9.8: Froude force variation with time for various time steps : Port Plymouth
For ∆t = 0.1s and ∆t = 0.2s, the results are relatively close. Later, the results of explicit and
implicit methods are also compared for ∆t = 0.1s. When we further increase the time step to
0.4s, deviation from the results at lower time steps is observed. Although this does not impact
the stability of the solution. We ran the simulation for 1000 seconds of real-time and did not
observe any divergence.
9.3
Profiling Results
In this study, all the functions are implemented in CUDA. A custom profiler is built inside the
code, which helps to analyze the time required by each function, whether it is implemented in
C++ or CUDA; thus removing the restriction caused by the profilers like Valgrind which can
be used only for a C++ implementation.
Following table shows the time required by each function inside the computewave(), which
computes wave for a single time step. We have compiled the time run for 20 seconds in realtime. The other computations required are related to computing and moving the ship pulses.
105
Table 9.5: Profiling results for CUDA implementation: Run time= 20 seconds. IJssel 1 million
nodes, ∆t = 0.1s
Function
Time (s)
Time (%)
Initiate RHS()
0.071
0.83 %
Incoming Waves()
0.112
1.32%
Boundary Zetaphi()
0.169
1.99%
Compute RHS zeta phi()
0.509
6.00%
Implicit Solve()
5.215
61.49%
Compute RHS psi()
0.145
1.71%
CG Solver()
1.555
18.33%
Boundary psi()
0.004
0.04%
Correct zeta()
0.001
0.01%
Boundary zeta()
0.444
5.24%
Maximum waves()
0.257
3.03%
Table 9.6: Profiling results for CUDA implementation: Run time= 20 seconds. IJssel 1 million
nodes, ∆t = 0.4s
Function
Time(s)
Time(%)
Initiate RHS()
0.018
0.42%
Incoming Waves()
0.028
0.66%
Boundary Zetaphi()
0.043
1.03%
Compute RHS zeta phi()
0.127
3.03%
Implicit Solve()
3.374
80.48%
Compute RHS psi()
0.036
0.86%
CG Solver()
0.392
9.35%
Boundary psi()
0.001
0.02%
Correct zeta()
0.000
0.01%
Boundary zeta()
0.111
2.64%
Maximum waves()
0.063
1.50%
The above tables provides an insight into the time consuming steps for the CUDA solver. The
function boundaryzeta() also includes the transfer of ζ from the device (GPU) to the host
(CPU). This time could be reduced if we later decide to plot the waves directly from the GPU
and not transfer it to the CPU at every time step. Also, the time required by maximum waves()
can be reduced by switching on the OpenMP flag. Futher optimization of the implicit solver is
possible by reducing the memory utlization and using single precision floating numbers instead
of the double precision floating numbers used in this study.
106
9.4
Comparison: Explicit vs. Implicit
We consider the test case of Plymouth Sound to perform the comparison between the explicit
CUDA Leap-Frog solver developed by Martijn and the Implicit solver developed in this thesis.
In the explicit solver, further work was done after Martijn’s thesis to move the functions like
computing the time integral, right-hand side etc. from C++ to CUDA. We have used this solver
in our study. In the code, this is referred to as CudaWavesComputer.
First we compare the accuracy of the two solvers. We plot the Fext versus time with a time
step of 0.1 seconds, which is below the CFL condition number. As the time step is the same,
the incoming boundary waves and pressure pulse are also captured with the same temporal
resolution. The difference between the two methods methods seen in Figure 9.9 can be attributed
to the numerical dispersion. In the next Chapter, we will evaluate the numerical dispersion for
two methods analytically.
Figure 9.9: Froude force variation with time for the explicit and implicit solvers: Plymouth
Sound with 1.5 Million Nodes, ∆t=0.1 seconds
Next, we compare the timing results for the two methods. We conducted simulation for the
explicit case with ∆t = 0.1 seconds and for the implicit solver with ∆t = 0.2, 0.4 seconds. Also,
only the total time of the simulation is shown here for comparison.
107
Table 9.7: Comparison of explicit and implicit solvers: Plymouth Sound
N
Explicit ∆t = 0.1
Implicit ∆t = 0.2
Implicit ∆t = 0.4
50000
5.99
4.83
3.28
100000
7.71
9.19
4.23
200000
6.95
11.63
6.20
500000
13.55
23.28
16.57
1000000
24.05
41.77
30.50
1500000
69.97
55.47
49.27
The above table is also shown in graph format below.
Figure 9.10: Variation of computational time for the explicit and implicit solvers with problem
size: Plymouth Sound
Let us compare the explicit results at ∆t = 0.1 seconds, and implicit results at ∆t = 0.4
seconds. For the smaller problems, implicit solver is better than the explicit solver. This could
be attributed to the memory bandwidth limitations for the small size problems. Also, number
of iterations required for the Bi-CGSTAB solver for Plymouth Sound are less for the smaller
problem size as discussed previously. As we increase the problem size, the number of iterations
for the convergence of Bi-CGSTAB method increases thus increasing the overall time of the
implicit solver. This we see for the problem sizes above 5000,000 nodes upto 1.0 million nodes.
The explicit solver requires less time now.
For the problem size of 1.5 million nodes, we see a sudden increase in the overall time required
for the explicit scheme. The reason is that for the case setup of 1.5 million nodes, the pulses
are defined based on a mesh file. This makes the moving and computing ship pulses a compute
108
intensive operation. Around 22.8% of the overall computation time is spent on pulses. In
contrast, if the pulses are not defined through meshes, then only 3-4% of the computation time
is spent on pulses.
When we move to the implicit method with larger time step of ∆t = 0.4 seconds, we subsequently
reduce the computation time of the pulse movement. Only 6.7% of the overall computation time
is now spent on computing and moving the pulses as compared to the 22.8% of the time required
by the explicit solver.
We will try to assess this impact of increase in problem size, for the River IJssel test case. The
results are shown in Table 9.8.
Table 9.8: Comparison of explicit and implicit solvers: River IJssel
N
Explicit (s)
Implicit ∆t = 0.2
Implicit ∆t = 0.4
500000
21.17
18.65
14.63
1000000
22.96
32.73
25.77
1500000
56.01
55.53
42.02
Following are the observations:
• For the smaller problems, implicit solver is faster than the explicit solver for both test
cases. The small number of iterations required for the convergence of the Bi-CGSTAB
solver for smaller problems could be one of the reason.
• For the test case of 0.5 Million and 1.0 Million Nodes, the time required for simulation is
almost the same. In fact, the explicit solver is faster.
• As we further increase the problem size, with ∆t = 0.4 seconds which is four times larger
than that of the explicit scheme ∆t, again the implicit solver becomes faster than the
explicit solver. Please note that the number of iterations required for convergence of the
Bi-CGSTAB algorithm for the test case of River IJssel does not increase with the increase
in problem size, as in the case of Plymouth Sound.
109
9.5
Concluding Remarks- Realistic Test Case
• The Chorin’s projection scheme provides stable results for the realistic test cases.
• The implicit solver is faster than the explicit solver for certain test cases. Though in
order to achieve this we need to use larger time steps which can result in large numerical
dispersion. In the next Chapter, we will discuss more about the numerical dispersion for
the implicit and the explicit schemes.
• The time step is governed by the required frames per second (fps) of the Ship simulator.
It is intended to achieve 20 fps in real-time, which implies a time step of 0.05 seconds.
With the current problem set-up, ∆tcf l > 0.1s and the explicit solver can be used at
0.05 seconds. Using the implicit solver at 0.05 seconds won’t result in improvement in
performance.
But this solver will scale linearly with the problem size and cannot solve large problems
under 0.05 seconds. In that case, it would be required to move to the Non-Uniform grids,
which results in less number of grid points.
Also, it is intended to simulate more than one ships and capture the interaction between
them. Such a setup will require a finer mesh near the interaction region. If we move
towards a Non-Uniform grid set-up, it would be possible to have a finer grid near the
ships and coarser grid away from the ship. With the explicit scheme, this might impose a
time step limitation as the CFL number will be governed by the smallest spatial step size.
Our implicit solver will become handy in such cases, as it is not restricted by the CFL
number. In the next Chapter, we will provide a framework for setting up the Non-Uniform
grid for the two-dimensional test case.
110
Chapter 10
Dispersion Analysis
Lets consider the one dimensional wave equation given by:
∂ζ
∂ζ
+ c0
=0
∂t
∂x
(10.1)
where ζ is the wave height, and c0 is the wave velocity. In our system represented by the Variational Bousinessq approximation, the wave velocity is given by the surface gravity wave. The
spatial derivative is discretized using central differences, and the time derivative is integrated
with the fully implicit theta method and the explicit Leap-Frog method.
The theta-method discretization is given as
ζin+1 = ζin −
C(1 − θ) n
Cθ n+1
n+1
n
)
(ζi+1 − ζi−1
)−
(ζi+1 − ζi−1
2
2
(10.2)
where C is the Courant number given by C = c0 ∆t/∆x. The Leap-Frog discretization is given
by:
n
n
)
ζin+1 = ζin−1 − C(ζi+1
− ζi−1
(10.3)
The Taylor series expansion gives the modified equivalent partial differential equation for the
theta method as [9]:
∂ζ
c2 ∆t
∂2ζ
c0 ∆x2
∂3ζ
∂ζ
+ c0
= 0 (2θ − 1) 2 −
(1 + C 2 (3θ − 1)) 3 + O(∆x4 )
∂t
∂x
2
∂x
6
∂x
(10.4)
and the modified equivalent partial differential equation for the Leap-Frog is given as:
∂ζ
∂ζ
c0 ∆x2
∂3ζ
+ c0
=−
(1 − C 2 ) 3 + O(∆x4 )
∂t
∂x
6
∂x
(10.5)
As expected for the theta method, numerical diffusion arises with the magnitude:
κnum =
c20 ∆t
(2θ − 1)
2
(10.6)
and this numerical diffusion vanished with θ = 0.5, which is the Implicit-Trapezoidal method.
111
For a given value of the Courant number, the numerical dispersion for the Trapezoidal method is
always greater than the Leap-Frog scheme. In fact, if C = 1, the numerical dispersion vanishes
for the Leap-Frog scheme. This is not the case for the Trapezoidal method, which gives a finite
dispersion of c0 ∆x2 /4 when C = 1.
10.1
Dispersion Relation
The exact dispersion relation is obtained via a substitution of ζ(x, t) = ζ̂exp[i(kx − ωt)] which
gives the relationship between the frequency and the wavenumber as :
ω = c0 k
(10.7)
ω
The phase speed is then given by c = = c0 and hence the equation is not dispersive since all
k
the waves propogte at the same speed c0 regardless of the wavenumber k.
Deriving the dispersion relation for the Implicit-Trapezoidal method and the Leap-Frog method
from the Taylor series expansion, we obtain:
c
=
c0
2
1 − 16 (k∆x)2 (1 + C2 )
1 − 16 (k∆x)2 (1 − C 2 )
Theta method
Leap-Frog method
These show that c = c0 only when the Leap-Frog method is used and when C = 1. This is not
possible for the Implicit Trapezoidal method.
The above relation holds only in the case of small k∆x as we have ignored the higher order terms
in the Taylor series expansion. A general dispersion relation can be derived by substituting:
ζ(xj±1 , tn±1 ) = ζ̂exp[i(kxj±1 − ωtn±1 )],
= ζ̂exp[i(kxj − ωtn )]exp(±ik∆x)exp(∓iω∆t),
= ζ(xj , tn )exp(±ik∆x)exp(∓iω∆t)
If we substitute the above relation into the discretized equations, we get the following relations.
The numerical dispersion relation for the Implicit Trapezoidal method is given by:
c
i
2 − iCsink∆x
=
log
c0
Ck∆x
2 + iCsink∆x
(10.8)
and for the Leap Frog scheme is given by:
sin−1 (Csink∆x)
c
=
c0
Ck∆x
Figure 10.1 shows the variation of c/c0 with k∆x for various Courant numbers.
112
(10.9)
1
C =0.5
C=1
C=2
C=4
0.8
c/c0
0.6
0.4
0.2
0
0
0.5
1
1.5
2
2.5
3
3.5
k∆ x
Figure 10.1: Dispersion analysis of Implicit Trapezoidal method
In order to limit the dispersion error, we need to formulate the problem such that we are near
the top part of the curve. For example, to get under 10% dispersion error, and achieve a time
step above the CFL limit, say ∆t = 2 × ∆tCF L , then k∆x should be below 0.5.
10.2
Numerical Dispersion for One dimensional test case
For the one dimensional test case discussed in Chapter 5, we discuss the impact of numerical
dispersion with various time steps for the Implicit Trapezoidal method. We consider a fixed
location x = 50m, and observe the wave passing through for various time steps. The results are
shown in Figure 10.2.
113
Figure 10.2: Phase lag for the Implicit Trapezoidal scheme. Solution at x=50m with periodic
boundary conditions
We can see a clear phase lag, which increases for larger time steps. It is difficult to quantify
this error, as we have a coupled set of equations. Although a general trend can be observed.
10.3
Non-Uniform Grid
In this section we develop the discretization scheme for the non-uniform grid and explore the
possiblity of employing the implicit solver for the non-uniform grid.
We focus on a rectilinear grid in two dimensions as shown in Figure 10.3.
114
y
x
Figure 10.3: Rectilinear grid in two dimensions
The grid size in x and y direction varies smoothly. The grid size distribution in the x-direction
is shown in Figure 10.4. Near the boundaries, the grid size is constant. In the central area, it
varies quadratically and is continuous in nature. Though the first derivative of such grid size
ditribution has a discontinuity. Based on the parameters of the quadratic curve, the ratio of
the maximum grid size to the minimum grid size can be varied.
8
∆x
7
6
5
4
0
20
40
60
80
100 120 140 160 180 200 220 240
Nx
Figure 10.4: Grid size distribution in one dimension
For the two-dimensional rectilinear grid, the y-spacing can also vary with similar distribution.
In order to access elements, we do not need to store the location of all the grid points. As the
grid is rectilinear, we store the x-coordinates and y-coordinates in two separate arrays x and y.
When we have to access any grid element (i,j), its location can be determined as x(i), y(j).
115
10.3.1
Spatial discretization
The governing equations are discretized based on the grid shown in Figure 10.5.
The variables are evaluated at the grid points (i, j) and the finite volume as shown as the dotted
rectangle in Figure 10.5 is centered around the grid point.
(i,j+1)
(i,j+1/2)
y
(i-1,j)
(i-1/2,j)
(i,j)
(i+1/2,j)
(i+1,j)
(i,j-1/2)
(i,j-1)
x
Figure 10.5: Finite Volume grid
We define the following:
xi+1 − xi−1
• ∆xi =
2
yj+1 − yj−1
• ∆yj =
2
The governing equation for variable ζ is given by:
∂ζ
∂ζ
∂2ϕ ∂2ϕ
∂2ψ ∂2ψ
∂ζ
+ Ux
+ Uy
+ h( 2 +
)
−
hD(
+
) = 0,
∂t
∂x
∂y
∂x
∂y 2
∂x2
∂y 2
Lets consider discretization of each term separately. For the time derivative,
ZZ
∂ζi,j
∂ζ
dxdy =
∆xi ∆yj
∂t
∂t
For the single spatial derivative in x-direction,
ZZ
∂ζ
Ux dxdy = Ux (ζi+1/2,j − ζi−1/2,j )∆yj
∂x
ζi,j + ζi−1,j
ζi+1,j + ζi,j
−
)∆yj
= Ux (
2
2
ζi+1,j − ζi−1,j
= Ux (
)∆yj
2
116
(10.10)
(10.11)
Similarly,
ZZ
ZZ
∂2ϕ
h 2 dxdy = h
∂x
Uy
ζi,j+1 − ζi,j−1
∂ζ
dxdy = Uy (
)∆xi
∂y
2
(10.12)
∂ϕ
∂ϕ
∆yj
|
−
|
∂x i+1/2,j ∂x i−1/2,j
ϕi+1,j − ϕi,j
ϕi,j − ϕi−1,j
=h
∆yj
−
xi+1 − xi
xi − xi−1
h∆yj
h∆yj
h∆yj
h∆yj
+ ϕi−1,j
− ϕi,j
+
= ϕi+1,j
xi+1 − xi
xi − xi−1
xi+1 − xi xi − xi−1
Similarly,
ZZ
∂2ϕ
h∆xi
h∆xi
h 2 dxdy = ϕi,j+1
+ ϕi,j−1
− ϕi,j
∂y
yj+1 − yj
yj − yj−1
h∆xi
h∆xi
+
yj+1 − yj
yj − yj−1
The above discretization results in formation of pentadiagonal matrices, represented by 5-point
stencils. In order to use the RRB-Solver, we need to show that these matrices are either skewsymmetric or symmetric in nature.
As done previously with the Uniform grid, we do not divide the whole equation by ∆x∆y term,
which comes with the time integral. This is done because ∆x∆y now varies with the grid, and
we would like to keep this non-uniform part with the diagonal term.
The stencil for the convective term is give by:
∆xi
0
Uy
2
∆yj
Sζζ :
−U
0
x 2
∆xi
0
−Uy
2
0
∆yj
Ux
2
0
(10.13)
To verify the symmetry (or skew-symmetry) of the matrix, we need to compare the outer diagonals with each other as was done for the uniform grid case. In this case, the absolute contribution
∆y
∆y
|i,j = − Ux
|i−1,j . This is
from the West and the East term should be same, i.e. , −Ux
2
2
true as ∆y remains constant if we traverse in the x-direction. Similarly we can prove that the
diffusive term is also symmetric in nature.
Please note that the above derivation is applicable only for the Rectilinear grids. Also, if we
1
1
were to divide the equation by ∆xi ∆yj , then we would need −Ux
|i,j = Ux
|i−1,j , which
2∆x
2∆x
is true only when ∆xi = ∆xi−1 , but that is the uniform grid case.
Based on the above formulation, we can now formulate the stencils for the two-dimensional case
and use the RRB-solver (both implicit and symmetric).
117
10.3.2
Accuracy of the Non-Uniform Grid
Accuracy of the spatial discretization described above can be analyzed by expanding the derivatives with the Taylor series and comparing the results. The single derivative is accurate upto
O((xi+1 −xi )((xi −xi−1 )) i.e. to the second order in the space increments. The second derivative
is accurate upto O((xi+1 − xi ) − ((xi − xi−1 )) i.e. to the first order in the difference of the grid
intervals. In the case of uniform grid, this would be just O(∆x2 ), but is higher for non-uniform
grid and depends on the type of stretching in the grid.
Another impact of the Non-Uniform grid could be seen on the numerical dispersion. As we saw
before that the numerical dispersion depends on the CFL number and k∆x. Also, that was the
case in only one dimension. In the case of non isotropic grid, that is, if the number of points
per wavelength are not the same in every direction, it adds to more dispersion. Also if the grid
size is varying, then the numerical wave speed would also be different, and this causes effects
like reflection of the wave etc.
We will try to observe these effects in the results given in the next section.
10.3.3
Non-Uniform Grid Results
The grid used in the simulations is shown in Figure 10.6.
Figure 10.6: The Non-Uniform grid
Following are the parameters used in the simulation
• Lx = 1476 m, Ly = 908 m
• Nx = 240, Ny =120
• Ux = Uy = 1m/s
118
• h=50m
• g=9.81 m/s2
• double precision for computation
The maximum grid spacing in the x-direction is 13.8m, and the mimimum grid spacing is 1.37m
∆xmax
= 10. Same ratio is applied on the grid spacing in the y-direction. ∆tCF L
resulting in
∆xmin
based on the minimum grid spacing is 0.031s. Our aim is to carry out the simulations for
various ∆t’s and observe the required computation time, and the number of iterations of the
Bi-CGSTAB algorithm. Also important is to reflect upon the accuracy and stability of the
results obtained.
∆t
First we show the results for ∆t = 0.5s, which gives a
= 15.1
∆tCF L
Figure 10.7: Initial conditions for ζ for the Non-Uniform grid
1
The snapshots of the results are generated at various time steps with the help of Paraview which is an
open-source visualization software.
119
Figure 10.8: ζ at t=50s for the Non-Uniform grid
From Figure 10.7 and Figure 10.8, we can see that the algorithm developed is stable for the
Non-Uniform grid, even at large CFL ratios.
We will now compare the results with that of CFL number corresponding to the ∆tCF L = 0.031s.
We obtain the result by cutting a slice normal to the y-direction, thus allowing us to compare
the results in an one-dimensional plot. The results are shown in Figure 10.9. We can see that
the numerical dispersion is very small! This could be attributed to the fact that we are operating
4π
∆x = 0.01.
at very low k∆x ratios. For the smallest grid spacing of 1.37m, the ratio k∆x =
L
As CFL number is inversely propotional to the spatial step size, this also correspond to the
region in the grid with highest CFL ratio. For the largest grid spacing, k∆x = 0.1, but CFL
ratio for such spacing is 10 times lower than the CFL ratio for the smallest grid size. Thus we
are always near c/c0 = 1, i.e. the top of the curve as shown in Figure 10.1.
This is in contrast to the previous case of a Uniform grid, when with larger CFL number we
encountered large dispersion errors.
120
Figure 10.9: ζ at t=50s with ∆t = 0.031s and ∆t = 0.5s
We now show the timing results for the solver. The simulations are run for 50 seconds in the
real time. The implicit Bi-CGSTAB solver is implemented both in C++ and CUDA, whereas
the CG solver is implemeted in CUDA.
Table 10.1: Timing results for the Non-Uniform Grid case with C++ Bi-CGSTAB solver,
∆tCF L = 0.062s, ∆xmax /∆xmin = 10
∆t
Avg. No. of Iterations per time step
Run time of simulation
0.031
2
37.56
0.062
2
19.13
0.124
4
11.41
0.248
6
6.08
0.496
14
5.09
0.992
29
4.11
121
Table 10.2: Timing results for the Non-Uniform Grid case with CUDA implicit Solver, ∆tCF L =
0.062s, ∆xmax /∆xmin = 10
∆t
Avg. No. of Iterations per time step
Run time of simulation
0.031
2
34.36
0.062
2
18.39
0.124
4
10.43
0.248
8
6.2
0.496
20
4.57
0.992
36
3.18
Following are the observations from above results:
• An explicit method here would require ∆t = 0.031s for the simulation due to CFL number
limitation. With the implicit solver, this simulation could be carried out at larger time
steps. The required frame per second is 20, which results in a ∆t of 0.05 seconds. Performing the simiulation at 0.05 seconds would only require 2 iterations for the convergence
of the Bi-CGSTAB algorithm and will save time required in double computation of the
CG solver, computing the pulses etc (if the explicit solver is used). Also, in the case of
multiple ships where the non-uniform grids are required, the pressure pulse computation
can become a time consuming step.
• As observed in the Uniform grid case, CUDA implemtation of the Bi-CGSTAB solver
requires more number of iterations to converge than the C++ solver. Part of the problem
lies in the r1r2b1b2 storage structure in CUDA and its impact on the sparse matrix vector
multiplication and the preconditioning solve function. Both of them need to be studied
carefully, and constitute part of the future work.
This concludes our discussion on the non-uniform test case. We have been able to use the solvers
developed for the uniform grid with suitable spatial discretization and stencil formulation. Much
work still need to be carried out to study the accuracy and speed-up of these methods and their
adaptation for the real case. We will now present our conclusions and propose the future work.
122
Chapter 11
Conclusion
We started with the efficient implementation of the explicit CUDA RRB solver, and intended to
develop a robust implicit solver which is based on the similar RRB framework and can overcome
the numerical limitation of the explicit schemes. We were able to achieve the following:
• Analyzed various time integration schemes, studied their stability and accuracy behaviour
with the simplified test-problems in MATLAB. The Implicit Trapezoidal method came out
as the front runner with suitable Symplectic and stability properties for the VBM model.
• Chorin’s projection scheme has been developed to solve the coupled system of equations
arising from the VBM model. By splitting the operations carefully, stability of the method
has been ensured.
• Suitable physics based preconditioners which can take care of the weak-coupling of the
system have been formulated and their performance studied . The preconditioners were
studied analytically through the eigenvalue analysis and then by performing numerical
experiments. The Gauss-Seidel block preconditioner with the gravitational constant as the
coupling term provides the optimum number of iterations amongst various preconditioners.
• In order to develop the CUDA and C++ solver in an object oriented way, and to perform
benchmarking, a two-dimensional test problem has been formulated. Object-oriented
approach allows to plug-in various solvers and make use of the efficient solver developed
by Martijn. Various profilers were used to optimize the code, which can be then directly
incorporated in the real solver.
• A careful implementation of the Incoming boundary waves allowed a stable solution for
the real test case. The Implicit solver for the real test cases outperforms the Explicit
solver in certain scenarios. A speed-up factor upto 7/5 was observed.
• A frame-work for the further development of the non-uniform grid solver has been provided. A suitable spatial discretization has been evaluated, resulting in system matrices
which can be handled by the RRB solvers.
Chapter 12
Future Work
We would like to recommend MARIN the following regarding the Interactive Waves project.
The recommendations also indicate topics for the future research.
• Optimize the CUDA implementation of the implicit solver.
– Investigate the role of initial guess on the convergence of the Bi-CGSTAB method.
It is possible to get a better estimate of the initial guess by linear interpolation of
results from previous time steps.
– Investigate the difference in CUDA and C++ implementation of the Bi-CGSTAB
solver. It is possible to reduce the number of iterations required for convergence of
the CUDA implicit solver further.
– Get the CUDA implicit sovler working with single precision floating numbers.
• Implement the weakly-reflective boundary conditions for the implicit solver routine. In
current thesis, only the reflective (Neumann) boundary conditions have been implemented.
• Build up-on the non-uniform two-dimensional test case. It is also possible to develop a
non-rectilinear grid and reduce the number of grid points further. In order to track the
ship movement, a moving grid would be required. This would also require recomputing
system matrices and reconstructing the preconditioner at subsequent timesteps. The implicit solver preconditioner construction step should be optimized further through efficient
memory utilizations and using C++ standard library.
• Move the pulse computation on GPU. As we observed for the Plymouth Sound test case
of 1.5 million nodes, this becomes a time consuming step. This will also become more
important in the case of multiple ships.
• The ship simulator supplied by MARIN runs on a distributed system. A strategy needs
to be developed which could either implement the current solver on distributed system or
does the data transfer to the different compute nodes efficiently.
124
References
[1] Jong, M. de, Developing a CUDA solver for large sparse matrices for MARIN, MSc Thesis
Applied Mathematics, Delft University of Technology, Delft, February 2012.
[2] Klopman, G., Variational Boussinesq Modelling of Surface Gravity Waves over Bathymetry,
Phd Thesis, University of Twente, Twente, May 2010.
[3] Wout, E. van ’t, Improving the Linear Solver used in the Interactive Wave Model of a
Real-time Ship Simulator, MSc Thesis Applied Mathematics, Delft University of Technology,
Delft, August 2009.
[4] Keppens R., Tòth G., Botchev M. A, and Ploeg A. van der, Implicit and semi-implicit
schemes: algorithms, International Journal for Numerical Methods in Fluids, 30, 335-352,
1999.
[5] Botchev M. A., Sleijpen G. L. G. and Vorst H. A. van der, Low-dimensional Krylov subspace
iterations for enhancing stability of time-step integration schemes, Preprint 1004, Dept. of
Mathematics, Utrecht University, the Netherlands, 1997.
[6] Trefethen L.N, Bau D. III , Numerical Linear Algebra , Society for Industrial and Applied,
Philadelhpia, 1997
[7] Hundsdorfer W. , Verwer J.G., Numerical Solution of Time-Dependent Advection-Diffusion
Reaction Equations, Springer, Berlin, 2003
[8] Chorin, A. J., Numerical Solution of the Navier-Stokes Equations, Math. Comp. 22, 745â762,
1968
[9] Fringer O. [PDF Document].Retrieved from Lecture Notes
http://web.stanford.edu/ fringer/teaching/cee262c/solution4.pdf
125
Online
Web
site: