Academia.eduAcademia.edu

Implicit Methods for Real-Time simulation of Interactive Waves

2014

The project focuses on developing a simulator in which ships and waves interact. The new wave model is the Variational Boussinesq model (VBM). However, this new realistic model brings much more computation effort with it. The VBM mainly requires an unsteady state solver, that solves a coupled system of equations at each frame (20 fps) . The model is currently limited by the time-step constraints, which makes it difficult to simulate large and non-uniform domains in real-time. The focus of the current Master’s thesis is to develop a fast and robust solver with good stability properties.

Delft University of Technology Faculty of Electrical Engineering, Mathematics and Computer Science Delft Institute of Applied Mathematics Implicit Methods for Real-Time Simulation of Interactive Waves A thesis submitted to the Delft Institute of Applied Mathematics in partial fulfillment of the requirements for the degree MASTER OF SCIENCE in APPLIED MATHEMATICS by Akash Mittal Delft, The Netherlands July 23, 2014 Copyright c 2014 by Akash Mittal. All rights reserved. MSc THESIS APPLIED MATHEMATICS Implicit Methods for Real-Time Simulation of Interactive Waves Akash Mittal Delft University of Technology Daily supervisor Responsible professor Prof. dr. ir. Cornelis Vuik Prof. dr. ir. Cornelis Vuik Committee members Prof. dr. ir. Cornelis Vuik Dr. ir. Peter Wilders Dr. Auke Ditzel July 23, 2014 Delft, The Netherlands Preface This Master thesis has been written within the Erasmus Mundus Master’s program Computer Simulations for Science and Engineering (COSSE) 1 . This included a six months internship carried out at Maritime Research Institute Netherlands (MARIN) in collaboration with Technical University Delft. MARIN provides a unique combination of simulation, model testing, full-scale measurements and training programmes to the shipbuilding and offshore industry and governments. One of such service includes the supply of full-scale bridge simulators to the shipping companies which can not only be used to train captains, steersmen and other ship workers, but also for research and consultancy purposes. MARIN continuously strive towards bringing innovation in their products. Currently they are working on a new wave model called the ’Interactive Waves’. The project focuses on developing a simulator in which ships and waves interact. The new wave model is the Variational Boussinesq model (VBM) as suggested by Gert Klopman. However, this new realistic model brings much more computation effort with it. The VBM mainly requires an unsteady state solver, that solves a coupled system of equations at each frame (20 fps) . The model is currently limited by the time-step constraints, which makes it difficult to simulate large and non-uniform domains in real-time. The focus of the current Master’s thesis is to develop a fast and robust solver with good stability properties. Acknowledgements During the past two years of my studies, I had the opportunity to study for one year at the University of Erlangen-Nuremberg, Germany, as well at the Technical University Delft, The Netherlands under the COSSE program. COSSE also provided me the possibility to interact with some of the great minds in the field of Computer Science and Applied Mathematics and, develop a friendship network which will last beyond the boundaries of the program. Foremost, I would like to thank my supervisor Professor Kees Vuik from TU Delft for providing regular feedback and support during the thesis. I would like to thank MARIN and in particular Auke Ditzel for giving me the opportunity to do research at MARIN on this challenging project, Auke Van Der Ploeg on his expertise on the linear solvers and Anneke Sicherer-Roetman for the support on the programming aspects. Also, I would like to thank Martijn de Jong, his work helped me alot. Also, I would like to thank Professor Ulrich Ruede from FAU Erlangen for his encouragement. During the last two years of studying abroad, the support of my parents never ended and I am very thankful for many of their advices. Akash Mittal, Delft, July 2014 1 For more information, please visit: http://www.kth.se/en/studies/programmes/em/cosse 4 Contents 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Earlier Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Elwin van’t Wout: Fast solver and model explanation . . . . . . . . . . 1.2.2 Martijn De Jong: Developing a CUDA solver for large sparse matrices for MARIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Research Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Model Equations and Discretization 2.1 The Variational Boussinesq Model (VBM) 2.1.1 Hamiltonian Description . . . . . . 2.1.2 Linearized Model Description . . . 2.2 Spatial Discretization . . . . . . . . . . . . . . . 3 Test Problems 3.1 One Dimensional Test Problem . . . . . . . 3.2 Two Dimensional Test problem- Assessment 3.3 Realistic problems IJssel, Plymouth . . . . 3.3.1 The Gelderse IJssel . . . . . . . . . . 3.3.2 The Plymouth Sound . . . . . . . . 3.4 Testing System . . . . . . . . . . . . . . . . 4 Temporal Discretization 4.1 Explicit Time Stepping :Leap-Frog Scheme 4.2 Implicit Time Stepping . . . . . . . . . . . . 4.2.1 Fully Implicit Scheme . . . . . . . . 4.2.2 β - Implicit Scheme . . . . . . . . . 4.2.3 Stability . . . . . . . . . . . . . . . . 4.2.4 Backward Differentiation Formula . 4.2.5 Semi-Implicit schemes . . . . . . . . 4.3 Symplectic integration . . . . . . . . . . . . . . . . . . . . . . . . . . . with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Analysis of Time Integration Schemes 5.1 Stability and Accuracy of various time integration schemes 5.1.1 Benchmark Solution using MATLAB integrators . . 5.1.2 Explicit Scheme . . . . . . . . . . . . . . . . . . . . 5.1.3 Fully Implicit Schemes . . . . . . . . . . . . . . . . . 5.1.4 Semi Implicit Schemes . . . . . . . . . . . . . . . . . i . . . 2 2 3 3 . . . 3 4 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 6 7 . . . . . . and C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 11 11 11 12 13 . . . . . . . . 14 14 15 15 16 17 18 18 19 . . . . . 21 21 22 24 26 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 5.3 5.4 5.5 Study of Implicit Trapezoidal Scheme . . . . . . . . . . . 5.2.1 Higher Order Schemes and Impact on Phase Error 5.2.2 Impact of Initial Conditions . . . . . . . . . . . . . 5.2.3 Impact of Boundary Conditions . . . . . . . . . . . 5.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . Impact of Spatial Discretization . . . . . . . . . . . . . . . 5.3.1 Impact of Grid size on Accuracy . . . . . . . . . . Stability Analysis with Vertical Structure ψ included . . . 5.4.1 Benchmark . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Chorin’s Projection technique . . . . . . . . . . . . Stability Analysis for Two Dimensional Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 33 34 36 37 38 42 42 43 44 45 6 Iterative Solvers 6.1 Background and Previous Work . . . . . . . . . . . . . . . . . . . . 6.1.1 Properties of the Discretization Matrix Sψψ . . . . . . . . 6.1.2 Preconditioned Conjugate Gradient Method . . . . . . . . . 6.1.3 Repeated Red Black (RRB) ordering as preconditioner . . . 6.1.4 Implementation of RRB preconditioner . . . . . . . . . . . 6.1.5 Effect of RRB ordering on sparsity pattern . . . . . . . . . 6.1.6 RRB-k method . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.7 The Schur complement and the 9-point Stencil . . . . . . . 6.1.8 Constructing the preconditioning matrix M . . . . . . . . . 6.1.9 Solving M z = r . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.10 Efficient CUDA and C++ Code Implementation . . . . . . 6.2 Solvers for Non-Symmetric Matrices . . . . . . . . . . . . . . . . . 6.2.1 From Symmetric to Non-Symmetric Matrices . . . . . . . . 6.2.2 Arnoldi Iteration and GMRES . . . . . . . . . . . . . . . . 6.2.3 BiConjugate Gradient methods . . . . . . . . . . . . . . . . 6.3 Combined Solver - GMRES/ Bi-CGStab . . . . . . . . . . . . . . . 6.3.1 Preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Preconditioner for Skew-Symmettric Matrix . . . . . . . . . 6.3.3 Impact of different Preconditioners on Eigenvalue Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 48 48 49 49 51 51 52 52 52 53 53 53 54 55 56 59 59 60 61 7 C++ and CUDA Implementation 7.1 General comments on implementation . . . . . . . . . . . . . . . 7.2 Construction of Preconditioner for the Implicit Equation . . . . . 7.3 Implementation of Chorin’s Projection Scheme . . . . . . . . . . 7.3.1 Sparse Matrix Vector Multiplication . . . . . . . . . . . . 7.3.2 Solve Step for the Coupled System . . . . . . . . . . . . . 7.3.3 RRB Solver for ψ, and ζ correction . . . . . . . . . . . . . 7.4 Plugging the Solver into the Interactive waves Code . . . . . . . 7.4.1 Incoming Boundary Conditions- Previous Implementation 7.4.2 Incoming Boundary Conditions- Current Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 69 70 70 72 78 78 78 80 . . . . . . 81 81 82 82 84 85 86 8 Results - Two Dimensional Test case 8.1 Convergence properties . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Termination Criteria . . . . . . . . . . . . . . . . . . . . . 8.1.2 Variation with ∆t for various preconditioners . . . . . . . 8.1.3 Variation with time step ratio . . . . . . . . . . . . . . . . 8.1.4 Impact of the problem size . . . . . . . . . . . . . . . . . 8.2 Bi-CGSTAB vs CG: Memory requirement and Operation Count ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 88 91 92 96 9 Results - Real Test case 9.1 Variation with time step . . . . . . . . . 9.2 Variation with problem size . . . . . . . . 9.3 Profiling Results . . . . . . . . . . . . . . 9.4 Comparison: Explicit vs. Implicit . . . . . 9.5 Concluding Remarks- Realistic Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 100 102 105 107 110 10 Dispersion Analysis 10.1 Dispersion Relation . . . . . . . . . . . . . 10.2 Numerical Dispersion for One dimensional 10.3 Non-Uniform Grid . . . . . . . . . . . . . 10.3.1 Spatial discretization . . . . . . . . 10.3.2 Accuracy of the Non-Uniform Grid 10.3.3 Non-Uniform Grid Results . . . . . . . . . . test case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 112 113 114 116 118 118 8.3 8.4 8.2.1 Memory requirement: C++ . 8.2.2 Memory requirement: CUDA 8.2.3 Operation Count . . . . . . . Timing Results . . . . . . . . . . . . CUDA profiling . . . . . . . . . . . . . . . . . . . . . 11 Conclusion 123 12 Future Work 124 References 124 1 Chapter 1 Introduction 1.1 Background Simulation of ocean waves can be categorized into two major groups. First one is based on the physical models whereas the other generates the ocean waves based on either geometrical shapes or oceanography spectrum. The later method group requires less computational effort, but results in the the waves which are less realistic in nature. Currently MARIN (Maritime Research Institute Netherlands) provides ship maneuvering simulators to the shipping companies which can be used for training and research programs. Computation of the wave field in these simulators is based on the second method and utilizes wave spectra (Fourier theory). Being deterministic in time and space; they are easy to implement on distributed simulation systems. However, this model is not interactive, that is, the impact of the ship movement on the wave field is not taken into account. In order to capture the impact of the ship movement and perform realistic simulations, the waves need to be modeled physically; but in a real time, which is a requirement imposed by the ship simulator. The physical modeling for the ocean wave simulation requires solving the non-linear Navier Stokes equations. With the current simulation techniques available, it is not possible to solve these equations in three dimensions in real time. Although it is possible to simplify the physical model to a certain extent by making some suitable approximations. This is where the so-called Boussinesq’s approximation comes into picture, which can be applied to weakly non-linear and fairly long waves. The idea is to employ these approximations and develop a numerical solver which is fast enough to carry out the simulations in real time. The essential idea in the Boussinesq approximation is the elimination of the vertical coordinate from the flow structure, while still retaining some of the influence of the vertical structure of the flow. This is possible because wavs propagate in the horizontal direction, whereas it has a different non-wave like behavior in the vertical direction. In Boussinesq-like models, an approximate vertical structure of the flow is used to eliminate the vertical space. Though, the classical Boussinesq models are limited to long waves - having wavelengths much longer than the water depth. Also, they cannot deal with varying depths. The variational Bousinnesq model (developed by Klopman [2]) can deal with deep waters with varying depth, as the vertical structure is treated as a function of depth, surface elevation and other parameters. The big advantage over the deterministic model is that it can incorporate the influence of the ship movement, thus making it an interactive wave model. Owing to the 2 variation in the vertical structure and its interactive nature, it is much more computational intensive, and much work has been done in order to improve the running time of the solver. Two previous master thesis by Van’t Wout [3] and De Jong [1] have focused on creating a fast linear solver and increasing the efficiency by GPU programming. MARIN intends to use this solver to simulate the cases including multiple ships. In these cases smaller grid sizes are required and this puts severe restrictions on the current numerical techniques. 1.2 1.2.1 Earlier Work Elwin van’t Wout: Fast solver and model explanation Elwin’s work comprised of improving the linear solver that is used in the real-time ship simulator. The underlying wave model is the variational Boussinesq model as suggested by Klopman [2]. Elwin starts with derivation of the Variational Boussinesq Model (VBM). Starting point of the derivation of the model equations are the Euler equations which are valid for motions of ideal fluids and are a special case of the Navier-Stokes equations (inviscid incompressible flow). In order to reduce the problem from 3 dimensions to 2 dimensions, vertical shape functions are employed. The resulting model is linearized to reduce complexity. The linearized model is described in detail in Chapter 2. Equations are then discretized (spatially) using a finite volume method with structured Cartesian grids. Time integration is performed using an explicit Leap-Frog scheme which is described later. One of the equations results in a system of linear equations. Elwin provides a detailed description of various numerical methods available for solving linear system of equations. Based on the properties of the matrix involved, he selects the Conjugate Gradient method as the solver. This iterative method is then combined with various preconditioners, namely Diagonal scaling, modified incomplete Cholesky and repeated red black ordering. Elwin modified the original implementation of the repeated red black-preconditioner to a repeated red black preconditioner with a predefined number of levels, combined with a complete Cholesky decomposition on the maximum level. Elwin concludes that the repeated red black preconditioner requires the lowest amount of computation time, in most of the cases. 1.2.2 Martijn De Jong: Developing a CUDA solver for large sparse matrices for MARIN The C++ RRB solver developed by Elwin was able to solve the system of linear equations within 50 ms for domains no bigger than 100,000 to 200,000 nodes. The focus of Martijn’s thesis was to increase the performance of above solver by utilizing GPU architecture. By carefully selecting the storage data structures and optimal CUDA programming parameters, Martijn was able to achieve a speedup of 30x as compared to the serial code for the linear solver. Martijn reports that the new solver can solve systems that have more than 1.5 million nodes within 50ms with ease. 3 1.3 Research Direction Our main target is to explore different approaches to increase the performance of the current time dependent solver, and allow the solution of larger problems in a similar runtime. Our approach will be to first analyze the proposed algorithms on smaller test problems in MATLAB, and then implement the suitable algorithms and solvers in C++ and CUDA. The following research questions and goals are addressed in the thesis to achieve the desired target: • Analyze the stability and accuracy of various implicit time integration schemes. • Implement generalized Krylov subspace method. – Analyze the impact on the performance (speed-up) and storage requirements for the generalized Krylov Subspace methods for CUDA and C++ code. – Study the performance of different Preconditioner combinations to solve the coupled set of equations. • Provide a framework to study non-uniform meshes. 1.4 Organization of the Thesis The organization of the thesis is given below: 1. Chapter 2 describes the governing equations and discretization. 2. Chapter 3 provides a brief overview of the test cases and the systems used for testing. 3. Chapter 4 and 5 discusses various time integration techniques. 4. Chapter 6 discusses the iterative linear solvers and preconditioners. 5. Chapter 7 discusses the implementation of the linear solvers for the selected time integration. 6. Chapter 8 and 9 provides the results for the various test cases in C++ and CUDA. 7. Chapter 10 discusses the analytical results for the numerical dispersion and provides a framework for the non-uniform grid two-dimensional test case. 4 Chapter 2 Model Equations and Discretization 2.1 The Variational Boussinesq Model (VBM) The basic idea of the model is to minimize the pressure in the whole fluid. Starting point of the derivation of the model equation comes from the physical model of the Navier-Stokes Equations. The Euler equations are a special case of the Navier Stokes equations where viscous forces are neglected. For ocean waves, this is considered to be a valid assumption as the Reynolds number is typically large, and thus surface waves in water are a superb example of a stationary and ergodic random process. Also, rotational effects are negligible in large water waves. The Euler equation is given by: ∂u 1 + (u.∇)u = ∇p − g, ∂t ρ (2.1) The fluid velocity is given by u, ρ denotes the mass density, p the pressure and g the gravitational constant. Understanding physics behind the problem is important as it provides information on the numerical techniques required for the solution of the problem. Note that the inviscid and irrotational assumption is not necessarily valid near solid boundaries, where very small flow structures associated with turbulence result from the no-slip boundary condition. In the current model, this has been ignored and the assumptions are considered to be valid throughout the domain. Equation 2.1 is converted to the instationary Bernoulli equation (fluid being assumed irrotational) , and then the pressure is integrated over the whole domain. The basic idea of the variational Boussinesq model is to minimize the total pressure p. For irrotational flows, velocity can be represented by a velcoity potential (φ). Another parameter ζ (water level) is also introduced while performing integration. The problem thus becomes finding these functions (φ and ζ). The vertical structure of the flow is often known. The velocity potential can be written as a series expansion in predefined vertical shape functions, thus reducing the 3D model to a 2D model. 2.1.1 Hamiltonian Description Being an irrotational and incompressible flow, the system of equations has a Hamiltionian structure, where the force acting on the system is the gravitational and the pressure force. The 5 Hamiltionian H(ζ, ϕ) is given by the sum of kinetic and potential Energy. In order to obtain the minimal value of total pressure, the Hamiltionian function is minimized with respect to variables ζ and ϕ [3]. This results in : ∂ζ = ∇Hϕ (ϕ, ζ) ∂t (2.2a) ∂ϕ = ∇Hζ (ϕ, ζ) ∂t (2.2b) where ∇Hζ refers to the partial derivative of H with respect to ζ. The structure of the above equations will play an important role in the discussion of temporal integration which is discussed in Chapter 4. 2.1.2 Linearized Model Description The variational Boussinesq model constructed based on the above simplifications is non-linear in nature. In order to simplify these equations and thus reducing the computational effort, the Hamiltionian equations given by Equation (2.2) is linearized. Detailed derivation from the Hamiltionian to linear set of equations is described in [3]. The final set of equations after performing the linearization are given by: ∂ζ + ∇.(ζU + h∇ϕ − hD0 ∇ψ) = 0, ∂t (2.3a) ∂ϕ + U.∇ϕ + gζ = Ps , ∂t (2.3b) M0 ψ + ∇.(hD0 ∇ϕ − N0 ∇ψ) = 0, (2.3c) Notes: 1. Equations 2.3 are solved for three basic variables: water level ζ, surface velocity potential (2-D) ϕ and vertical structure ψ. Splitting of the velocity potential into the surface potential and the vertical shape function is given by: φ(x, y, z, t) = ϕ(x, y, t) + f (z)ψ(x, y, t) (2.4) 2. Shape function f is chosen among either a parabolic or a cosine-hyperbolic shape. The model parameters (functionals D, M and N ) are computed using the shape function f . 3. The water depth h = h(x, y, t) is relative to the reference level z = 0. The bottom of a basin, river or sea is thus at level z = −h. 4. The total velocity can be split into the average current velocity and the velocity due to the wave front. U = U(x, y, t) is the time average horizontal velocity of the current and is used as an external input in the model. 5. Equations 2.3 represent the motion of waves. The impact of a moving ship is not seen directly. In order to model a ship, a pressure pulse on the water surface is defined. In the pressure functional given by Equation (2.3b), Ps represented the source term with Ps := − pρs . The draft of a ship’s hull is the vertical distance between the waterline and 6 the bottom of the hull. Given the draft of a ship, ps is computed as the hydrostatic pressure at the given depth. Let the draft be given as ds , then ps = gds α(x, y) with α(x, y) a shape function with one in the middle of the ship and zero on the horizontal boundary of the ship. Alternatively, the shape of the ship’s hull can be imported from an external surface description file. 2.2 Spatial Discretization Both Elwin and Martijn have assumed the domain to be rectangular, and divided the domain in a rectilinear Cartesian grid. The dimensions of the domain are Lx × Ly . It is divided into Nx × Ny grid points. The outermost nodes represent the physical boundary. The mesh spacing Ly x in each direction is given as ∆x = NLx −1 and ∆y = Ny −1 . An example is given in ?? Figure 2.1: Physical domain and corresponding Cartesian grid The model Equations 2.3 are discretized using a finite volume method. The variables are evaluated at the grid points and the finite volumes are rectangles of size ∆x × ∆y centered around the grid point. The derivatives are approximated with centered differences yielding a five-point stencil. For the grid point located at C (= center) the surrounding control volume V and its four nearest neighbors (N = north,E = east, S = south, W = west) are indicated in Section 2.2. On node (i, j), the discretized versions of the variables ζ, ϕ and ψ are given by ζij , ϕij and ψij . In order to put the variables in matrix format, one dimensional ordering of the ~ ϕ ~ The spatial discretization can be written variables is defined which gives the vector ζ, ~ and ψ. 7 as :    S ζ~ d    ζζ ϕ ~ + Sϕζ dt Sψζ 0     Sζϕ Sζψ 0 ζ~     Sϕϕ Sϕψ ϕ ~ = Ps  ~ Sψϕ Sψψ 0 ψ (2.5) The matrix S’s are given by five point stencils as follows:  0 1 2∆y VN  1 U Sζζ : − 2∆x W 0 − 1 2∆y VN 1 1 2∆y VS − 2∆x UW 1 VS − 2∆y +  0  1 2∆x UE  0 1 2∆x UE (2.6) where U and V denotes the current velocity in x and y direction respectively.  1 − ∆y 2 hN 1 1 1 h + ∆x2 hW + ∆x 2 hE + ∆y 2 N 1 − ∆y2 hS 0  1 h Sζϕ : − ∆x 2 W 0 Sζψ  0 − 1 h D :  ∆x2 W W 0 Sψϕ  0  1 h D : ∆x∆y  ∆x2 W W 0  Sϕψ = 0, 0  1 N Sψψ : ∆x∆y − ∆x W 2 0 1 −( 2∆y VN − 1 2∆y VS 1 h D ) ∆x2 E E (2.7)  0  1 − ∆x 2 hE D E  0 (2.8) Sψζ = 0 1 − 2∆x UW 1 − 2∆y VS 1 −( ∆y 2 hN D N +   1 − ∆x 2 hE  0 (2.9)  0  1 2∆x UE  0 1 2∆y VN 0  1 U Sϕϕ : − 2∆x W 0 1 h ∆y 2 S 1 − ∆y 2 hN D N 1 1 (hN DN + hS DS ) + ∆x 2 (hW DW + ∆y 2 1 − ∆y2 hS DS Sϕζ = g,  0 + 1 2∆x UE ) 1 h D ∆y 2 N N 1 1 h D + ∆x 2 hE D E ∆x2 W W 1 h D ∆y 2 S S 1 − ∆y 2 NN 1 1 1 N + ∆x2 NW + ∆x 2 NE + ∆y 2 N 1 − ∆y2 NS + 1 h D ) ∆y 2 S S 0 1 N ∆y 2 S +M (2.10)  0 1 h D  ∆x2 E E  0 (2.11)   1 − ∆x 2 NE  (2.12) 0 The system can be written as : q̇ = Lq + f, (2.13a) ~=b Sψ (2.13b) 8    S ζ~ and q̇ its time derivative. The matrix L = − ζζ with q = S ϕ ~ ϕζ " # ~ S ψ tion matrix and f = − ζψ ~ . Sϕψ ψ  Sζϕ is the spatial discretizaSϕϕ Elwin and Martijn focused on solving the system of equations represented by Equation (2.13b), whereas the Equation (2.13a) was solved using explicit time integration schemes described in the next Chapter. 9 Chapter 3 Test Problems 3.1 One Dimensional Test Problem To assess the stability and accuracy of various implicit time integration procedures, we shall consider a simplified problem in one dimension given below as the test problem. ∂ζ + ∇.(ζU + h∇ϕ) = 0, ∂t (3.1a) ∂ϕ + U.∇ϕ + gζ = 0, ∂t (3.1b) Here, we have neglected the impact of the vertical structure ψ and pressure pulse Ps . This is now a coupled initial boundary value problem, where both equations are hyperbolic in nature and also represents an Hamiltonian Set. The boundary conditions for both ζ and ϕ are taken as periodic. Assuming no spatial variation of mean current velocity U and depth h, the above equations can be written as: ∂ζ ∂2ϕ ∂ζ +U + h 2 = 0, ∂t ∂x ∂x (3.2a) ∂ϕ ∂ϕ +U + gζ = 0, ∂t ∂x (3.2b) Implicit time integration procedures will require solving a system of linear equations. We will use MATLAB to solve the system after performing spatial discretization and applying the boundary conditions. The idea here is to assess the stability and accuracy of various methods and not the performance. THe solution from various time integration procedures will be benchmarked with MATLAB’s inbuilt ODE integrators. 10 3.2 Two Dimensional Test problem- Assessment with MATLAB and C++ In order to analyze the stability and accuracy of the complete system and implement the required solver in C++ and CUDA, it is required to study the two dimensional case which results in a Pentadiagonal matrix instead of a Tridiagonal matrix. In this test problem, a simplified problem which addresses issues related to the coupling of vertical structure along with wave height and surface potential in two dimensions will be considered. First the system will be tested with MATLAB to understand the stability of the required scheme and then will be implemented in C++ and CUDA. The test problem is used to build the coupled RRB solver for the implicit set of equations first in C++ and then in CUDA. The solution obtained from C++ and CUDA can then be benchmarked against the results from MATLAB. The equations considered are: ∂ζ ∂ζ ∂ζ ∂2ϕ ∂2ϕ ∂2ψ ∂2ψ + Ux + Uy + h( 2 + ) − hD( − ) = 0, ∂t ∂x ∂y ∂x ∂y 2 ∂x2 ∂y 2 (3.3a) ∂ϕ ∂ϕ ∂ϕ + Ux + Uy + gζ = 0, ∂t ∂x ∂y (3.3b) Mψ + hD( ∂2ψ ∂2ψ ∂2ϕ ∂2ϕ + ) − N ( + ) = 0, ∂x2 ∂y 2 ∂x2 ∂y 2 (3.3c) The algorithm required to solve the above set of equations is described in detail in Section 5.4.2. 3.3 Realistic problems IJssel, Plymouth Two realistic problems are obtained from MARIN’s database and previous simulations by Martijn in [1]. Testing will be carried out with respect to the validated models in [1]. A very small time step (much lower than the allowable limit) will be used to first carry out simulations with the current model which use Explicit Leap Frog time integration. Results from implicit time integration will then be compared with the results from explicit time integration. 3.3.1 The Gelderse IJssel The Gelderse IJssel, a small river, is a branch from the Rhine in the Dutch provinces Gelderland and Overijssel. The river flows from Westervoort and discharges in the IJsselmeer. In Figure 6.1 a (small) part of the river is shown. From this part several test problems are extracted. This is done by extracting small regions out of the the displayed region . For the discretization an equidistant 2 m by 2 m grid is used. 11 Figure 3.1: The Gelderse IJssel (Google Maps) 3.3.2 The Plymouth Sound Plymouth Sound is a bay located at Plymouth, a town in the South shore region of England, United Kingdom. The Plymouth Breakwater is a dam in the centre of the Plymouth Sound which protects anchored ships in the Northern part of the Plymouth Sound against southwestern storms. From this region also test problems are extracted, see Figure 6.2. For the discretization an equidistant 5 m by 5 m grid is used. Figure 3.2: The Plymouth Sound (Google Maps) 12 3.4 Testing System Below is an overview of the machine that has been used in our simulations. This system can be considered having one of the best available GPU cards along with CPU configuration. Though it is possible to obtain GPU cards with higher specifications, their cost is much higher. Brand / Type Dell Precision Workstation T3600 Owner / System no. MARIN LIN0245 CPU Intel Xeon E5-1620 @ 3.6 GHz No. of cores 4 (8 Threads) Cache 256 kB L1 / 1 MB L2 / 10 MB L3 Memory 15.5 GB RAM DDR2 @ 1066 MHz Motherboard Dell Custom Operating System Scientific Linux Release 6.5 System kernel 2.6.32 - 431 .17.1.ei1.x86-64 CUDA release 6.0 Nvidia Driver version 331.62 GCC version 4.4.7 GPU (CUDA) GeForce GTX 780 Memory 3072 MB No. of cores 12 SM 192 (cores/SM) = 2304 cores GPU Clock Rate 941 MHz Compute capability 3.5 13 Chapter 4 Temporal Discretization The system of equations to be advanced in time are given in Equation (2.13a). Expanding the equation we get: ˙ ~ ζ~ = −Sζζ ζ~ − Sζϕ ϕ ~ − Sζψ ψ ~ + Ps ϕ ~˙ = −Sϕζ ζ~ − Sϕϕ ϕ ~ − Sϕψ ψ (4.1a) (4.1b) The equation for ϕ can be further simplified taking into account the structure of the stencils. ϕ ~˙ = −Sϕϕ ϕ ~ − g ζ~ + Ps (4.1c) where g is the gravitational constant. Above equations in the form of Equation (2.13a) are solved using the explicit Leap-Frog scheme in [1] and [3]. The method is described in 4.1. The equation for variable ψ, which does not contain the time derivative, but still depends on the other variables is given by ~ = −Sψϕ ϕ Sψψ ψ ~ 4.1 (4.1d) Explicit Time Stepping :Leap-Frog Scheme In explicit schemes, the fluxes and the sources are computed at the nth (previous) time level and their contribution is added to the current value of the variable. In the work by Elwin and Martijn, the Leap-Frog method has been used to integrate equation 2.13a. The Leapfrog method is one of the Symplectic Integration techniques designed for the numerical solution of Hamilton’s equations given by Equation (2.2). The method described in [3] depends on two previous time steps (a so-called multistep) method. The exact solution to Equation (2.13a) is approximated at the time intervals tn = n∆t, n = 0, 1, 2, . . . with ∆t > 0 being the time step size. The numerical approximations are denoted by qn ≈ q(tn ). To keep the derivation short, we will first focus on the fixed constant step size ∆t := tn+1 − tn . A Taylor series expansion gives: 1 1 ... qn+1 = qn + ∆tq̇n + ∆t2 q̈n + ∆t3 q n + O(∆t4 ), 2 6 14 (4.2a) 1 1 ... qn−1 = qn − ∆tq̇n + ∆t2 q̈n − ∆t3 q n + O(∆t4 ) 2 6 (4.2b) Subtracting the second equation from first, we obtain: qn+1 − qn−1 = 2∆tq̇n + O(∆t3 ), (4.3) which reduces to q̇n = qn+1 − qn−1 + O(∆t2 ), 2∆t (4.4) The Explicit Leap-Frog method is second order accurate in time, and is only conditionally stable. The main advantage of this method is that, it is simple and fast, and is easily parallelizable. The disadvantage is that the time step is limited by the numerical stability requirements. For the equations represented by Equation (2.13a), the stability condition requires that the time step is shorter than the crossing time of the grid cells by the faster wave: ∆t < ∆xi cmax i (4.5) for all grid cells and all directions i = x, y and c represents the velocity of the wave. This is the famous Courant-Friedrich-Levy (CFL) condition valid for explicit time integration of arbitrary hyperbolic PDE’s. In the current implementation, with the uniform-grid, ∆x is constant for each cell, and the velocity of the surface gravity wave is used to determine the time step such that it satisfies the CFL condition. A safety margin is added to ensure stability without needing to check the stability criteria at each time step. For example, given the equation ẏ = λy (λ arises as an eigenvalue of a local Jacobian and could be complex), the Leap-Frog method is stable only for |λ∆t ≤ 1|. In order to approximate the interaction between the ship and the waves, a grid size of the order of half a meter or smaller is required. This limits the time step to 0.01 seconds maximum, while the maneuvering simulator at MARIN often can use time steps as large as 0.1 seconds. The intention to use lower mesh sizes to capture details of wave interaction with ships at finer levels will require a further reduction in the time step, hence increasing the computation time. This provides the motivation to explore time integration methods which are more stable, hence allowing us to use larger time steps, and still giving good accuracy. 4.2 4.2.1 Implicit Time Stepping Fully Implicit Scheme Stability of the time integration can be significantly improved by employing fully implicit time integration techniques. The most common and simplest of the fully implicit scheme is the Backward Euler scheme. For the set of equations given by Equation (2.13a), we get: qn+1 − qn ~n+1 ) = (Lqn+1 − Sζψ ψ dt 15 (4.6a)   ζ~ . The method is first order accurate in time where q = ϕ ~ 1 and is unconditionally stable. While the stability of the implicit time discretization is a great improvement over the explicit ~n+1 , which requires schemes, one has to solve the implicit equations for the unknowns qn+1 and ψ solving a system of equations (Differential Algebraic Equations) represented by : ~n+1 (I − L)qn+1 = qn − Sζψ ψ ~n+1 = −Sψϕ ϕ Sψψ ψ ~ n+1 (4.7a) (4.7b) As the method is first order accurate in time, and the spatial discretization was second order accurate, overall accuracy of method is only first order. Such methods are usually good to achieve steady-state results, but not very accurate. The disadvantages of using the fully implicit scheme includes: • Moving from explicit to implicit schemes incurs extra computational efforts as it requires solving the linear system of equations given by Equation (4.7a). For the system represented above, the iterative linear solvers are used, which are described in more detail in Chapter 5. • First order implicit methods simply under-relax more to maintain the stability of the iterative solution. It is this increased damping, with the increase in time-step size, which induces more inaccuracies in transient behavior. • Spatial discretization plays an important role in the stability of the numerical solutions of unsteady state hyperbolic equations. Generally for the Euler equations, the central difference scheme is more accurate than the first order upwind schemes. Also stability in the case of the central differencing scheme is not an issue as the diffusive forces are not active. In the case flux limiters are used for spatial discretization which have the capability of capturing shocks, discontinuities or sharp changes in the solution domain, the implicit scheme results in the formation of non-linear system of equations, which then requires more computational effort (computation of Jacobi). 4.2.2 β - Implicit Scheme Semi-implicit methods try to combine the stability of the implicit methods with the efficiency of the explicit methods. The system of equations is discretized using a one parameter (β) implicit scheme to advance in time: ~n+1 ) + (1 − β)(Lqn − Sζψ ψ ~n )) qn+1 = qn + ∆t(β(Lqn+1 − Sζψ ψ (4.8a) The parameter β can vary between 0 and 1. For β = 1, we get the Backward-Euler method (fully implicit) and the so-called Trapezoidal scheme for β = 0.5 For β = 0.5 , we can rewrite the equations as: qn+1 = qn + 0.5∆t(qn+1 ˙ + q˙n ) = qn + ∆tq˙n + 0.5∆t(qn+1 ˙ − q˙n ) 1 2 dq̇ + O(∆t3 ) = qn + ∆tq˙n + ∆t 2 dt which gives the trapezoidal method of second order temporal accuracy. 1 The can be derived from the Taylor series expansion 16 (4.9a) (4.9b) (4.9c) 4.2.3 Stability The Scalar Test Equation To understand the stability of time integration methods, we consider the scalar, complex test equation : ′ w (t) = λw(t) (4.10) where λ ∈ C . Application of the time integration scheme (Explicit or Implicit) gives : wn+1 = R(∆tλ)wn , (4.11) R(∆tλ) is called the Stability Function. Let z = ∆tλ. For the explicit schemes of order s, R(z) is a polynomial of degree ≤ s. For implicit methods it is a rational function with degree of both numerator and denominator ≤ s. The stability region is defined in terms of R(z) as: S = z ∈ C : |R(z)| ≤ 1 (4.12) The scheme that has the property that S contains entire left half plane C − = x ∈ C : Re(z) ≤ 0 is called A-Stable. A scheme is said to be Strongly A-Stable if it is A-stable with |R(∞)| < 1, and it is said to be L-stable if in addition |R(∞)| = 0. 1 + (1 − β)z . 1 − βz The implicit trapezoidal rule given by β = 0.5 is A-stable, whereas the fully implicit Backward Euler method is L-stable with β = 1. For the semi-implicit scheme with parameter β, the stability function is R(z) = Stability for Linear Systems Let the linear system of equations (m equations, m unknowns) be given as: ′ w (t) = Aw(t) + g(t) (4.13) with A ∈ Rmxm . Application of the semi-implicit scheme with parameter β gives: wn+1 = R(∆tA)wn + (I − β∆tA)−1 ∆tgn+β (4.14) R(∆tA) = (I − β∆tA)−1 )(I + (1 − β)∆tA) (4.15) where and gn+β = (1 − β)g(tn ) + βg(tn+1 ). Starting from initial solution of w0 , we obtain: wn = R(∆tA)n w0 + ∆t n−1 X i=0 R(∆tA)n−i−1 (I − β∆tA)−1 gi+β (4.16) If we perturb the initial solution ŵ0 , we get the formula for the perturbed solution at nth time step as: ŵn − wn = R(∆tA)n (ŵ0 − w0 ) Hence, the powers R(∆tA)n determine the growth of the initial errors. 17 (4.17) Let λj with 1 ≤ j ≤ m denote the eigenvalues of the matrix A, and let A be diagonizable such that A = U ΛU −1 where Λ = diag(λj ) [7]. Let K be the condition number of U . Then for ∆tλj ∈ S, 1 ≤ j ≤ m =⇒ ||R(∆tA)n || ≤ K ∀n ≥ 1 where S represents the stability region described above in the scalar test equation case. Stability of the semi-implicit methods requires a moderate bound for these powers. In case the powers are not bounded, ∆tλj does not belong to the Stability Region S, resulting in an unstable method. 4.2.4 Backward Differentiation Formula Another approach to achieve second order temporal accuracy is by using information from multiple previous time steps. This gives rise to a two-parameter three-level time integration scheme: qn+1 qn − qn−1 = qn + ∆t α − αq˙n + β qn+1 ˙ + (1 − β)q˙n ∆t   (4.18) ˙ n represents the derivative of q at the nth time interval. where (q This scheme is three-level whenever the parameter α 6= 0. When α = 1/3 and β = 2/3, we obtain the second order Backward Differentiation Formula (BDF2) for constant time step ∆t. The analysis of the accuracy for a multi-step method is presented below. Let us assume that qn was computed from qn−1 with a second order temporal accuracy. This implies dq̇ qn − qn−1 = q˙n − (∆t/2) + O(∆t2 ) ∆t dt Substituting this into Equation (4.18)   ∆t dq̇ dq̇ 2 + β∆t + O(∆t ) qn+1 = qn + ∆t q˙n − α 2 dt dt α d q̇ = qn + ∆tq˙n + ∆t2 (β − ) + O(∆t3 ) 2 dt (4.19) (4.20a) (4.20b) The method is second order temporal accurate when 2β − α = 1. At the start of simulation, one could use trapezoidal scheme for second order accuracy, or more stable backward Euler method (as the value at n − 1th time step is not available). The BDF2 method has better stability properties than trapezoidal rule, and is regularly used for stiff problems. It can also be viewed as an implicit counterpart of the explicit Leap-Frog scheme. 4.2.5 Semi-Implicit schemes Any implicit method requires solving a linear system of equations. In [1], a linear solver has been developed for the case when the system of equations is represented by a pentadiagonal matrix. From Equation (4.7a), the linear system of equations formed for the variables qn+1 is   ~ ζ , and each ζ~ and ϕ ~ are represented by block pentadiagonal instead of pentadiagonal as q = ϕ ~ their own five point stencils. 18 As previously mentioned, it is our intention to make use of the linear solver in [1] with minimum modifications possible. For this reason, we split Equation (4.7a) into Equation (4.1a) and Equation (4.1b). Then we derive the corresponding equations for various implicit time integration procedures. ~ and use the previous time step values of In the first approach, we implicitly advance variable ζ, ~ variables ϕ ~ and ψ. The equations are given below: ζ~n+1 = ζ~n + ∆t(−Sζζ ζ~n+1 − Sζϕ ϕ~n − Sζψ ψ~n ) I 1 ~ + Sζζ )ζ~n+1 = ζn − (Sζϕ ϕ~n + Sζψ ψ~n ) ( ∆t ∆t (4.21a) (4.21b) Please note that variables ϕ and ψ here are treated explicitly. ~ equation for ϕ After advancing ζ, ~ is advanced implicitly. ~n ) ϕ ~ n+1 = ϕ ~ n + ∆t(−Sϕζ ζ~n+1 − Sϕϕ ϕ ~ n+1 − Sϕψ ψ 1 I ~n ) + Sϕϕ )~ ϕn+1 = ϕ ~ n − (Sϕζ ζ~n+1 + Sϕψ ψ ( ∆t ∆t (4.22a) (4.22b) Now, only variable ψ here is treated explicitly. As the value of ζ at n + 1th time step is available, we are able to use it. After obtaining the value of both ζ and ϕ at the new time step, the linear ~ is solved. system of equations for ψ As it is a combination of implicit and explicit scheme, stability of the method is not guaranteed. Another possibility here is to use a predictor-corrector kind of scheme, described below: • Advance ϕ ~ using an explicit time integration scheme. ~ by solving linear system of equations. • Based on the new value of ϕ ~ , compute ψ ~ on the right-hand side of Equation (4.21) • Implicitly advance ζ~ where the values of ϕ ~ and ψ are substituted by above computed values. ~ • Perform implicit correction of ϕ ~ and compute ψ. This will definitely demand more number computational effort, but will be more stable than Equation (4.21). Similar derivations can be done for higher order methods. 4.3 Symplectic integration The Symplectic integrators are designed for the numerical solution of Hamiltonian equations given by Equation (2.2). Symplectic integrators ensures time-reversibility and preservation of the symplectic (Ergodic) nature of the equation. The so-called symplectic Euler method (first order) can be constructed as follows: ζn+1 = ζn + ∆t∇Hϕ (ζn+1 , ϕn ) (4.23a) ϕn+1 = ϕn + ∆t∇Hζ (ζn+1 , ϕn ) (4.23b) 19 The methods are implicit for general Hamiltonian systems. However if H(ζ, ϕ) is separable as H(ζ, ϕ) = T (ζ) + U (ϕ), it turns out to be explicit. The Stormer-Verlet schemes are Symplectic methods of order 2. They are composed of the two ∆t . symplectic Euler methods with step size 2 ∆t ∇Hϕ (ζn+1/2 , ϕn ) 2 (4.24a) ∆t (∇Hζ (ζn+1/2 , ϕn ) + ∇Hζ (ζn+1/2 , ϕn+1 ) 2 (4.24b) ζn+1/2 = ζn + ϕn+1 = ϕn + ζn+1 = ζn+1/2 + ∆t ∇Hϕ (ζn+1/2 , ϕn+1 ) 2 (4.24c) Implicit mid-point rule: For a fully implicit method, ζ and ϕ are combined in one equation represented by variable  ζ~ . Then the implicit mid-point rule is given by: ~q = ϕ ~ qn+1 = qn + ∆t∇H( qn+1 + qn ) 2 (4.25) In the next Chapter, we implement the above time integration schemes for various test cases and analyze their accuracy and stability behaviors. 20 Chapter 5 Analysis of Time Integration Schemes 5.1 Stability and Accuracy of various time integration schemes To assess the stability of various implicit time integration procedures, we have considered a simplified problem in one dimension given below. ∂ζ ∂ζ ∂2ϕ +U + h 2 = 0, ∂t ∂x ∂x (5.1a) ∂ϕ ∂ϕ +U + gζ = 0, ∂t ∂x (5.1b) Here U represents the mean current velocity and h represents the depth of the ocean. Both U and h have been assumed to be uniform in x direction. Periodic boundary conditions have been assumed for both ζ and ϕ, and the initial conditions are given as: • ζ(0, x) = cos( 4πx ) L • ϕ(0, x) = 1 The central difference scheme has been used for the spatial discretization. It results in following equations: ∂ ζ~ + Sζζ ζ~ + Sζϕ ϕ ~ = 0, ∂t (5.2a) ∂ϕ ~ + Sϕζ ζ~ + Sϕϕ ϕ ~ = 0, ∂t (5.2b) where the discretization matrices are given by three point stencils as given below. 21 Sζζ Sζϕ Sϕζ Sϕϕ  −U : 2∆x  h : ∆x2  g : 0  −U : 2∆x 0 −2h ∆x2  0 0 U 2∆x  h ∆x2 U 2∆x   The matrices Sζζ and Sϕϕ are skew- symmetric, matrix Sζϕ is symmetric, whereas Sϕζ is only a diagonal matrix. For the sample problem, following values have been used: • x − length = L =100 m • ∆x = 0.5 m • U = 1.0 m/sec • h = 50 m • time − max = 100 sec eFor the combined system √ of equations represented by Equation (5.2), the maximum eigenvalue obtained is :λmax = 2 gh. Time step is limited by the CFL condition for the explicit time integration schemes. It is given as: ∆t ≤ ∆x ∆x ≤ √ = 0.0113 sec = ∆tcf l λmax 2 gh (5.3) Various time integration schemes as discussed in Chapter 4 have been implemented. In case of Fully Implicit Schemes (Fully Implicit Backward Euler, Fully Implicit Trapezoidal etc.), Equation (5.2) is solved simultaneously. In case of Semi-Implicit schemes, an iterative procedure is followed to solve Equation (5.2). 5.1.1 Benchmark Solution using MATLAB integrators In order to get the benchmark solution, MATLAB ODE integrators are used. The system of ODE’s given by Equation (5.2) are integrated by ODE45 and ODE15s functions. ODE45 is an explicit Runge-Kutta method with adaptive time stepping. The time step is modified based on the error estimate provided. ODE15s is a higher order implicit method. For the benchmark, a relative error estimate of 1e−6 is used. Using ODE15s, we obtain the following solution for the the variable ζ. 22 1 ζ at t=0 ζ at t=100 0.8 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.1: ζ variation with length at different times using ODE15s The result depends on the Initial condition chosen. If the Initial conditions are smooth, as in Figure 5.1, the final solution is also smooth. 1.5 ζ at t=0 ζ at t=100 1 ζ 0.5 0 −0.5 −1 −1.5 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.2: ζ variation with length at different times using ODE15s with discontinuity In case if we introduce some discontinuity in the initial condition as shown in in Figure 5.2 at x=0, wiggles/oscillations are observed with the Central discretization. In order to remove these oscillations, high resolution spatial discretization schemes (flux-limiters or total variation 23 diminishing) are required, which are not considered in the current study. The comparison of the ODE45 (explicit) and ODE15s (implicit) is shown in Figure 5.3. 1 ζ with ODE15s ζ with ODE45 0.8 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.3: ζ variation with length at different times using ODE15s and ODE45 In order to compare the two solutions, we define the norm of the difference between the two vectors. Let xs be the number of grid points in x direction. v u xs u1 X (u(i) − v(i))2 (5.4) N orm(u, v) = t xs i=1 For the solutions given in Figure 5.3, the norm as given by Equation (5.4) is 2.0500e−4 . We will now use ODE45 as the benchmark for the explicit time integration solvers given in Chapter 4, and ODE15s as the benchmark for the implicit and semi-implicit time integration methods. 5.1.2 Explicit Scheme The Leap Frog scheme has been used as the explicit time integration scheme as described in the Chapter 4. For the first time step as the value at two previous time steps in unknown, the Forward Euler method is used. The Leap Frog scheme requires the CFL condition for the stability of the time integration. For time step upto ∆tcf l , the method is stable. It becomes unstable when the time step is equal or greater than ∆tcf l . For various time steps, simulation was run in Matlab for 100 seconds, and the results plotted in Figure 5.4. 24 1 ζ ODE45 0.8 ζ , Leap-Frog dt =0.001 ζ Leap-Frog dt =0.01 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.4: Comparison of ζ variation with length for ODE45 vs Explicit Leap Frog It is known that the Leap-Frog method does not suffer from phase lag or numerical dispersion (an analytical behaviour of the Leap-Frog method is discussed in Chapter 11). We can see that the solution is accurate for the time steps below ∆tCF L . In order to compare the results, the norm defined in Equation (5.4) is used and tabulated below. Table 5.1: Leap Frog vs ODE45, Error Norm Time step (sec) Error Norm 0.0113 (∆tcf l ) 0.0370 0.01 0.0229 0.001 0.0007 25 5.1.3 Fully Implicit Schemes Equation (5.2) is solved using three implicit schemes : The Backward Euler scheme, the Trapezoidal scheme and the Backward Differentiation Formulae (BDF2). It is observed that all the methods are unconditionally stable. For a linear system of equations (as in our case), the Trapezoidal method is equivalent to the implicit mid-point rule which is a Symplectic integrator. At larger time steps, the results are not very accurate and the solution is damped. The following subsections give the plots of the three schemes for varying time steps. Fully Implicit Backward Euler Method 1 ζ ODE15s 0.8 ζ Backward Implicit, dt =0.001 ζ Backward Implicit, dt =0.01 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.5: Comparison of ζ variation with length for ODE15s vs Implicit Backward Euler Even though the Implicit Backward Euler method is unconditionally stable, it suffers from very large damping at larger time steps as observed in the Figure 5.5. As the time step is increased to 0.1 seconds, the solution dies out very quickly, which is unwanted for the equations represented by the Symplectic system where total energy should be conserved. The error norm is given in Table 5.2. 26 Fully Implicit Trapezoidal Method 1 ζ ODE15s 0.8 ζ Implicit Trapezoidal dt =0.01 ζ Implicit Trapezoidal dt =0.1 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.6: Comparison of ζ variation with length for ODE15s vs Implicit Trapezoidal Method As expected, the Implicit Trapezoidal scheme provides better results than Backward Euler. For larger time steps (0.1 sec), the solution is not very accurate. Some phase error is observed around xs = 140, which results in large error norm for ∆t = 0.1 sec. Along with the phase error, some damping is also observed. As later studied in Section 5.2, this reduction in amplitude is not damping, but results from amplitude variation with change in phase. The phase error reduces as we reduce the time step (around 0.01 second). 27 Fully Implicit BDF2 Method 1 ζ ODE15s ζ BDF2, dt =0.02 0.8 ζ BDF2, dt =0.1 ζ BDF2, dt =1 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.7: Comparison of ζ variation with length for ODE15s vs Implicit BDF2 Method BDF2 is a multi-step second order method, and is not Symplectic in nature. As can be seen from Figure 5.7, the solution suffers from very large numerical damping at larger time steps. In Table 5.2, the error norm for various implicit schemes at different time steps is provided. In terms of accuracy, the Trapezoidal method is better than the explicit Leap Frog scheme and BDF2 for a given time step. Table 5.2: Implicit Methods vs ODE15s, Error Norm 5.1.4 Time step (sec) Backward Euler-Norm Trapezoidal-Norm BDF-Norm 0.1 0.672 0.470 0.673 0.01 0.672 0.0136 0.0364 0.001 0.387 0.0003 0.007 Semi Implicit Schemes Semi Implicit schemes of the predictor corrector nature were proposed in Chapter 4 in order to make use of the existing structure of the RRB-k solver. The following algorithm is applied while using the Semi-Implicit scheme. Here itermax represents the maximum sub-iterations performed for the corrector step of the algorithm. Apart from the time step, itermax is important for the stability considerations. Larger value of itermax should typically allow usage of large timestep. After performing numerical experiments, itermax vs maximum allowable timestep has been computed and tabulated below. 28 Algorithm 5.1 Semi Implicit Algorithm to solve Equation (5.2) Input ζ~0 , ϕ ~ 0 (Initial Vector), ∆t, itermax while t ≤ tmax do while iter ≤ itermax do if iter == 1 then Step 1. if t == 1 then Advance ϕ ~ explicitly using Forward Euler method on Equation (5.2) else Advance ϕ ~ explicitly using Leap Frog method on Equation (5.2) end if else Advance ζ~ implicitly using Implicit Trapezoidal Scheme with the implicit value of ϕ ~ taken from Step1. Correct ϕ ~ implicitly using Implicit Trapezoidal Scheme with implicit value of ζ~ taken from above Step. end if end while end while Table 5.3: Semi Implicit Methods: itermax vs. ∆t itermax ∆t maximum 3 0.019 ∆t ∆tcf l 1.68 5 0.020 1.77 7 0.021 1.86 9 0.021 1.86 11 0.021 1.86 19 0.022 1.95 21 0.022 1.95 23 0.022 1.95 100 0.022 1.95 It is observed that the time step can be increased upto two times the ∆tCF L by increasing the number of iterations. The error norm for the Semi-implicit schemes follow the similar trend as that of the Explicit Leap-Frog scheme. For the parameters and the numerical techniques used for the test case, Semi-Implicit methods provides a little advantage over Explicit methods in terms of Stability. 29 Impact of depth h The depth h appears in the maximum eigenvalue of the system, and influences the stability criteria of any scheme. In order to ascertain the impact of the depth on the stability of the Semi-implicit schemes, a sensitivity analysis has been carried out. The CFL number is given by Equation (5.3) and varies with the depth h. For a given itermax, it is observed that the ratio between the maximum ∆t achieved and the ∆t given by CFL number remains the same. 5.2 Study of Implicit Trapezoidal Scheme As discussed in the previous section, the implicit Trapezoidal scheme provides a stable solution without much damping as compared to other methods. Although it suffers from the phase shift or phase error at larger time steps resulting in larger error norms. In this section, the impact of phase difference on the amplitude of the solution for the coupled set of equations and non-coupled set of equations is discussed. In Figure 5.8, the values of variable ζ is plotted at 50 seconds. The system of equations have been decoupled by setting h = 0 and g = 0. No phase error is observed for this case even when large time steps are used. As the initial solution do not contain sharp discontinuities, Central discretization method does not introduce oscillations and the phase errors are minimal. This changes when we introduce the coupling between the two equations by setting h = 50 and g = 9.81. 1.5 ζ ODE15s ζ Implicit Trapezoidal, dt =0.01 ζ Implicit Trapezoidal, dt =0.1 1 ζ 0.5 0 −0.5 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.8: Implicit Trapezoidal method without Coupling at t = 50sec for various time steps For the coupled equations (Figure 5.9), it is observed that the amplitude of the solution does not remain constant due to the impact of the source term, and varies with the phase. This 30 impact can be seen in Figure 5.9 where the benchmark solution is plotted at three different time intervals. Without any coupling between the equations, the solution should travel with the wave speed without changing the shape or the amplitude. With the coupling, a change in amplitude is seen at different time intervals. 1 t=20s 0.8 t=22s t=24s 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.9: Solution of coupled equation at t = 20,22 and 24 sec with Matlab ODE integrator This change in the amplitude with the phase for the coupled system causes large error norms for Implicit Trapezoidal method. As Implicit Trapezoidal method induces a phase error, the change in the phase results in a change in amplitude. This is reflected in the Figure 5.10 31 1.5 ζ ODE15s ζ Implicit Trapezoidal, dt =0.01 ζ Implicit Trapezoidal, dt =0.1 ζ Implicit Trapezoidal, dt =1 1 ζ 0.5 0 −0.5 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.10: Implicit Trapezoidal method with Coupling at t = 20 sec for various time steps For time step ∆t = 0.01s, the solution is accurate and does not suffer from phase error. For larger timer steps, both phase change and amplitude variation is observed. It is important to observe that, for ∆t = 0.1s, the solution appears to be already damped at time 20s. If there was damping in the solution, then it would be greater for larger time steps, which is not seen for ∆t = 1s. Instead the amplitude at ∆t = 1s is greater than the benchmark solution. It could be argued that the larger ∆t is impacting the stability of the solution. In order to check this, the solution is plotted for time 100 sec in Figure 5.11. 32 1 ζ ODE15s ζ Implicit Trapezoidal, dt =0.01 0.8 ζ Implicit Trapezoidal, dt =0.1 ζ Implicit Trapezoidal, dt =1 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.11: Implicit Trapezoidal method with Coupling at t = 100 sec for various time steps It is observed that the solution remains stable, and only phase error and amplitude difference is observed for larger time steps. 5.2.1 Higher Order Schemes and Impact on Phase Error Higher order (Two stage, Fourth ordered) Symplectic Implicit Runge Kutta scheme has been implemented to analyze the impact on the phase error due to the larger time steps. These methods are computationally more intense, but provides better accuracy. Given the coupled set of equations as: dq + Lq = 0 dt we have 1 q n+1 − q n = (K1 + K2 ) ∆t 2 (5.5) where √ K1 = −L(q n + ∆t/4K1 + (1/4 + 3/6)∆tK2 ) √ K2 = −L(q n + (1/4 − 3/6)∆tK1 + 1/4∆tK2 ) At every time step, above system of equation is simultaneously solved and then plugged into Equation (5.5). Please note that the coefficients are chosen such that the method remains Symplectic in nature. 33 The results are shown in Figure 5.12. Higher Order Runge Kutta Method 1 ζ ODE15s 0.8 ζ Implicit Trapezoidal, dt =1.0 ζ Implicit Trapezoidal, dt =0.1 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.12: Higher order Runge Kutta method at time 100s It is observed that time step ∆t = 0.1s, the solution is accurate and there is no phase error. As compared to the Implicit Trapezoidal method, higher order Runge Kutta method is more accurate. A phase error is observed when the time step is further increased. 5.2.2 Impact of Initial Conditions In a real case scenario, the incoming waves are generated by mixing various waves of different frequencies and amplitudes. The initial condition in this case is chosen as: ζ 0 (x) = 0.25( 3 X i=0 cos( 2i 4πx ) L (5.6) For the initial conditions with single wave, the solution was accurate for ∆t = 0.01s. The solution for same ∆t = 0.01 but the mixed initial conditions is plotted in Figure 5.13. 34 0.8 ζ ODE15s 0.6 ζ Implicit Trapezoidal, dt =0.01 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.13: Implicit Trapezoidal Method with mixed initial conditions, ∆t =0.01s It is observed that the solution is not accurate anymore. This can be attributed to high frequency waves. The phase error for Implicit Trapezoidal schemes is generally proportional to square of the frequency of the wave. 32πx Instead of a mix of waves, simulation is carried with initial condition as ζ 0 (x) = cos( ). L The result for ∆t = 0.01s is plotted in Figure 5.14. Phase error is observed even for a single initial wave with high frequency at ∆t = 0.01s. This implies that for high frequency waves, a smaller time step is required to maintain accuracy. Spatial discretization also influences the solution with the high frequency waves. As a rule of thumb, a minimum of 15 grid points per wavelength should be used to accurately represent the wave. In our study, we have used 12.5 gridpoints per wavelength. Increasing the number of grid points reduces ∆x, and thus will result in lower CFL limit. In general, it is the low frequency waves, with larges wavelengths which influences the Ships most. The ∆x should be chosen to represent these low frequency waves accurately. 35 1.5 ζ ODE15s 1 ζ Implicit Trapezoidal, dt =0.01 ζ 0.5 0 −0.5 −1 −1.5 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.14: Implicit Trapezoidal Method with high frequency inlet conditions, ∆t =0.01s 5.2.3 Impact of Boundary Conditions Periodic boundary conditions are prone to accumulation of phase error as the time grows. In this section, we try to analyze the impact on the solution when different boundary conditions are used. We use the following: • ζ(0, t) = 1.0 : Dirichlet Boundary Condition. • For numerics, ζN +1 = ζN where N is the node number at right boundary. This is also equivalent to the Neumann boundary condition. • Neumann Boundary Condition for ϕ We again observe phase error for larger time steps. Though we did not study the difference between the phase error for a Periodic boundary condition and Neumann boundary condition. 36 5.2.4 Summary A summary of the methods is provided below. Figure 5.15: Comparison of various time integration schemes and stability estimates 37 5.3 Impact of Spatial Discretization As discussed in Section 5.1.1, wiggles/oscillations are observed in ζ when the initial condition has a discontinuity. Following are the possible reasons. • Spatial Discretization • Coupling of equations In order to analyze the impact of the coupling on the oscillations, Equation (5.2) is decoupled by setting h = 0 and g = 0. This results into two decoupled wave equations, whose analytical solution is available. Let the initial condition be given as ζ 0 (x) and ϕ0 (x). Then the solution at (x, t) is given by : ζ(x, t) = ζ 0 (x − U t) ϕ(x, t) = ϕ0 (x − U t) Based on the above equations, if the initial condition has a discontinuity, then it should propagate with time without changing any shape. Spatial discretization usually impacts the solution of the wave equation in the case of discontinuities or shock-waves. Again, equations are discretized using Central discretization and solved using the Matlab benchmark code (ODE15s). The results obtained for ζ are given in Figure 5.16. 1.5 1 ζ 0.5 0 −0.5 −1 −1.5 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.16: ζ for decoupled wave equation after one time period with Central discretization and with discontinuity in the initial condition Wiggles and oscillations are still present in the solution, which leads us to explore the other possibility of Spatial discretization. Whereas Central discretization causes oscillations around the discontinuity, the error introduced by the Upwind discretization can be classified as follows: 38 • Smearing: The length of Smearing after the discontinuity. • Phase Error Figure 5.17: Classification of Errors induced by Upwind Discretization In Figure 5.17, the anlaytical solution is shown as the black line, whereas the solution from the Upwind discretization is shown in red. Instead of Central discretization, the Upwind scheme is now used to discretize Equation (5.1). This results in following discretization stencils: Sζζ Sζϕ Sϕζ Sϕϕ  −U : ∆x  h : ∆x2  g : 0  −U : ∆x U ∆x −2h ∆x2  0 U ∆x 0  h ∆x2 0   Using Matlab ODE15s and the above discretization stencils, the obtained solution is plotted in Figure 5.18: 39 0.6 0.4 0.2 ζ 0 −0.2 −0.4 −0.6 −0.8 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.18: ζ for decoupled wave equation after one time period with Upwinding discretization In Figure 5.18, we observe that the solution is damped, but no oscillations or wiggles are present in the solution. Also, larger damping at x = 0 is observed which can be attributed to the discontinuity. Damping is a characteristic of the Upwinding discretization and can be removed by adopting higher order or higher resolution methods. In order to confirm that the wiggles are generated only because of the central discretization scheme and not because of the coupling, we run the simulation with the Upwinding discretization and by setting h = 50 and g = 9.81, which again gives us coupled set of equations. Again running the simulation with Matlab ODE15s integrator, we obtain the following result: 40 1 0.8 ζ at t = 0 ζ at t = 100 0.6 0.4 ζ 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.19: ζ for Coupled wave equation after one time period with Upwinding discretization In Figure 5.19, for the coupled equations, the Upwinding discretization removes the wiggles and also results in a smooth but damped solution. We conclude that the wiggles are generated in the simulation if a discontinuity is present in the initial condition, with the Central discretizations scheme. However, it does not impact the stability of the Semi-implicit schemes. 41 5.3.1 Impact of Grid size on Accuracy One of the goals of the current project is to make the mesh finer near the regions surrounding a ship or obstacles. As we have seen that varying the time step impacts the accuracy of the solution, we will try to assess the time step requirements for similar accuracy for different grid sizes. Simulations are run with the implicit Trapezoidal scheme, and varying grid spacing and adjusting time step. As the Implicit Trapezoidal method is second order accurate, reduction in grid size implies reduction in time step if the CFL number is not changed. Thus the accuracy of the solution improves by ∆t2 , and a better benchmark is required to compare the solution. Again the Matlab ODE15s method is used for the benchmarking along with an absolute error estimate of 1e−9 as compared to the previous absolute error of 1e−6 . The results obtained are indicated in Table 5.4 ∆x CFL Number Error Norm 2 1.8 0.717622 0.9 0.235133 0.4 0.053416 1.8 0.201493 0.9 0.046899 0.4 0.015783 1.8 0.011764 0.9 0.0063474 0.4 0.004944 1.8 0.005404 0.9 0.005445 0.4 0.003001 1 0.5 0.25 Table 5.4: Error Norm for various Grid sizes and CFL numbers For the same CFL number, an error reduction from 0.71 to 0.2 is observed when the grid spacing is reduced from 2 to 1. This approximate factor of 4 is attributed to the reduction in time step by two times and the second order temporal accuracy. Another observation is that, for the combination of small grid size (∆x = 0.25) and small ∆t, the error norm does not improve with further reduction in ∆t. 5.4 Stability Analysis with Vertical Structure ψ included Based on the observations in Section 5.1.3, the fully Implicit Trapezoidal scheme provides the best stability properties for the test case given by Equation (5.1). We still need to evaluate the impact of the vertical structure ψ which is given by an elliptic equation and does not include a time derivative term. To assess the impact of this additional variable and equation, the following 42 test case has been considered. ∂ζ ∂2ϕ ∂2ψ ∂ζ +U + h 2 − hD 2 = 0, ∂t ∂x ∂x ∂x (5.7a) ∂ϕ ∂ϕ +U + gζ = 0, ∂t ∂x (5.7b) Mψ + hD ∂2ψ ∂2ϕ − N = 0, ∂x2 ∂x2 (5.7c) Three additional discretization matrices are formulated. The model parameters M, N D are derived from the vertical structure shape. For the linearized parabolic model we obtain, 1 D = h, 3 1 M = h, 3 and N = 2 3 h 15 (5.8) The discretization matrices are given as: Sζϕ Sϕζ Sϕϕ Sζψ Sψϕ Sψψ  −U 2∆x  h : ∆x2  g : 0  −U : 2∆x  h2 : 3∆x2  h2 : 3∆x2  −2h3 : 15∆x2 Sζζ : U 2∆x 0 −2h ∆x2  0 0  h ∆x2 U 2∆x −2h2 3∆x2 −2h2 3∆x2 4h3 15∆x2    h2 3∆x2  h2 3∆x2  −2h3 15∆x2 Periodic boundary conditions for ψ has been assumed. The model parameters h, U and g remain the same as in the case of two equation model. 5.4.1 Benchmark Equations for ζ and ϕ are coupled and represented as a Lumped model. It contains terms from ψ which is explicitly not known as it depends on ϕ. Let the lumped equation be represented as : where L =  Sζζ Sϕζ d~q ~ + L~q = C ψ dt    Sζϕ Sζψ and C = − Sϕϕ Sϕψ 43 (5.9) The set of differential algebraic equations given by Equation (5.7) can be solved by the Matlab ODE15s solver, an implicit method capable of solving stiff problems and DAE’s. The derivatives in time are described as :  d~q " # ~ −L~ q + C ψ  dt   ~ = ~ dψ −Sψϕ ϕ ~ − Sψψ ψ dt  (5.10)  I2xs ×2xs 0 In order to solve the DAE by Matlab ODE15s, an additional input given by M = 0 0 3x ×3x s s is required, where xs is the number of grid points. This matrix M, also called as the Singular Matrix characterizes the DAE. One has to be careful while chosing the initial conditions for the system of DAE’s. An unique solution is obtained when the initial conditions also satisfy the ~ and ϕ the algebraic equation. We have chosen the initial values of ψ ~ , such that they satisfy the algebraic equation.  5.4.2 Chorin’s Projection technique In order to solve this coupled equation, a projection method similar to Chorin’s projection method [8] has been used. Chorin’s Projection scheme was originally used to solve the incompressible unsteady state Navier Stokes equations. The key advantage of the projection method is that the computation of the dependent variables are decoupled (Pressure and velocity in the case of Navier Stokes equations). Typically the algorithm consists of two stage fractional step schene, a method which uses multiple calculation steps for each numerical time-step. The steps are split as following: d~q • Ignore the ψ term from the lumped equation, and solve the system = −L~q with the dt Implicit Trapezoidal method. Denote the solution after one time step as ~q∗ ~n+1 from Sψϕ ϕ ~ = 0 by solving the system of equations where ϕ • Compute ψ ~ + Sψψ ψ ~ is derived from q∗. ~n+1 • Compute ~qn+1 as ~qn+1 = ~q ∗ +∆tC ψ In Figure 5.20, the solution obtained from Chorin’s projection scheme for different time steps is compared with the Benchmark solution. 44 1.5 ζ ODE15s ζ Chorin Projection Scheme dt =0.01 ζ Chorin Projection Scheme dt =0.1 1 ζ Chorin Projection Scheme dt =1.0 ζ 0.5 0 −0.5 −1 −1.5 0 20 40 60 80 100 x 120 140 160 180 200 Figure 5.20: ζ variation with length at different times using ODE15s using Chorin’s Projection As the time step is reduced, the solution is more close to the Benchmark solution. As the time step is increased, a larger phase error can be observed. High error norms for the solution at the larger time steps can be attributed to this phase error. 5.5 Stability Analysis for Two Dimensional Case In order to analyze the stability of the complete system, it is required to study the two dimensional case which results in a Pentadiagonal matrix instead of a Tridiagonal matrix. Following system of equations are considered as the test case: ∂ζ ∂ζ ∂ζ ∂2ϕ ∂2ϕ ∂2ψ ∂2ψ + Ux + Uy + h( 2 + ) − hD( − ) = 0, ∂t ∂x ∂y ∂x ∂y 2 ∂x2 ∂y 2 (5.11a) ∂ϕ ∂ϕ ∂ϕ + Ux + Uy + gζ = 0, ∂t ∂x ∂y (5.11b) Mψ + hD( ∂2ϕ ∂2ϕ ∂2ψ ∂2ψ + ) − N ( + ) = 0, ∂x2 ∂y 2 ∂x2 ∂y 2 (5.11c) Finite volume method based on Central discretization has been used for the spatial discretization of the system of equations. A rectangular grid has been assumed with the possibility of different grid sizes for the X and Y direction. The discretization matrices are Pentadiagonal matrices represented by 5-point stencil as described in Chapter 2 in detail. It is important to consider the boundary conditions while constructing the Pentadiagonal matrices. In our case, Periodic boundary conditions in both X and Y directions have been used. The model parameters are given below for the Two dimensional case: 45 • Lx =100 m • Ly =100 m • Nx = 50 Number of grid points • Ny = 50 Number of grid points • Ux = Uy = 1.0 m/s • h = 50 m • time-max = 50 sec • Initial Condition: ζ 0 (x, y) = cos( 4πx ) Lx • ϕ0 (0, x, y) = 1 • ψ 0 (x, y) = 0 The equations in lumped form for the benchmark solution and the Chorin’s projection remains the same. Instead of the tridiagonal matrices, we have pentadiagonal block matrices. The grid is ordered lexicographically in Column major format (Similar to Fortran ordering). Stability, accuracy and the Symplectic nature of the solution from Chorin’s projection method is compared with the Benchmark solution. The Implicit Trapezoidal method has been used as the substep of the Chorin’s projection method. Figure 5.21: ζ variation using ODE15s vs. Chorin’s Projection for ∆t = 0.01s and finer grid 46 Figure 5.22: ζ variation using ODE15s vs. Chorin’s Projection for ∆t = 1s and finer grid Phase error is observed at larger time steps, though the stability fo the system is not impacted. This gives us confidence in proceeding with Chorin’s Projection scheme for the real case of the interactive waves. 47 Chapter 6 Iterative Solvers 6.1 Background and Previous Work In [3] and [1], methods are given to solve the system of equations represented by Equation (2.13b) also given below: ~ = −Sψϕ ϕ Sψψ ψ ~=b In this section, the solver and the properties of the matrix Sψψ will be presented. 6.1.1 Properties of the Discretization Matrix Sψψ Numerical discretization of Equation (2.3c) gives Equation (2.13b). Sψψ is given by the five point stencil given in 2.12. When using lexicographical ordering of the grid points, this leads to a pentadiagonal matrix. Since the model parameters (functionals D, M and N ) are positive [3], the center term in the stencil becomes positive and the outer elements are negative. The center term is equal to the sum of the absolute value of the outer terms plus one additional term which is also positive as h > 0 and M0C > 0. For a row of the matrix formed by the five point stencil, the diagonal entry is given by the central point of the five point stencil, and the off-diagonal entries are given by the remaining four entries of the stencil. As the diagonal entry is greater than the sum of absolute value of other entries in a row of the matrix, the matrix is strictly diagonally dominant. To verify the symmetry of the matrix, one has to compare the outer diagonals with each other. For example the contribution   westterm and the east term should be the same. Thus  of the ∆y ¯ ∆y ¯ it is required that ∆x NW = ∆x NE . In the case of non-uniform mesh, (∆x)i,j will i,j i−i,j be different from (∆x)i−1,j , and a different discretization scheme will be required to ensure symmetry. For the uniform mesh, (∆x)i,j = (∆x)i−1,j . With the overbar notation, then the requirement can be rewritten as (NW + NC )i,j = (NC + NE )i−1,j ⇔ (Ni−1,j + Ni,j ) = (Ni−1,j + Ni,j ), which is clearly satisfied. Applying the same reasoning to the north and the south neighbors, we conclude that the matrix is symmetrical. A symmetrical diagonally dominant matrix is positive definite and special methods like Conjugate Gradient can be used to solve Equation (2.13b). 48 6.1.2 Preconditioned Conjugate Gradient Method The Conjugate Gradient(CG) method is an element of the class of Krylov subspace methods. The Krylov subspace is created by repeatedly applying a matrix to a vector. These are iterative methods, i.e., for a starting vector x0 , the method generates a series of vectors x1 , x2 , · · · converging to x = S −1 b. A detailed description and derivation of the CG method is provided in [1], [3]. Here we will describe the convergence characteristics and the requirements for the method, along with the algorithm pseudo-code. The CG method can only be applied to the matrices which are symmetric and positive definite. Convergence of the CG Method for solving the linear system of equations depends on the wavelength spectrum of the matrix S. The improvement in the error is bounded by the Condition , where λmax and λmin are the maximum Number of the matrix which is given as k(S) = λλmax min and minimum eigenvalues of the matrix S. Higher the condition number, slower the convergence of the method. It can be derived [3] that for the given system represented by the stencil given in 2.12, the number of iterations required for the convergence of the CG method is inversely proportional to the mesh size. Preconditioner If the condition number of the matrix S, (k(S)) is large, preconditioning is used to replace the original system Sψ − b = 0 by M−1 (Sψ − b = 0) so that (k(M −1 S)) gets smaller than (k(S)). In most cases, preconditioning is necessary to ensure fast convergence of the CG method. Algorithm 6.1 provides the pseudo-code for the PCG method. The choice of the preconditioner is important to achieve a fast solver. Using a preconditioner should ultimately lead to a reduction in work coming from a reduction in required number of iterations for CG to converge. This implies that solving the system M z = r should be relatively cheap to solve in the case of PCG. Martijn [1] worked on two preconditioners and translated them in an effective CUDA solver. We will describe the RRB (Repeated Red Black) preconditioner in brief here. The other preconditioner (Incomplete Poisson) is specific to the SPD (Symmetric Positive Definite) matrices, and may not be applicable for non-symmetric matrices (for example the matrix given by the stencil in Equation (2.6) ). 6.1.3 Repeated Red Black (RRB) ordering as preconditioner The RRB method uses a renumbering of the grid points according to a red-black ordering. Redblack ordering is the classic way of parallelizing the Gauss Seidel iterative solver by removing the dependencies between the adjacent cells. What the RRB- method basically does is making the following decomposition: A = LDLT + R (6.1) where L is a lower triangular matrix, D a block diagonal matrix and R a matrix that contains the adjustments made during the lumping procedure. In a two dimensional grid, the red nodes are given by the points xi,j , yi,j with i + j as even, and the black nodes with i + j as odd. First the black nodes are numbered in lexicographical ordering and then the red points. For the RRB ordering, the red nodes are again split up into red and black nodes, and we repeat the procedure. When the black node elimination is repeated k times, the method is named RRB-k method. With an elimination process on the finest level, 49 Algorithm 6.1 Preconditioned CG Algorithm to solve Sx = b Input x (Start Vector), r (Right Hand Side) , S (Matrix), M (Preconditioner), ǫ (Tolerance) r = b − Sx solve : M z = r ρnew = rT z i=0 while ρnew > ǫ2 ||b|| do i=i+1 if i = 1 then p=z else β= ρnew ρold p = z + βp end if q = Ap σ = pT q α= ρnew σ x = x + αp r = r − αq solve : M z = r ρold = ρnew ρnew = rT z end while 50 a stencil is obtained on the next coarser level, which is then simplified by lumping some outer elements (Converting the 9 point stencil on the coarser level back to 5 point stencil). The process of elimination and lumping is then repeatedly applied on the resulting stencil, until the grid is coarse enough to use a direct solver on it. 6.1.4 Implementation of RRB preconditioner Martijn [1] has taken great efforts in describing the implementation of the RRB solver both in C++ programming, and CUDA kernels. Instead of digressing into minor details, the main points which will later impact the implicit time integration are described in brief here. 6.1.5 Effect of RRB ordering on sparsity pattern The matrix S in the system Sψ = b is given by a five point stencil, see Section 2.2. This results in a matrix whose structure is pentadiagonal. The sparsity pattern changes when RRB ordering is applied. Figure 6.1: Sparsity pattern of S ∈ R64x64 before RRB ordering [1] Figure 6.2: Sparsity pattern of S ∈ R64x64 after RRB ordering [1] 51 6.1.6 RRB-k method The maximum number of levels of the RRB ordering could be computed given the dimensions of the grid in x and y direction. Though, it is not required to go upto the coarsest level. It is possible to stop at any level of coarsening. Such level is named k, and hence the RRB-k method. On level k, the nodes are numbered naturally (not in red-black fashion). From these nodes a matrix E is formed, which is a symmetric pentadiagonal matrix. The idea of stopping earlier is that we want to solve the remaining nodes accurately and exactly. In order to keep the computational cost lower, it is important to keep the size of the matrix E small. On level k, we have to solve the system of like Ex = b, where x is formed by level k nodes of z and b by the level k nodes of r in the problem M z = r. It is possible to use direct solvers (Complete Cholesky factorization) or Gaussian elimination for such small problem. 6.1.7 The Schur complement and the 9-point Stencil With the basic red-black numbering, the matrix S can be written in a block matrix format. The system Sψ = b can then be written in the form:      Db Sbr ψb b = b Srb D r ψr br Subscript b and r denotes the black and red nodes respectively. Dr and Db are the diagonal T (matrices with 4 off-diagonals). matrices, and for a symmetric matrix S, Srb = Sbr Next, the black nodes are eliminated by applying Gaussian elimination. This yields:      Db Sbr bb ψb = br − Srb Db−1 Sbr 0 Dr − Srb Db−1 Sbr ψr The matrix S1 := Dr − Srb Db−1 Sbr is called the 1st Schur complement and is given by a 9-point stencil, the vector b1 := br − Srb Db−1 Sbr is the corresponding right-hand side. Using the solution of red nodes, it is thus possible to compute the solution of black nodes. The catch here is that instead of the five point stencil of S, more expensive 9-point stencil of S1 has to be solved. 6.1.8 Constructing the preconditioning matrix M The preconditioning matrix M = LDLT is constructed in following four steps: • Elimination of black nodes: 9-point stencil is generated from the 5-point stencil (creation of the Schur complement). In the case of C++ code, all the five diagonals are stored separately, whereas in CUDA code only three out of five diagonals are stored. West and South stencil are used to access the East and North stencil respectively (utilizing the symmetry of the matrix). • Lumping from the 9-point stencil to the 5-point stencil for the remaining red nodes. This ensures removing stencil dependencies by adding the respective coefficients to other coefficients. • Elimination of the first level red nodes using the lumped 5-point stencil (creating the second level red nodes with 9-point stencil). 52 • Lumping on the second level. The resulting matrix on the coarse grid is pentadiagonal and has the same properties as the original matrix S. The above procedure is only one RRB iteration. It consists of an elimination S = L1 S1 LT1 , lumping S1 = S1 + R1 and again an elimination S1 = L2 S2 LT2 and lumping S2 = S2 + R2 . Combined we have, S = L1 L2 S2 LT2 LT1 + L1 L2 R2 LT2 LT1 + L1 R1 LT1 = LDLT + R. 6.1.9 Solving M z = r As given in Algorithm 6.1, in each iteration the preconditioning step M z = r =⇒ LDLT z = r has to be solved for z. This is done in three steps: • La = r is solved using forward-substitution. This is done level-wise going from finer grid to the coarser grid. • b = D−1 a. As D is block diagonal matrix, its inverse is easy to compute. • LT z = b is solved using back-substitution. This is done by going from coarser grids to fine grids level-wise. 6.1.10 Efficient CUDA and C++ Code Implementation In order to boost performance, Martijn has implemented efficient kernels for the following: • Matrix Vector product • Dot product : Performing mass reduction on GPU and using Kahan summation algorithm on CPU. Kahan summation algorithm greatly reduces error in floating point additions (It makes possible to sum n numbers with an error that only depends on the floating-point precisions) • Vector Updates These optimized codes could be reused even if the PCG algorithm is changed with minor modifications. 6.2 Solvers for Non-Symmetric Matrices As we have seen, most of the implicit time integration methods described in Chapter 4 (except MR-PC) will require solving a set of equations where the matrices have stencils as given in Section 2.2. The matrices involved are not symmetric as compared to the matrix given by Stencil Sψψ . For this reason, the PCG method as described in Chapter 3 cannot be applied to solve the linear system of equations arising from the implicit time integration. The objective is therefore to determine the solution vector for a large, sparse and linear system of equations which is not symmetric. An overview of these methods can be found in [6], which forms the basis for this Chapter. 53 As mentioned before in Chapter 3, the idea of some iterative methods is to project an mdimensional problem into a lower-dimensional Krylov subspace. Given a matrix A of size m × m and a vector b of size m × 1, the associated Krylov sequence is the set of vectors b, Ab, A2 b, A3 b, · · · , which can be computed by matrix-vector multiplications in the form b, Ab, A(Ab), A(A(Ab))) The corresponding Krylov subspaces are the spaces spanned by successively larger group of these vectors. 6.2.1 From Symmetric to Non-Symmetric Matrices Figure 6.3: Classification of Krylov Subspace iterations Figure 5.1 shows the classification of Krylov Subspace methods as we move from symmetric matrices to non-symmetric matrices. The Conjugate Gradient (CG) method results in the tridiagonal orthogonalization of the original matrix, which can be described as A = QT QT , where Q is the Unitary matrix and T is a tridiagonal matrix. When A is non-symmetric, this result cannot be obtained from a CG iteration. Two approaches can be followed : • Use of the so-called Arnoldi Iteration, a process of Hessenberg orthogonalization. This results in A = QHQT where Q is a Unitary matrix and H is an upper Hessenberg matrix. (An upper Hessenberg matrix has zero entries below the first sub-diagonal). • Bi-orthogonalization methods are based on the opposite choice. If we insist on obtaining a tridiagonal result, then we have to give up the unitary transformations, which gives us tridiagonal biorthogonalization : A = V T V −1 , where V is non-singular but generally not Unitary. The term ’biorthogonal’ refers to the fact that though all the columns of V are not orthogonal to each other, they are orthogonal to the columns of (V −1 )T = (V T )−1 . Let V be a non-singular matrix such that A = V T V −1 with T tridiagonal and define W = (V T )−1 . Let vj and wj denote the j t h columns of V and W respectively. These vectors are biorthogonal in the sense that wiT vj = δij where δij is the Kronecker delta 54 function. For each n with 1 ≤ n ≤ m, define the m × n matrices such that:   Vn = v 1 v 2 · · · v n ,   Wn = w 1 w 2 · · · w n (6.2) In matrix form, the biorthogonality can be written as WnT Vn = VnT Wn = In , where In is the identity matrix of dimension n. The iterations in biorthogonalization methods can be summarized as: AVn = Vn+1 T̃n (6.3a) AT Wn = Wn+1 S̃n (6.3b) Tn = SnT = WnT AVn (6.3c) Here Vn and Wn have dimensions m × n, T̃n+1 and S̃n+1 are tridiagonal matrices with dimensions (n + 1) × n, and Tn = SnT is the n × n matrix obtained by deleting the last row of T̃n+1 . BiConjugate gradient algorithms are described in Section 5.3. 6.2.2 Arnoldi Iteration and GMRES Arnoldi iteration can be understood as the analogue of a Gram-Schmidt type iteration for similarity transformations to upper Hessenberg form. It has the advantage that it can be stopped part-way, leaving one with a partial reduction to Hessenberg form that is exploited when dimensions upto n (dimension of the Krylov subspace) are considered. A simple algorithm of Arnoldi iteration is given below: Algorithm 6.2 Arnoldi Iteration b = arbitrary, q1 = b/||b|| for n = 1, 2, 3, · · · do v = Aqn for j = 1 do hjn = qj∗ v v = v − hjn qj end for hn+1,n = ||v|| qn+1 = v/hn+1,n end for The above algorithm can be condensed in the following form:   • The matrices Qn = q1 q2 · · · qn generated by the Arnoldi iteration are reduced QR factors of the Krylov matrix: Kn = Q n Rn (6.4) where Kn is the m × n Krylov matrix. • The Hessenberg matrices Hn are the corresponding projections : Hn = QTn AQn ′ ′ • The successive iterates are related by the formula: AQn = Qn+1 Hn , where Hn is the (n + 1) × n upper-left section of H. 55 The idea of GMRES is straightforward. At step n, the exact solution (xo = A−1 b) is approximated by the vector xn ∈ Kn that minimizes the norm of the residual rn = b − Axn , hence the name : Generalized Minimal Residuals (GMRES). It was proposed in 1986 and is applicable to a system of equations when the matrix is a general (Non-singular) square matrix. Arnoldi iteration is used to construct a sequence of Krylov matrix Qn whose columns q1 , q2 , · · · successively span the Krylov subspace Kn . Thus we can write xn = Qn y , where y represents the vector such that ||AQn y − b|| = minimum (6.5) Using the similarity transformation, this equation can be written as: ′ ||Qn+1 Hn y − b|| = minimum (6.6) Multiplication by a Unitary matrix does not change the 2-norm, thus we can rewrite above equation as: ′ ′ ||Q∗n+1 Qn+1 Hn y − Q∗n+1 b|| = minimum||Hn y − Q∗n+1 b|| = minimum (6.7) Finally, by construction of the Krylov matrices Qn , we have: Q∗n+1 b = ||b||e1 where e1 = (1, 0, 0, · · · )T . Thus we obtain: ′ ||Hn y − ||b||e1 || = minimum (6.8) The GMRES algorithm (unPreconditioned) can be written as : Algorithm 6.3 GMRES q1 = b/||b|| for n = 1, 2, 3, · · · do <step n of Arnoldi iteration, Algorithm 6.2> ′ Find y to minimize ||Hn y − ||b||e1 ||(= ||rn ||) xn = Qn y end for In order to find y, a QR factorization can be used which requires O(n2 ) flops, because of the ′ Hessenberg structure of the matrix H . Here n is the dimension of the Krylov subspace. Also ′ ′ it is possible to get the QR factorization of Hn from that of Hn−1 by using Given’s Rotation. One of the disadvantages of the GMRES method is the storage requirements. As it requires storing the whole sequence of the Krylov subspace, a large amount of storage is required as compares to the Conjugate Gradient method. For this reason, restarted versions of this method are used, where computational and storage costs are limited by specifying a fixed number of vectors to be generated. 6.2.3 BiConjugate Gradient methods The BiConjugate Gradient method (BiCG) is the other extension for non-symmetric matrices. As we saw in the previous section, the principle of GMRES is to pick vector xn such that 56 the residual corresponding to xn is minimized. The principle of BiCG algorithm is to pick xn in the same sub-space, i.e. xn ∈ Kn , but to enforce that the residual is orthogonal to w1 , A ∗ w1 , · · · , (A∗)n−1 w1 , where w1 ∈ Rm is an arbitrary vector satisfying the dot product, w1∗ v1 = 1. Its advantage is that it can be implemented with three-term recurrences rather than the (n+1) - term recurrences of GMRES (Difference arising from Hessenberg form of matrix vs. tridiagonal form). There are two major problems with the BiCG method : • Convergence is slower as compared to GMRES and often erratic. Also, it may have the consequence of reducing the ultimately attainable accuracy because of rounding errors. • It required multiplication with AT (transpose) as well as A. Computing the transpose brings serialization to the code and thus is not preferred. To address this problem, other variants of BiCG method were developed. One of them is stabilized BiCG method (Bi-CGSTAB). As for the Conjugate Gradient method, any other Krylov subspace method needs a good preconditioner to ensure fast and robust convergence. The algorithm for the preconditioned Bi-CGSTAB method is given below. One of the future task is to adjust this algorithm so as to use the same building blocks as developed by Martijn in [1] for the RRB-k method. 57 Algorithm 6.4 Bi-CGSTAB Solve the system of equation given by Ax = b by Bi-CGSTAB Compute r0 = b − Ax0 for some initial guess x0 . Choose r̃ (for example r̃ = r0 ) for i = 1, 2, 3, · · · do ρi−1 = r̃T ri−1 if ρi−1 = 0 then Method Fails end if if i = 1 then pi = ri−1 else ρi−1 αi−1 ρi−2 ωi−1 pi = ri−1 + βi−1 (pi−1 − ωi−1 v i−1 βi−1 = end if Solve M p̃ = pi v i = Ap̃ ρi−1 αi = T v i r̃ s = ri−1 − αi v i Check norm of s; if small enough, set xi = xi−1 + αi p̃ Solve M s̃ = s t = As̃ tT s ωi = T t t xi = xi−1 + αi p̃ + ωi s̃ r i = s − ωi t Check Convergence, continue if necessary For continuation it is required that ωi 6= 0 end for 58 6.3 Combined Solver - GMRES/ Bi-CGStab We formulate the strategy for the implementation of the Solver for the Test Case discussed in the Chapter 4, which can then be benchmarked with the solution from Matlab. We will implement Chorin’s Projection scheme in C++ for the Two-Dimensional Case. As we intend to use the same framework as used in Martijn’s Code, we should be able to plugin Martijn’s RRB solver into our code to solve the system of equation represented by the symmetric matrix without any modification. Step 1 of the Chorin’s projection scheme requires a different strategy than the current implementation of the preconditioned Conjuate Gradient method. This is discussed in detail in following sub sections. For the first step of the Chorin’s Projection Scheme, the system of equation to be solved is given by : q̇ + Lq = 0, (6.9)    S ζ~ and q̇ its time derivative. The matrix L = ζζ with q = Sϕζ ϕ ~ matrix as discussed in Chapter 2.  Sζϕ is the spatial discretization Sϕϕ The Implicit Trapezoidal method is used to integrate the system of ODE’s. ( I I + βL)q n+1 = ( − (1 − β)L)q n ∆t ∆t (6.10) This can also be written as : I  ∆t + βSζζ  βSϕζ  βSζϕ I + βSϕϕ ∆t  I  n+1  ∆t − (1 − β)Sζζ  ζ~ =  ϕ ~ (1 − β)S  ϕζ (1 − β)Sζϕ I + (1 − β)Sϕϕ ∆t   n  ζ~  ϕ ~ (6.11) The system matrix given by the matrix on the left hand side can be seen as combination of 4 block matrices. The diagonal blocks arise from the spatial discretization of the hyperbolic system of equations, and thus are Skew-Symmetric in nature. The off-diagonal blocks represent the coupling between the variables ζ and ϕ. 6.3.1 Preconditioners Each GMRES or Bi-CGSTAB iteration requires the preconditioning step, namely solving a system of equation M z = r, where r is some residual and M is the combined preconditioner. For the diagonal block matrices, it should be possible to construct a preconditioner similar to the RRB preconditioner of the symmetric pentadiagonal matrix. These are represented as P , and let Q represent the off-diagonal blocks in the preconditioner. In general form, the Combined preconditioner M is given by:   Pζζ Qζϕ M= Qϕζ Pϕϕ Various preconditioners can be constructed based on the structure of Q . 59 (6.12)  Pζζ 0 . This is a block Jacobi preconditioner for the By setting Q = 0, we obtain : M1 = 0 Pϕϕ combined system. Equation M1 z = r can be written as:   Pζζ 0 0 Pϕϕ     r zζ = ζ rϕ zϕ (6.13) which can be split into solving Pζζ zζ = rζ and Pϕϕ zϕ = rϕ Similarly, block Gaussian forms of the preconditioner are given by:  0 Pϕϕ  (6.14)  Sζϕ Pϕϕ  (6.15) P M2 = ζζ Sϕζ P M3 = ζζ 0 In order to solve the system of equation M2 z = r, we can solve:  Pζζ Sϕζ 0 Pϕϕ     zζ r = ζ zϕ rϕ (6.16) This can be split up into solving Pζζ zζ = rζ and Pϕϕ zϕ = rϕ − Sϕζ zζ Similarly we can solve M3 z = r. 6.3.2 Preconditioner for Skew-Symmettric Matrix As discussed above, it is required to formulate preconditioner for the diagonal blocks, which are skew-symmetric in nature. We will again use Repeated Red Black method to construct the preconditioner. The same strategy as described in Section 6.1.8 will be followed to construct the preconditioner. In the CUDA code, now instead of storing only three diagonals, all five diagonals will be used. In the current implementation by Martijn [1], a clever trick is used which saves memory (Phase 2c of the construction step) by utilizing the symmetry. The matrix S is split into LDLT and the stencils corresponding to LT are not stored. In the case of the matrix being skew symmetric, it can be derived that the preconditioning matrix has the form P = LDU , where U = −LT . Hence, we do not need to store the diagonals corresponding to U , and we can use the same preconditioning code in C++ and CUDA as used by Martijn. Algorithm In order to construct the preconditioner represented by the Matrix Equation (6.11), the stencils that are part of the four blocks, namely Sζζ , Sϕϕ , Sζϕ and Sϕζ are copied into the solver class. The impact of boundary conditions are incorporated while formulating these stencils, which will be discussed later in detail. Copied stencils are modified to represent the left-hand side of the Equation (6.11), and then preconditioning matrices, Pζζ and Pϕϕ are constructed respectively. The same algorithm is also implemented in Matlab, and the matrices generated are stored in order to analyze their convergence characteristics as discussed below. 60 6.3.3 Impact of different Preconditioners on Eigenvalue Spectrum The effectiveness of a preconditioner for the Symmetric matrix can be evaluated from the reduction in the condition number. For the non-symmetric matrices, the eigenvalues can also contain complex imaginary components. Thus, the condition number cannot be used to determine the convergence properties of a given preconditioner. To understand the convergence property of methods like GMRES and Bi-CGSTAB on a given matrix A, the eigenvalue spectrum is plotted. Let us assume a hypothetical system whose eigenvalue spectrum is given in Figure 6.4 Figure 6.4: Eigenvalue Spectrum of matrix A A circle is drawn which encapsulates all the eigenvalues. Let the distance of the circle from the origin be C, and radius of the circle be R, as shown in Figure 6.4. The convergence property R of the matrix can be related to the ratio . Lower the number, less number of GMRES/BiC CGSTAB iterations will be required. We will  now study the eigenvalue  spectrum of the coupled I + βSζζ βSζϕ   system given in Equation (6.11). Let A =  ∆t . I βSϕζ + βSϕϕ ∆t Then for a sample test case, where ∆x = ∆y = 5, ∆t = 0.1 , β = 0.5 , h = 50 and Ux = Uy = 1 , the spectrum obtained is shown in Figure 6.5. 61 Figure 6.5: Eigenvalue Spectrum of matrix A In order to analyze the impact of different preconditioners, the eigenvalue spectrum of M −1 A is plotted in the following figures. The inverse of the preconditioner and matrix matrix multiplication is carried out in Matlab. Figure 6.6: Eigenvalue Spectrum of matrix M1−1 A 62 Figure 6.7: Eigenvalue Spectrum of matrix M2−1 A Figure 6.8: Eigenvalue Spectrum of matrix M3−1 A 63 For all the preconditioners, the ratio R/C is computed for two different values of ∆t and tabulated below. Table 6.1: Convergence characteristics with different preconditioners Without Preconditioner R , ∆t = 0.1 C 0.6 R , ∆t = 1.0 C 6 M1 0.4 4 M2 0.18 1 M3 0.18 1 Preconditioner It can be seen that the Jacobi block preconditioner M1 does not impact the convergence characteristics, whereas the Gaussian preconditioners M2 and M3 results in a reduction of more than 3-4 times. This can be attributed to the coupling between the two governing equations. As discussed before, the maximum eigenvalue of the governing system of equations is given √ by the suface gravity wave gh. We have previously seen in the Semi-Implicit schemes, that ignoring the coupling between the equations for ζ and φ do not give us a stable solution. Similar behaviour is observed when we choose the Jacobi block preconditioner, as we ignore the contribution from the off-diagonal blocks which contain the components representing the coupling and determined by the parameters g and h. R Also, we can oberseve that an increase in the time step results in a larger ratio. As the time C I reduces and the matrix A loses its diagonal dominance. If we use a large step is increased, ∆t time step, then we should observe in an increase in number of GMRES/Bi-CGSTAB iterations. 64 Chapter 7 C++ and CUDA Implementation The solver for the Implicit Trapezoidal method in C++ and CUDA is first developed for the two-dimensional test case. We will employ the Chorin’s Projection scheme as the algorithm to march in time. An important consideration while developing the code has been to keep the modularity as it was developed by Martijn [1]. This helps in plugging different solvers with ease into the code. The kernels developed for CUDA computation by Martijn have been modified in order to reflect the requirements of implicit set of equations. The complete flowchart for the Implicit solver for the simplified test problem is given in Figure 7.4. The idea is to benchmark the solver routines, and then use them in the Interactive wave program. In this chapter, the implementation of the major steps which are also used later in the Interactive code in detail are explained. 7.1 General comments on implementation Throughout the C++ code, the class Array2D is used which is a very efficient custom-built C++ class exploiting pointers that let us store and manipulate 2D arrays in a comparatively cheap manner. The memory structure in C++ is shown in Figure 7.1. Here COLUMNS = Nx1+3 and ROWS=Nx2+3. This includes the Ghost layers which are used for boundary computations. In the CUDA environment, the arrays from the host (C++) are copied to the device (GPU) and stored in an embedded grid. When the original grid has size Nx1 (x) by Nx2 (y) on the host, on the device it arrives as a Nx2 (x) by Nx1 (y) array. This mirroring effect is due to the storage pattern difference between C++ and GPU (row-major vs. column-major order respectively). The structure of the grid in CUDA is shown in Figure 7.2. The embedding is carried out by choosing a appropriate BORDER WIDTH. Martijn[1] has taken BORDER WIDTH of 16 which makes that data stored according to the ’optimal 16- spacing (= 64 byte)’ thus allowing access to the elements in a coalesced fashion. 65 Figure 7.1: Array2D structure in C++ [1] Figure 7.2: Embedded grid structure in CUDA [1] The new array is nx[0] by ny[0]. Also a so-called compute-block is indicated. A compute-block is the basic identity on which CUDA will do the computations. The compute-block is square given by the dimension DIM COMPUTE BLOCK: 32 x 32 elements . The computing area is given by cx[0] by cy[0], which is a multiple of the compute-block. Since the repeated red black ordering is used in the preconditioning step, the data is restored in 66 a r1/r2/b1/b2 storage format. It is divided into four groups representing the repeated red black ordering: The r1-nodes (even/even red nodes), the b1-nodes (the odd/even black nodes), the r2 nodes (the odd/odd red nodes), and the b2-nodes (the even/odd black nodes). ”Even/Odd” here implies that the first coordinate is even and the second coordinate is odd. After the data is copied in the device from the host, it is restored in this r1/r2/b1/b2 storage format. The storage structure is shown in Figure 7.3. More details on the storage format, and how to access the elements by thread indexing can be found in [1]. Figure 7.3: r1/r2/b1/b2 Storage Format [1] 67 Start 1.1: Construct pentadiagonal stencils, taking into account the boundary conditions 1.2 :Construct the Solver for ζ and ϕ and setup the preconditioners for the skew symmetric diagonal blocks. 1.3: Construct the Solver for ψ and setup the preconditioner for the symmetric matrix. 1.4 Construct the right hand side stencils for the coupled equation representing ζ and ϕ 1.5 Initiate variables: ζ, ϕ and psi Is current no time ≤ tmax ? Yes 1.6 Compute the RHS for the coupled equation, ignoring the ψ contribution. 1.7 Solved the coupled system for ζ and ϕ using preconditioned BiCGStab or GMRES 1.8 Compute the RHS for the psi equation using above value of ϕ 1.9 Solve for psi using RRB Solver 1.10: Update ζ by including contribution from psi Increment time Figure 7.4: Flowchart for the Two Dimensional Simplified Test Problem 68 Stop 7.2 Construction of Preconditioner for the Implicit Equation The spatial discretization matrices are stored in the Array2D structures in C++. Say for Sζζ which represents the pentadiagonal stencil for the hyperbolic term give in Equation (2.6), the five diagonals are stored in the variables matzz C, matzz S, matzz W , matzz N , matzz E, where the subscript zz denotes ζζ, and S to E denotes the directions. In the CUDA implementation, a structure Grid is used which contain the 9 stencils as the structure members, including the stencils corresponding to Noth-East, South-East, North-West and South-West directions. It also has structure members nx, ny, cx and cy which represent the dimensions of the corresnponding embedded grid and compute area respectively. More details can be found in [1]. The stencils are then adjusted for the boundary conditions. As we have used the Ghost cell approach in specifying the spatial discretization, it is easy to adjust the stencil at the boundary for the Neumann boundary conditions. The discretization stencil in two dimensions for Sζζ is given as:  0  1 U Sζζ : − 2∆x W 0 1 2∆y VN − 1 2∆y VN 1 1 2∆y VS − 2∆x UW 1 VS − 2∆y + 1 2∆x UE  0  1 2∆x UE  0 (7.1) Say for example, at the North boundary, the Neumann boundary conditions are applicable. Then we set ζi,N x2 +1 = ζi,N x2 , which translates into the following stencil at the North boundary. Here (i, N x2 + 1) is the Ghost node.  0 1 Sζζ : − 2∆x UW 0 0 1 1 1 V + ( V − 2∆y N 2∆y N 2∆y VS − 1 VS − 2∆y 1 2∆x UW + 1 2∆x UE )  0 1  2∆x UE 0 (7.2) This is programmed in C++ as follows: Listing 7.1: Stencil construction for Neumann boundary conditions 1 2 3 4 5 6 7 8 9 f o r ( int i 1 = 1 ; i 1 <= Nx1 ; ++i 1 ) { // North Boundary m a t z z c [ i 1 ] [ Nx2 ] += m a t z z n [ i 1 ] [ Nx2 ] ; m a t z z n [ i 1 ] [ Nx2 ] = 0 ; // South Boundary mat zz c [ i1 ] [ 1 ] += m a t z z s [ i 1 ] [ 1 ] ; mat zz s [ i1 ] [ 1 ] = 0; } In order to construct the preconditioner for the system of equations represented by the Equation (6.11), the stencils that are part of the four blocks, namely Sζζ , Sϕϕ , Sζϕ and Sϕζ are copied into the solver class. The off-diagonal block matrices are symmetric in nature, whereas for the main diagonal block matrices: L = −U T , where L is the strictly lower triangular part of matrix and U is the strictly upper part of the matrix. Hence, only three diagonals are required to be copied into the Solver class: South, West and Center. The North and East stencils can then be reconstructed inside the Solver class. The corresponding C++ code for the reconstruction of stencils for the diagonal block matrices is shown below. 69 Listing 7.2: Stencil Reconstruction inside the solver class 1 2 3 4 5 6 7 8 f o r ( int j j = 0 ; j j <= Nx2 { f o r ( int i i = 0 ; i i <= { mat zz n [ i i ] [ j j ] = mat zz e [ i i ] [ j j ] = } } + 1 ; ++j j ) Nx1 + 1 ; ++i i ) −m a t z z s [ i i ] [ j j + 1 ] ; −mat zz w [ i i + 1 ] [ j j ] ; Please note that this reconstruction is valid only when the Neumann boundary conditions have already been applied to the stencils. Inside the Solver class, the stencils are modified to represent the left-hand side of the Equation (6.11). This also involves division by the time step (∆t), which is copied to the Solver class. After the modification, the stencils are stored in C++ standard library format (std::vector <Array2D <REAL>>). This allows an easy access to the member matrices, and provides modularity to the preconditioner construction and preconditioning solve steps. The preconditioning matrices, Pζζ and Pϕϕ are then constructed respectively by following the algorithm given in Section 6.3.2. The stencils for the right-hand side of the Equation (6.11) are constructed similarly, and stored in memory for the computation of right-hand side through Sparse matrix vector multiplication routines. ~ = −Sψϕ ϕ The Solver for ψ, which is represented by the elliptic equation, Sψψ ψ ~ is constructed in exactly the same manner as in [1]. 7.3 Implementation of Chorin’s Projection Scheme The time loop in the Figure 7.4 follows the Chorin’s projection scheme described in Chapter 6. 7.3.1 Sparse Matrix Vector Multiplication The right-hand side of the Equation (6.11) is computed through sparse matrix vector multiplication. In Martijn’s code [1], this step is carried out while computing the time derivatives for the explicit Leap Frog method. The stencils are computed on the fly. In our algorithm, it is done differently, as we need to store the matrices in any case. Thus we avoid recomputing the stencils at every time step. As we have a coupled system, the computations are done in two parts.    I ~ RHSζ = − (1 − β)Sζζ ζ~ + (1 − β)Sζϕ ϕ ∆t     I ~ ~ RHSϕ = (1 − β)Sϕζ ζ + + (1 − β)Sϕϕ ϕ ∆t  (7.3a) (7.3b) Computing right-hand sides for both variables ζ and ϕ require two sparse matrix vector mutiplications each. 70 Following code snippet shows the sparse matrix vector multiplication in C++. The diagonals are represented by cn, .... They are obtained by accessing the elements of the std::vector <Array2D <REAL>>. The multiplication by X is stored in Y which has the structure of Array2D. OpenMP directives have been used to employ parallel programming on the CPU. Listing 7.3: Sparse matrix vector multiplication in C++ 1 #pragma 2 for 3 { 4 5 6 7 8 9 10 11 12 13 14 15 16 } omp f o r ( int i 1 = 1 ; i 1 < nx1 ; ++i 1 ) int i 1 e = i 1 +1; int i1w = i 1 −1; f o r ( int j 1 = 1 ; j 1 < nx2 ; ++j 1 ) { int j 1 n = j 1 +1; int j 1 s = j1 −1; Y[ i 1 ] [ j 1 ] = cn [ i 1 ] [ j 1 ] ∗ X[ i 1 ] [ j1n ] + ce [ i 1 ] [ j1 ] ∗ X[ i 1 e ] [ j 1 ] + cs [ i1 ] [ j1 ] ∗ X[ i 1 ] [ j1s ] + cw [ i 1 ] [ j1 ] ∗ X[ i1w ] [ j 1 ] + X[ i 1 ] [ j 1 ] ∗ c c [ i 1 ] [ j 1 ] ; } The computation is done in similar fashion in CUDA. The grid representing the stencil Sζζ is passed to the CUDA kernel along with the vectors Y and X. A two-dimensional thread indexing is used here. The computation is done over the compute area shown in Figure 7.2. Please note that this implementation is different from the sparse matrix vector mupltiplication on the r1r2b1b2 storage format which will be explained later in the solve step. Listing 7.4: Sparse matrix vector multiplication in CUDA 1 template <c l a s s T> 2 global void k e r n e l m a t v 1 (T ∗y , const T ∗x , const Grid &g ) 3 { 4 // Computation o f t h e Thread ID−−−−−−−−−−−− 5 int l d = g . nx ; 6 int u = BORDER WIDTH + bx ∗ Bx + tx ; 7 int v = BORDER WIDTH + by ∗ By + ty ; 8 int t i d = v ∗ l d + u ; 9 // −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 10 y [ t i d ] = g . cc [ t i d ] ∗ x [ t i d ] + 11 g . nn [ t i d ] ∗ x [ t i d+l d ] + 12 g . s s [ t i d ] ∗ x [ t i d −l d ] + 13 g . e e [ t i d ] ∗ x [ t i d +1] + 14 g .ww[ t i d ] ∗ x [ t i d − 1 ] ; 15 16 } 71 7.3.2 Solve Step for the Coupled System This is the backbone of the Chorin’s projection scheme, and one of the most important step in current thesis. This solves the system of equations arising from the Implicit Trapezoidal approach. The algorithm has already been discussed. We first present the implementation of major functions and kernels which are used in the combined solver, whether GMRES/BiCGStab. In the CUDA implementation, variables ζ and ϕ are restored in the r1r2b1b2 format as the preconditioning step is involved. Vector updates Vector updates (AXPY) is a very easy to implement routine. Following is the equivalent routine in C++. Listing 7.5: Vector update (axpy) in C++ 1 2 3 4 5 6 7 8 9 10 11 12 13 // Computes y = a∗ y + b ∗ x void axpy ( Array2D<REAL> &y , const Array2D<REAL> &x , const REAL a , const REAL b ) { #pragma omp p a r a l l e l f o r private ( i 1 , j 1 ) f o r ( int i 1 = 0 ; i 1 < y . dim1 ( ) ; i 1 ++) { f o r ( int j 1 = 0 ; j 1 < y . dim2 ( ) ; j 1++) { y [ i1 ] [ j1 ] = y [ i1 ] [ j1 ]∗ a + x [ i1 ] [ j1 ]∗b ; } } } In the actual implementation in C++, the function axpy() is split into vector addition and scalar multiplication separately, in order avoid incurring extra FLOPS. In the CUDA implementation, the vector update is carried out over the r1r2b1b2 storage. First the location or thread index of the respective nodes are computed and then the vector is updated. In the previous implementation by Martijn [1], computation on only red nodes was carried out. Here as we do not operate on the Schur Complement as discussed before, computation is carried out over both red and black nodes. The Code snippet is shown below: Listing 7.6: Sparse Matrix Vector Multiplication in C++ 1 2 3 4 5 6 7 8 9 template <c l a s s T> global void k e r n e l a x p y (T ∗y , const T ∗x , const T a , const T b , const Grid g ) { int cgx = g . cx ; int cgy = g . cy ; int l d = g . nx ; int v r 1 = BORDER WIDTH + by ∗ By + ty ; 72 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 int u r 1 = BORDER WIDTH + bx ∗ Bx + tx ; int v r 2 = BORDER WIDTH2 + cgy + by ∗ By + ty ; int u r 2 = BORDER WIDTH2 + cgx + bx ∗ Bx + tx ; int int int int u v u v int int int int loc loc loc loc y[ loc y[ loc y [ loc y [ loc b1 b1 b2 b2 = = = = r1 r2 b1 b2 r1 ] r2 ] b1 ] b2 ] BORDER WIDTH2 + cgx + bx ∗ Bx + tx ; BORDER WIDTH + by ∗ By + ty ; BORDER WIDTH + bx ∗ Bx + tx ; BORDER WIDTH2 + cgy+ by ∗ By + ty ; = = = = = = = = a a a a ld ld ld ld ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ v v v v r1 r2 b1 b2 y[ loc y[ loc y [ loc y [ loc + + + + r1 ] r2 ] b1 ] b2 ] u u u u r1 ; r2 ; b1 ; b2 ; + + + + b b b b ∗ ∗ ∗ ∗ x[ loc x[ loc x [ loc x [ loc r1 r2 b1 b2 ]; ]; ]; ]; } Let’s take the example of the r1 nodes to understand the above code snippet. First the twodimensional index (u,v) of the r1 node is determined using the structure shown in Figure 7.3. Then the global thread index is computed which is given by locr1 = ld ∗ vr1 + ur1 . Then the vector y is updated. Similarly the computation is carried out for r2, b1 and b2 nodes. Dot Products Two implementations of the dot product, also called as the inner product for a vector of structure Array2D<REAL> is provided for the C++ code. The first implementation uses the BLAS library. The second implementation is a generic one, which can be used if the BLAS library is not installed. This uses openmp reduction scheme. This implementation excludes the operation over the Ghost boundary nodes, and computes the dot product of only inner nodes. Listing 7.7: Dot Product version 2 in C++ 1 REAL d o t p r o d u c t ( const Array2D<REAL> &Vec1 , const Array2D<REAL> &Vec2 ) 2 { 3 REAL sum =0; 4 #pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum ) 5 f o r ( int i 1 = 1 ; i 1 < Vec1 . dim1 () −2; ++i 1 ) 6 { 7 f o r ( int j 1 = 1 ; j 1 < Vec1 . dim2 () −2; ++j 1 ) 8 { 9 sum += Vec1 [ i 1 ] [ j 1 ] ∗ Vec2 [ i 1 ] [ j 1 ] ; 10 } 11 } 12 return sum ; 13 } 73 On a GPU, it is difficult to use the existing libraries like CUBLAS as the variables are stored in r1r2b1b2 format, and these library routines require the input vectors as linear one dimensional arrays. Also, as majority of the routines for the RRB-solver are already custom built, we do not strive to use the CUBLAS library, but build upon the existing implementation from Martijn [1]. As discussed before, this is done in two steps. • Do a mass reduction on GPU over the compute blocks, and store the intermediate results. • Use Kahan summation on the CPU to add the elements from the intermediate results. The reader is encouraged to go through Martijn’s work[1] to understand the implementation of above two steps. Here we explain the modifications required for the computation of the dot product for all the nodes, instead of only the red nodes. The change required is in the first step, which is the mass reduction on GPU and can be achieved easily. Each compute block is divided in sub-blocks by introduction of the block-factor called DOTP BF. Listing 7.8: Dot product in CUDA 1 template <c l a s s T> 2 global void k e r n e l d o t p 1 (T ∗ odata , const T ∗y , const T ∗x , 3 const Grid g ) 4 { 5 int cx = g . cx ; 6 int cy = g . cy ; 7 int l d = g . nx ; 8 9 int u r 1 = BORDER WIDTH + bx ∗ DIM COMPUTE BLOCK + tx ; 10 int v r 1 = BORDER WIDTH + by ∗ DIM COMPUTE BLOCK + ty ; 11 int u r 2 = BORDER WIDTH2 + cx + bx ∗ DIM COMPUTE BLOCK + tx ; 12 int v r 2 = BORDER WIDTH2 + cy + by ∗ DIM COMPUTE BLOCK + ty ; 13 14 15 // Also i n c l u d e t h e B l a c k Nodes 16 int u b1 = BORDER WIDTH2 + cx + bx ∗ DIM COMPUTE BLOCK + tx ; 17 int v b1 = BORDER WIDTH + by ∗ DIM COMPUTE BLOCK + ty ; 18 int u b2 = BORDER WIDTH + bx ∗ DIM COMPUTE BLOCK + tx ; 19 int v b2 = BORDER WIDTH2 + cy + by ∗ DIM COMPUTE BLOCK + ty ; 20 21 int l o c r 1 = l d ∗ v r 1 + u r 1 ; 22 int l o c r 2 = l d ∗ v r 2 + u r 2 ; 23 int l o c b 1 = l d ∗ v b1 + u b1 ; 24 int l o c b 2 = l d ∗ v b2 + u b2 ; 25 int t i d = Bx ∗ ty + tx ; 26 27 shared T sm [DIM COMPUTE BLOCK ∗ (DIM COMPUTE BLOCK/DOTP BF ) ] ; 28 T sum = 0 ; 29 30 f o r ( int k = 0 ; k < DOTP BF ; ++k ) { 31 32 sum += y [ l o c r 1 ] ∗ x [ l o c r 1 ] ; 33 sum += y [ l o c r 2 ] ∗ x [ l o c r 2 ] ; 34 sum += y [ l o c b 1 ] ∗ x [ l o c b 1 ] ; 74 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 } sum loc loc loc loc += r1 r2 b1 b2 y [ loc += l d += l d += l d += l d b2 ] ∗ x [ loc b2 ] ; ∗ (DIM COMPUTE BLOCK ∗ (DIM COMPUTE BLOCK ∗ (DIM COMPUTE BLOCK ∗ (DIM COMPUTE BLOCK / / / / DOTP BF ) ; DOTP BF ) ; DOTP BF ) ; DOTP BF ) ; } sm [ t i d ] = sum ; syncthreads ( ) ; f o r ( int k = (DIM COMPUTE BLOCK / 2 ) ∗ (DIM COMPUTE BLOCK / DOTP BF ) ; k > 1 6 ; k >>= 1 ) { if ( tid < k) { sm [ t i d ] += sm [ t i d + k ] ; syncthreads ( ) ; } } i f ( t i d < 16) sm [ t i d ] += sm [ t i d + 1 6 ] ; syncthreads ( ) ; i f ( t i d < 8) sm [ t i d ] += sm [ t i d + 8 ] ; syncthreads ( ) ; i f ( t i d < 4) sm [ t i d ] += sm [ t i d + 4 ] ; syncthreads ( ) ; i f ( t i d < 2) sm [ t i d ] += sm [ t i d + 2 ] ; syncthreads ( ) ; i f ( t i d < 1) sm [ t i d ] += sm [ t i d + 1 ] ; syncthreads ( ) ; i f ( t i d == 0 ) odata [ by ∗ gridDim . x + bx ] = sm [ t i d ] ; Sparse matrix vector multiplication over r1r2b1b2 Inside the solve routine, the implementation of sparse matrix vector multiplication in C++ remains the same as in Listing 7.3. In CUDA the matrix-vector product is implemented differently as the r1r2b1b2 storage has to be taken into account. For each r1,r2,b1,b2 node, first the location is determined. Then the matrix vector product computed using the stencil values adjacent to the node. Listing 7.9: Sparse matrix cector multiplication in CUDA 1 template <c l a s s T> 2 global void k e r n e l m a t v 1 (T ∗y , const T ∗x , const Grid g ) 3 { 4 int l d = g . nx ; // l e a d i n g dimension 5 6 int u r 1 = BORDER WIDTH + bx ∗ Bx + tx ; 75 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 int v r 1 = BORDER WIDTH + by ∗ By + ty ; int u r 2 = BORDER WIDTH2 + g . cx + bx ∗ Bx + tx ; int v r 2 = BORDER WIDTH2 + g . cy + by ∗ By + ty ; int int int int u v u v b1 b1 b2 b2 int int int int loc loc loc loc = = = = r1 r2 b1 b2 BORDER WIDTH2 + g . cx + BORDER WIDTH + by ∗ By BORDER WIDTH + bx ∗ Bx BORDER WIDTH2 + g . cy + = = = = ld ld ld ld ∗ ∗ ∗ ∗ v v v v r1 r2 b1 b2 + + + + u u u u bx ∗ Bx + tx ; + ty ; + tx ; by ∗ By + ty ; r1 ; r2 ; b1 ; b2 ; const const const const const const T T T T T T x x x x x x r1 r1up1 r1vp1 r2 r2vm1 r2um1 = = = = = = FETCH FETCH FETCH FETCH FETCH FETCH XX( u XX( u XX( u XX( u XX( u XX( u r1 , r 1 +1, r1 , r2 , r2 , r2 −1, v v v v v v r1 , r1 , r 1 +1, r2 , r 2 −1, r2 , ld ld ld ld ld ld ); ); ); ); ); ); const const const const const const T T T T T T x x x x x x b1 b1um1 b1vp1 b2 b2vm1 b2up1 = = = = = = FETCH FETCH FETCH FETCH FETCH FETCH XX( u XX( u XX( u XX( u XX( u XX( u b1 , b1 −1, b1 , b2 , b2 , b2 +1, v v v v v v b1 , b1 , b1 +1, b2 , b2 −1, b2 , ld ld ld ld ld ld ); ); ); ); ); ); y [ l o c b 1 ] = g . cc [ l o c g . nn [ l o c g . ss [ loc g . ee [ l o c g .ww[ l o c b1 ]∗ x b1 ]∗ x b1 ]∗ x b1 ]∗ x b1 ]∗ x b1 + r2 + r2vm1 + r1up1 + r1 ; y [ l o c b 2 ] = g . cc [ l o c g . nn [ l o c g . ss [ loc g . ee [ l o c g .ww[ l o c b2 ]∗ x b2 ]∗ x b2 ]∗ x b2 ]∗ x b2 ]∗ x b2 + r1vp1 + r1 + r2 + r2um1 ; y [ l o c r 1 ] = g . cc [ g . nn [ g . ss [ g . ee [ g .ww[ loc loc loc loc loc r1 r1 r1 r1 r1 ]∗ x ]∗ x ]∗ x ]∗ x ]∗ x r1 + b2 + b2vm1 + b1 + b1um1 ; y [ l o c r 2 ] = g . cc [ g . nn [ g . ss [ g . ee [ loc loc loc loc r2 r2 r2 r2 ]∗ x ]∗ x ]∗ x ]∗ x r2 + b1vp1 + b1 + b2up1 + 76 58 59 } g .ww[ l o c r 2 ] ∗ x b2 ; Preconditioning Solve As discussed in Section 6.3, each iteration step requires the preconditioning step, where M z = r has to be solved for z. For the main diagonal block matrices, the preconditioning matrix can be written as M = LDU . Here L is the lower triangular matrix with diagonal elements as entry 1. Let Ls be the strictly lower triangular part of L. Then we can assign Us = −LTs where Us is the strictly upper triangular matrix part of U . The diagonal elements of U also contain the entry 1. During the preconditioner construction step, only the strictly lower triangular part Ls and D are stored. D is the diagonal matrix. Solving M z = r can be done in three steps as follows. Set y := U z and x := DU z = Dy, then: • solve Lx = r using forward substitution • compute y = D−1 x • solve U T z = y using backward substitution. U can be constructed with the help of Us = −LTs . Implementation of above three steps is described in detail in [1]. The step 1 of solving Lx = r is done grid wise in two phases, where grid refers to the repeated red black ordering level. We go from the finest grid to the coarsest grid. In the case of the symmetric RRB solver, the CG algorithm operates only on the red nodes. The first level black nodes are eliminated while constructing the Schur complement. Phase 1 corresponds to updating the r2-nodes using the r1-nodes in the same level. Phase 2 corresponds to updating r1- and r2- nodes using b1- and b2-nodes in the same level. Thus starting with Phase 1, we end in Phase 1. In the case of Skew-Symmetric preconditioning step, as the computation is done over all the nodes, we start with the phase 2 of the process instead, in which the r1- and r2- nodes are updated using the b1- and b2- nodes in the finest level. The step 2 of the preconditioning step is easy, as it only requires division by the diagonals. Again, for the RRB symmetric solver, this division is carried out only for the r1- and r2- nodes in the finest level. In our implementation, this step is carried out over all the nodes in the finest level. In Step 3 we go from coarse grids to fine grids and backward substitution is used. Here the phase 4 corresponds to updating r1-nodes using r2-nodes in the same level, and the phase 3 corresponds to updating b1- and b2-nodes using r1- and r2- nodes in the same level. The backward substitution starts with phase 4 and ends with phase 4, as there is no need of updating the black nodes at the finest level. Again, two changes are required for the implementation in the case of Skew-Symmetric preconditioning step. In the RRB solver, symmetry (L = LT ) has been used to carry out the computations in phase 4 and phase 3. We will employ the skew symmetry here using L = (−L)T . Secondly, our backward substitution will end with phase 3 instead, where the black nodes at the finest level are also updated using the red nodes at the finest level. 77 7.3.3 RRB Solver for ψ, and ζ correction In order to solve the elliptic equation 2.13b for ψ , first the right-hand side is computed. This is done in C++ with a simple sparse matrix vector multiplication. The stencil corresponding to Sψϕ has already been stored in the system, and thus computing −Sψϕ ϕ ~ is easy. In CUDA, the previous routine ’solvepsi()’ from Martijn [1] has been used. After computing the right-hand side of the equation, the RRB solve function is called. Depending on the predefined directives, it either calls the C++ or the CUDA RRBSolver. With the updated value of ψ, now ζ can be updated, which is the last step in the Chorin’s Projection scheme. This step only requires sparse matrix multiplication and a vector update. Please note that in CUDA, the computations outside the solve steps are carried out on the embedded grid given in Figure 7.2. Only for the solve routines, the computations are carried out on the r1r2b1b2 structure given in Figure 7.3. 7.4 Plugging the Solver into the Interactive waves Code The Solver developed above is then plugged into the Interactive wave code. As the solver for the test case has been developed only for the Neumann boundary conditions, some changes are required while taking into account the impact of Incoming Waves. We will first discuss the previous implementation of the Boundary conditions by de Jong[1] for the RRB solver and then employ the ideas in the implicit solver. 7.4.1 Incoming Boundary Conditions- Previous Implementation Dirichlet or Incoming boundary conditions depends on the current time, and fixed parameters like wavenumber and frequency which are inputs to the model. At every time step, the incomingboundary() routine is called to generate the Incoming boundary conditions. Lets see how it is applied to Equation (2.13b). The stencil Sψψ has been constructed considering Neumann boundary conditions in mind, whereas during construction of Sψϕ , the Neumann boundary conditions are not applied. First the routine boundaryzetaphi() is called which is shown below. Listing 7.10: Dirichlet/Incoming boundary conditions 1 2 if 3 { 4 5 6 7 8 9 10 } ( m bcN == 0 ) /∗ r e f l e c t i v e Neumann boundary ∗/ f o r ( int j 1 = 1 ; j 1 <= m Nx1 ; ++j 1 ) { p h i [ j 1 ] [ Nx2+1] = p h i [ j 1 ] [ m Nx2 ] + phi in N ext [ j1 ] − phi in N cen [ j1 ] ; } In the above code, the Incoming boundary conditions given by variables phi in N ext and phi in N cen are applied to the variable φ. These variables are generated by the incomingboundary() routine. Similarly the Incoming boundary conditions are applied to the variable ψ. 78 The right-hand side of Equation (2.13b) is computed with sparse matrix vector multiplication. On the North boundary, the stencil Sψϕ takes the following form: Sψϕ  0  1 h D : ∆x∆y  ∆x2 W W 0 1 −( ∆y 2 hN D N + 1 h D ∆y 2 N N 1 1 h D + ∆x 2 hE D E ∆x2 W W 1 h D ∆y 2 S S + 1 h D ) ∆y 2 S S  0 1 h D  ∆x2 E E  0 (7.4) Please note that the Neumann boundary conditions are not applied while constructing this 1 stencil. Thus the stencil component at the North boundary given by ∆y 2 hN DN is not zero. When the sparse matrix multiplication at the North boundary is carried out, we get the following code: Listing 7.11: Right-hand side contribution at the North boundary 11 12 13 14 15 16 17 18 f o r ( int j 1 = 1 ; j 1 <= m Nx1 ; ++j 1 ) { b p s i [ j 1 ] [ Nx2 ] −= p h i [ j 1 ] [ Nx2 ] ∗ S p s i + p h i [ j 1 ] [ Nx2+1]∗ S + p h i [ j 1 ] [ Nx2−1]∗ S + p h i [ j 1 + 1 ] [ Nx2 ] ∗ S + p h i [ j1 − 1 ] [ Nx2 ] ∗ S } phi psi psi psi psi C [ j 1 ] [ Nx2 ] p h i N [ j 1 ] [ Nx2 ] p h i S [ j 1 ] [ Nx2 ] p h i E [ j 1 ] [ Nx2 ] p h i W [ j 1 ] [ Nx2 ] ; We can see that the value of φ at the Ghost node [j1][Nx2+1], which was set in the Listing 7.10 has been used in Listing 7.11. Now, we will see what happens if the Neumann boundary conditions ared to the stencil Sψψ , which is the stencil on the left-hand side of Equation (2.13b). In that case, for the North boundary, the term Sψψ N will be zero and Sψψ C ′ = Sψψ C + Sψψ N as discussed in Listing 7.1. The matrix vector multiplication will result in : ~ = Sψψ Sψψ ψ C ′ ψ[j1[N x2] = (Sψψ C + Sψψ + ··· N )ψ[j1[N x2] + ··· (7.5) whereas if we have Dirichlet/Incoming boundary conditions instead of Neumann boundary conditions, then sparse matrix multiplication should result in following: ~ = Sψψ Sψψ ψ C ψ[j1][N x2] + Sψψ N ψ[j1][N x2 = Sψψ C ψ[j1][N x2] + Sψψ N (ψ[j1][N x2] = (Sψψ = Sψψ C + Sψψ N )ψ[j1][N x2]) C ′ ψ[j1][N x2] + Sψψ + Sψψ + 1] + · · · + ψ.in.N.ext[j1] − ψ.in.N.cen[j1]) + · · · N (ψ.in.N. xt[j1] N (ψ.in.N. xt[j1] − ψ.in.N.cen[j1]) + · · · (7.6) − ψ.in.N.cen[j1]) + · · · Comparing Equation (7.5) and Equation (7.6), we can easily see that there is a missing term. This has to be additionally taken into account while computing the right-hand side of the Equation (2.13b). As Neumann boundary conditions have only been applied to Sψψ , we need to subtract right hand side by Sψψ N (ψ.in.N. xt[j1]−ψ.in.N.cen[j1]), for all directions respectively. This can be seen as superimposition of Dirichlet boundary conditions over Neumann boundary conditions. A similar method need to be adopted while computing the right-hand side for the implicit set of equations. This is explained in the next section. 79 7.4.2 Incoming Boundary Conditions- Current Implementation Let us consider the system of coupled equations arising from the Implicit Trapezoidal time integration given in Equation (7.7). ( I I + βL)q n+1 = ( − (1 − β)L)q n ∆t ∆t (7.7) Here, assume that Neumann boundary conditions have been applied while constructing the matrix L. Now assume that on the North boundary, we have Dirichlet (Incoming) boundary conditions, which is generated by the routine incomingwaves(). In that case, we need to superimpose the Incoming boundary conditions on the Neumann boundary conditions, as was done in the previous section. There is one catch though. In the previous section, Incoming conditions were taken at only next time step. In our case of Implicit Trapezoidal scheme, we need to account for the Incoming boundary conditions at previous time step and the new time step. This is crucial for the stability of the Trapezoidal scheme, as the Incoming boundary conditions depend on time. In order to avoid the double computation of the Incoming boundary conditions, following steps are carried out. • Introduce a Start-up step, which initializes the Incoming waves and also solves for ψ based on the initial values of φ. This step is called before the time loop. • Inside the time loop, initialize the right-hand side for the coupled implicit equation based on the previous Incoming waves. The correction due to the superimposition of Incoming waves is computed. • Increment time and call the incomingwaves() routine to compute the Incoming boundary conditions at the new time step. • Update the right-hand side for the coupled implicit equation by sparse matrix vector multiplication and correction due to the superimposition of Incoming waves at the new time step. Incoming waves have a great influence over the way the final solution behaves. Devising a proper strategy to handle the boundary conditions has been a very important step in current thesis for maintaining the stability of the implicit time integration method. We encourage to implement a simplified two-dimensional test case with Incoming boundary waves in future so that the required properties can be studied in more detail and simplified way. 80 Chapter 8 Results - Two Dimensional Test case In this chapter, we discuss the results of the two dimensional test problem for the C++ and CUDA implementation. We have seen the benchmark results from MATLAB in Section 5.5. There is one difference. The boundary conditions used in 5.5 is periodic, whereas the test case built in C++ or CUDA makes use of the Neumann boundary conditions. The MATLAB code has been modified to take into account the Neumann boundary conditions by adjusting the stencil formulation. We will directly jump to the important problem related results, which include discussion on the role of different preconditioners, number of iterations required for GMRES/Bi-CGTAB, storage requirements, and floating point operations. Then we will discuss the timing results corresponding to the C++ and the CUDA code. 8.1 Convergence properties Here we discuss the number of iterations required to solve the coupled set of equations either by the Bi-CGSTAB or GMRES algorithm. The number of iterations required depends on the variables given below. • stucture of the preconditioner • the accuracy or the solver tolerance • the time step and the grid size • the problem size The other parameters which are fixed during this section are: • Ux = Uy = 1 m/s • h = 50 m • g = 9.81 m/s2 81 8.1.1 Termination Criteria The number of iterations required to achieve convergence depends on the tolerance chosen. We have used the relative error of the 2-norm of residual ri with respect to the 2-norm of the initial right-hand side vector b, as the termination criterion, i.e. stop the iterative process when: < ri , ri > ≤ psitol2 < b0 , b0 > (8.1) This criterion is scale invariant, which implies that increasing the number of grid points does not lead to a more stringent criterion. We have fixed the tolerance to be 1e−5 . The iterative process is stopped either when we satisfy the termination criteria, or reach the maximum number of iterations. 8.1.2 Variation with ∆t for various preconditioners The matrix solved by the linear iterative solvers is given in Equation (6.11). The diagonal 1 elements contain an term, which results in reduction in diagonal dominance of the matrix ∆t with increase in the time step. Also the overall convergence properties depend on the preconditioner chosen. This has been previously discussed in Section 6.3.2. Here M1 refers to the  Pζζ 0 . M2 and M3 refer to the two variants block-Jacobi preconditioner given by: M1 = 0 Pϕϕ     Pζζ 0 Pζζ Sζϕ of block-Gaussian preconditioners given by: M2 = and M3 = . Sϕζ Pϕϕ 0 Pϕϕ Figure 8.1: Variation of number of iterations with timestep: Bi-CGSTAB Figure 8.1 shows the variation of number of iterations with the time step, for different preconditioners. Here Bi-CGSTAB is used as the iterative solver. The results are close to our expectations as discussed in Section 6.3.3. While the block Jacobi preconditioner does not 82 greatly help in reducing the number of iterations required to solve the system, the Gauss Seidel preconditioners results in a large reduction in the number of iterations. It is well known that in certain situations, the Bi-CGSTAB method suffers from stagnation. In order to analyze this, we implemented the GMRES(m) method, which is the restarted version of GMRES method discussed in Section 6.2.2. The major drawback of the GMRES method is that the amount of work and storage required per iteration rises linearly with the iteration count. In the restarted version, after a chosen number of iterations m, the accumulated data is cleared and the intermediate results are used as the initial data for the next m iterations. This procedure is repeated until convergence is reached. Here we have chosen m = 30. The results with the GMRES(m) method are shown in Figure 8.2. Figure 8.2: Variation of the number of iterations with timestep: GMRES(30) Let us consider the results with preconditioner M2 and M3 . The number of iterations required for convergence is higher for GMRES(m) and the difference increases as we take higher time steps. The comparison between the two methods is shown in Figure 8.3 83 Figure 8.3: Variation of the number of iterations with timestep: GMRES vs Bi-CGSTAB for preconditioner M2 The results encourage us to proceed with the implementation of Bi-CGSTAB on CUDA. Though there is a catch here. If we look at the time steps below 1 second, the different in number of iterations is not very large. Also, from the computation side, GMRES method only requires one sparse matrix vector multiplication and preconditioner solve per iteration whereas the BiCGSTAB method requires two sparse matrix vector multiplication and two preconditioner solves per iteration. A good comparison would be the timing results between the two methods. For the CUDA implementation of the implicit solver, we have only considered Bi-CGSTAB algorithm as it was relatively easier to switch from the existing CUDA CG solver to CUDA Bi-CGSTAB solver. As the preconditioner M2 provides the optimum number of iterations amongst the preconditioners compared, we have only considered it for further analysis, though functionality is provided in the code to switch between various preconditioners. 8.1.3 Variation with time step ratio The ∆t based on the CFL number depends on the spatial step size. If the spatial step size is reduced, then ∆tCF L will also reduce proportionally. We examined the behavior of the BiCGSTAB method with the preconditioner M2 , and plotted the required number of iterations ∆t ratio. This is shown in Figure 8.4. for various ∆tCF L 84 Figure 8.4: Varitaion of number of iterations with tioner M2 ∆t ratio: Bi-CGSTAB with precondi∆tCF L We do this for various step sizes (keeping the number of grid points the same). It is observed ∆t ratio is constant, the number of iterations remains that irrespective of the step size, if the ∆tCF L constant too. 1 Also, we can see a linear increase in the number of iterations with the time step. Although after a certain time step, the slope is larger. It can be attributed to the impact of initial guess. If the time step is higher than the wave period, then we lose information about the wave. In this case, the initial guess for the iterative solver, which is the solution from the previous time step, may not be good enough. Further research can be carried out in determining a suitable initial guess by extrapolating values from previous time steps. Although this would result in extra storage requirement, it is required to evaluate the trade-off between storage and efficiency for such case. 8.1.4 Impact of the problem size To analyze the impact of the problem size, we have chosen a square grid with N = 32, 64, 128, 256, 512, 1024, where N is the number of unknowns in both x− and y− direction. The step size is fixed at ∆x = 5m, and ∆t = 1s. The results are gathered in Figure 8.5. 1 This is not presented in this report. 85 Figure 8.5: Variation of number of iterations with problem size We can see that the iteration count is almost constant for N > 256. We have checked this for both GMRES and Bi-CGSTAB methods. The realistic test cases like River IJssel or Plymouth Sound have larger problem sizes. This observation becomes more attractive for them. We can say that the required number of iterations for convergence will be constant for realistic problem size. There is a catch here. In the two-dimensional test case, the waves occupy the complete grid starting from the initial conditions. The solver then has to compute the waves at all the grid points from the first time step itself. This is different for the realisitc case. Initially the sea is at a steady-state position, and then the waves are introduced through incoming boundary waves. The required number of iterations may vary with time as the waves start filling the grid. We will discuss this in detail in the next Chapter. 8.2 Bi-CGSTAB vs CG: Memory requirement and Operation Count In this section, we discuss the extra memory and operation count required in converting the symmetric RRB solver based on preconditioned CG to Non-symmetric RRB solver based on preconditioned Bi-CGSTAB 8.2.1 Memory requirement: C++ In C++, computing the memory requirement is straightforward as we do not need to allocate memory in different storage scheme as in CUDA. The storage requirement for the preconditioned CG method is given in Table 8.1 [1]. Here n = N x1 × N x2 is the total number of grid points. 86 Table 8.1: Storage requirements for the preconditioned CG algorithm Variable Type Description Memory A Matrix System Matrix 5n P Matrix Preconditioner 5n b Vector Right hand Side n x Vector Solution n r Vector Residual n p Vector Search direction n q Vector q = Ap n z Vector z = M −1 r n Total = 16 × n In the Bi-CGSTAB C++ solver, we need to allocate memory for all system matrices, preconditioners, and for the variables corresponding to both ζ and ϕ. Table 8.2 shows the memory requirement of the Bi-CGSTAB solver. 87 Table 8.2: Storage requirements for the Bi-CGSTAB algorithm Variable Type Description Memory Sζζ , Sζϕ , Sϕϕ , Sϕζ Matrix System Matrices 5n × 4 Pζζ , Pϕϕ Matrix Preconditioners 5n × 2 bζ , bϕ Vector Right hand Side n×2 x ζ , xϕ Vector Solution n×2 Rζ , Rϕ Vector Residual n×2 R̃ζ , R̃ϕ Vector Initial Residual n×2 Pζ , Pϕ Vector p n×2 P̂ζ , R̂ϕ Vector M −1 p n×2 Sζ , Sϕ Vector s n×2 Ŝζ , Ŝϕ Vector M −1 s n×2 T ζ , Tϕ Vector t n×2 Total = 48 × n We can see that there is a three times increase in the memory required. When the PCG method is used with the explicit Leap-Frog scheme, other stencils are not stored but computed on the fly. Lets discuss if we really need to store the stencils for the Bi-CGSTAB (or GMRES) method. The stencils Sζζ and Sϕϕ are used to compute the preconditioner, and they are required to be stored in memory. Though it is possible to store only three diagonals and use skew-symmetry, but that makes the code reading more complex. The stencils Sζϕ and Sϕζ are used in the Bi-CGSTAB algorithm while computing the matrixvector product. As this operation is required at every iteration (we will discuss the required number of operations in the next section), we have decided to store the stencil in memory instead of computing it at every iteration. In the previous implementation by Martijn[1], these stencils were constructed only once per time step, whereas in our case, they are required at-least two times every iteration of Bi-CGSTAB or GMRES. Though it is possible to use the symmetry of the stencil Sζϕ and diagonal nature of stencil Sϕζ , and reduce the memory requirement from 48n to 42n. 8.2.2 Memory requirement: CUDA In this section, we shall give indications of the memory requirement for the Bi-CGSTAB solver on the GPU and compare it with the CG requirements. 88 In the CUDA implementation, we have seen the use of an embedded grid and the r1r2b1b2 storage format. It can be argued that the embedding results in a larger memory requirements, but that is valid only for small size problems. The embedding width is given by 16 grid points, and is not significant for larger problems that we intend to solve. For storing the preconditioners which are constructed with the repeated red black ordering scheme, extra storage will be required, as we need to store the grid at each RRB level. At every level, we reduce the number of grid points by a factor of 4. Thus we end up with a geometric series starting with n and ratio of 1/4. The total memory requirement (in the ideal case of very large initial grid) is given by: ∞ X 4×n n = Total memory required = 4i 3 (8.2) i=0 4 will be higher due to the embedding grid effect. 3 The list of data objects required in PCG algorithm are given below: For smaller problems, the ratio of Table 8.3: Storage requirements for the preconditioned CG algorithm in CUDA Variable Type Storage format Memory *m prec preconditioner stencils repeated r1/r2/b1/b2 *m orig Original Stencil single r1/r2/b1/b2 5×n *m dX vector x Standard n *m dB vector b Standard n *m dXr1r2b1b2 vector x single r1/r2/b1/b2 n *m dBr1r2b1b2 vector b single r1/r2/b1/b2 n *m dYr1r2b1b2 vector y single r1/r2/b1/b2 n *m dPr1r2b1b2 vector p single r1/r2/b1/b2 n *m dQr1r2b1b2 vector q single r1/r2/b1/b2 n *m dRr1r2b1b2 vector r single r1/r2/b1/b2 n *m dZr1r2b1b2 vector z repeated r1/r2/b1/b2 4n 3 *m drem for dot products standard n 5× 4n 3 Total = 23 × n The vectors x and b are stored twice in different storage formats. Compared to the C++ PCG implementation there is an increase in storage requirement of minimum 1.4 times. 89 We now tabulate the storage requirements for the Bi-CGSTAB method in CUDA. In the Table 8.4, we have used the nomenclature ’lex’ representing the lexicographic or standard ordering. Table 8.4: Storage requirements for the preconditioned Bi-CGSTAB algorithm in CUDA Variable Type Storage format Memory *m precζζ ,*m precϕϕ preconditioner stencils repeated r1/r2/b1/b2 *m origζζ ,*m origϕϕ Original Stencils single r1/r2/b1/b2 2×5×n *m origζϕ ,*m precϕζ Original Stencils single r1/r2/b1/b2 2×5×n *m dXζ -lex, *m dXϕ -lex vector x Standard 2×n *m dBζ -lex, *m dBϕ -lex vector b Standard 2×n *m dXζ , *m dXϕ vector x single r1/r2/b1/b2 2×n *m dBζ , *m dBϕ vector b single r1/r2/b1/b2 2×n *m dYζ , *m dYϕ vector y single r1/r2/b1/b2 2×n *m dRζ , *m dRϕ vector r single r1/r2/b1/b2 2×n ˜ ζ , *m dR ˜ϕ *m dR vector r̃ single r1/r2/b1/b2 2×n *m dVζ , *m dVϕ vector v single r1/r2/b1/b2 2×n *m dPζ , *m dPϕ vector p single r1/r2/b1/b2 2×n ˆ ζ , *m dP ˆϕ *m dP vector p̂ single r1/r2/b1/b2 2×n *m dTζ , *m dTϕ vector t single r1/r2/b1/b2 2×n ˆ ζ , *m dS ˆϕ *m dS vector ŝ single r1/r2/b1/b2 2×n *m dtemp vector temp single r1/r2/b1/b2 n *m dZζ , *m dZϕ vector z repeated r1/r2/b1/b2 *m drem for dot products standard 2×5× 2×5× 4n 3 4n 3 n Total ≃ 72 × n Clearly, the CUDA implementation of the Bi-CGSTAB gives extra strain to the memory requirement as compared to the CG implementation. Martijn [1] reports that the realtistic test problem of Plymouth Sound described in Chapter 3, requires around 180 MB in storage (with 1.5 Million number of grid points). With the Bi-CGSTAB implementation, this number will increase roughly three times to 540 MB; plus the requirement of the CG solver itself. This is still under the global memory limit of 3048 MB for the GTX780 card. 90 8.2.3 Operation Count We compare the floating point operations (flops) required for the preconditioned CG and the preconditioned Bi-CGSTAB method. Work is split into the dot product, vector update, matrix vector multiplication and preconditioning solve step. Let the size of the problem given by n and i be the number of iterations required for convergence. Following are the flops for each operation: • dot product : n − 1 summations and n multiplications ∼ 2n • Vector update: n summations and n multiplications = 2n • sparse matrix vector multiplication : 5n multiplications and 4n summations = 9n • preconditioning: Solving LDLT z = r: 2n multiplications and summations for each lower triangular and upper triangular solve, n multiplications for division by diagonal = 9n Table 8.5 provides the flop count for the PCG algorithm. The second column ’count’ specifies the number of times each operation in called. Table 8.5: Operations Count for the preconditioned CG algorithm Operation Count flops per operation dot product 2i + 1 2n vector update 3i + 1 2n Matrix vector product i+1 9n preconditioning solve i 9n Total flops: n(28i + 13) In the case of Bi-CGSTAB algorithm, as the computation is done for ζ and ϕ, we have two times the operations as compared to a regular Bi-CGSTAB method. This is reflected in column ’count’ in Table 8.6, where the dot products and vector updates are multiplied by 2. The matrix vector product and preconditioning solve requires special attention.   Sζζ Sζϕ Let the coupled matrix be given as A = . Then the matrix vector multiplication Sϕζ Sϕϕ   ζ~ is given by: with the vector ~x = ϕ ~ # " Sζζ ζ~ + Sζϕ ϕ ~ A~x = (8.3) Sϕζ ζ~ + Sϕϕ ϕ ~ We require two matrix vector multiplications and one vector summation for each variable ζ and ϕ resulting in 19n flops per variable. 91 Flops for the preconditioning solve for the coupled system depends on preconditioner chosen. Here we study the preconditioner M2 which has the structure:  P M2 = ζζ Sϕζ 0 Pϕϕ  (8.4) Solving M2 z = r requires the following: • Two preconditioning solve operations which require 9n flops each = 18n • one matrix vector product =9n • one vector summation =n resulting in a total of 28n flops. Table 8.6: Operations Count for the preconditioned Bi-CGSTAB algorithm Operation Count flops per operation dot product 2 × (5i + 1) 2n vector update 2 × (6i + 1) 2n Matrix vector product 2 × (2i + 1) 19n preconditioned solve (2i) 28n Total flops: n(176i + 46) The flop count of the Bi-CGSTAB is 6 times higher than that of the CG algorithm. This also makes sense, as we now solve a system of equations which has twice the number of variables and is coupled in nature. The extra computational effort gives us the flexibility of choosing a time step irrespective of the CFL number. Also, the Bi-CGSTAB algorithm has the advantage that the number of iterations i required to solve the system does not depend on the problem size. We will discuss more on the impact of the problem size and time step on the required computation time for both Bi-CGSTAB and CG solver in the next sections. 8.3 Timing Results In this section, we shall discuss how to establish the efficiency of the different solvers. Our aim is to perform the simulations at time steps larger than what is allowed by the CFL number limitation for the explicit time stepping case. This will allow us to carrry out simulations for finer, larger and non-uniform meshes in real time. The following four quantities are defined: • Implicit solver time: This is the time required by the Bi-CGSTAB implicit solver to compute one time step. The simulation is run for 100 seconds in real time, and an average is computed. This is reported in milli seconds (ms). For example, if the time step is 0.1 seconds, then 1000 time steps are required to compute the 100 seconds in real time, and an average of 1000 steps is considered as the Implicit solver time. 92 • CG solver time : This is the time required by the RRB-k solver developed by Martijn[1] for the symmetric system of equations to carry out simulation for one time step, similar to the Implicit solver time described above. • Additional time : This is the time required by the other functions (Computing right hand side etc.). Again an average is considered. • Total time : This is computed differently. Total time is the time required to carry out the simulation for 100 seconds. In the case if the simulation is carried out for other than 100 seconds, then it will be mentioned. This is reported in seconds(s). We first try to analyze the Implicit solver time for the implementation in C++. To get an insight on the cost of computation for various functions involved, a profiler like Valgrind has be used. It is possible to generate the so-called callgrind-files which can thereafter be studied using a package called kcachegrind. These profilers are available in OpenSource. It is not possible to profile GPU tasks with Valgrind though. Also, it gives random results in the case OpenMP is used. We have used Valgrind for the C++ case without OpenMP implementation. It was observed that the maximum time is consumed by the sparse matrix vector multiplication, which is required for the preconditioning solve and the matrix vector multiplications both. Later on, we will profile the CUDA implicit solver with a different profiler called nvprof. Following tables shows the timing results for the test problem in two dimensions. The grid step size is taken as ∆x = ∆y = 5m. The relative tolerance is set at 1e−5 . Single precision has been used for this test study. For the real test cases, double floating point precision will be used as some errors creep into the simulation after simiulating for linger times with single precision for the real test cases. It should be possible to remove this discrepancy and will be part of the furture work. We will run the code on C++ and CUDA for various problem sizes and time steps. Furthermore, the C++ run is divided into with or without OpenMP implementation. The algorithm chosen for solving the implicit set of equations is Bi-CGSTAB. In the tables below, the column ’iteration’ refers to the iterations required for the convergence of the Bi-CGSTAB algorithm and the rest are our time quantities. The Total time refers to runtime for 100 seconds of realtime simulation as discussed before. Table 8.7: Timing results for C++ with a single core. N x = 240 and N y = 120 ∆t Iterations Total time (s) Implicit solver time (ms) CG solver time (ms) 0.05 1 29.79 6.19 7.31 0.1 1 15.69 7.19 7.10 0.25 3 9.61 15.54 6.97 0.5 6 7.62 29.49 6.87 1 13 7.55 66.54 6.79 93 Table 8.8: Timing results for C++ with OpenMP (4 cores). N x = 240 and N y = 120 ∆t Iterations Total time (s) Implicit solver time (ms) CG solver time (ms) 0.05 1 13.36 2.24 3.65 0.1 1 7.74 3.33 3.60 0.25 2 3.66 4.97 3.43 0.5 6 2.95 10.50 3.42 1 13 2.74 22.83 3.35 Table 8.9: Timing results with CUDA. The CG solver is still in C++ with OpenMP2 . N x = 240 and N y = 120 ∆t Iterations Total time (s) Implicit solver time (ms) CG solver time (ms) 0.05 1 13.33 2.20 3.61 0.1 1 7.60 3.05 3.54 0.25 3 3.85 4.67 3.77 0.5 7 2.95 10.06 3.37 1 16 2.43 19.66 3.58 Following are the observations from the above tables: • There is a gain of three times in the Implicit solver time when we go from a single core to four core processor. This corresponds to a parallel efficiency of 75% for the Bi-CGSTAB algorithm. • A marginal speed-up in the CUDA routines as compared to the C++ routine with OpenMP implementation is observed. This could be attributed to two reasons: – Size of the problem (Number of grid points). For smaller problems, it is possible that the communication time between the host (CPU) and device (GPU) become the bottleneck. This is not a concern for the OpenMP parallelization, as it uses shared memory architecture, and the communication time is minimal. – The number of iterations required for CUDA convergence are slightly more as compared to the C++ implementation. • The total time for the simulation reduces till the time step of 1 second, and then it stagnates. As discussed in the previous sections, number of iterations required for convergence of Bi-CGSTAB increases linearly with the time step. Though after a certain time step the slope of the linear curve start increasing. This increases the resulant computation time leading to stagnation. In order to check the impact of the problem size on the efficiency of the solvers, we ran the code for different problem sizes, and keeping the time step constant at 0.5 seconds. 2 CUDA CG solver for the two-dimensional test case has not been implemented. Complete integration of the two CUDA solvers has been done only for the real test case and the Non-Uniform grid case. 94 Table 8.10: Timing results in C++ without OpenMP. ∆t = 0.5s. Variation with problem size Nx × Ny Iterations Total time (s) Implicit solver time (ms) CG solver time (ms) 240x120 6 7.56 29.24 6.78 480x240 5 26.90 100.88 26.83 960x480 4 106.87 380.93 123.24 Table 8.11: Timing results in C++ with OpenMP. ∆t = 0.5s. Variation with problem size Nx × Ny Iterations Total time (s) Implicit solver time (ms) CG solver time (ms) 240x120 6 2.97 10.42 3.38 480x240 5 12.15 43.21 13.81 960x480 4 45.16 142.32 64.95 1024x1024 4 116.874 344.612 189.287 2048x2048 4 517.471 1281.47 1113.57 Table 8.12: Timing results in CUDA Implicit Solver. ∆t = 0.5s. Variation with problem size Nx × Ny Iterations Total time (s) Implicit solver time (ms) CG solver time (ms) 240x120 7 2.95 10.06 3.37 480x240 6 6.82 13.91 14.73 960x480 6 21.65 24.38 67.62 1024x1024 5 58.14 42.31 190.61 2048x2048 5 279.12 122.73 1093.67 The real impact of the CUDA implicit solver can be seen now. For the small problem size of 240x120 grid points, we did not see any improvement, but a speed-up of 10 times can be observed with the problem size of 2048x2048 points. The speedup graph is also shown below. 95 Figure 8.6: Speedup of the CUDA Implicit solver with respect to C++ solver with OpenMP implementation While we obtain a speedup of 10 times for the implicit solver time for the increasing problem size, the speed-up for the total time saturates. The RRB solver implementation in C++ here now becomes the limiting factor, instead of the implicit solver computation. We will observe the impact of the problem size with the CUDA RRB solver implementation in the next Chapter. 8.4 CUDA profiling As we carried out the analysis of the C++ code to identify the bottlenecks with the valgrind memory tool, we do the same with the CUDA code using Nvidia’s nvprof command line utility tool. We carried out the simulation for the test case with the problem size of 2048x2048. This allows focusing on the compute intensive parts of the program. A time step of 0.5 second is chosen which is around 5 times the time step based on the CFL condition. We saw that the number of iterations required for this case is 5. To limit profiling to the Bi-CGSTAB algorithm , we have used CUDA functions, cudaProfilerStart() and cudaProfilerStop() to start and stop profile data collection. The timing results obtained from the profile shows the distribution of all the compute kernels (or functions). The kernels then can be distributed into four major operations: preconditioning solve, matrix vector multiplication, vector update and the dot product. The results are shown in following table. The column ’time’ represents the actual time spend in these function, for the simulation which was carried out for 100 seconds in real time. This excludes the time spent on communication between the host and the device. 96 Table 8.13: Profiling of the Bi-CGSTAB algorithm implementation Function Time (s) Time (%) Preconditioning Solve 6.543 36.18% Matrix vector multiplication 4.952 27.39% AXPY (vector update) 4.807 26.59% Dot Product 1.780 9.84% The above table provides an insight into the time consuming steps. When the tool was used initially, it showed a lot of time being spent into cuda:memcpy. The implementation at that time included initializing many variables repeatedly, and also required extra copying in the preconditioning step. Following the profiling results, it was possible to remove such extra time consuming steps and optimize the code. The test code allowed us to study the Mathematical and Computing aspects of solving the implicit set of equations. We will now perform simulations for the real test cases and provide the timing results in the next Chapter. 97 Chapter 9 Results - Real Test case In this Chapter we discuss the results for the real test cases of River IJssel and Plymouth Sound. Analysis has been carried out over various problem sizes and time steps. CUDA has been used while simulating routines for both Bi-CGSTAB and CG solvers along with the computation of other functions (right-hand side computation etc.). Before discussing the performance of the implicit solver, we will present the stability and accuracy of results for various time steps ∆t. We have used the Chorin’s projection scheme as discussed in the previous Chapters to perform the time integration. It is of interest to see whether we can perform the simulations without any constraint on time step, as this was the intention with which we started our study. We have used a smaller test case of the Plymouth Sound, with the dimensions N x × N y = 240 × 120 and ∆x = ∆y = 5m. The accuracy is checked by observing the plots visually and by computing Froude force (Fext ). Details of computation of this force is not reported in this report. If the simulations are carried out for a fixed period of time (say 100 seconds), then the Froude force at the end of 100 seconds should not change. We already know that the implicit methods at larger time steps suffer from phase lag. We will try to analyze the differences in Froude force for various time steps. Following three figures show the snapshot after 40 seconds of real-time simulation for various time steps. The CFL condition dictates that the time step ∆t should be less than or equal to 0.15 seconds for the explicit methods. Figure 9.1: Snapshot at t=40 seconds, ∆t = 0.2s 98 Figure 9.2: Snapshot at t=40 seconds, ∆t = 0.4s Figure 9.3: Snapshot at t=40 seconds, ∆t = 1.0s Following are the observations from the above figures: • The solution remains stable for time steps greater than ∆tCF L . We checked the stability in detail by running the simulation over 1000 seconds, and by testing other problem sizes. In the case of the explicit simulator, the time step greater than 0.15 seconds cannot be used as it results in unstable solution. • As we increase the time step, the solution behind the ship and near the boundary looks distorted. This is certainly not the impact of the phase error, and can be attributed to the following: – As the time step is increased, the information of the interaction of the ship with the waves during the intermediate times is lost. For example, let’s take ∆t = 1s. Ship’s velocity is 15.5m/s. This implies that the ship would have moved by 15.5m (three grid points here, as dx = 5m) in one time step and we lose the information of the ship’s interaction with the waves for the intermediate grid points. – The incoming boundary conditions are computed using some predefined time periods and amplitudes. A time step should be chosen which can resolve the wave form of the 99 incoming boundary waves. As a rule of thumb, one time period should be covered by 10-15 time steps. For the current simulation, the minimum time period is around 4 seconds. In order to capture this time period, we need the time step to be less than or equal to 0.4 seconds. We will see later for the case of Plymouth Sound, even with a time step of 0.4 seconds, we can have slightly inaccurate results. Based on the above results, we will limit the time step to 0.4 seconds for our discussions on timing and code profiling. Please note that, this time step limitation is due to the physical considerations and not due to the numerical restrictions imposed by the CFL condition for the explicit methods. The physical limitations remain constant irrespective of the spatial step size. The timing results are presented in a different way as compared to the results discussed for the two-dimensional test case in the previous Chapter. Our intention is not to compute the speedup between C++ and CUDA code, but understand the behaviour of the implicit code with the variation in the time step and the problem size, and compare it with the explicit scheme. The four quantities to quantify time are re-defined here: • Implicit solver time: This is the time required by the Bi-CGSTAB implicit solver to carry out the simulation for 100 seconds. This is reported in seconds(s). • CG solver time : This is the time required by the RRB-k solver developed by Martijn for the symmetric system of equations to carry out simulation for 100 seconds • Additional time : This is the time required by the other functions (Computing right hand side etc.). • Total time : Total time is the time required to carry out the simulation for 100 seconds. This is the sum of the above times. We have switched from single precision floating point numbers to double precision for CUDA simulation. It would be part of the future work to understand why the single precision accuracy is not providing stable results for the current CUDA implicit solver. One of the area which will require attention is the sparse matrix vector multiplication with the r1r2b1b2 ordering and the preconditioning solve function. The implicit solver implemented in C++ works fine with single precision. 9.1 Variation with time step In order to see the impact of the time step on the accuracy and the speed of the solver, we have taken the problem of river IJssel with 0.5 and 1 million nodes. The ∆t based on the CFL condition is 0.12 seconds. In this case, the incoming boundary waves are based on deterministic model with maximum frequency of 0.8 rad/second (long waves). Following table shows the timing results for various time steps. The column ’iteration’ refers to the average iterations required for the convergence of the Bi-CGSTAB algorithm over 100 seconds. Table 9.1: Timing results for IJssel-500000 ∆t Iterations Total time (s) Implicit solver time (s) CG solver time (s) 0.1 3.3 35.98 18.43 5.15 0.2 4.97 18.65 13.1 2.34 0.4 8.88 14.63 11.87 1.15 100 Table 9.2: Timing results for IJssel-1000000 ∆t Iterations Total time (s) Implicit solver time (s) CG solver time (s) 0.1 3.3 49.88 31.20 7.97 0.2 4.96 32.58 23.28 3.95 0.4 8.93 25.77 21.05 1.97 Following are the observations : • The total simulation time reduces as we increase the time step. • Increasing the problem size does not impact the required number of iterations for the Bi-CGSTAB solver. This is also similar to the results observed for the two-dimensional test case. • As we go from ∆t = 0.2 to ∆t = 0.4, the implicit solver time remains almost constant as the number of iterations required for Bi-CGSTAB convergence increases propotionally with the time step. In order to see the accuracy of the solution with respect to the time step, we plot the Fext with time for various time steps. Figure 9.4: Froude force variation with time for various time steps: River IJssel, 1.0 million nodes As the time step is increased, we can see the difference in Froude force starts increasing which can be attributed to either numerical dispersion or pressure pulse. A snapshot of the simulation for the test case of River IJssel with 1.5 million nodes is shown in Figure 9.5. 101 Figure 9.5: Snapshot of River IJssel with 1.5 million nodes at t=100 seconds, ∆t = 0.2s 9.2 Variation with problem size In order to see the impact of the problem size on the efficiency of the solvers, we carried out simulation for the test case of Plymouth Sound for various problem sizes, keeping the time step constant at 0.2 and 0.4 seconds. The results are shown in Table 9.3 and Table 9.4. Table 9.3: Timing results for Plymouth. ∆t = 0.2s. Variation with problem size N Iterations Total time (s) Implicit solver time (s) CG solver time (s) 50000 4.1 4.83 2.91 0.82 100000 4.1 9.19 5.13 1.59 200000 4.2 11.63 6.53 1.76 500000 5.1 23.28 14.3 2.86 1000000 5.45 41.77 26.63 4.77 1500000 5.86 62.26 41.09 6.65 102 Table 9.4: Timing results for Plymouth. ∆t = 0.4s. Variation with problem size N Iterations Total time (s) Implicit solver time (ms) CG solver time (ms) 50000 6.45 3.28 2.33 0.42 100000 6.63 4.22 3.07 0.48 200000 6.7 6.20 4.62 0.67 500000 8.74 16.57 12.13 1.43 1000000 9.31 30.5 22.96 2.40 1500000 10.58 49.26 37.03 3.28 Following are the observations: • A propotional increase in the simulation time with the problem size is observed. • The CG solver time does not increase in proportion to the problem size. This can be attributed to slight reduction in the required number of iterations for the CG solver which was also reported by de Jong [1] • For the Implicit solver, we can see an increase in the number of iterations. This is different from as observed for the two-dimensional test case and the real case of river IJssel. For the Plymouth Sound, there is large influence of incoming boundary waves. As the problem size increases, the waves start occupying a larger space with time. If there were no waves, then the solution will be zero, and no iterations are required. As the waves start propagating, the problem field gets filled with more and more waves, resulting in more number of iterations required for convergence. • As we increase the time step, we can see an overall improvement in the solver performance. With the time step ratio of 2 times, the number of iterations required for the convergence of Bi-CGSTAB method does not increase by 2 times. The snapshots for Plymouth Sound with 1.5 million nodes for different time steps are shown in Figure 9.6 and Figure 9.7. 103 Figure 9.6: Snapshot of Plymouth Sound with 1.0 million nodes at t=100 seconds, ∆t = 0.2s Figure 9.7: Snapshot of Plymouth Sound with 1.0 million nodes at t=100 seconds, ∆t = 0.4s There is a need to check the accuracy of the solution. In the case of river IJssel, the incoming waves have smaller frequency and thus larger time periods. Hence, we do not suffer from the inaccuracy of not capturing the incoming boundary waves. In the case of Plymouth Sound, minimum time period of the incoming waves is 4 seconds, and with a time step of 0.4 seconds we may not have enough temporal resolution. The variation of Fext with time for different time steps for the test case of Plymouth Sound, 1.0 million nodes is plotted in Figure 9.8. 104 Figure 9.8: Froude force variation with time for various time steps : Port Plymouth For ∆t = 0.1s and ∆t = 0.2s, the results are relatively close. Later, the results of explicit and implicit methods are also compared for ∆t = 0.1s. When we further increase the time step to 0.4s, deviation from the results at lower time steps is observed. Although this does not impact the stability of the solution. We ran the simulation for 1000 seconds of real-time and did not observe any divergence. 9.3 Profiling Results In this study, all the functions are implemented in CUDA. A custom profiler is built inside the code, which helps to analyze the time required by each function, whether it is implemented in C++ or CUDA; thus removing the restriction caused by the profilers like Valgrind which can be used only for a C++ implementation. Following table shows the time required by each function inside the computewave(), which computes wave for a single time step. We have compiled the time run for 20 seconds in realtime. The other computations required are related to computing and moving the ship pulses. 105 Table 9.5: Profiling results for CUDA implementation: Run time= 20 seconds. IJssel 1 million nodes, ∆t = 0.1s Function Time (s) Time (%) Initiate RHS() 0.071 0.83 % Incoming Waves() 0.112 1.32% Boundary Zetaphi() 0.169 1.99% Compute RHS zeta phi() 0.509 6.00% Implicit Solve() 5.215 61.49% Compute RHS psi() 0.145 1.71% CG Solver() 1.555 18.33% Boundary psi() 0.004 0.04% Correct zeta() 0.001 0.01% Boundary zeta() 0.444 5.24% Maximum waves() 0.257 3.03% Table 9.6: Profiling results for CUDA implementation: Run time= 20 seconds. IJssel 1 million nodes, ∆t = 0.4s Function Time(s) Time(%) Initiate RHS() 0.018 0.42% Incoming Waves() 0.028 0.66% Boundary Zetaphi() 0.043 1.03% Compute RHS zeta phi() 0.127 3.03% Implicit Solve() 3.374 80.48% Compute RHS psi() 0.036 0.86% CG Solver() 0.392 9.35% Boundary psi() 0.001 0.02% Correct zeta() 0.000 0.01% Boundary zeta() 0.111 2.64% Maximum waves() 0.063 1.50% The above tables provides an insight into the time consuming steps for the CUDA solver. The function boundaryzeta() also includes the transfer of ζ from the device (GPU) to the host (CPU). This time could be reduced if we later decide to plot the waves directly from the GPU and not transfer it to the CPU at every time step. Also, the time required by maximum waves() can be reduced by switching on the OpenMP flag. Futher optimization of the implicit solver is possible by reducing the memory utlization and using single precision floating numbers instead of the double precision floating numbers used in this study. 106 9.4 Comparison: Explicit vs. Implicit We consider the test case of Plymouth Sound to perform the comparison between the explicit CUDA Leap-Frog solver developed by Martijn and the Implicit solver developed in this thesis. In the explicit solver, further work was done after Martijn’s thesis to move the functions like computing the time integral, right-hand side etc. from C++ to CUDA. We have used this solver in our study. In the code, this is referred to as CudaWavesComputer. First we compare the accuracy of the two solvers. We plot the Fext versus time with a time step of 0.1 seconds, which is below the CFL condition number. As the time step is the same, the incoming boundary waves and pressure pulse are also captured with the same temporal resolution. The difference between the two methods methods seen in Figure 9.9 can be attributed to the numerical dispersion. In the next Chapter, we will evaluate the numerical dispersion for two methods analytically. Figure 9.9: Froude force variation with time for the explicit and implicit solvers: Plymouth Sound with 1.5 Million Nodes, ∆t=0.1 seconds Next, we compare the timing results for the two methods. We conducted simulation for the explicit case with ∆t = 0.1 seconds and for the implicit solver with ∆t = 0.2, 0.4 seconds. Also, only the total time of the simulation is shown here for comparison. 107 Table 9.7: Comparison of explicit and implicit solvers: Plymouth Sound N Explicit ∆t = 0.1 Implicit ∆t = 0.2 Implicit ∆t = 0.4 50000 5.99 4.83 3.28 100000 7.71 9.19 4.23 200000 6.95 11.63 6.20 500000 13.55 23.28 16.57 1000000 24.05 41.77 30.50 1500000 69.97 55.47 49.27 The above table is also shown in graph format below. Figure 9.10: Variation of computational time for the explicit and implicit solvers with problem size: Plymouth Sound Let us compare the explicit results at ∆t = 0.1 seconds, and implicit results at ∆t = 0.4 seconds. For the smaller problems, implicit solver is better than the explicit solver. This could be attributed to the memory bandwidth limitations for the small size problems. Also, number of iterations required for the Bi-CGSTAB solver for Plymouth Sound are less for the smaller problem size as discussed previously. As we increase the problem size, the number of iterations for the convergence of Bi-CGSTAB method increases thus increasing the overall time of the implicit solver. This we see for the problem sizes above 5000,000 nodes upto 1.0 million nodes. The explicit solver requires less time now. For the problem size of 1.5 million nodes, we see a sudden increase in the overall time required for the explicit scheme. The reason is that for the case setup of 1.5 million nodes, the pulses are defined based on a mesh file. This makes the moving and computing ship pulses a compute 108 intensive operation. Around 22.8% of the overall computation time is spent on pulses. In contrast, if the pulses are not defined through meshes, then only 3-4% of the computation time is spent on pulses. When we move to the implicit method with larger time step of ∆t = 0.4 seconds, we subsequently reduce the computation time of the pulse movement. Only 6.7% of the overall computation time is now spent on computing and moving the pulses as compared to the 22.8% of the time required by the explicit solver. We will try to assess this impact of increase in problem size, for the River IJssel test case. The results are shown in Table 9.8. Table 9.8: Comparison of explicit and implicit solvers: River IJssel N Explicit (s) Implicit ∆t = 0.2 Implicit ∆t = 0.4 500000 21.17 18.65 14.63 1000000 22.96 32.73 25.77 1500000 56.01 55.53 42.02 Following are the observations: • For the smaller problems, implicit solver is faster than the explicit solver for both test cases. The small number of iterations required for the convergence of the Bi-CGSTAB solver for smaller problems could be one of the reason. • For the test case of 0.5 Million and 1.0 Million Nodes, the time required for simulation is almost the same. In fact, the explicit solver is faster. • As we further increase the problem size, with ∆t = 0.4 seconds which is four times larger than that of the explicit scheme ∆t, again the implicit solver becomes faster than the explicit solver. Please note that the number of iterations required for convergence of the Bi-CGSTAB algorithm for the test case of River IJssel does not increase with the increase in problem size, as in the case of Plymouth Sound. 109 9.5 Concluding Remarks- Realistic Test Case • The Chorin’s projection scheme provides stable results for the realistic test cases. • The implicit solver is faster than the explicit solver for certain test cases. Though in order to achieve this we need to use larger time steps which can result in large numerical dispersion. In the next Chapter, we will discuss more about the numerical dispersion for the implicit and the explicit schemes. • The time step is governed by the required frames per second (fps) of the Ship simulator. It is intended to achieve 20 fps in real-time, which implies a time step of 0.05 seconds. With the current problem set-up, ∆tcf l > 0.1s and the explicit solver can be used at 0.05 seconds. Using the implicit solver at 0.05 seconds won’t result in improvement in performance. But this solver will scale linearly with the problem size and cannot solve large problems under 0.05 seconds. In that case, it would be required to move to the Non-Uniform grids, which results in less number of grid points. Also, it is intended to simulate more than one ships and capture the interaction between them. Such a setup will require a finer mesh near the interaction region. If we move towards a Non-Uniform grid set-up, it would be possible to have a finer grid near the ships and coarser grid away from the ship. With the explicit scheme, this might impose a time step limitation as the CFL number will be governed by the smallest spatial step size. Our implicit solver will become handy in such cases, as it is not restricted by the CFL number. In the next Chapter, we will provide a framework for setting up the Non-Uniform grid for the two-dimensional test case. 110 Chapter 10 Dispersion Analysis Lets consider the one dimensional wave equation given by: ∂ζ ∂ζ + c0 =0 ∂t ∂x (10.1) where ζ is the wave height, and c0 is the wave velocity. In our system represented by the Variational Bousinessq approximation, the wave velocity is given by the surface gravity wave. The spatial derivative is discretized using central differences, and the time derivative is integrated with the fully implicit theta method and the explicit Leap-Frog method. The theta-method discretization is given as ζin+1 = ζin − C(1 − θ) n Cθ n+1 n+1 n ) (ζi+1 − ζi−1 )− (ζi+1 − ζi−1 2 2 (10.2) where C is the Courant number given by C = c0 ∆t/∆x. The Leap-Frog discretization is given by: n n ) ζin+1 = ζin−1 − C(ζi+1 − ζi−1 (10.3) The Taylor series expansion gives the modified equivalent partial differential equation for the theta method as [9]: ∂ζ c2 ∆t ∂2ζ c0 ∆x2 ∂3ζ ∂ζ + c0 = 0 (2θ − 1) 2 − (1 + C 2 (3θ − 1)) 3 + O(∆x4 ) ∂t ∂x 2 ∂x 6 ∂x (10.4) and the modified equivalent partial differential equation for the Leap-Frog is given as: ∂ζ ∂ζ c0 ∆x2 ∂3ζ + c0 =− (1 − C 2 ) 3 + O(∆x4 ) ∂t ∂x 6 ∂x (10.5) As expected for the theta method, numerical diffusion arises with the magnitude: κnum = c20 ∆t (2θ − 1) 2 (10.6) and this numerical diffusion vanished with θ = 0.5, which is the Implicit-Trapezoidal method. 111 For a given value of the Courant number, the numerical dispersion for the Trapezoidal method is always greater than the Leap-Frog scheme. In fact, if C = 1, the numerical dispersion vanishes for the Leap-Frog scheme. This is not the case for the Trapezoidal method, which gives a finite dispersion of c0 ∆x2 /4 when C = 1. 10.1 Dispersion Relation The exact dispersion relation is obtained via a substitution of ζ(x, t) = ζ̂exp[i(kx − ωt)] which gives the relationship between the frequency and the wavenumber as : ω = c0 k (10.7) ω The phase speed is then given by c = = c0 and hence the equation is not dispersive since all k the waves propogte at the same speed c0 regardless of the wavenumber k. Deriving the dispersion relation for the Implicit-Trapezoidal method and the Leap-Frog method from the Taylor series expansion, we obtain: c = c0  2 1 − 16 (k∆x)2 (1 + C2 ) 1 − 16 (k∆x)2 (1 − C 2 ) Theta method Leap-Frog method These show that c = c0 only when the Leap-Frog method is used and when C = 1. This is not possible for the Implicit Trapezoidal method. The above relation holds only in the case of small k∆x as we have ignored the higher order terms in the Taylor series expansion. A general dispersion relation can be derived by substituting: ζ(xj±1 , tn±1 ) = ζ̂exp[i(kxj±1 − ωtn±1 )], = ζ̂exp[i(kxj − ωtn )]exp(±ik∆x)exp(∓iω∆t), = ζ(xj , tn )exp(±ik∆x)exp(∓iω∆t) If we substitute the above relation into the discretized equations, we get the following relations. The numerical dispersion relation for the Implicit Trapezoidal method is given by:   c i 2 − iCsink∆x = log c0 Ck∆x 2 + iCsink∆x (10.8) and for the Leap Frog scheme is given by: sin−1 (Csink∆x) c = c0 Ck∆x Figure 10.1 shows the variation of c/c0 with k∆x for various Courant numbers. 112 (10.9) 1 C =0.5 C=1 C=2 C=4 0.8 c/c0 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 3.5 k∆ x Figure 10.1: Dispersion analysis of Implicit Trapezoidal method In order to limit the dispersion error, we need to formulate the problem such that we are near the top part of the curve. For example, to get under 10% dispersion error, and achieve a time step above the CFL limit, say ∆t = 2 × ∆tCF L , then k∆x should be below 0.5. 10.2 Numerical Dispersion for One dimensional test case For the one dimensional test case discussed in Chapter 5, we discuss the impact of numerical dispersion with various time steps for the Implicit Trapezoidal method. We consider a fixed location x = 50m, and observe the wave passing through for various time steps. The results are shown in Figure 10.2. 113 Figure 10.2: Phase lag for the Implicit Trapezoidal scheme. Solution at x=50m with periodic boundary conditions We can see a clear phase lag, which increases for larger time steps. It is difficult to quantify this error, as we have a coupled set of equations. Although a general trend can be observed. 10.3 Non-Uniform Grid In this section we develop the discretization scheme for the non-uniform grid and explore the possiblity of employing the implicit solver for the non-uniform grid. We focus on a rectilinear grid in two dimensions as shown in Figure 10.3. 114 y x Figure 10.3: Rectilinear grid in two dimensions The grid size in x and y direction varies smoothly. The grid size distribution in the x-direction is shown in Figure 10.4. Near the boundaries, the grid size is constant. In the central area, it varies quadratically and is continuous in nature. Though the first derivative of such grid size ditribution has a discontinuity. Based on the parameters of the quadratic curve, the ratio of the maximum grid size to the minimum grid size can be varied. 8 ∆x 7 6 5 4 0 20 40 60 80 100 120 140 160 180 200 220 240 Nx Figure 10.4: Grid size distribution in one dimension For the two-dimensional rectilinear grid, the y-spacing can also vary with similar distribution. In order to access elements, we do not need to store the location of all the grid points. As the grid is rectilinear, we store the x-coordinates and y-coordinates in two separate arrays x and y. When we have to access any grid element (i,j), its location can be determined as x(i), y(j). 115 10.3.1 Spatial discretization The governing equations are discretized based on the grid shown in Figure 10.5. The variables are evaluated at the grid points (i, j) and the finite volume as shown as the dotted rectangle in Figure 10.5 is centered around the grid point. (i,j+1) (i,j+1/2) y (i-1,j) (i-1/2,j) (i,j) (i+1/2,j) (i+1,j) (i,j-1/2) (i,j-1) x Figure 10.5: Finite Volume grid We define the following: xi+1 − xi−1 • ∆xi = 2 yj+1 − yj−1 • ∆yj = 2 The governing equation for variable ζ is given by: ∂ζ ∂ζ ∂2ϕ ∂2ϕ ∂2ψ ∂2ψ ∂ζ + Ux + Uy + h( 2 + ) − hD( + ) = 0, ∂t ∂x ∂y ∂x ∂y 2 ∂x2 ∂y 2 Lets consider discretization of each term separately. For the time derivative, ZZ ∂ζi,j ∂ζ dxdy = ∆xi ∆yj ∂t ∂t For the single spatial derivative in x-direction, ZZ ∂ζ Ux dxdy = Ux (ζi+1/2,j − ζi−1/2,j )∆yj ∂x ζi,j + ζi−1,j ζi+1,j + ζi,j − )∆yj = Ux ( 2 2 ζi+1,j − ζi−1,j = Ux ( )∆yj 2 116 (10.10) (10.11) Similarly, ZZ ZZ ∂2ϕ h 2 dxdy = h ∂x Uy ζi,j+1 − ζi,j−1 ∂ζ dxdy = Uy ( )∆xi ∂y 2 (10.12)  ∂ϕ ∂ϕ ∆yj | − | ∂x i+1/2,j ∂x i−1/2,j   ϕi+1,j − ϕi,j ϕi,j − ϕi−1,j =h ∆yj − xi+1 − xi xi − xi−1   h∆yj h∆yj h∆yj h∆yj + ϕi−1,j − ϕi,j + = ϕi+1,j xi+1 − xi xi − xi−1 xi+1 − xi xi − xi−1  Similarly, ZZ ∂2ϕ h∆xi h∆xi h 2 dxdy = ϕi,j+1 + ϕi,j−1 − ϕi,j ∂y yj+1 − yj yj − yj−1  h∆xi h∆xi + yj+1 − yj yj − yj−1  The above discretization results in formation of pentadiagonal matrices, represented by 5-point stencils. In order to use the RRB-Solver, we need to show that these matrices are either skewsymmetric or symmetric in nature. As done previously with the Uniform grid, we do not divide the whole equation by ∆x∆y term, which comes with the time integral. This is done because ∆x∆y now varies with the grid, and we would like to keep this non-uniform part with the diagonal term. The stencil for the convective term is give by:  ∆xi 0 Uy  2  ∆yj Sζζ :  −U 0  x 2  ∆xi 0 −Uy 2  0  ∆yj   Ux 2   0 (10.13) To verify the symmetry (or skew-symmetry) of the matrix, we need to compare the outer diagonals with each other as was done for the uniform grid case. In this case, the absolute contribution  ∆y ∆y |i,j = − Ux |i−1,j . This is from the West and the East term should be same, i.e. , −Ux 2 2 true as ∆y remains constant if we traverse in the x-direction. Similarly we can prove that the diffusive term is also symmetric in nature. Please note that the above derivation is applicable only for the Rectilinear grids. Also, if we 1 1 were to divide the equation by ∆xi ∆yj , then we would need −Ux |i,j = Ux |i−1,j , which 2∆x 2∆x is true only when ∆xi = ∆xi−1 , but that is the uniform grid case. Based on the above formulation, we can now formulate the stencils for the two-dimensional case and use the RRB-solver (both implicit and symmetric). 117 10.3.2 Accuracy of the Non-Uniform Grid Accuracy of the spatial discretization described above can be analyzed by expanding the derivatives with the Taylor series and comparing the results. The single derivative is accurate upto O((xi+1 −xi )((xi −xi−1 )) i.e. to the second order in the space increments. The second derivative is accurate upto O((xi+1 − xi ) − ((xi − xi−1 )) i.e. to the first order in the difference of the grid intervals. In the case of uniform grid, this would be just O(∆x2 ), but is higher for non-uniform grid and depends on the type of stretching in the grid. Another impact of the Non-Uniform grid could be seen on the numerical dispersion. As we saw before that the numerical dispersion depends on the CFL number and k∆x. Also, that was the case in only one dimension. In the case of non isotropic grid, that is, if the number of points per wavelength are not the same in every direction, it adds to more dispersion. Also if the grid size is varying, then the numerical wave speed would also be different, and this causes effects like reflection of the wave etc. We will try to observe these effects in the results given in the next section. 10.3.3 Non-Uniform Grid Results The grid used in the simulations is shown in Figure 10.6. Figure 10.6: The Non-Uniform grid Following are the parameters used in the simulation • Lx = 1476 m, Ly = 908 m • Nx = 240, Ny =120 • Ux = Uy = 1m/s 118 • h=50m • g=9.81 m/s2 • double precision for computation The maximum grid spacing in the x-direction is 13.8m, and the mimimum grid spacing is 1.37m ∆xmax = 10. Same ratio is applied on the grid spacing in the y-direction. ∆tCF L resulting in ∆xmin based on the minimum grid spacing is 0.031s. Our aim is to carry out the simulations for various ∆t’s and observe the required computation time, and the number of iterations of the Bi-CGSTAB algorithm. Also important is to reflect upon the accuracy and stability of the results obtained. ∆t First we show the results for ∆t = 0.5s, which gives a = 15.1 ∆tCF L Figure 10.7: Initial conditions for ζ for the Non-Uniform grid 1 The snapshots of the results are generated at various time steps with the help of Paraview which is an open-source visualization software. 119 Figure 10.8: ζ at t=50s for the Non-Uniform grid From Figure 10.7 and Figure 10.8, we can see that the algorithm developed is stable for the Non-Uniform grid, even at large CFL ratios. We will now compare the results with that of CFL number corresponding to the ∆tCF L = 0.031s. We obtain the result by cutting a slice normal to the y-direction, thus allowing us to compare the results in an one-dimensional plot. The results are shown in Figure 10.9. We can see that the numerical dispersion is very small! This could be attributed to the fact that we are operating 4π ∆x = 0.01. at very low k∆x ratios. For the smallest grid spacing of 1.37m, the ratio k∆x = L As CFL number is inversely propotional to the spatial step size, this also correspond to the region in the grid with highest CFL ratio. For the largest grid spacing, k∆x = 0.1, but CFL ratio for such spacing is 10 times lower than the CFL ratio for the smallest grid size. Thus we are always near c/c0 = 1, i.e. the top of the curve as shown in Figure 10.1. This is in contrast to the previous case of a Uniform grid, when with larger CFL number we encountered large dispersion errors. 120 Figure 10.9: ζ at t=50s with ∆t = 0.031s and ∆t = 0.5s We now show the timing results for the solver. The simulations are run for 50 seconds in the real time. The implicit Bi-CGSTAB solver is implemented both in C++ and CUDA, whereas the CG solver is implemeted in CUDA. Table 10.1: Timing results for the Non-Uniform Grid case with C++ Bi-CGSTAB solver, ∆tCF L = 0.062s, ∆xmax /∆xmin = 10 ∆t Avg. No. of Iterations per time step Run time of simulation 0.031 2 37.56 0.062 2 19.13 0.124 4 11.41 0.248 6 6.08 0.496 14 5.09 0.992 29 4.11 121 Table 10.2: Timing results for the Non-Uniform Grid case with CUDA implicit Solver, ∆tCF L = 0.062s, ∆xmax /∆xmin = 10 ∆t Avg. No. of Iterations per time step Run time of simulation 0.031 2 34.36 0.062 2 18.39 0.124 4 10.43 0.248 8 6.2 0.496 20 4.57 0.992 36 3.18 Following are the observations from above results: • An explicit method here would require ∆t = 0.031s for the simulation due to CFL number limitation. With the implicit solver, this simulation could be carried out at larger time steps. The required frame per second is 20, which results in a ∆t of 0.05 seconds. Performing the simiulation at 0.05 seconds would only require 2 iterations for the convergence of the Bi-CGSTAB algorithm and will save time required in double computation of the CG solver, computing the pulses etc (if the explicit solver is used). Also, in the case of multiple ships where the non-uniform grids are required, the pressure pulse computation can become a time consuming step. • As observed in the Uniform grid case, CUDA implemtation of the Bi-CGSTAB solver requires more number of iterations to converge than the C++ solver. Part of the problem lies in the r1r2b1b2 storage structure in CUDA and its impact on the sparse matrix vector multiplication and the preconditioning solve function. Both of them need to be studied carefully, and constitute part of the future work. This concludes our discussion on the non-uniform test case. We have been able to use the solvers developed for the uniform grid with suitable spatial discretization and stencil formulation. Much work still need to be carried out to study the accuracy and speed-up of these methods and their adaptation for the real case. We will now present our conclusions and propose the future work. 122 Chapter 11 Conclusion We started with the efficient implementation of the explicit CUDA RRB solver, and intended to develop a robust implicit solver which is based on the similar RRB framework and can overcome the numerical limitation of the explicit schemes. We were able to achieve the following: • Analyzed various time integration schemes, studied their stability and accuracy behaviour with the simplified test-problems in MATLAB. The Implicit Trapezoidal method came out as the front runner with suitable Symplectic and stability properties for the VBM model. • Chorin’s projection scheme has been developed to solve the coupled system of equations arising from the VBM model. By splitting the operations carefully, stability of the method has been ensured. • Suitable physics based preconditioners which can take care of the weak-coupling of the system have been formulated and their performance studied . The preconditioners were studied analytically through the eigenvalue analysis and then by performing numerical experiments. The Gauss-Seidel block preconditioner with the gravitational constant as the coupling term provides the optimum number of iterations amongst various preconditioners. • In order to develop the CUDA and C++ solver in an object oriented way, and to perform benchmarking, a two-dimensional test problem has been formulated. Object-oriented approach allows to plug-in various solvers and make use of the efficient solver developed by Martijn. Various profilers were used to optimize the code, which can be then directly incorporated in the real solver. • A careful implementation of the Incoming boundary waves allowed a stable solution for the real test case. The Implicit solver for the real test cases outperforms the Explicit solver in certain scenarios. A speed-up factor upto 7/5 was observed. • A frame-work for the further development of the non-uniform grid solver has been provided. A suitable spatial discretization has been evaluated, resulting in system matrices which can be handled by the RRB solvers. Chapter 12 Future Work We would like to recommend MARIN the following regarding the Interactive Waves project. The recommendations also indicate topics for the future research. • Optimize the CUDA implementation of the implicit solver. – Investigate the role of initial guess on the convergence of the Bi-CGSTAB method. It is possible to get a better estimate of the initial guess by linear interpolation of results from previous time steps. – Investigate the difference in CUDA and C++ implementation of the Bi-CGSTAB solver. It is possible to reduce the number of iterations required for convergence of the CUDA implicit solver further. – Get the CUDA implicit sovler working with single precision floating numbers. • Implement the weakly-reflective boundary conditions for the implicit solver routine. In current thesis, only the reflective (Neumann) boundary conditions have been implemented. • Build up-on the non-uniform two-dimensional test case. It is also possible to develop a non-rectilinear grid and reduce the number of grid points further. In order to track the ship movement, a moving grid would be required. This would also require recomputing system matrices and reconstructing the preconditioner at subsequent timesteps. The implicit solver preconditioner construction step should be optimized further through efficient memory utilizations and using C++ standard library. • Move the pulse computation on GPU. As we observed for the Plymouth Sound test case of 1.5 million nodes, this becomes a time consuming step. This will also become more important in the case of multiple ships. • The ship simulator supplied by MARIN runs on a distributed system. A strategy needs to be developed which could either implement the current solver on distributed system or does the data transfer to the different compute nodes efficiently. 124 References [1] Jong, M. de, Developing a CUDA solver for large sparse matrices for MARIN, MSc Thesis Applied Mathematics, Delft University of Technology, Delft, February 2012. [2] Klopman, G., Variational Boussinesq Modelling of Surface Gravity Waves over Bathymetry, Phd Thesis, University of Twente, Twente, May 2010. [3] Wout, E. van ’t, Improving the Linear Solver used in the Interactive Wave Model of a Real-time Ship Simulator, MSc Thesis Applied Mathematics, Delft University of Technology, Delft, August 2009. [4] Keppens R., Tòth G., Botchev M. A, and Ploeg A. van der, Implicit and semi-implicit schemes: algorithms, International Journal for Numerical Methods in Fluids, 30, 335-352, 1999. [5] Botchev M. A., Sleijpen G. L. G. and Vorst H. A. van der, Low-dimensional Krylov subspace iterations for enhancing stability of time-step integration schemes, Preprint 1004, Dept. of Mathematics, Utrecht University, the Netherlands, 1997. [6] Trefethen L.N, Bau D. III , Numerical Linear Algebra , Society for Industrial and Applied, Philadelhpia, 1997 [7] Hundsdorfer W. , Verwer J.G., Numerical Solution of Time-Dependent Advection-Diffusion Reaction Equations, Springer, Berlin, 2003 [8] Chorin, A. J., Numerical Solution of the Navier-Stokes Equations, Math. Comp. 22, 745â762, 1968 [9] Fringer O. [PDF Document].Retrieved from Lecture Notes http://web.stanford.edu/ fringer/teaching/cee262c/solution4.pdf 125 Online Web site: