Alya towards Exascale: Algorithmic Scalability
using PSCToolkit
Herbert Owen1 , Oriol Lehmkuhl1 , Pasqua D’Ambra2*,
Fabio Durastante3,2 , Salvatore Filippone4,2
arXiv:2210.16660v2 [math.NA] 19 Dec 2023
1* Barcelona
Supercomputing Centre (BSC), Plaça d’Eusebi Güell,
Barcelona, 08034, Spain.
2 Institute for Applied Computing, National Research Council (CNR),
Via P. Castellino, 111, Naples, 80131, NA, Italy.
3 Department of Mathematics, University of Pisa, Largo Bruno
Pontecorvo, 5, Pisa, 56127, PI, Italy.
4 Department of Civil and Computer Engineering, University of Rome
“Tor Vergata”, Via del Politecnico, 1, Rome, 00133, RM, Italy.
*Corresponding author(s). E-mail(s):
[email protected];
Contributing authors:
[email protected];
[email protected];
[email protected];
[email protected];
Abstract
In this paper, we describe an upgrade of the Alya code with up-to-date parallel linear solvers capable of achieving reliability, efficiency and scalability in the
computation of the pressure field at each time step of the numerical procedure
for solving a Large Eddy Simulation formulation of the incompressible NavierStokes equations. We developed a software module in Alya’s kernel to interface
the libraries included in the current version of PSCToolkit, a framework for the
iterative solution of sparse linear systems on parallel distributed-memory computers by Krylov methods coupled to Algebraic MultiGrid preconditioners. The
Toolkit has undergone various extensions within the EoCoE-II project with the
primary goal of facing the exascale challenge. Results on a realistic benchmark for
airflow simulations in wind farm applications show that the PSCToolkit solvers
significantly outperform the original versions of the Conjugate Gradient method
available in the Alya’s kernel in terms of scalability and parallel efficiency and
represent a very promising software layer to move the Alya code towards exascale.
Keywords: Navier-Stokes equations, iterative linear solvers, algebraic multigrid,
parallel scalability
1
MSC Classification: 65F08 , 65F10 , 65M55 , 65Y05 , 65Z05
1 Introduction
Alya is a high-performance computational mechanics code for complex coupled multiphysics engineering problems. In this work, we present the interfacing between Alya
and the PSCToolkit to overcome one of Alya’s main obstacles in the path towards
exascale, namely the lack of state-of-the-art parallel algebraic linear solvers with adequate algorithmic scalability, as already identified in [29], where Alya’s strengths and
weaknesses in facing the exascale challenge have been analyzed by scalability studies
up to one hundred thousand cores.
Although Alya can be applied to a wide range of problems, in this work, we shall
concentrate on solving turbulent incompressible flow problems using a Large Eddy
Simulation (LES) approach. Due to the wide range of scales present in turbulent highReynolds-number flows, their accurate solution requires computational meshes with
a huge number of degrees of freedom (dofs). Alya uses a Finite Element (FE) spatial
discretization, while its time discretization is based on finite difference methods; when
an implicit time discretization is applied, the two main kernels of a simulation are
the assembly of stiffness matrices and the solution of the associated linear system at
each time step. In [29] the authors observed that the FE assembly implemented in
Alya showed nearly perfect scalability, as one could a priory expect, while the solution
of linear systems by available iterative linear solvers was the main weakness in the
path towards exascale. The problem is related to Alya’s lack of solvers with optimal
algorithmic scalability, i.e., solvers able to obtain a given accuracy employing an almost
constant number of iterations for an increasing number of dofs.
Alya’s sparse linear algebra solvers are specifically developed with tight integration
with the overall parallelization scheme; they include Krylov-based solvers, such as Generalized Minimal Residual (GMRES) or Conjugate Gradient (CG), coupled to some
deflation approach or a simple diagonal preconditioner. As shown in [29], when incompressible flow problems are considered, the solution of a Poisson-type equation for the
pressure field becomes challenging as the size of the problem increases. Indeed, when a
uniform mesh multiplication [19] is used to have successively finer mesh, each time the
mesh is refined to obtain elements with half the size, the number of iterations for solving the pressure equation is approximately doubled, showing a mesh-size-dependent
behavior. To overcome these scalability issues, we interfaced PSCToolkit to Alya to
take advantage of the Algebraic MultiGrid (AMG) preconditioners available through
the AMG4PSBLAS library; this effort has been carried out in the context of the European
Center of Excellence for Energy (EoCoE) applications.
The rest of the paper is organized as follows. In Section 2, we describe the general
framework of Alya and the type of fluid dynamics problem we wish to test the new
solvers on; in Section 3, we give an overall description of PSCToolkit, and then we
focus on the AMG preconditioners employed in Section 3.1. Section 4 discusses the
new module written to interface the solver library to the Alya software and the related
2
issues. Section 5 describes the actual test case, while Section 6 analyzes the numerical
scalability results in detail. Finally, Section 7 summarizes the results obtained and
illustrates the new lines of development.
2 Alya Description
Alya is a high-performance computational mechanics code for complex coupled multiphysics engineering problems. It can solve problems in the simulation of turbulent
incompressible/compressible flows, non-linear solid mechanics, chemistry, particle
transport, heat transfer, and electrical propagation. Alya has been designed for massively parallel supercomputers and exploits several parallel programming models/tools.
It relies on MPI to support a distributed-memory model; some kernels support vectorization at the CPU level and GPU accelerators are exploited through OpenACC
pragmas or CUDA.
Multi-physics coupling is achieved following a multi-code strategy that uses MPI
to communicate different instances of Alya. Each instance solves a particular physics,
enabling asynchronous execution. Coupled problems can be solved by retaining the
scalability properties of the individual instances. Alya is one of the two Computational
Fluid Dynamic (CFD) codes of the Unified European Applications Benchmark Suite
(UEBAS) [25]. It is also part of the Partnership for Advanced Computing in Europe
(PRACE) Accelerator benchmark suite [26].
As mentioned in Section 1, large-scale CFD applications are the main problems
targeted by Alya; hence, the basic mathematical models include various formulations
of the Navier-Stokes equations, whose strong form for incompressible flows in a suitable
domain is the following:
∂t u − 2ν∇ · ε(u) + u · ∇u + ∇p = f ,
(1)
∇ · u = 0,
(2)
where u and p are the velocity and pressure field respectively, ε(u) = 21 ∇u + ∇T u is
the velocity strain rate tensor, ν is the kinematic viscosity, and f denotes the vector of
external body forces. The problem is supplied with an initial divergence-free velocity
field and appropriate boundary conditions.
The flow is turbulent for most real-world flow problems, and some turbulence modeling is needed to make the problem solvable with currently available computational
resources. For all the examples presented in this work, we rely on the functionalities
of Alya, which apply the spatially filtered Navier-Stokes equations coupled to the Vreman subgrid-scale model [30] for turbulence closure. In practice, a spatially varying
turbulent viscosity supplements the laminar viscosity and the velocity and pressure
unknowns correspond to spatially filtered values. Finally, since the size of the dynamically important eddies at high Reynolds numbers becomes too small to be grid resolved
close to the wall, we employ a wall modeling technique [24] to impose the boundary
conditions for the LES equations. For simplicity, the non-linear term has been written
in its convective form, which is most commonly encountered in computational practice.
3
Space discretization is based on a Galerkin FE approximation, employing hybrid
unstructured meshes, which can include tetrahedra, prisms, hexahedra, and pyramids.
Temporal discretization is performed through an explicit third-order Runge–Kutta
scheme, where the Courant–Friedrichs–Lewy number is set to CFL = 1.0 for the cases
presented in this work. A non-incremental fractional step method is used to stabilize
the pressure, allowing the use of finite element pairs that do not satisfy the inf-sup
condition [12], such as the equal order interpolation for the velocity and pressure
applied in this work. A detailed description of the above numerical method, together
with examples for turbulent flows, showing its high accuracy and low dissipation, can
be found in [22].
The fractional step method allows uncoupling the solution of velocity and
pressure [12]. At each Runge–Kutta substep, an explicit approach computes an intermediate velocity, and then a linear system coming from a Poisson-type equation is
solved for the pressure; finally, the incompressible velocity is recovered. In the path
towards exascale, the solution of the linear system for the pressure is the most demanding step. To reduce the computational burden, for most problems, an approximate
projection method for Runge–Kutta time-stepping schemes is applied, which allows
solving for the pressure only at the final substep [10].
It is important to note that most flow problems solved with Alya use a fixed
mesh. For such problems, the linear system matrix for the pressure equation remains
constant during the whole simulation. Therefore, the matrix assembly and the setup
of a matrix preconditioner are needed only once at the beginning of the numerical
procedure. Given that the number of time steps for LES is usually of the order of
105 , it is clear that the linear solver computational times and scalability are the most
relevant issues to be tackled.
3 PSCToolkit: PSBLAS and AMG4PSBLAS
We have interfaced Alya to exploit the solvers and preconditioners developed in the
PSCToolkit1 software framework for parallel sparse computations, proven on current petascale supercomputers and targeting the next-generation exascale machines.
PSCToolkit is composed of two main libraries, named PSBLAS (Parallel Sparse
Basic Linear Algebra Subprograms) [17, 18], and AMG4PSBLAS (Algebraic MultiGrid
Preconditioners for PSBLAS) [16].
Both libraries are written in modern Fortran; PSBLAS implements algorithms and
functionalities of parallel iterative Krylov subspace linear solvers, while AMG4PSBLAS
is the package containing sophisticated preconditioners. In particular, AMG4PSBLAS
provides one-level Additive Schwarz (AS) and Algebraic MultiGrid (AMG) preconditioners. In the following, we will describe in some detail the AMG preconditioners we
use within the Alya test cases.
1
See psctoolkit.github.io on how to obtain and run the code.
4
3.1 AMG preconditioners
Algebraic MultiGrid methods can be viewed as a particular instance of a general
stationary iterative method:
x(k) = x(k−1) + B b − Ax(k−1) , k = 1, 2, . . . given x(0) ∈ Rn ,
for the solution of a linear system
Ax = b,
A ∈ Rn×n , b ∈ Rn ,
where A is symmetric and positive-definite (SPD), and the iteration matrix B is
defined recursively; see, e.g., [28] for an exhaustive account. AMG methods are often
employed as preconditioners for Krylov subspace solvers; what distinguishes the methods implemented in AMG4PSBLAS are the specific details of the construction procedure
for the B matrix.
ℓ −1
We define A0 = A, and consider the sequence {Al }nl=0
of coarse matrices
computed by the triple-matrix Galerkin product:
Al+1 = PlT Al Pl , l = 0, . . . , nℓ − 1,
ℓ −1
where {Pl }nl=0
is a sequence of prolongation matrices of size nl ×nl+1 , with nl+1 < nl
ℓ −2
and n0 = n. To complete the formal construction we need also a sequence {Ml }nl=0
of Al -convergent smoothers for the coarse matrices Al , i.e., matrices Ml for which
−1
∥I
pl − Ml Al ∥Al < 1 holds true, where Il is the identity matrix of size nl and ∥v∥Al =
vT Al v is the Al norm. The preconditioner matrix B for the V -cycle with ν pre- and
post-smooth iteration is then given by the multiplicative composition of the following
error propagation matrices,
Il − Bl Al = (Il − Ml−T Al )ν (Il − Pl Bl+1 PlT Al )(Il − Ml−1 Al )ν ∀l < nℓ ,
(3)
with Bnℓ ≈ A−1
nℓ , either as a direct solution or as a convergent iterative procedure with
a fine enough tolerance.
For the case at hand, we select each iteration matrix of the smoother sequence
ℓ −1
{Ml }nl=0
as the one representing four iterations (ν = 4) of the hybrid forward/backward Gauss-Seidel method. We consider having A in a general row-block parallel
distribution over np processes, i.e., A is divided into np blocks of size nb × n, and we
call App the corresponding diagonal block of A. We then decompose each block App
as App = Lpp + Dpp + LTpp , where Dpp = diag(App ), Lpp is strictly lower triangular.
To enforce symmetry in (3), we select Ml,pp as the block diagonal matrices (Figure 1)
n /n
p
b
Ml = blockdiag(Ml,pp )p=1
,
Ml,pp = ω (Ll,pp + Dl,pp ) ,
l = 0, . . . , nℓ ,
where ω is a damping parameter. The overall procedure thus amounts essentially to
using four sweeps of the damped block-Jacobi method on the matrix of the level while
solving the blocks with the forward, respectively backward, Gauss-Seidel method.
5
Fig. 1: Depiction of the structure of the hybrid forward/backward Gauss-Seidel
method on a general row-block parallel distribution of symmetric positive definite
matrix A.
To build the prolongation (and thus the restriction) matrices, we employ the coarsening based on compatible weighted matching strategy; a full account of the derivation
and detailed theoretical analysis may be found in [13, 15, 16]. This is a recursive procedure that starts from the adjacency graph G = (V, E) associated with the sparse
matrix A; this is the graph in which the vertex set V consists of either the row or
column indices of A and the edge set E corresponds to the indices pairs (i, j) of the
nonzero entries in A. The method works by constructing a matching M in the graph
G to obtain a partition into subgraphs. We recall that a graph matching is a subset
of the graph’s edges such that no two of them are incident on the same vertex. Specifically, we consider more than a purely topological matching by taking into account
the weights of the edges, i.e., the values of the entries of the matrix A. In the first
step, we associate an edge weight matrix C, computed from the entries ai,j in A and
an arbitrary vector w; then, we compute an approximate maximum product matching of the whole graph to obtain the aggregates defining the coarse spaces. We define
C = (ci,j )i,j as
2ai,j wi wj
;
(4)
ci,j = 1 −
ai,i wi2 + aj,j wj2
then, M is an approximate maximum product matching of G with edge weight matrix
C, i.e.,
Y
ci,j .
(5)
M ≈ arg max
′
M
(i,j)∈M′
|M|
The aggregates are then the subsets of indices {Gp }p=1 of the whole index set I of
A made of pairs of indices matched by the algorithm, where we denote with |M|
the cardinality of the graph matching M. In other terms, we have obtained the
decomposition
n
M
[
Gp , Gp ∩ Gr = ∅ if p ̸= r;
I = {1, . . . , n} =
p=1
see, e.g., Figure 2 in which the matching of a test graph is computed–in more detail,
Figure 2a has a black dot corresponding to a non-zero element of the adjacency matrix;
6
Figure 2b shows the corresponding graph obtained from it; while Figure 2c highlights
the aggregated nodes, i.e., the Gp sets. In most cases, we will end up with a sub-optimal
(a) Adjacency matrix
(b) Original graph
(c) Matched edges
Fig. 2: Matching of the graph bcspwr01 from the Harwell-Boeing collection. The
matched nodes in the graph are highlighted by a bold red edge.
matching, i.e., not all vertices will be endpoints of matched edges; thus, we usually have
unmatched vertices. To each unmatched vertex, we associate an aggregate Gs that is
a singleton, and we denote with nS the total number of singletons. The main computational cost of this phase is represented by the computation of the approximate graph
matching on a graph that is distributed across thousands of processors. The parallel
coarsening implemented in AMG4PSBLAS uses the MatchBox-P software library [11]; this
implements a distributed parallel algorithm for the computation of half-approximate
maximum weight matching with complexity O(|E|∆), where |E| is the cardinality of
the graph edge set and ∆ is the maximum vertex degree, i.e., the maximum number
of edges incident on any given node of the graph. The procedure guarantees a solution that is at least half of the optimal weight, i.e., the approximation in (5) holds
within 1/2 of the optimum. The message aggregation and overlapping between communication and computation employed by this strategy reduces the impact of the
data communication on parallel efficiency; we refer the reader to [16] for a complete
set of experiments showcasing this feature. Finally, to build the prolongator matrices,
the last ingredients we need are the vectors we identifying for each edge ei7→j ∈ M
the orthonormal projection of w on the non-singleton aggregate Gp . For the sake
of the explanation, we consider an ordering of the indices in which we move all the
unknowns corresponding to unmatched vertices at the bottom2 , and thus define a
tentative prolongator
P̃ 0
P̂ =
∈ Rn×nc ,
(6)
0 W
where:
P̃ = blockdiag(we1 , . . . , wenM ),
2
This ordering is for explanatory purposes only, and is not actually enforced in practice.
7
W = diag(ws /|ws |), s = 1, . . . , nS , corresponds to unmatched vertices. The resulting
number of coarse variables is then given by nc = nM + nS . The matrix P̂ we have
just built is a piecewise constant interpolation operator whose range includes, by
construction, the vector w. The actual prolongator P is then obtained from P̂ as
P = (I − ωD−1 A)P̂ , where D = diag(A) and ω = 1/∥D−1 A∥∞ ≈ 1/ρ(D−1 A), with
ρ(D−1 A) the spectral radius of D−1 A. Indeed, the P we have built is an instance of
smoothed aggregation. Please observe that the procedure we have described produces,
at best, a halving of the size of the system at each new level of the hierarchy. Given the
size of the systems we are interested in, this may be unsatisfactory since the number
of levels in the hierarchy and thus the operational cost needed to cross it, would be
too large. Fortunately, it is rather easy to overcome this issue: to obtain aggregates
of size greater than two, we just have to collect them together by multiplying the
corresponding prolongators (restrictors). This permits us to select the desired size of
the aggregates (2, 4, 8, and so on) as an input parameter of the method.
To conclude the description of the preconditioners, we need to specify the choice
for the coarsest solver. While using a direct solver at the coarsest level is the easiest
way to ensure that the coarsest grid is resolved to the needed tolerance, such an
approach for an AMG method running on many thousands of parallel cores can be very
expensive. If the matching strategy has worked satisfactorily, the coarsest-level matrix
will tend to have both a small global size and a small number of rows per core: in this
case the cost of data communication will dominate the local arithmetic computations
causing a deterioration of the method efficiency. We use here a dual strategy: on
the one hand, we employ a distributed coarsest solver running on all the parallel
cores, whilst on the other, we limit the maximum size of the coarsest-level matrix to
around 200 unknowns per core. Specifically, we use the Flexible Conjugate Gradient
(FCG) method with a block-Jacobi preconditioner on which we solve approximately
the blocks by an incomplete LU factorization with one level of fill-in, ILU(1), the
stopping criterion is based on the reduction of the relative residual of 3 orders of
magnitude or a maximum number of iterations equal to 30.
To have a comparison with the preconditioner just discussed, we also consider the
same construction but with a different aggregation procedure: the decoupled version
of the classic smoothed aggregation of Vaněk et al [27]. This is an aggregation option
that was already available in previous versions of the library [9, 14], and was already
successfully used in CFD applications [2, 3]. The basic idea is to build a coarse set
of indices by grouping unknowns into disjoint subsets (the aggregates) by using an
affinity measure and defining a simple tentative prolongator whose range contains
the so-called near null space of the matrix of the given level, i.e., a sample of the
eigenvector corresponding to the smallest eigenvalue. The strategy is implemented in
an embarrassingly parallel fashion, i.e., each processor produces aggregates by only
looking at local unknowns, i.e. the aggregation is performed in a decoupled fashion,
in contrast to the previous matching procedure that instead crosses the boundary of
the single process.
Table 1 summarizes the different preconditioners we have discussed here and that
are used in the experiments of Section 6.1.
8
Pre-smoother
4 iterations of hybrid forward Gauss-Seidel
Post-smoother
4 iterations of hybrid backward Gauss-Seidel
Coarsest solver FCG preconditioned by block-Jacobi with ILU(1) block solvers
Cycle
V-cycle
Aggregation
Coupled smoothed based on matching
Decoupled classic
|G| ≤ 8
|G| ≤ 16
smoothed
Label
MLVSMATCH3
MLVSMATCH4
MLVSBM
Table 1: Summary of the described preconditioners, the labels are used to
describe the results in Section 6.1.
Remark 1. The AMG4PSBLAS library provides interfaces to some widely used parallel
direct solvers, such as SuperLU [23] and MUMPS [1]. Thus, we could have used any of
those within the damped block-Jacobi method, either on the smoother or on the coarsest
solvers. For what concerns the smoothers, it has been observed in the literature [4, 16]
that the combination with the Gauss-Seidel method delivers better smoothing properties
for the overall method. In the coarsest solver case, the size of the local matrices is small
enough to not usually show a significant performance increase when using a direct
solver. We also stress that the preconditioner described in this section depends only
on native PSCToolkit code, i.e., the user does not have to install optional third-party
libraries to use it.
4 Interfacing Alya to PSCToolkit
The Alya code is organized in a modular way, and its architecture is split into modules,
kernel, and services, which can be separately compiled and linked. Each module represents a physical model, i.e., a set of partial differential equations which can interact
for running a multi-physics simulation in a time-splitting approach, while Alya’s kernel implements the functionalities for dealing with the discretization mesh, the solvers
and the I/O functionalities. As already mentioned, the governing equations of a physical model are discretized in space by using FE methods and all the functionalities
to assemble the global stiffness matrix and right-hand-side (RHS) of the corresponding set of equations, including boundary conditions and material properties are the
responsibility of the module. Instead, all the functionalities needed to solve the algebraic linear systems are implemented in the kernel. Some work on data structures and
distribution of matrices and RHS was necessary to interface Alya with libraries from
PSCToolkit, as described in the following.
Alya uses the compressed sparse row matrix scheme for the internal representation
of sparse matrices. This scheme is supported by PSCToolkit, so no significant difficulty
was met from this perspective. The main difficulty in the interfacing process was how
the data, i.e. the discretization mesh and the corresponding unknowns, are distributed
among the parallel processes and the way the related sparse matrix rows and RHS
are locally assembled. The Alya code is based on a domain decomposition where the
discretization mesh is partitioned into disjoint subsets of elements/nodes, referred to
as subdomains. Then, each subdomain is assigned to a parallel process that carries out
9
all the geometrical and algebraic operations corresponding to that part of the domain
and the associated unknowns. The interface elements/nodes on the boundary between
two subdomains are assigned to one of the subdomains (see Figure 3).
(a) In white, interface nodes.
(b) In white, halo elements.
Fig. 3: Mesh partitioning into (3a) disjoint sets of nodes, and (3b) disjoint sets of
elements.
The sparse matrices expressing the linear couplings among the unknowns are distributed in such a way that each parallel process holds the entries associated with the
couplings generated on its subdomain. Two different options are possible for sparse
matrix distribution: the partial row format and the full row format [20], respectively.
In the full row format, if a mesh element/node and the corresponding unknown belong
to a process, all row entries related to that unknown are stored by that process. In
the partial row format, the row of a matrix corresponding to an unknown is not full
and needs contributions from unknowns belonging to different processes.Alya uses a
partial row format for storing the matrix.
The libraries from PSCToolkit build the preconditioners and apply the Krylov
methods on the assumption of a full row format; nevertheless, support for partial
row format was added to the libraries’ pre-processing stage so that the interfacing
can be as transparent as possible. The pre-processing support implies the retrieval
of remote information for those matrix contributions that correspond to elements on
the boundary; the data communication is split between the discovery of the needed
entries (which needs only be executed when the discretization mesh changes) and the
actual retrieval of the matrix entries, which must happen at any time step where
the matrix coefficients and/or vector entries may be rebuilt, prior to an invocation
of the solvers. When the topology of the mesh does not change, and there is only an
update in the coefficients, it is also possible to reuse the same preconditioner; this may
10
be full reuse of the overall matrices hierarchy, or partial reuse, employing the same
prolongators/restrictors to rebuild the AMG hierarchy and smoothers.
We developed a software module in Alya’s kernel for declaration, allocation, and
initialization of the library’s data structures as well as for using solvers and preconditioners. PSBLAS makes available some of the widely used iterative methods based on
Krylov projection methods through a single interface to a driver routine, while preconditioners for PSBLAS Krylov solvers are available through the AMG4PSBLAS package.
The main functionalities for selecting and building the chosen preconditioner are the
responsibility of the software module included in the Alya’s kernel, while the functionalities for applying it within the PSBLAS Krylov solver are completely transparent to
the Alya code and are the responsibility of the library.
5 The Bolund test case
Fig. 4: Photograph of the Bolund hill [7].
Our main aim was to test the libraries for systems stemming from fluid dynamics
simulation of incompressible flow arising in the study of wind-farm efficiency. The
test case is based on the Bolund experiment, a classical benchmark for microscale
atmospheric flow models over complex terrain [5, 6]. An incompressible flow treatment
is used because the Mach number, i.e., the ratio of the speed of the flow to the speed
of sound, is much smaller than 0.3. The test case is based on a small (12 m) isolated
steep hill at Roskilde Fjord in Denmark having a significantly steep escarpment in the
main wind direction and uniformly covered by grass so that the resulting flow is not
influenced by individual roughness elements. This is considered the ideal benchmark
for the validation of neutral flow models and, hence a most relevant scenario for the
analysis of software modeling for wind energy. Though relatively small, its geometrical
shape induces complex 3D flow. Bolund was equipped with several measurement masts
with conventional meteorological instruments and remote sensing Lidars to obtain
detailed information of mean wind, wind shear, turbulence intensities, etc. A publicly
available database for evaluating currently available flow models and methodologies
11
for turbine siting in complex terrain regarding wind resources and loads is available
at [7].
Fig. 5: Volume rendering of the velocity over Bolund obtained with Alya.
We discretize the incompressible Navier-Stokes equation (1) as described in
Section 2. At each time step of the LES procedure, we solved the SPD linear systems
arising from the pressure equation employing the preconditioned flexible version of
the CG method (FCG) method by PSBLAS. Starting from an initial guess for pressure from the previous time step, we stopped linear iterations when the Euclidean
norm of the relative residual was no larger than T OL = 10−3 . The Reynolds number
based on the friction velocity for this test case is approximately REτ = U h/ν ≈ 107
with U = 10 m s−1 . As discussed in [6, Section 2.1], we can neglect Coriolis force
in the horizontal direction and use the formulation (1) since the Rossby number
RO = 667 ≫ 1.
The next Section 6 details the scalability result obtained for this test case with the
new solvers and preconditioners from PSCToolkit described in Section 3.
6 Parallel Performance Results
In the following we discuss the results of experiments run on two of the most powerful European supercomputers. The first set of experiments aimed to analyze the
behavior of different AMG preconditioners available from AMG4PSBLAS and run on
the Marenostrum-4 supercomputer up to 12288 CPU cores. Marenostrum-4 is composed of 3456 nodes with 2 Intel Xeon Platinum 8160 CPUs with 24 cores per CPU.
12
It is ranked 121th in the November 2023 TOP500 list3 , with more than 10 petaflops
of peak performance and is operated by the Barcelona Supercomputer Center. The
simulations have been performed with the Alya code interfaced to PSBLAS (3.7.0.1)
and AMG4PSBLAS (1.0), built with GNU compilers 7.2. The second set of experiments
aimed to reach very large scales and run by using only one of the most promising preconditioners by AMG4PSBLAS on the Juwels supercomputer, up to 23551 CPU cores.
Juwels is composed of 2271 compute nodes with 2 Intel Xeon Platinum 8168 CPUs,
of 24 cores each. It is ranked 127th in the June 2022 TOP500 list, with more than
9 petaflops of peak performance, and is operated by the Jülich Supercomputer Center. The simulations have been performed with the Alya code interfaced to the same
versions of the solvers libraries mentioned above, built with GNU compilers 10.3.
6.1 Comparison of AMG Preconditioners
In this section, we discuss results obtained on Marenostrum-4 and compare the behavior of FCG coupled to the preconditioners described in Section 3.1 and summarized in
Table 1. We run both strong scalability analysis for unstructured meshes of tetrahedra of three fixed sizes as well as weak scalability analysis, obtained by fixing different
mesh sizes per core and linearly increasing both mesh size and the number of cores.
A general row-block matrix distribution based on the Metis 4.0 mesh partitioner [21]
was applied for the parallel runs.
6.1.1 Strong scalability
We first focus on strong scalability results obtained on the Bolund experiment for
three fixed size meshes (small, medium and large) including n1 = 5570786 ≈ 6 × 106 ,
n2 = 43619693 ≈ 4.4 × 107 and n3 = 345276325 ≈ 0.35 × 109 dofs, respectively. Three
different configurations of the number of cores, obtained by doubling each time the
number of MPI cores with respect to the minimum number of cores (nodes) needed
to run at full load, were employed for the three different mesh sizes: from minp = 48
to maxp = 192 cores in the case of the small mesh, from minp = 384 to maxp = 1536
cores for the medium mesh, and finally from minp = 3072 to maxp = 12288 cores
for the large mesh. We analyze the parallel efficiency and convergence behavior of the
linear solvers for 20 time steps after a pre-processing phase so that we focus on the
solvers’ behavior in the simulation of a fully developed flow. Note that in the Alya
code a master-slave approach is employed, where the master process is not involved
in the parallel computations.
In Figures 6-7, we report a comparison of the different methods in terms of the
total number of iterations of the linear solvers and of the solve time per iteration (in
seconds), respectively. Note that in the figures we also have results obtained with a
version of Deflated CG (AlyaDefCG), available from the original Alya code.
We can observe that the total number of linear iterations is much smaller than
that with the original AlyaDefCG, for all three meshes, when AMG4PSBLAS multilevel
preconditioners are applied. For the small mesh, the minimum number of linear iterations is obtained by MLVSBM which shows a fixed number of 60 iterations for all core
3
Available at www.top500.org.
13
Systems size n1
Systems size n2
Total Iterations
Total Iterations
1042
700
100
60
47
95
191
number of MPI cores
160
120
383
767
1535
number of MPI cores
Systems size n3
1406
Total Iterations
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
125
100
3071
6143
12287
number of MPI cores
Fig. 6: Strong scalability: total iteration number of the linear solvers
counts, while MLVSMATCH3 requires 90 iterations for all core counts except on 192
cores, where 1 less iteration was needed, and MLVSMATCH4 requires 100 iterations;
in this case, the original AlyaDefCG requires 700 iterations for all core counts.
In the case of the medium mesh, we observe a larger number of iterations of the
solvers employing AMG4PSBLAS preconditioners with respect to the large mesh. We have
a minimum number of iterations with MLVSMATCH3 ranging from 122 to 123 for all
number of cores, while MLVSMATCH4 requires a range from 160 to 161 iterations
and MLVSBM requires a range from 172 to 174 iterations. The original AlyaDefCG
requires a number of iterations ranging from 1040 to 1042 for the medium mesh.
In the case of the large mesh, the number of iterations required by MLVSMATCH3
ranges between 108 on 3072 cores and 137 on 12288 cores, while MLVSMATCH4
requires a more stable number of iterations ranging from 115 to 117; a similarly stable
behavior is observed for MLVSBM which requires a number of iterations ranging from
121 to 123. AlyaDefCG requires a number of iterations ranging from 1404 to 1406 for
the large mesh.
The oscillations in the number of iterations seem to be mostly dependent on the
data partitioning obtained by Metis, which in turn, appears to have a larger impact
on the MLVSMATCH3 preconditioner in the case of the large mesh. A deeper analysis
14
Systems size n2
Time × iteration (s)
Time × iteration (s)
Systems size n1
10−1
10−2
47
95
191
number of MPI cores
10−1
10−2
383
767
1535
number of MPI cores
Time × iteration (s)
Systems size n3
10
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
−1
10−2
3071
6143
12287
number of MPI cores
Fig. 7: Strong scalability: time per iteration of the linear solvers
of the impact of the data partitioner on the solver behavior, albeit interesting, is out
of the scope of our current work and would require a significant amount of computing
resources.
In all cases, the time needed per iteration decreases for an increasing number of
cores and, as expected, it is larger for the AMG preconditioners, where the cost for the
preconditioner application at each FCG iteration is larger than that of AlyaDefCG.
Depending on mesh size and number of cores, the AMG preconditioners show very
similar behavior, although MLVSBM always requires a smaller time per iteration for
the large mesh and for the medium mesh when 1536 cores are used.
In Figures 8-9, we can see the total solve time spent in the linear solvers and
the resulting speedup for the preconditioners. Here, we define speedup as the ratio
Sp = Tminp /Tp , where Tminp is the total time for solving linear systems when the
minimum number of total cores, per each problem size, is involved in the simulation,
and Tp is the total time spent in linear solvers for all the increasing number of cores
used for the specified mesh size.
We observe that the AMG preconditioners from AMG4PSBLAS generally achieve
shorter execution times than the original AlyaDefCG; indeed, the expected longer
time per iteration is more than compensated by the large reduction in the number of
15
iterations especially for the small and large mesh. In good agreement with the behavior
in terms of iterations and time per iteration, we observe that MLVSBM generally shows
the shortest execution time for the small mesh, especially for small number of cores,
while for the medium and large mesh, MLVSMATCH3 and MLVSMATCH4 show some
better or comparable behaviour with respect to MLVSBM. The best speedups are
generally obtained, except for the small mesh, by the original AlyaDefCG, while in
the case of AMG preconditioners, the very good convergence behavior and solve time
on the smallest number of cores limit the speedup for the increasing number of cores.
For the AMG4PSBLAS preconditioners, speedups are in good agreement with the total
solve times, showing that MLVSMATCH3 and MLVSMATCH4 are generally better
or comparable with respect to MLVSBM for all meshes when the small and medium
number of cores are used, while MLVSBM is better for medium and large mesh when
the largest number of cores is used.
20
Systems size n2
Total Solve Time (s)
Total Solve Time (s)
Systems size n1
10
47
95
191
number of MPI cores
30
14
383
767
1535
number of MPI cores
Total Solve Time (s)
Systems size n3
50
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
25
10
3071
6143
12287
number of MPI cores
Fig. 8: Strong scalability: total solve time of the linear solvers
In conclusion, the selected solvers from the PSCToolkit generally outperform the
original Alya solver for the employed test case, and the choice of the better preconditioner from AMG4PBLAS depends on target mesh size and number of employed
parallel cores. This appears as an advantage for Alya’s users that having available a
16
Systems size n2
Speedup of Solve
Speedup of Solve
Systems size n1
3
2
1
47
3
2
1
95
191
number of MPI cores
383
767
1535
number of MPI cores
Speedup of Solve
Systems size n3
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
1.6
1.4
1.2
1
3071
6143
12287
number of MPI cores
Fig. 9: Strong scalability: speedup of the linear solvers. We note that ideal values for
speedups in all three configurations are 1, 2 and 4, respectively.
large set of parallel preconditioners through the interface to PSCToolkit, can select
the best one for their specific aims.
6.1.2 Weak scalability
In this section, we analyze the weak scalability of the AMG4PSBLAS preconditioners,
i.e., we observe the solvers looking at their behavior when we fix the mesh size per
core and increase the number of cores.
We considered the same test case and the three meshes of the previous section in
the three possible configurations of computational cores, from 48 up to 3072, from 96
up to 6144 and from 192 to 12288. The different configurations of cores correspond to
three different (decreasing) mesh sizes per core equal to nxcore1 = 1.1e5, nxcore2 =
5.9e4, and nxcore3 = 2.9e4, respectively. Note that the medium and the large mesh
correspond to scaling factors of 8 and 64, respectively, with respect to the small mesh;
therefore in the same way, we scaled the number of cores for our weak scalability
analysis.
We can limit our analysis to observe the average number of linear iterations of the
different employed preconditioners per each time step in the various simulations and
17
to analyze execution times and scaled speedup for solve. In Figure 10, we report the
average number of iterations for each time step. We can observe a general increase,
ranging from 35 to 70 for an increasing number of cores when the original AlyaDefCG
is employed. On the other hand, when AMG preconditioners from AMG4PSBLAS coupled
with FCG by PSBLAS are applied, we observe a constant average number of iterations
equal to 5 for MLVSMATCH4 both for the small and the large mesh, independently
of the number of cores, while MLVSBM requires 3 iterations for the small mesh and
6 for the large mesh. MLVSMACTH3 ranges from 4 to 6 iterations on the small mesh
and the large mesh, respectively. In the case of medium mesh, in agreement with what
was observed for the strong scalability analysis, all the preconditioners require a larger
average number of iterations, which is 8 for MLVSBM and MLVSMATCH4, and 6 for
MLVSMATCH3. This behavior indicates a very promising algorithmic scalability of
MLVSMATCH4. In Figures 11-12, we can see the total solve time and the correspond-
70
37
3
47
70
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
383
37
3071
3
95
70
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
767
37
6143
3
191
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
1535
12287
number of MPI cores
number of MPI cores
number of MPI cores
(a)
(b)
(c)
Fig. 10: Weak scalability: average number of linear iterations per time step. nxcore1
dofs per core (10a), nxcore2 dofs per core (10b), nxcore3 dofs per core (10c)
ing scaled speedup. We can observe that, as expected from the previous sections, all
preconditioners from AMG4PSBLAS generally lead to a smaller increase ratio in the solve
times with respect to the original AlyaDefCG, when the mesh size goes from the small
to the large one. In more detail, we observe that, for all mesh sizes per core, smaller
increase ratios in the execution time are generally obtained with MLVSMATCH3 and
MLVSMATCH4. This is better observed by looking at the scaled speedup. It is defined
as scalfactor × Tminp /Tp , where scalfactor = 1, 8, 64, for the three increasing number
of cores, Tminp is the total time for solving linear systems when the minimum number
of total cores is involved in the simulation, per each mesh size per core, and Tp is the
total time spent in linear solvers for all the increasing number of cores used for the
specified mesh size per core. We observe that the best values are obtained with the
MLVSMATCH3 and MLVSMATCH4 preconditioners when nxcore1 and nxcore2 dofs
per core are used. In detail, for nxcore1 dofs per core, MLVSMATCH3 reaches the
best value of about 71% of scaled efficiency on 3072 cores and about 44% of scaled
efficiency on 6144 core when nxcore2 dofs per core are employed. This shows that the
18
scalability of MLVSMATCH3 and MLVSMATCH4 are very promising in facing the
exascale challenge, especially when the resources are used at their best in terms of
node memory capacity and bandwidth. On the other hand, in the case of nxcore3 dofs
per core (12c), the scaled speedup of AlyaDefCG is better; this is essentially due to
the very large solve time spent by this solver on 192 cores.
30
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
40
30
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
20
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
20
20
10
10
10
47
383
3071
95
767
6143
191
1535
12287
number of MPI cores
number of MPI cores
number of MPI cores
(a)
(b)
(c)
Fig. 11: Weak scalability: total solve time (s) of the linear solvers. nxcore1 dofs per
core (11a), nxcore2 dofs per core (11b), nxcore3 dofs per core (11c)
40
30
20
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
20
15
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
10
MLVSBM
MLVSMATCH3
MLVSMATCH4
AlyaDefCG
10
5
10
47
383
number of MPI cores
3071
95
767
number of MPI cores
(a)
(b)
6143
191
1535
12287
number of MPI cores
(c)
Fig. 12: Weak scalability: the scaled speedup of the linear solvers. nxcore1 dofs per
core (12a), nxcore2 dofs per core (12b), nxcore3 dofs per core (12c)
6.2 Results at extreme scales
In this section, we discuss some results obtained on the Juwels supercomputer by
increasing the number of dofs till to n4 ≈ 2.9 × 109 . We limit our analysis to the
weak scalability results of one of the most promising solvers in PSCToolkit. Indeed,
19
due to the limited access to the Juwels resources and taking into account the above
preconditioners comparison, we only run experiments by using the MLVSMATCH4
preconditioner. A general row-block data distribution based on a parallel geometric
partitioning using Space Filling Curve (SFC) [8] was applied for these experiments.
As in the previous experiments, we analyze the parallel efficiency and convergence
behavior of the linear solver for 20 time steps after a pre-processing phase so that
we focus on the solver behavior in the simulation of a fully developed flow for all the
meshes but the largest one, where we were not able to skip the transient phase due
to long simulation time. In this last case, we considered a total number of time steps
equal to 1379 and analyzed solver performance in the last 20 time steps. Note that
increasing mesh size imposes a decrease in time step due to stability constraints of the
explicit time discretization that is preferred for LES simulations. Therefore, the total
simulated time depends on the mesh size. Furthermore, to reduce observed operating
oscillations associated with the full node runs, we used only a total of 46 cores per
node.
As already mentioned, we analyze the weak scalability of the solvers; we considered
a mesh size per core equal to nxcore1 and used a scaling factor of 8 for going up to
the largest mesh size; therefore in the same way, we scaled the number of cores for our
weak scalability analysis. We can limit our analysis to observing the average number
of linear iterations of the solver per each time step and analyzing execution times
and scaled speedup for the solve phase. We compare the results obtained by using
the PSCToolkit’s solver against Alya’s Conjugate Gradient solver (hereby AlyaCG).
Observe that in these experiments, we also tried to use the Deflated CG implemented
in Alya, but it does not work for the two larger test cases, and AlyaCG appears better
in the case of smaller size meshes. In Figure 13, we report the average number of
iterations per each time step. We can observe a general increase, ranging from 133 to
331 for an increasing number of cores, but on 368 cores where 95 average iteration
count is obtained, when the original AlyaCG is employed, while very good algorithmic
scalability, with an average number of linear iterations per each time step ranging from
4 to 6, when the PSCToolkit’s solver is applied. In Figures 14a-14b, we can see the
total solve time and the corresponding scaled speedup. We can observe that the good
algorithmic scalability of MLVSMATCH4 leads to an almost flat execution time for
solving when the first three meshes are employed, while a decrease is observed for the
simulation carried out with the largest mesh, depending on a smaller average number of
iterations per time step. On the contrary, the original AlyaCG generally shows a huge
increase for increasing number of cores and mesh size, but in the second one, where a
decrease in the average number of iterations per time step is observed. Then we look
at the scaled speedup, defined as scalfactor × T45 /Tp , where scalfactor = 1, 8, 64, 512,
for increasing number of cores, T45 is the total time for solving linear systems when
45 cores are involved in the simulation, and Tp is the total time spent in linear solvers
for all the increasing number of cores. We observe that for the two larger meshes,
MLVSMATCH4 has a super-linear scaled speed-up of about 71 (up from the ideal
speedup of 64) and 640 (up from the ideal speedup of 512), respectively, showing
that its very good algorithmic scalability is coupled with excellent implementation
20
Average Iterations
300
AlyaCG
MLVSMATCH4
100
4
45
367
2943
23551
Number of MPI cores
Fig. 13: Weak scalability: average number of linear iterations per time step. Systems
size from n1 to n4 .
scalability of all the basic computational kernels. This scalability is very promising in
facing the exascale challenge.
700
100
80
AlyaCG
MLVSMATCH4
100
AlyaCG
MLVSMATCH4
60
10
40
20
45
1
45
367
2943
23551
Number of MPI cores
(a) Total solve time of the linear solvers (s).
367
2943
23551
Number of MPI cores
(b) Scaled speedup of the linear solvers.
Fig. 14: Weak scalability: systems size from n1 to n4 .
7 Conclusions
In this paper, we presented our work on improving the linear solver capabilities of
a large-scale CFD code by interfacing it with a software framework, including new
and state-of-the-art algebraic linear solvers, specifically designed to exploit the very
large potential of current petascale supercomputers and aimed at the early exascale
supercomputers. Our activities were carried out in the context of the European Center
of Excellence for Energy applications, where one of the lighthouse codes was the Alya
21
code, developed at the Barcelona Supercomputing Center (BSC) and applied to wind
flow studies for renewable production. However, this work has a wider impact, and
confirms the benefits of using third-party software libraries developed by specialists,
in complex, multi-component and multi-physics simulation codes.
From Alya’s perspective, the most significant achievement has been obtaining
excellent algorithmic scalability thanks to multigrid preconditioners, as shown in the
weak scalability studies. This allows us to solve much bigger problems efficiently. Our
first objective for the future is to test the GPU version of PSCToolkit. During EoCoE,
we have significantly optimized the FE assembly on GPUs, making it four times more
energy efficient than the CPU version; integrating a competitive linear algebra GPU
package is now the next priority. After that, having the entire workflow for incompressible flow problems on GPUs should be relatively straightforward. We expect to
have a much higher number of unknowns for problems running for MPI process when
GPU accelerators are exploited. Therefore, strong scalability should be much less critical. While we have focused on a wind energy problem in this work, we wish to test the
solver in other incompressible flow problems in the future. Moreover, since the solver
is fully interfaced with Alya, it will be interesting to test the suitability of PSCToolkit
for other problems, such as solid mechanics or heat transfer.
Acknowledgments. We thank the two anonymous reviewers whose suggestions
helped improve and clarify this manuscript. P.D., F.D. and S.F are members of the
INdAM Research Group GNCS.
8 Declarations
Ethical Approval
N/A
Competing interests
N/A
Authors’ contributions
The first two authors are involved in the development of the Alya code and worked
on the integration of the PSCToolkit into Alya. Herbert Owen did main contribution
in setup and running of experiments. The last three authors are the developers of
PSCToolkit and experts in Linear Algebra. They have equally contributed in setup of
suitable linear solvers, analyzing results and writing the paper.
Funding
The research received funding and PRACE grants for supercomputers access from
Horizon 2020 Project “Energy oriented Centre of Excellence: toward exascale for
energy” (EoCoE–II), Project ID: 824158.
22
Availability of data and materials
PSCToolkit is available at https://psctoolkit.github.io/ The data that support the
findings of this study are available from the corresponding author upon reasonable
request.
References
[1] Amestoy P, Duff I, L’Excellent JY (2000) Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput Methods in Appl Mech Eng 184(2):501–
520. https://doi.org/https://doi.org/10.1016/S0045-7825(99)00242-X, URL
https://www.sciencedirect.com/science/article/pii/S004578259900242X
[2] Aprovitola A, D’Ambra P, Denaro F, et al (2011) Scalable algebraic multilevel
preconditioners with application to CFD. Lecture Notes in Computational Science
and Engineering 74 LNCSE:15 – 27. https://doi.org/10.1007/978-3-642-14438-7
2
[3] Aprovitola A, D’Ambra P, Denaro FM, et al (2015) SParC-LES: Enabling large
eddy simulations with parallel sparse matrix computation tools. Computers and
Mathematics with Applications 70(11):2688 – 2700. https://doi.org/10.1016/j.
camwa.2015.06.028
[4] Baker AH, Falgout RD, Kolev TV, et al (2011) Multigrid smoothers for ultraparallel computing. SIAM J Sci Comput 33(5):2864–2887. https://doi.org/10.
1137/100798806
[5] Bechmann A, Sorensen NN, Berg J, et al (2011) The Bolund experiment, Part II:
Blind comparison of microscale flow models. Boundary-Layer Meteorology 141(2)
[6] Berg J, Mann J, Bechmann A, et al (2011) The Bolund Experiment,
Part I: Flow Over a Steep, Three-Dimensional Hill. Bound-Layer Meteorol
141(2):219. https://doi.org/10.1007/s10546-011-9636-y, URL https://doi.org/10.
1007/s10546-011-9636-y
[7] Bolund (2022) The Bolund experiment. https://www.bolund.vindenergi.dtu.dk,
accessed: 2022-09-27
[8] Borrell R, Cajas JC, Mira D, et al (2018) Parallel mesh partitioning based on
space filling curves. Comput & Fluids 173:264–272. https://doi.org/10.1016/j.
compfluid.2018.01.040, URL https://doi.org/10.1016/j.compfluid.2018.01.040
[9] Buttari A, D’Ambra P, Di Serafino D, et al (2007) 2LEV-D2P4: A package
of high-performance preconditioners for scientific and engineering applications.
Applicable Algebra in Engineering, Communications and Computing 18(3):223 –
239. https://doi.org/10.1007/s00200-007-0035-z
23
[10] Capuano F, Coppola G, Chiatto M, et al (2016) Approximate projection method for the incompressible Navier-Stokes equations. AIAA Journal
54(7):2179–2182. https://doi.org/10.2514/1.J054569, URL https://doi.org/10.
2514/1.J054569, https://doi.org/10.2514/1.J054569
[11] Catalyürek UV, Dobrian F, Gebremedhin A, et al (2011) Distributed-memory
parallel algorithms for matching and coloring. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp
1971–1980, https://doi.org/10.1109/IPDPS.2011.360
[12] Codina R (2001) Pressure stability in fractional step finite element methods for
incompressible flows. J Comput Phys 170:112–140
[13] D’Ambra P, Vassilevski PS (2013) Adaptive AMG with coarsening based on compatible weighted matching. Comput Vis Sci 16(2):59–76. https://doi.org/10.1007/
s00791-014-0224-9
[14] D’Ambra P, di Serafino D, Filippone S (2010) MLD2P4: a package of parallel algebraic multilevel domain decomposition preconditioners in Fortran 95. ACM Trans
Math Software 37(3):Art. 30, 23. https://doi.org/10.1145/1824801.1824808, URL
https://doi.org/10.1145/1824801.1824808
[15] D’Ambra P, Filippone S, Vassilevski PS (2018) BootCMatch: a software package for bootstrap AMG based on graph weighted matching. ACM Trans Math
Software 44(4):Art. 39, 25. https://doi.org/10.1145/3190647
[16] D’Ambra P, Durastante F, Filippone S (2021) AMG preconditioners for linear
solvers at extreme scale. SIAM J on Sci Comp 43(5). https://doi.org/10.1137/
20M134914X
[17] Filippone S, Buttari A (2012) Object-oriented techniques for sparse matrix
computations in Fortran 2003. ACM TOMS 38(4):23:1–23:20
[18] Filippone S, Colajanni M (2000) PSBLAS: a library for parallel linear algebra
computations on sparse matrices. ACM TOMS 26(4):527–550
[19] Houzeaux G, de la Cruz R, Owen H, et al (2013) Parallel uniform mesh multiplication applied to a Navier-Stokes solver. Comput & Fluids 80:142–151.
https://doi.org/10.1016/j.compfluid.2012.04.017, URL https://doi.org/10.1016/
j.compfluid.2012.04.017
[20] Houzeaux G, Borrell R, Fournier Y, et al (2018) High-Performance Computing: Dos and Don’ts. In: Ionescu A (ed) Computational Fluid Dynamics - Basic
Instruments and Applications in Science. IntechOpen, pp 3–41, https://doi.org/
10.5772/intechopen.72042
24
[21] Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392. https://doi.org/10.
1137/S1064827595287997, URL https://doi.org/10.1137/S1064827595287997
[22] Lehmkuhl O, Houzeaux G, Owen H, et al (2019) A low-dissipation finite element
scheme for scale resolving simulations of turbulent flows. Journal of Computational Physics 390:51 – 65. https://doi.org/https://doi.org/10.1016/j.jcp.2019.04.
004
[23] Li XS (2005) An overview of SuperLU: algorithms, implementation, and user
interface. ACM Trans Math Software 31(3):302–325. https://doi.org/10.1145/
1089014.1089017
[24] Owen H, Chrysokentis G, Avila M, et al (2020) Wall-modeled largeeddy simulation in a finite element framework. International Journal for
Numerical Methods in Fluids 92(1):20–37. https://doi.org/https://doi.org/10.
1002/fld.4770, URL https://onlinelibrary.wiley.com/doi/abs/10.1002/fld.4770,
https://onlinelibrary.wiley.com/doi/pdf/10.1002/fld.4770
[25] PRACE (accessed May 2020) Unified European Application Benchmark Suite.
https://repository.prace-ri.eu/git/UEABS/ueabs/
[26] PRACEBS (accessed May 2020) PRACE benchmark-suite. https://praceri.eu/training-support/technical-documentation/benchmark-suites/
[27] Vaněk P, Mandel J, Brezina M (1996) Algebraic multigrid by smoothed
aggregation for second and fourth order elliptic problems. Computing
56(3):179–196. https://doi.org/10.1007/BF02238511, URL https://doi.org/10.
1007/BF02238511, international GAMM-Workshop on Multi-level Methods
(Meisdorf, 1994)
[28] Vassilevski PS (2008) Multilevel block factorization preconditioners. Springer,
New York, matrix-based analysis and algorithms for solving finite element
equations
[29] Vazquez M, Houzeaux G, Koric S, et al (2016) Alya: multiphysics engineering
simulation toward exascale. J Comput Sci 14:15–27. https://doi.org/10.1016/j.
jocs.2015.12.007, URL https://doi.org/10.1016/j.jocs.2015.12.007
[30] Vreman AW (2004) An eddy-viscosity subgrid-scale model for turbulent shear
flow: Algebraic theory and applications. Phys Fluids 16(10):3670–3681
25