Academia.eduAcademia.edu

Alya towards Exascale: Algorithmic Scalability using PSCToolkit

2022, arXiv (Cornell University)

In this paper, we describe an upgrade of the Alya code with up-to-date parallel linear solvers capable of achieving reliability, efficiency and scalability in the computation of the pressure field at each time step of the numerical procedure for solving a Large Eddy Simulation formulation of the incompressible Navier-Stokes equations. We developed a software module in Alya's kernel to interface the libraries included in the current version of PSCToolkit, a framework for the iterative solution of sparse linear systems on parallel distributed-memory computers by Krylov methods coupled to Algebraic MultiGrid preconditioners. The Toolkit has undergone various extensions within the EoCoE-II project with the primary goal of facing the exascale challenge. Results on a realistic benchmark for airflow simulations in wind farm applications show that the PSCToolkit solvers significantly outperform the original versions of the Conjugate Gradient method available in the Alya's kernel in terms of scalability and parallel efficiency and represent a very promising software layer to move the Alya code towards exascale.

Alya towards Exascale: Algorithmic Scalability using PSCToolkit Herbert Owen1 , Oriol Lehmkuhl1 , Pasqua D’Ambra2*, Fabio Durastante3,2 , Salvatore Filippone4,2 arXiv:2210.16660v2 [math.NA] 19 Dec 2023 1* Barcelona Supercomputing Centre (BSC), Plaça d’Eusebi Güell, Barcelona, 08034, Spain. 2 Institute for Applied Computing, National Research Council (CNR), Via P. Castellino, 111, Naples, 80131, NA, Italy. 3 Department of Mathematics, University of Pisa, Largo Bruno Pontecorvo, 5, Pisa, 56127, PI, Italy. 4 Department of Civil and Computer Engineering, University of Rome “Tor Vergata”, Via del Politecnico, 1, Rome, 00133, RM, Italy. *Corresponding author(s). E-mail(s): [email protected]; Contributing authors: [email protected]; [email protected]; [email protected]; [email protected]; Abstract In this paper, we describe an upgrade of the Alya code with up-to-date parallel linear solvers capable of achieving reliability, efficiency and scalability in the computation of the pressure field at each time step of the numerical procedure for solving a Large Eddy Simulation formulation of the incompressible NavierStokes equations. We developed a software module in Alya’s kernel to interface the libraries included in the current version of PSCToolkit, a framework for the iterative solution of sparse linear systems on parallel distributed-memory computers by Krylov methods coupled to Algebraic MultiGrid preconditioners. The Toolkit has undergone various extensions within the EoCoE-II project with the primary goal of facing the exascale challenge. Results on a realistic benchmark for airflow simulations in wind farm applications show that the PSCToolkit solvers significantly outperform the original versions of the Conjugate Gradient method available in the Alya’s kernel in terms of scalability and parallel efficiency and represent a very promising software layer to move the Alya code towards exascale. Keywords: Navier-Stokes equations, iterative linear solvers, algebraic multigrid, parallel scalability 1 MSC Classification: 65F08 , 65F10 , 65M55 , 65Y05 , 65Z05 1 Introduction Alya is a high-performance computational mechanics code for complex coupled multiphysics engineering problems. In this work, we present the interfacing between Alya and the PSCToolkit to overcome one of Alya’s main obstacles in the path towards exascale, namely the lack of state-of-the-art parallel algebraic linear solvers with adequate algorithmic scalability, as already identified in [29], where Alya’s strengths and weaknesses in facing the exascale challenge have been analyzed by scalability studies up to one hundred thousand cores. Although Alya can be applied to a wide range of problems, in this work, we shall concentrate on solving turbulent incompressible flow problems using a Large Eddy Simulation (LES) approach. Due to the wide range of scales present in turbulent highReynolds-number flows, their accurate solution requires computational meshes with a huge number of degrees of freedom (dofs). Alya uses a Finite Element (FE) spatial discretization, while its time discretization is based on finite difference methods; when an implicit time discretization is applied, the two main kernels of a simulation are the assembly of stiffness matrices and the solution of the associated linear system at each time step. In [29] the authors observed that the FE assembly implemented in Alya showed nearly perfect scalability, as one could a priory expect, while the solution of linear systems by available iterative linear solvers was the main weakness in the path towards exascale. The problem is related to Alya’s lack of solvers with optimal algorithmic scalability, i.e., solvers able to obtain a given accuracy employing an almost constant number of iterations for an increasing number of dofs. Alya’s sparse linear algebra solvers are specifically developed with tight integration with the overall parallelization scheme; they include Krylov-based solvers, such as Generalized Minimal Residual (GMRES) or Conjugate Gradient (CG), coupled to some deflation approach or a simple diagonal preconditioner. As shown in [29], when incompressible flow problems are considered, the solution of a Poisson-type equation for the pressure field becomes challenging as the size of the problem increases. Indeed, when a uniform mesh multiplication [19] is used to have successively finer mesh, each time the mesh is refined to obtain elements with half the size, the number of iterations for solving the pressure equation is approximately doubled, showing a mesh-size-dependent behavior. To overcome these scalability issues, we interfaced PSCToolkit to Alya to take advantage of the Algebraic MultiGrid (AMG) preconditioners available through the AMG4PSBLAS library; this effort has been carried out in the context of the European Center of Excellence for Energy (EoCoE) applications. The rest of the paper is organized as follows. In Section 2, we describe the general framework of Alya and the type of fluid dynamics problem we wish to test the new solvers on; in Section 3, we give an overall description of PSCToolkit, and then we focus on the AMG preconditioners employed in Section 3.1. Section 4 discusses the new module written to interface the solver library to the Alya software and the related 2 issues. Section 5 describes the actual test case, while Section 6 analyzes the numerical scalability results in detail. Finally, Section 7 summarizes the results obtained and illustrates the new lines of development. 2 Alya Description Alya is a high-performance computational mechanics code for complex coupled multiphysics engineering problems. It can solve problems in the simulation of turbulent incompressible/compressible flows, non-linear solid mechanics, chemistry, particle transport, heat transfer, and electrical propagation. Alya has been designed for massively parallel supercomputers and exploits several parallel programming models/tools. It relies on MPI to support a distributed-memory model; some kernels support vectorization at the CPU level and GPU accelerators are exploited through OpenACC pragmas or CUDA. Multi-physics coupling is achieved following a multi-code strategy that uses MPI to communicate different instances of Alya. Each instance solves a particular physics, enabling asynchronous execution. Coupled problems can be solved by retaining the scalability properties of the individual instances. Alya is one of the two Computational Fluid Dynamic (CFD) codes of the Unified European Applications Benchmark Suite (UEBAS) [25]. It is also part of the Partnership for Advanced Computing in Europe (PRACE) Accelerator benchmark suite [26]. As mentioned in Section 1, large-scale CFD applications are the main problems targeted by Alya; hence, the basic mathematical models include various formulations of the Navier-Stokes equations, whose strong form for incompressible flows in a suitable domain is the following: ∂t u − 2ν∇ · ε(u) + u · ∇u + ∇p = f , (1) ∇ · u = 0, (2)  where u and p are the velocity and pressure field respectively, ε(u) = 21 ∇u + ∇T u is the velocity strain rate tensor, ν is the kinematic viscosity, and f denotes the vector of external body forces. The problem is supplied with an initial divergence-free velocity field and appropriate boundary conditions. The flow is turbulent for most real-world flow problems, and some turbulence modeling is needed to make the problem solvable with currently available computational resources. For all the examples presented in this work, we rely on the functionalities of Alya, which apply the spatially filtered Navier-Stokes equations coupled to the Vreman subgrid-scale model [30] for turbulence closure. In practice, a spatially varying turbulent viscosity supplements the laminar viscosity and the velocity and pressure unknowns correspond to spatially filtered values. Finally, since the size of the dynamically important eddies at high Reynolds numbers becomes too small to be grid resolved close to the wall, we employ a wall modeling technique [24] to impose the boundary conditions for the LES equations. For simplicity, the non-linear term has been written in its convective form, which is most commonly encountered in computational practice. 3 Space discretization is based on a Galerkin FE approximation, employing hybrid unstructured meshes, which can include tetrahedra, prisms, hexahedra, and pyramids. Temporal discretization is performed through an explicit third-order Runge–Kutta scheme, where the Courant–Friedrichs–Lewy number is set to CFL = 1.0 for the cases presented in this work. A non-incremental fractional step method is used to stabilize the pressure, allowing the use of finite element pairs that do not satisfy the inf-sup condition [12], such as the equal order interpolation for the velocity and pressure applied in this work. A detailed description of the above numerical method, together with examples for turbulent flows, showing its high accuracy and low dissipation, can be found in [22]. The fractional step method allows uncoupling the solution of velocity and pressure [12]. At each Runge–Kutta substep, an explicit approach computes an intermediate velocity, and then a linear system coming from a Poisson-type equation is solved for the pressure; finally, the incompressible velocity is recovered. In the path towards exascale, the solution of the linear system for the pressure is the most demanding step. To reduce the computational burden, for most problems, an approximate projection method for Runge–Kutta time-stepping schemes is applied, which allows solving for the pressure only at the final substep [10]. It is important to note that most flow problems solved with Alya use a fixed mesh. For such problems, the linear system matrix for the pressure equation remains constant during the whole simulation. Therefore, the matrix assembly and the setup of a matrix preconditioner are needed only once at the beginning of the numerical procedure. Given that the number of time steps for LES is usually of the order of 105 , it is clear that the linear solver computational times and scalability are the most relevant issues to be tackled. 3 PSCToolkit: PSBLAS and AMG4PSBLAS We have interfaced Alya to exploit the solvers and preconditioners developed in the PSCToolkit1 software framework for parallel sparse computations, proven on current petascale supercomputers and targeting the next-generation exascale machines. PSCToolkit is composed of two main libraries, named PSBLAS (Parallel Sparse Basic Linear Algebra Subprograms) [17, 18], and AMG4PSBLAS (Algebraic MultiGrid Preconditioners for PSBLAS) [16]. Both libraries are written in modern Fortran; PSBLAS implements algorithms and functionalities of parallel iterative Krylov subspace linear solvers, while AMG4PSBLAS is the package containing sophisticated preconditioners. In particular, AMG4PSBLAS provides one-level Additive Schwarz (AS) and Algebraic MultiGrid (AMG) preconditioners. In the following, we will describe in some detail the AMG preconditioners we use within the Alya test cases. 1 See psctoolkit.github.io on how to obtain and run the code. 4 3.1 AMG preconditioners Algebraic MultiGrid methods can be viewed as a particular instance of a general stationary iterative method:   x(k) = x(k−1) + B b − Ax(k−1) , k = 1, 2, . . . given x(0) ∈ Rn , for the solution of a linear system Ax = b, A ∈ Rn×n , b ∈ Rn , where A is symmetric and positive-definite (SPD), and the iteration matrix B is defined recursively; see, e.g., [28] for an exhaustive account. AMG methods are often employed as preconditioners for Krylov subspace solvers; what distinguishes the methods implemented in AMG4PSBLAS are the specific details of the construction procedure for the B matrix. ℓ −1 We define A0 = A, and consider the sequence {Al }nl=0 of coarse matrices computed by the triple-matrix Galerkin product: Al+1 = PlT Al Pl , l = 0, . . . , nℓ − 1, ℓ −1 where {Pl }nl=0 is a sequence of prolongation matrices of size nl ×nl+1 , with nl+1 < nl ℓ −2 and n0 = n. To complete the formal construction we need also a sequence {Ml }nl=0 of Al -convergent smoothers for the coarse matrices Al , i.e., matrices Ml for which −1 ∥I pl − Ml Al ∥Al < 1 holds true, where Il is the identity matrix of size nl and ∥v∥Al = vT Al v is the Al norm. The preconditioner matrix B for the V -cycle with ν pre- and post-smooth iteration is then given by the multiplicative composition of the following error propagation matrices, Il − Bl Al = (Il − Ml−T Al )ν (Il − Pl Bl+1 PlT Al )(Il − Ml−1 Al )ν ∀l < nℓ , (3) with Bnℓ ≈ A−1 nℓ , either as a direct solution or as a convergent iterative procedure with a fine enough tolerance. For the case at hand, we select each iteration matrix of the smoother sequence ℓ −1 {Ml }nl=0 as the one representing four iterations (ν = 4) of the hybrid forward/backward Gauss-Seidel method. We consider having A in a general row-block parallel distribution over np processes, i.e., A is divided into np blocks of size nb × n, and we call App the corresponding diagonal block of A. We then decompose each block App as App = Lpp + Dpp + LTpp , where Dpp = diag(App ), Lpp is strictly lower triangular. To enforce symmetry in (3), we select Ml,pp as the block diagonal matrices (Figure 1) n /n p b Ml = blockdiag(Ml,pp )p=1 , Ml,pp = ω (Ll,pp + Dl,pp ) , l = 0, . . . , nℓ , where ω is a damping parameter. The overall procedure thus amounts essentially to using four sweeps of the damped block-Jacobi method on the matrix of the level while solving the blocks with the forward, respectively backward, Gauss-Seidel method. 5 Fig. 1: Depiction of the structure of the hybrid forward/backward Gauss-Seidel method on a general row-block parallel distribution of symmetric positive definite matrix A. To build the prolongation (and thus the restriction) matrices, we employ the coarsening based on compatible weighted matching strategy; a full account of the derivation and detailed theoretical analysis may be found in [13, 15, 16]. This is a recursive procedure that starts from the adjacency graph G = (V, E) associated with the sparse matrix A; this is the graph in which the vertex set V consists of either the row or column indices of A and the edge set E corresponds to the indices pairs (i, j) of the nonzero entries in A. The method works by constructing a matching M in the graph G to obtain a partition into subgraphs. We recall that a graph matching is a subset of the graph’s edges such that no two of them are incident on the same vertex. Specifically, we consider more than a purely topological matching by taking into account the weights of the edges, i.e., the values of the entries of the matrix A. In the first step, we associate an edge weight matrix C, computed from the entries ai,j in A and an arbitrary vector w; then, we compute an approximate maximum product matching of the whole graph to obtain the aggregates defining the coarse spaces. We define C = (ci,j )i,j as 2ai,j wi wj ; (4) ci,j = 1 − ai,i wi2 + aj,j wj2 then, M is an approximate maximum product matching of G with edge weight matrix C, i.e., Y ci,j . (5) M ≈ arg max ′ M (i,j)∈M′ |M| The aggregates are then the subsets of indices {Gp }p=1 of the whole index set I of A made of pairs of indices matched by the algorithm, where we denote with |M| the cardinality of the graph matching M. In other terms, we have obtained the decomposition n M [ Gp , Gp ∩ Gr = ∅ if p ̸= r; I = {1, . . . , n} = p=1 see, e.g., Figure 2 in which the matching of a test graph is computed–in more detail, Figure 2a has a black dot corresponding to a non-zero element of the adjacency matrix; 6 Figure 2b shows the corresponding graph obtained from it; while Figure 2c highlights the aggregated nodes, i.e., the Gp sets. In most cases, we will end up with a sub-optimal (a) Adjacency matrix (b) Original graph (c) Matched edges Fig. 2: Matching of the graph bcspwr01 from the Harwell-Boeing collection. The matched nodes in the graph are highlighted by a bold red edge. matching, i.e., not all vertices will be endpoints of matched edges; thus, we usually have unmatched vertices. To each unmatched vertex, we associate an aggregate Gs that is a singleton, and we denote with nS the total number of singletons. The main computational cost of this phase is represented by the computation of the approximate graph matching on a graph that is distributed across thousands of processors. The parallel coarsening implemented in AMG4PSBLAS uses the MatchBox-P software library [11]; this implements a distributed parallel algorithm for the computation of half-approximate maximum weight matching with complexity O(|E|∆), where |E| is the cardinality of the graph edge set and ∆ is the maximum vertex degree, i.e., the maximum number of edges incident on any given node of the graph. The procedure guarantees a solution that is at least half of the optimal weight, i.e., the approximation in (5) holds within 1/2 of the optimum. The message aggregation and overlapping between communication and computation employed by this strategy reduces the impact of the data communication on parallel efficiency; we refer the reader to [16] for a complete set of experiments showcasing this feature. Finally, to build the prolongator matrices, the last ingredients we need are the vectors we identifying for each edge ei7→j ∈ M the orthonormal projection of w on the non-singleton aggregate Gp . For the sake of the explanation, we consider an ordering of the indices in which we move all the unknowns corresponding to unmatched vertices at the bottom2 , and thus define a tentative prolongator   P̃ 0 P̂ = ∈ Rn×nc , (6) 0 W where: P̃ = blockdiag(we1 , . . . , wenM ), 2 This ordering is for explanatory purposes only, and is not actually enforced in practice. 7 W = diag(ws /|ws |), s = 1, . . . , nS , corresponds to unmatched vertices. The resulting number of coarse variables is then given by nc = nM + nS . The matrix P̂ we have just built is a piecewise constant interpolation operator whose range includes, by construction, the vector w. The actual prolongator P is then obtained from P̂ as P = (I − ωD−1 A)P̂ , where D = diag(A) and ω = 1/∥D−1 A∥∞ ≈ 1/ρ(D−1 A), with ρ(D−1 A) the spectral radius of D−1 A. Indeed, the P we have built is an instance of smoothed aggregation. Please observe that the procedure we have described produces, at best, a halving of the size of the system at each new level of the hierarchy. Given the size of the systems we are interested in, this may be unsatisfactory since the number of levels in the hierarchy and thus the operational cost needed to cross it, would be too large. Fortunately, it is rather easy to overcome this issue: to obtain aggregates of size greater than two, we just have to collect them together by multiplying the corresponding prolongators (restrictors). This permits us to select the desired size of the aggregates (2, 4, 8, and so on) as an input parameter of the method. To conclude the description of the preconditioners, we need to specify the choice for the coarsest solver. While using a direct solver at the coarsest level is the easiest way to ensure that the coarsest grid is resolved to the needed tolerance, such an approach for an AMG method running on many thousands of parallel cores can be very expensive. If the matching strategy has worked satisfactorily, the coarsest-level matrix will tend to have both a small global size and a small number of rows per core: in this case the cost of data communication will dominate the local arithmetic computations causing a deterioration of the method efficiency. We use here a dual strategy: on the one hand, we employ a distributed coarsest solver running on all the parallel cores, whilst on the other, we limit the maximum size of the coarsest-level matrix to around 200 unknowns per core. Specifically, we use the Flexible Conjugate Gradient (FCG) method with a block-Jacobi preconditioner on which we solve approximately the blocks by an incomplete LU factorization with one level of fill-in, ILU(1), the stopping criterion is based on the reduction of the relative residual of 3 orders of magnitude or a maximum number of iterations equal to 30. To have a comparison with the preconditioner just discussed, we also consider the same construction but with a different aggregation procedure: the decoupled version of the classic smoothed aggregation of Vaněk et al [27]. This is an aggregation option that was already available in previous versions of the library [9, 14], and was already successfully used in CFD applications [2, 3]. The basic idea is to build a coarse set of indices by grouping unknowns into disjoint subsets (the aggregates) by using an affinity measure and defining a simple tentative prolongator whose range contains the so-called near null space of the matrix of the given level, i.e., a sample of the eigenvector corresponding to the smallest eigenvalue. The strategy is implemented in an embarrassingly parallel fashion, i.e., each processor produces aggregates by only looking at local unknowns, i.e. the aggregation is performed in a decoupled fashion, in contrast to the previous matching procedure that instead crosses the boundary of the single process. Table 1 summarizes the different preconditioners we have discussed here and that are used in the experiments of Section 6.1. 8 Pre-smoother 4 iterations of hybrid forward Gauss-Seidel Post-smoother 4 iterations of hybrid backward Gauss-Seidel Coarsest solver FCG preconditioned by block-Jacobi with ILU(1) block solvers Cycle V-cycle Aggregation Coupled smoothed based on matching Decoupled classic |G| ≤ 8 |G| ≤ 16 smoothed Label MLVSMATCH3 MLVSMATCH4 MLVSBM Table 1: Summary of the described preconditioners, the labels are used to describe the results in Section 6.1. Remark 1. The AMG4PSBLAS library provides interfaces to some widely used parallel direct solvers, such as SuperLU [23] and MUMPS [1]. Thus, we could have used any of those within the damped block-Jacobi method, either on the smoother or on the coarsest solvers. For what concerns the smoothers, it has been observed in the literature [4, 16] that the combination with the Gauss-Seidel method delivers better smoothing properties for the overall method. In the coarsest solver case, the size of the local matrices is small enough to not usually show a significant performance increase when using a direct solver. We also stress that the preconditioner described in this section depends only on native PSCToolkit code, i.e., the user does not have to install optional third-party libraries to use it. 4 Interfacing Alya to PSCToolkit The Alya code is organized in a modular way, and its architecture is split into modules, kernel, and services, which can be separately compiled and linked. Each module represents a physical model, i.e., a set of partial differential equations which can interact for running a multi-physics simulation in a time-splitting approach, while Alya’s kernel implements the functionalities for dealing with the discretization mesh, the solvers and the I/O functionalities. As already mentioned, the governing equations of a physical model are discretized in space by using FE methods and all the functionalities to assemble the global stiffness matrix and right-hand-side (RHS) of the corresponding set of equations, including boundary conditions and material properties are the responsibility of the module. Instead, all the functionalities needed to solve the algebraic linear systems are implemented in the kernel. Some work on data structures and distribution of matrices and RHS was necessary to interface Alya with libraries from PSCToolkit, as described in the following. Alya uses the compressed sparse row matrix scheme for the internal representation of sparse matrices. This scheme is supported by PSCToolkit, so no significant difficulty was met from this perspective. The main difficulty in the interfacing process was how the data, i.e. the discretization mesh and the corresponding unknowns, are distributed among the parallel processes and the way the related sparse matrix rows and RHS are locally assembled. The Alya code is based on a domain decomposition where the discretization mesh is partitioned into disjoint subsets of elements/nodes, referred to as subdomains. Then, each subdomain is assigned to a parallel process that carries out 9 all the geometrical and algebraic operations corresponding to that part of the domain and the associated unknowns. The interface elements/nodes on the boundary between two subdomains are assigned to one of the subdomains (see Figure 3). (a) In white, interface nodes. (b) In white, halo elements. Fig. 3: Mesh partitioning into (3a) disjoint sets of nodes, and (3b) disjoint sets of elements. The sparse matrices expressing the linear couplings among the unknowns are distributed in such a way that each parallel process holds the entries associated with the couplings generated on its subdomain. Two different options are possible for sparse matrix distribution: the partial row format and the full row format [20], respectively. In the full row format, if a mesh element/node and the corresponding unknown belong to a process, all row entries related to that unknown are stored by that process. In the partial row format, the row of a matrix corresponding to an unknown is not full and needs contributions from unknowns belonging to different processes.Alya uses a partial row format for storing the matrix. The libraries from PSCToolkit build the preconditioners and apply the Krylov methods on the assumption of a full row format; nevertheless, support for partial row format was added to the libraries’ pre-processing stage so that the interfacing can be as transparent as possible. The pre-processing support implies the retrieval of remote information for those matrix contributions that correspond to elements on the boundary; the data communication is split between the discovery of the needed entries (which needs only be executed when the discretization mesh changes) and the actual retrieval of the matrix entries, which must happen at any time step where the matrix coefficients and/or vector entries may be rebuilt, prior to an invocation of the solvers. When the topology of the mesh does not change, and there is only an update in the coefficients, it is also possible to reuse the same preconditioner; this may 10 be full reuse of the overall matrices hierarchy, or partial reuse, employing the same prolongators/restrictors to rebuild the AMG hierarchy and smoothers. We developed a software module in Alya’s kernel for declaration, allocation, and initialization of the library’s data structures as well as for using solvers and preconditioners. PSBLAS makes available some of the widely used iterative methods based on Krylov projection methods through a single interface to a driver routine, while preconditioners for PSBLAS Krylov solvers are available through the AMG4PSBLAS package. The main functionalities for selecting and building the chosen preconditioner are the responsibility of the software module included in the Alya’s kernel, while the functionalities for applying it within the PSBLAS Krylov solver are completely transparent to the Alya code and are the responsibility of the library. 5 The Bolund test case Fig. 4: Photograph of the Bolund hill [7]. Our main aim was to test the libraries for systems stemming from fluid dynamics simulation of incompressible flow arising in the study of wind-farm efficiency. The test case is based on the Bolund experiment, a classical benchmark for microscale atmospheric flow models over complex terrain [5, 6]. An incompressible flow treatment is used because the Mach number, i.e., the ratio of the speed of the flow to the speed of sound, is much smaller than 0.3. The test case is based on a small (12 m) isolated steep hill at Roskilde Fjord in Denmark having a significantly steep escarpment in the main wind direction and uniformly covered by grass so that the resulting flow is not influenced by individual roughness elements. This is considered the ideal benchmark for the validation of neutral flow models and, hence a most relevant scenario for the analysis of software modeling for wind energy. Though relatively small, its geometrical shape induces complex 3D flow. Bolund was equipped with several measurement masts with conventional meteorological instruments and remote sensing Lidars to obtain detailed information of mean wind, wind shear, turbulence intensities, etc. A publicly available database for evaluating currently available flow models and methodologies 11 for turbine siting in complex terrain regarding wind resources and loads is available at [7]. Fig. 5: Volume rendering of the velocity over Bolund obtained with Alya. We discretize the incompressible Navier-Stokes equation (1) as described in Section 2. At each time step of the LES procedure, we solved the SPD linear systems arising from the pressure equation employing the preconditioned flexible version of the CG method (FCG) method by PSBLAS. Starting from an initial guess for pressure from the previous time step, we stopped linear iterations when the Euclidean norm of the relative residual was no larger than T OL = 10−3 . The Reynolds number based on the friction velocity for this test case is approximately REτ = U h/ν ≈ 107 with U = 10 m s−1 . As discussed in [6, Section 2.1], we can neglect Coriolis force in the horizontal direction and use the formulation (1) since the Rossby number RO = 667 ≫ 1. The next Section 6 details the scalability result obtained for this test case with the new solvers and preconditioners from PSCToolkit described in Section 3. 6 Parallel Performance Results In the following we discuss the results of experiments run on two of the most powerful European supercomputers. The first set of experiments aimed to analyze the behavior of different AMG preconditioners available from AMG4PSBLAS and run on the Marenostrum-4 supercomputer up to 12288 CPU cores. Marenostrum-4 is composed of 3456 nodes with 2 Intel Xeon Platinum 8160 CPUs with 24 cores per CPU. 12 It is ranked 121th in the November 2023 TOP500 list3 , with more than 10 petaflops of peak performance and is operated by the Barcelona Supercomputer Center. The simulations have been performed with the Alya code interfaced to PSBLAS (3.7.0.1) and AMG4PSBLAS (1.0), built with GNU compilers 7.2. The second set of experiments aimed to reach very large scales and run by using only one of the most promising preconditioners by AMG4PSBLAS on the Juwels supercomputer, up to 23551 CPU cores. Juwels is composed of 2271 compute nodes with 2 Intel Xeon Platinum 8168 CPUs, of 24 cores each. It is ranked 127th in the June 2022 TOP500 list, with more than 9 petaflops of peak performance, and is operated by the Jülich Supercomputer Center. The simulations have been performed with the Alya code interfaced to the same versions of the solvers libraries mentioned above, built with GNU compilers 10.3. 6.1 Comparison of AMG Preconditioners In this section, we discuss results obtained on Marenostrum-4 and compare the behavior of FCG coupled to the preconditioners described in Section 3.1 and summarized in Table 1. We run both strong scalability analysis for unstructured meshes of tetrahedra of three fixed sizes as well as weak scalability analysis, obtained by fixing different mesh sizes per core and linearly increasing both mesh size and the number of cores. A general row-block matrix distribution based on the Metis 4.0 mesh partitioner [21] was applied for the parallel runs. 6.1.1 Strong scalability We first focus on strong scalability results obtained on the Bolund experiment for three fixed size meshes (small, medium and large) including n1 = 5570786 ≈ 6 × 106 , n2 = 43619693 ≈ 4.4 × 107 and n3 = 345276325 ≈ 0.35 × 109 dofs, respectively. Three different configurations of the number of cores, obtained by doubling each time the number of MPI cores with respect to the minimum number of cores (nodes) needed to run at full load, were employed for the three different mesh sizes: from minp = 48 to maxp = 192 cores in the case of the small mesh, from minp = 384 to maxp = 1536 cores for the medium mesh, and finally from minp = 3072 to maxp = 12288 cores for the large mesh. We analyze the parallel efficiency and convergence behavior of the linear solvers for 20 time steps after a pre-processing phase so that we focus on the solvers’ behavior in the simulation of a fully developed flow. Note that in the Alya code a master-slave approach is employed, where the master process is not involved in the parallel computations. In Figures 6-7, we report a comparison of the different methods in terms of the total number of iterations of the linear solvers and of the solve time per iteration (in seconds), respectively. Note that in the figures we also have results obtained with a version of Deflated CG (AlyaDefCG), available from the original Alya code. We can observe that the total number of linear iterations is much smaller than that with the original AlyaDefCG, for all three meshes, when AMG4PSBLAS multilevel preconditioners are applied. For the small mesh, the minimum number of linear iterations is obtained by MLVSBM which shows a fixed number of 60 iterations for all core 3 Available at www.top500.org. 13 Systems size n1 Systems size n2 Total Iterations Total Iterations 1042 700 100 60 47 95 191 number of MPI cores 160 120 383 767 1535 number of MPI cores Systems size n3 1406 Total Iterations MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 125 100 3071 6143 12287 number of MPI cores Fig. 6: Strong scalability: total iteration number of the linear solvers counts, while MLVSMATCH3 requires 90 iterations for all core counts except on 192 cores, where 1 less iteration was needed, and MLVSMATCH4 requires 100 iterations; in this case, the original AlyaDefCG requires 700 iterations for all core counts. In the case of the medium mesh, we observe a larger number of iterations of the solvers employing AMG4PSBLAS preconditioners with respect to the large mesh. We have a minimum number of iterations with MLVSMATCH3 ranging from 122 to 123 for all number of cores, while MLVSMATCH4 requires a range from 160 to 161 iterations and MLVSBM requires a range from 172 to 174 iterations. The original AlyaDefCG requires a number of iterations ranging from 1040 to 1042 for the medium mesh. In the case of the large mesh, the number of iterations required by MLVSMATCH3 ranges between 108 on 3072 cores and 137 on 12288 cores, while MLVSMATCH4 requires a more stable number of iterations ranging from 115 to 117; a similarly stable behavior is observed for MLVSBM which requires a number of iterations ranging from 121 to 123. AlyaDefCG requires a number of iterations ranging from 1404 to 1406 for the large mesh. The oscillations in the number of iterations seem to be mostly dependent on the data partitioning obtained by Metis, which in turn, appears to have a larger impact on the MLVSMATCH3 preconditioner in the case of the large mesh. A deeper analysis 14 Systems size n2 Time × iteration (s) Time × iteration (s) Systems size n1 10−1 10−2 47 95 191 number of MPI cores 10−1 10−2 383 767 1535 number of MPI cores Time × iteration (s) Systems size n3 10 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG −1 10−2 3071 6143 12287 number of MPI cores Fig. 7: Strong scalability: time per iteration of the linear solvers of the impact of the data partitioner on the solver behavior, albeit interesting, is out of the scope of our current work and would require a significant amount of computing resources. In all cases, the time needed per iteration decreases for an increasing number of cores and, as expected, it is larger for the AMG preconditioners, where the cost for the preconditioner application at each FCG iteration is larger than that of AlyaDefCG. Depending on mesh size and number of cores, the AMG preconditioners show very similar behavior, although MLVSBM always requires a smaller time per iteration for the large mesh and for the medium mesh when 1536 cores are used. In Figures 8-9, we can see the total solve time spent in the linear solvers and the resulting speedup for the preconditioners. Here, we define speedup as the ratio Sp = Tminp /Tp , where Tminp is the total time for solving linear systems when the minimum number of total cores, per each problem size, is involved in the simulation, and Tp is the total time spent in linear solvers for all the increasing number of cores used for the specified mesh size. We observe that the AMG preconditioners from AMG4PSBLAS generally achieve shorter execution times than the original AlyaDefCG; indeed, the expected longer time per iteration is more than compensated by the large reduction in the number of 15 iterations especially for the small and large mesh. In good agreement with the behavior in terms of iterations and time per iteration, we observe that MLVSBM generally shows the shortest execution time for the small mesh, especially for small number of cores, while for the medium and large mesh, MLVSMATCH3 and MLVSMATCH4 show some better or comparable behaviour with respect to MLVSBM. The best speedups are generally obtained, except for the small mesh, by the original AlyaDefCG, while in the case of AMG preconditioners, the very good convergence behavior and solve time on the smallest number of cores limit the speedup for the increasing number of cores. For the AMG4PSBLAS preconditioners, speedups are in good agreement with the total solve times, showing that MLVSMATCH3 and MLVSMATCH4 are generally better or comparable with respect to MLVSBM for all meshes when the small and medium number of cores are used, while MLVSBM is better for medium and large mesh when the largest number of cores is used. 20 Systems size n2 Total Solve Time (s) Total Solve Time (s) Systems size n1 10 47 95 191 number of MPI cores 30 14 383 767 1535 number of MPI cores Total Solve Time (s) Systems size n3 50 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 25 10 3071 6143 12287 number of MPI cores Fig. 8: Strong scalability: total solve time of the linear solvers In conclusion, the selected solvers from the PSCToolkit generally outperform the original Alya solver for the employed test case, and the choice of the better preconditioner from AMG4PBLAS depends on target mesh size and number of employed parallel cores. This appears as an advantage for Alya’s users that having available a 16 Systems size n2 Speedup of Solve Speedup of Solve Systems size n1 3 2 1 47 3 2 1 95 191 number of MPI cores 383 767 1535 number of MPI cores Speedup of Solve Systems size n3 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 1.6 1.4 1.2 1 3071 6143 12287 number of MPI cores Fig. 9: Strong scalability: speedup of the linear solvers. We note that ideal values for speedups in all three configurations are 1, 2 and 4, respectively. large set of parallel preconditioners through the interface to PSCToolkit, can select the best one for their specific aims. 6.1.2 Weak scalability In this section, we analyze the weak scalability of the AMG4PSBLAS preconditioners, i.e., we observe the solvers looking at their behavior when we fix the mesh size per core and increase the number of cores. We considered the same test case and the three meshes of the previous section in the three possible configurations of computational cores, from 48 up to 3072, from 96 up to 6144 and from 192 to 12288. The different configurations of cores correspond to three different (decreasing) mesh sizes per core equal to nxcore1 = 1.1e5, nxcore2 = 5.9e4, and nxcore3 = 2.9e4, respectively. Note that the medium and the large mesh correspond to scaling factors of 8 and 64, respectively, with respect to the small mesh; therefore in the same way, we scaled the number of cores for our weak scalability analysis. We can limit our analysis to observe the average number of linear iterations of the different employed preconditioners per each time step in the various simulations and 17 to analyze execution times and scaled speedup for solve. In Figure 10, we report the average number of iterations for each time step. We can observe a general increase, ranging from 35 to 70 for an increasing number of cores when the original AlyaDefCG is employed. On the other hand, when AMG preconditioners from AMG4PSBLAS coupled with FCG by PSBLAS are applied, we observe a constant average number of iterations equal to 5 for MLVSMATCH4 both for the small and the large mesh, independently of the number of cores, while MLVSBM requires 3 iterations for the small mesh and 6 for the large mesh. MLVSMACTH3 ranges from 4 to 6 iterations on the small mesh and the large mesh, respectively. In the case of medium mesh, in agreement with what was observed for the strong scalability analysis, all the preconditioners require a larger average number of iterations, which is 8 for MLVSBM and MLVSMATCH4, and 6 for MLVSMATCH3. This behavior indicates a very promising algorithmic scalability of MLVSMATCH4. In Figures 11-12, we can see the total solve time and the correspond- 70 37 3 47 70 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 383 37 3071 3 95 70 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 767 37 6143 3 191 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 1535 12287 number of MPI cores number of MPI cores number of MPI cores (a) (b) (c) Fig. 10: Weak scalability: average number of linear iterations per time step. nxcore1 dofs per core (10a), nxcore2 dofs per core (10b), nxcore3 dofs per core (10c) ing scaled speedup. We can observe that, as expected from the previous sections, all preconditioners from AMG4PSBLAS generally lead to a smaller increase ratio in the solve times with respect to the original AlyaDefCG, when the mesh size goes from the small to the large one. In more detail, we observe that, for all mesh sizes per core, smaller increase ratios in the execution time are generally obtained with MLVSMATCH3 and MLVSMATCH4. This is better observed by looking at the scaled speedup. It is defined as scalfactor × Tminp /Tp , where scalfactor = 1, 8, 64, for the three increasing number of cores, Tminp is the total time for solving linear systems when the minimum number of total cores is involved in the simulation, per each mesh size per core, and Tp is the total time spent in linear solvers for all the increasing number of cores used for the specified mesh size per core. We observe that the best values are obtained with the MLVSMATCH3 and MLVSMATCH4 preconditioners when nxcore1 and nxcore2 dofs per core are used. In detail, for nxcore1 dofs per core, MLVSMATCH3 reaches the best value of about 71% of scaled efficiency on 3072 cores and about 44% of scaled efficiency on 6144 core when nxcore2 dofs per core are employed. This shows that the 18 scalability of MLVSMATCH3 and MLVSMATCH4 are very promising in facing the exascale challenge, especially when the resources are used at their best in terms of node memory capacity and bandwidth. On the other hand, in the case of nxcore3 dofs per core (12c), the scaled speedup of AlyaDefCG is better; this is essentially due to the very large solve time spent by this solver on 192 cores. 30 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 40 30 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 20 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 20 20 10 10 10 47 383 3071 95 767 6143 191 1535 12287 number of MPI cores number of MPI cores number of MPI cores (a) (b) (c) Fig. 11: Weak scalability: total solve time (s) of the linear solvers. nxcore1 dofs per core (11a), nxcore2 dofs per core (11b), nxcore3 dofs per core (11c) 40 30 20 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 20 15 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 10 MLVSBM MLVSMATCH3 MLVSMATCH4 AlyaDefCG 10 5 10 47 383 number of MPI cores 3071 95 767 number of MPI cores (a) (b) 6143 191 1535 12287 number of MPI cores (c) Fig. 12: Weak scalability: the scaled speedup of the linear solvers. nxcore1 dofs per core (12a), nxcore2 dofs per core (12b), nxcore3 dofs per core (12c) 6.2 Results at extreme scales In this section, we discuss some results obtained on the Juwels supercomputer by increasing the number of dofs till to n4 ≈ 2.9 × 109 . We limit our analysis to the weak scalability results of one of the most promising solvers in PSCToolkit. Indeed, 19 due to the limited access to the Juwels resources and taking into account the above preconditioners comparison, we only run experiments by using the MLVSMATCH4 preconditioner. A general row-block data distribution based on a parallel geometric partitioning using Space Filling Curve (SFC) [8] was applied for these experiments. As in the previous experiments, we analyze the parallel efficiency and convergence behavior of the linear solver for 20 time steps after a pre-processing phase so that we focus on the solver behavior in the simulation of a fully developed flow for all the meshes but the largest one, where we were not able to skip the transient phase due to long simulation time. In this last case, we considered a total number of time steps equal to 1379 and analyzed solver performance in the last 20 time steps. Note that increasing mesh size imposes a decrease in time step due to stability constraints of the explicit time discretization that is preferred for LES simulations. Therefore, the total simulated time depends on the mesh size. Furthermore, to reduce observed operating oscillations associated with the full node runs, we used only a total of 46 cores per node. As already mentioned, we analyze the weak scalability of the solvers; we considered a mesh size per core equal to nxcore1 and used a scaling factor of 8 for going up to the largest mesh size; therefore in the same way, we scaled the number of cores for our weak scalability analysis. We can limit our analysis to observing the average number of linear iterations of the solver per each time step and analyzing execution times and scaled speedup for the solve phase. We compare the results obtained by using the PSCToolkit’s solver against Alya’s Conjugate Gradient solver (hereby AlyaCG). Observe that in these experiments, we also tried to use the Deflated CG implemented in Alya, but it does not work for the two larger test cases, and AlyaCG appears better in the case of smaller size meshes. In Figure 13, we report the average number of iterations per each time step. We can observe a general increase, ranging from 133 to 331 for an increasing number of cores, but on 368 cores where 95 average iteration count is obtained, when the original AlyaCG is employed, while very good algorithmic scalability, with an average number of linear iterations per each time step ranging from 4 to 6, when the PSCToolkit’s solver is applied. In Figures 14a-14b, we can see the total solve time and the corresponding scaled speedup. We can observe that the good algorithmic scalability of MLVSMATCH4 leads to an almost flat execution time for solving when the first three meshes are employed, while a decrease is observed for the simulation carried out with the largest mesh, depending on a smaller average number of iterations per time step. On the contrary, the original AlyaCG generally shows a huge increase for increasing number of cores and mesh size, but in the second one, where a decrease in the average number of iterations per time step is observed. Then we look at the scaled speedup, defined as scalfactor × T45 /Tp , where scalfactor = 1, 8, 64, 512, for increasing number of cores, T45 is the total time for solving linear systems when 45 cores are involved in the simulation, and Tp is the total time spent in linear solvers for all the increasing number of cores. We observe that for the two larger meshes, MLVSMATCH4 has a super-linear scaled speed-up of about 71 (up from the ideal speedup of 64) and 640 (up from the ideal speedup of 512), respectively, showing that its very good algorithmic scalability is coupled with excellent implementation 20 Average Iterations 300 AlyaCG MLVSMATCH4 100 4 45 367 2943 23551 Number of MPI cores Fig. 13: Weak scalability: average number of linear iterations per time step. Systems size from n1 to n4 . scalability of all the basic computational kernels. This scalability is very promising in facing the exascale challenge. 700 100 80 AlyaCG MLVSMATCH4 100 AlyaCG MLVSMATCH4 60 10 40 20 45 1 45 367 2943 23551 Number of MPI cores (a) Total solve time of the linear solvers (s). 367 2943 23551 Number of MPI cores (b) Scaled speedup of the linear solvers. Fig. 14: Weak scalability: systems size from n1 to n4 . 7 Conclusions In this paper, we presented our work on improving the linear solver capabilities of a large-scale CFD code by interfacing it with a software framework, including new and state-of-the-art algebraic linear solvers, specifically designed to exploit the very large potential of current petascale supercomputers and aimed at the early exascale supercomputers. Our activities were carried out in the context of the European Center of Excellence for Energy applications, where one of the lighthouse codes was the Alya 21 code, developed at the Barcelona Supercomputing Center (BSC) and applied to wind flow studies for renewable production. However, this work has a wider impact, and confirms the benefits of using third-party software libraries developed by specialists, in complex, multi-component and multi-physics simulation codes. From Alya’s perspective, the most significant achievement has been obtaining excellent algorithmic scalability thanks to multigrid preconditioners, as shown in the weak scalability studies. This allows us to solve much bigger problems efficiently. Our first objective for the future is to test the GPU version of PSCToolkit. During EoCoE, we have significantly optimized the FE assembly on GPUs, making it four times more energy efficient than the CPU version; integrating a competitive linear algebra GPU package is now the next priority. After that, having the entire workflow for incompressible flow problems on GPUs should be relatively straightforward. We expect to have a much higher number of unknowns for problems running for MPI process when GPU accelerators are exploited. Therefore, strong scalability should be much less critical. While we have focused on a wind energy problem in this work, we wish to test the solver in other incompressible flow problems in the future. Moreover, since the solver is fully interfaced with Alya, it will be interesting to test the suitability of PSCToolkit for other problems, such as solid mechanics or heat transfer. Acknowledgments. We thank the two anonymous reviewers whose suggestions helped improve and clarify this manuscript. P.D., F.D. and S.F are members of the INdAM Research Group GNCS. 8 Declarations Ethical Approval N/A Competing interests N/A Authors’ contributions The first two authors are involved in the development of the Alya code and worked on the integration of the PSCToolkit into Alya. Herbert Owen did main contribution in setup and running of experiments. The last three authors are the developers of PSCToolkit and experts in Linear Algebra. They have equally contributed in setup of suitable linear solvers, analyzing results and writing the paper. Funding The research received funding and PRACE grants for supercomputers access from Horizon 2020 Project “Energy oriented Centre of Excellence: toward exascale for energy” (EoCoE–II), Project ID: 824158. 22 Availability of data and materials PSCToolkit is available at https://psctoolkit.github.io/ The data that support the findings of this study are available from the corresponding author upon reasonable request. References [1] Amestoy P, Duff I, L’Excellent JY (2000) Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput Methods in Appl Mech Eng 184(2):501– 520. https://doi.org/https://doi.org/10.1016/S0045-7825(99)00242-X, URL https://www.sciencedirect.com/science/article/pii/S004578259900242X [2] Aprovitola A, D’Ambra P, Denaro F, et al (2011) Scalable algebraic multilevel preconditioners with application to CFD. Lecture Notes in Computational Science and Engineering 74 LNCSE:15 – 27. https://doi.org/10.1007/978-3-642-14438-7 2 [3] Aprovitola A, D’Ambra P, Denaro FM, et al (2015) SParC-LES: Enabling large eddy simulations with parallel sparse matrix computation tools. Computers and Mathematics with Applications 70(11):2688 – 2700. https://doi.org/10.1016/j. camwa.2015.06.028 [4] Baker AH, Falgout RD, Kolev TV, et al (2011) Multigrid smoothers for ultraparallel computing. SIAM J Sci Comput 33(5):2864–2887. https://doi.org/10. 1137/100798806 [5] Bechmann A, Sorensen NN, Berg J, et al (2011) The Bolund experiment, Part II: Blind comparison of microscale flow models. Boundary-Layer Meteorology 141(2) [6] Berg J, Mann J, Bechmann A, et al (2011) The Bolund Experiment, Part I: Flow Over a Steep, Three-Dimensional Hill. Bound-Layer Meteorol 141(2):219. https://doi.org/10.1007/s10546-011-9636-y, URL https://doi.org/10. 1007/s10546-011-9636-y [7] Bolund (2022) The Bolund experiment. https://www.bolund.vindenergi.dtu.dk, accessed: 2022-09-27 [8] Borrell R, Cajas JC, Mira D, et al (2018) Parallel mesh partitioning based on space filling curves. Comput & Fluids 173:264–272. https://doi.org/10.1016/j. compfluid.2018.01.040, URL https://doi.org/10.1016/j.compfluid.2018.01.040 [9] Buttari A, D’Ambra P, Di Serafino D, et al (2007) 2LEV-D2P4: A package of high-performance preconditioners for scientific and engineering applications. Applicable Algebra in Engineering, Communications and Computing 18(3):223 – 239. https://doi.org/10.1007/s00200-007-0035-z 23 [10] Capuano F, Coppola G, Chiatto M, et al (2016) Approximate projection method for the incompressible Navier-Stokes equations. AIAA Journal 54(7):2179–2182. https://doi.org/10.2514/1.J054569, URL https://doi.org/10. 2514/1.J054569, https://doi.org/10.2514/1.J054569 [11] Catalyürek UV, Dobrian F, Gebremedhin A, et al (2011) Distributed-memory parallel algorithms for matching and coloring. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp 1971–1980, https://doi.org/10.1109/IPDPS.2011.360 [12] Codina R (2001) Pressure stability in fractional step finite element methods for incompressible flows. J Comput Phys 170:112–140 [13] D’Ambra P, Vassilevski PS (2013) Adaptive AMG with coarsening based on compatible weighted matching. Comput Vis Sci 16(2):59–76. https://doi.org/10.1007/ s00791-014-0224-9 [14] D’Ambra P, di Serafino D, Filippone S (2010) MLD2P4: a package of parallel algebraic multilevel domain decomposition preconditioners in Fortran 95. ACM Trans Math Software 37(3):Art. 30, 23. https://doi.org/10.1145/1824801.1824808, URL https://doi.org/10.1145/1824801.1824808 [15] D’Ambra P, Filippone S, Vassilevski PS (2018) BootCMatch: a software package for bootstrap AMG based on graph weighted matching. ACM Trans Math Software 44(4):Art. 39, 25. https://doi.org/10.1145/3190647 [16] D’Ambra P, Durastante F, Filippone S (2021) AMG preconditioners for linear solvers at extreme scale. SIAM J on Sci Comp 43(5). https://doi.org/10.1137/ 20M134914X [17] Filippone S, Buttari A (2012) Object-oriented techniques for sparse matrix computations in Fortran 2003. ACM TOMS 38(4):23:1–23:20 [18] Filippone S, Colajanni M (2000) PSBLAS: a library for parallel linear algebra computations on sparse matrices. ACM TOMS 26(4):527–550 [19] Houzeaux G, de la Cruz R, Owen H, et al (2013) Parallel uniform mesh multiplication applied to a Navier-Stokes solver. Comput & Fluids 80:142–151. https://doi.org/10.1016/j.compfluid.2012.04.017, URL https://doi.org/10.1016/ j.compfluid.2012.04.017 [20] Houzeaux G, Borrell R, Fournier Y, et al (2018) High-Performance Computing: Dos and Don’ts. In: Ionescu A (ed) Computational Fluid Dynamics - Basic Instruments and Applications in Science. IntechOpen, pp 3–41, https://doi.org/ 10.5772/intechopen.72042 24 [21] Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392. https://doi.org/10. 1137/S1064827595287997, URL https://doi.org/10.1137/S1064827595287997 [22] Lehmkuhl O, Houzeaux G, Owen H, et al (2019) A low-dissipation finite element scheme for scale resolving simulations of turbulent flows. Journal of Computational Physics 390:51 – 65. https://doi.org/https://doi.org/10.1016/j.jcp.2019.04. 004 [23] Li XS (2005) An overview of SuperLU: algorithms, implementation, and user interface. ACM Trans Math Software 31(3):302–325. https://doi.org/10.1145/ 1089014.1089017 [24] Owen H, Chrysokentis G, Avila M, et al (2020) Wall-modeled largeeddy simulation in a finite element framework. International Journal for Numerical Methods in Fluids 92(1):20–37. https://doi.org/https://doi.org/10. 1002/fld.4770, URL https://onlinelibrary.wiley.com/doi/abs/10.1002/fld.4770, https://onlinelibrary.wiley.com/doi/pdf/10.1002/fld.4770 [25] PRACE (accessed May 2020) Unified European Application Benchmark Suite. https://repository.prace-ri.eu/git/UEABS/ueabs/ [26] PRACEBS (accessed May 2020) PRACE benchmark-suite. https://praceri.eu/training-support/technical-documentation/benchmark-suites/ [27] Vaněk P, Mandel J, Brezina M (1996) Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing 56(3):179–196. https://doi.org/10.1007/BF02238511, URL https://doi.org/10. 1007/BF02238511, international GAMM-Workshop on Multi-level Methods (Meisdorf, 1994) [28] Vassilevski PS (2008) Multilevel block factorization preconditioners. Springer, New York, matrix-based analysis and algorithms for solving finite element equations [29] Vazquez M, Houzeaux G, Koric S, et al (2016) Alya: multiphysics engineering simulation toward exascale. J Comput Sci 14:15–27. https://doi.org/10.1016/j. jocs.2015.12.007, URL https://doi.org/10.1016/j.jocs.2015.12.007 [30] Vreman AW (2004) An eddy-viscosity subgrid-scale model for turbulent shear flow: Algebraic theory and applications. Phys Fluids 16(10):3670–3681 25