Academia.eduAcademia.edu

Block Iterators for Sparse Matrices

2016, Proceedings of the 2016 Federated Conference on Computer Science and Information Systems

Finding an optimal block size for a given sparse matrix forms an important problem for storage formats that partition matrices into uniformly-sized blocks. Finding a solution to this problem can take a significant amount of time, which, effectively, may negate the benefits that such a format brings into sparse-matrix computations. A key for an efficient solution is the ability to quickly iterate, for a particular block size, over matrix nonzero blocks. This work proposes an efficient parallel algorithm for this task and evaluate it experimentally on modern multi-core and many-core high performance computing (HPC) architectures.

Proceedings of the Federated Conference on Computer Science DOI: 10.15439/2016F35 and Information Systems pp. 695–704 ACSIS, Vol. 8. ISSN 2300-5963 Block Iterators for Sparse Matrices Daniel Langr∗† , Ivan Šimeček∗ and Tomáš Dytrych‡ ∗ Czech Technical University in Prague Faculty of Information Technology Department of Computer Systems Thákurova 9, 160 00, Praha, Czech Republic Email: {langrd,xsimecek}@fit.cvut.cz † Výzkumný a zkušební letecký ústav, a.s. Beranových 130, 199 05, Praha, Czech Republic ‡ Czech Academy of Sciences Nuclear Physics Institute Řež 130, 250 68, Řež, Czech Republic Email: [email protected] Abstract—Finding an optimal block size for a given sparse matrix forms an important problem for storage formats that partition matrices into uniformly-sized blocks. Finding a solution to this problem can take a significant amount of time, which, effectively, may negate the benefits that such a format brings into sparse-matrix computations. A key for an efficient solution is the ability to quickly iterate, for a particular block size, over matrix nonzero blocks. This work proposes an efficient parallel algorithm for this task and evaluate it experimentally on modern multi-core and many-core high performance computing (HPC) architectures. I. I NTRODUCTION TORAGE formats prescribe a way how sparse matrices are stored in a computer memory. Many designed formats are based on partitioning of matrices into blocks where 1) blocks have a uniform size, 2) this size is not fixed for a given format and may be chosen for each matrix individually. We call such formats uniformly-blocking formats or, shortly, UB formats. Considering a particular sparse matrix A and a particular UB format, we thus face a problem of finding an optimal block size (whatever this means). Typically, we want to find a block size that will provide highest performance of sparse matrixvector multiplication (SpMV) performed with A. Generally, this task cannot be accomplished until the matrix is fully assembled or at least until its structure of nonzero elements is fully known, which implies that matrices cannot be assembled in UB formats directly (there are usually other reasons as well). Instead, one needs to 1) assemble A in some suitable (not-parametrized) simple storage format, S This work was supported by the Czech Science Foundation under Grant No. 16-16772S. This work was supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme, as well as Czech Ministry of Education, Youth and Sports via the project Large Research, Development and Innovations Infrastructures (LM2011033). 978-83-60810-90-3/$25.00 c 2016, IEEE 2) find an optimal block size for A, 3) transform A in memory from its original storage format to a given UB format. The second and third steps form an important problem related to a given UB format. If the solution of this problem takes too long, it might effectively negate the benefits that a UB format brings into subsequent computations with A. To find an optimal block size, there is usually no option other than 1) to form some set of possibly-optimal block sizes, 2) to evaluate an optimization criterion for all of them. Note that this approach generally gives a pseudooptimal block size instead of an optimal one. For sake of simplicity, we do not distinguish between these two cases and call them both optimal throughout this text. Evaluation of an optimization criterion for a given UB format, given matrix A, and a particular tested block size typically involves gathering some information about all nonzero blocks of A. We therefore need to examine all these nonzero blocks. Such a procedure can be described briefly as follows: for all nonzero blocks of A, perform some calculations that contribute to the evaluation of an optimization criterion. Thus, in fact, we need to iterate over nonzero blocks of A. When an optimal block size is found, this iterative process has to be run once again within the third step mentioned above, i.e., during the final transformation of A to a given UB format. This paper addresses the problem of fast iteration over nonzero blocks of a sparse matrix. We propose an efficient scalable parallel algorithm for a solution to this problem and evaluate it experimentally on modern multi-core and manycore HPC architectures, where matrices frequently emerge in multi-threaded programs. (In distributed-memory environments, objects of our concern are “local” matrices formed by nonzero elements mapped to particular application processes.) II. C ASE S TUDY As an illustrative case study, we work throughout this text with the adaptive-blocking hierarchical storage format 695 696 PROCEEDINGS OF THE FEDCSIS. GDAŃSK, 2016 determines the upper bound for tested block sizes. In practice, setting kmax = 10 is typically sufficient, which implies 11 iterations over nonzero blocks of A while testing block sizes s = 2, 4, 8, . . . , 1024 within Algorithm 1. Algorithm 1: Transformation of A to the ABHSF Input: A: sparse matrix Input: S = {s1 , s2 , . . .}: set of possibly-optimal block sizes Data: Mopt , M , sopt , i: auxiliary variables 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Mopt ← 0 for i ← 1 to |S| do M ←0 for each nonzero block B of size si of A do find a space-optimal way W to store B in memory considering ⌈log2 si ⌉ bits for in-block indexes; calculate the contribution c(B, W ) of B stored in W to the memory footprint of A; M ← M + c(B, W ) end if i = 1 or M < Mopt then sopt ← si Mopt ← M end end for each nonzero block B of size sopt of A do store B in memory in the ABHSF format end III. N OTATION Let A be an m × n sparse matrix, where ai,j denotes the value of an element of A located in its ith row and jth column. As a mathematical object, A can be written as   a1,1 · · · a1,n  ..  . .. A =  ... (2) . .  am,1 (ABHSF) [1], [2]. This format partitions A into uniform square blocks of size s and stores each block in memory in a space-optimal way. The optimization criterion of ABHSF is represented by the total memory footprint of A which is being minimized. This is a very common optimization criterion for storage formats in general (not only UB formats), since the performance of SpMV is limited by bandwidths of memory subsystems on modern HPC architectures [3]. Many UB formats work with in-block row and column indexes. Optimal (space-optimal) block sizes are then typically those that employ most or all of the available indexing bits. In case of the ABHSF and byte-padded in-block indexes, setting s = 256 is almost generally optimal or at least close to being optimal [1]. (Such a choice eliminates the discussed optimization problem, however, it does not eliminate the need to iterate over nonzero blocks of A; this process is still required for transformation of A into the ABHSF.) On the other hand, if we really want to minimize the memory footprint of A stored in the ABHSF, we need to use the minimum possible number of bits for in-block indexes, i.e., ⌈log2 s⌉. In such cases, any block size s can be generally optimal. Let S = {s1 , s2 , . . .} denotes some set of possiblyoptimal block sizes. The transformation of A into the ABHSF can then be written as Algorithm 1. Algorithm 1 iterates over nonzero blocks of A exactly |S|+1 times. Therefore, we want |S| to be 1) large enough to find the best possible block size, 2) small enough to prevent long algorithm running times. One way to get close to both these outcomes is to consider only block sizes S = {2k : 1 ≤ k ≤ kmax }, (1) which implies maximum utilization of all k bits for inblock indexes. The kmax parameter, which corresponds to |S|, ··· am,n However, within computer programs, we typically work only with nonzero elements of sparse matrices (or, even only with nonzero elements from a single triangular part if a matrix exhibits some kind of symmetry). An element of A is determined by its value, row index, and column index; let us write it as a triplet (i, j, ai,j ). As a data structure, we can consider A as a set of matrix nonzero elements:  A = (i, j, ai,j ) : 1 ≤ i ≤ m, 1 ≤ j ≤ n, ai,j 6= 0 . (3) Moreover, nonzero elements stored in memory are accessible in some order, which is typically prescribed by a given storage format. If this order matters, we can consider A as a sequence of matrix nonzero elements: nnz A = (il , jl , ail ,jl ) l=1 , ail ,jl 6= 0, (4) where nnz denotes the number of nonzero elements of A. In the text below, we use forms (2), (3), and (4) interchangeably, while preferring the particular one in dependence on the actual context. For the sake of simplicity, we also consider partitioning into square blocks only; the generalization for rectangular blocks is straightforward. Partitioning A into square blocks of size s yields an M × N block matrix, where M = ⌈m/s⌉ and N = ⌈n/s⌉. For indexing block rows and columns, we use capital letters I and J, respectively. A block is called nonzero if it contains at least one nonzero matrix element. A matrix element (i, j, ai,j ) belongs to a block with indexes I = ⌊(i − 1)/s⌋ + 1 and J = ⌊(j − 1)/s⌋ + 1. (5) By using \ for integer division, we can rewrite (5) as I = (i − 1)\s + 1 and J = (j − 1)\s + 1. (6) Element’s in-block   indexes can be found correspondingly  as (i − 1) mod s + 1 and (j − 1) mod s + 1. Note that the calculations of block indexes and local inblock indexes for nonzero matrix elements involve integer division and modulo, which are relatively expensive arithmetic operations [4]. When possibly-optimal block sizes are chosen according to (1), both integer division and modulo can be substituted by much faster logical shift and bitwise AND operations. DANIEL LANGR ET AL.: BLOCK ITERATORS FOR SPARSE MATRICES 697 Algorithm 2: Iteration over nonzero blocks of A: variant 1 Input: A: sparse matrix Input: s: block size Data: B: nonzero elements of a single block Data: I, J: indexes 1 2 3 4 5 6 7 8 9 10 11 12 13 for I ← 1 to ⌈m/s⌉ do for J ← 1 to ⌈n/s⌉ do B ← {} for all (i, j, ai,j ) ∈ A do if (i − 1)\s + 1 = I and (j − 1)\s + 1 = J then B ← B ∪ {(i, j, ai,j )} end end if B 6= {} then process block B with indexes I and J end end end Algorithm 3: Iteration over nonzero blocks of A: variant 2 Input: A: sparse matrix Input: s: block size Data: BI,J : nonzero elements of a block in Ith block row and Jth block column Data: I, J: indexes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 IV. A LGORITHMS Let us now analyse the problem of iteration over the nonzero blocks of A. In the most generic case, we have, at the outset, no knowledge which blocks of A are nonzero and which nonzero elements of A belong to these blocks. There are basically two ways to find this out: 1) to iterate over all blocks of A and for each block find its nonzero elements; 2) to iterate over all nonzero elements of A, find for each of them the corresponding block (6), and save the information that the element belongs to this block. Pseudocodes for these two options are provided as Algorithms 2 and 3, respectively. Processing of blocks is application-dependent; it might, e.g., represent the calculation of blocks contributions to the optimization criterion (line 5–7 of Algorithm 1) or the storage of blocks in memory (line 15). Algorithm 2 have low memory requirements; its auxiliary space is S2 (A, s) = O(s2 ), since, at a given time, only nonzero elements for a single block need to be kept in memory. The drawback of this algorithm is its high time complexity T2 (A, s) = Θ(m · n · nnz /s2 ). As for Algorithm 3, its time complexity is considerably lower, namely T3 (A, s) = Θ(m · n/s2 + nnz ). However, the auxiliary space of Algorithm 3 is for I ← 1 to ⌈m/s⌉ do for J ← 1 to ⌈n/s⌉ do BI,J ← {} end for all (i, j, ai,j ) ∈ A do I ← (i − 1)\s + 1 J ← (j − 1)\s + 1 BI,J ← BI,J ∪ {(i, j, ai,j )} end for I ← 1 to ⌈m/s⌉ do for J ← 1 to ⌈n/s⌉ do if BI,J 6= {} then process block BI,J with indexes I and J end end end Whenever working with sparse matrices, we generally want to avoid algorithms with Ω(nnz ) auxiliary space as much as possible. Within many running instances of HPC programs, matrices are the largest objects in a computer memory and their sizes determine an extent of underlying computational problems. Any Ω(nnz ) auxiliary space algorithm (such as Algorithm 3) thus, in effect, considerably limits the size of a problem being solved. To avoid the high time complexity of Algorithm 2 as well as the high auxiliary space of Algorithm 3, we propose another solution for iteration over nonzero block of A that works as follows: 1) The nonzero elements of A are reordered such that the nonzero elements of each block are laid out consecutively (grouped together) in memory. In other words, the nonzero elements are sorted with respect to blocks. 2) A single iteration over nonzero elements is performed while elements of each nonzero block are identified and processed. The pseudocode of such a solution is provided as Algorithm 4. Its time complexity and auxiliary space is dominated by the sorting step (line 1). Let us assume that we use an in-place randomized quicksort with time complexity   O nnz · log2 (nnz ) and auxiliary space O log2 (nnz ) . The overall time complexity of Algorithm 4 then will be  T4 (A, s) = O nnz · log2 (nnz ) and its auxiliary space S3 (A, s) = O(m · n/s2 + nnz ), S4 (A, s) = O log2 (nnz ) since one needs to save the information about all nonzero elements for each nonzero block. Moreover, an implementation of this algorithm would likely require some complex dynamic data structure, which might introduce problems with memory fragmentation and expensive insertion/look-up operations. as well. Algorithm 4 reduces both the time complexity of Algorithm 2 and the auxiliary space of Algorithm 3, however, at the following price: it requires A to be provided in such a format that facilitates reordering/sorting its nonzero elements. There is  698 PROCEEDINGS OF THE FEDCSIS. GDAŃSK, 2016 Algorithm 4: Iteration over nonzero blocks of A: variant 3 Input: A: sparse matrix Input: s: block size Data: I, I ′ , J, J ′ , l, l1 : indexes nnz ← sort A with respect to blocks 1 (il , jl , ail ,jl ) l=1 2 l1 ← 1 3 I ← (i1 − 1)\s + 1 4 J ← (j1 − 1)\s + 1 5 for l ← 2 to nnz do 6 I ′ ← (il − 1)\s + 1 7 J ′ ← (jl − 1)\s + 1 8 if I ′ 6= I or J ′ 6= J then 9 process block with indexes I and J that contains l−1 nonzero elements (iq , jq , aiq ,jq ) q=l 1 10 l1 ← l ′ 11 I←I 12 J ← J′ 13 end 14 end 15 process block with indexes I and J that contains nonzero nnz elements (iq , jq , aiq ,jq ) q=l 1 practically only one candidate—the coordinate storage format (COO) [5], [6]; it consists of three arrays containing row indexes, column indexes, and values of nonzero elements. At the same time, it does not prescribe any particular ordering for these arrays. To require A to be initially in COO is not as restrictive in practice as it might seem, since: 1) any sparse matrix can be easily and quickly transformed into COO regardless of its original storage format, 2) COO is the most convenient format for assembling sparse matrices (newly generated nonzero elements are simply appended to the corresponding COO arrays). A scenario where matrices are first assembled in COO and then transformed to another, computationally more suitable, storage format (such as some UB format) is thus perfectly viable for HPC programs. To sort the nonzero elements with respect to blocks, we can define sorting keys by using the pairs of I and J block indexes calculated by (5). For example, if we want blocks to be sorted lexicographically, we can calculate sorting keys as I · N + J. Again, note that choosing (1) for possibly-optimal block sizes implies faster calculation of sorting keys and therefore, in effect, likely faster sorting step within Algorithm 4. A. Parallelization Parallelization of (expensive) Algorithms 2 and 3 is straightforward. In Algorithm 2, we can parallelize the inner-most loop (line 4) while synchronizing concurrent updates of B at line 6. In Algorithm 3, we can parallelize the loops over blocks (lines 1–2 and 9–10) as well as the loop over nonzero matrix elements (line 4) while using thread-local I and J indexes and synchronizing concurrent updates to BI,J at line 7. Parallelization of Algorithm 4 is a bit more complex; we propose its multi-threaded variant as Algorithm 5. Note that Algorithm 5: Parallel iteration over nonzero blocks of A Input: A: sparse matrix Input: s: block size Input: T : number of threads Data: I, I ′ , J, J ′ , l, l1 , t: thread-private indexes Data: tb[]: thread-shared integer array of size T + 1 nnz ← sort A in parallel with respect to blocks 1 (il , jl , ail ,jl ) l=1 2 tb[1] ← 1 3 tb[T + 1] ← nnz + 1 4 for all threads do in parallel 5 t ← current thread number (between 1 and T ) 6 if t > 1 then   7 l ← nnz · (t − 1) \T + 1 8 I ← (i1 − 1)\s + 1 9 J ← (j1 − 1)\s + 1 10 l ←l+1 11 while l ≤ nnz do 12 I ′ ← (i1 − 1)\s + 1 13 J ′ ← (j1 − 1)\s + 1 14 if I ′ 6= I or J ′ 6= J then break 15 l ←l+1 16 end 17 tb[t] ← l 18 end 19 perform barrier to synchronize threads 20 l1 ← tb[t] 21 I ← (il1 − 1)\s + 1 22 J ← (jl1 − 1)\s + 1 23 for l ← tb[t] + 1 to tb[t + 1] − 1 do 24 I ′ ← (il − 1)\s + 1 25 J ′ ← (jl − 1)\s + 1 26 if I ′ 6= I or J ′ 6= J then 27 process block with indexes I and J that contains l−1 nonzero elements (iq , jq , aiq ,jq ) q=l 1 28 l1 ← l ′ 29 I←I 30 J ← J′ 31 end 32 end 33 process block with indexes I and J that contains nonzero tb[t+1]−1 elements (iq , jq , aiq ,jq ) q=l 1 34 end we cannot simply parallelize the main loop of Algorithm 4 (line 5), since its uniform splitting would generally cause threads to start with nonzero elements that are not, in sequence (4), first within corresponding blocks. Algorithm 5 therefore splits the load among threads such that: 1) an amortized number of nonzero elements processed by each thread is nnz /T , where T denotes the number of threads; 2) all nonzero elements of each particular block are processed by a single thread only. Such splitting is calculated at lines 2–18 of Algorithm 5 and stored into an auxiliary array tb[]. Each thread can then process its exclusive portion of nonzero elements independently of other threads (lines 20–33). Threads are required to be indexed from 1 to T ; if not, some mapping from thread IDs to such indexing must be provided. DANIEL LANGR ET AL.: BLOCK ITERATORS FOR SPARSE MATRICES nlpkkt240 HV15R kron_g500-logn21 arabic-2005 101 100 1 4 8 12 16 20 nlpkkt240 HV15R kron_g500-logn21 arabic-2005 102 Runtime [s] 102 Runtime [s] 699 101 100 24 1 4 8 Number of threads (a) Salomon node, s = 4 103 nlpkkt240 HV15R kron_g500-logn21 arabic-2005 102 101 100 1 16 61 16 20 24 (b) Salomon node, s = 512 Runtime [s] Runtime [s] 103 12 Number of threads 122 nlpkkt240 HV15R kron_g500-logn21 arabic-2005 102 101 100 1 16 Number of threads 61 122 Number of threads (c) Intel Xeon Phi, s = 4 (d) Intel Xeon Phi, s = 512 Fig. 1: Strong scalability of Algorithm 5 with the do-nothing processor measured for different architectures, different matrices, and different block sizes. V. E XPERIMENTS We have conducted an extensive experimental study to evaluate Algorithm 5. Within this study, we worked with matrices from the University of Florida Sparse Matrix Collection (UFSMC) [7]. Matrices that we used are listed in Appendix; their characteristics can be found at the UFSMC web pages1 . We tried to choose matrices emerging in a wide range of scientific and engineering disciplines and thus having different properties, such as: • • • • • • different types of elements—real, complex, integer, binary; different sizes and shapes—square, rectangular; different kinds of symmetries—unsymmetric, symmetric, Hermitian; different numbers of nonzero elements—from 1.1 · 107 of the kim2 matrix to 6.4 · 108 of the arabic-2005 matrix; different densities, i.e., relative counts nnz /(m · n), of nonzero elements—from 5.12 · 10−7 of the nlpkkt240 matrix to 1.11 · 10−2 of the TSOPF_RS_b2328 matrix; different patterns of nonzero elements. The matrices were read on the input from files downloaded from the UFSMC. All these files stored nonzero elements of matrices in the reverse lexicograhical order (RLO) and in the 1 http://www.cise.ufl.edu/research/sparse/matrices/ same order, we stored the elements in memory in the COO format as the first step of our benchmark program. The measurements were performed on the following two shared-memory HPC architectures: 1) nodes of the Salomon supercomputer operated by IT4Innovations National Supercomputing Center in Ostrava, Czech Republic, having two 12-core Intel Xeon E5-2680v3 CPUs and 128 GB RAM per node; 2) Intel Xeon Phi coprocessor type 7120P with 16 GB RAM. Benchmark codes were written in C++ and we used the GNU g++ compiler version 5.1.0 on Salomon and Intel icpc compiler version 16.0.1 for Intel Xeon Phi builds. Parallelization was implemented with OpenMP. As for sorting (line 1 of Algorithm 5), we used AQsort2 —an OpenMPbased multi-threaded variant of in-place quicksort that can work with multiple arrays, such as the arrays of the COO storage format in our case. A. Processors Lines 27 and 33 of Algorithm 5 contain processing of found nonzero blocks. Within this study, we invoked two different block processors at these points. The first one did nothing useful at all, which allowed us to evaluate the algorithm itself 2 https://github.com/DanielLangr/AQsort 700 PROCEEDINGS OF THE FEDCSIS. GDAŃSK, 2016 Overall Iteration Sorting Overall Sorting Iteration 102 Runtime [s] Runtime [s] 101 100 10−1 101 100 10−1 10−2 10−2 2 4 8 16 32 64 128 256 512 1024 2 4 8 Block size 16 32 64 128 256 512 1024 Block size (b) Intel Xeon Phi, arabic-2005, T = 122 (a) Salomon node, nlpkkt240, T = 24 Fig. 2: Runtime of Algorithm 5 (Overall) and its two phases (Sorting and Iteration) with the do-nothing processor, measured for different architectures, different matrices, different block sizes, and the optimal number of threads. 102 102 Runtime [s] Runtime [s] real complex integer binary 101 real complex integer binary 101 100 107 108 109 Matrix nonzero elements (a) Salomon node, T = 24 100 107 108 109 Matrix nonzero elements (b) Intel Xeon Phi, T = 122 Fig. 3: Aggregated runtime of Algorithm 5 with the do-nothing processor run 10 times in a row for block sizes s = 2, 4, 8, . . . , 1024, measured for 26 matrices from the UFSMC, different architectures and the optimal number of threads. without any application-dependent computations; we call this processor a do-nothing processor.3 The second processor was designed for the problem of finding an optimal block size when storing A in the ABHSF; we call it the ABHSF-opt processor. This processor calculated and summed the contributions of blocks to the overall memory footprint of A. B. Scalability First, we measured the strong scalability of Algorithm 5; the results for 4 different matrices and 2 different block sizes are shown in Fig. 1. In all cases, parallelization led to a considerable reduction of runtime required for the iteration over nonzero blocks of A. This runtime was dominated by the sorting phase of the algorithm (see Section V-C for details), 3 A processor that would do anything at all might be optimized away by the compiler. We therefore designed the do-nothing processor such that it summed the number of nonzero elements of blocks, which also allowed us to verify that the algorithm correctly iterated over all nonzero blocks of A. thus, consequently, the overall scalability of Algorithm 5 was determined by the scalability of AQsort within our study. The maximum number of threads, i.e., 24 for Salomon nodes and 122 for Intel Xeon Phi, was chosen experimentally; beyond these points, runtime of AQsort started to grow significantly. C. Algorithm Phases The second experiment evaluated the contributions of the sorting and iterations phases of Algorithm 5 to its overall runtime; the results are presented by Fig. 2. The set of tested block sizes was selected according to (1) while setting kmax = 10. The sorting phase of Algorithm 5 clearly dominates the overall algorithm runtime. The runtime of the iteration phase (with the do-nothing processor in this case) is practically negligible. We can also notice that larger block sizes yielded slightly faster sorting due to lower number of distinct sorting keys (less nonzero elements need to be swapped in memory). DANIEL LANGR ET AL.: BLOCK ITERATORS FOR SPARSE MATRICES Sorting from s/2 Sorting from RLO 6 4 2 Sorting from s/2 Sorting from RLO Runtime [s] Runtime [s] 701 0 15 10 5 0 4 8 16 32 64 128 256 512 1024 4 8 16 32 Block size 64 128 256 512 1024 Block size (b) Intel Xeon Phi, nlpkkt240, T = 122 (a) Salomon node, nlpkkt240, T = 24 Fig. 4: Runtime of the sorting phase of Algorithm 5 for different block sizes when nonzero elements were sorted from the RLO (Sorting from RLO) and from the ordering given by half a block size (Sorting from s/2). Overall Sorting Iteration Overall Sorting Iteration 103 102 10 Runtime [s] Runtime [s] 102 101 0 10−1 101 100 10−1 10−2 10−2 2 3 4 5 12 16 100 128 500 512 Block size (a) Salomon node, nlpkkt240, T = 24 2 3 4 5 12 16 100 128 500 512 Block size (b) Intel Xeon Phi, arabic-2005, T = 122 Fig. 5: Runtime of Algorithm 5 (Overall) and its two phases (Sorting and Iteration) with the do-nothing processor, measured for different architectures, different matrices, different block sizes, and the optimal number of threads. D. Multiple Block Sizes Within the problem of finding an optimal block size, we need to iterate over nonzero blocks of matrices multiple times while testing different block sizes. In the next experiment, we therefore run Algorithm 5 ten times in a row, while testing block sizes s = 2, 4, 8, . . . , 1024 as proposed in Section II. We used 26 matrices from the UFSMC (listed in Appendix) and, for each one, measured the aggregated runtime of all 10 algorithm runs. The results are presented by Fig. 3, which shows runtimes as a function of the number of matrix nonzero elements. We can observe that the relation of runtime and nnz was roughly linear. Though, for a constant  T , the time complexity of Algorithm 5 is O n · log2 (n) , modern implementations of parallel quicksorts yield in practice such a linear growth of runtime on modern multi-core and many-core architectures (details are beyond the scope of this text). E. Initial Ordering Effects Fig. 2 shows the runtimes for sorting of nonzero elements of matrices from the RLO to the block-aware ordering. However, when we are looking for an optimal block size from S for a given matrix, we initially need to sort its nonzero elements from the input ordering (the RLO in our case) only once for the tested block size s1 . Then, for all other tested block sizes sk : k > 1, the sorting algorithm takes as an input nonzero elements sorted with respect to the block size sk−1 . Within our study, we considered block sizes sk = 2k , which implies sk = 2 · sk−1 for k > 1. Moreover, we defined sorting keys according to the lexicographical ordering of blocks. Consequently, for k > 1, the nonzero elements were on the input of Algorithm 5 partially sorted, which should result in shorter sorting times. We performed an experiment to verify this assumption; the results are shown in Fig. 4. They clearly indicate that AQsort was able to take the advantage of such partially sorted data; the amount of spared time was significant, especially on Salomon CPU-based nodes. F. Block Sizes Effects Recall that in the previous text, we made an assumption that setting block sizes sk = 2k should provide faster runs of Algorithm 5 due to the possibility of calculation of block indexes I and J by using cheap logical and bitwise operations. However, if we need to test block sizes other than the powers of 2, we cannot avoid integer division (6). To evaluate the 702 PROCEEDINGS OF THE FEDCSIS. GDAŃSK, 2016 Sorting Iteration: do-nothing Iteration: ABHSF-opt 102 Runtime [s] 101 100 10−1 or kt 16 nl pk 0 kt 20 nl pk 0 kt 24 0 rg re g_ l a n_ 2_ t9 23 _s 0 R M 07 TS R s O PF pal _R _00 4 S_ b2 38 uk 3 -2 w 00 ik 2 ip w ed bia ed -2 u 00 61 10 4 ld o nl pk af _s he ar l ab l10 ic -2 00 fe 5 m c a _h ge ifr 1 eq 5 _c irc Fl ui an t _1 56 Fr 5 ee sc al e Fu 1 llC hi p G ho L7 lly d1 w oo 9 d20 09 in H V1 do ch 5R in a20 04 kr on _g k im 50 2 0lo gn 21 10−2 1500 Structure memory footprint [MB] Structure memory footprint [MB] Fig. 6: Aggregated runtime of sorting and iteration phases of Algorithm 5 run 10 times in a row for block sizes s = 2, 4, 8, . . . , 1024, measured for different matrices on a Salomon node using 24 threads. The iteration phases were measured for both the do-nothing and ABHSF-opt processors. arabic-2005 HV15R nlpkkt240 uk-2002 1000 500 0 2 4 8 16 32 64 128 256 512 1024 rgg_n_2_23_s0 nlpkkt160 wikipedia-20061104 hollywood-2009 400 200 0 2 4 8 16 32 64 128 256 512 1024 Fig. 7: Matrix structure memory footprints for different matrices stored in the ABHSF and different block sizes. difference between both cases, we measured runtimes of Algorithm 5 and both its phases (sorting and iteration) for several block sizes of both classes s = 2k and s 6= 2k ; the results are presented in Fig. 5. The conclusion is obvious—the way of deriving block indexes for the nonzero elements had a tremendous impact on the algorithm. Its runtime grew by almost a factor of 4 and 7 on Salomon nodes and Intel Xeon Phi, respectively, when using integer division instead of bitwise/logical operations. G. ABHSF Up to now, we presented measurements that used the donothing processor designed to evaluate Algorithm 5 itself. However, in practice, we iterate over nonzero blocks of sparse matrices to do something useful and the question is how the algorithm runtime will change in such cases. To answer this question, we substituted the do-nothing processor with the ABHSF-opt processor introduced in Section V-A and used Algorithm 5 for finding optimal block sizes for all tested matrices. The results of this experiment are presented in Fig. 6. They show aggregated runtimes of all 10 sorting phases as well as 10 iteration phases, and, for comparison, we show results for both types of block processors. We can observe that with the ABHSF-opt processor, the iteration phase took considerably longer times in comparison with the do-nothing processor. However, the overall runtime of the whole algorithm was still dominated by its sorting phase. In regard to memory footprints of sparse matrices, we can usually focus only on matrix structure memory footprints, i.e., memory footprints of the information describing the structure of nonzero elements (compression of the values of nonzero elements pays off only for special kinds of matrices where same values emerge many times). For illustration, we show in Fig. 7 the relation between block sizes and the matrix structure memory footprints of selected matrices stored in the ABHSF. For most of the tested matrices, we have found only a single minimum, which typically corresponded to block sizes of 8 or 16 (left side of Fig. 7 and the nlpkkt160 matrix). However, we have also observed few “pathological” cases with different behavior (right side of Fig. 7). For example, the minimum DANIEL LANGR ET AL.: BLOCK ITERATORS FOR SPARSE MATRICES TABLE I: Matrix structure memory footprints in MB for selected matrices stored in the COO and CSR storage formats with 32-bit indexes and the ABHSF with optimal block sizes. Matrix arabic-2005 hollywood-2009 HV15R nlpkkt160 nlpkkt240 rgg_n_2_23_s0 uk-2002 wikipedia-20061104 COO CSR 4882.8 438.8 2159.7 907.4 3061.2 484.5 2274.4 300.5 2528.2 223.8 1087.5 485.5 1637.4 274.2 1207.9 162.2 ABHSF 400.1 81.5 135.4 116.9 410.3 115.8 215.7 130.1 for the wikipedia-20061104 matrix was not even found within the whole tested range s = 2, 4, 8, . . . , 1024; such a result might indicate that the ABHSF is not a suitable format for this matrix. We also present in Table I the comparison of the matrix structure memory footprints for selected matrices and 3 storage formats—COO, the compressed sparse row (CSR) format, and the ABHSF. CSR is likely the most commonly used format for sparse matrices, together with its compressed sparse column (CSC) counterpart (they are also often abbreviated as CRS and CCS). The measurements revealed that storing sparse matrices in the ABHSF can result in substantial memory savings. VI. R ELATED W ORK We have proposed an algorithm for the purpose of storing matrices in a file system in the ABHSF [2, Algorithm 1]. This algorithm served as a starting point for the development of Algorithm 4 that was generalized for any UB format. For examples of designed UB formats, see, e.g., [1], [5], [8]–[21] VII. C ONCLUSIONS The contribution of this paper is an efficient scalable parallel algorithm for fast iteration over nonzero blocks of sparse matrices. This algorithm is a building block of a process of transformation of sparse matrices into UB storage formats. We have presented an extensive experimental study with the proposed algorithm using matrices from the UFSMC that came from different scientific and engineering disciplines and thus featured different characteristics. Measurements conducted on modern multi-core and many-core HPC architectures revealed that if the set of tested block sizes is chosen properly, the process of finding an optimal block size takes up to tens of seconds even for very large matrices. The remaining question is whether or not it pays off to transform matrices into UB formats. The answer to this question is highly application-dependent. For instance, if a matrix is used within an iterative linear solver or an eigensolver, we would first need to know how many SpMV operations are applied to a given matrix and how much time this operation takes. In our future work, we want to focus on the ABHSF and undertake 703 a research that should tell how many SpMV-based iterations need to be done with a given matrix to reduce the overall application runtime when considering matrix storage in this format. A PPENDIX The list of sparse matrices from the UFSMC used in the experiments: 3Dspectralwave, af_shell10, arabic-2005, cage15, fem_hifreq_circuit, Flan_1565, Freescale1, FullChip, GL7d19, hollywood-2009, HV15R, indochina-2004, kim2, kron_g500-logn21, ldoor, nlpkkt160, nlpkkt200, nlpkkt260, relat9, rgg_n_2_23_s0, RM07R, spal_004, TSOPF_RS_b2383, uk-2002, wb-edu, wikipedia-20061104. ACKNOWLEDGMENTS The authors acknowledge support from P. Tvrdík from the Czech Technical University in Prague, P. Vrchota and J. Fiala Výzkumný a zkušební letecký ústav, a.s., and M. Pajr from CQK Holding and IHPCI. The authors would like to thank M. Václavík of the Czech Technical University in Prague for providing an access to an Intel Xeon Phi accelerator installed at the Star university cluster. R EFERENCES [1] D. Langr, I. Šimeček, P. Tvrdík, T. Dytrych, and J. P. Draayer, “Adaptiveblocking hierarchical storage format for sparse matrices,” in Proceedings of the Federated Conference on Computer Science and Information Systems (FedCSIS 2012). IEEE Xplore Digital Library, 2012, pp. 545– 551. [2] D. Langr, I. Šimeček, and P. Tvrdík, “Storing sparse matrices in the adaptive-blocking hierarchical storage format,” in Proceedings of the Federated Conference on Computer Science and Information Systems (FedCSIS 2013). IEEE Xplore Digital Library, 2013, pp. 479–486. [3] D. Langr and P. Tvrdík, “Evaluation criteria for sparse matrix storage formats,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 2, pp. 428–440, 2016. doi: 10.1109/TPDS.2015.2401575 [4] A. Fog, “Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs,” 2016, accessed April 8, 2016 at http://www.agner.org/optimize/instruction_ tables.pdf. [5] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd ed. Philadelphia, PA: SIAM, 1994. [6] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2003. ISBN 0898715342 [7] T. A. Davis and Y. F. Hu, “The University of Florida Sparse Matrix Collection,” ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1:1–1:25, 2011. doi: 10.1145/2049662.2049663 [8] M. Belgin, G. Back, and C. J. Ribbens, “Pattern-based sparse matrix representation for memory-efficient SMVM kernels,” in Proceedings of the 23rd International Conference on Supercomputing, ser. ICS ’09. New York, NY, USA: ACM, 2009. doi: 10.1145/1542275.1542294. ISBN 978-1-60558-498-0 pp. 100–109. [9] ——, “A library for pattern-based sparse matrix vector multiply,” International Journal of Parallel Programming, vol. 39, no. 1, pp. 62– 87, 2011. doi: 10.1007/s10766-010-0145-2 [10] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,” in Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures, ser. SPAA ’09. New York, NY, USA: ACM, 2009. doi: 10.1145/1583991.1584053. ISBN 978-1-60558-606-9 pp. 233–244. 704 [11] A. Buluc, S. Williams, L. Oliker, and J. Demmel, “Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication,” in Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, ser. IPDPS ’11. IEEE Computer Society, 2011. doi: 10.1109/IPDPS.2011.73 pp. 721–733. [12] J. W. Choi, A. Singh, and R. W. Vuduc, “Model-driven autotuning of sparse matrix-vector multiply on GPUs,” in Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’10. New York, NY, USA: ACM, 2010. doi: 10.1145/1693453.1693471 pp. 115–126. [13] E.-J. Im and K. Yelick, “Optimizing sparse matrix computations for register reuse in SPARSITY,” in Proceedings of the International Conference on Computational Science (ICCS 2001), Part I, ser. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2001, vol. 2073, pp. 127–136. [14] E.-J. Im, K. Yelick, and R. Vuduc, “Sparsity: Optimization framework for sparse matrix kernels,” International Journal of High Performance Computing Applications, vol. 18, no. 1, pp. 135–158, 2004. doi: 10.1177/1094342004041296 [15] V. Karakasis, G. Goumas, and N. Koziris, “A comparative study of blocking storage methods for sparse matrices on multicore architectures,” in Computational Science and Engineering, 2009. CSE ’09. International Conference on, vol. 1, Aug 2009. doi: 10.1109/CSE.2009.223 pp. 247–256. [16] R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, “Performance modeling and analysis of cache blocking in sparse matrix PROCEEDINGS OF THE FEDCSIS. GDAŃSK, 2016 [17] [18] [19] [20] [21] vector multiply,” University of California, Tech. Rep. UCB/CSD-041335, 2004. ——, “When cache blocking of sparse matrix vector multiply works and why,” Applicable Algebra in Engineering, Communication and Computing, vol. 18, no. 3, pp. 297–311, 2007. doi: 10.1007/s00200007-0038-9 I. Šimeček, D. Langr, and P. Tvrdík, “Space-efficient sparse matrix storage formats for massively parallel systems,” in Proceedings of the 14th IEEE International Conference of High Performance Computing and Communications (HPCC 2012). IEEE Computer Society, 2012. doi: 10.1109/HPCC.2012.18 pp. 54–60. I. Šimeček and D. Langr, “Space and execution efficient formats for modern processor architectures,” in Proceedings of the 17th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2015). IEEE Computer Society, 2015. doi: 10.1109/SYNASC.2015.24 pp. 98–105. F. S. Smailbegovic, G. N. Gaydadjiev, and S. Vassiliadis, “Sparse Matrix Storage Format,” in Proceedings of the 16th Annual Workshop on Circuits, Systems and Signal Processing, ProRisc 2005, 2005, pp. 445– 448. P. Stathis, S. Vassiliadis, and S. Cotofana, “A hierarchical sparse matrix storage format for vector processors,” in Proceedings of the 17th International Symposium on Parallel and Distributed Processing, ser. IPDPS ’03. Washington, DC, USA: IEEE Computer Society, 2003, p. 61.