... Search: The ACM Digital Library The Guide. Feedback. Introduction to parallel computing: desi... more ... Search: The ACM Digital Library The Guide. Feedback. Introduction to parallel computing: design and analysis of algorithms. Purchase this Book. Source, Pages: 498. Year of Publication: 1994. ISBN:0-8053-3170-0. Authors, Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis ...
IEEE Transactions on Parallel and Distributed Systems, 1993
In this paper, we present the scalability analysis of parallel Fast Fourier Transform algorithm o... more In this paper, we present the scalability analysis of parallel Fast Fourier Transform algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. On the hypercube architecture, a commonly used parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in problem size. But there is a limit on the achievable efficiency and this limit is determined by the ratio of CPU speed and communication bandwidth of the hypercube channels. Efficiencies higher than this threshold value can be obtained if the problem size is increased very rapidly. If the hardware supports cut-through routing, then this threshold can also be overcome by using an alternate less scalable parallel formulation. The scalability analysis for the mesh connected multicomputers reveals that FFT cannot make efficient use of large-scale mesh architectures unless the bandwidth of the communication channels is increased as a function of the number of processors. We also show that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this paper is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture.
Page 1. UNIVERSITY OF MINNESOTA This is to certify that I have examined this bound copy of a doct... more Page 1. UNIVERSITY OF MINNESOTA This is to certify that I have examined this bound copy of a doctoral thesis by Anshul Gupta and have found that it is complete and satisfactory in all respects, ... BY Anshul Gupta IN PARTIAL FULFILLMENT OF THE REQUIREMENTS ...
Page 1. Introduction to parallel Computing PB Sunil Kumar Department of Physics IIT Madras, Chenn... more Page 1. Introduction to parallel Computing PB Sunil Kumar Department of Physics IIT Madras, Chennai 600036 www.physics.iitm.ac.in/~sunil Friday 19 November 2010 Page 2. What is high performance computing ? Large volume numerical calculations :Set up facilites ...
IEEE Transactions on Parallel and Distributed Systems, 1997
In this paper, we describe scalable parallel algorithms for sparse matrix factorization, analyze ... more In this paper, we describe scalable parallel algorithms for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two-and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.
This paper describes an optimally scalable parallel algorithm for factorization of sparse matrice... more This paper describes an optimally scalable parallel algorithm for factorization of sparse matrices. This algorithm incurs strictly less communication overhead than any known parallel formulation of sparse matrix factorization, and hence, can utilize a higher number of ...
Journal of Parallel and Distributed Computing, 1994
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity t... more The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithm-architecture combination for a problem under di erent constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a xed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying scalability issues, and discuss their interrelationships. For example, we derive an important relationship between time-constrained scaling and the isoe ciency function. We point out some of the weaknesses of the existing schemes for measuring scalability, and discuss possible ways of extending them.
*This work was supported by Army Research Office grant # 28408-MA-SDI to the University of Minnes... more *This work was supported by Army Research Office grant # 28408-MA-SDI to the University of Minnesota and by the Army High Performance Computing Research Center at the University of Minnesota. An extended version of this paper is available from the authors upon request.
A number of parallel formulations of dense matrix multiplication algorithm have been developed. F... more A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be superior than the others. In this paper we analyze the performance and scalability of a number of parallel formulations of the matrix multiplication algorithm and predict the conditions under which each formulation is better than the others.
Optimally Scalable Parallel Sparse Cholesky Factorization* Anshul Gupta* Vipin Kumar* Abstract In... more Optimally Scalable Parallel Sparse Cholesky Factorization* Anshul Gupta* Vipin Kumar* Abstract In this papei, we describe a parallel algorithm for sparse Cholesky factorization that incurs strictly less communication overhead than any known parallel formulation of sparse matrix ...
This report has been submitted for publication outside of IBM and will probably be copyrighted if... more This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution ...
During the past few years, algorithmic improvements alone have shaved almost an order of magnitud... more During the past few years, algorithmic improvements alone have shaved almost an order of magnitude off the time required for the direct solution of general sparse systems of linear equations. Combined with a similar increase in the performance to cost ratio due to hardware advances during this period, current sparse solver technology makes it possible to solve those problems quickly and easily that might have been considered impractically large until recently. In this paper, we compare the performance of some commonly used software packages for solving general sparse systems. In particular, we demonstrate the consistently high level of performance achieved by WSMP—the most recent of such solvers. We compare the various algorithmic components of these solvers and show that the choices made in WSMP enable it to run two to three times faster than the best amongst other similar solvers. As a result, WSMP can factor some of the largest sparse matrices available from real applications in a few seconds on 4-CPU workstation.
The scalability of the parallel fast Fourier transform (FFT) algorithm on mesh- and hypercube-con... more The scalability of the parallel fast Fourier transform (FFT) algorithm on mesh- and hypercube-connected multicomputers is analyzed. The hypercube architecture provides linearly increasing performance for the FFT algorithm with an increasing number of processors and a moderately increasing problem size. However, there is a limit on the efficiency, which is determined by the communication bandwidth of the hypercube channels. Efficiencies higher than this limit can be obtained only if the problem size is increased very rapidly. Technology-dependent features, such as the communication bandwidth, determine the upper bound on the overall performance that can be obtained from a P-processor system. The upper bound can be moved up by either improving the communication-related parameters linearly or increasing the problem size exponentially. The scalability analysis shows that the FFT algorithm cannot make efficient use of large-scale mesh architectures. The addition of such features as cut-through routing and multicasting does not improve the overall scalability on this architecture
IEEE Transactions on Parallel and Distributed Systems, 1993
This paper provides a tutorial introduction to a performance evaluation metric called the isoeffi... more This paper provides a tutorial introduction to a performance evaluation metric called the isoefficiencyfunction. Traditional methods for evaluating serial algorithms are inadequate foranalyzing the performance of parallel algorithm-architecture combinations. Isoefficiency functionhas proven useful for evaluating the performance of a wide variety of such combinations
IEEE Transactions on Parallel and Distributed Systems, 1995
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze... more In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two-and three-dimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.
Journal of Parallel and Distributed Computing, 1993
There are several metrics that characterize the performance of a parallel system, such as, parall... more There are several metrics that characterize the performance of a parallel system, such as, parallel execution time, speedup and e ciency. A number of properties of these metrics have been studied. For example, it is a well known fact that given a parallel architecture and a problem of a xed size, the speedup of a parallel algorithm does not continue to increase with increasing number of processors. It usually tends to saturate or peak at a certain limit. Thus it may not be useful to employ more than an optimal number of processors for solving a problem on a parallel computer. This optimal number of processors depends on the problem size, the parallel algorithm and the parallel architecture. In this paper we study the impact of parallel processing overheads and the degree of concurrency of a parallel algorithm on the optimal number of processors to be used when the criterion for optimality is minimizing the parallel execution time. We then study a more general criterion of optimality and show how operating at the optimal point is equivalent to operating at a unique value of e ciency which is characteristic of the criterion of optimality and the properties of the parallel system under study. We put the technical results derived in this paper in perspective with similar results that have appeared in the literature before and show how this paper generalizes and/or extends these earlier results.
IEEE Transactions on Parallel and Distributed Systems, 1995
This paper analyzes the performance and scalability of an iteration of the Preconditioned Conjuga... more This paper analyzes the performance and scalability of an iteration of the Preconditioned Conjugate Gradient Algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 T M y parallel computer. It is shown that for block-tridiagonal matrices resulting from two dimensional nite di erence grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor. As a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation. For the matrices resulting from three dimensional nite di erence grids, the scalability is quite good on a hypercube or the CM-5, but not as good on a 2-D mesh architecture. In case of unstructured sparse matrices with a constant number of non-zero elements in each row, the parallel formulation of the PCG iteration is unscalable on any message passing parallel architecture, unless some ordering is applied on the sparse matrix. The parallel system can be made scalable either if, after re-ordering, the non-zero elements of the N N matrix can be con ned in a band whose width is O(N y ) for any y < 1, or if the number of non-zero elements per row increases as N x for any x > 0. Scalability increases as the number of non-zero elements per row is increased and/or the width of the band containing these elements is reduced. For unstructured sparse matrices, the scalability is asymptotically the same for all architectures. Many of these analytical results are experimentally veri ed on the CM-5 parallel computer.
Many problems in engineering and scientific domains require solving large sparse systems of linea... more Many problems in engineering and scientific domains require solving large sparse systems of linear equations, as a computationally intensive step towards the final solution. It has long been a challenge to develop efficient parallel formulations of sparse direct solvers due to ...
... Search: The ACM Digital Library The Guide. Feedback. Introduction to parallel computing: desi... more ... Search: The ACM Digital Library The Guide. Feedback. Introduction to parallel computing: design and analysis of algorithms. Purchase this Book. Source, Pages: 498. Year of Publication: 1994. ISBN:0-8053-3170-0. Authors, Vipin Kumar, Ananth Grama, Anshul Gupta, George Karypis ...
IEEE Transactions on Parallel and Distributed Systems, 1993
In this paper, we present the scalability analysis of parallel Fast Fourier Transform algorithm o... more In this paper, we present the scalability analysis of parallel Fast Fourier Transform algorithm on mesh and hypercube connected multicomputers using the isoefficiency metric. The isoefficiency function of an algorithm architecture combination is defined as the rate at which the problem size should grow with the number of processors to maintain a fixed efficiency. On the hypercube architecture, a commonly used parallel FFT algorithm can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in problem size. But there is a limit on the achievable efficiency and this limit is determined by the ratio of CPU speed and communication bandwidth of the hypercube channels. Efficiencies higher than this threshold value can be obtained if the problem size is increased very rapidly. If the hardware supports cut-through routing, then this threshold can also be overcome by using an alternate less scalable parallel formulation. The scalability analysis for the mesh connected multicomputers reveals that FFT cannot make efficient use of large-scale mesh architectures unless the bandwidth of the communication channels is increased as a function of the number of processors. We also show that it is more cost-effective to implement the FFT algorithm on a hypercube rather than a mesh despite the fact that large scale meshes are cheaper to construct than large hypercubes. Although the scope of this paper is limited to the Cooley-Tukey FFT algorithm on a few classes of architectures, the methodology can be used to study the performance of various FFT algorithms on a variety of architectures such as SIMD hypercube and mesh architectures and shared memory architecture.
Page 1. UNIVERSITY OF MINNESOTA This is to certify that I have examined this bound copy of a doct... more Page 1. UNIVERSITY OF MINNESOTA This is to certify that I have examined this bound copy of a doctoral thesis by Anshul Gupta and have found that it is complete and satisfactory in all respects, ... BY Anshul Gupta IN PARTIAL FULFILLMENT OF THE REQUIREMENTS ...
Page 1. Introduction to parallel Computing PB Sunil Kumar Department of Physics IIT Madras, Chenn... more Page 1. Introduction to parallel Computing PB Sunil Kumar Department of Physics IIT Madras, Chennai 600036 www.physics.iitm.ac.in/~sunil Friday 19 November 2010 Page 2. What is high performance computing ? Large volume numerical calculations :Set up facilites ...
IEEE Transactions on Parallel and Distributed Systems, 1997
In this paper, we describe scalable parallel algorithms for sparse matrix factorization, analyze ... more In this paper, we describe scalable parallel algorithms for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two-and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.
This paper describes an optimally scalable parallel algorithm for factorization of sparse matrice... more This paper describes an optimally scalable parallel algorithm for factorization of sparse matrices. This algorithm incurs strictly less communication overhead than any known parallel formulation of sparse matrix factorization, and hence, can utilize a higher number of ...
Journal of Parallel and Distributed Computing, 1994
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity t... more The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithm-architecture combination for a problem under di erent constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a xed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying scalability issues, and discuss their interrelationships. For example, we derive an important relationship between time-constrained scaling and the isoe ciency function. We point out some of the weaknesses of the existing schemes for measuring scalability, and discuss possible ways of extending them.
*This work was supported by Army Research Office grant # 28408-MA-SDI to the University of Minnes... more *This work was supported by Army Research Office grant # 28408-MA-SDI to the University of Minnesota and by the Army High Performance Computing Research Center at the University of Minnesota. An extended version of this paper is available from the authors upon request.
A number of parallel formulations of dense matrix multiplication algorithm have been developed. F... more A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be superior than the others. In this paper we analyze the performance and scalability of a number of parallel formulations of the matrix multiplication algorithm and predict the conditions under which each formulation is better than the others.
Optimally Scalable Parallel Sparse Cholesky Factorization* Anshul Gupta* Vipin Kumar* Abstract In... more Optimally Scalable Parallel Sparse Cholesky Factorization* Anshul Gupta* Vipin Kumar* Abstract In this papei, we describe a parallel algorithm for sparse Cholesky factorization that incurs strictly less communication overhead than any known parallel formulation of sparse matrix ...
This report has been submitted for publication outside of IBM and will probably be copyrighted if... more This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution ...
During the past few years, algorithmic improvements alone have shaved almost an order of magnitud... more During the past few years, algorithmic improvements alone have shaved almost an order of magnitude off the time required for the direct solution of general sparse systems of linear equations. Combined with a similar increase in the performance to cost ratio due to hardware advances during this period, current sparse solver technology makes it possible to solve those problems quickly and easily that might have been considered impractically large until recently. In this paper, we compare the performance of some commonly used software packages for solving general sparse systems. In particular, we demonstrate the consistently high level of performance achieved by WSMP—the most recent of such solvers. We compare the various algorithmic components of these solvers and show that the choices made in WSMP enable it to run two to three times faster than the best amongst other similar solvers. As a result, WSMP can factor some of the largest sparse matrices available from real applications in a few seconds on 4-CPU workstation.
The scalability of the parallel fast Fourier transform (FFT) algorithm on mesh- and hypercube-con... more The scalability of the parallel fast Fourier transform (FFT) algorithm on mesh- and hypercube-connected multicomputers is analyzed. The hypercube architecture provides linearly increasing performance for the FFT algorithm with an increasing number of processors and a moderately increasing problem size. However, there is a limit on the efficiency, which is determined by the communication bandwidth of the hypercube channels. Efficiencies higher than this limit can be obtained only if the problem size is increased very rapidly. Technology-dependent features, such as the communication bandwidth, determine the upper bound on the overall performance that can be obtained from a P-processor system. The upper bound can be moved up by either improving the communication-related parameters linearly or increasing the problem size exponentially. The scalability analysis shows that the FFT algorithm cannot make efficient use of large-scale mesh architectures. The addition of such features as cut-through routing and multicasting does not improve the overall scalability on this architecture
IEEE Transactions on Parallel and Distributed Systems, 1993
This paper provides a tutorial introduction to a performance evaluation metric called the isoeffi... more This paper provides a tutorial introduction to a performance evaluation metric called the isoefficiencyfunction. Traditional methods for evaluating serial algorithms are inadequate foranalyzing the performance of parallel algorithm-architecture combinations. Isoefficiency functionhas proven useful for evaluating the performance of a wide variety of such combinations
IEEE Transactions on Parallel and Distributed Systems, 1995
In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze... more In this paper, we describe a scalable parallel algorithm for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a Cray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithm substantially improves the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithm to factor a wide class of sparse matrices (including those arising from two-and three-dimensional finite element problems) that is asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithm incurs less communication overhead and is more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of our sparse Cholesky factorization algorithm delivers up to 20 GFlops on a Cray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer.
Journal of Parallel and Distributed Computing, 1993
There are several metrics that characterize the performance of a parallel system, such as, parall... more There are several metrics that characterize the performance of a parallel system, such as, parallel execution time, speedup and e ciency. A number of properties of these metrics have been studied. For example, it is a well known fact that given a parallel architecture and a problem of a xed size, the speedup of a parallel algorithm does not continue to increase with increasing number of processors. It usually tends to saturate or peak at a certain limit. Thus it may not be useful to employ more than an optimal number of processors for solving a problem on a parallel computer. This optimal number of processors depends on the problem size, the parallel algorithm and the parallel architecture. In this paper we study the impact of parallel processing overheads and the degree of concurrency of a parallel algorithm on the optimal number of processors to be used when the criterion for optimality is minimizing the parallel execution time. We then study a more general criterion of optimality and show how operating at the optimal point is equivalent to operating at a unique value of e ciency which is characteristic of the criterion of optimality and the properties of the parallel system under study. We put the technical results derived in this paper in perspective with similar results that have appeared in the literature before and show how this paper generalizes and/or extends these earlier results.
IEEE Transactions on Parallel and Distributed Systems, 1995
This paper analyzes the performance and scalability of an iteration of the Preconditioned Conjuga... more This paper analyzes the performance and scalability of an iteration of the Preconditioned Conjugate Gradient Algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 T M y parallel computer. It is shown that for block-tridiagonal matrices resulting from two dimensional nite di erence grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor. As a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation. For the matrices resulting from three dimensional nite di erence grids, the scalability is quite good on a hypercube or the CM-5, but not as good on a 2-D mesh architecture. In case of unstructured sparse matrices with a constant number of non-zero elements in each row, the parallel formulation of the PCG iteration is unscalable on any message passing parallel architecture, unless some ordering is applied on the sparse matrix. The parallel system can be made scalable either if, after re-ordering, the non-zero elements of the N N matrix can be con ned in a band whose width is O(N y ) for any y < 1, or if the number of non-zero elements per row increases as N x for any x > 0. Scalability increases as the number of non-zero elements per row is increased and/or the width of the band containing these elements is reduced. For unstructured sparse matrices, the scalability is asymptotically the same for all architectures. Many of these analytical results are experimentally veri ed on the CM-5 parallel computer.
Many problems in engineering and scientific domains require solving large sparse systems of linea... more Many problems in engineering and scientific domains require solving large sparse systems of linear equations, as a computationally intensive step towards the final solution. It has long been a challenge to develop efficient parallel formulations of sparse direct solvers due to ...
Uploads
Papers by Anshul Gupta