International Journal of Computational Science and Engineering, 2009
This article emphasizes on load balancing issues associated with hybrid programming models for th... more This article emphasizes on load balancing issues associated with hybrid programming models for the parallelization of tiled algorithms onto SMP clusters. Although tiled algorithms usually account for relatively regular computation and communication patterns, their hybrid parallelization often suffers from intrinsic load imbalance between threads. This observation mainly reflects the fact that most existing message passing libraries generally provide limited multi-threading support, thus allowing only the master thread to perform inter-node message passing communication. In order to mitigate this effect, we propose a generic method for the application of load balancing on the coarse-grain hybrid model for the appropriate distribution of the computational load to the working threads. We adopt both a static, as well as a dynamic load balancing approach, and implement three alternative balancing variations. All implementations are experimentally evaluated against kernel benchmarks, in order to demonstrate the potential of such load balancing schemes for the extraction of maximum performance out of hybrid parallel programs.
This paper presents a complete framework for the parallelization of nested loops by applying tili... more This paper presents a complete framework for the parallelization of nested loops by applying tiling transformation and automatically generating MPI code that allows for an advanced scheduling scheme. In particular, under advanced scheduling scheme we consider two separate techniques: rst, the application of a suitable tiling transformation, and second the overlapping of computation and communication when executing the parallel program. As far as the choice of a scheduling-ecient tiling transformation is concerned, the data dependencies of the initial algorithm are taken into account and an appropriate transformation matrix is automatically generated according to a well-established theory. On the other hand, overlapping computation with communication partly hides the communication overhead and allows for a more ecient processor utilization. We address all issues concerning automatic parallelization of the initial algorithm. More speci cally, we have developed a tool that automaticall...
The parallelization process of nested-loop algorithms onto popular multi-level parallel architect... more The parallelization process of nested-loop algorithms onto popular multi-level parallel architectures, such as clusters of SMPs, is not a trivial issue, since the existence of data dependencies in the algorithm impose severe restrictions on the task decomposition to be applied. In this paper we propose three techniques for the parallelization of such algorithms, namely pure MPI parallelization, fine-grain hybrid MPI/OpenMP parallelization and coarse-grain MPI/OpenMP parallelization. We further apply an advanced hyperplane scheduling scheme that enables pipelined execution and the overlapping of communication with useful computation, thus leading almost to full CPU utilization. We implement the three variations and perform a number of micro-kernel benchmarks to verify the intuition that the hybrid programming model could potentially exploit the characteristics of an SMP cluster more efficiently than the pure messagepassing programming model. We conclude that the overall performance for each model is both application and hardware dependent, and propose some directions for the efficiency improvement of the hybrid model.
Proceedings of the 2004 ACM symposium on Applied computing - SAC '04, 2004
This paper presents an overview of our work, concerning a complete end-to-end framework for autom... more This paper presents an overview of our work, concerning a complete end-to-end framework for automatically generating message passing parallel code for tiled nested for-loops. It considers general parallelepiped tiling transformations and general convex iteration spaces. We address all problems regarding both the generation of sequential tiled code and its parallelization. We have implemented our techniques in a tool which automatically generates MPI parallel code and conducted several series of experiments, concerning the compilation time of our tool, the efficiency of the generated code and the speedup attained on a cluster of PCs. Apart from confirming the value of our techniques, our experimental results show the merit of general parallelepiped tiling transformations and verify previous theoretical work on scheduling-optimal tile shapes.
18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004
This paper compares the performance of three programming paradigms for the parallelization of nes... more This paper compares the performance of three programming paradigms for the parallelization of nested loop algorithms onto SMP clusters. More specifically, we propose three alternative models for tiled nested loop algorithms, namely a pure message passing paradigm, as well as two hybrid ones, that implement communication both through message passing and shared memory access. The hybrid models adopt an advanced hyperplane scheduling scheme, that allows both for minimal thread synchronization, as well as for pipelined execution with overlapping of computation and communication phases. We focus on the experimental evaluation of all three models, and test their performance against several iteration spaces and parallelization grains with the aid of a typical micro-kernel benchmark. We conclude that the hybrid models can in some cases be more beneficial compared to the monolithic pure message passing model, as they exploit better the configuration characteristics of an hierarchical parallel platform, such as an SMP cluster.
Proceedings. IEEE International Conference on Cluster Computing, 2002
This paper presents a complete end-to-end framework to generate automatic message-passing code fo... more This paper presents a complete end-to-end framework to generate automatic message-passing code for tiled iteration spaces. It considers general parallelepiped tiling transformations and general convex iteration spaces. We aim to address all problems concerning data parallel code generation efficiently by transforming the initial non-rectangular tile to a rectangular one. In this way, data distribution and communication become simple and straightforward. We have implemented our parallelizing techniques in a tool which automatically generates MPI code and run several experiments on a cluster of PCs. Our experimental results show the merit of general parallelepiped tiling transformations, and confirm previous theoretical work on schedulingoptimal tile shapes.
2005 International Conference on Parallel Processing Workshops (ICPPW'05), 2005
This paper emphasizes on load balancing issues associated with hybrid programming models for the ... more This paper emphasizes on load balancing issues associated with hybrid programming models for the parallelization of fully permutable nested loops onto SMP clusters. Hybrid parallel programming models usually suffer from intrinsic load imbalance between threads, mainly because most existing message passing libraries generally provide limited multi-threading support, allowing only the master thread to perform inter-node message passing communication. In order to mitigate this effect, we propose a generic method for the application of static load balancing on the coarse-grain hybrid model for the appropriate distribution of the computational load to the working threads. We experimentally evaluate the efficiency of the proposed scheme against a micro-kernel benchmark, and demonstrate the potential of such load balancing schemes for the extraction of maximum performance out of hybrid parallel programs.
Tiling is a well known loop transformation used to reduce communication overhead in distributed m... more Tiling is a well known loop transformation used to reduce communication overhead in distributed memory machines. Although a lot of theoretical research has been done concerning the selection of proper tile shapes that reduce processor idle times, there is no complete approach to automatically parallelize non-rectangularly tiled iteration spaces and consequently there are no actual experimental results to verify previous theoretical work on the effect of the tile shape on the overall completion time of a tiled algorithm. This paper presents a complete end-to-end framework to generate automatic message-passing code for tiled iteration spaces. It considers general parallelepiped tiling transformations and convex iteration spaces. We aim to address all problems concerning data parallel code generation efficiently by transforming the initial non-rectangular tile to a rectangular one. In this way, data distribution and the respective communication pattern become simple and straightforward. We have implemented our parallelizing techniques in a tool which automatically generates MPI code and run several benchmarks on a cluster of PCs. Our experimental results show the merit of general parallelepiped tiling transformations, and verify previous theoretical work on scheduling-optimal, non-rectangular tile shapes.
This article focuses on the effect of both process topology and load balancing on various program... more This article focuses on the effect of both process topology and load balancing on various programming models for SMP clusters and iterative algorithms. More specifically, we consider nested loop algorithms with constant flow dependencies, that can be parallelized on SMP clusters with the aid of the tiling transformation. We investigate three parallel programming models, namely a popular message passing monolithic parallel implementation, as well as two hybrid ones, that employ both message passing and multi-threading. We conclude that the selection of an appropriate mapping topology for the mesh of processes has a significant effect on the overall performance, and provide an algorithm for the specification of such an efficient topology according to the iteration space and data dependencies of the algorithm. We also propose static load balancing techniques for the computation distribution between threads, that diminish the disadvantage of the master thread assuming all inter-process communication due to limitations often imposed by the message passing library. Both improvements are implemented as compile-time optimizations and are further experimentally evaluated. An overall comparison of the above parallel programming styles on SMP clusters based on micro-kernel experimental evaluation is further provided, as well.
Parallel and Distributed Processing Techniques and Applications, 2002
Tiling or supernode transformation is extensively discussed as a loop transformation to efficient... more Tiling or supernode transformation is extensively discussed as a loop transformation to efficiently execute nested loops onto distributed memory machines. In addition, a lot of work has been done concerning the selection of a communication-minimal and a scheduling-optimal tiling transformation. However, no complete approach has been presented in terms of implementation for non-rectangularly tiled iteration spaces. Code generation in this
International Journal of Computational Science and Engineering, 2009
This article emphasizes on load balancing issues associated with hybrid programming models for th... more This article emphasizes on load balancing issues associated with hybrid programming models for the parallelization of tiled algorithms onto SMP clusters. Although tiled algorithms usually account for relatively regular computation and communication patterns, their hybrid parallelization often suffers from intrinsic load imbalance between threads. This observation mainly reflects the fact that most existing message passing libraries generally provide limited multi-threading support, thus allowing only the master thread to perform inter-node message passing communication. In order to mitigate this effect, we propose a generic method for the application of load balancing on the coarse-grain hybrid model for the appropriate distribution of the computational load to the working threads. We adopt both a static, as well as a dynamic load balancing approach, and implement three alternative balancing variations. All implementations are experimentally evaluated against kernel benchmarks, in order to demonstrate the potential of such load balancing schemes for the extraction of maximum performance out of hybrid parallel programs.
This paper presents a complete framework for the parallelization of nested loops by applying tili... more This paper presents a complete framework for the parallelization of nested loops by applying tiling transformation and automatically generating MPI code that allows for an advanced scheduling scheme. In particular, under advanced scheduling scheme we consider two separate techniques: rst, the application of a suitable tiling transformation, and second the overlapping of computation and communication when executing the parallel program. As far as the choice of a scheduling-ecient tiling transformation is concerned, the data dependencies of the initial algorithm are taken into account and an appropriate transformation matrix is automatically generated according to a well-established theory. On the other hand, overlapping computation with communication partly hides the communication overhead and allows for a more ecient processor utilization. We address all issues concerning automatic parallelization of the initial algorithm. More speci cally, we have developed a tool that automaticall...
The parallelization process of nested-loop algorithms onto popular multi-level parallel architect... more The parallelization process of nested-loop algorithms onto popular multi-level parallel architectures, such as clusters of SMPs, is not a trivial issue, since the existence of data dependencies in the algorithm impose severe restrictions on the task decomposition to be applied. In this paper we propose three techniques for the parallelization of such algorithms, namely pure MPI parallelization, fine-grain hybrid MPI/OpenMP parallelization and coarse-grain MPI/OpenMP parallelization. We further apply an advanced hyperplane scheduling scheme that enables pipelined execution and the overlapping of communication with useful computation, thus leading almost to full CPU utilization. We implement the three variations and perform a number of micro-kernel benchmarks to verify the intuition that the hybrid programming model could potentially exploit the characteristics of an SMP cluster more efficiently than the pure messagepassing programming model. We conclude that the overall performance for each model is both application and hardware dependent, and propose some directions for the efficiency improvement of the hybrid model.
Proceedings of the 2004 ACM symposium on Applied computing - SAC '04, 2004
This paper presents an overview of our work, concerning a complete end-to-end framework for autom... more This paper presents an overview of our work, concerning a complete end-to-end framework for automatically generating message passing parallel code for tiled nested for-loops. It considers general parallelepiped tiling transformations and general convex iteration spaces. We address all problems regarding both the generation of sequential tiled code and its parallelization. We have implemented our techniques in a tool which automatically generates MPI parallel code and conducted several series of experiments, concerning the compilation time of our tool, the efficiency of the generated code and the speedup attained on a cluster of PCs. Apart from confirming the value of our techniques, our experimental results show the merit of general parallelepiped tiling transformations and verify previous theoretical work on scheduling-optimal tile shapes.
18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004
This paper compares the performance of three programming paradigms for the parallelization of nes... more This paper compares the performance of three programming paradigms for the parallelization of nested loop algorithms onto SMP clusters. More specifically, we propose three alternative models for tiled nested loop algorithms, namely a pure message passing paradigm, as well as two hybrid ones, that implement communication both through message passing and shared memory access. The hybrid models adopt an advanced hyperplane scheduling scheme, that allows both for minimal thread synchronization, as well as for pipelined execution with overlapping of computation and communication phases. We focus on the experimental evaluation of all three models, and test their performance against several iteration spaces and parallelization grains with the aid of a typical micro-kernel benchmark. We conclude that the hybrid models can in some cases be more beneficial compared to the monolithic pure message passing model, as they exploit better the configuration characteristics of an hierarchical parallel platform, such as an SMP cluster.
Proceedings. IEEE International Conference on Cluster Computing, 2002
This paper presents a complete end-to-end framework to generate automatic message-passing code fo... more This paper presents a complete end-to-end framework to generate automatic message-passing code for tiled iteration spaces. It considers general parallelepiped tiling transformations and general convex iteration spaces. We aim to address all problems concerning data parallel code generation efficiently by transforming the initial non-rectangular tile to a rectangular one. In this way, data distribution and communication become simple and straightforward. We have implemented our parallelizing techniques in a tool which automatically generates MPI code and run several experiments on a cluster of PCs. Our experimental results show the merit of general parallelepiped tiling transformations, and confirm previous theoretical work on schedulingoptimal tile shapes.
2005 International Conference on Parallel Processing Workshops (ICPPW'05), 2005
This paper emphasizes on load balancing issues associated with hybrid programming models for the ... more This paper emphasizes on load balancing issues associated with hybrid programming models for the parallelization of fully permutable nested loops onto SMP clusters. Hybrid parallel programming models usually suffer from intrinsic load imbalance between threads, mainly because most existing message passing libraries generally provide limited multi-threading support, allowing only the master thread to perform inter-node message passing communication. In order to mitigate this effect, we propose a generic method for the application of static load balancing on the coarse-grain hybrid model for the appropriate distribution of the computational load to the working threads. We experimentally evaluate the efficiency of the proposed scheme against a micro-kernel benchmark, and demonstrate the potential of such load balancing schemes for the extraction of maximum performance out of hybrid parallel programs.
Tiling is a well known loop transformation used to reduce communication overhead in distributed m... more Tiling is a well known loop transformation used to reduce communication overhead in distributed memory machines. Although a lot of theoretical research has been done concerning the selection of proper tile shapes that reduce processor idle times, there is no complete approach to automatically parallelize non-rectangularly tiled iteration spaces and consequently there are no actual experimental results to verify previous theoretical work on the effect of the tile shape on the overall completion time of a tiled algorithm. This paper presents a complete end-to-end framework to generate automatic message-passing code for tiled iteration spaces. It considers general parallelepiped tiling transformations and convex iteration spaces. We aim to address all problems concerning data parallel code generation efficiently by transforming the initial non-rectangular tile to a rectangular one. In this way, data distribution and the respective communication pattern become simple and straightforward. We have implemented our parallelizing techniques in a tool which automatically generates MPI code and run several benchmarks on a cluster of PCs. Our experimental results show the merit of general parallelepiped tiling transformations, and verify previous theoretical work on scheduling-optimal, non-rectangular tile shapes.
This article focuses on the effect of both process topology and load balancing on various program... more This article focuses on the effect of both process topology and load balancing on various programming models for SMP clusters and iterative algorithms. More specifically, we consider nested loop algorithms with constant flow dependencies, that can be parallelized on SMP clusters with the aid of the tiling transformation. We investigate three parallel programming models, namely a popular message passing monolithic parallel implementation, as well as two hybrid ones, that employ both message passing and multi-threading. We conclude that the selection of an appropriate mapping topology for the mesh of processes has a significant effect on the overall performance, and provide an algorithm for the specification of such an efficient topology according to the iteration space and data dependencies of the algorithm. We also propose static load balancing techniques for the computation distribution between threads, that diminish the disadvantage of the master thread assuming all inter-process communication due to limitations often imposed by the message passing library. Both improvements are implemented as compile-time optimizations and are further experimentally evaluated. An overall comparison of the above parallel programming styles on SMP clusters based on micro-kernel experimental evaluation is further provided, as well.
Parallel and Distributed Processing Techniques and Applications, 2002
Tiling or supernode transformation is extensively discussed as a loop transformation to efficient... more Tiling or supernode transformation is extensively discussed as a loop transformation to efficiently execute nested loops onto distributed memory machines. In addition, a lot of work has been done concerning the selection of a communication-minimal and a scheduling-optimal tiling transformation. However, no complete approach has been presented in terms of implementation for non-rectangularly tiled iteration spaces. Code generation in this
Uploads
Papers by N. Drosinos