Metaheuristics On GPUs

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

J. Parallel Distrib. Comput.

73 (2013) 13

Contents lists available at SciVerse ScienceDirect

J. Parallel Distrib. Comput.


journal homepage: www.elsevier.com/locate/jpdc

Editorial

Metaheuristics on GPUs

1. Motivations Computationally hard discrete and continuous optimization problems are found virtually everywhere in society. In the discrete case, real-life problem instances are typically beyond resolution by exact methods under realistic response requirements. Often, one needs to resort to some form of approximate method. Metaheuristics denote a general type of approximate methods based on heuristics. They are based on strategies for diversifying search effort and for escaping local optima. Over the past few decades, they have shown remarkable performance in discrete optimization [2]. In real-life continuous optimization problems, one rarely finds an analytical expression for the objective. There are problems of noise, non-linearities, and correlations between variables. There is a need for direct and global methods. With only a few exceptions, metaheuristics were originally designed for discrete optimization. In recent years, many metaheuristics have been adapted for solving continuous problems, with very good results. Given the computational hardness of the targeted problems, it is not surprising that parallel computing has been investigated to further improve the performance of metaheuristics, both in the discrete and the continuous case. Lately we have seen a drastic change in the development of commodity computers. The steady increase of CPU frequency has stopped. Performance is now improved through a growing number of cores for task parallel computing. Moreover, improvement of the performance and programmability of graphics processing units (GPUs) add impressive stream processing capabilities to ordinary computers. To take advantage of the new computing capabilities offered by the GPU, algorithms need to be re-designed and reimplemented and even rethought [1]. Since 2006, the conference series International Conference on Metaheuristics and Nature Inspired Computing (META) has become one of the main gathering points of top researchers from academia and industry. META2010, which was held in Djerba Island, Tunisia from October 27 to 31, 2010, continued this tradition of excellence and friendly scientific exchange. A special session on metaheuristics on graphics hardware was an important and novel part of this years META conference. 2. Content This special issue is devoted to publications on the use of GPUs in metaheuristics, both in discrete and continuous optimization. The topics include:

Case studies focused on specific problems: scheduling, routing,


sequence alignment, stochastic combinatorial optimization.

Metaheuristic algorithms: local search, ant colonies, genetic


programming, genetic algorithms, differential evolution.

Software frameworks.
In the following, we shortly outline the content and contribution of the 9 papers that were finally selected for publication. In the first paper, A.R. Brodtkorb, T.R. Hagen and M.L. Stra give an overview of current GPU programming strategies, profile driven development, and an outlook to future trends of GPGPU programming. Together with the expanding use of GPUs, there is a tremendous development in the programming languages and tools, and getting started in programming GPUs has never been easier. Whilst getting started with GPU programming can be simple, being able to fully utilize GPU hardware is an art that can take months and years to master. The aim of this article is to simplify the process of design and implementation of efficient parallel algorithms on GPUs. In the second paper, C. Schulz proposes an efficient local search algorithm on the GPU and investigates its application to the well known vehicle routing problem. More precisely, the author investigates local search with the best improving strategy for the 2-opt and 3-opt operators on a giant tour representation. Resource extension functions are used for constant time move evaluation. Using CUDA, a basic implementation called The Benchmark Version has been developed and deployed on a Fermi architecture. Both neighborhood setup and evaluation are performed entirely on the GPU. The Benchmark Version is the initial step of an incremental improvement process where a number of important implementation aspects have been systematically studied. Ten well-known test instances from the literature are used in computational experiments, and profiling tools are used to identify bottlenecks. In the final version, the device is fully saturated, given a large enough problem instance. A speedup of almost an order of magnitude relative to The Benchmark Version is observed. The author concludes that, with some effort, local search may be implemented very efficiently on GPUs. He shows that a maximum efficiency, however, requires a neighborhood cardinality of at least one million. Full exploration of a billion neighbors takes a few seconds and may be deemed too expensive with the current technology. Reduced neighborhoods through filtering is an obvious remedy. Experiments on simple models of neighborhood filtering indicate, however, that the speedup effect is limited on data parallel accelerators. The authors of the third paper, J. Blazewicz, W. Frohmberg, M. Kierzynka and P. Wojciechowski, propose a GPU-based fast and accurate algorithm for multiple sequence alignment. Multiple

Surveys of GPU computing and prospects for the future,


0743-7315/$ see front matter 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2012.09.014

Editorial / J. Parallel Distrib. Comput. 73 (2013) 13

sequence alignment (MSA) methods are essential in biological analysis. Several MSA algorithms have been proposed in recent years. The quality of the results produced by those methods is reasonable, but there is no single method that consistently outperforms others. Additionally, the increasing number of sequences in the biological databases is perceived as one of the upcoming challenges for alignment methods in the near future. The lack of performance concerns not only the alignment problems, but may be observed in many areas of biologically related researches. To overcome this problem in the field of pairwise alignment, several GPU computing approaches have been proposed lately. It shows a great potential of the GPU platform. Therefore, the main idea in this paper is to design and implement an MSA method which can take advantage of modern graphics cards. The solution is based on T-Coffee, a well known MSA algorithm. Its computational time, however, is often unacceptable. Performed tests show that the proposed method, named G-MSA, is highly efficient achieving up to 193-fold speedup on a single GPU while the quality of its results remains very good. Due to effective memory usage the method can perform alignment for huge sets of sequences that previously could only be aligned on computer clusters. Moreover, multiple GPUs support with load balancing makes the application very scalable. The authors of the fourth paper, J.M. Cecilia, J.M. Garca, A. Nisbet, M. Amos, and M. Ujaldn focus on a GPU implementation of ant colony optimization (ACO), a population-based metaheuristic which comprises two major stages: tour construction and pheromone update. Because of its inherently parallel nature, ACO is well-suited to GPU implementation, but it also poses significant challenges due to irregular memory access patterns. The contribution of this paper within this context is threefold: (1) a data parallelism scheme for tour construction tailored to GPUs, (2) novel GPU programming strategies for the pheromone update stage, and (3) a new mechanism called I-Roulette to replicate the classic roulette wheel while improving GPU parallelism. The proposed implementation leads to factor gains exceeding 20 for any of the two stages of the ACO algorithm as applied to the TSP when compared to its sequential counterpart version running on a similar single-threaded high-end CPU. Moreover, an extensive discussion focused on different implementation paths on GPUs shows the way to deal with parallel graph connected components. This, in turn, suggests a broader area of enquiry, where algorithm designers may learn to adapt similar optimization methods to GPU architecture. In the fifth paper, A. Delvacq, P. Delisle, M. Gravel, and M. Krajecki design and implement a parallel ant colony optimization on GPUs. The MaxMin Ant System (MMAS) algorithm augmented with 3-opt local search is used as a framework for the implementation of the parallel ants and multiple ant colonies general parallelization approaches. The four resulting GPU algorithms are extensively evaluated and compared on both speedup and solution quality on a state-of-the-art Fermi GPU architecture. A rigorous effort is made to keep parallel algorithms true to the original MMAS applied to the traveling salesman problem. The authors report speedups of up to 23.60 with solution quality similar to the original sequential implementation. With the intent of providing a parallelization framework for ACO on GPUs, a comparative experimental study highlights the performance impact of ACO parameters, GPU technical configuration, memory structures and parallelization granularity. The authors of the sixth paper, H. Wang, S. Rahnamayan, and Z. Wu, propose a parallel differential evolution with self-adapting control parameters and generalized opposition-based learning for solving high-dimensional global optimization problems. Indeed, solving high-dimensional global optimization problems is a time-consuming task because of the high complexity of the problems. The proposed approach is called GOjDE, which employs

self-adapting control parameters and generalized oppositionbased learning (GOBL). The adapting parameters strategy is helpful to avoid manually adjusting the control parameters, and GOBL is beneficial for improving the quality of candidate solutions. Simulation experiments are conducted on a set of recently proposed high-dimensional benchmark problems with dimensions of 100, 200, 500 and 1000. Simulation results demonstrate that GjODE is better than, or at least comparable to, six other algorithms, and employing a GPU can effectively reduce computational time. The obtained maximum speedup is 75. The seventh paper, authored by D. Weyland, R. Montemanni, and L.M. Gambardella, deals with a metaheuristic framework for stochastic combinatorial optimization problems based on GPGPU with a case study on the probabilistic traveling salesman problem with deadlines (PTSPD). Computational studies reveal significant improvements over state-of-the-art methods for the PTSPD. Additionally, the obtained results reveal the huge potential of the proposed framework and sampling-based methods for stochastic combinatorial optimization problems. The authors of paper number eight, D.A. Augusto and H.J. Barbosa, focus on the accelerated parallel genetic programming (GP) tree evaluation with OpenCL. Being classified as an embarrassingly parallel technique, GP can theoretically scale up to tackle very diverse problems by increasingly adding computational power to its arsenal. With todays availability of many powerful parallel architectures, a challenge is to take advantage of all those heterogeneous compute devices in a portable and uniform way. This work proposes both (1) a transcription of existing GP parallelization strategies into the OpenCL programming platform; and (2) a freely available implementation to evaluate its suitability for GP, by assessing the performance of parallel strategies on the CPU and GPU processors from different vendors. Benchmarks on the symbolic regression and data classification domains were performed. On the GPU the results achieve 13 billion node evaluations per second, delivering almost 10 times the throughput of a twelvecore CPU. The authors of the last paper, F. Pinel, B. Dorronsoro, and P. Bouvry, solve very large instances of the scheduling of independent tasks problem on the GPU. In this paper, two new parallel algorithms are designed and implemented. First, a parallel version of the Min-min heuristic is described. Second, GraphCell, an advanced parallel cellular genetic algorithm (CGA) for the GPU is detailed. Two new generic recombination operators that take advantage of the massive parallelism of the GPU are proposed for GraphCell. A speedup study shows the high performance of the parallel Min-min algorithm in the GPU versus several CPU versions of the algorithm (both sequential and parallel using multiple threads). GraphCell improves state-of-the-art solutions, especially for larger problems, and it provides an alternative to our GPU Minmin heuristic when more accurate solutions are needed, at the expense of an increased runtime. Parallelization of metaheuristics with stream processing accelerators such as the GPU started roughly a decade ago. Since then, more than one hundred papers have been published on this topic. A large part of the literature describes rather basic implementations of well-known metaheuristics and report moderate speedups relative to an unknown CPU implementation. In contrast, this special issue contains reports from high-quality research on GPU-based parallelization of metaheuristics; reports that confirm the large potential for performance gains and also provide a solid basis for further work. Among the most important and exciting challenges ahead are the development of novel, flexible, and self-adaptable metaheuristics that fully exploit a typically heterogeneous collection of available hardware units.

Editorial / J. Parallel Distrib. Comput. 73 (2013) 13

References
[1] T.-V. Luong, N. Melab, E-G. Talbi, GPU computing for parallel local search metaheuristic algorithms, IEEE Transactions on Computers (2012). [2] E.-G. Talbi, Metaheuristics: From Design to Implementation, Wiley, 2009.

Geir Hasle SINTEF, Oslo, Norway

El-Ghazali Talbi University of Lille, CNRS, INRIA, Lille, France E-mail address: [email protected].

24 September 2012 Available online 29 September 2012


Corresponding editor.

You might also like