Hardware Implementation Girma S. Tewolde, Darrin M. Hanna, and Richard E. Haskell
The basic characteristic of embedded systems is that they
Abstract—The ever increasing popularity of the particle swarm run on platforms that have strict constraints on optimization (PSO) algorithm is recently attracting attention to computational, communication, and energy resources, as the embedded computing world. Although PSO is considered well as restrictions on physical size of the system. The efficient compared to other contemporary population based optimization techniques, for many continuous multimodal and central processing elements in such systems are typically multidimensional problems, it still suffers from performance loss low capacity 8 or 16 bit processors that run at low clock when it is targeted onto embedded application platforms. frequencies ranging from a few hundred Kilohertz to a few Examples of such target applications include small mobile robots hundred Megahertz (i.e. from one to three orders of and distributed sensor nodes in sensor network applications. In a magnitude lower than that of PCs and powerful previous work we presented a novel, modular, efficient and workstations). Hence, one of the main goals in embedded portable hardware architecture to accelerate the performance of the PSO for embedded applications. This paper extends the work system design is to find the right balance between these by presenting a parallelization technique for further speedup of system constraints and the desired execution performance. the PSO algorithm by dividing the swarm into a set of sub- Application examples from recent publications that could swarms that are executing in parallel. The underlying benefit from embedded implementations of the PSO communication topology and messaging protocols are described. include: system parameter identification and optimal control Finally, the performance of the proposed system is evaluated on design [18], unsupervised robotic learning [19], and power mathematical and real-world benchmark functions. system optimization [8]. However, most of the previous I. INTRODUCTION works demonstrate the effectiveness of the PSO to these and related applications by ways of simulation on PCs or high P article Swarm Optimization (PSO) [1]-[3] is a population based stochastic heuristic algorithm developed to tackle complex optimization problems. Its attractive features of performance workstations. There is little work that targets the PSO algorithm to embedded platforms. In our papers [20] and [21], we have demonstrated the severe relative simplicity, efficiency and flexibility are some of the performance degradation of the software implementation of reasons why PSO is gaining strong ground to compete with the PSO algorithm when implemented on low-capacity other well established global optimization techniques. embedded processors. To address this problem we Since its initial introduction, PSO has attracted a great deal developed a novel, efficient, modular, flexible and general of interest from researchers in various areas trying to use the purpose hardware architecture for the standard PSO algorithm for real-world applications [3]-[8]. Similar efforts targeted on programmable logic implementation have also been put on theoretical and experimental research in technologies. The performance of the proposed hardware trying to improve the algorithm’s performance [9]-[17]. PSO architecture, evaluated on mathematical and real world In recent years we witness increasing desire to expand the test problems, is shown to be greatly accelerated. application domain of the PSO algorithm to embedded The next logical step is to find schemes for further systems and industrial areas of applications. An embedded enhancement of the execution performance of the hardware system is a special purpose computer system designed to PSO, especially for application targets that are endowed perform one or a few dedicated functions, often with real-time with large amounts of hardware resources. In this paper we computing constraints. It is usually embedded as part of a present a hardware parallelization technique for the PSO complete device including hardware and mechanical parts. algorithm by dividing the swarm of particles into multiple sub-swarms with parallel threads of PSO responsible for Manuscript received November 12, 2008. manipulating the individual sub-swarm. G. S. Tewolde is with the Department of Electrical and Computer Engineering, Kettering University, Flint, MI 48504 USA (e-mail: This paper is organized as follows. Section II presents a [email protected]). short introduction to the PSO algorithm. In Section III, we D. M. Hanna, is with the Electrical & Computer Engineering Dept., describe the benchmark test functions used for performance Oakland University, Rochester, MI 48309, USA (e-mail: evaluation. The PSO performance on an embedded [email protected]). R. H. Haskell, is with the Electrical & Computer Engineering Dept., microcomputer is then evaluated and results presented in Oakland University, Rochester, MI 48309, USA (e-mail: Section IV. For the performance evaluation we use a [email protected]).
Freescale 16-bit microcomputer, MC9S12DP256 [22], For the test runs carried out in this paper we choose a running at 25 MHz clock. The serial hardware architecture for commonly used set of generic PSO parameter values χ = a PSO engine targeted on field programmable gate array 0.729, φ1 = φ2 = 2.05, which work well in most cases. When (FPGA) and its performance evaluation are presented in necessary, however, problem specific parameter tuning Sections V and VI, respectively. In Section VII, the parallel strategies could be implemented for even enhanced solution architecture for the hardware PSO and its performance qualities with better PSO execution performances. In this evaluation are presented. Finally, Section VIII provides regard, we have developed a hierarchical dual PSO concluding remarks. technique for the purpose with a top level PSO tuning the parameters of a function optimizer PSO, resulting in a II. PARTICLE SWARM OPTIMIZATION greatly improved performance than PSO with generic Development of the Particle Swarm Optimization algorithm parameters [23]. However, for this paper we focus only on was inspired by the social behavior of bird flocks, fish the hardware technique for accelerating the execution schools, bee swarms, and animal herds [1]-[3]. Individual performance for embedded applications. particles in a particle swarm represent candidate solutions for the optimization problem. The particles move around in the III. BENCHMARK TEST FUNCTIONS parameter space by using systematic rules to adjust their For purposes of testing and comparing different velocities and positions, in response to the swarm’s experience evolutionary optimization methods a suit of mathematical in locating quality solutions. Thus, the dynamics of the benchmark functions are often used. Because of their particles in the swarm is influenced by a combination of two relative simplicity for hardware implementation, for factors: the personal and social best experiences preliminary testing and evaluation of our proposed Since the particles in the swarm fly in a continuous hardware architecture we use the Sphere and Rosenbrock hyperspace in search of the optimum solution, the basic PSO functions [3]. In addition to these two functions, for algorithm is naturally suited for real-valued problems. Several evaluating the algorithm’s performance on a real world modifications have also been proposed in the literature to problem scenario, we also consider an emission source enable PSO to handle discrete and combinatorial optimization localization problem in a distributed sensor network problems. environment [20]. These test functions are described below. One of the characteristic features of PSO is that each A. Mathematical Benchmark Functions particle in the swarm has memory that allows it to keep track of its own personal best and neighborhood best experiences in This subsection describes the Sphere and Rosenbrock the search process. At each step of the algorithm each benchmark functions we used for performance evaluation. individual particle adjusts its flying velocity in response to these cognitive and social learning experiences. In effect, the a) Sphere (De Jong’s f1). particles tend to be attracted to the best solution they have D individually found and the best solution that any particle in f 1 ( x ) = ∑ xi2 (3) their neighborhood has found. i =1
When modeling a D-dimensional optimization problem,
each particle X i in the swarm is represented by a vector in the b) Rosenbrock.
D-dimensional space, X i = ( x i1 , x i 2 ,..., x iD ) T . Particle i ’s D
f 2 ( x ) = ∑ (100( xi +1 − x i2 ) 2 + ( xi − 1) 2 ) (4) previous best position is designated by a vector i =1 T Pi = ( p i1 , p i 2 ,..., p iD ) and its fitness evaluation at that point, called personal best fitness, is pbest i . The movement of each B. Emission Source Localization Problem particle in the search space is determined by its velocity vector We consider a scenario where a distributed sensor T Vi = (vi1, vi 2 ,..., viD ) . Different variations of the velocity update network is used for localization of emission sources in an equations have been proposed in the literature [3], [10], [11], environment. A sensor network with the appropriate types [16], the inertia-weight and constriction-factor forms being the of sensors is assumed to be deployed to detect the intensity most common ones. For our implementations in this paper we of the emission at the sensor locations. The sensors have use the constriction-factor form of the velocity update only limited perception of their immediate locality with equation as given in equation (1) below [10], [11]. The some degree of uncertainty. Further information, with test position of the particles is updated using equation (2). results on larger scale problem dimensions and sensor noise levels and relative comparisons of PSO with other solution vij (t ) = χ ∗ (vij (t − 1) + ϕ1r1 ( pij − xij (t − 1)) + ϕ 2 r2 ( p gj − xij (t − 1))) (1) strategies could be found in [20]. The fitness function f is formulated for capturing the x ij (t ) = x ij (t − 1) + v ij (t ) (2) important parameters relating the sensor locations, their where, i = 1, 2, …, N and j = 1, 2, …, D. measurements and the estimated emission source location and its intensity. Equation (5) is derived based on the least- The execution speed achieved by the software squares method, where by the sum of the squares of the errors implementation of the PSO on the embedded processor between the measurement estimates from the theoretical could not be considered fast enough for many real-time model and the actual readings, over all sensors, is minimized. applications with strict timing constraints. In our previous work [20] and [21] we developed an efficient, modular, Ns Ns flexible, and reusable hardware architecture to accelerate f (Q0 , x0 , y0 ,W ) = ∑ dQ 2j = ∑ (Q0 − (q j d 2j + w j Q0 )) 2 (5) the execution performance of the standard PSO algorithm. j =1 j =1 Due to strict hardware resource constraints on low-footprint embedded systems, the approach followed in the previous Where, W is the vector of the weighting factors wj, j = 1, 2, 3, architecture is a serial implementation to allow sharing of …, Ns, introduced to capture measurement uncertainties. A the computational units by all particles in the swarm. summarized list of the benchmark test functions, their However, when larger hardware resources are available it is dimensions, ranges and error thresholds is given in Table 1. possible to further speedup the algorithm’s execution The emission source localization problem, for the tests in this performance by a parallelization scheme that divides the paper, is configured with zero sensor measurement noise. entire swarm into a number of sub-swarms that are executed concurrently. TABLE 1 BENCHMARK TEST FUNCTIONS, THEIR DIMENSIONS, RANGES, V. SERIAL HARDWARE ARCHITECTURE AND MAXIMUM ERROR THRESHOLDS. Function Dimension Range Error threshold From previous work in the literature on hardware Sphere 30 [-100,100] < 0.01 implementation of the PSO, Reynolds et al. [24] proposed Rosenbrock 30 [-30,30] < 100 an FPGA implementation of a modified version of the PSO Emission source 30 [-50,50] < 1.0 localization for inversion of neural networks. Their goal was to simplify the hardware implementation by using different techniques such as using deterministic approaches in the particle update IV. PERFORMANCE OF PSO ON EMBEDDED MICROCOMPUTERS operations, rather than employing random number sources. To analyze the performance of the PSO algorithm on They use fixed-point, rather than floating-point, number embedded microcomputer platforms, we pick the Freescale representation to minimize the required logic resources. MC9S12DP256 microcomputer [22] that is in common use Although this modified version is reported to have helped in for various embedded applications. This microcomputer reducing the hardware cost and greatly enhancing the comes with a 16-bit CPU, 12Kbytes of RAM and 256 Kbytes execution speed, it also resulted in slightly degraded of Flash EEPROM on chip, and runs at a maximum clock solution quality compared to the full-blown standard PSO frequency of 25 MHz. The program and data of our PSO implementation. The method is however presented as a implementation comfortably fit within the available on chip problem specific solution. memory. Like in the software world several approaches could be considered to map the PSO algorithm to hardware, TABLE 2. depending on the available hardware resources and the EXECUTION PERFORMANCE RESULTS OF THE SOFTWARE required performance. For example, if sufficient hardware IMPLEMENTATION OF THE PSO ON THE FREESCALE MC9S12DP256 resource is available, one can consider parallel MICROCOMPUTER @25MHZ implementation of the algorithm for greater performance. Freescale MCC9S12DP256 @ 25MHz On the other hand, on target platforms with limited Test functions No. Ex. Time (sec) Ck. Cycles hardware resources a serial implementation of the PSO that Iters allows reuse of computational elements by the entire swarm Sphere 372.6 100.6 2.52E+09 and the fitness evaluation module could produce a more Rosenbrock 695.4 254.6 6.36E+09 compact design. Emission source In the serial implementation of the PSO algorithm on 127.0 46.1 1.15E+09 localization hardware all particles share common computational hardware resources for updating their velocity and position Table 2 shows the average, over twenty independent runs data, for evaluating their fitness values and for updating on the embedded processor platform, of the execution their personal best and global best performance data. A performance of the PSO on the benchmark functions. From control state machine is designed to regulate the access to these results, we observe that the PSO implementation on the the shared computing and memory resources. 16-bit MC9S12DP256 microcomputer running at 25 MHz For our PSO engine all the swarm data, for the position, clock takes on average about 100, 255, and 46 seconds, velocity, and fitness of the particles, is represented in respectively, to solve the 30-dimensional Sphere, Rosenbrock, single-precision floating-point numbers, using the standard and emission localization problems. What we mean by solving IEEE-754 floating-point format. Using this format makes the problem here is that a global best particle in the PSO the design easily reusable and portable than custom, but population has managed to produce a result better than the perhaps more efficient, number formats. However, attractive given minimum error threshold. features of floating-point numbers come at a cost of relatively more complex and resource intensive computational serial hardware implementation on the FPGA. These results engine to carry out the floating-point operations in hardware. show that the serial hardware PSO running on FPGA at 25 Thus, in small density FPGA targets the need arises to share a MHz clock achieves accelerated performance with average limited number of hardware floating-point operators by speedups ranging from 360 to 653 over the implementation different components of the PSO algorithm. On the other on the 16-bit MC9S12DP256 processor running at the same hand, when the hardware PSO engine is targeted onto more clock frequency. capable FPGA devices, or multi-FPGA systems, the larger available computational resources could be exploited to add TABLE 3. AVERAGE EXECUTION PERFORMANCE OF THE HARDWARE parallelism to some of the operations and hence achieve even IMPLEMENTATION OF THE PSO ON A SPARTAN-3E, XC3S500E greater improvements in performance. Figure 1 shows a high- FPGA, AT 25 MHZ CLOCK. level architecture for the direct serial implementation of the Test No. Ex. Time (sec) Ck. Cycles PSO on FPGA. Refer to [21] for details of the major Functions Iterations functional blocks of the hardware architecture. Sphere 338.7 0.28 6.93E+06 Rosenbrock 470.4 0.39 9.72E+06 Emission source 132.0 0.11 2.75E+06 data Fitness localization Datapath Unit result Evaluation Module TABLE 4. stat EXECUTION SPEEDUP OF THE HARDWARE IMPLEMENTATION contr Registers for OF THE PSO OVER SOFTWARE IMPLEMENTATIONS ON stat Control Unit EMBEDDED PROCESSORS. Temporary data Execution time speedup over Test RNG contr functions MC9S12DP256 Module @ 25 MHz Sphere 360 Floating Point addr Rosenbrock 653 contr Operation Emission source Modules 419 Swarm Memory localization data
VII. PARALLELIZATION SCHEME
Since PSO emulates a population-based multi-agent Fig. 1. High level architecture for the direct serial implementation of the PSO optimization it naturally presents a good opportunity to on FPGA. parallelize the algorithm. Each particle of the swarm carries a potential solution that flies through the solution search space by modifying its velocity and position according to its VI. PERFORMANCE EVALUATION OF SERIAL HARDWARE PSO own and its neighborhood’s or global best performance For prototyping, testing and evaluating the performance of histories. The computations involved in the PSO algorithm the proposed serial hardware architecture a Spartan-3E family, could be divided into operations that update the particle XC3S500E, FPGA [25] is used. For the performance velocity and position information on one hand, and evaluation, we are mainly interested in the quality of the computations required for the fitness evaluation of the solution generated by the hardware PSO and its execution particle in the objective function landscape, on the other performance. Therefore, we incorporated in our designs small hand. measurement and output units that count and display the Most previous work in the literature on parallel PSO number of iterations the algorithm executes to arrive at a implementations consider coarse grained parallelization of given error threshold and the number of system clock cycles it the fitness evaluation computations over parallel cluster of takes for its execution. These quantities are the criteria we use computers [26]-[28]. Schutte et al. [26] present a to compare different implementations of the PSO algorithm. synchronous parallel implementation of the PSO on cluster All the benchmark functions were tested on the hardware of computers, testing on analytical and bio-mechanical PSO engine by running them 20 times each, using different system identification problems. The computing random number seed value for each run. Table 3 shows the infrastructure in this configuration is organized into one execution performance results in terms of the average number coordinating node and several slave nodes for fitness of iterations and execution times (in seconds and clock cycles) function evaluations. The responsibility of the coordinating required to arrive at the given error thresholds. node includes particle update operations and assigning of The execution speedups of the hardware PSO over the fitness evaluation tasks to the parallel slave nodes. In the software implementation on the embedded processor synchronous parallel implementation scheme the platforms is presented in Table 4. These numbers are coordinating node waits for all the fitness evaluations to be calculated by dividing the execution time (in seconds) of the software implementation by the execution time of the direct completed before proceeding with updating of the best fitness values. Consequently, the swarm will react more proposing allows the communication to proceed in the slowly to changes of the best fitness values and the background without interrupting the PSO execution of the corresponding positions. Thus, this causes unavoidable sub-swarms. performance loss, especially if the slave nodes are of B. Hardware Architecture heterogeneous type or if the fitness evaluation process requires variable execution time depending on the position of The architecture for the direct serial hardware the candidate solution in the search space. implementation of the PSO is reused with appropriate Venter and Sobiezczanski-Sobieski [27] and Koh et al. [28] modifications, for making the independent components present strategies that apply the asynchronous parallelization forming the individual threads of the parallel execution. To scheme for the PSO algorithm on parallel cluster of make the comparison process between the serial and parallel computers. The main idea behind the asynchronous approach implementations of the PSO as fair as possible we use the is to allow update of the best position and fitness data, in the same generic parameters, including the swarm size, for both current iteration, as it becomes available without waiting for implementations. Thus, if the swarm size is N in the serial the completion of the fitness evaluation of all the particles in implementation, a parallel implementation with p parallel the swarm. In both of these reported works, similar to the threads will have p sub-swarms, each with N/p particles. work of Schutte et al. the strategy is coarse-grained Each sub-swarm will have its own computation resources parallelization of the fitness function computation tasks on shared by all the particles in the sub-swarm. However, the p parallel cluster of computers with a master-slave parallel sub-swarms need to exchange information about the configuration. According to the papers, the asynchronous best performing particle they discover and the approach outperforms the synchronous one in terms of parallel corresponding best fitness value. So, one of the main design efficiency when heterogeneity exists in the computational task decisions is how to define the communication strategy in or the environment. such a way that the associated overhead does not overshadow the anticipated performance gain from the A. Degree of parallelism parallelization effort. Since in the gbest PSO topology the We define the degree of parallelism of an implementation global best particle information is needed by all particles in as the number of parallel threads of executions that it can run the entire swarm for their velocity and position updates, we simultaneously. The type and amount of resources of the need to devise an efficient mechanism to provide access to hardware platform, the number of particles in the swarm and this information. One approach for allowing access by all the nature or complexity of the optimization problem are sub-swarms to the global best data is to use shared memory. factors that determine the degree of parallelism for a hardware However, since this data is required for all particles’ implementation of the PSO algorithm. For example, if enough velocity update operations, simultaneous access to this hardware resource is available, one could consider employing common data by all sub-swarms creates a potential a fully parallel implementation of the algorithm with each bottleneck causing performance loss. particle having its own execution engine for velocity and As an alternative approach, instead of saving the global position updates as well as for its fitness evaluation. On the best information at a common shared memory location we other hand, if hardware space is limited and if the main propose keeping local copies in each of the sub-swarms, execution bottleneck is the fitness function evaluation, a hence totally eliminating the bottleneck due to shared possible parallelization approach could benefit from memory access. However, this approach requires a parallelization of the fitness evaluation modules while using a communication mechanism to allow the sub-swarms common basic PSO engine for the particle updates. exchange the global best data as they discover them. For the For the embedded system applications that we are mainly physical interconnection between the sub-swarms, we interested in, such as the emission source localization problem employ a ring communication topology to interconnect scenario described in previous sections, the available them along with a separate communication controller. In the hardware resources is often limited. So, unlike the case of illustration shown in Figure 2, SS#k (where k = 1, 2, …, p) clusters of tens to hundreds or thousands of high performance are the parallel sub-swarms, each keeping local copies of computers we often target our system on small footprint the global best fitness (fgbest) and the global best particle microprocessors or FPGAs. So the goal is to come up with position (gbx in mem_gbx). suitable and efficient architecture of the parallel PSO for low As soon as sub-swarm k discovers a new global best to medium density FPGAs, with possible extensions to larger fitness value it starts to execute a sequence of steps to capacity hardware platforms. Therefore, instead of a fully share the information with the other sub-swarms by first parallel implementation we consider sub-swarm or group level sending a communication request signal (newgb_k) to the parallelization whereby the entire swarm is evenly divided communication controller module. The communication into a number of sub-swarms each of which having their own controller module handles such requests from the sub- computational resources. The execution of the parallel sub- swarms and it acts to initiate the data communication of swarms is asynchronous. They communicate only when they the newly discovered global best particle to all the discover new global best fitness. The architecture we are swarms. localization problem. On the other hand, the Virtex II Pro newgb_1 newgb_p FPGA fits up to five parallel threads of each of the test Communication ack_p ack_1 done_1 Controller done_p functions. Thus, for p = 2, the Sphere and Rosenbrock are tested on the XC3S500E, and the emission source SYNC localization problem on the XC2VP30. For p = 3 and 5, all functions are tested on the XC2VP30 target hardware. SS#1 SS#2 SS#p For all test experiments we measured the execution … performance results of the hardware implementations of gbx gbx gbx the parallel and serial PSO. For a fair comparison, other fgbest fgbest fgbest than the degree of parallelism, all algorithm parameters data are kept the same for all implementations. The PSO Figure 2. Communication infrastructure for exchanging information parameters are set to their generic values. We use a between parallel sub-swarms swarm size of N = 30 for the serial implementations. For a parallel implementation with degree of parallelism p, the Although it is possible for more than one sub-swarm to number of particles in each sub-swarm is set to N/p. assert newgb_k communication requests simultaneously, the Different initial seed values are set for the random communication controller is configured to acknowledge number generators of the different sub-swarms so as to only one request at a time. To keep the design of the minimize any possible overlaps of the initial particle communication controller simple we do not send the newly distributions of the different sub-swarms. Table 5 presents discovered best fitness values from the requesting sub- the execution performance of the serial PSO and the swarms to the communication controller. Hence, because of parallel one with different degrees of parallelism. The lack of knowledge of the actual newly discovered global actual speedup of the parallel PSO over the serial best fitness values from the different sub-swarms, the implementation is computed as the execution time of the communication controller simply chooses one of the sub- serial PSO divided by that of the parallel PSO for each of swarms arbitrarily in the case of multiple simultaneous the different p values (Table 6). requests. For example, the sub-swarm ID number k could be used as a priority number that the communication controller TABLE 5. could use to resolve among such competing requests. Thus, COMPARISON OF THE EXECUTION PERFORMANCES OF THE only the chosen sub-swarm receives the acknowledgement SERIAL AND PARALLEL (WITH P = 2, 3 AND 5) IMPLEMENTATIONS OF PSO. (ack_k) signal. p=1 p=2 p=3 p=5 From the above mentioned strategy one can observe that Test functions No. Ck. No. Ck. No. Ck. No. Ck. there is no guarantee that the communication controller Iters cycles Iters cycles Iters cycles Iters cycles chooses the actual overall global best data to be distributed Sphere 339 6.93E6 330 3.36E6 368 2.49E6 447 1.81E6 to all the sub-swarms when there are multiple simultaneous Rosenbrock 470 9.72E6 330 3.39E6 376 2.58E6 342 1.40E6 requests. However, we found the strategy described to be a Emission better compromise solution than adding more complex source 132 2.75E6 134 1.39E6 163 1.12E6 222 9.07E5 functionality, with the associated cost of logic, in the localization communication controller to be able to search for the actual TABLE 6. overall global best fitness value in cases when it receives SPEEDUPS OF THE PARALLEL IMPLEMENTATIONS OF PSO competing requests. Furthermore, after the current OVER THE SERIAL VERSION ON THE THREE TEST FUNCTIONS. communication is done it is possible that previously missed Speedups for different values of p Test functions p=2 p=3 p=5 global best data could be rediscovered in the following Sphere 2.06 2.78 3.83 iterations to get transmitted to the other sub-swarms. In the Rosenbrock 2.87 3.77 6.94 communication process, immediately after sending the Emission source 1.98 2.46 3.03 ack_k signal to the chosen sub-swarm the communication localization controller sends a synchronization (SYNC) pulse to all the sub-swarms in the ring communication topology. The actual The experimental test results demonstrate that the data communication then starts following the SYNC pulse. proposed parallel PSO architecture achieves execution speedups over the serial implementation, although the C. Performance Evaluation degrees of the speedups do not exactly match with the The available target hardware platforms we use for testing degrees of parallelisms. Due to the stochastic nature of the are the Xilinx Spartan 3E family, XC3S500E FPGA and the algorithm, the division of the optimization task into the larger capacity Virtex II Pro family, XC2VP30 FPGA. The multiple sub-swarms of smaller sizes could have varying XC3S500E FPGA can fit just two parallel threads of the effects on the performance. For example, among the three Sphere or the Rosenbrock benchmark test functions, and no test scenarios, the parallel PSO has achieved the highest more than a single serial thread of the emission source speedups for the Rosenbrock function for all degrees of parallelism. For the Sphere and emission source localization Transactions on Evolutionary Computation, Vol. 6, No. 1, February 2002, pp. 58-73. problems, however, the speedups are about equal to the [11] R.C. Eberhart and Y. Shi, “Comparing Inertia Weights and number of parallel threads for p = 2, but it is lower for higher Constriction Factors in Particle Swarm Optimization,” The 2000 values of p. Congress on Evolutionary Computing, Vol. 1, pp. 84-88. Different factors are believed to affect the performance of [12] M. Meissner, M. Schmuker, and G. Schneider, “Optimized Particle Swarm Optimization (OPSO) and its application to artificial neural the parallel PSO, including the nature of the problem, how network training,” BMC Bioinformatics. 2006. vol. 7:125. well the different sub-swarms subdivide the search space with [13] R. Mendes, Population Topologies and Their Influence in Particle minimal overlap, configuration of the random number sources, Swarm Performance, PhD Thesis, Universidade do Minho, Braga, Portugal, Feb. 2004. and the communication strategy. In the current [14] R. Mendes, J. Kennedy, and J. Neves, “The fully Informed Particle implementation, the same type of random number generator Swarm: Simpler, maybe better,” IEEE Transactions on Evolutionary with different seed values is used for all the sub-swarms. Computation, 2004. [15] F. van den Bergh and A. P. Engelbrecht, “A cooperative approach to Further work is required to better understand the impact of the particle swarm optimization,” IEEE Trans. Evol. Comput, vol. 8, no. configurations of the different random number sources on the 3, pp. 225-239, Jun. 2004. performance of the parallel PSO. [16] Y. Shi and R. C. Eberhart, “A modified particle swarm optimizer,” Proceedings of the IEEE Congress on Evolutionary Computation (CEC 1998), Piscataway, NJ. pp. 69-73. 1998. VIII. CONCLUSION [17] Y. Shi and R. C. Eberhart, “Parameter Selection in Particle Swarm This paper presented a parallelization scheme to speed up Optimization,” The 7th Annual Conference on Evolutionary Programming, San Diego, USA. 1998. the performance of hardware PSO. The proposed architecture [18] Y. Liu, J. Zhang, and S. Wang, “Optimization design based on PSO employs sub-swarm or group level parallelization, for targets algorithm for PID controller,” Fifth World Congress on Intelligent on low to medium density FPGAs with possible extension to Control and Automation (WCICA 2004), Vol. 3, pp. 2419 – 1422. [19] J. Pugh, A. Matinoli, and Y. Zhang, “Particle swarm optimization for larger capacity hardware platforms. The effectiveness of the unsupervised robotic learning,” Proceedings of the 2005 IEEE proposed architecture has been demonstrated, on different test Swarm Intelligence Symposium (SIS 2005), pp. 92-99. functions, for various degrees of parallelism. In future work, [20] G.S. Tewolde, D.M. Hanna, and R.E. Haskell, “Hardware PSO for Sensor Network Applications,” Proceedings of the 2008 IEEE systematic approaches of partitioning the global search space Swarm Intelligence Symposium (SIS 2008). for the different sub-swarm working regions will be [21] G.S. Tewolde, D.M. Hanna, and R.E. Haskell, “Accelerating the investigated. Such strategies could potentially reduce the Performance of Particle Swarm Optimization for Embedded overlap of the different partitions of the search space that are Application,” submitted to the 2009 IEEE Congress on Evolutionary Computation (CEC2009). handled by the different sub-swarms, hence possibly reducing [22] Freescale, MC9S12DP256B Device User Guide V02.15. [Online]. the total execution effort to get to a desirable solution. Available: http://www.freescale.com/files/microcontrollers/doc/data_sheet/ [23] G.S. Tewolde, D.M. Hanna, and R.E. Haskell, “Enhancing REFERENCES Performance of PSO with Automatic Parameter Tuning Technique,” [1] R. Eberhart and J. Kennedy, “A new Optimizer using Particle Swarm submitted to the 2009 IEEE Swarm Intelligence Symposium (SIS Theory,” Proceedings of the Sixth International Symposium on Micro 2009). Machine and Human Science, 1995, pp. 39-43. [24] P. D. Reynolds, R. W. Duren, M. L. Trumbo, and R. J. Marks, II., [2] J. Kennedy and R. C. Eberhart, “Particle Swarm Optimization.,” “FPGA Implementation of Particle Swarm Optimization for Proceedings of the IEEE International Conference on Neural Inversion of Large Neural Networks,” Proceedings 2005 IEEE Networks, 1995, Vol. 4, pp. 1942-1948. Swarm Intelligence Symposium. [3] J. Kennedy and R. C. Eberhart, Swarm Intelligence. Morgan Kaufmann [25] Xilinx, Xilinx Spartan-3E FPGA Family: Complete Data Sheet, Publishers. 2001. 2007. [Online]. Available: [4] X. Cui, T. E. Potok, and P. Palathingal, “Document Clustering Using http://www.xilinx.com/support/documentation/data_sheets/ds312.pdf Particle Swarm Optimization,” Proceedings of the 2005 IEEE Swarm [26] J. F. Schutte, J. A. Reinbolt, B. J. Fregly, R. T. Haftka, and A. D. Symposium (SIS 2005), pp. 185-191. George, “Parallel global optimization with particle swarm [5] A. S. Mohais, R. Mohais, and C. Ward, “Earthquake Classifying Neural algorithm,” International Journal of Numerical Methods in Networks Trained with Dynamic Neighborhood PSOs,” Proceedings of Engineering, 61(13). 2004, pp. 2296-2315. the 9th annual conference on Genetic and evolutionary computation [27] G. Venter and J. Sobieszczanski-Sobieski, “A Parallel Swarm (GECCO’07), pp. 110-117. Optimization Algorithm Accelerated by Asynchronous Evaluations,” [6] G. S. Tewolde and D. M. Hanna, “Particle Swarm Optimization for 6th World Congress of Structural and Multidisciplinary Classification of Breast Cancer Data using Single and Multisurface Optimization, 2005. Methods of Data Separation,” The 2007 IEEE International [28] B. Koh, A. D. George, R. D. Haftka, and B. J. Fregly, “Parallel Conference on Electro-Information Technology (EIT2007). Asynchronous Particle Swarm Optimization,” International Journal [7] G. S. Tewolde and D. M. Hanna, “Particle Swarm Optimization for for Numerical Methods in Engineering, 2006, 67: 578-595. Cluster-Based Classification of Breast Cancer Data,” The 2007 International Conference on Genetic and Evolutionary Methods (GEM2007). [8] B. Yang, Y. Chen, and Z. Zhao, “Survey on Applications of Particle Swarm Optimization in Electric Power Systems,” IEEE International Conference on Control and Automation (ICCA 2007), pp. 481-486. [9] D. Bratton and J. Kennedy, “Defining a Standard for Particle Swarm Optimization,” Proceedings of the 2007 IEEE Swarm Symposium (SIS 2007). [10] M. Clerc and J. Kennedy, “The Particle Swarm – Explosion, Stability and Convergence in a Multidimensional Complex Space,” IEEE