Multi-Swarm Parallel PSO

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Multi-Swarm Parallel PSO:

Hardware Implementation
Girma S. Tewolde, Darrin M. Hanna, and Richard E. Haskell

The basic characteristic of embedded systems is that they


Abstract—The ever increasing popularity of the particle swarm run on platforms that have strict constraints on
optimization (PSO) algorithm is recently attracting attention to computational, communication, and energy resources, as
the embedded computing world. Although PSO is considered well as restrictions on physical size of the system. The
efficient compared to other contemporary population based
optimization techniques, for many continuous multimodal and
central processing elements in such systems are typically
multidimensional problems, it still suffers from performance loss low capacity 8 or 16 bit processors that run at low clock
when it is targeted onto embedded application platforms. frequencies ranging from a few hundred Kilohertz to a few
Examples of such target applications include small mobile robots hundred Megahertz (i.e. from one to three orders of
and distributed sensor nodes in sensor network applications. In a magnitude lower than that of PCs and powerful
previous work we presented a novel, modular, efficient and workstations). Hence, one of the main goals in embedded
portable hardware architecture to accelerate the performance of
the PSO for embedded applications. This paper extends the work system design is to find the right balance between these
by presenting a parallelization technique for further speedup of system constraints and the desired execution performance.
the PSO algorithm by dividing the swarm into a set of sub- Application examples from recent publications that could
swarms that are executing in parallel. The underlying benefit from embedded implementations of the PSO
communication topology and messaging protocols are described. include: system parameter identification and optimal control
Finally, the performance of the proposed system is evaluated on
design [18], unsupervised robotic learning [19], and power
mathematical and real-world benchmark functions.
system optimization [8]. However, most of the previous
I. INTRODUCTION works demonstrate the effectiveness of the PSO to these and
related applications by ways of simulation on PCs or high
P article Swarm Optimization (PSO) [1]-[3] is a population
based stochastic heuristic algorithm developed to tackle
complex optimization problems. Its attractive features of
performance workstations. There is little work that targets
the PSO algorithm to embedded platforms. In our papers
[20] and [21], we have demonstrated the severe
relative simplicity, efficiency and flexibility are some of the performance degradation of the software implementation of
reasons why PSO is gaining strong ground to compete with the PSO algorithm when implemented on low-capacity
other well established global optimization techniques. embedded processors. To address this problem we
Since its initial introduction, PSO has attracted a great deal developed a novel, efficient, modular, flexible and general
of interest from researchers in various areas trying to use the purpose hardware architecture for the standard PSO
algorithm for real-world applications [3]-[8]. Similar efforts targeted on programmable logic implementation
have also been put on theoretical and experimental research in technologies. The performance of the proposed hardware
trying to improve the algorithm’s performance [9]-[17]. PSO architecture, evaluated on mathematical and real world
In recent years we witness increasing desire to expand the test problems, is shown to be greatly accelerated.
application domain of the PSO algorithm to embedded The next logical step is to find schemes for further
systems and industrial areas of applications. An embedded enhancement of the execution performance of the hardware
system is a special purpose computer system designed to PSO, especially for application targets that are endowed
perform one or a few dedicated functions, often with real-time with large amounts of hardware resources. In this paper we
computing constraints. It is usually embedded as part of a present a hardware parallelization technique for the PSO
complete device including hardware and mechanical parts. algorithm by dividing the swarm of particles into multiple
sub-swarms with parallel threads of PSO responsible for
Manuscript received November 12, 2008. manipulating the individual sub-swarm.
G. S. Tewolde is with the Department of Electrical and Computer
Engineering, Kettering University, Flint, MI 48504 USA (e-mail:
This paper is organized as follows. Section II presents a
[email protected]). short introduction to the PSO algorithm. In Section III, we
D. M. Hanna, is with the Electrical & Computer Engineering Dept., describe the benchmark test functions used for performance
Oakland University, Rochester, MI 48309, USA (e-mail: evaluation. The PSO performance on an embedded
[email protected]).
R. H. Haskell, is with the Electrical & Computer Engineering Dept., microcomputer is then evaluated and results presented in
Oakland University, Rochester, MI 48309, USA (e-mail: Section IV. For the performance evaluation we use a
[email protected]).

978-1-4244-2762-8/09/$25.00 ©2009 IEEE


Freescale 16-bit microcomputer, MC9S12DP256 [22], For the test runs carried out in this paper we choose a
running at 25 MHz clock. The serial hardware architecture for commonly used set of generic PSO parameter values χ =
a PSO engine targeted on field programmable gate array 0.729, φ1 = φ2 = 2.05, which work well in most cases. When
(FPGA) and its performance evaluation are presented in necessary, however, problem specific parameter tuning
Sections V and VI, respectively. In Section VII, the parallel strategies could be implemented for even enhanced solution
architecture for the hardware PSO and its performance qualities with better PSO execution performances. In this
evaluation are presented. Finally, Section VIII provides regard, we have developed a hierarchical dual PSO
concluding remarks. technique for the purpose with a top level PSO tuning the
parameters of a function optimizer PSO, resulting in a
II. PARTICLE SWARM OPTIMIZATION greatly improved performance than PSO with generic
Development of the Particle Swarm Optimization algorithm parameters [23]. However, for this paper we focus only on
was inspired by the social behavior of bird flocks, fish the hardware technique for accelerating the execution
schools, bee swarms, and animal herds [1]-[3]. Individual performance for embedded applications.
particles in a particle swarm represent candidate solutions for
the optimization problem. The particles move around in the III. BENCHMARK TEST FUNCTIONS
parameter space by using systematic rules to adjust their For purposes of testing and comparing different
velocities and positions, in response to the swarm’s experience evolutionary optimization methods a suit of mathematical
in locating quality solutions. Thus, the dynamics of the benchmark functions are often used. Because of their
particles in the swarm is influenced by a combination of two relative simplicity for hardware implementation, for
factors: the personal and social best experiences preliminary testing and evaluation of our proposed
Since the particles in the swarm fly in a continuous hardware architecture we use the Sphere and Rosenbrock
hyperspace in search of the optimum solution, the basic PSO functions [3]. In addition to these two functions, for
algorithm is naturally suited for real-valued problems. Several evaluating the algorithm’s performance on a real world
modifications have also been proposed in the literature to problem scenario, we also consider an emission source
enable PSO to handle discrete and combinatorial optimization localization problem in a distributed sensor network
problems. environment [20]. These test functions are described below.
One of the characteristic features of PSO is that each
A. Mathematical Benchmark Functions
particle in the swarm has memory that allows it to keep track
of its own personal best and neighborhood best experiences in This subsection describes the Sphere and Rosenbrock
the search process. At each step of the algorithm each benchmark functions we used for performance evaluation.
individual particle adjusts its flying velocity in response to
these cognitive and social learning experiences. In effect, the a) Sphere (De Jong’s f1).
particles tend to be attracted to the best solution they have D
individually found and the best solution that any particle in f 1 ( x ) = ∑ xi2 (3)
their neighborhood has found. i =1

When modeling a D-dimensional optimization problem,


each particle X i in the swarm is represented by a vector in the b) Rosenbrock.

D-dimensional space, X i = ( x i1 , x i 2 ,..., x iD ) T . Particle i ’s D


f 2 ( x ) = ∑ (100( xi +1 − x i2 ) 2 + ( xi − 1) 2 ) (4)
previous best position is designated by a vector i =1
T
Pi = ( p i1 , p i 2 ,..., p iD ) and its fitness evaluation at that point,
called personal best fitness, is pbest i . The movement of each B. Emission Source Localization Problem
particle in the search space is determined by its velocity vector We consider a scenario where a distributed sensor
T
Vi = (vi1, vi 2 ,..., viD ) . Different variations of the velocity update network is used for localization of emission sources in an
equations have been proposed in the literature [3], [10], [11], environment. A sensor network with the appropriate types
[16], the inertia-weight and constriction-factor forms being the of sensors is assumed to be deployed to detect the intensity
most common ones. For our implementations in this paper we of the emission at the sensor locations. The sensors have
use the constriction-factor form of the velocity update only limited perception of their immediate locality with
equation as given in equation (1) below [10], [11]. The some degree of uncertainty. Further information, with test
position of the particles is updated using equation (2). results on larger scale problem dimensions and sensor noise
levels and relative comparisons of PSO with other solution
vij (t ) = χ ∗ (vij (t − 1) + ϕ1r1 ( pij − xij (t − 1)) + ϕ 2 r2 ( p gj − xij (t − 1))) (1) strategies could be found in [20].
The fitness function f is formulated for capturing the
x ij (t ) = x ij (t − 1) + v ij (t )
(2) important parameters relating the sensor locations, their
where, i = 1, 2, …, N and j = 1, 2, …, D. measurements and the estimated emission source location
and its intensity. Equation (5) is derived based on the least- The execution speed achieved by the software
squares method, where by the sum of the squares of the errors implementation of the PSO on the embedded processor
between the measurement estimates from the theoretical could not be considered fast enough for many real-time
model and the actual readings, over all sensors, is minimized. applications with strict timing constraints. In our previous
work [20] and [21] we developed an efficient, modular,
Ns Ns flexible, and reusable hardware architecture to accelerate
f (Q0 , x0 , y0 ,W ) = ∑ dQ 2j = ∑ (Q0 − (q j d 2j + w j Q0 )) 2 (5) the execution performance of the standard PSO algorithm.
j =1 j =1
Due to strict hardware resource constraints on low-footprint
embedded systems, the approach followed in the previous
Where, W is the vector of the weighting factors wj, j = 1, 2, 3, architecture is a serial implementation to allow sharing of
…, Ns, introduced to capture measurement uncertainties. A the computational units by all particles in the swarm.
summarized list of the benchmark test functions, their However, when larger hardware resources are available it is
dimensions, ranges and error thresholds is given in Table 1. possible to further speedup the algorithm’s execution
The emission source localization problem, for the tests in this performance by a parallelization scheme that divides the
paper, is configured with zero sensor measurement noise. entire swarm into a number of sub-swarms that are executed
concurrently.
TABLE 1
BENCHMARK TEST FUNCTIONS, THEIR DIMENSIONS, RANGES,
V. SERIAL HARDWARE ARCHITECTURE
AND MAXIMUM ERROR THRESHOLDS.
Function Dimension Range Error threshold From previous work in the literature on hardware
Sphere 30 [-100,100] < 0.01 implementation of the PSO, Reynolds et al. [24] proposed
Rosenbrock 30 [-30,30] < 100 an FPGA implementation of a modified version of the PSO
Emission source 30 [-50,50] < 1.0
localization
for inversion of neural networks. Their goal was to simplify
the hardware implementation by using different techniques
such as using deterministic approaches in the particle update
IV. PERFORMANCE OF PSO ON EMBEDDED MICROCOMPUTERS
operations, rather than employing random number sources.
To analyze the performance of the PSO algorithm on They use fixed-point, rather than floating-point, number
embedded microcomputer platforms, we pick the Freescale representation to minimize the required logic resources.
MC9S12DP256 microcomputer [22] that is in common use Although this modified version is reported to have helped in
for various embedded applications. This microcomputer reducing the hardware cost and greatly enhancing the
comes with a 16-bit CPU, 12Kbytes of RAM and 256 Kbytes execution speed, it also resulted in slightly degraded
of Flash EEPROM on chip, and runs at a maximum clock solution quality compared to the full-blown standard PSO
frequency of 25 MHz. The program and data of our PSO implementation. The method is however presented as a
implementation comfortably fit within the available on chip problem specific solution.
memory. Like in the software world several approaches could be
considered to map the PSO algorithm to hardware,
TABLE 2. depending on the available hardware resources and the
EXECUTION PERFORMANCE RESULTS OF THE SOFTWARE required performance. For example, if sufficient hardware
IMPLEMENTATION OF THE PSO ON THE FREESCALE MC9S12DP256 resource is available, one can consider parallel
MICROCOMPUTER @25MHZ
implementation of the algorithm for greater performance.
Freescale MCC9S12DP256 @ 25MHz On the other hand, on target platforms with limited
Test functions No. Ex. Time (sec) Ck. Cycles hardware resources a serial implementation of the PSO that
Iters allows reuse of computational elements by the entire swarm
Sphere 372.6 100.6 2.52E+09
and the fitness evaluation module could produce a more
Rosenbrock 695.4 254.6 6.36E+09 compact design.
Emission source In the serial implementation of the PSO algorithm on
127.0 46.1 1.15E+09
localization hardware all particles share common computational
hardware resources for updating their velocity and position
Table 2 shows the average, over twenty independent runs data, for evaluating their fitness values and for updating
on the embedded processor platform, of the execution their personal best and global best performance data. A
performance of the PSO on the benchmark functions. From control state machine is designed to regulate the access to
these results, we observe that the PSO implementation on the the shared computing and memory resources.
16-bit MC9S12DP256 microcomputer running at 25 MHz For our PSO engine all the swarm data, for the position,
clock takes on average about 100, 255, and 46 seconds, velocity, and fitness of the particles, is represented in
respectively, to solve the 30-dimensional Sphere, Rosenbrock, single-precision floating-point numbers, using the standard
and emission localization problems. What we mean by solving IEEE-754 floating-point format. Using this format makes
the problem here is that a global best particle in the PSO the design easily reusable and portable than custom, but
population has managed to produce a result better than the perhaps more efficient, number formats. However, attractive
given minimum error threshold. features of floating-point numbers come at a cost of
relatively more complex and resource intensive computational serial hardware implementation on the FPGA. These results
engine to carry out the floating-point operations in hardware. show that the serial hardware PSO running on FPGA at 25
Thus, in small density FPGA targets the need arises to share a MHz clock achieves accelerated performance with average
limited number of hardware floating-point operators by speedups ranging from 360 to 653 over the implementation
different components of the PSO algorithm. On the other on the 16-bit MC9S12DP256 processor running at the same
hand, when the hardware PSO engine is targeted onto more clock frequency.
capable FPGA devices, or multi-FPGA systems, the larger
available computational resources could be exploited to add TABLE 3.
AVERAGE EXECUTION PERFORMANCE OF THE HARDWARE
parallelism to some of the operations and hence achieve even
IMPLEMENTATION OF THE PSO ON A SPARTAN-3E, XC3S500E
greater improvements in performance. Figure 1 shows a high- FPGA, AT 25 MHZ CLOCK.
level architecture for the direct serial implementation of the Test No.
Ex. Time (sec) Ck. Cycles
PSO on FPGA. Refer to [21] for details of the major Functions Iterations
functional blocks of the hardware architecture. Sphere 338.7 0.28 6.93E+06
Rosenbrock 470.4 0.39 9.72E+06
Emission source
132.0 0.11 2.75E+06
data Fitness localization
Datapath Unit result Evaluation
Module TABLE 4.
stat EXECUTION SPEEDUP OF THE HARDWARE IMPLEMENTATION
contr
Registers for OF THE PSO OVER SOFTWARE IMPLEMENTATIONS ON
stat Control Unit EMBEDDED PROCESSORS.
Temporary
data Execution time speedup over
Test
RNG contr functions
MC9S12DP256
Module @ 25 MHz
Sphere 360
Floating Point addr Rosenbrock 653
contr
Operation Emission source
Modules 419
Swarm Memory localization
data

VII. PARALLELIZATION SCHEME


Since PSO emulates a population-based multi-agent
Fig. 1. High level architecture for the direct serial implementation of the PSO optimization it naturally presents a good opportunity to
on FPGA. parallelize the algorithm. Each particle of the swarm carries
a potential solution that flies through the solution search
space by modifying its velocity and position according to its
VI. PERFORMANCE EVALUATION OF SERIAL HARDWARE PSO own and its neighborhood’s or global best performance
For prototyping, testing and evaluating the performance of histories. The computations involved in the PSO algorithm
the proposed serial hardware architecture a Spartan-3E family, could be divided into operations that update the particle
XC3S500E, FPGA [25] is used. For the performance velocity and position information on one hand, and
evaluation, we are mainly interested in the quality of the computations required for the fitness evaluation of the
solution generated by the hardware PSO and its execution particle in the objective function landscape, on the other
performance. Therefore, we incorporated in our designs small hand.
measurement and output units that count and display the Most previous work in the literature on parallel PSO
number of iterations the algorithm executes to arrive at a implementations consider coarse grained parallelization of
given error threshold and the number of system clock cycles it the fitness evaluation computations over parallel cluster of
takes for its execution. These quantities are the criteria we use computers [26]-[28]. Schutte et al. [26] present a
to compare different implementations of the PSO algorithm. synchronous parallel implementation of the PSO on cluster
All the benchmark functions were tested on the hardware of computers, testing on analytical and bio-mechanical
PSO engine by running them 20 times each, using different system identification problems. The computing
random number seed value for each run. Table 3 shows the infrastructure in this configuration is organized into one
execution performance results in terms of the average number coordinating node and several slave nodes for fitness
of iterations and execution times (in seconds and clock cycles) function evaluations. The responsibility of the coordinating
required to arrive at the given error thresholds. node includes particle update operations and assigning of
The execution speedups of the hardware PSO over the
fitness evaluation tasks to the parallel slave nodes. In the
software implementation on the embedded processor
synchronous parallel implementation scheme the
platforms is presented in Table 4. These numbers are
coordinating node waits for all the fitness evaluations to be
calculated by dividing the execution time (in seconds) of the
software implementation by the execution time of the direct completed before proceeding with updating of the best
fitness values. Consequently, the swarm will react more proposing allows the communication to proceed in the
slowly to changes of the best fitness values and the background without interrupting the PSO execution of the
corresponding positions. Thus, this causes unavoidable sub-swarms.
performance loss, especially if the slave nodes are of
B. Hardware Architecture
heterogeneous type or if the fitness evaluation process
requires variable execution time depending on the position of The architecture for the direct serial hardware
the candidate solution in the search space. implementation of the PSO is reused with appropriate
Venter and Sobiezczanski-Sobieski [27] and Koh et al. [28] modifications, for making the independent components
present strategies that apply the asynchronous parallelization forming the individual threads of the parallel execution. To
scheme for the PSO algorithm on parallel cluster of make the comparison process between the serial and parallel
computers. The main idea behind the asynchronous approach implementations of the PSO as fair as possible we use the
is to allow update of the best position and fitness data, in the same generic parameters, including the swarm size, for both
current iteration, as it becomes available without waiting for implementations. Thus, if the swarm size is N in the serial
the completion of the fitness evaluation of all the particles in implementation, a parallel implementation with p parallel
the swarm. In both of these reported works, similar to the threads will have p sub-swarms, each with N/p particles.
work of Schutte et al. the strategy is coarse-grained Each sub-swarm will have its own computation resources
parallelization of the fitness function computation tasks on shared by all the particles in the sub-swarm. However, the p
parallel cluster of computers with a master-slave parallel sub-swarms need to exchange information about the
configuration. According to the papers, the asynchronous best performing particle they discover and the
approach outperforms the synchronous one in terms of parallel corresponding best fitness value. So, one of the main design
efficiency when heterogeneity exists in the computational task decisions is how to define the communication strategy in
or the environment. such a way that the associated overhead does not
overshadow the anticipated performance gain from the
A. Degree of parallelism parallelization effort. Since in the gbest PSO topology the
We define the degree of parallelism of an implementation global best particle information is needed by all particles in
as the number of parallel threads of executions that it can run the entire swarm for their velocity and position updates, we
simultaneously. The type and amount of resources of the need to devise an efficient mechanism to provide access to
hardware platform, the number of particles in the swarm and this information. One approach for allowing access by all
the nature or complexity of the optimization problem are sub-swarms to the global best data is to use shared memory.
factors that determine the degree of parallelism for a hardware However, since this data is required for all particles’
implementation of the PSO algorithm. For example, if enough velocity update operations, simultaneous access to this
hardware resource is available, one could consider employing common data by all sub-swarms creates a potential
a fully parallel implementation of the algorithm with each bottleneck causing performance loss.
particle having its own execution engine for velocity and As an alternative approach, instead of saving the global
position updates as well as for its fitness evaluation. On the best information at a common shared memory location we
other hand, if hardware space is limited and if the main propose keeping local copies in each of the sub-swarms,
execution bottleneck is the fitness function evaluation, a hence totally eliminating the bottleneck due to shared
possible parallelization approach could benefit from memory access. However, this approach requires a
parallelization of the fitness evaluation modules while using a communication mechanism to allow the sub-swarms
common basic PSO engine for the particle updates. exchange the global best data as they discover them. For the
For the embedded system applications that we are mainly physical interconnection between the sub-swarms, we
interested in, such as the emission source localization problem employ a ring communication topology to interconnect
scenario described in previous sections, the available them along with a separate communication controller. In the
hardware resources is often limited. So, unlike the case of illustration shown in Figure 2, SS#k (where k = 1, 2, …, p)
clusters of tens to hundreds or thousands of high performance are the parallel sub-swarms, each keeping local copies of
computers we often target our system on small footprint the global best fitness (fgbest) and the global best particle
microprocessors or FPGAs. So the goal is to come up with position (gbx in mem_gbx).
suitable and efficient architecture of the parallel PSO for low As soon as sub-swarm k discovers a new global best
to medium density FPGAs, with possible extensions to larger fitness value it starts to execute a sequence of steps to
capacity hardware platforms. Therefore, instead of a fully share the information with the other sub-swarms by first
parallel implementation we consider sub-swarm or group level sending a communication request signal (newgb_k) to the
parallelization whereby the entire swarm is evenly divided communication controller module. The communication
into a number of sub-swarms each of which having their own controller module handles such requests from the sub-
computational resources. The execution of the parallel sub- swarms and it acts to initiate the data communication of
swarms is asynchronous. They communicate only when they the newly discovered global best particle to all the
discover new global best fitness. The architecture we are swarms.
localization problem. On the other hand, the Virtex II Pro
newgb_1 newgb_p FPGA fits up to five parallel threads of each of the test
Communication ack_p
ack_1
done_1 Controller done_p functions. Thus, for p = 2, the Sphere and Rosenbrock are
tested on the XC3S500E, and the emission source
SYNC
localization problem on the XC2VP30. For p = 3 and 5,
all functions are tested on the XC2VP30 target hardware.
SS#1 SS#2 SS#p For all test experiments we measured the execution
… performance results of the hardware implementations of
gbx gbx gbx
the parallel and serial PSO. For a fair comparison, other
fgbest fgbest fgbest
than the degree of parallelism, all algorithm parameters
data
are kept the same for all implementations. The PSO
Figure 2. Communication infrastructure for exchanging information parameters are set to their generic values. We use a
between parallel sub-swarms
swarm size of N = 30 for the serial implementations. For a
parallel implementation with degree of parallelism p, the
Although it is possible for more than one sub-swarm to number of particles in each sub-swarm is set to N/p.
assert newgb_k communication requests simultaneously, the Different initial seed values are set for the random
communication controller is configured to acknowledge number generators of the different sub-swarms so as to
only one request at a time. To keep the design of the minimize any possible overlaps of the initial particle
communication controller simple we do not send the newly distributions of the different sub-swarms. Table 5 presents
discovered best fitness values from the requesting sub- the execution performance of the serial PSO and the
swarms to the communication controller. Hence, because of parallel one with different degrees of parallelism. The
lack of knowledge of the actual newly discovered global actual speedup of the parallel PSO over the serial
best fitness values from the different sub-swarms, the implementation is computed as the execution time of the
communication controller simply chooses one of the sub- serial PSO divided by that of the parallel PSO for each of
swarms arbitrarily in the case of multiple simultaneous the different p values (Table 6).
requests. For example, the sub-swarm ID number k could be
used as a priority number that the communication controller TABLE 5.
could use to resolve among such competing requests. Thus, COMPARISON OF THE EXECUTION PERFORMANCES OF THE
only the chosen sub-swarm receives the acknowledgement SERIAL AND PARALLEL (WITH P = 2, 3 AND 5)
IMPLEMENTATIONS OF PSO.
(ack_k) signal.
p=1 p=2 p=3 p=5
From the above mentioned strategy one can observe that Test
functions No. Ck. No. Ck. No. Ck. No. Ck.
there is no guarantee that the communication controller Iters cycles Iters cycles Iters cycles Iters cycles
chooses the actual overall global best data to be distributed
Sphere 339 6.93E6 330 3.36E6 368 2.49E6 447 1.81E6
to all the sub-swarms when there are multiple simultaneous Rosenbrock 470 9.72E6 330 3.39E6 376 2.58E6 342 1.40E6
requests. However, we found the strategy described to be a Emission
better compromise solution than adding more complex source 132 2.75E6 134 1.39E6 163 1.12E6 222 9.07E5
functionality, with the associated cost of logic, in the localization
communication controller to be able to search for the actual
TABLE 6.
overall global best fitness value in cases when it receives SPEEDUPS OF THE PARALLEL IMPLEMENTATIONS OF PSO
competing requests. Furthermore, after the current OVER THE SERIAL VERSION ON THE THREE TEST FUNCTIONS.
communication is done it is possible that previously missed Speedups for different values of p
Test functions p=2 p=3 p=5
global best data could be rediscovered in the following
Sphere 2.06 2.78 3.83
iterations to get transmitted to the other sub-swarms. In the
Rosenbrock 2.87 3.77 6.94
communication process, immediately after sending the Emission source
1.98 2.46 3.03
ack_k signal to the chosen sub-swarm the communication localization
controller sends a synchronization (SYNC) pulse to all the
sub-swarms in the ring communication topology. The actual The experimental test results demonstrate that the
data communication then starts following the SYNC pulse. proposed parallel PSO architecture achieves execution
speedups over the serial implementation, although the
C. Performance Evaluation
degrees of the speedups do not exactly match with the
The available target hardware platforms we use for testing degrees of parallelisms. Due to the stochastic nature of the
are the Xilinx Spartan 3E family, XC3S500E FPGA and the algorithm, the division of the optimization task into the
larger capacity Virtex II Pro family, XC2VP30 FPGA. The multiple sub-swarms of smaller sizes could have varying
XC3S500E FPGA can fit just two parallel threads of the effects on the performance. For example, among the three
Sphere or the Rosenbrock benchmark test functions, and no test scenarios, the parallel PSO has achieved the highest
more than a single serial thread of the emission source speedups for the Rosenbrock function for all degrees of
parallelism. For the Sphere and emission source localization Transactions on Evolutionary Computation, Vol. 6, No. 1, February
2002, pp. 58-73.
problems, however, the speedups are about equal to the [11] R.C. Eberhart and Y. Shi, “Comparing Inertia Weights and
number of parallel threads for p = 2, but it is lower for higher Constriction Factors in Particle Swarm Optimization,” The 2000
values of p. Congress on Evolutionary Computing, Vol. 1, pp. 84-88.
Different factors are believed to affect the performance of [12] M. Meissner, M. Schmuker, and G. Schneider, “Optimized Particle
Swarm Optimization (OPSO) and its application to artificial neural
the parallel PSO, including the nature of the problem, how network training,” BMC Bioinformatics. 2006. vol. 7:125.
well the different sub-swarms subdivide the search space with [13] R. Mendes, Population Topologies and Their Influence in Particle
minimal overlap, configuration of the random number sources, Swarm Performance, PhD Thesis, Universidade do Minho, Braga,
Portugal, Feb. 2004.
and the communication strategy. In the current [14] R. Mendes, J. Kennedy, and J. Neves, “The fully Informed Particle
implementation, the same type of random number generator Swarm: Simpler, maybe better,” IEEE Transactions on Evolutionary
with different seed values is used for all the sub-swarms. Computation, 2004.
[15] F. van den Bergh and A. P. Engelbrecht, “A cooperative approach to
Further work is required to better understand the impact of the particle swarm optimization,” IEEE Trans. Evol. Comput, vol. 8, no.
configurations of the different random number sources on the 3, pp. 225-239, Jun. 2004.
performance of the parallel PSO. [16] Y. Shi and R. C. Eberhart, “A modified particle swarm optimizer,”
Proceedings of the IEEE Congress on Evolutionary Computation
(CEC 1998), Piscataway, NJ. pp. 69-73. 1998.
VIII. CONCLUSION [17] Y. Shi and R. C. Eberhart, “Parameter Selection in Particle Swarm
This paper presented a parallelization scheme to speed up Optimization,” The 7th Annual Conference on Evolutionary
Programming, San Diego, USA. 1998.
the performance of hardware PSO. The proposed architecture [18] Y. Liu, J. Zhang, and S. Wang, “Optimization design based on PSO
employs sub-swarm or group level parallelization, for targets algorithm for PID controller,” Fifth World Congress on Intelligent
on low to medium density FPGAs with possible extension to Control and Automation (WCICA 2004), Vol. 3, pp. 2419 – 1422.
[19] J. Pugh, A. Matinoli, and Y. Zhang, “Particle swarm optimization for
larger capacity hardware platforms. The effectiveness of the unsupervised robotic learning,” Proceedings of the 2005 IEEE
proposed architecture has been demonstrated, on different test Swarm Intelligence Symposium (SIS 2005), pp. 92-99.
functions, for various degrees of parallelism. In future work, [20] G.S. Tewolde, D.M. Hanna, and R.E. Haskell, “Hardware PSO for
Sensor Network Applications,” Proceedings of the 2008 IEEE
systematic approaches of partitioning the global search space Swarm Intelligence Symposium (SIS 2008).
for the different sub-swarm working regions will be [21] G.S. Tewolde, D.M. Hanna, and R.E. Haskell, “Accelerating the
investigated. Such strategies could potentially reduce the Performance of Particle Swarm Optimization for Embedded
overlap of the different partitions of the search space that are Application,” submitted to the 2009 IEEE Congress on Evolutionary
Computation (CEC2009).
handled by the different sub-swarms, hence possibly reducing [22] Freescale, MC9S12DP256B Device User Guide V02.15. [Online].
the total execution effort to get to a desirable solution. Available:
http://www.freescale.com/files/microcontrollers/doc/data_sheet/
[23] G.S. Tewolde, D.M. Hanna, and R.E. Haskell, “Enhancing
REFERENCES Performance of PSO with Automatic Parameter Tuning Technique,”
[1] R. Eberhart and J. Kennedy, “A new Optimizer using Particle Swarm submitted to the 2009 IEEE Swarm Intelligence Symposium (SIS
Theory,” Proceedings of the Sixth International Symposium on Micro 2009).
Machine and Human Science, 1995, pp. 39-43. [24] P. D. Reynolds, R. W. Duren, M. L. Trumbo, and R. J. Marks, II.,
[2] J. Kennedy and R. C. Eberhart, “Particle Swarm Optimization.,” “FPGA Implementation of Particle Swarm Optimization for
Proceedings of the IEEE International Conference on Neural Inversion of Large Neural Networks,” Proceedings 2005 IEEE
Networks, 1995, Vol. 4, pp. 1942-1948. Swarm Intelligence Symposium.
[3] J. Kennedy and R. C. Eberhart, Swarm Intelligence. Morgan Kaufmann [25] Xilinx, Xilinx Spartan-3E FPGA Family: Complete Data Sheet,
Publishers. 2001. 2007. [Online]. Available:
[4] X. Cui, T. E. Potok, and P. Palathingal, “Document Clustering Using http://www.xilinx.com/support/documentation/data_sheets/ds312.pdf
Particle Swarm Optimization,” Proceedings of the 2005 IEEE Swarm [26] J. F. Schutte, J. A. Reinbolt, B. J. Fregly, R. T. Haftka, and A. D.
Symposium (SIS 2005), pp. 185-191. George, “Parallel global optimization with particle swarm
[5] A. S. Mohais, R. Mohais, and C. Ward, “Earthquake Classifying Neural algorithm,” International Journal of Numerical Methods in
Networks Trained with Dynamic Neighborhood PSOs,” Proceedings of Engineering, 61(13). 2004, pp. 2296-2315.
the 9th annual conference on Genetic and evolutionary computation [27] G. Venter and J. Sobieszczanski-Sobieski, “A Parallel Swarm
(GECCO’07), pp. 110-117. Optimization Algorithm Accelerated by Asynchronous Evaluations,”
[6] G. S. Tewolde and D. M. Hanna, “Particle Swarm Optimization for 6th World Congress of Structural and Multidisciplinary
Classification of Breast Cancer Data using Single and Multisurface Optimization, 2005.
Methods of Data Separation,” The 2007 IEEE International [28] B. Koh, A. D. George, R. D. Haftka, and B. J. Fregly, “Parallel
Conference on Electro-Information Technology (EIT2007). Asynchronous Particle Swarm Optimization,” International Journal
[7] G. S. Tewolde and D. M. Hanna, “Particle Swarm Optimization for for Numerical Methods in Engineering, 2006, 67: 578-595.
Cluster-Based Classification of Breast Cancer Data,” The 2007
International Conference on Genetic and Evolutionary Methods
(GEM2007).
[8] B. Yang, Y. Chen, and Z. Zhao, “Survey on Applications of Particle
Swarm Optimization in Electric Power Systems,” IEEE International
Conference on Control and Automation (ICCA 2007), pp. 481-486.
[9] D. Bratton and J. Kennedy, “Defining a Standard for Particle Swarm
Optimization,” Proceedings of the 2007 IEEE Swarm Symposium (SIS
2007).
[10] M. Clerc and J. Kennedy, “The Particle Swarm – Explosion, Stability
and Convergence in a Multidimensional Complex Space,” IEEE

You might also like