Optimizing Simulated Annealing
Patrick BANGERT
School of Engineering and Science, International University Bremen
P.O. Box 750 561, 28725 Bremen, Germany.
March 21, 2005
ABSTRACT
The method of simulated annealing was used to get a heuristic solution for the minimum length word equivalent to
a given word in the braid groups (a known NP-complete
problem). The simulated annealing paradigm with a simple cooling schedule leaves five parameters up to the user
to choose that were chosen empirically based on performance experiments as is the usual practise. After this,
a downhill simplex method was developed to further optimize these critical parameters and a quality improvement
of up to 26.1% was observed. This additional improvement
made the algorithm competitive, on average, with custom
designed heuristics for this problem.
The conclusions going beyond the present combinatorial
problem are: (1) Fine-tuning of cooling schedule parameters
is critical for the solution quality in simulated annealing, (2)
downhill simplex methods (as opposed to Newton’s method,
for example) are well-suited for this task and (3) significant
quality improvement is possible even for a simple cooling
schedule.
INTRODUCTION
Simulated Annealing
Simulated annealing is a paradigm method for optimization
problems of many types [8]. It was first presented in 1953 as
an idea from industrial physics together with an interpretation of this industrial process from statistical mechanics [4].
To form a strong alloy, one heats up the components, mixes
them well and lets them cool slowly according to a strict
(empirically determined) cooling schedule in order to minimize the number of defects in the solidification process of
the material. This physical annealing process is interpreted
in the setting of combinatorial optimization (by which we
mean finding the minimum of a cost function C(x) for the
vector of variables x in many dimensions) by the following
meta-algorithm:
: A candidate solution S and a cost function
C(x).
Result : A solution S ′ that minimizes the cost function
C(x).
T ← Starting Temperature
while not frozen do
while not at equilibrium do
S ′ ← perturbation of S.
if C(S ′ ) < C(S) or selection criterion then S ←
S′
end
end
Data
Algorithm 1: General Simulated Annealing
In words, we begin with a guessed solution S and heat it
up using some starting temperature. While at one temperature, the system is allowed to transit to other states until it
reaches equilibrium at that temperature. Transitions that
lower cost are always accepted and transitions that increase
cost are accepted relative to some selection criterion that
usually is a function of both the current temperature and
the cost increase of the proposed change. When equilibrium has been reached the temperature is changed and this
is continued until the system is at equilibrium at a very low
temperature at which time we call it frozen. Due to the
equilibration and the guaranteed acceptance of a downhill
transition, the final state is certain to be a minimum in the
cost function. It may however be a local minimum and not
the global one. In order to increase the chances of getting
the global minimum, the algorithm allows uphill transitions
preferentially early in the execution of the algorithm, i.e. at
high temperatures.
Simple as this idea may seem, it is very powerful and this
general algorithm has been used to solve a variety of problems. It is regarded to be the solution to the most studied
problem of optimization, the travelling salesman problem
[6]. Research about simulated annealing reached its heyday
in the mid 1980’s and has been employed in a variety of settings since then; a search reveals 986 Ph.D. theses in the last
18 years that focussed on using the method to solve some
problem of practical significance [3, 7]. The later advent
of genetic algorithms stole the limelight from simulated annealing for some time [8]. However, from direct comparisons
between these two approaches it appears that simulated annealing nearly always wins in all three important categories:
implementation time, use of computing resources (memory
and time) and solution quality [3, 8].
In our general presentation in algorithm 1, we have not
specified five crucial components in the method: (1) starting temperature, (2) freezing condition, (3) equilibrium condition, (4) perturbation mechanism and (5) selection criterion. The perturbation mechanism is problem dependent.
It defines a neighborhood structure in the space of possible solutions (which solutions can be reached by means of
a single transition from any given solution) and thus the
mechanism has considerable influence over the convergence
speed of simulated annealing. If the neighborhood is very
large, then we are able to explore the solution space quickly
but will have trouble settling down later. Conversely, if
the neighborhood is very small, convergence might be prohibitively slow. A healthy medium must be chosen but there
is no general theory available at present to discuss what a
good medium might be; we are firmly in the realm of empirical testing where this issue is concerned. Below, we shall
discuss an optimization problem coming from group theory. Here, the transition mechanism will be given naturally
by the presentation of the group but which presentation
to choose is still up to the user. In our example, there is
one presentation that is special for physical reasons and so
we do not, in fact, have a choice of transition mechanism.
The transition mechanism is certainly more complex than
the other elements of simulated annealing and cannot be
treated using the same methods as we shall treat these, i.e.
it cannot be continuously parameterized in general.
Taking the transition mechanism as discussed, we may
simplify the other steps in a general way. The simplest
version of simulated annealing sets five constants A, B, C,
D and E to some initial values and looks like this:
define a constant temperature B to be the freezing point.
Equilibration is assumed to occur after or within C steps of
the proposal-acceptance loop where the selection criterion
is the thermodynamic Maxwell-Boltzmann distribution after which the temperature is decremented by a constant
factor. The standard choices for these constants are A =
C(S), B is 100 times smaller than the best lower bound
on cost, C = 1000, D = 1 and E = 0.9. After successful
implementation of this algorithm, one usually plays with
these parameters until the program behaves satisfactorily.
It is clear that implementing this method is very fast and
we observe from the literature that the vast majority of
applications are computed using the version of simulated
annealing given in algorithm 2 where the five parameters
are determined manually [3].
The entire literature on simulated annealing makes two
marked omissions: No direct comparison of a variety of
cooling schedules nor a systematic investigation of the dependence of simulated annealing on the above parameters
is made (only partial and scattered data are available with
regard to both issues). In this paper, we wish to rectify
the second omission. We find that small changes in the
parameters of the simple cooling schedule presented above
can have significant effects on the solution quality in the average case of the resultant simulated annealing algorithm.
We do this by means of a representative example: Finding
the minimum length word in a braid group.
The Minimum Length Braid Word Problem
The braid groups are given by their Artin presentation
{σi } : σi σj ≈ σj σi ,
σi σi+1 σi ≈ σi+1 σi σi+1 ,
Bn =
1 ≤ i, j < n, |i − j| > 1
(1)
Suppose that w ∈ Bn and let L(w) give the letter length of
the word w. The minimum word problem asks us to find
a word u ∈ Bn such that u ≈ w and that L(u) ≤ L(v) for
any word v ∈ Bn and v ≈ w; that is for an equivalent word
Data
: A candidate solution S and a cost function of minimum length. In B1 and B2 , the problem is trivial
C(x).
and in B3 , a polynomial-time algorithm exists [2]. For Bn
Result : A solution S ′ that minimizes the cost function with n > 3, the problem is known to be NP-complete [5].
A number of heuristic methods have been designed for this
C(x).
problem and seem to work very well [1].
T ←A
The cost function is clearly L(w) and the transitions are
while T > B do
the group relations with the obvious possibility of either
for i = 1 to C do
cancelling or including the pair σi σi−1 anywhere in the word.
S ′ ← perturbation of S.
′
′
if C(S ) < C(S) or Random < exp[(C(S ) − As such the cost function is always positive or zero.
This problem is representative of many combinatorial
C(S))/DT ] then S ← S ′
problems. It is a symbol sequence with a certain numend
ber of allowed moves on it together with a cost function.
T ← ET
The travelling salesman problem and protein folding are
end
just two industrial examples that are very similar in nature
Algorithm 2: Simple Simulated Annealing
to minimizing braids [8]. The issue of braid minimization
is important when one models the magnetic field lines on
In words, we start with a constant temperature A and the solar corona [1]. As one cannot see inside the sun, the
lines appear to have endpoints and due to magnetohydrodynamics the topology must be conserved, i.e. the field
lines are braided. The amount of energy stored in a braid
is proportional to the number of essential crossings in it the crossings that are needed to achieve the topology defined by this braid. However most braids are randomly
generated due to the erratic motion of the endpoints on the
photospheric surface. Hence the attempt to find the shortest braid word topologically equivalent to a given one. The
modelling effort for magnetic field lines is directed towards
predicting solar flares and coronal mass ejections that have
significant effects on the Earth and some of its important
systems (electrical power grids, satellites, communication
networks etc.). For these reasons, we believe braid minimization to be a representative and important problem to
be studied. Heuristics must be used as the problem is NPcomplete for n > 3. Past efforts at designing a custom
heuristic have been successful but the problem lends itself
well to simulated annealing which we compare with the custom methods here.
SIMULATED ANNEALING AS A FUNCTION
Suppose we start simulated annealing with configuration
S and obtain configuration S ′ as a result, then we define
the reduction ratio α by
α=
C(S ′ )
C(S)
(2)
As C(S ′ ) ≤ C(S) and cost is always non-negative, we have
0 ≤ α ≤ 1. The efficacy of the optimization method may
be measured by α at least on average when the initial solution S is generated randomly. The quality is high when
α is low (if α = 0.2, then 80% of initial cost has been removed). Clearly the α obtained varies between different
starting configurations and even between runs of simulated
annealing with the same starting configuration because of
the random element in the method. Therefore, we agree to
run simulated annealing N times using randomly generated
initial configurations of the same cost. The configurations
are all different but the cost must be the same in order to
get an accurate measurement of the reduction for a braid
of a particular length. Then we define,
α=
C(S1 ) + C(S2 ) + · · · + C(SN )
N · C(S)
(3)
which is the average reduction ratio over the N runs. As
long as N is large enough, the random variations of the
method should cancel out and the result should become
predictable to within a given accuracy. Using this interpretation, we may regard the simulated annealing method as
defining a function α = α(A, B, C, D, E) depending on five
parameters (for the simple schedule). We would like the
average reduction ratio to be as large as possible.
This is yet another optimization problem with a function instead of a combinatorial problem. We are able to
evaluate the function only at considerable computational
cost (N runs of simulated annealing for N randomly generated initial configurations) and we do not know its derivative accurately. Even approximating the derivative comes
only at heavy computational cost. There are many optimization methods such as Newton’s method or more generally a family of methods known under the names QuasiNewton or also Newton-Kantorovich methods that rely on
computing the derivative of the objective function. Some
of them require high computational complexity due to the
computation of the Hessian matrix but complexity considerations are secondary here. The most important reason
against all these methods is that the derivative computation is not very accurate for the function constructed here
and this loss of accuracy in an iterative method would yield
meaningless answers. Indeed, such methods were tried and
the results found to be unpredictable because of error accumulation and much worse than the results obtained by
methods not requiring the computation of derivatives. The
method of choice for optimizing a function over several dimensions without computing its derivative is the downhill
simplex method (alternatively one may use direction set
methods). Thus, we use the downhill simplex method to
minimize α(A, B, C, D, E).
The starting point for the simplex method will be given
by those values of the five parameters that we obtain after
some manual experiments. This is done for the reason that
most practitioners of the simulated annealing paradigm
choose their parameters based on manual experiments [3].
The other points on the simplex are set by manually estimating the length scale for each parameter [6].
RESULTS AND DISCUSSION
We implemented the downhill simplex method to minimize α(A, B, C, D, E) which was implemented according to
algorithm 2 and evaluated using equation 3. Manual experiments were started looking at the reduction ratio and
attempting to find optimal parameters manually. This is
the approach taken by the vast majority of practitioners
of the simulated annealing methodology thinking that the
algorithm is robust enough to find the global minimum in
any case. This point was used as the starting point for
downhill simplex method that found different values from
the manual search.
In table 1, we give the computational results representing
about four months of computation time. The first column
gives the index of the braid group studied. The second
column is a custom heuristic that represents the state-ofthe-art solution for this problem [1]. The third column is
the result of the manual simulated annealing search. The
fourth column gives the result of the downhill simplex optimized average reduction ratio (we took N = 1000 and
Table 1: Reduction ratios α (average final length divided by
initial length) as a function of the number of strings n. The
quality of the solution is high when the reduction ratio is
low. For n < 6 the optimized simulated annealing algorithm
is better than the best hitherto known custom heuristic algorithm for the particular problem of braid minimization.
Moreover, the downhill simplex method significantly improves (by up to 26% in this experiment) the quality of
simulated annealing over the common manual tuning.
n
3
4
5
6
7
8
9
10
Crossing
Force
0.426(6)
0.430(5)
0.447(4)
0.455(4)
0.469(4)
0.471(4)
0.476(4)
0.481(4)
Manual
Annealing
0.327(3)
0.451(4)
0.537(6)
0.596(5)
0.734(3)
0.681(4)
0.541(5)
0.736(3)
Downhill
Annealing
0.293(3)
0.384(4)
0.438(6)
0.474(5)
0.550(3)
0.518(4)
0.534(5)
0.544(3)
Change
(%)
10.4
14.9
18.4
20.4
25.1
23.9
1.3
26.1
ratio seems to become relatively constant after n = 10 and
L(w) = 100. However, the final parameter list found for
a particular n and L(w) depends very strongly on n and
L(w). This means that the five parameters for the simple simulated annealing schedule must be viewed as being
functions of the input size and not as being constants.
CONCLUSIONS
Thus we may draw a number of conclusions that would
appear to hold in general: (1) The solution quality obtained
using simulated annealing depends strongly on the numerical values of the parameters of the cooling schedule, (2) the
downhill simplex method is effective in locating the optimal
parameter values for a specific input size, (3) the parameters depend strongly on input size and should therefore not
be global constants for an optimization problem, (4) the
improvement in solution quality can be significant for theoretical and practical problems (up to 26.1% improvement
was measured in these experiments which is large enough
to have significant industrial impact).
Furthermore, the reason that the usual manual search is
so much worse than an automated search seems to be that
the solution quality (as measured by the average reduction
ratio) depends strongly on the cooling schedule parameters,
i.e. the landscape is a complex mountain range with narrow valleys that are hard to find manually. Finally, the
improved schedule parameters, in general, lead to slightly
greater execution time but in view of the dramatic improvement of quality (as well as the fact that execution time
seems to be dominated by programming care) this is well
worth it.
L(w) = 100). The fifth column compares the improvement
of the downhill simplex values to the manual values. The
numbers in brackets express the computed error in the last
digit. We easily see that the (optimized) simulated annealing algorithm is competitive with the custom heuristic (it
is actually better for n < 6).
Many simulated annealing papers have been published
that center around the topic of performance of the algorithm in terms of getting to an acceptable minimum quickly
ACKNOWLEDGEMENTS
[8]. A variety of cooling schedules have been designed that
I am very grateful to Milko Krastev for helping with the
can reduce the computation time at the expense of solution
quality. While the author experimented with a number of software and the initial experiments and to the CLAMV
open-source implementations of simulated annealing for a computing facility at the International University Bremen
variety of optimization problems with tools such as a pro- for supplying the computing power.
filer, speed-ups of up to three orders of magnitude were
achieved. This is in contrast to claimed speed-up factors
REFERENCES
of between 1.2 and 2.0 that come from changing the cooling schedule at the expense of solution quality [8]. Thus
[1] P. D. Bangert, M. A. Berger and R. Prandi, In Search
the author believes the speed of the simulated annealing
of Minimal Random Braid Configurations, J.
method to be so dominated by programming care that he
Phys. A, 35 (2002), pp. 43–59.
has not attempted to simultaneously optimize solution qual[2] M. A. Berger, Minimum Crossing Numbers for
ity and execution speed. This simultaneous optimization
Three-Braids, J. Phys. A, 27 (1994), pp. 6205–6213.
would, however, be no problem in principle after one made
the, completely random, decision how relatively important
[3] N. E. Collins, R. W. Eglese, B. L. Golden, Simulated
speed is in relation to quality.
Annealing - An Annotated Bibliography, Am. J.
While the results shown in columns three and four of taMath. Manag. Sci., 8 (1988), pp. 209–307.
ble 1 where computed for initial braid lengths of 100 generators, a large number of experiments were made with other [4] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A.
Teller and E. Teller, Equation of State Calculainitial lengths. As in previous work, we find an approxitions by Fast Computing Machines, J. Chem.
mately logarithmic growth of α with respect to both group
Phys., 21 (1953), pp. 1087–1092
index n and initial length L(w) [1]. The average reduction
[5] M. S. Paterson and A. A. Razborov, The set of minimal braids in co-NP-complete, J. Algorithms, 12
(1991), pp. 393–408.
[6] W. H. Press, S. A. Teukolsky, W. T. Vetterling and
B. P. Flannery, Numerical Recipes in C, Cambridge
University Press, Cambridge, 1992.
[7] ProQuest, http://www.umi.com/proquest/
[8] P. Salamon, P. Sibani and R. Frost, Facts, conjectures, and improvements for simulated annealing, Society for Industrial and Applied Mathematics,
Philadelphia, PA, 2002.