Best Practices For Comparing Optimization Algorithms: Optimization and Engineering June 2017
Best Practices For Comparing Optimization Algorithms: Optimization and Engineering June 2017
Best Practices For Comparing Optimization Algorithms: Optimization and Engineering June 2017
net/publication/317724973
CITATIONS READS
26 2,226
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Optimizing and merging two state-of-are software tools (i.e., GuideSIGN and SIGMA) to design and plan road sign marks in a road design project. View project
All content following this page was uploaded by Warren L Hare on 22 November 2017.
Abstract
The final publication is available at Springer via http://dx.doi.org/10.1007/s11081-
017-9366-1
Comparing, or benchmarking, of optimization algorithms is a complicated task that
involves many subtle considerations to yield a fair and unbiased evaluation. In this
paper, we systematically review the benchmarking process of optimization algorithms,
and discuss the challenges of fair comparison. We provide suggestions for each step of
the comparison process and highlight the pitfalls to avoid when evaluating the perfor-
mance of optimization algorithms. We also discuss various methods of reporting the
benchmarking results. Finally, some suggestions for future research are presented to
improve the current benchmarking process.
Keywords. Benchmarking; Algorithm comparison; Guidelines; Performance; Soft-
ware; Testing; Metric; Timing; Optimization algorithms.
1 Introduction
As the number of optimization methods, and implementations of those methods, has in-
creased, researchers have pursued comparative studies to evaluate their performance. When
done well, such studies can be of great value in helping end-users choose the most suitable op-
timization method for their problems. Such studies are generally referred to as optimization
benchmarking.
In the most general sense, benchmarking is the comparison of one or more products
to an industrial standard product over a series of performance metrics. In the case of
benchmarking optimization algorithms, the products are the specific implementations of
given algorithms, and the performance metrics are generated by running the implementations
on a series of test problems. This framework presents a certain clarity in benchmarking
optimization algorithms, as there is at least some agreement on what constitutes “better”.
If one algorithm runs faster, uses less memory, and returns a better final function value,
on all possible problems, then it can be considered better than the alternative. Of course,
in practice such a clear conclusion seldom arises. Thus, interpreting the conclusions of
algorithmic comparisons can be tricky.
Nonetheless, when done well, benchmarking optimization algorithms can have great prac-
tical value. It can reveal both strengths and weaknesses of an algorithm, which allows for
better research focus. It can aid in determining if a new version of optimization software is
performing up to expectations. And, it can help guide end-users in selecting a good choice
of algorithm for a particular real-world problem.
∗ corresponding author: [email protected]
1
However, when done poorly, benchmarking optimization algorithms can also be mislead-
ing. It can hide algorithm’s weaknesses (or strengths), report improvements that do not
exist, or suggest the incorrect algorithmic choice for a given situation.
In optimization benchmarking many subjective choices are made, such as the test set
to solve, the computing environment to use, the performance criteria to measure, etc. Our
primary objective in this paper is to help researchers to design a proper benchmarking
approach that is more comprehensive, less biased, and less subject to variations within
a particular software or hardware environment. Our secondary objective is to provide a
comprehensive review of the benchmarking literature for optimization algorithms.
In pursuing these objectives, we focus on single-objective optimization algorithms that
run in serial (i.e., that do not use parallel processing). Comparing algorithms for multi-
objective optimization, or optimization algorithms that use parallel processing, introduce
new levels of complexity to the benchmarking process. While we provide some comments
on the challenges for benchmarking some algorithms in the conclusions, we consider these
issues outside of the scope of this paper.
We also note that much of the presentation within this paper discusses algorithms as
if the underlying optimization problem is a continuous unconstrained problem. This is for
ease of presentation, and in most cases translating the ideas to other styles of optimization
problems is clear. As such, we limit ourselves to discussing specific styles of optimization
problems only when the translation is not straight-forward.
2
delivered an updated set of guidelines. In 2002, Dolan and Moré introduced performance
profiles [DM02], which have rapidly become a gold standard in benchmarking of optimiza-
tion algorithms with more recent work pointing out its limitations [GS16]. In this paper,
we attempt to provide a modern picture of best-practice in the optimization benchmarking
process.
3
to deal with nonconvexity when applying a proximal-bundle method to a nonsmooth opti-
mization problem. As such, to see the value of the method, the authors compare it against
other proximal-bundle methods on a collection of nonconvex nonsmooth optimization prob-
lem. If they had compared their method against a quasi-Newton method on smooth convex
optimization problems, then very little insight would have been gained.
Regardless of the reason, another question researchers must consider is what aspect of the
algorithm is most important. Is a fast algorithm that returns infeasible solutions acceptable?
Is it more important that an algorithms solves every problem, or that its average performance
is very good? Is the goal to find a global minimizer, or a highly accurate local minimizer?
The answers to these questions should guide the choice of performance metrics that need
to be collected (Section 4) and how they should be analyzed (Section 5). Answering these
types of questions before running the experiments is time well spent.
3 Test sets
A test problem contains a test function together with some further criteria such as the
constraint set, feasible domain, starting points, etc. A test set is a collection of test problems.
Obviously, benchmarking only yields meaningful results when competing algorithms are
evaluated on the same test set with the same performance measures.
The selection of the appropriate test sets to benchmark the performance of optimization
algorithms is a widely noticed issue among researchers [Pin07, DM04, JBNP90, SSL13, Zv08].
Generally, there are three sources for collecting test problems: real-world problems, pre-
generated problems, and randomly-generated problems. Real-world problems can be found
through instances of specific applications, and pre-generated problems exist in common test
set libraries, see Table 1. Conversely, randomly-generated test problems are often more ad
hoc in nature, with researchers creating methods to randomly generate test problems that
are only used in a single paper [NN91, HS10, HP14] (among many others). However, some
researchers have gone to the effort to study methods to randomly-generate test problems for
a given area; some examples appear in Table 2.
While the real-world test sets provide specialized information about the performance
of the optimization algorithms within a specific application, the results may be difficult to
generalize. The difficulties lie in the facts that real-world test sets are often small and the
problems are often application-specific. Nonetheless, if the goal is to determine the best
algorithm to use for a particular real-world application, then a real-world test set focused
on that application is usually the best option.
On the other hand, the artificial and randomly-generated test sets can provide useful
information about the algorithmic characteristics of optimization algorithms. Artificial and
randomly-generated test sets can be extremely large in size, thereby providing an enormous
amount of comparative data. However, it can be difficult to rationalize their connection to
the real-world performance of optimization algorithms. If the goal is to compare a collection
of algorithms across a very wide spectrum, then artificial and randomly-generated test sets
are usually the better option.
When selecting a test set, it is always important to keep the particular goal of the
comparison in mind. Regardless of the goal, an appropriate test set should generally seek
to avoid the following deficiencies.
i. Too few problems. Having more problems in the test set makes the experiment
more reliable and helps the results reveal more information about the strengths or
weaknesses of the evaluated algorithms.
ii. Too little variety in problem difficulty. A test set containing only simple problems
is not enough to identify the strengths and weaknesses of algorithms. In contrast, a
test set which only has problems that are so difficult that no algorithm can solve them,
clearly, does not provide useful information on the relative performance of algorithms.
4
iii. Problems with unknown solutions. When possible, it is better to use test prob-
lems where the solution is known. Depending on the analysis performed (see section
5), the “solution” could be interpreted as the minimum function value, or the set of
global minimizers. Having access to the solution greatly improves the ability to eval-
uate the quality of the algorithmic output. However, when the test set is comprised
of real-world test problems, then a lack of known solutions may need to be accepted
as inevitable.
iv. Biased starting points. Allowing different algorithms to use different starting-
points will obviously bias the result. However, more subtle problems can also exist.
For example, if a starting point lies on the boundary of a constraint set, then an
interior point method will be severely disadvantaged. Another example comes from
considering the Beale test function, which has a global minimizer at (3, 0.5) [MGH81].
If a compass search (see, e.g., [KLT03]) with an initial step length of 1 is started at
(0.5, 0.5), then it will converges to the exact minimizer in just 4 iterations. However,
if a starting point of (0.51, 0.51) is used, then the exact same algorithm will use 63
iterations to reach a point within 10−2 of the global minimizer.1
v. Hidden structures. Many test sets have some structure that is not realistic in real-
world problems. For example, about 50% of the problems in the test set [MGH81] have
solution points that occur at integer valued coordinates. An algorithm that employs
some form of rounding may perform better than usual on these problems.
Thus, when choosing test sets for the benchmarking task the following considerations
should be taken into account as much as possible.
i. If the test set contains only few problems, then the experiment should be referred to
as a case study or a proof of concept, but not benchmarking. While there is no fixed
number that determines what is “enough problems to be considered benchmarking”,
we recommend that in order to achieve a reliable conclusion about the performance, an
experiment should contain at least 20 test problems (preferably more). In the specific
case of comparing a new version of an optimization algorithm with a previous version,
the number of test problems should be significantly greater – in the order of 100 or
more. In all cases, the more problems tested, the more reliable the conclusions.
ii. When possible, a test set should include at least two groups of problems: an easy
group which consists of the problems that are easy to solve within a reasonable time
on a regular contemporary computer using all the optimization algorithms tested,
and a hard group that contains the problems which are solvable but computationally
expensive and may require a specific optimization algorithm.
iii. Whenever possible, ensure at least a portion of the test set includes problems with
known solutions.
iv. For test sets that include starting points, new starting points can be generated by
introducing a small (random) perturbations to the given starting points. For other
test sets, randomly-generated starting points can be created from scratch. In either
case, all starting points should be created for each problem, and then every algorithm
should be provided the same starting point for testing. This approach can be further
used to increase result reliability, by repeating tests on the same function with a variety
of starting points.
v. Examine the test set with a critical eye and try to determine any hidden structure.
Some structures can be removed through methods similar to the random perturbation
1 Note that this example is artificially constructed to emphasize the results; the recommended starting
5
of starting points in (iv). One quick test is to set an algorithm to minimize f (x)
starting at x0 and then set the algorithm to minimize fˆ(x) = f (x − p) starting from
x̂0 = x0 −p (where p is any random point). Constraints can then be shifted in a similar
manner, effectively shifting the entire problem horizontally by the vector p. While it
relocates the origin, and moves any constraints away from special integer values, it
has no theoretical effect on the geometry of the problem. As such, the results of
both tests should be extremely similar (theoretically they should be identical, but
numerical errors may cause some deviation). If the results of both tests differ, then
perhaps some hidden structure is being exploited by the algorithm, or perhaps some
hidden constraints are causing issues. Regardless of the reason, the researcher should
recognize the issue and consider a wider test set.
Using suitable standard test sets is usually a good option when benchmarking optimiza-
tion algorithms. In particular, it is usually easier to compare results across research groups
when standard tests are employed, although even within a specific research field there is gen-
erally no consensus on the appropriate test set to draw specific conclusions. Many interesting
and diverse test sets have been reported in the literature, see Tables 1 and 2.
Producing random test sets using test problem generators has its own drawbacks. Are
the generated problems representative or difficult? Is there any hidden structure in the
problems? Some papers that use random test problem generators are listed in Table 2.
Figure 1 shows a decision tree that summarizes the fundamental decisions required for
6
assembling an appropriate test set for benchmarking of optimization algorithms.
Test Set ≥ 20
a
Test Set Decisions a. New Algorithm or
b. version update?
b
Test Set ≥ 100
Size
See Table 2.
Starting
Point Ensure no algorithm is
abusing starting point
a
a. Fixed or information.
b. random?
b Ensure all algorithms are
given same starting information.
7
In general, performance measures fall into 3 categories: efficiency, reliability, and quality of
algorithmic output. We discuss these performance categories in Subsections 4.1, 4.2, and 4.3.
Table 3 provides a classification of the comparative measures for optimization algorithms
based partly on the guidelines provided by Hoffman and Jackson [HJ82].
In order to allow for maximal data analysis (and thereby the best understanding of
overall performance), it is recommended to collect at least some data from every performance
category.
4.1 Efficiency
The efficiency of an optimization algorithm refers to the computational effort required to
obtain a solution. In mathematical programming, there are two primary measures of effi-
ciency, the number of fundamental evaluations and the running time. A third, less common,
measure is memory usage.
Number of fundamental evaluations: The term fundamental evaluation is used
to refer to any subroutine that is called by the algorithm in order to gain fundamental
information about the optimization problem. The most obvious example is an objective
function evaluation but the evaluation may involve complex simulation algorithms. Other
fundamental evaluations could include gradient evaluations, Hessian evaluations, or con-
straint function evaluations. The number of fundamental evaluations can be used as a
standard unit of time, and is often assumed to be platform independent. In many sit-
uations, the number of fundamental evaluations is a particularly important measure, as
for real-world problems these evaluations often dominate over the internal workings of the
algorithm [HS83, CDM79, DS78, CGT96, Bar87, NN91, VBB05, HL70, MTL72, Asa73,
SNSH+ 03, NSHV05, Eas82, ALDP14, Evt85, PSKv14, KS15]. Note however that this mea-
sure is unreasonable when fundamental evaluations do not dominate the internal workings
of the algorithm [AKZ05].
Running Time: Running time, as a measure for optimization benchmarking, is usually
measured by either CPU time or wall clock time.2 Wall clock time contains CPU time, and
has been argued to be more useful in real-world setting [McG02]. However, Wall clock time
is not reproducible or verifiable since it is tied to a specific hardware platform and software
combination. CPU time is considerably more stable, as it is independent of background
operations of the computer. Moreover, CPU time is more-or-less consistent for the same
version of an operating systems running on the same computer architectures.
2 Wall clock time refers to the amount of time the human tester has to wait to get an answer from the
computer. Conversely, CPU time is the amount of time the CPU spends on the algorithm, at the exclusion
of operating system tasks and other processes.
8
It should be noted that, due to the wide variety and complexity of modern computing
architectures, the number of situations in which time is dominated by memory access costs
is increasing, hence the precision of CPU timers has been reduced. To improve the precision
of CPU timers, tools such as cache and memory access tracers can help obtaining a more
accurate CPU time performance. For a more detailed discussion on these techniques we
refer to [Knu94, LL96].
Another issue with CPU time is the increasing prevalence of multi-core machines. Thor-
ough reporting would require indicating the number of cores, the CPU time for each core,
but also how efficiently the different levels of memory were used and cache hits/misses. Since
such measurements are not straightforward to obtain for multi-core machines, the wall-clock
time along with the hardware specifications are usually reported. (Unless the new algorithm
contribution focuses specifically on optimizing computation for a multi-core architecture, in
which case more precise measures are warranted.) Eventually, the onus is on the researchers
to explain how simplified measurements support the conclusions drawn; this is especially
true for multi-core machines.
Regardless of whether wall clock time or CPU time is used, in order to maximize the
reliability of the data, it is important to ensure that any background operations of the
computer are kept to a minimum. Furthermore, any manuscript regarding the benchmarking
should clearly state which form of running time was collected.
Other measures: In addition to the categorization presented above, in some specific
cases, there is another issue that influences the choice of an appropriate measure for running
time: the type of algorithm. For example, to evaluate the running time for branch-and-
bound based algorithms, the number of branch-and-bound nodes is a common criterion,
while for simplex and interior point based algorithms, the number of iterations is often
used. Therefore, when deciding on the choice of a suitable efficiency measure, the type of
algorithms to be evaluated should also be taken into account.
4.2 Reliability
The reliability and robustness of an optimization algorithm is defined as the ability of the
algorithm to “perform well” over a wide range of optimization problems [MGH81, BGK+ 95].
The most common performance measure to evaluate the reliability is success rate [TŽ89,
RM78, Eas82, SS00]. Success rate is gauged by counting the number of test problems that are
successfully solved within a pre-selected tolerance. This can be done using objective function
value, or distance of the solution point from a minimizer. In convex optimization these
two approaches are largely, but not perfectly, interchangeable. However, if the objective
function has multiple local minimizers, then the researcher must decide whether good local
solutions are acceptable outcomes, or if the algorithm must converge to a global minimizer
[Sch80, RW77]. In addition to the success rate, the average objective function values and the
average constraint violation values have also been reported to measure reliability [Sch80].
When studying reliability, the researcher should consider whether the algorithms are
deterministic, or non-deterministic, and repeat tests multiple times if the algorithm is non-
deterministic. Reliability can be based on a fixed starting point (if one is given with the test
set), but it is often better to use multiple starting points.
In deterministic optimization algorithms, reliability can be interpreted as the number of
problems in the given test set that are solved by the optimization algorithm. When dealing
with non-deterministic algorithms, it is important to repeat each test multiple times, to
make sure that reliability is measured in aggregate, and not skewed by a single lucky (or
unlucky) algorithmic run.
Using multiple repeats of each test raises the issue of how to aggregate the results. One
option is to consider each algorithmic run as a separate test problem and then compare
solvers across this expanded test set. This allows comparisons based on worst-case or best-
case scenarios. Another option is to use averaged data, for example, average runtime, average
solution accuracy, average reliability, etc. If averaging is used, then it is important to also
9
include standard deviations of the data. In either case, data collection is best performed by
considering each algorithmic run as a separate test problem, as average values can easily be
extracted from this data, while reconstructing the full test data from averaged values is not
possible.
It should be noted that, in some cases multiple repeats of a non-deterministic method is
impractical due to the time it takes to solve a single problem.
Multiple starting points: As mentioned in Section 3, many academic test problems
come with suggested starting points. While algorithms should always be tested using these
starting points, it is often enlightening to test the algorithm using other starting points as
well. Most deterministic algorithms should show little change in performance if a starting
point is perturbed by a small random vector – provided the new starting point retains
whatever feasibility properties the algorithm requires in a starting point.
Hillstrom [Hil77] is one of the first to consider testing optimization algorithms at non-
standard starting points. He proposed using random starting points chosen from a box
surrounding the standard starting point. In another approach to this problem, in [MGH81]
the authors present a large collection of test functions along with some procedures and start-
ing points to assess the reliability and robustness of unconstrained optimization algorithms.
In some cases, prior knowledge is available about the solution of a test problem. Some
methods use such information to construct a starting point close to the optimal solution
[NW06].
Regardless of how starting points are selected, fair benchmarking requires all the algo-
rithms to use the same starting point for each test. Therefore, starting points should be
generated and stored outside of the testing process.
10
It is often valuable to “normalize” these quantities by dividing by the starting accuracy:
or the analogous equation using xlacc . Note that we have multiplied facc
n
by −1, so γ can be
interpreted as the number of new digits of accuracy obtained up to a maximal improvement
value of M .
Similar measures can be used to quantify the amount of constraint violation for a test
run. Considering min{f (x) : gi (x) ≤ 0, i = 1, 2, ..., m},
m
X
max{0, gi (x̄)} gives the sum of violated constraints,
i=1
Xm
(max{0, gi (x̄)})2 gives the squared sum of violated constraints,
i=1
m
1 X
max{0, gi (x̄)} gives the mean constraint violation, and
m i=1
Y
gi (x̄) amounts to the geometric mean of the violated constraints.
i:gi (x̄)>0
The selection of the appropriate strategy among the variety of approaches depends on the
objectives of the experimental research, the problem structure, and the type of optimization
algorithms used. The researcher should also carefully select the success criteria, e.g., how to
fairly compare a solution that barely satisfies the constraints versus a solution that barely
violates the constraints but returns a much lower objective function value.
No known solution available: In many situations, the test set used will not have
known solutions to all problems. This is particularly true if the test set includes instances of
real-world applications. To evaluate the quality of an algorithmic output in this situation,
some new considerations are required [McG96, JMR96].
First, it should be immediately obvious that, if no known solution is available, then fixed-
target approaches cannot be applied. Fixed-cost approaches are still applicable, but since
f (x∗ ) is not known, measuring the accuracy of the final algorithmic output’s function value,
f (x̄), becomes difficult. Measuring the accuracy of the final algorithmic output’s point, x̄,
becomes essentially impossible.
To measure the quality of the final algorithmic output’s function value f (x̄), the simplest
approach is to replace f (x∗ ) with the best known value for the problem. For any given test
run, this guarantees that at least one algorithm gets the exact answer, so it is important
to select a reasonable maximal improvement value. Another approach is to estimate the
optimal solution using statistical techniques. For example, in combinatorial optimization
problems, some researchers [Dan77, Der85] use a sample of algorithmic outputs to predict
11
the location of the solution. In [GS85], such an approach is explained in an evaluation of
non-deterministic algorithms. Another strategy is to calculate a lower bound on the cost of
an optimal solution, and to compare the algorithmic output cost with that lower bound. As
an example, the total sum of the weight list in packing problems can be considered as a lower
bound on the total number of bins used in a packing. Finally, one may abandon comparing
the algorithmic output quality with the optimal solution, and only assess the quality of the
algorithmic output with similar results published in the literature or other algorithms being
tested.
12
5 Analyzing and reporting the results
Many studies use basic statistics (e.g., average solving time) to report the experimental
results. Basic statistics are a reasonable starting point, but provide little information about
the overall performance of optimization methods. Reporting methods can be loosely broken
down into three categories: numerical tables, graphics, and performance ratio methods (e.g.,
performance and data profiles).
5.1 Tables
Numerical tables provide the most complete method of reporting benchmarking results, so
for the sake of completeness, we recommend making full tables of results readily available.
However, such tables are cumbersome, so are often better included in an appendix or in
additional online material linked to an article.
As full tables of results can easily overwhelm a reader, researchers have developed var-
ious techniques that provide easy-to-understand and compact methods for reporting the
experimental results. Summary tables can be employed as a first step [SK06]. For example,
in [BDF97] optimization methods were rated by the percentage of problems for which a
method’s time is termed competitive or very competitive. The solving time of an algorithm
was called competitive if Ts ≤ 2Tmin in which Ts is the solving time obtained by an al-
gorithm on a particular problem and Tmin is the minimum solving time obtained among
all the algorithms on that specific problem. Similarly, if Ts ≤ 43 Tmin , then they call that
method very competitive. Tables such as these provide good talking points for discussing
benchmarking data, but fail to give a complete picture of the results. One critic for this
particular approach is it does not explore how much the table would change if, for example,
the cut-off for very competitive was changed from 43 Tmin to 54 Tmin .
Many other forms of summary tables are present throughout optimization benchmarking
literature, however all suffer from the same fundamental problem – to be readable a summary
table must distill the results down to a highly condensed format, thereby eliminating much
of the benchmarking information.
5.2 Graphics
Well-conceived graphics can provide more information than some other data presentations.
Simple graphical methods, such as histograms, box-plots, and trajectory plots, provide a next
step in analysis, while more complete methods include performance profiles, data profiles,
and accuracy profiles. Depending on the objectives of an experimental research, one or more
of these techniques might be useful to report the results. In [Tuk77, TGM83], different types
of plots are introduced, which are useful for data representation in general.
A more specialized plot for optimization algorithms is the trajectory plot [FRK+ 08,
RS07, SR80b, Eas82, EF74, KLM+ 10, SS00]. In a trajectory plot, the performance of an
optimization algorithm on a given test problem is visualized by plotting a path that connects
the points generated by each iteration of the algorithm. An example appears in Figure 2,
where the trajectories of two algorithms attempting to minimize the Rosenbrock function
are plotted. Both algorithms begin at the point (3, 3), and the first iteration moves both
algorithms to the point (0.2, 3.5). Algorithm 1 (represented by the solid line) proceeds to
(0.7, −0.2) and continues in a zig-zag path towards the minimizer. Algorithm 2 (represented
by the dashed line) proceeds to (1.1, 1.3) and then follows a fairly straight path towards
the minimizer, albeit with very small step sizes. While trajectory plots are useful to build
a better understand of how each algorithm behaves, they are not particularly good for
benchmarking as they can only present the results for one test problem at a time. They are
also limited to plots of functions of 2 or 3 variables, or to plotting projections onto subspaces
for more than 3 variables.
13
Figure 2: A sample trajectory plot.
20
M1
M2
Best Function Value Found
M3
15 M4
10
0
100 200 300 400 500 600 700 800 900 1000
Number of Function Evaluations
In Figure 3 the results of four optimization methods are plotted for a given test problem.
In this example, method M1 starts well, but stalls after about 300 function evaluations,
while method M2 shows steady decrease for about 800 function evaluations before stalling.
Method M3 initially decreases the fastest, but stalls after about 350 function evaluations.
Finally, method M4 starts very slowly, but ultimately finds the lowest value. Like trajectory
plots, convergence plots are useful for discussing some specific behavior of the algorithm,
but are poor for benchmarking as they can only be used to analyze one test problem at a
time.
While trajectory and convergence plots are useful to visualize a method on one problem,
their main drawback is that they represent the results for a single problem per plot. So
if the test set contains a large number of problems then it will be difficult to evaluate the
overall performance of these methods. Other types of plots can be found in the literature,
but generally have the same limitations as trajectory and convergence plots [BSV04] [RS07].
For many optimization algorithms, researchers are interested in how the problem scales
with the size of the input (e.g., dimension of the problem). For such research it can be
valuable to produce a runtime plot. Runtime plots visualize the data by plotting the time
14
to solve across a series of problem instances with different sizes. Runtime plots can suffer
from a similar issues to trajectory and convergence plots, namely, they represent the results
for a single series of problem instances. However, this problem can be somewhat mitigated
by aggregating data from a collection of problems to create an “average runtime” plot.
for a specific problem p and test s (the best solver has rp,s = 1). The performance profile
of a solver s is defined as follows
1
ρs (τ ) = size{p ∈ P : rp,s ≤ τ }, (3)
|P|
where |P| represents the cardinality of the test set P. Then, ρs (τ ) is the portion of the time
that the performance ratio rp,s for solver s ∈ S is within a factor τ ∈ R of the best possible
performance ratio.
Note that ρs (1) represents the percentage of problems for which solver s ∈ S has the best
performance among all the other solvers. And for τ sufficiently large, ρs (τ ) is the percentage
of the test set that can be solved by s ∈ S. Solvers with consistently high values for ρs (τ )
are of interest.
Figure 4 shows a sample performance profile plot (created using data from [BHLH15])
for logarithmic values of τ . The logarithmic values are employed to deal with smaller values
for τ . This will result in a more accurate plot which shows the long term behavior of the
methods. To demonstrate the difference, Figure 5 shows the same performance profile using
non-logarithmic values of τ . Depending on the data collected, logarithmic or non-logarithmic
values of τ may be more appropriate. Researchers should create both profiles, but it may
be only necessary to provide one in the final manuscript.
3 We thank “Mathematics Referee #1” for pointing out that reference.
15
1
0.9
0.8
0.7
0.6
; s(=) 0.5
0.4
0.3
0.2 M1
M2
0.1 M3
M4
0
0 1 2 3 4 5 6 7
log2 (=)
0.9
0.8
0.7
0.6
; s(=)
0.5
0.4
0.3
0.2 M1
M2
M3
0.1
M4
0
0 20 40 60 80 100 120
=
16
as the performance measure to compare the profiles. In particular, tp,s is replaced with
for problem p and solver s, where fw is the largest (worst) function value obtained among
all the algorithms, and fˆp,s is the estimated function value after k function evaluations.
In another example, [SK15] creates a performance measure based on proximity to optimal
points.
The primary advantage of performance profiles is that they implicitly include both speed
and success rate in the analysis. The value of ρs (α) gives a sense of how promising the
algorithmic outputs are relative to the best solution found by all the optimization algorithms
that are compared together.
One criticism of performance profiles is that the researcher must select a definition for
the convergence test passing and failing. Changing this definition can substantially change
the performance profile [HVL11]. Also note that if a fixed-cost approach is used to perform-
ing the benchmarking experiments, then performance profiles become inappropriate, as all
algorithms will use the same “time”. Another criticism is that the profile is only showing
performance with respect to the best method and does not allow one to compare other
methods with each other due to the appearance of a switching phenomenon [GS16].
Nonetheless, performance profiles have become a gold-standard in modern optimization
benchmarking, and should be included in optimization benchmarking analysis whenever
possible with an appropriate interpretation.
p,s
where facc = log10 (f (x̄p,s ) − f (x∗p )) − log10 (f (x0p ) − f (x∗p )), x̄p,s is the candidate solution
point obtained by solver s on problem p, x∗p is the optimal point for problem p, and x0p is
the initial point for problem p. The performance of the solver s ∈ S on the test set P is
measured using the following function
1
Rs (τ ) = size{γp,s |γp,s ≥ τ, p ∈ P}.
|P|
The accuracy profile Rs (τ ) shows the proportion of problems such that the solver s ∈ S is
able to obtain a solution within an accuracy of τ of the best solution. An example accuracy
profiles (using data from [HS06]) appears in Figure 6.
17
1
M1
0.9 M2
M3
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15
Relative accuracy obtained
In Figure 6, we see 4 methods (M1, M2, M3, and M4) plotted against each other in an
accuracy profile format. Examining the profile, notice that method M1 achieves 5 digits
of accuracy on almost all test-problems, and 6 digits of accuracy on about 75% of test
problems. All other method achieve this level of accuracy on 50% or less of test problems.
Thus, if 5 or 6 digits is the desired level of accuracy, then M1 is a clear winner. However,
if the particular application requires much higher accuracy, then M3 becomes a contender.
Indeed, only M3 was able to achieve 12 digits of accuracy on any reasonable portion of the
test problems. (In this particular test, accuracy was capped at 16 digits, but no method
managed to achieve this on a significant portion of the test problems.)
Accuracy profiles do not provide as much information as performance profiles, but are
suitable when fixed-cost data sets are collected. This is appropriate in cases where the cost
of obtaining the exact solution exceeds the budget, so the optimization target is to find as
good a solution as possible within a limited time.
in which tp,s shows the number of function evaluations required to satisfy the convergence
test, np is the number of variables in the problem p ∈ P, and ds (k) is the percentage of
problems that can be solved with k(np + 1) function evaluations. The value k(np + 1) is
used since np + 1 is to the number of function evaluations required to compute a “simplex
gradient” (a one-sided finite-difference estimate of the gradient).
t
It is worth noting that data profiles could easily be defined replacing npp,s
+1 by any other
t
measure of fundamental evaluations used. Moreover, if npp,s+1 is replaced by iterations, then
data profiles become a slight variation of operational characteristics defined in [SS00].
18
1
M1
0.9
M2
M3
0.8
M4
0.7
0.6
d s(,)
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700
Number of Function Evaluations
Figure 7 shows a typical data profile. Suppose the user has a budget limit of 100 simplex
gradients, according to Figure 7, with this budget the method M4 has the best performance
by solving roughly 22% of the problems; while M3 has the worst performance among all the
solvers since with this budget it does not solve any problem.
Like performance profiles, data profiles are cumulative distribution functions, and thus,
monotone increasing step functions with a range in [0, 1]. Data profiles do not provide the
number of function evaluations required to solve a specific problem, but instead provide a
visualization of the aggregate data. Also note that the data profile for a given solver s ∈ S
is independent of other solvers; this is not the case for performance profiles.
Although the data profiles are useful for benchmarking, they have not received the same
extensive attention as the performance profiles. This is partly because they are newer, but
perhaps also because they are primarily used with derivative-free optimization algorithms.
However, data profiles could be easily adjusted to a broader class of algorithms by replacing
tp,s with any measure of time, and np + 1 by any dimensional normalization factor. For
example, for a sub-gradient based method, ds (α) could be redefined as
1
ds (α) = size{p ∈ P : gp,s ≤ α},
|P|
where gp,s is the number of sub-gradient evaluations. This might make them an appropriate
tool for benchmarking bundle methods [HUL93, §XIV-XV].
Table 4 summarizes the reporting methods discussed in this section.
19
Reporting method Evaluates Advantage Drawback Recommendation
Provide in appendix or on-
Full Data Tables — Comprehensive Overwhelming
line data set.
Summary Tables Provide as talking point,
Varies Brief Incomplete but include other forms of
Simple Graphs analysis.
Good for case-studies, but
Trajectory Plots Speed and Accuracy Clear
Examines one problem should include other forms
at a time. of analysis for larger data
20
Convergence Plots Efficiency Precise
sets
Performance Profiles Speed and Robustness
Strong graphical repre- Include at least one of
Cannot be used for
Accuracy Profiles Accuracy sentation that incorpo- these three profiles when-
fixed-cost data sets
rates the entire dataset. ever possible
Data Profiles Speed and Robustness
21
7 Conclusion
This article reviews the issue of benchmarking optimization algorithms. For the sake of hav-
ing a careful, less-biased, explicitly-stated, and comprehensive evaluation of the optimization
algorithms an a priori benchmarking design is required. To develop an appropriate experi-
mental design, the first thing to consider is to clarify the questions that are to be answered
by the experiment. This includes selecting a suitable test set and suitable performance mea-
sures based on the objectives of the research. The data must be analyzed and processed in a
transparent, fair, and complete manner. Within this paper we discuss each of these topics,
and present a review of the state-of-the-art for each of these steps. We include several tables
and figures that summarize the process, and provide key advice designed to lead to a fair
benchmarking process.
A final important point must be raised in regards to optimization benchmarking:
as in all scientific research, benchmarking optimization algorithms
should be reported in a manner that allows for reproducibility of the
experiments.
When reporting results, make sure to describe algorithms, parameters, test problems, the
computational environment, and the statistical techniques employed with an acceptable level
of details. It should be clarified that it is usually difficult to provide enough information
in a published paper to enable the reader to rerun the stated experiments and replicate
completely the reported results. Moreover, the pace of computational development is so
high that it is virtually impossible to entirely reproduce a computational experiment, due to
development and modifications in operating systems, computer architecture, programming
languages, etc. However, the minimum standard for replication of the experiments is that at
least the authors themselves should be able to replicate the experiments [CDM79]. Therefore,
it is important that the researcher keep all the programs and data necessary to redo all the
computations and recreate all graphs. Such programs should be made available whenever
possible.
22
A benchmarking challenge that we have not addressed is how to compare optimization
algorithms that are different in nature5 . For example, consider the comparison of a de-
terministic and a non-deterministic method [GK17, KM17]. If the multiple repeats of the
non-deterministic method are considered, is it fair to compare the average quality to the
single run of the deterministic method. Some ideas on this, including a proposed method
for comparing deterministic and a non-deterministic methods, can be found in [SKM16].
Another benchmarking challenge that has not been fully address is how to compare al-
gorithms that approach the same problem from fundamentally different view points6 . For
example, when working with constrained optimization problems, some researchers have ex-
plored infeasible point methods while others have focused on interior point methods. Infea-
sible point methods typically take a two phase approach, where one phase aims for decrease
in function value and the second phase aims to improve feasibility. Interior point methods
assume a strictly feasible starting point and use some form of penalty function to maintain
feasibility of all trial points. Comparing these two styles of algorithms is very challenging,
and possibly meaningless, as one assumes an infeasible starting point and the other assumes
a feasible starting point. Other algorithms adopt a hybrid approach by approximating the
feasible set with some tolerance [RW17]; in that case, the tolerance parameter could greatly
influence the result of the comparison.
A source of debate in benchmarking global optimization algorithms is how to deal with
rescaling of the domain7 . Many global optimization algorithms are designed with the baseline
assumption that the optimization problem’s constrained region is the unit hypercube [0, 1]n .
Of course, in practical applications this is not always true. Some algorithms deal with this
at the solver level, using the constraint set’s diameter to select parameters like initial step
lengths; while other algorithms deal with this at the problem level, assuming that the end-
user will rescale the constraint set to be the unit hypercube (which is not always easy to do).
Comparisons of algorithms that place fundamentally different assumptions on the problem
structure may impact the selection of an appropriate test set and may limit the conclusions
one can draw from the numerical results.
Another potential limitations on what conclusions can be drawn from a numerical study
is the sensitivity analysis of the parameters8 . A robust study should investigate a range
of parameters and report on their impact on the validity of the conclusions. We leave the
complexity of how best to report such information to future research.
References
[ACM91] B. M. Averick, R. G. Carter, and J. J. Moré. The MINPACK-2 test problem
collection. Technical report, Argonne National Laboratory, Argonne, IL, 1991.
[ADO10] C. Audet, C. K. Dang, and D. Orban. Algorithmic parameter optimization of
the DFO method with the OPAL framework. In Ken Naono, Keita Teranishi,
John Cavazos, and Reiji Suda, editors, Software Automatic Tuning, pages 255–
274. Springer New York, 2010.
[ADO14] C. Audet, K.-C. Dang, and D. Orban. Optimization of algorithms with OPAL.
Math. Program. Comput., 6(3):233–254, 2014.
[AKZ05] M. M. Ali, C. Khompatraporn, and Z. B. Zabinsky. A numerical evaluation
of several stochastic algorithms on selected continuous global optimization test
problems. Journal of Global Optimization, 31(4):635–672, 2005.
5 We thank “Mathematics Referee #1” for pointing out this challenge.
6 We thank “Engineering Referee #3” and “Mathematics Referee #2” for pointing out this challenge.
7 We thank “Mathematics Referee #2” for pointing out this challenge.
8 We thank “Mathematics Referee #1” for pointing out this challenge.
23
[AL07] B. Addis and M. Locatelli. A new class of test functions for global optimization.
Journal of Global Optimization, 38(3):479–501, 2007.
[ALDP14] C. Audet, S. Le Digabel, and M. Peyrega. Linear equalities in blackbox opti-
mization. Technical report, Les Cahiers du GERAD, 2014.
[And08] N. Andrei. An unconstrained optimization test functions collection. Advanced
Modeling and Optimization, 10(1):147–161, 2008.
[AO06] C. Audet and D. Orban. Finding optimal algorithmic parameters using
derivative-free optimization. SIAM J. Optim., 17(3):642–664, 2006.
[Asa73] J. Asaadi. A computational comparison of some non-linear programs. Mathe-
matical Programming, 4(1):144–154, 1973.
[Bar70] Y. Bard. Comparison of gradient methods for the solution of nonlinear param-
eter estimation problems. SIAM Journal on Numerical Analysis, 7(1):157–186,
1970.
[Bar87] R. R. Barton. Testing strategies for simulation optimization. In Proceedings
of the 19th Conference on Winter Simulation, WSC ’87, pages 391–401, New
York, NY, USA, 1987. ACM.
[BBLP05] T. Bartz-Beielstein, C. W. G. Lasarczyk, and M. Preuss. Sequential parame-
ter optimization. In The 2005 IEEE Congress on Evolutionary Computation,
volume 1, pages 773–780, 2005.
24
[Bel69] E. J. Beltrami. A comparison of some recent iterative methods for the numerical
solution of nonlinear programs. In M. Beckmann and H. P. Künzi, editors, Com-
puting Methods in Optimization Problems, volume 14 of Lecture Notes in Op-
erations Research and Mathematical Economics, pages 20–29. Springer Berlin
Heidelberg, 1969.
[Ber13] T. Berthold. Measuring the impact of primal heuristics. Operations Research
Letters, 41(6):611 – 614, 2013.
[BGK+ 95] R. S. Barr, B. L. Golden, J. P. Kelly, M. G. C. Resende, and W. R. Stewart Jr.
Designing and reporting on computational experiments with heuristic methods.
Journal of Heuristics, 1(1):9–32, 1995.
[BGKR10] A. Balint, D. Gall, G. Kapler, and R. Retz. Experiment design and admin-
istration for computer clusters for SAT-solvers (EDACC), system description.
Journal on Satisfiability, Boolean Modeling and Computation, 7:77–82, 2010.
[BH93] R. S. Barr and B. L. Hickman. Reporting computational experiments with
parallel algorithms: Issues, measures, and experts’ opinions. INFORMS Journal
on Computing, 5(1):2–18, 1993.
[BHBG07] M. Baz, B. Hunsaker, P. Brooks, and A. Gosavi. Automated tuning of op-
timization software parameters. Technical report, University of Pittsburgh,
Department of Industrial Engineering, 2007.
[BHLH15] V. Beiranvand, W. Hare, Y. Lucet, and S. Hossain. Multi-haul quasi network
flow model for vertical alignment optimization. Technical report, Computer
Science, University of British Columbia, Kelowna BC, Canada, 2015.
[Bir09] M. Birattari. Tuning Metaheuristics: A Machine Learning Perspective. Springer
Publishing Company, Incorporated, 1st ed. 2005. 2nd printing edition, 2009.
[Box66] M. J. Box. A comparison of several current optimization methods, and the use
of transformations in constrained problems. The Computer Journal, 9(1):67–77,
1966.
[BSV03] H. Y. Benson, D. F. Shanno, and R. J. Vanderbei. A comparative study of
large-scale nonlinear optimization algorithms. In Gianni Di Pillo and Almerico
Murli, editors, High Performance Algorithms and Software for Nonlinear Opti-
mization, volume 82 of Applied Optimization, pages 95–127. Springer US, 2003.
[BSV04] Hande Y. Benson, David F. Shanno, and Robert J. Vanderbei. Interior-point
methods for nonconvex nonlinear programming: jamming and numerical test-
ing. Math. Program., 99(1, Ser. A):35–48, 2004.
[Buc92] A. G. Buckley. Algorithm 709: Testing algorithm implementations. ACM
Transactions on Mathematical Software (TOMS), 18(4):375–391, December
1992.
[CDM79] H. Crowder, R. S. Dembo, and J. M. Mulvey. On reporting computational
experiments with mathematical software. ACM Transactions on Mathematical
Software (TOMS), 5(2):193–203, 1979.
[CGT96] A. R. Conn, N. Gould, and P. L. Toint. Numerical experiments with the
LANCELOT package (release A) for large-scale nonlinear optimization. Math-
ematical Programming, 73(1):73–110, 1996.
[Col68] A. R. Colville. A comparative study of nonlinear programming codes. Technical
Report 320-2949, IBM Scientific Center, New York, 1968.
25
[CPL14] CPLEX’s automatic tuning tool. Technical report, IBM Corporation, 2014.
[Dan77] D. G. Dannenbring. Procedures for estimating optimal solution values for large
combinatorial problems. Management Science, 23(12):1273–1283, 1977.
[Dem76] R. S. Dembo. A set of geometric programming test problems and their solutions.
Mathematical Programming, 10(1):192–213, 1976.
[Dem78] R. S. Dembo. Current state of the art of algorithms and computer software
for geometric programming. Journal of Optimization Theory and Applications,
26(2):149–183, 1978.
[Der85] U. Derigs. Using confidence limits for the global optimum in combinatorial
optimization. Operations Research, 33(5):1024–1049, 1985.
26
[FPS02] D. Famularo, P. Pugliese, and Y. D. Sergeyev. Test problems for Lipschitz uni-
variate global optimization with multiextremal constraints. In Gintautas Dze-
myda, Vydūnas Šaltenis, and Antanas Žilinskas, editors, Stochastic and global
optimization, volume 59 of Nonconvex Optim. Appl., pages 93–109. Kluwer
Acad. Publ., Dordrecht, 2002.
[FRK+ 08] K. R. Fowler, J. P. Reese, C. E. Kees, J. E. Dennis Jr., C. T. Kelley, C. T. Miller,
C. Audet, A. J. Booker, G. Couture, R. W. Darwin, M. W. Farthing, D. E.
Finkel, J. M. Gablonsky, G. Gray, and T. G. Kolda. Comparison of derivative-
free optimization methods for groundwater supply and hydraulic capture com-
munity problems. Advances in Water Resources, 31(5):743 – 757, 2008.
[GJ09a] J. C. Gilbert and X. Jonsson. LIBOPT an environment for testing solvers
on heterogeneous collections of problems the manual, version 2.1. Technical
report, INRIA, BP 105, 78153 Le Chesnay, 2009.
[GJ09b] D. Grundel and D. Jeffcoat. Combinatorial test problems and problem genera-
tors. In Christodoulos A. Floudas and Panos M. Pardalos, editors, Encyclopedia
of Optimization, pages 391–394. Springer US, 2009.
27
[Hil77] K. E. Hillstrom. A simulation test approach to the evaluation of nonlinear op-
timization algorithms. ACM Transactions on Mathematical Software (TOMS),
3(4):305–315, 1977.
[HJ82] K. L. Hoffman and R. H. F. Jackson. In pursuit of a methodology for testing
mathematical programming software. In JohnM. Mulvey, editor, Evaluating
Mathematical Programming Techniques, volume 199 of Lecture Notes in Eco-
nomics and Mathematical Systems, pages 177–199. Springer Berlin Heidelberg,
1982.
[HJL92] P. Hansen, B. Jaumard, and S. H. Lu. Global optimization of univariate Lips-
chitz functions: Ii. new algorithms and computational comparison. Mathemat-
ical Programming, 55(1-3):273–292, 1992.
[HKT01] P. Hough, T. Kolda, and V. Torczon. Asynchronous parallel pattern search for
nonlinear optimization. SIAM Journal on Scientific Computing, 23(1):134–156,
2001.
[HL70] H. Y. Huang and A. V. Levy. Numerical experiments on quadratically con-
vergent algorithms for function minimization. Journal of Optimization Theory
and Applications, 6(3):269–282, 1970.
[HMSW53] A. Hoffman, M. Mannos, D. Sokolowsky, and N. Wiegmann. Computational
experience in solving linear programs. Journal of the Society for Industrial and
Applied Mathematics, 1(1):17–33, 1953.
[HP14] W Hare and C Planiden. The NC-proximal average for multiple functions.
Optimization Letters, 8(3):849–860, 2014.
[HRCV88] E. N. Houstis, J. R. Rice, C. C. Christara, and E. A. Vavalis. Performance
of scientific software. In J.R. Rice, editor, Mathematical Aspects of Scientific
Software, volume 14 of The IMA Volumes in Mathematics and Its Applications,
pages 123–155. Springer New York, 1988.
[HS81] W. Hock and K. Schittkowski. Test examples for nonlinear programming codes.
Lecture notes in economics and mathematical systems. Springer-Verlag, 1981.
[HS83] W. Hock and K. Schittkowski. A comparative performance evaluation of 27
nonlinear programming codes. Computing, 30(4):335–358, 1983.
28
[JBNP90] R. H. F. Jackson, P. T. Boggs, S. G. Nash, and S. Powell. Guidelines for
reporting results of computational experiments. report of the ad hoc committee.
Mathematical Programming, 49(1-3):413–425, 1990.
[JMR96] D. S. Johnson, L. A. McGeoch, and E. E. Rothberg. Asymptotic experimental
analysis for the Held-Karp traveling salesman bound. In Proceedings of the
Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’96,
pages 341–350, Philadelphia, PA, USA, 1996. Society for Industrial and Applied
Mathematics.
[JY13] M. Jamil and X. S. Yang. A literature survey of benchmark functions for global
optimisation problems. International Journal of Mathematical Modelling and
Numerical Optimisation, 4(2):150–194, 2013.
[KAA+ 11] T. Koch, T. Achterberg, E. Andersen, O. Bastert, T. Berthold, R. E. Bixby,
E. Danna, G. Gamrath, A. M. Gleixner, S. Heinz, A. Lodi, H. Mittelmann,
T. Ralphs, D. Salvagnin, D. E. Steffy, and K. Wolter. MIPLIB 2010. Mathe-
matical Programming Computation, 3(2):103–163, 2011.
[KLM+ 10] M. Kortelainen, T. Lesinski, J. Moré, W. Nazarewicz, J. Sarich, N. Schunck,
M. V. Stoitsov, and S. Wild. Nuclear energy density optimization. Physical
Review C, 82(2):024313, 2010.
[KLT03] T. G. Kolda, R. M. Lewis, and V. Torczon. Optimization by direct search: new
perspectives on some classical and modern methods. SIAM Rev., 45(3):385–482,
2003.
[KM16] D. E. Kvasov and M. S. Mukhametzhanov. One-dimensional global
search: Nature-inspired vs. lipschitz methods. AIP Conference Proceedings,
1738(1):400012, 2016.
[KM17] D. E. Kvasov and M. S. Mukhametzhanov. Metaheuristic vs. deterministic
global optimization algorithms: The univariate case. Applied Mathematics and
Computation, 2017.
[Knu94] D. E. Knuth. The Stanford GraphBase: a platform for combinatorial computing,
volume 37. Addison-Wesley Publishing Company, 1994.
[KPZ05] C. Khompatraporn, J. D. Pinter, and Z. B. Zabinsky. Comparative assessment
of algorithms and software for global optimization. Journal of Global Optimiza-
tion, 31(4):613–633, 2005.
[KS15] D. E. Kvasov and Y. D. Sergeyev. Deterministic approaches for solving practical
black-box global optimization problems. Advances in Engineering Software,
80:58 – 66, 2015.
[LL96] A. LaMarca and R. Ladner. The influence of caches on the performance of
heaps. Journal of Experimental Algorithmics (JEA), 1, 1996.
[LM84] M. L. Lenard and M. Minkoff. Randomly generated test problems for positive
definite quadratic programming. ACM Transactions on Mathematical Software,
10(1):86–96, January 1984.
[LZ00] D. Liu and X. S. Zhang. Test problem generator by neural network for algo-
rithms that try solving nonlinear programming problems globally. Journal of
Global Optimization, 16(3):229–243, 2000.
[McG96] C. C. McGeoch. Toward an experimental method for algorithm simulation.
INFORMS Journal on Computing, 8(1):1–15, 1996.
29
[McG02] C. C. McGeoch. Experimental analysis of algorithms. In Panos M. Pardalos
and H. Edwin Romeijn, editors, Handbook of Global Optimization, volume 62
of Nonconvex Optimization and Its Applications, pages 489–513. Springer US,
2002.
[MGH81] J. J. Moré, B. S. Garbow, and K. E. Hillstrom. Testing unconstrained op-
timization software. ACM Transactions on Mathematical Software (TOMS),
7(1):17–41, March 1981.
[Mit03] H. D. Mittelmann. An independent benchmarking of SDP and SOCP solvers.
Mathematical Programming, 95(2):407–430, 2003.
[MP06] H. D. Mittelmann and A. Pruessner. A server for automated performance
analysis of benchmarking data. Optimization Methods and Software, 21(1):105–
120, 2006.
[MTL72] A. Miele, J. L. Tietze, and A. V. Levy. Comparison of several gradient al-
gorithms for mathematical programming problems. Aero-astronautics report
no.94, Rice University, Houston, Texas, 1972.
[NE06] V. Nannen and A. E. Eiben. A method for parameter calibration and relevance
estimation in evolutionary algorithms. In Proceedings of the 8th Annual Confer-
ence on Genetic and Evolutionary Computation, GECCO ’06, pages 183–190,
New York, NY, USA, 2006. ACM.
[Net] Netlib: Netlib linear programming library.
[NL14] C.-K. Ng and D. Li. Test problem generator for unconstrained global optimiza-
tion. Comput. Oper. Res., 51:338–349, 2014.
[NN91] S. Nash and J. Nocedal. A numerical study of the limited memory BFGS
method and the Truncated-Newton method for large scale optimization. SIAM
Journal on Optimization, 1(3):358–372, 1991.
30
[Pin02] J. D. Pintér. Global optimization: Software, test problems, and applications.
In PanosM. Pardalos and H.Edwin Romeijn, editors, Handbook of Global Op-
timization, volume 62 of Nonconvex Optimization and Its Applications, pages
515–569. Springer US, 2002.
[Pin07] J. D. Pintér. Nonlinear optimization with GAMS /LGO. Journal of Global
Optimization, 38(1):79–101, 2007.
[PK13] J. D. Pintér and F. J. Kampas. Benchmarking nonlinear optimization software
in technical computing environments. TOP, 21(1):133–162, 2013.
[PRCLF12] J. A. Parejo, A. Ruiz-Cortés, S. Lozano, and P. Fernandez. Metaheuristic opti-
mization frameworks: a survey and benchmarking. Soft Computing, 16(3):527–
561, 2012.
[PSKv14] R. Paulavičius, Y. D. Sergeyev, D. E. Kvasov, and J. Žilinskas. Globally-biased
Disimpl algorithm for expensive global optimization. J. Global Optim., 59(2-
3):545–567, 2014.
[Rid07] E. Ridge. Design of experiments for the tuning of optimisation algorithms.
University of York, Department of Computer Science, 2007.
[RK10] E. Ridge and D. Kudenko. Tuning an algorithm using design of experiments. In
Thomas Bartz-Beielstein, Marco Chiarandini, Lus Paquete, and Mike Preuss,
editors, Experimental Methods for the Analysis of Optimization Algorithms,
pages 265–286. Springer Berlin Heidelberg, 2010.
[RM78] M. J. Rijckaert and X. M. Martens. Comparison of generalized geometric
programming algorithms. Journal of Optimization Theory and Applications,
26(2):205–242, 1978.
[Ros14] R. E. Rosenthal. GAMS–a user’s guide. Technical report, GAMS Development
Corporation, 2014.
[RS07] R. G. Regis and C. A. Shoemaker. A stochastic radial basis function method
for the global optimization of expensive functions. INFORMS Journal on Com-
puting, 19(4):497–509, 2007.
[RS13] L. Rios and N. V. Sahinidis. Derivative-free optimization: a review of algorithms
and comparison of software implementations. Journal of Global Optimization,
56(3):1247–1293, 2013.
[RU01] R. L. Rardin and R. Uzsoy. Experimental evaluation of heuristic optimization
algorithms: A tutorial. Journal of Heuristics, 7(3):261–304, 2001.
[RW77] H. Ramsin and P. Wedin. A comparison of some algorithms for the nonlinear
least squares problem. BIT Numerical Mathematics, 17(1):72–90, 1977.
[RW17] R. G. Regis and S. M. Wild. CONORBIT: constrained optimization by ra-
dial basis function interpolation in trust regions. Optimization Methods and
Software, 0(0):1–29, 2017.
[RXMC03] M. Romesis, M. Xie, K. Minkovich, and J. Cong. Optimality study project.
Technical report, UCLA Computer Science Department, 2003. http://cadlab.
cs.ucla.edu/~pubbench/.
[Sch80] K. Schittkowski. Nonlinear programming codes: information, tests, perfor-
mance. Lecture notes in economics and mathematical systems. Springer-Verlag,
1980.
31
[Sch93] F. Schoen. A wide class of test functions for global optimization. Journal of
Global Optimization, 3(2):133–137, 1993.
[Sch08] K. Schittkowski. An updated set of 306 test problems for nonlinear program-
ming with validated optimal solutions - user’s guide. Technical report, Depart-
ment of Computer Science, University of Bayreuth, 2008.
32
[TM10] N. P. Tedford and J. R. R.A. Martins. Benchmarking multidisciplinary design
optimization algorithms. Optimization and Engineering, 11(1):159–183, 2010.
[Tuk77] J. W. Tukey. Exploratory data analysis. Pearson, 1977.
[TŽ89] A. Törn and A. Žilinskas. Global optimization, volume 350 of Lecture Notes in
Computer Science. Springer-Verlag, Berlin, 1989.
[VBB05] F. Vanden Berghen and H. Bersini. CONDOR, a new parallel, constrained ex-
tension of powell’s UOBYQA algorithm: Experimental results and comparison
with the DFO algorithm. Journal of Computational and Applied Mathematics,
181(1):157 – 175, 2005.
[VS99] R. J. Vanderbei and D. F. Shanno. An Interior-Point algorithm for nonconvex
nonlinear programming. Computational Optimization and Applications, 13(1-
3):231–252, 1999.
[VV07] A. I. F. Vaz and L. N. Vicente. A particle swarm pattern search method
for bound constrained global optimization. Journal of Global Optimization,
39(2):197–219, 2007.
33