Best Practices For Comparing Optimization Algorithms: Optimization and Engineering June 2017

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/317724973
Best Practices for Comparing Optimization Algorithms
Article in Optimization and Engineering · June 2017

DOI: 10.1007/s11081-017-9366-1
CITATIONS READS
26 2,226
3 authors, including:
Vahid Beiranvand Warren L Hare

University of British Columbia University of British Columbia - Okanagan
4 PUBLICATIONS 71 CITATIONS 91 PUBLICATIONS 947 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Derivative-free Methods View project
Optimizing and merging two state-of-are software tools (i.e., GuideSIGN and SIGMA) to design and plan road sign marks in a road design project. View project
All content following this page was uploaded by Warren L Hare on 22 November 2017.
The user has requested enhancement of the downloaded file.

Best Practices for Comparing Optimization Algorithms
∗1
Vahid Beiranvand1 , Warren Hare2 , and Yves Lucet
1
Computer Science, University of British Columbia, Kelowna, BC, Canada
2
Mathematics, University of British Columbia, Kelowna, BC, Canada
September 26, 2017

arXiv:1709.08242v1 [math.OC] 24 Sep 2017
Abstract
The final publication is available at Springer via http://dx.doi.org/10.1007/s11081-
017-9366-1
Comparing, or benchmarking, of optimization algorithms is a complicated task that
involves many subtle considerations to yield a fair and unbiased evaluation. In this
paper, we systematically review the benchmarking process of optimization algorithms,
and discuss the challenges of fair comparison. We provide suggestions for each step of
the comparison process and highlight the pitfalls to avoid when evaluating the perfor-
mance of optimization algorithms. We also discuss various methods of reporting the
benchmarking results. Finally, some suggestions for future research are presented to
improve the current benchmarking process.
Keywords. Benchmarking; Algorithm comparison; Guidelines; Performance; Soft-
ware; Testing; Metric; Timing; Optimization algorithms.
1 Introduction
As the number of optimization methods, and implementations of those methods, has in-
creased, researchers have pursued comparative studies to evaluate their performance. When
done well, such studies can be of great value in helping end-users choose the most suitable op-
timization method for their problems. Such studies are generally referred to as optimization
benchmarking.
In the most general sense, benchmarking is the comparison of one or more products
to an industrial standard product over a series of performance metrics. In the case of
benchmarking optimization algorithms, the products are the specific implementations of
given algorithms, and the performance metrics are generated by running the implementations
on a series of test problems. This framework presents a certain clarity in benchmarking
optimization algorithms, as there is at least some agreement on what constitutes “better”.
If one algorithm runs faster, uses less memory, and returns a better final function value,
on all possible problems, then it can be considered better than the alternative. Of course,
in practice such a clear conclusion seldom arises. Thus, interpreting the conclusions of
algorithmic comparisons can be tricky.
Nonetheless, when done well, benchmarking optimization algorithms can have great prac-
tical value. It can reveal both strengths and weaknesses of an algorithm, which allows for
better research focus. It can aid in determining if a new version of optimization software is
performing up to expectations. And, it can help guide end-users in selecting a good choice
of algorithm for a particular real-world problem.
∗ corresponding author: [email protected]
1
However, when done poorly, benchmarking optimization algorithms can also be mislead-
ing. It can hide algorithm’s weaknesses (or strengths), report improvements that do not
exist, or suggest the incorrect algorithmic choice for a given situation.
In optimization benchmarking many subjective choices are made, such as the test set
to solve, the computing environment to use, the performance criteria to measure, etc. Our
primary objective in this paper is to help researchers to design a proper benchmarking
approach that is more comprehensive, less biased, and less subject to variations within
a particular software or hardware environment. Our secondary objective is to provide a
comprehensive review of the benchmarking literature for optimization algorithms.
In pursuing these objectives, we focus on single-objective optimization algorithms that
run in serial (i.e., that do not use parallel processing). Comparing algorithms for multi-
objective optimization, or optimization algorithms that use parallel processing, introduce
new levels of complexity to the benchmarking process. While we provide some comments
on the challenges for benchmarking some algorithms in the conclusions, we consider these
issues outside of the scope of this paper.
We also note that much of the presentation within this paper discusses algorithms as
if the underlying optimization problem is a continuous unconstrained problem. This is for
ease of presentation, and in most cases translating the ideas to other styles of optimization
problems is clear. As such, we limit ourselves to discussing specific styles of optimization
problems only when the translation is not straight-forward.
1.1 Historical overview of benchmarking in optimization

We begin with a brief historical overview of optimization benchmarking.
One of the very first studies in benchmarking of algorithms was given by Hoffman et
al., in 1953 [HMSW53], in which three different methods for linear programming were com-
pared. Although this computational experiment was performed early in the development of
computers, when there existed almost no compiler and programming environment, the re-
ported techniques have been used for a long time and can be considered as the foundation of
the current comparison techniques. They include such ideas as, using test sets to compare
algorithms, employing performance measures (accuracy, CPU time, number of iterations,
and convergence rate), and paying attention to the impact of coding on the performance of
algorithms.
Another early contribution to the field of benchmarking is Box’s work from 1966 [Box66].
In this work, Box evaluates the performances of eight algorithms for unconstrained optimiza-
tion using a collection of 5 test problems with up to 20 variables. He considered the number
of function evaluations, the importance of model size and the generality of the optimization
algorithms.
In the late 1960’s, optimization benchmarking research began to expand rapidly. Com-
parative studies have been performed throughout the optimization literature, for example in
unconstrained optimization [Tab69, HL70, MGH81], constrained optimization [Bel69, SS78,
FPS02], nonlinear least squares [Bar70, RW77, VS99, McG02], linear programming [Mul82,
BHBG07, Ber13], nonlinear programming [Col68, Bar70, Asa73, Sch80, EF74, SR80a, SR80b,
Eas82, BCG+ 97, VS99, BSV04, HKT01, BSV03, Yen05, TM10], geometric programming
[Dem78, RM78], global optimization [HJL92, Pin02, BDMP03, KPZ05, NSHV05, RS07,
SS00, Zv08, KM16], derivative-free optimization [HW10, RS13, Zha14, SK06, PSKv14], and
other areas of optimization [Tab69, HL70, MTL72, HRCV88, Mit03, OA11, PRCLF12] –
amongst many more.
In addition, a few researchers have focused on improving the (optimization) benchmark-
ing process. In 1979, Crowder et al. [CDM79], presented the first study that attempted to
provide standards and guidelines on how to benchmark mathematical algorithms. It includes
a detailed discussion of experimental design and noted the necessity of a priori experimental
design. The authors paid attention to reproducibility of the results and provided a method
for reporting the results. In 1990, similar research conducted by Jackson et al. [JBNP90]
2
delivered an updated set of guidelines. In 2002, Dolan and Moré introduced performance
profiles [DM02], which have rapidly become a gold standard in benchmarking of optimiza-
tion algorithms with more recent work pointing out its limitations [GS16]. In this paper,
we attempt to provide a modern picture of best-practice in the optimization benchmarking
process.
1.2 Paper framework

We now provide a general framework for benchmarking optimization algorithms, which we
also use to structure discussions in the paper.
1. Clarify the reason for benchmarking. In Section 2, we discuss some of the com-
mon reasons to compare optimization algorithms, and some of the pitfalls that arise
when the purpose of benchmarking is unclear.
2. Select the test set. In Section 3, a review of test sets for various problem categories
is presented, the challenges related to test sets are discussed, and some guidelines are
provided for assembling an appropriate test set.
3. Perform the experiments. In Section 4, we review and discuss various consid-

erations related to the critical task of designing experiments, including performance
measures, tuning parameters, repeatability of the experiments, and ensuring compa-
rable computational environments.
4. Analyze and report the results. Section 5 contains a review of different reporting
methods for optimization algorithms, including tabular methods, trajectory plots, and
ratio-based plots (such as performance and data profiles).
In addition to the aforementioned sections, Section 6 contains a review of recent advances
in the field of automated benchmarking and Section 7 presents some concluding thoughts.
2 Reason for benchmarking

Having a clear understanding of the purpose of a numerical comparison is a crucial step that
guides the rest of the benchmarking process. While seemingly self-evident, it is surprisingly
easy to neglect this step. Optimization benchmarking has been motivated by a variety of
objectives. For example:
1. To help select the best algorithm for working with a real-world problem.
2. To show the value of a novel algorithm, when compared to a more classical method.
3. To compare a new version of optimization software with earlier releases.
4. To evaluate the performance of an optimization algorithm when different option set-

tings are used.
In a practical sense, all of these work towards gathering information in order to rank opti-
mization algorithms within a certain context. However, the context can, and should, play a
major role in guiding the rest of the benchmarking process.
For example, if the goal is to select the best algorithm for a particular real-world appli-
cation, then the test problems (Section 3) should come from examples of that application.
Alternately, if the goal is to show the value of a new optimization algorithm, then it is
valuable to think about exactly where the algorithm differs from previous methods. Many
new algorithms are actually improvements on how a classical method deals with some as-
pect of an optimization problem. For example, in [HS10] the authors develop a new method
3
to deal with nonconvexity when applying a proximal-bundle method to a nonsmooth opti-
mization problem. As such, to see the value of the method, the authors compare it against
other proximal-bundle methods on a collection of nonconvex nonsmooth optimization prob-
lem. If they had compared their method against a quasi-Newton method on smooth convex
optimization problems, then very little insight would have been gained.
Regardless of the reason, another question researchers must consider is what aspect of the
algorithm is most important. Is a fast algorithm that returns infeasible solutions acceptable?
Is it more important that an algorithms solves every problem, or that its average performance
is very good? Is the goal to find a global minimizer, or a highly accurate local minimizer?
The answers to these questions should guide the choice of performance metrics that need
to be collected (Section 4) and how they should be analyzed (Section 5). Answering these
types of questions before running the experiments is time well spent.
3 Test sets
A test problem contains a test function together with some further criteria such as the
constraint set, feasible domain, starting points, etc. A test set is a collection of test problems.
Obviously, benchmarking only yields meaningful results when competing algorithms are
evaluated on the same test set with the same performance measures.
The selection of the appropriate test sets to benchmark the performance of optimization
algorithms is a widely noticed issue among researchers [Pin07, DM04, JBNP90, SSL13, Zv08].
Generally, there are three sources for collecting test problems: real-world problems, pre-
generated problems, and randomly-generated problems. Real-world problems can be found
through instances of specific applications, and pre-generated problems exist in common test
set libraries, see Table 1. Conversely, randomly-generated test problems are often more ad
hoc in nature, with researchers creating methods to randomly generate test problems that
are only used in a single paper [NN91, HS10, HP14] (among many others). However, some
researchers have gone to the effort to study methods to randomly-generate test problems for
a given area; some examples appear in Table 2.
While the real-world test sets provide specialized information about the performance
of the optimization algorithms within a specific application, the results may be difficult to
generalize. The difficulties lie in the facts that real-world test sets are often small and the
problems are often application-specific. Nonetheless, if the goal is to determine the best
algorithm to use for a particular real-world application, then a real-world test set focused
on that application is usually the best option.
On the other hand, the artificial and randomly-generated test sets can provide useful
information about the algorithmic characteristics of optimization algorithms. Artificial and
randomly-generated test sets can be extremely large in size, thereby providing an enormous
amount of comparative data. However, it can be difficult to rationalize their connection to
the real-world performance of optimization algorithms. If the goal is to compare a collection
of algorithms across a very wide spectrum, then artificial and randomly-generated test sets
are usually the better option.
When selecting a test set, it is always important to keep the particular goal of the
comparison in mind. Regardless of the goal, an appropriate test set should generally seek
to avoid the following deficiencies.
i. Too few problems. Having more problems in the test set makes the experiment
more reliable and helps the results reveal more information about the strengths or
weaknesses of the evaluated algorithms.
ii. Too little variety in problem difficulty. A test set containing only simple problems
is not enough to identify the strengths and weaknesses of algorithms. In contrast, a
test set which only has problems that are so difficult that no algorithm can solve them,
clearly, does not provide useful information on the relative performance of algorithms.
4
iii. Problems with unknown solutions. When possible, it is better to use test prob-
lems where the solution is known. Depending on the analysis performed (see section
5), the “solution” could be interpreted as the minimum function value, or the set of
global minimizers. Having access to the solution greatly improves the ability to eval-
uate the quality of the algorithmic output. However, when the test set is comprised
of real-world test problems, then a lack of known solutions may need to be accepted
as inevitable.
iv. Biased starting points. Allowing different algorithms to use different starting-
points will obviously bias the result. However, more subtle problems can also exist.
For example, if a starting point lies on the boundary of a constraint set, then an
interior point method will be severely disadvantaged. Another example comes from
considering the Beale test function, which has a global minimizer at (3, 0.5) [MGH81].
If a compass search (see, e.g., [KLT03]) with an initial step length of 1 is started at
(0.5, 0.5), then it will converges to the exact minimizer in just 4 iterations. However,
if a starting point of (0.51, 0.51) is used, then the exact same algorithm will use 63
iterations to reach a point within 10−2 of the global minimizer.1
v. Hidden structures. Many test sets have some structure that is not realistic in real-
world problems. For example, about 50% of the problems in the test set [MGH81] have
solution points that occur at integer valued coordinates. An algorithm that employs
some form of rounding may perform better than usual on these problems.
Thus, when choosing test sets for the benchmarking task the following considerations
should be taken into account as much as possible.
i. If the test set contains only few problems, then the experiment should be referred to
as a case study or a proof of concept, but not benchmarking. While there is no fixed
number that determines what is “enough problems to be considered benchmarking”,
we recommend that in order to achieve a reliable conclusion about the performance, an
experiment should contain at least 20 test problems (preferably more). In the specific
case of comparing a new version of an optimization algorithm with a previous version,
the number of test problems should be significantly greater – in the order of 100 or
more. In all cases, the more problems tested, the more reliable the conclusions.
ii. When possible, a test set should include at least two groups of problems: an easy
group which consists of the problems that are easy to solve within a reasonable time
on a regular contemporary computer using all the optimization algorithms tested,
and a hard group that contains the problems which are solvable but computationally
expensive and may require a specific optimization algorithm.
iii. Whenever possible, ensure at least a portion of the test set includes problems with
known solutions.
iv. For test sets that include starting points, new starting points can be generated by
introducing a small (random) perturbations to the given starting points. For other
test sets, randomly-generated starting points can be created from scratch. In either
case, all starting points should be created for each problem, and then every algorithm
should be provided the same starting point for testing. This approach can be further
used to increase result reliability, by repeating tests on the same function with a variety
of starting points.
v. Examine the test set with a critical eye and try to determine any hidden structure.
Some structures can be removed through methods similar to the random perturbation
1 Note that this example is artificially constructed to emphasize the results; the recommended starting
point for the Beale test problem is (1, 1).
5
of starting points in (iv). One quick test is to set an algorithm to minimize f (x)
starting at x0 and then set the algorithm to minimize fˆ(x) = f (x − p) starting from
x̂0 = x0 −p (where p is any random point). Constraints can then be shifted in a similar
manner, effectively shifting the entire problem horizontally by the vector p. While it
relocates the origin, and moves any constraints away from special integer values, it
has no theoretical effect on the geometry of the problem. As such, the results of
both tests should be extremely similar (theoretically they should be identical, but
numerical errors may cause some deviation). If the results of both tests differ, then
perhaps some hidden structure is being exploited by the algorithm, or perhaps some
hidden constraints are causing issues. Regardless of the reason, the researcher should
recognize the issue and consider a wider test set.
Using suitable standard test sets is usually a good option when benchmarking optimiza-
tion algorithms. In particular, it is usually easier to compare results across research groups
when standard tests are employed, although even within a specific research field there is gen-
erally no consensus on the appropriate test set to draw specific conclusions. Many interesting
and diverse test sets have been reported in the literature, see Tables 1 and 2.
Collection type Resources

Unconstrained optimization problems [AKZ05, MGH81, And08, JY13]
Global optimization GAMS [Pin07], COCONUT [SNSH+ 03],
and other collections [FP90, FPA+ 99,
Sch93, PK13, AL07, TAV99, FPS02]
Linear programming [Net]
Local optimization [FPA+ 99, FP90]
Nonlinear optimization problems CUTEr [Buc92, BCGT95], CUTEst
[GOT14], COPS [BBM99, DM04], and
collections [HS81, Sch08, ACM91, Dem76,
DM00]
Mixed integer linear programming MIPLIB [RXMC03, KAA+ 11]
Table 1: Some test sets reported in the literature.
Producing random test sets using test problem generators has its own drawbacks. Are
the generated problems representative or difficult? Is there any hidden structure in the
problems? Some papers that use random test problem generators are listed in Table 2.
Test problem generators

Network programming [EK82]
Nonlinear optimization problems [Sch80, LZ00]
Combinatorial problems [GJ09b, EK82]
Quadratic Programming [LM84]
Global Optimization [Sch93, GKLS03, AL07, NL14]
Table 2: Some test problem generators reported in the literature.
Figure 1 shows a decision tree that summarizes the fundamental decisions required for
6
assembling an appropriate test set for benchmarking of optimization algorithms.

Test Set ≥ 20
a

Test Set Decisions a. New Algorithm or
b. version update?
b

Test Set ≥ 100

Size
Case study, do not over

interpret results.
a
a. Real-world or
b. standardized test set b
Source See Table 1.
c. randomly generated?
c
See Table 2.
Starting
Point Ensure no algorithm is
abusing starting point
a
a. Fixed or information.
b. random?
b Ensure all algorithms are
given same starting information.
Figure 1: Test set decision tree.
4 Performing the experiments

The performance of algorithms is influenced by two general types of factors: environmental
factors and algorithmic factors.
Environmental factors refer to factors that are out of the algorithm scope and usually
beyond the control of the researcher. A common example is the computer environment used
to test the algorithms, which includes processor speed, operating system, computer memory,
etc. Environmental factors may also include the programmer’s skill and the programming
language/compiler used. This is particularly evident when multiple pieces of software by
a variety of programmers are being compared. In essence, if the benchmarking process is
repeated by another researcher elsewhere, then environmental factors are unlikely to remain
constant, and so the benchmarking results are expected to change.
Algorithmic factors are related to the algorithm itself. These are factors that are con-
sidered global across a variety of computing platforms. If the software is programmed by
the researcher, then it is assumed these factors are independent from the implementation
aspects of the algorithm.
Optimization benchmarking seeks to measure the algorithmic factors, and proceeds under
the key assumption that, while environmental factors are expected to change the results,
the algorithmic factors are sufficiently strong that the general ranking of algorithms should
remain constant under the specific ranges of parameters under consideration.
To compare algorithms, it is necessary to collect data that measures the overall perfor-
mance of each algorithm. This is done by running each algorithm on the test set (discussed
in Section 3), and collecting data on the results. The data collection and the selection of
performance measures is based on the research questions motivating the experimental study.
7
In general, performance measures fall into 3 categories: efficiency, reliability, and quality of
algorithmic output. We discuss these performance categories in Subsections 4.1, 4.2, and 4.3.
Table 3 provides a classification of the comparative measures for optimization algorithms
based partly on the guidelines provided by Hoffman and Jackson [HJ82].
Performance Category Example Criteria

1. Number of fundamental evaluations
Efficiency 2. Running time
3. Memory usage
1. Success rate
Reliability 2. Number of constraint violations
3. Percentage of global solutions found
1. Fixed-cost solution result
Quality of Solution 2. Fixed-target solve time
3. Computational accuracy
Table 3: Comparative measures.
In order to allow for maximal data analysis (and thereby the best understanding of
overall performance), it is recommended to collect at least some data from every performance
category.
4.1 Efficiency
The efficiency of an optimization algorithm refers to the computational effort required to
obtain a solution. In mathematical programming, there are two primary measures of effi-
ciency, the number of fundamental evaluations and the running time. A third, less common,
measure is memory usage.
Number of fundamental evaluations: The term fundamental evaluation is used
to refer to any subroutine that is called by the algorithm in order to gain fundamental
information about the optimization problem. The most obvious example is an objective
function evaluation but the evaluation may involve complex simulation algorithms. Other
fundamental evaluations could include gradient evaluations, Hessian evaluations, or con-
straint function evaluations. The number of fundamental evaluations can be used as a
standard unit of time, and is often assumed to be platform independent. In many sit-
uations, the number of fundamental evaluations is a particularly important measure, as
for real-world problems these evaluations often dominate over the internal workings of the
algorithm [HS83, CDM79, DS78, CGT96, Bar87, NN91, VBB05, HL70, MTL72, Asa73,
SNSH+ 03, NSHV05, Eas82, ALDP14, Evt85, PSKv14, KS15]. Note however that this mea-
sure is unreasonable when fundamental evaluations do not dominate the internal workings
of the algorithm [AKZ05].
Running Time: Running time, as a measure for optimization benchmarking, is usually
measured by either CPU time or wall clock time.2 Wall clock time contains CPU time, and
has been argued to be more useful in real-world setting [McG02]. However, Wall clock time
is not reproducible or verifiable since it is tied to a specific hardware platform and software
combination. CPU time is considerably more stable, as it is independent of background
operations of the computer. Moreover, CPU time is more-or-less consistent for the same
version of an operating systems running on the same computer architectures.
2 Wall clock time refers to the amount of time the human tester has to wait to get an answer from the
computer. Conversely, CPU time is the amount of time the CPU spends on the algorithm, at the exclusion
of operating system tasks and other processes.
8
It should be noted that, due to the wide variety and complexity of modern computing
architectures, the number of situations in which time is dominated by memory access costs
is increasing, hence the precision of CPU timers has been reduced. To improve the precision
of CPU timers, tools such as cache and memory access tracers can help obtaining a more
accurate CPU time performance. For a more detailed discussion on these techniques we
refer to [Knu94, LL96].
Another issue with CPU time is the increasing prevalence of multi-core machines. Thor-
ough reporting would require indicating the number of cores, the CPU time for each core,
but also how efficiently the different levels of memory were used and cache hits/misses. Since
such measurements are not straightforward to obtain for multi-core machines, the wall-clock
time along with the hardware specifications are usually reported. (Unless the new algorithm
contribution focuses specifically on optimizing computation for a multi-core architecture, in
which case more precise measures are warranted.) Eventually, the onus is on the researchers
to explain how simplified measurements support the conclusions drawn; this is especially
true for multi-core machines.
Regardless of whether wall clock time or CPU time is used, in order to maximize the
reliability of the data, it is important to ensure that any background operations of the
computer are kept to a minimum. Furthermore, any manuscript regarding the benchmarking
should clearly state which form of running time was collected.
Other measures: In addition to the categorization presented above, in some specific
cases, there is another issue that influences the choice of an appropriate measure for running
time: the type of algorithm. For example, to evaluate the running time for branch-and-
bound based algorithms, the number of branch-and-bound nodes is a common criterion,
while for simplex and interior point based algorithms, the number of iterations is often
used. Therefore, when deciding on the choice of a suitable efficiency measure, the type of
algorithms to be evaluated should also be taken into account.
4.2 Reliability
The reliability and robustness of an optimization algorithm is defined as the ability of the
algorithm to “perform well” over a wide range of optimization problems [MGH81, BGK+ 95].
The most common performance measure to evaluate the reliability is success rate [TŽ89,
RM78, Eas82, SS00]. Success rate is gauged by counting the number of test problems that are
successfully solved within a pre-selected tolerance. This can be done using objective function
value, or distance of the solution point from a minimizer. In convex optimization these
two approaches are largely, but not perfectly, interchangeable. However, if the objective
function has multiple local minimizers, then the researcher must decide whether good local
solutions are acceptable outcomes, or if the algorithm must converge to a global minimizer
[Sch80, RW77]. In addition to the success rate, the average objective function values and the
average constraint violation values have also been reported to measure reliability [Sch80].
When studying reliability, the researcher should consider whether the algorithms are
deterministic, or non-deterministic, and repeat tests multiple times if the algorithm is non-
deterministic. Reliability can be based on a fixed starting point (if one is given with the test
set), but it is often better to use multiple starting points.
In deterministic optimization algorithms, reliability can be interpreted as the number of
problems in the given test set that are solved by the optimization algorithm. When dealing
with non-deterministic algorithms, it is important to repeat each test multiple times, to
make sure that reliability is measured in aggregate, and not skewed by a single lucky (or
unlucky) algorithmic run.
Using multiple repeats of each test raises the issue of how to aggregate the results. One
option is to consider each algorithmic run as a separate test problem and then compare
solvers across this expanded test set. This allows comparisons based on worst-case or best-
case scenarios. Another option is to use averaged data, for example, average runtime, average
solution accuracy, average reliability, etc. If averaging is used, then it is important to also
9
include standard deviations of the data. In either case, data collection is best performed by
considering each algorithmic run as a separate test problem, as average values can easily be
extracted from this data, while reconstructing the full test data from averaged values is not
possible.
It should be noted that, in some cases multiple repeats of a non-deterministic method is
impractical due to the time it takes to solve a single problem.
Multiple starting points: As mentioned in Section 3, many academic test problems
come with suggested starting points. While algorithms should always be tested using these
starting points, it is often enlightening to test the algorithm using other starting points as
well. Most deterministic algorithms should show little change in performance if a starting
point is perturbed by a small random vector – provided the new starting point retains
whatever feasibility properties the algorithm requires in a starting point.
Hillstrom [Hil77] is one of the first to consider testing optimization algorithms at non-
standard starting points. He proposed using random starting points chosen from a box
surrounding the standard starting point. In another approach to this problem, in [MGH81]
the authors present a large collection of test functions along with some procedures and start-
ing points to assess the reliability and robustness of unconstrained optimization algorithms.
In some cases, prior knowledge is available about the solution of a test problem. Some
methods use such information to construct a starting point close to the optimal solution
[NW06].
Regardless of how starting points are selected, fair benchmarking requires all the algo-
rithms to use the same starting point for each test. Therefore, starting points should be
generated and stored outside of the testing process.
4.3 Quality of algorithmic output

The quality of the algorithmic output is obviously important when comparing optimization
algorithms. Measuring quality falls into two easily separated categories: a known solution
is available, and no known solutions are available.
Known solution available: When the expected solution for a problem is available,
two methods can be employed to measure the quality of an algorithmic output: fixed-target
and fixed-cost [FRK+ 08, RU01, RS13].
In the fixed-target method, the required time (function calls, iterations, etc) to find
a solution at an accuracy target εtarget is evaluated. The main problem with fixed-target
methods is that some algorithms may fail to solve a test problem. Therefore, the termination
criterion cannot rely only on accuracy, but should also include some safety breaks such as the
maximum computational budget. If the algorithm successfully reaches the desired accuracy,
then the time to achieve the accuracy can be used to measure the quality of the algorithm
on that test problem. If the algorithm terminates before reaching the desired accuracy, then
it should be considered unsuccessful on that test problem.
Let x0 be the initial point from a test run, x̄ ∈ Rn be the termination point obtained
from the test run, and x∗ ∈ Rn be the known solution for the problem. In the fixed-cost
approach, the final optimization error f (x̄) − f (x∗ ) is checked after running the algorithm
for a certain period of time, number of function calls, number of iterations, or some other
fixed measurement of cost. Then, the smaller the final optimization error is, the better the
quality of the algorithmic output.
The fixed-target versus fixed-cost decision can be seen as a multiobjective problem. It is
analogous in engineering to minimizing cost, subject to constraints on performance versus
maximizing performance, subject to a constraint on cost.
If a fixed-cost approach is used, then there are many options on how to quantify the
accuracy of the algorithmic output. We need to determine whether x̄ approximates x∗ or
not. For example, this can be done using the function value or the distance from the solution:
facc = f (x̄) − f (x∗ ), and xacc = kx̄ − x∗ k respectively.
10
It is often valuable to “normalize” these quantities by dividing by the starting accuracy:
n f (x̄) − f (x∗ ) kx̄ − x∗ k

facc = , and xnacc = .
f (x0 ) − f (x∗ ) kx0 − x∗ k
Finally, to improve readability, and reduce floating point errors, many researchers take a
base-10 logarithm:
l
facc = log10 (f (x̄) − f (x∗ )) − log10 (f (x0 ) − f (x∗ )), and
l
xacc = log10 (kx̄ − x∗ k) − log10 (kx0 − x∗ k).
l
The values facc and xlacc can be loosely interpreted as the negative of the number of new
digits of accuracy obtained (measured on a continuous scale), thus making these values very
useful for discussion. Finally, to avoid exact solutions making an algorithm look better than
it is, one can select a “maximal improvement value” M (typically about 16) and set
(
l l
−facc , if − facc ≤M
γ= (1)
M, −facc > M or f (x̄) − f (x∗ ) = 0,
l
or the analogous equation using xlacc . Note that we have multiplied facc
n
by −1, so γ can be
interpreted as the number of new digits of accuracy obtained up to a maximal improvement
value of M .
Similar measures can be used to quantify the amount of constraint violation for a test
run. Considering min{f (x) : gi (x) ≤ 0, i = 1, 2, ..., m},
m
X
max{0, gi (x̄)} gives the sum of violated constraints,
i=1
Xm
(max{0, gi (x̄)})2 gives the squared sum of violated constraints,
i=1
m
1 X
max{0, gi (x̄)} gives the mean constraint violation, and
m i=1
Y
gi (x̄) amounts to the geometric mean of the violated constraints.
i:gi (x̄)>0
The selection of the appropriate strategy among the variety of approaches depends on the
objectives of the experimental research, the problem structure, and the type of optimization
algorithms used. The researcher should also carefully select the success criteria, e.g., how to
fairly compare a solution that barely satisfies the constraints versus a solution that barely
violates the constraints but returns a much lower objective function value.
No known solution available: In many situations, the test set used will not have
known solutions to all problems. This is particularly true if the test set includes instances of
real-world applications. To evaluate the quality of an algorithmic output in this situation,
some new considerations are required [McG96, JMR96].
First, it should be immediately obvious that, if no known solution is available, then fixed-
target approaches cannot be applied. Fixed-cost approaches are still applicable, but since
f (x∗ ) is not known, measuring the accuracy of the final algorithmic output’s function value,
f (x̄), becomes difficult. Measuring the accuracy of the final algorithmic output’s point, x̄,
becomes essentially impossible.
To measure the quality of the final algorithmic output’s function value f (x̄), the simplest
approach is to replace f (x∗ ) with the best known value for the problem. For any given test
run, this guarantees that at least one algorithm gets the exact answer, so it is important
to select a reasonable maximal improvement value. Another approach is to estimate the
optimal solution using statistical techniques. For example, in combinatorial optimization
problems, some researchers [Dan77, Der85] use a sample of algorithmic outputs to predict
11
the location of the solution. In [GS85], such an approach is explained in an evaluation of
non-deterministic algorithms. Another strategy is to calculate a lower bound on the cost of
an optimal solution, and to compare the algorithmic output cost with that lower bound. As
an example, the total sum of the weight list in packing problems can be considered as a lower
bound on the total number of bins used in a packing. Finally, one may abandon comparing
the algorithmic output quality with the optimal solution, and only assess the quality of the
algorithmic output with similar results published in the literature or other algorithms being
tested.
4.4 Parameter tuning and stopping conditions

Additional parameters, such as stopping tolerances, population size, step sizes, or initial
penalty parameters, are required for most optimization algorithms.
Among such parameters, stopping conditions play a highly notable role, as different
stopping conditions can drastically change the output of an algorithm [SS00, SSL13, Zv08].
Moreover, if stopping tests are internalized within a method, it may not be possible to ensure
all algorithms use the same stopping conditions [SS00, SK06]. However, if a fixed-cost or
fixed-target approach (see Subsection 4.3) is employed, then other stopping conditions can
be turned off, thereby ensuring all algorithms use the same stopping conditions. If it is not
possible to ensure all algorithms use the same stopping conditions, then researchers should
recognize this potential source of error when drawing conclusions from the results.
Other parameters, such as initial step length, can also have an impact on the performance
of an optimization algorithm. Such parameters often require tuning in order to obtain a
better performance. If different choices of input parameters are allowed in an algorithm, re-
searchers should mention the parameter settings used and how they were selected. Different
strategies used for tuning parameters affect the benchmarking process. Choosing appropri-
ate parameter settings for an optimization algorithm is usually based on experiments and
statistical analysis.
The tuning strategy should be chosen in conjunction with a specific algorithm and in a
replicable manner [JMR96]. Any improvements obtained from hand-tuning can of course
be reported, but separately from more systematic comparative experiments. In some stud-
ies, algorithmic methods are presented to automate the tuning procedure of parameters
[AO06, ADO14, BHBG07, HHLBS09, Rid07, RK10]. The major disadvantage of these tun-
ing methods is that they require a considerable computational investment because they
usually try many possible settings to find an appropriate one. Nonetheless, in recent years
some studies have specifically focused on automatic tuning of parameters in optimization
solvers. Examples of these efforts include the machine learning based method proposed
in [BHBG07], CPLEX automatic tuning tool [CPL14], use of derivative-free optimization
[AO06], ParamILS [HHLBS09], and the procedure proposed in [HHLB10] for mixed in-
teger programming solvers. Similarly, some of the tuning techniques for non-deterministic
methods include sequential parameter optimization (SPO) [BBLP05, BBP14], relevance and
calibration approach [NE06], and F-Race [Bir09].
In view of the considerable research on automatic tuning of optimization solvers, a more
accurate approach for benchmarking of optimization solvers requires a pre-processing step
in which an automatic tuning method is employed to find the suitable configuration settings
for all the solvers [HW10]. As this is not always practical, it is important to emphasize that
tuning parameters can have a major impact on the performance of an algorithm, therefore
it is not appropriate to tune the parameters of some methods while leaving other methods
at their default settings.
12
5 Analyzing and reporting the results
Many studies use basic statistics (e.g., average solving time) to report the experimental
results. Basic statistics are a reasonable starting point, but provide little information about
the overall performance of optimization methods. Reporting methods can be loosely broken
down into three categories: numerical tables, graphics, and performance ratio methods (e.g.,
performance and data profiles).
5.1 Tables
Numerical tables provide the most complete method of reporting benchmarking results, so
for the sake of completeness, we recommend making full tables of results readily available.
However, such tables are cumbersome, so are often better included in an appendix or in
additional online material linked to an article.
As full tables of results can easily overwhelm a reader, researchers have developed var-
ious techniques that provide easy-to-understand and compact methods for reporting the
experimental results. Summary tables can be employed as a first step [SK06]. For example,
in [BDF97] optimization methods were rated by the percentage of problems for which a
method’s time is termed competitive or very competitive. The solving time of an algorithm
was called competitive if Ts ≤ 2Tmin in which Ts is the solving time obtained by an al-
gorithm on a particular problem and Tmin is the minimum solving time obtained among
all the algorithms on that specific problem. Similarly, if Ts ≤ 43 Tmin , then they call that
method very competitive. Tables such as these provide good talking points for discussing
benchmarking data, but fail to give a complete picture of the results. One critic for this
particular approach is it does not explore how much the table would change if, for example,
the cut-off for very competitive was changed from 43 Tmin to 54 Tmin .
Many other forms of summary tables are present throughout optimization benchmarking
literature, however all suffer from the same fundamental problem – to be readable a summary
table must distill the results down to a highly condensed format, thereby eliminating much
of the benchmarking information.
5.2 Graphics
Well-conceived graphics can provide more information than some other data presentations.
Simple graphical methods, such as histograms, box-plots, and trajectory plots, provide a next
step in analysis, while more complete methods include performance profiles, data profiles,
and accuracy profiles. Depending on the objectives of an experimental research, one or more
of these techniques might be useful to report the results. In [Tuk77, TGM83], different types
of plots are introduced, which are useful for data representation in general.
A more specialized plot for optimization algorithms is the trajectory plot [FRK+ 08,
RS07, SR80b, Eas82, EF74, KLM+ 10, SS00]. In a trajectory plot, the performance of an
optimization algorithm on a given test problem is visualized by plotting a path that connects
the points generated by each iteration of the algorithm. An example appears in Figure 2,
where the trajectories of two algorithms attempting to minimize the Rosenbrock function
are plotted. Both algorithms begin at the point (3, 3), and the first iteration moves both
algorithms to the point (0.2, 3.5). Algorithm 1 (represented by the solid line) proceeds to
(0.7, −0.2) and continues in a zig-zag path towards the minimizer. Algorithm 2 (represented
by the dashed line) proceeds to (1.1, 1.3) and then follows a fairly straight path towards
the minimizer, albeit with very small step sizes. While trajectory plots are useful to build
a better understand of how each algorithm behaves, they are not particularly good for
benchmarking as they can only present the results for one test problem at a time. They are
also limited to plots of functions of 2 or 3 variables, or to plotting projections onto subspaces
for more than 3 variables.
13
Figure 2: A sample trajectory plot.
Another specialized plot for optimization benchmarking is the convergence plot. In a

convergence plot the performance of different optimization methods is visualized by plotting
the best function value found against some measure of fundamental evaluation (Section 4.1).
An example convergence plot is given in Figure 3.
20
M1
M2
Best Function Value Found
M3
15 M4
10
0
100 200 300 400 500 600 700 800 900 1000
Number of Function Evaluations
Figure 3: A sample convergence plot.
In Figure 3 the results of four optimization methods are plotted for a given test problem.
In this example, method M1 starts well, but stalls after about 300 function evaluations,
while method M2 shows steady decrease for about 800 function evaluations before stalling.
Method M3 initially decreases the fastest, but stalls after about 350 function evaluations.
Finally, method M4 starts very slowly, but ultimately finds the lowest value. Like trajectory
plots, convergence plots are useful for discussing some specific behavior of the algorithm,
but are poor for benchmarking as they can only be used to analyze one test problem at a
time.
While trajectory and convergence plots are useful to visualize a method on one problem,
their main drawback is that they represent the results for a single problem per plot. So
if the test set contains a large number of problems then it will be difficult to evaluate the
overall performance of these methods. Other types of plots can be found in the literature,
but generally have the same limitations as trajectory and convergence plots [BSV04] [RS07].
For many optimization algorithms, researchers are interested in how the problem scales
with the size of the input (e.g., dimension of the problem). For such research it can be
valuable to produce a runtime plot. Runtime plots visualize the data by plotting the time
14
to solve across a series of problem instances with different sizes. Runtime plots can suffer
from a similar issues to trajectory and convergence plots, namely, they represent the results
for a single series of problem instances. However, this problem can be somewhat mitigated
by aggregating data from a collection of problems to create an “average runtime” plot.
5.3 Performance profiles

According to [SKM16], the idea of creating graphical comparison of optimization methods
dates back to at least 1978 in the paper by Grishagin [Gri78].3 In 2000, Strongin and
Sergeyev presented the idea of operational characteristics for an algorithm: a graphical
method to visualize the probability that an algorithm solves a problem within a set time-
frame [SS00]. However, it was not until the 2002 paper by Dolan and Moré, [DM02], that
the idea of graphically presenting benchmarking results became mainstream. Dolan and
Moré (apparently unaware of works of Grishagin or Strongin and Sergeyev) denoted their
proposed graphs performance profiles.
Performance profiles provide interesting information such as efficiency, robustness, and
probability of success in a graphically compact form [DM02]. Their use has grown rapidly
in optimization benchmarking, and should certainly be considered for any benchmarking
optimization research.
Let P be a set of problems, S a set of optimization solvers, and T a convergence test.
Assume proper data has been collected. Performance profiles are now defined in terms of a
performance measure tp,s > 0, obtained for each pair of (p, s) ∈ P × S. This measure can
be the computational time, the number of function evaluations, etc. A larger value of tp,s
shows worse performance. For each problem p and solver s, the performance ratio is defined
as
tp,s

 if convergence test passed,
rp,s = min{tp,s : s ∈ S} (2)
∞ if convergence test failed.

for a specific problem p and test s (the best solver has rp,s = 1). The performance profile
of a solver s is defined as follows
1
ρs (τ ) = size{p ∈ P : rp,s ≤ τ }, (3)
|P|
where |P| represents the cardinality of the test set P. Then, ρs (τ ) is the portion of the time
that the performance ratio rp,s for solver s ∈ S is within a factor τ ∈ R of the best possible
performance ratio.
Note that ρs (1) represents the percentage of problems for which solver s ∈ S has the best
performance among all the other solvers. And for τ sufficiently large, ρs (τ ) is the percentage
of the test set that can be solved by s ∈ S. Solvers with consistently high values for ρs (τ )
are of interest.
Figure 4 shows a sample performance profile plot (created using data from [BHLH15])
for logarithmic values of τ . The logarithmic values are employed to deal with smaller values
for τ . This will result in a more accurate plot which shows the long term behavior of the
methods. To demonstrate the difference, Figure 5 shows the same performance profile using
non-logarithmic values of τ . Depending on the data collected, logarithmic or non-logarithmic
values of τ may be more appropriate. Researchers should create both profiles, but it may
be only necessary to provide one in the final manuscript.
3 We thank “Mathematics Referee #1” for pointing out that reference.
15
1
0.9
0.8
0.7
0.6
; s(=) 0.5
0.4
0.3
0.2 M1
M2
0.1 M3
M4
0
0 1 2 3 4 5 6 7
log2 (=)
Figure 4: An example performance profile using logarithmic values of τ .
0.9
0.8
0.7
0.6
; s(=)
0.5
0.4
0.3
0.2 M1
M2
M3
0.1
M4
0
0 20 40 60 80 100 120
=
Figure 5: The performance profile from Figure 4 using non-logarithmic values of τ .
The performance profiles in Figure 4 compares four different optimization methods on a

test set of 60 problems. The method M1 has the best performance (in terms of CPU time)
for almost 93% of the problems; meaning that M1 is able to solve 93% of the problems as
fast or faster than the other two approaches. M3 solves roughly 11% of the problems as
fast or faster than the other approaches. On the other hand, given enough time M1 solves
about 92% of all problems, while M3 solves about 94% of all problems. The graphs of M1
and M3 cross at about log2 (τ ) ≈ 3 (i.e., τ ≈ 8), the two methods solve the same number of
problems if time to solve is relaxed to be within a factor of 8.
Since performance profiles compare different methods versus the best method, the in-
terpretation of the results should be limited to comparison to the best method and no
interpretation should be made between, e.g., the second best and third best method since a
switching phenomenon may occur [GS16]4 . To compare the second and third best method, a
new performance profile should be drawn without the first method, see the explicit examples
provided in [GS16].
Performance profile plots can be customized by substituting the standard performance
measure time. For example, in [AKZ05, ADO10, VV07], the objective function value is used
4 We thank “Engineer Referee #3” for pointing out that reference.
16
as the performance measure to compare the profiles. In particular, tp,s is replaced with
fˆp,s (after k function evaluations) − f ∗

mp,s = , (4)
(fw − f ∗ )
for problem p and solver s, where fw is the largest (worst) function value obtained among
all the algorithms, and fˆp,s is the estimated function value after k function evaluations.
In another example, [SK15] creates a performance measure based on proximity to optimal
points.
The primary advantage of performance profiles is that they implicitly include both speed
and success rate in the analysis. The value of ρs (α) gives a sense of how promising the
algorithmic outputs are relative to the best solution found by all the optimization algorithms
that are compared together.
One criticism of performance profiles is that the researcher must select a definition for
the convergence test passing and failing. Changing this definition can substantially change
the performance profile [HVL11]. Also note that if a fixed-cost approach is used to perform-
ing the benchmarking experiments, then performance profiles become inappropriate, as all
algorithms will use the same “time”. Another criticism is that the profile is only showing
performance with respect to the best method and does not allow one to compare other
methods with each other due to the appearance of a switching phenomenon [GS16].
Nonetheless, performance profiles have become a gold-standard in modern optimization
benchmarking, and should be included in optimization benchmarking analysis whenever
possible with an appropriate interpretation.
5.4 Accuracy profiles

Similar to performance profiles, accuracy profiles provide a visualization of an entire opti-
mization benchmarking test set. However, accuracy profiles are designed for fixed-cost data
sets. They begin by defining, for each problem p ∈ P and solver s ∈ S, an accuracy measure
(similar to equation (1)):
(
p,s p,s
−facc , if − facc ≤M
γp,s = p,s p,s
M, −facc > M or facc is undefined,
p,s
where facc = log10 (f (x̄p,s ) − f (x∗p )) − log10 (f (x0p ) − f (x∗p )), x̄p,s is the candidate solution
point obtained by solver s on problem p, x∗p is the optimal point for problem p, and x0p is
the initial point for problem p. The performance of the solver s ∈ S on the test set P is
measured using the following function
1
Rs (τ ) = size{γp,s |γp,s ≥ τ, p ∈ P}.
|P|
The accuracy profile Rs (τ ) shows the proportion of problems such that the solver s ∈ S is
able to obtain a solution within an accuracy of τ of the best solution. An example accuracy
profiles (using data from [HS06]) appears in Figure 6.
17
1
M1
0.9 M2
M3
0.8
portion of runs completed

M4
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15
Relative accuracy obtained
Figure 6: An example accuracy profile.
In Figure 6, we see 4 methods (M1, M2, M3, and M4) plotted against each other in an
accuracy profile format. Examining the profile, notice that method M1 achieves 5 digits
of accuracy on almost all test-problems, and 6 digits of accuracy on about 75% of test
problems. All other method achieve this level of accuracy on 50% or less of test problems.
Thus, if 5 or 6 digits is the desired level of accuracy, then M1 is a clear winner. However,
if the particular application requires much higher accuracy, then M3 becomes a contender.
Indeed, only M3 was able to achieve 12 digits of accuracy on any reasonable portion of the
test problems. (In this particular test, accuracy was capped at 16 digits, but no method
managed to achieve this on a significant portion of the test problems.)
Accuracy profiles do not provide as much information as performance profiles, but are
suitable when fixed-cost data sets are collected. This is appropriate in cases where the cost
of obtaining the exact solution exceeds the budget, so the optimization target is to find as
good a solution as possible within a limited time.
5.5 Data profiles

Moré and Wild [MW09] proposed data profiles as an adjustment to performance profiles
for derivative-free optimization algorithms. Data profiles try to answer the question: what
percentage of problems (for a given tolerance τ ) can be solved within the budget of k
function evaluations? They assume the required number of function evaluations to satisfy
the convergence test is likely to grow as the number of variables increases. The data profile
of an optimization algorithm s is defined using [MW09]

1 tp,s
ds (k) = size p ∈ P : ≤k , (5)
|P| np + 1
in which tp,s shows the number of function evaluations required to satisfy the convergence
test, np is the number of variables in the problem p ∈ P, and ds (k) is the percentage of
problems that can be solved with k(np + 1) function evaluations. The value k(np + 1) is
used since np + 1 is to the number of function evaluations required to compute a “simplex
gradient” (a one-sided finite-difference estimate of the gradient).
t
It is worth noting that data profiles could easily be defined replacing npp,s
+1 by any other
t
measure of fundamental evaluations used. Moreover, if npp,s+1 is replaced by iterations, then
data profiles become a slight variation of operational characteristics defined in [SS00].
18
1
M1
0.9
M2
M3
0.8
M4
0.7
0.6
d s(,)
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700
Number of Function Evaluations
Figure 7: An example data profile.
Figure 7 shows a typical data profile. Suppose the user has a budget limit of 100 simplex
gradients, according to Figure 7, with this budget the method M4 has the best performance
by solving roughly 22% of the problems; while M3 has the worst performance among all the
solvers since with this budget it does not solve any problem.
Like performance profiles, data profiles are cumulative distribution functions, and thus,
monotone increasing step functions with a range in [0, 1]. Data profiles do not provide the
number of function evaluations required to solve a specific problem, but instead provide a
visualization of the aggregate data. Also note that the data profile for a given solver s ∈ S
is independent of other solvers; this is not the case for performance profiles.
Although the data profiles are useful for benchmarking, they have not received the same
extensive attention as the performance profiles. This is partly because they are newer, but
perhaps also because they are primarily used with derivative-free optimization algorithms.
However, data profiles could be easily adjusted to a broader class of algorithms by replacing
tp,s with any measure of time, and np + 1 by any dimensional normalization factor. For
example, for a sub-gradient based method, ds (α) could be redefined as
1
ds (α) = size{p ∈ P : gp,s ≤ α},
|P|
where gp,s is the number of sub-gradient evaluations. This might make them an appropriate
tool for benchmarking bundle methods [HUL93, §XIV-XV].
Table 4 summarizes the reporting methods discussed in this section.
19
Reporting method Evaluates Advantage Drawback Recommendation
Provide in appendix or on-
Full Data Tables — Comprehensive Overwhelming
line data set.
Summary Tables Provide as talking point,
Varies Brief Incomplete but include other forms of
Simple Graphs analysis.
Good for case-studies, but
Trajectory Plots Speed and Accuracy Clear
Examines one problem should include other forms
at a time. of analysis for larger data
20
Convergence Plots Efficiency Precise
sets
Performance Profiles Speed and Robustness
Strong graphical repre- Include at least one of
Cannot be used for
Accuracy Profiles Accuracy sentation that incorpo- these three profiles when-
fixed-cost data sets
rates the entire dataset. ever possible
Data Profiles Speed and Robustness
Table 4: Reporting methods summarization.

6 Automated benchmarking
As we have seen, the benchmarking process of optimization algorithms is a complicated task
that requires a lot of effort from data preparation and transformation to analysis and visual-
ization of benchmarking data. Accordingly, some researchers have begun the development of
software tools to facilitate and automate developing test sets, solving the problems using a
variety of optimization algorithms, and carrying out performance analysis and visualization
of benchmarking data.
The PAVER server [MP06, BDV14] is an online server that provides some tools for
automated performance analysis, visualization, and processing of benchmarking data. An
optimization engine, either a modeling environment, such as AMPL [FGK02] or GAMS
[Ros14], or a stand-alone solver, is required to obtain solution information such as objective
function value, resource time, number of iterations, and the solver status. Then, the bench-
mark data obtained by running several solvers over a set of problems can be automatically
analyzed via online submission to the PAVER server. PAVER returns a performance analy-
sis report through e-mail in HTML format. The tools available in PAVER allow either direct
comparisons between two solvers or comparisons of more than two solvers simultaneously in
terms of efficiency, robustness, algorithmic output quality, or performance profiles.
The High-performance Algorithm Laboratory [NFHLB11] (HAL) is a computational en-
vironment designed to facilitate empirical analysis and design of algorithms. It supports
conducting large computational experiments and uses a database to handle data related to
algorithms, test sets, and experimental results. It also supports distributed computation
on a cluster of computers. Its major advantage over other tools is its aim to develop a
general purpose tool that can handle different categories of problems, although the initial
deployment of problems and algorithms is tricky.
The Optimization Test Environment [DFSN14] is another tool that can be used for
benchmarking the performance of different optimization algorithms. It provides some facili-
ties to organize and solve large test sets, extract a specific subset of test sets using predefined
measures, and perform statistical analysis on the benchmarking data. The results obtained
by each optimization algorithm is verified in terms of feasibility and correctness. A variety
of information is reported such as the number of global numerical solutions found (i.e., the
best solution found among all optimization algorithms), number of local solutions found,
number of wrong claims, etc. For problem representation, it uses Directed Acyclic Graphs
(DAGs) from the Coconut Environment [SM12]. This user-friendly environment analyzes
results and automatically summarizes them before reporting them in an easy-to-use format
such as LATEX, JPEG, and PDF.
Other software tools for automating benchmarking process include EDACC [BGKR10],
LIBOPT [GJ09a], CUTEr [GOT03] and a testing environment reported in [BDF97].
Using automated performance analysis tools has the potential to facilitate the bench-
marking process. Moreover, the automation of the process may reduce the risk of biased
comparison, by taking some of the comparison decisions away from the algorithm designer.
However, automated benchmarking tools are not yet accepted by the research community
due to their shortcomings. The major drawback of these tools is that the flexibility of a
researcher to design experiments based on their research objectives is restricted to the tools’
limitations and the way they view the benchmarking process. Moreover, so far all of these
tools operate in expert mode, meaning that the usability aspect needs to be improved in
terms of application and design of experiments. In most cases preparation of an experiment
beyond the scope of default facilities of the benchmarking tools is nontrivial and involves
some customization, e.g., scripting. Further research in this direction will create valuable
tools for the optimization community, but the current status is not ready for wide-spread
use.
21
7 Conclusion
This article reviews the issue of benchmarking optimization algorithms. For the sake of hav-
ing a careful, less-biased, explicitly-stated, and comprehensive evaluation of the optimization
algorithms an a priori benchmarking design is required. To develop an appropriate experi-
mental design, the first thing to consider is to clarify the questions that are to be answered
by the experiment. This includes selecting a suitable test set and suitable performance mea-
sures based on the objectives of the research. The data must be analyzed and processed in a
transparent, fair, and complete manner. Within this paper we discuss each of these topics,
and present a review of the state-of-the-art for each of these steps. We include several tables
and figures that summarize the process, and provide key advice designed to lead to a fair
benchmarking process.
A final important point must be raised in regards to optimization benchmarking:
as in all scientific research, benchmarking optimization algorithms
should be reported in a manner that allows for reproducibility of the
experiments.
When reporting results, make sure to describe algorithms, parameters, test problems, the
computational environment, and the statistical techniques employed with an acceptable level
of details. It should be clarified that it is usually difficult to provide enough information
in a published paper to enable the reader to rerun the stated experiments and replicate
completely the reported results. Moreover, the pace of computational development is so
high that it is virtually impossible to entirely reproduce a computational experiment, due to
development and modifications in operating systems, computer architecture, programming
languages, etc. However, the minimum standard for replication of the experiments is that at
least the authors themselves should be able to replicate the experiments [CDM79]. Therefore,
it is important that the researcher keep all the programs and data necessary to redo all the
computations and recreate all graphs. Such programs should be made available whenever
possible.
7.1 Some final insights and remarks from the referees

This paper provides a high-level perspective on benchmarking of optimization algorithms.
While it does not aim to be all encompassing, it hopefully provides a baseline for best
practices when benchmarking optimization algorithms. Many nuances exist when dealing
with benchmarking specific genres of optimization algorithms. We end with some final
discussion on some of these nuanced areas. Many of these final remarks were provided
through the insights of 5 excellent referees.
The state-of-the-art in optimization benchmarking currently has (at least) two major
voids that require further research: how to properly benchmark optimization algorithms that
make use of parallel processing, and how to properly benchmark multi-objective optimization
algorithms.
Evaluating the performance of parallel optimization algorithms is different from tradi-
tional optimization methods in various aspects: performance measures such as time, the
appropriate test sets, the new measures of merit involved in parallel processing such as the
concept of speedup, efficiency, etc. All of these concerns together with the fast pace of
technological advances in parallel computing motivate a research on the benchmarking of
parallel optimization algorithms. A good start in this regard is the research paper by Barr
and Hickman [BH93].
Benchmarking multi-objective optimization algorithms is similarly in its infancy. Appro-
priate test sets and performance measures have yet to surface. Multi-objective optimization
is a rapidly advancing field, and research into proper benchmarking in this discipline would
be highly valuable.
22
A benchmarking challenge that we have not addressed is how to compare optimization
algorithms that are different in nature5 . For example, consider the comparison of a de-
terministic and a non-deterministic method [GK17, KM17]. If the multiple repeats of the
non-deterministic method are considered, is it fair to compare the average quality to the
single run of the deterministic method. Some ideas on this, including a proposed method
for comparing deterministic and a non-deterministic methods, can be found in [SKM16].
Another benchmarking challenge that has not been fully address is how to compare al-
gorithms that approach the same problem from fundamentally different view points6 . For
example, when working with constrained optimization problems, some researchers have ex-
plored infeasible point methods while others have focused on interior point methods. Infea-
sible point methods typically take a two phase approach, where one phase aims for decrease
in function value and the second phase aims to improve feasibility. Interior point methods
assume a strictly feasible starting point and use some form of penalty function to maintain
feasibility of all trial points. Comparing these two styles of algorithms is very challenging,
and possibly meaningless, as one assumes an infeasible starting point and the other assumes
a feasible starting point. Other algorithms adopt a hybrid approach by approximating the
feasible set with some tolerance [RW17]; in that case, the tolerance parameter could greatly
influence the result of the comparison.
A source of debate in benchmarking global optimization algorithms is how to deal with
rescaling of the domain7 . Many global optimization algorithms are designed with the baseline
assumption that the optimization problem’s constrained region is the unit hypercube [0, 1]n .
Of course, in practical applications this is not always true. Some algorithms deal with this
at the solver level, using the constraint set’s diameter to select parameters like initial step
lengths; while other algorithms deal with this at the problem level, assuming that the end-
user will rescale the constraint set to be the unit hypercube (which is not always easy to do).
Comparisons of algorithms that place fundamentally different assumptions on the problem
structure may impact the selection of an appropriate test set and may limit the conclusions
one can draw from the numerical results.
Another potential limitations on what conclusions can be drawn from a numerical study
is the sensitivity analysis of the parameters8 . A robust study should investigate a range
of parameters and report on their impact on the validity of the conclusions. We leave the
complexity of how best to report such information to future research.
References
[ACM91] B. M. Averick, R. G. Carter, and J. J. Moré. The MINPACK-2 test problem
collection. Technical report, Argonne National Laboratory, Argonne, IL, 1991.
[ADO10] C. Audet, C. K. Dang, and D. Orban. Algorithmic parameter optimization of
the DFO method with the OPAL framework. In Ken Naono, Keita Teranishi,
John Cavazos, and Reiji Suda, editors, Software Automatic Tuning, pages 255–
274. Springer New York, 2010.
[ADO14] C. Audet, K.-C. Dang, and D. Orban. Optimization of algorithms with OPAL.
Math. Program. Comput., 6(3):233–254, 2014.
[AKZ05] M. M. Ali, C. Khompatraporn, and Z. B. Zabinsky. A numerical evaluation
of several stochastic algorithms on selected continuous global optimization test
problems. Journal of Global Optimization, 31(4):635–672, 2005.
5 We thank “Mathematics Referee #1” for pointing out this challenge.
6 We thank “Engineering Referee #3” and “Mathematics Referee #2” for pointing out this challenge.
23
[AL07] B. Addis and M. Locatelli. A new class of test functions for global optimization.
Journal of Global Optimization, 38(3):479–501, 2007.
[ALDP14] C. Audet, S. Le Digabel, and M. Peyrega. Linear equalities in blackbox opti-
mization. Technical report, Les Cahiers du GERAD, 2014.
[And08] N. Andrei. An unconstrained optimization test functions collection. Advanced
Modeling and Optimization, 10(1):147–161, 2008.
[AO06] C. Audet and D. Orban. Finding optimal algorithmic parameters using
derivative-free optimization. SIAM J. Optim., 17(3):642–664, 2006.
[Asa73] J. Asaadi. A computational comparison of some non-linear programs. Mathe-
matical Programming, 4(1):144–154, 1973.
[Bar70] Y. Bard. Comparison of gradient methods for the solution of nonlinear param-
eter estimation problems. SIAM Journal on Numerical Analysis, 7(1):157–186,
1970.
[Bar87] R. R. Barton. Testing strategies for simulation optimization. In Proceedings
of the 19th Conference on Winter Simulation, WSC ’87, pages 391–401, New
York, NY, USA, 1987. ACM.
[BBLP05] T. Bartz-Beielstein, C. W. G. Lasarczyk, and M. Preuss. Sequential parame-
ter optimization. In The 2005 IEEE Congress on Evolutionary Computation,
volume 1, pages 773–780, 2005.
[BBM99] A. S. Bondarenko, D. M. Bortz, and J. J. Moré. COPS: Large-scale nonlin-

early constrained optimization problems. Technical report, Mathematics and
Computer Science Division, Argonne National Laboratory, Technical Report
ANL/MCS-TM-237, 1999.
[BBP14] T. Bartz-Beielstein and M. Preuss. Experimental analysis of optimization algo-
rithms: Tuning and beyond. In Yossi Borenstein and Alberto Moraglio, editors,
Theory and Principled Methods for the Design of Metaheuristics, Natural Com-
puting Series, pages 205–245. Springer Berlin Heidelberg, 2014.
[BCG+ 97] I. Bongartz, A. R. Conn, N.I.M. Gould, M.A. Saunders, and P.L. Toint. A
numerical comparison between the LANCELOT and MINOS packages for large
scale constrained optimization. Technical report, SCAN-9711063, 1997.
[BCGT95] I. Bongartz, A. R. Conn, N. Gould, and P. L. Toint. CUTE: Constrained
and unconstrained testing environment. ACM Transactions on Mathematical
Software (TOMS), 21(1):123–160, March 1995.
[BDF97] S. C. Billups, S. P. Dirkse, and M. C. Ferris. A comparison of large scale mixed
complementarity problem solvers. Computational Optimization and Applica-
tions, 7(1):3–25, 1997.
[BDMP03] M. R. Bussieck, A. S. Drud, A. Meeraus, and A. Pruessner. Quality assurance
and global optimization. In Christian Bliek, Christophe Jermann, and Arnold
Neumaier, editors, Global Optimization and Constraint Satisfaction, volume
2861 of Lecture Notes in Computer Science, pages 223–238. Springer Berlin
Heidelberg, 2003.
[BDV14] M. R. Bussieck, S. P. Dirkse, and S. Vigerske. PAVER 2.0: an open source en-
vironment for automated performance analysis of benchmarking data. Journal
of Global Optimization, 59(2-3):259–275, 2014.
24
[Bel69] E. J. Beltrami. A comparison of some recent iterative methods for the numerical
solution of nonlinear programs. In M. Beckmann and H. P. Künzi, editors, Com-
puting Methods in Optimization Problems, volume 14 of Lecture Notes in Op-
erations Research and Mathematical Economics, pages 20–29. Springer Berlin
Heidelberg, 1969.
[Ber13] T. Berthold. Measuring the impact of primal heuristics. Operations Research
Letters, 41(6):611 – 614, 2013.
[BGK+ 95] R. S. Barr, B. L. Golden, J. P. Kelly, M. G. C. Resende, and W. R. Stewart Jr.
Designing and reporting on computational experiments with heuristic methods.
Journal of Heuristics, 1(1):9–32, 1995.
[BGKR10] A. Balint, D. Gall, G. Kapler, and R. Retz. Experiment design and admin-
istration for computer clusters for SAT-solvers (EDACC), system description.
Journal on Satisfiability, Boolean Modeling and Computation, 7:77–82, 2010.
[BH93] R. S. Barr and B. L. Hickman. Reporting computational experiments with
parallel algorithms: Issues, measures, and experts’ opinions. INFORMS Journal
on Computing, 5(1):2–18, 1993.
[BHBG07] M. Baz, B. Hunsaker, P. Brooks, and A. Gosavi. Automated tuning of op-
timization software parameters. Technical report, University of Pittsburgh,
Department of Industrial Engineering, 2007.
[BHLH15] V. Beiranvand, W. Hare, Y. Lucet, and S. Hossain. Multi-haul quasi network
flow model for vertical alignment optimization. Technical report, Computer
Science, University of British Columbia, Kelowna BC, Canada, 2015.
[Bir09] M. Birattari. Tuning Metaheuristics: A Machine Learning Perspective. Springer
Publishing Company, Incorporated, 1st ed. 2005. 2nd printing edition, 2009.
[Box66] M. J. Box. A comparison of several current optimization methods, and the use
of transformations in constrained problems. The Computer Journal, 9(1):67–77,
1966.
[BSV03] H. Y. Benson, D. F. Shanno, and R. J. Vanderbei. A comparative study of
large-scale nonlinear optimization algorithms. In Gianni Di Pillo and Almerico
Murli, editors, High Performance Algorithms and Software for Nonlinear Opti-
mization, volume 82 of Applied Optimization, pages 95–127. Springer US, 2003.
[BSV04] Hande Y. Benson, David F. Shanno, and Robert J. Vanderbei. Interior-point
methods for nonconvex nonlinear programming: jamming and numerical test-
ing. Math. Program., 99(1, Ser. A):35–48, 2004.
[Buc92] A. G. Buckley. Algorithm 709: Testing algorithm implementations. ACM
Transactions on Mathematical Software (TOMS), 18(4):375–391, December
1992.
[CDM79] H. Crowder, R. S. Dembo, and J. M. Mulvey. On reporting computational
experiments with mathematical software. ACM Transactions on Mathematical
Software (TOMS), 5(2):193–203, 1979.
[CGT96] A. R. Conn, N. Gould, and P. L. Toint. Numerical experiments with the
LANCELOT package (release A) for large-scale nonlinear optimization. Math-
ematical Programming, 73(1):73–110, 1996.
[Col68] A. R. Colville. A comparative study of nonlinear programming codes. Technical
Report 320-2949, IBM Scientific Center, New York, 1968.
25
[CPL14] CPLEX’s automatic tuning tool. Technical report, IBM Corporation, 2014.
[Dan77] D. G. Dannenbring. Procedures for estimating optimal solution values for large
combinatorial problems. Management Science, 23(12):1273–1283, 1977.
[Dem76] R. S. Dembo. A set of geometric programming test problems and their solutions.
Mathematical Programming, 10(1):192–213, 1976.
[Dem78] R. S. Dembo. Current state of the art of algorithms and computer software
for geometric programming. Journal of Optimization Theory and Applications,
26(2):149–183, 1978.
[Der85] U. Derigs. Using confidence limits for the global optimum in combinatorial
optimization. Operations Research, 33(5):1024–1049, 1985.
[DFSN14] F. Domes, M. Fuchs, H. Schichl, and A. Neumaier. The optimization test

environment. Optimization and Engineering, 15(2):443–468, 2014.
[DM00] E. D. Dolan and J. J. Moré. Benchmarking optimization software with COPS.
Technical report, Argonne National Laboratory Research Report, 2000.
[DM02] E. D. Dolan and J. J. Moré. Benchmarking optimization software with perfor-

mance profiles. Mathematical Programming, 91:201–213, 2002.
[DM04] E. D. Dolan and J. J. Moré. Benchmarking optimization software with COPS
3.0. Argonne National Laboratory Research Report, 2004.
[DS78] L. C. W. Dixon and G. P. Szegö. Towards global optimisation 2. North-Holland

Amsterdam, 1978.
[Eas82] E. D. Eason. Evidence of fundamental difficulties in nonlinear optimization code
comparisons. In JohnM. Mulvey, editor, Evaluating Mathematical Programming
Techniques, volume 199 of Lecture Notes in Economics and Mathematical Sys-
tems, pages 60–71. Springer Berlin Heidelberg, 1982.
[EF74] E. D. Eason and R. G. Fenton. A comparison of numerical optimization methods
for engineering design. Journal of Manufacturing Science and Engineering,
96(1):196–200, 1974.
[EK82] J. J. Elam and D. Klingman. Netgen-ii: A system for generating structured
network-based mathematical programming test problems. In JohnM. Mulvey,
editor, Evaluating Mathematical Programming Techniques, volume 199 of Lec-
ture Notes in Economics and Mathematical Systems, pages 16–23. Springer
Berlin Heidelberg, 1982.
[Evt85] Y. G. Evtushenko. Numerical optimization techniques. Translation Series in
Mathematics and Engineering. Optimization Software, Inc., Publications Divi-
sion, New York; distributed by Springer-Verlag, New York, 1985. Translated
from the Russian, Translation edited and with a foreword by J. Stoer.
[FGK02] R. Fourer, D. M. Gay, and B. W. Kernighan. AMPL: A modeling language for
mathematical programming. Duxbury Press, second edition, 2002.
[FP90] C. A. Floudas and P. M. Pardalos. A collection of test problems for constrained

global optimization algorithms, volume 455. Springer, 1990.
[FPA+ 99] C. A. Floudas, P. M. Pardalos, C. S. Adjiman, W. R. Esposito, Z. H. Gëmës,
S. T. Harding, J. L. Klepeis, C. A. Meyer, and C. A. Schweiger. Handbook of
Test Problems in Local and Global Optimization. Springer US, 1999.
26
[FPS02] D. Famularo, P. Pugliese, and Y. D. Sergeyev. Test problems for Lipschitz uni-
variate global optimization with multiextremal constraints. In Gintautas Dze-
myda, Vydūnas Šaltenis, and Antanas Žilinskas, editors, Stochastic and global
optimization, volume 59 of Nonconvex Optim. Appl., pages 93–109. Kluwer
Acad. Publ., Dordrecht, 2002.
[FRK+ 08] K. R. Fowler, J. P. Reese, C. E. Kees, J. E. Dennis Jr., C. T. Kelley, C. T. Miller,
C. Audet, A. J. Booker, G. Couture, R. W. Darwin, M. W. Farthing, D. E.
Finkel, J. M. Gablonsky, G. Gray, and T. G. Kolda. Comparison of derivative-
free optimization methods for groundwater supply and hydraulic capture com-
munity problems. Advances in Water Resources, 31(5):743 – 757, 2008.
[GJ09a] J. C. Gilbert and X. Jonsson. LIBOPT an environment for testing solvers
on heterogeneous collections of problems the manual, version 2.1. Technical
report, INRIA, BP 105, 78153 Le Chesnay, 2009.
[GJ09b] D. Grundel and D. Jeffcoat. Combinatorial test problems and problem genera-
tors. In Christodoulos A. Floudas and Panos M. Pardalos, editors, Encyclopedia
of Optimization, pages 391–394. Springer US, 2009.
[GK17] J. W. Gillard and D. E. Kvasov. Lipschitz optimization methods for fitting a

sum of damped sinusoids to a series of observations. Statistics and Its Interface,
10:59–70, 2017.
[GKLS03] M. Gaviano, D. E. Kvasov, D. Lera, and Y. D. Sergeyev. Algorithm 829:
Software for generation of classes of test functions with known local and global
minima for global optimization. ACM Transactions on Mathematical Software
(TOMS), 29(4):469–480, 2003.
[GOT03] N. Gould, D. Orban, and P. L. Toint. CUTEr and SifDec: A constrained and
unconstrained testing environment, revisited. ACM Transactions on Mathe-
matical Software (TOMS), 29(4):373–394, 2003.
[GOT14] N. M. Gould, D. Orban, and P. L. Toint. CUTEst: a constrained and uncon-

strained testing environment with safe threads for mathematical optimization.
Computational Optimization and Applications, pages 1–13, 2014.
[Gri78] V. A. Grishagin. Operating characteristics of some global search algorithms,
volume 7, pages 198–206. 1978. (in Russian).
[GS85] B.L. Golden and W.R. Stewart. The Traveling Salesman Problem: a guided
tour of combinatorial optimization, chapter Empirical Analysis of Heuristics,
pages 207–249. Wiley Interscience, 1985.
[GS16] N. Gould and J. Scott. A note on performance profiles for benchmarking soft-
ware. ACM Trans. Math. Softw., 43(2):15:1–15:5, August 2016.
[HHLB10] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Automated configuration of
mixed integer programming solvers. In Andrea Lodi, Michela Milano, and Paolo
Toth, editors, Integration of AI and OR Techniques in Constraint Programming
for Combinatorial Optimization Problems, volume 6140 of Lecture Notes in
Computer Science, pages 186–202. Springer Berlin Heidelberg, 2010.
[HHLBS09] F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. Stützle. ParamILS: an au-
tomatic algorithm configuration framework. Journal of Artificial Intelligence
Research, 36(1):267–306, 2009.
27
[Hil77] K. E. Hillstrom. A simulation test approach to the evaluation of nonlinear op-
timization algorithms. ACM Transactions on Mathematical Software (TOMS),
3(4):305–315, 1977.
[HJ82] K. L. Hoffman and R. H. F. Jackson. In pursuit of a methodology for testing
mathematical programming software. In JohnM. Mulvey, editor, Evaluating
Mathematical Programming Techniques, volume 199 of Lecture Notes in Eco-
nomics and Mathematical Systems, pages 177–199. Springer Berlin Heidelberg,
1982.
[HJL92] P. Hansen, B. Jaumard, and S. H. Lu. Global optimization of univariate Lips-
chitz functions: Ii. new algorithms and computational comparison. Mathemat-
ical Programming, 55(1-3):273–292, 1992.
[HKT01] P. Hough, T. Kolda, and V. Torczon. Asynchronous parallel pattern search for
nonlinear optimization. SIAM Journal on Scientific Computing, 23(1):134–156,
2001.
[HL70] H. Y. Huang and A. V. Levy. Numerical experiments on quadratically con-
vergent algorithms for function minimization. Journal of Optimization Theory
and Applications, 6(3):269–282, 1970.
[HMSW53] A. Hoffman, M. Mannos, D. Sokolowsky, and N. Wiegmann. Computational
experience in solving linear programs. Journal of the Society for Industrial and
Applied Mathematics, 1(1):17–33, 1953.
[HP14] W Hare and C Planiden. The NC-proximal average for multiple functions.
Optimization Letters, 8(3):849–860, 2014.
[HRCV88] E. N. Houstis, J. R. Rice, C. C. Christara, and E. A. Vavalis. Performance
of scientific software. In J.R. Rice, editor, Mathematical Aspects of Scientific
Software, volume 14 of The IMA Volumes in Mathematics and Its Applications,
pages 123–155. Springer New York, 1988.
[HS81] W. Hock and K. Schittkowski. Test examples for nonlinear programming codes.
Lecture notes in economics and mathematical systems. Springer-Verlag, 1981.
[HS83] W. Hock and K. Schittkowski. A comparative performance evaluation of 27
nonlinear programming codes. Computing, 30(4):335–358, 1983.
[HS06] W. Hare and C. Sagastizábal. Benchmark of some nonsmooth optimization

solvers for computing nonconvex proximal points. Pac. J. Optim., 2(3):545–
573, 2006.
[HS10] W. Hare and C. Sagastizábal. A redistributed proximal bundle method for non-
convex optimization. SIAM Journal on Optimization, 20(5):2442–2473, 2010.
[HUL93] J.-B. Hiriart-Urruty and C. Lemaréchal. Convex analysis and minimization
algorithms. II, volume 306 of Grundlehren der Mathematischen Wissenschaften
[Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin,
1993. Advanced theory and bundle methods.
[HVL11] W. L. Hare, V. R. Koch, and Y. Lucet. Models and algorithms to improve

earthwork operations in road design using mixed integer linear programming.
European J. Oper. Res., 215(2):470–480, 2011.
[HW10] W. L. Hare and Y Wang. Fairer benchmarking of optimization algorithms via
derivative free optimization. Technical report, optimization-online, 2010.
28
[JBNP90] R. H. F. Jackson, P. T. Boggs, S. G. Nash, and S. Powell. Guidelines for
reporting results of computational experiments. report of the ad hoc committee.
Mathematical Programming, 49(1-3):413–425, 1990.
[JMR96] D. S. Johnson, L. A. McGeoch, and E. E. Rothberg. Asymptotic experimental
analysis for the Held-Karp traveling salesman bound. In Proceedings of the
Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’96,
pages 341–350, Philadelphia, PA, USA, 1996. Society for Industrial and Applied
Mathematics.
[JY13] M. Jamil and X. S. Yang. A literature survey of benchmark functions for global
optimisation problems. International Journal of Mathematical Modelling and
Numerical Optimisation, 4(2):150–194, 2013.
[KAA+ 11] T. Koch, T. Achterberg, E. Andersen, O. Bastert, T. Berthold, R. E. Bixby,
E. Danna, G. Gamrath, A. M. Gleixner, S. Heinz, A. Lodi, H. Mittelmann,
T. Ralphs, D. Salvagnin, D. E. Steffy, and K. Wolter. MIPLIB 2010. Mathe-
matical Programming Computation, 3(2):103–163, 2011.
[KLM+ 10] M. Kortelainen, T. Lesinski, J. Moré, W. Nazarewicz, J. Sarich, N. Schunck,
M. V. Stoitsov, and S. Wild. Nuclear energy density optimization. Physical
Review C, 82(2):024313, 2010.
[KLT03] T. G. Kolda, R. M. Lewis, and V. Torczon. Optimization by direct search: new
perspectives on some classical and modern methods. SIAM Rev., 45(3):385–482,
2003.
[KM16] D. E. Kvasov and M. S. Mukhametzhanov. One-dimensional global
search: Nature-inspired vs. lipschitz methods. AIP Conference Proceedings,
1738(1):400012, 2016.
[KM17] D. E. Kvasov and M. S. Mukhametzhanov. Metaheuristic vs. deterministic
global optimization algorithms: The univariate case. Applied Mathematics and
Computation, 2017.
[Knu94] D. E. Knuth. The Stanford GraphBase: a platform for combinatorial computing,
volume 37. Addison-Wesley Publishing Company, 1994.
[KPZ05] C. Khompatraporn, J. D. Pinter, and Z. B. Zabinsky. Comparative assessment
of algorithms and software for global optimization. Journal of Global Optimiza-
tion, 31(4):613–633, 2005.
[KS15] D. E. Kvasov and Y. D. Sergeyev. Deterministic approaches for solving practical
black-box global optimization problems. Advances in Engineering Software,
80:58 – 66, 2015.
[LL96] A. LaMarca and R. Ladner. The influence of caches on the performance of
heaps. Journal of Experimental Algorithmics (JEA), 1, 1996.
[LM84] M. L. Lenard and M. Minkoff. Randomly generated test problems for positive
definite quadratic programming. ACM Transactions on Mathematical Software,
10(1):86–96, January 1984.
[LZ00] D. Liu and X. S. Zhang. Test problem generator by neural network for algo-
rithms that try solving nonlinear programming problems globally. Journal of
Global Optimization, 16(3):229–243, 2000.
[McG96] C. C. McGeoch. Toward an experimental method for algorithm simulation.
INFORMS Journal on Computing, 8(1):1–15, 1996.
29
[McG02] C. C. McGeoch. Experimental analysis of algorithms. In Panos M. Pardalos
and H. Edwin Romeijn, editors, Handbook of Global Optimization, volume 62
of Nonconvex Optimization and Its Applications, pages 489–513. Springer US,
2002.
[MGH81] J. J. Moré, B. S. Garbow, and K. E. Hillstrom. Testing unconstrained op-
timization software. ACM Transactions on Mathematical Software (TOMS),
7(1):17–41, March 1981.
[Mit03] H. D. Mittelmann. An independent benchmarking of SDP and SOCP solvers.
Mathematical Programming, 95(2):407–430, 2003.
[MP06] H. D. Mittelmann and A. Pruessner. A server for automated performance
analysis of benchmarking data. Optimization Methods and Software, 21(1):105–
120, 2006.
[MTL72] A. Miele, J. L. Tietze, and A. V. Levy. Comparison of several gradient al-
gorithms for mathematical programming problems. Aero-astronautics report
no.94, Rice University, Houston, Texas, 1972.
[Mul82] J. M. Mulvey, editor. Evaluating Mathematical Programming Techniques, vol-

ume 199. Springer-Verlag, 1982.
[MW09] J. J. Moré and S. Wild. Benchmarking derivative-free optimization algorithms.
SIAM Journal on Optimization, 20 (1)(5):172–191, 2009.
[NE06] V. Nannen and A. E. Eiben. A method for parameter calibration and relevance
estimation in evolutionary algorithms. In Proceedings of the 8th Annual Confer-
ence on Genetic and Evolutionary Computation, GECCO ’06, pages 183–190,
New York, NY, USA, 2006. ACM.
[Net] Netlib: Netlib linear programming library.
[NFHLB11] C. Nell, C. Fawcett, H. H. Hoos, and K. Leyton-Brown. HAL: A framework

for the automated analysis and design of high-performance algorithms. In Car-
losA.Coello Coello, editor, Learning and Intelligent Optimization, volume 6683
of Lecture Notes in Computer Science, pages 600–615. Springer Berlin Heidel-
berg, 2011.
[NL14] C.-K. Ng and D. Li. Test problem generator for unconstrained global optimiza-
tion. Comput. Oper. Res., 51:338–349, 2014.
[NN91] S. Nash and J. Nocedal. A numerical study of the limited memory BFGS
method and the Truncated-Newton method for large scale optimization. SIAM
Journal on Optimization, 1(3):358–372, 1991.
[NSHV05] A. Neumaier, O. Shcherbina, W. Huyer, and T. Vinkó. A comparison of com-

plete global optimization solvers. Mathematical Programming, 103(2):335–356,
2005.
[NW06] J. Nocedal and S. Wright. Numerical optimization. Springer Series in Opera-
tions Research and Financial Engineering, 2006.
[OA11] K. Opara and J. Arabas. Benchmarking procedures for continuous optimiza-

tion algorithms. Journal of Telecommunications and Information Technology,
(4):73–80, 2011.
30
[Pin02] J. D. Pintér. Global optimization: Software, test problems, and applications.
In PanosM. Pardalos and H.Edwin Romeijn, editors, Handbook of Global Op-
timization, volume 62 of Nonconvex Optimization and Its Applications, pages
515–569. Springer US, 2002.
[Pin07] J. D. Pintér. Nonlinear optimization with GAMS /LGO. Journal of Global
Optimization, 38(1):79–101, 2007.
[PK13] J. D. Pintér and F. J. Kampas. Benchmarking nonlinear optimization software
in technical computing environments. TOP, 21(1):133–162, 2013.
[PRCLF12] J. A. Parejo, A. Ruiz-Cortés, S. Lozano, and P. Fernandez. Metaheuristic opti-
mization frameworks: a survey and benchmarking. Soft Computing, 16(3):527–
561, 2012.
[PSKv14] R. Paulavičius, Y. D. Sergeyev, D. E. Kvasov, and J. Žilinskas. Globally-biased
Disimpl algorithm for expensive global optimization. J. Global Optim., 59(2-
3):545–567, 2014.
[Rid07] E. Ridge. Design of experiments for the tuning of optimisation algorithms.
University of York, Department of Computer Science, 2007.
[RK10] E. Ridge and D. Kudenko. Tuning an algorithm using design of experiments. In
Thomas Bartz-Beielstein, Marco Chiarandini, Lus Paquete, and Mike Preuss,
editors, Experimental Methods for the Analysis of Optimization Algorithms,
pages 265–286. Springer Berlin Heidelberg, 2010.
[RM78] M. J. Rijckaert and X. M. Martens. Comparison of generalized geometric
programming algorithms. Journal of Optimization Theory and Applications,
26(2):205–242, 1978.
[Ros14] R. E. Rosenthal. GAMS–a user’s guide. Technical report, GAMS Development
Corporation, 2014.
[RS07] R. G. Regis and C. A. Shoemaker. A stochastic radial basis function method
for the global optimization of expensive functions. INFORMS Journal on Com-
puting, 19(4):497–509, 2007.
[RS13] L. Rios and N. V. Sahinidis. Derivative-free optimization: a review of algorithms
and comparison of software implementations. Journal of Global Optimization,
56(3):1247–1293, 2013.
[RU01] R. L. Rardin and R. Uzsoy. Experimental evaluation of heuristic optimization
algorithms: A tutorial. Journal of Heuristics, 7(3):261–304, 2001.
[RW77] H. Ramsin and P. Wedin. A comparison of some algorithms for the nonlinear
least squares problem. BIT Numerical Mathematics, 17(1):72–90, 1977.
[RW17] R. G. Regis and S. M. Wild. CONORBIT: constrained optimization by ra-
dial basis function interpolation in trust regions. Optimization Methods and
Software, 0(0):1–29, 2017.
[RXMC03] M. Romesis, M. Xie, K. Minkovich, and J. Cong. Optimality study project.
Technical report, UCLA Computer Science Department, 2003. http://cadlab.
cs.ucla.edu/~pubbench/.
[Sch80] K. Schittkowski. Nonlinear programming codes: information, tests, perfor-
mance. Lecture notes in economics and mathematical systems. Springer-Verlag,
1980.
31
[Sch93] F. Schoen. A wide class of test functions for global optimization. Journal of
Global Optimization, 3(2):133–137, 1993.
[Sch08] K. Schittkowski. An updated set of 306 test problems for nonlinear program-
ming with validated optimal solutions - user’s guide. Technical report, Depart-
ment of Computer Science, University of Bayreuth, 2008.
[SK06] Y. D. Sergeyev and D. E. Kvasov. Global search based on efficient diagonal

partitions and a set of Lipschitz constants. SIAM J. Optim., 16(3):910–937,
2006.
[SK15] Y. D. Sergeyev and D. E. Kvasov. A deterministic global optimization using
smooth diagonal auxiliary functions. Commun. Nonlinear Sci. Numer. Simul.,
21(1-3):99–111, 2015.
[SKM16] Y. D. Sergeyev, D. E. Kvasov, and M. S. Mukhametzhanov. Operational zones
for comparing metaheuristic and deterministic one-dimensional global optimiza-
tion algorithms. Mathematics and Computers in Simulation, 141:96–109, 2016.
[SM12] H. Schichl and M. C. Markót. Algorithmic differentiation techniques for global
optimization in the COCONUT environment. Optimization Methods and Soft-
ware, 27(2):359–372, 2012.
[SNSH+ 03] O. Shcherbina, A. Neumaier, D. Sam-Haroud, X. H. Vu, and T. V. Nguyen.
Benchmarking global optimization and constraint satisfaction codes. In Chris-
tian Bliek, Christophe Jermann, and Arnold Neumaier, editors, Global Opti-
mization and Constraint Satisfaction, volume 2861 of Lecture Notes in Com-
puter Science, pages 211–222. Springer Berlin Heidelberg, 2003.
[SR80a] E. Sandgren and K. M. Ragsdell. The utility of nonlinear programming algo-
rithms: A comparative study, part 1. Journal of Mechanical Design, 102(3):540–
546, 1980.
[SR80b] E. Sandgren and K. M. Ragsdell. The utility of nonlinear programming algo-

rithms: A comparative study, part 2. Journal of Mechanical Design, 102(3):547–
551, 1980.
[SS78] K. Schittkowski and J. Stoer. A factorization method for the solution of con-
strained linear least squares problems allowing subsequent data changes. Nu-
merische Mathematik, 31(4):431–463, 1978.
[SS00] R. G. Strongin and Y. D. Sergeyev. Global Optimization with Non-Convex
Constraints: Sequential and Parallel Algorithms. Springer US, 2000.
[SSL13] Y. D. Sergeyev, R. G. Strongin, and D. Lera. Introduction to global optimization
exploiting space-filling curves. SpringerBriefs in Optimization. Springer, New
York, 2013.
[Tab69] D. Tabak. Comparative study of various minimization techniques used in math-
ematical programming. IEEE Transactions on Automatic Control, 14(5):572–
572, Oct 1969.
[TAV99] A. Törn, M. M. Ali, and S. Viitanen. Stochastic global optimization: Problem

classes and solution techniques. Journal of Global Optimization, 14(4):437–447,
1999.
[TGM83] E. R. Tufte and P. R. Graves-Morris. The visual display of quantitative infor-
mation, volume 2. Graphics press Cheshire, CT, 1983.
32
[TM10] N. P. Tedford and J. R. R.A. Martins. Benchmarking multidisciplinary design
optimization algorithms. Optimization and Engineering, 11(1):159–183, 2010.
[Tuk77] J. W. Tukey. Exploratory data analysis. Pearson, 1977.
[TŽ89] A. Törn and A. Žilinskas. Global optimization, volume 350 of Lecture Notes in
Computer Science. Springer-Verlag, Berlin, 1989.
[VBB05] F. Vanden Berghen and H. Bersini. CONDOR, a new parallel, constrained ex-
tension of powell’s UOBYQA algorithm: Experimental results and comparison
with the DFO algorithm. Journal of Computational and Applied Mathematics,
181(1):157 – 175, 2005.
[VS99] R. J. Vanderbei and D. F. Shanno. An Interior-Point algorithm for nonconvex
nonlinear programming. Computational Optimization and Applications, 13(1-
3):231–252, 1999.
[VV07] A. I. F. Vaz and L. N. Vicente. A particle swarm pattern search method
for bound constrained global optimization. Journal of Global Optimization,
39(2):197–219, 2007.
[Yen05] O. Yeniay. A comparative study on optimization methods for the constrained

nonlinear programming problems. Mathematical Problems in Engineering,
2005(2):165–173, 2005.
[Zha14] Z. Zhang. Sobolev seminorm of quadratic functions with applications to
derivative-free optimization. Mathematical Programming, 146(1-2):77–96, 2014.
[Zv08] A. Zhigljavsky and A. Žilinskas. Stochastic global optimization, volume 9 of
Springer Optimization and Its Applications. Springer, New York, 2008.
33
View publication stats

Best Practices For Comparing Optimization Algorithms: Optimization and Engineering June 2017

Uploaded by

Copyright:

Available Formats

Best Practices For Comparing Optimization Algorithms: Optimization and Engineering June 2017

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Best Practices For Comparing Optimization Algorithms: Optimization and Engineering June 2017

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Best Practices for Comparing Optimization Algorithms

Article in Optimization and Engineering · June 2017

Vahid Beiranvand Warren L Hare

SEE PROFILE SEE PROFILE

Derivative-free Methods View project

The user has requested enhancement of the downloaded file.

September 26, 2017

1.1 Historical overview of benchmarking in optimization

1.2 Paper framework

3. Perform the experiments. In Section 4, we review and discuss various consid-

2 Reason for benchmarking

4. To evaluate the performance of an optimization algorithm when different option set-

point for the Beale test problem is (1, 1).

Collection type Resources

Table 1: Some test sets reported in the literature.

Test problem generators

Table 2: Some test problem generators reported in the literature.

Case study, do not over

Figure 1: Test set decision tree.

4 Performing the experiments

Performance Category Example Criteria

Table 3: Comparative measures.

4.3 Quality of algorithmic output

facc = f (x̄) − f (x∗ ), and xacc = kx̄ − x∗ k respectively.

n f (x̄) − f (x∗ ) kx̄ − x∗ k

4.4 Parameter tuning and stopping conditions

Another specialized plot for optimization benchmarking is the convergence plot. In a

Figure 3: A sample convergence plot.

5.3 Performance profiles

Figure 4: An example performance profile using logarithmic values of τ .

Figure 5: The performance profile from Figure 4 using non-logarithmic values of τ .

The performance profiles in Figure 4 compares four different optimization methods on a

fˆp,s (after k function evaluations) − f ∗

5.4 Accuracy profiles

portion of runs completed

Figure 6: An example accuracy profile.

5.5 Data profiles

Figure 7: An example data profile.

Table 4: Reporting methods summarization.

7.1 Some final insights and remarks from the referees

[BBM99] A. S. Bondarenko, D. M. Bortz, and J. J. Moré. COPS: Large-scale nonlin-

[DFSN14] F. Domes, M. Fuchs, H. Schichl, and A. Neumaier. The optimization test

[DM02] E. D. Dolan and J. J. Moré. Benchmarking optimization software with perfor-

[DS78] L. C. W. Dixon and G. P. Szegö. Towards global optimisation 2. North-Holland

[FP90] C. A. Floudas and P. M. Pardalos. A collection of test problems for constrained

[GK17] J. W. Gillard and D. E. Kvasov. Lipschitz optimization methods for fitting a

[GOT14] N. M. Gould, D. Orban, and P. L. Toint. CUTEst: a constrained and uncon-

[HS06] W. Hare and C. Sagastizábal. Benchmark of some nonsmooth optimization

[HVL11] W. L. Hare, V. R. Koch, and Y. Lucet. Models and algorithms to improve

[Mul82] J. M. Mulvey, editor. Evaluating Mathematical Programming Techniques, vol-

[NFHLB11] C. Nell, C. Fawcett, H. H. Hoos, and K. Leyton-Brown. HAL: A framework

[NSHV05] A. Neumaier, O. Shcherbina, W. Huyer, and T. Vinkó. A comparison of com-

[OA11] K. Opara and J. Arabas. Benchmarking procedures for continuous optimiza-

[SK06] Y. D. Sergeyev and D. E. Kvasov. Global search based on efficient diagonal

[SR80b] E. Sandgren and K. M. Ragsdell. The utility of nonlinear programming algo-

[TAV99] A. Törn, M. M. Ali, and S. Viitanen. Stochastic global optimization: Problem

[Yen05] O. Yeniay. A comparative study on optimization methods for the constrained

View publication stats