A Genetic Algorithm for the Index Selection
Problem
Jozef Kratica1 , Ivana Ljubić2 and Dušan Tošić3
1
2
Institute of Mathematics, Serbian Academy of Sciences and Arts,
Kneza Mihajla 35/I, pp. 367, 11001 Belgrade, Yugoslavia
[email protected]
Institute of Computer Graphics and Algorithms, Vienna University of Technology,
Favoritenstraße 9–11/186, 1040 Vienna, Austria
[email protected]
3
Faculty of Mathematics, University of Belgrade,
Studentski trg 16, Belgrade, Yugoslavia
[email protected]
Abstract. This paper considers the problem of minimizing the response
time for a given database workload by a proper choice of indexes. This
problem is NP-hard and known in the literature as the Index Selection
Problem (ISP).
We propose a genetic algorithm (GA) for solving the ISP. Computational
results of the GA on standard ISP instances are compared to branchand-cut method and its initialisation heuristics and two state of the art
MIP solvers: CPLEX and OSL. These results indicate good performance,
reliability and efficiency of the proposed approach.
1
Introduction
The Index Selection Problem (ISP) is one of the crucial problems in the physical
design of databases. The main goal is to choose a subset of given indexes to be
created in a database, so that the response time for a given database workload
is minimized.
The ISP represents a generalization of the well-known Uncapacitated Facility
Location Problem - UFLP, which is known as NP-hard in a strong sense [1].
Some authors also consider the ISP problem as the multilevel (or multi-stage)
uncapacitated facility location problem ([2, 3]). Suppose we are given a set of
indexes (that can be built or not) and a set of queries. Each built index requires
a maintenance time, while each query can be answered in answer time, which
depends on the set of built indexes. The main goal is to minimize the overall
execution time, defined as the sum of the maintenance times plus the sum of the
answer times for all queries.
In the simple case, each query can be answered either without using any
index, in a given answer time, or with using one built index, reducing answer
time by a gain specified for every index usable for query. In this situation the
problem can be formulated as an UFLP.
2
Jozef Kratica, Ivana Ljubić and Dušan Tošić
However, in most database management systems each query may use not only
a single index, but a set of indexes ([4, 5]). This problem can not be formulated
as a pure UFLP, but must be generalized and formulated as an ISP. Detailed
description of the problem and analysis of all aspects of its influence on the
managing of database systems is beyond the scope of this paper. Some important
survey articles are [6, 7].
In this paper we propose a genetic algorithm (GA) for solving the ISP. In
the following section, we will present the mathematical formulation of the problem and point out the previous work. Section 3 explains in detail all important
aspects of the GA: encoding, evaluation of solutions and genetic operators. In
Section 4, we will compare the results of the GA with branch-and-cut (BnC) algorithm proposed in [8, 9]. Results are also compared with those obtained using
two state of the art MIP solvers: CPLEX and OSL. Finally, in Section 5, we will
draw out some conclusions and propose ideas for the future work.
2
The Index Selection Problem
Let N = {1, 2, ..., n} be the set of all indexes and M = {1, 2, ..., m} the set
of all queries for a database. Each index can be built or not - each built index
requires a given maintenance time fj > 0. In the ISP, not only single indexes,
but also sets of indexes can be used for a query. Therefore, we are given a set
of configurations P = {1, 2, ..., p} - each configuration k ∈ P is associated with
some subset Nk ⊂ N of indexes. A configuration is active, if all its indexes are
built. When all indexes of a certain configuration k ∈ P are built, during the
execution of a query i ∈ M , we gain gik ≥ 0 time. In practice, for the most pairs
(i,k), i ∈ M , k ∈ P the gains gik are equal zero. This can easily be explained
with the fact that a certain configuration has an influence only on a limited
number of queries from M . Our goal is to build the indexes, so that the total
time needed to execute all the queries is minimized, i.e. that the total time of
gains is maximized. Formally, the problem can be stated as:
XX
X
max
gik · xik −
fj · yj
(1)
i∈M k∈P
j∈N
subject to
X
xik ≤ 1,
i∈M
(2)
k∈P
xik ≤ yj ,
xik , yj ∈ {0, 1},
i ∈ M, j ∈ N, k ∈ P
i ∈ M, j ∈ N, k ∈ P
(3)
(4)
In this formulation, each solution is represented by the set of built indexes
S ⊂ N , where y represents a characteristic vector of S, i.e.:
A Genetic Algorithm for the Index Selection Problem
yj =
½
1, index j ∈ S
0, otherwise
For each query-configuration pair (i, k) ∈ M × P , we introduce a variable:
½
1, query i uses configuration k
xik =
0, otherwise
3
(5)
(6)
Constraint (2) explains that each query i ∈ M can use at most one configuration.
Constraint (3) says that the queries can be served only by active configurations.
2.1
Previous Work
In [6], two methods for solving the ISP are proposed. One is a heuristics based
on the solution of the suitably defined knapsack subproblem and on Lagrangean
decomposition. This method is used on larger scale ISP instances, up to several
thousand indexes and queries, and it produces solutions of a good quality. The
second method is a branch-and-bound algorithm based on the linear programming relaxation of the model. The performance of the algorithm is enhanced by
means of the preprocessing which reduces the size of the candidate index set.
Optimal solutions on the ISP instances involving several hundred indexes and
queries have been reported.
In [8], an improvement of the previous method by using branch-and-cut
(BnC) algorithm is proposed. The problem is formulated as the Set Packing
Problem that contains all clique inequalities. For initialisation of the root node
in the branch-and-bound tree, a so called LP-heuristics is used: this is an adapted
greedy heuristics for the UFLP, followed by a 2-interchange heuristics. The greedy
heuristics starts with an empty index set, and iteratively adds to the current set
the index leading to the largest objective function increase. The 2-interchange
procedure tries to improve the current solution by adding an index, removing
an index, or interchanging two indexes. Analogous algorithms working on configurations instead of indexes are also applied, and the best solution is taken
as a primal bound. Computational results on two different challenging types of
instances (class A and class B) are reported. Although these instances contain
only 50 (respectively 100) indexes and queries, presented results indicate that
two new classes are more complicated than the instances tested in the previous
paper with several thousand of indexes and queries.
A separation procedure for a suitably defined family of lifted odd-hole inequalities is embedded within the branch-and-cut method presented in [9]. These
inequalities are obtained by applying Chvátal-Gomory derivation to the clique
inequalities. As an initialisation heuristics, the LP-heuristics described above is
used. The results of this approach compared to those obtained using CPLEX 3.0
MIP solver, show the effectiveness of the presented approach on the instances
of class A and class B. However, the performance analysis indicates that the
instances of class B are easily solvable by both exact algorithms, and that the
instances of real interest are those of class A.
4
Jozef Kratica, Ivana Ljubić and Dušan Tošić
3
The Genetic Algorithm
We used traditional generational GA (containing Npop individuals) with overlapping populations: Nelit is the number of elitist individuals that survive to the
next generation. Since the evaluation of the objective value is a time-consuming
operation, we store a certain amount of already calculated values in a cachetable of size NCache . Before the objective value for a certain individual is calculated, the table is checked [10, 11]. Least Recently Used caching strategy, which
is simple but effective, is used for that purpose. Caching is implemented by a
hash-queue data structure.
3.1
Encoding and Objective Function
We store only non-zero gik (i ∈ M, k ∈ P ) values, since their number is small
compared to the matrix dimension |M | · |P |. For each query i we need to remember three variables: the number of gik variables that are greater than zero, the
values itself stored in an array and their original indices. Thus, we enhance the
evaluation of the objective function, since during the search the complete matrix
need not to be considered. Figure 1 shows an example. A similar strategy has
been used when the contents of the configurations are stored. Each solution S
1 2 3
1 15 0 0
2 0 20 0
3 0 0 30
4 5 6 7 8
0 20 0 0 0
0 0 0 0 0
0 0 0 0 40
1 2 3 4 5
1 2 15 1 20 5
2 1 20 2
3 2 30 3 40 8
1 2 3 4 5
1 2 20 5 15 1
2 1 20 2
3 2 40 8 30 3
(b)
(c)
(a)
Fig. 1. An example how the values (gik ) can be efficiently stored: (a) The rows represent
the queries, while the columns represent possible configurations; (b) The matrix is
reduced so that the first number which represents the number of non-zero values is
followed by an array of values itself and their original indices. (c) The gains gik are
finally sorted in the decreasing order.
is encoded using the binary vector y, given by (5). According to the values of
indexes that are built, we determine the set of active configurations.
For each query i, we need to find the configuration k ∗ whose time gain is
maximized, i.e.
k ∗ = max gik .
(7)
k∈P
Finally, we need to sum all such gains over all queries, and to subtract the
maintenance times of the built indexes.
The whole objective value’s evaluation process can be enhanced significantly
if the values gik are sorted in a decreasing order, for each query i ∈ M . That
way, the first active configuration (if such exists) in the sorted array, represents
the best one.
If the obtained value is negative, the building of the indexes does not pay
off. In that case yj = 0, for all j ∈ N and there are no active configurations the objective value is set to zero.
A Genetic Algorithm for the Index Selection Problem
5
Time Complexity of the Objective Function The set of built indexes can
be found in O(n) time. Each configuration can contain n indexes in the worst
case. Thus, all active configurations can be found in O(np) time. For each query
i ∈ M , finding the best configuration according to (7), can be done in O(p)
time. Hence, finding the best configurations for all queries needs O(mp) time.
However, this is only a theoretical upper bound, since the matrix (gik ) is usually
sparse. Finally, subtraction of the maintenance time for each built index i ∈ S
can be performed in O(n) time. In total, evaluation of the objective function for
each individual can be done in O((n + m)p) time.
3.2
Genetic Operators and Other Aspects of the GA
In order to avoid premature convergence, multiple individuals are discarded from
the population. To recombine two individuals, we use the uniform crossover
described in [12]. The recombination is applied to a pair of selected individuals
with probability pcro .
Mutating every single gene may be unnecessarily low, since usually only
several genes really mutate, while the majority of them remains the same. For
a given mutation rate, the number of muted genes of one generation can be
predicted. Using the Central limit theorem, by the Gaussian distribution, the
mutation procedure is performed only on the number of randomly chosen genes.
That way, a relatively small mutation probability does not create drawbacks on
the performance of the simple mutation operator we used.
In our model, the mutation rate has been changed over the generations: We
start with a mutation rate pmstart , and let it converge towards pmend :
−Ngen
pm = pmend + (pmstart − pmend ) · 2 pmgen
(8)
where Ngen represents the current generation, and pmgen a mutation’s biasparameter. Note that for the large number of generations, the mutation rate
converges towards pmend , but the value itself can not be achieved.
As a selection method, we used the fine grained tournament selection, proposed in [13]. Instead of having an integer parameter Ntour , which represents the
tournament size in the classic tournament selection, the new operator’s input is
a floating point parameter Ftour representing an average tournament size. The
size of each of Npop −Nelite tournaments is chosen so that this value is on average
as close as possible to Ftour . Such a selection scheme can significantly affect the
overall performance of the GA, as it can be seen in [14].
4
Empirical Results
We have tested our GA on class A of randomly generated ISP instances, obtained
by the generator of A. Caprara and J.J. Salazar Gonzales [9]. This generator
intends to produce instances of the structure similar to real-world ISP instances,
in the spirit of [6].
6
Jozef Kratica, Ivana Ljubić and Dušan Tošić
All instances are given with 50 queries, 50 indexes and 500 configurations.
For each j ∈ N , fj = 100, and for each k ∈ P , |Nk | is chosen as a random
integer from {1, . . . , 5}. |Nk | indexes inside of Nk are then randomly selected
from N . For each k ∈ P , a set Mk is defined by randomly selecting 5 queries
from M ; The answer time gain gik , ∀i ∈ M , is set to zero if i ∈ M \ Mk , and to
a random integer from {1, . . . , t|Nk |} otherwise (where the parameter t is given
as an input).
The following setup was used for the GA, as it proved to be robust in preliminary tests: population size Npop = 150; number of elitist individuals Nelit = 100;
size of the cache NCache = 5000; average group size for the fine grained tournament selection Ftour = 5.5; crossover probability pcro = 0.85, where 30% of information has been exchanged; mutation parameters pmstart = 0.01, pmend = 0.002
and pmgen = 300. Each run was terminated after 2 000 iterations. On all the instances we considered, this criterion allowed the GA to converge so that only
minor improvements in the quality of final solutions can be expected when prolonging the runs. Thus, the main goal was to find high-quality solutions, and
running times were considered only secondary.
Table 1 shows average results of our GA approach obtained on 40 instances of
class A, running on an AMD Athlon K7/1.333GHz (Pentium IV class machine)
with 256MB SDRAM memory. The GA was run 20 times on each instance.
Average time needed to detect the best value is given in t[s] column, while
ttot [s] represents the total time needed to execute all 2000 generations. On average, the best value has been reached after gen generations. In sr column, the
success rate is given, i.e. in how many runs (out of 20) the optimal value has
been reached.
The quality of a solution S obtained by the GA is evaluated as a percentage
gap with respect to the optimal cost cost(S ∗ ):
gap(S) =
cost(S) − cost(S ∗ )
· 100% .
cost(S ∗ )
(9)
In the column gap, average over 20 runs is given. The values cost(S ∗ ) can be
found in Table 2. Standard deviation of the average gap (σ) is also presented in
the Table 1. There are additional four columns showing in how many runs the
obtained gaps were less than 0.2%, 1%, 5% respectively, and in how many runs
the gap was greater than 5%.
The last two columns are related to the caching: eval represents the average number of needed evaluations, while cache[%] displays savings (in percent)
achieved by using the caching technique. On average, instead of making 100, 000
calls of the objective function, we took between 75% and 85% of the values from
the cache-table.
Results of two MIP solvers (CPLEX and OSL) for the same set of instances
are given in Table 2. OSL is tested also on AMD Athlon K7/1.333GHz machine,
but CPLEX according to its license was run on Pentium III/800 MHz with
1GB memory. The columns describe: the instance name, the optimal solution,
the number of CPLEX iterations, the number of branch-and-bound nodes and
A Genetic Algorithm for the Index Selection Problem
7
Table 1. Empirical results of the GA on class A instances
Inst.
t[s] ttot [s] gen sr
i200.111
i200.222
i200.333
i200.444
i200.555
i175.111
i175.222
i175.333
i175.444
i175.555
i150.111
i150.222
i150.333
i150.444
i150.555
i125.111
i125.222
i125.333
i125.444
i125.555
i100.111
i100.222
i100.333
i100.444
i100.555
i075.111
i075.222
i075.333
i075.444
i075.555
i050.111
i050.222
i050.333
i050.444
i050.555
i025.111
i025.222
i025.333
i025.444
i025.555
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.2
0.1
0.1
0.1
0.3
0.2
0.1
0.1
0.1
0.1
0.3
0.3
0.1
0.1
0.2
0.2
0.2
0.3
0.2
0.3
0.3
0.3
0.2
0.4
1.1
0.9
1.3
0.9
0.9
1.1
1.0
1.2
1.0
0.9
1.0
1.2
1.1
1.0
1.0
1.0
1.3
1.1
1.0
1.1
1.2
1.4
1.1
0.9
1.2
1.1
1.2
1.3
0.9
1.3
1.1
1.1
1.1
1.2
1.1
1.1
1.1
1.0
1.0
1.1
39
41
83
40
40
67
62
120
47
40
69
68
104
41
38
50
219
97
44
46
528
280
73
40
150
139
294
306
48
138
256
189
180
429
195
360
517
423
193
623
20
20
20
20
20
20
20
19
20
20
20
18
20
20
20
19
18
20
20
20
14
6
20
20
19
17
16
18
20
20
12
6
9
8
9
8
11
2
7
2
err <
err >
0.2% 1% 5%
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
1
0
2
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0 14
0
0
0
0
0
0
0
0
1
0
0
3
0
0
4
0
2
0
0
0
0
0
0
0
0
8
0
0
0
0 14
7
4
0
0
7
5
0
0 11
0
0 12
0
0
8
0
5
7
0
0
1
0
5 13
5%
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
6
12
0
gap
σ
0.0
0.0
0.0
0.0
0.0
0.0
0.0
<0.1
0.0
0.0
0.0
<0.1
0.0
0.0
0.0
<0.1
<0.1
0.0
0.0
0.0
<0.1
0.2
0.0
0.0
<0.1
0.1
0.1
<0.1
0.0
0.0
0.1
1.1
0.2
0.5
0.7
1.3
2.0
2.9
5.1
0.8
0.0
0.0
0.0
0.0
0.0
0.0
0.0
<0.1
0.0
0.0
0.0
<0.1
0.0
0.0
0.0
<0.1
<0.1
0.0
0.0
0.0
<0.1
0.1
0.0
0.0
<0.1
0.3
0.2
<0.1
0.0
0.0
0.1
1.2
0.3
0.5
0.7
1.6
3.1
2.6
4.1
0.5
eval cache[%]
16844
14684
22449
14485
12448
17151
14941
20447
15747
13355
15659
19950
18381
15448
15244
15883
22142
17742
14433
18594
20335
25388
16624
13368
20469
17299
19688
23088
13982
21611
19693
19667
18180
21174
17980
21381
21263
19434
17808
21784
83.2
85.3
77.6
85.5
87.6
82.9
85.1
79.6
84.3
86.7
84.3
80.1
81.6
84.6
84.8
84.1
77.9
82.3
85.6
81.4
79.7
74.6
83.4
86.6
79.5
82.7
80.3
76.9
86.0
78.4
80.3
80.3
81.8
78.8
82.0
78.6
78.7
80.6
82.2
78.2
8
Jozef Kratica, Ivana Ljubić and Dušan Tošić
Table 2. Results obtained by two MIP solvers: CPLEX and OSL
Instance cost(S ∗ )
i200.111
i200.222
i200.333
i200.444
i200.555
i175.111
i175.222
i175.333
i175.444
i175.555
i150.111
i150.222
i150.333
i150.444
i150.555
i125.111
i125.222
i125.333
i125.444
i125.555
i100.111
i100.222
i100.333
i100.444
i100.555
i075.111
i075.222
i075.333
i075.444
i075.555
i050.111
i050.222
i050.333
i050.444
i050.555
i025.111
i025.222
i025.333
i025.444
i025.555
41102
40412
41034
41030
40706
35452
34803
35355
35352
35097
29781
29199
29712
29703
29458
24121
23612
24088
24054
23839
18479
18138
18486
18419
18255
12956
12698
12899
12777
12761
7475
7400
7423
7312
7384
2356
2347
2339
2278
2289
Iterations
190
319
383
177
191
220
937
1477
221
211
348
3829
4027
604
531
1191
20619
17458
2325
2338
12752
40775
57212
7863
21598
60188
83844
117720
149620
147597
1138414
1005745
2383662
2309766
1764340
70114754
19846560
131290943
90164493
71976083
CPLEX
Nodes
0
4
13
0
0
0
26
66
0
0
2
172
220
7
8
39
959
1038
69
101
465
2001
3346
314
994
2376
4100
5714
4602
7658
36611
40367
124310
98514
61630
1039922
282383
3945602
1596756
1178812
OSL
t[s] Iterations
t[s]
0.5
3866
22
7.6
5364
56
8.2
5834
70
0.4
4316
24
0.4
3845
20
0.7
5233
54
17.4
6445
71
18.4
18565
232
0.5
4509
24
0.5
4097
22
10.6
5577
64
26.5
26104
281
22.1
45093
546
18.3
6056
72
13.0
6751
78
18.4
8375
105
63.3
222424 2349
41.1
137939 1555
24.1
10920
127
29.2
11763
127
82.7
62798
636
190.5
1792061 18071
187.4
2202795 28493
72.3
51660
496
79.9
212854 2037
288.5
768490 7157
349.2
486.9
495.1
2153357 18380
584.5
2245747 19196
4151.2
3719.7
9900.6
9783.8
7664.2
249709.9
66569.9
441895.3
294388.1
170476.5
-
A Genetic Algorithm for the Index Selection Problem
Table 3. Comparing results obtained by LP-heuristics, BnC and GA
Instance
i200.111
i200.222
i200.333
i200.444
i200.555
i175.111
i175.222
i175.333
i175.444
i175.555
i150.111
i150.222
i150.333
i150.444
i150.555
i125.111
i125.222
i125.333
i125.444
i125.555
i100.111
i100.222
i100.333
i100.444
i100.555
i075.111
i075.222
i075.333
i075.444
i075.555
i050.111
i050.222
i050.333
i050.444
i050.555
i025.111
i025.222
i025.333
i025.444
i025.555
BnC (HP9000/720)
root-t[s] root-gap[%]
ttot [s]
1.1
0
1.1
1.0
0
1.0
0.9
0
0.9
1.0
0
1.0
1.4
0
1.4
1.2
0
1.2
1.3
0
1.3
1.1
0
1.1
1.1
0
1.1
2.0
0
2.0
2.0
0
2.1
2.5
0
2.5
1.3
0
1.3
2.1
0
2.1
3.3
0
3.3
4.7
0
4.8
5.2
0
5.2
1.6
0
1.6
4.3
0
4.3
7.4
0
7.4
32.6
0
33.3
25.4
0
25.4
4.8
0
4.8
31.6
0.2
37.6
26.6
0.3
45.2
83.8
1.9
391.2
78.9
1.8
773.4
47.5
0.5
75.9
92.1
2.3
815.2
79.1
1.9
457.2
139.6
7.4
7553.5
152.4
7.1
6583.4
134.2
5.2
1692.0
192.5
8.0
9284.9
157.4
8.1
10566.7
565.3
37.7 > 50000
595.5
38.4 > 50000
585.5
36.0 > 50000
691.3
35.8 > 50000
584.6
34.9 > 50000
GA (AMD K5)
t[s]
ttot [s]
1.5
22.9
1.6
21.2
2.3
28.0
1.3
20.1
1.5
18.7
2.0
22.5
1.6
21.4
3.8
25.4
1.6
21.0
1.3
19.1
1.6
20.6
1.9
24.1
3.3
24.4
1.6
21.3
1.4
20.7
1.5
20.5
2.0
25.0
2.8
23.2
1.5
20.3
2.0
24.1
8.1
24.9
6.9
27.8
2.4
22.4
1.5
18.8
5.2
26.8
3.6
22.3
9.7
26.0
6.1
27.4
1.9
20.3
3.4
26.0
4.5
23.8
7.6
25.6
7.7
24.9
5.8
24.5
2.8
23.4
7.0
23.2
6.3
23.3
7.0
22.7
7.6
21.4
5.2
22.2
9
10
Jozef Kratica, Ivana Ljubić and Dušan Tošić
CPLEX’s running time in seconds; OSL’s number of iterations and its running
time.
Results of the branch-and-cut algorithm described in [8, 9] directly compared
to the GA are given in Table 3. It is clear that AMD Athlon K7/1.333GHz is a
more powerful machine than the one used in [9] (HP 9000/720 at 80 MHz, 59
Specs, 58 MIPS, 18 Mflops), but the exact speedup factor is unknown. This factor
is unfortunately not constant, and greatly depends on instance’s characteristics.
For a fair comparison of the performance, the GA with the same set of parameters
is tested also on AMD K5/100MHz machine with 64MB memory (Pentium I
class machine). The performance on this machine is very similar to that of HP
9000/720.
The first two columns of Table 3 represent the running time of the LPheuristics (described in Sect. 2.1) (root-t[s]) and its gap (root-gap[%]), respectively. These values are followed by BnC’s total time needed to obtain optimal
solution, GA’s average time needed to detect the best value, and GA’s total time
needed to execute all 2000 generations.
For the instances of type i200, i175, i150 and i125, solutions obtained by the
LP-heuristics are already optimal, and no branching has been performed. For
some of these instances is the BnC even faster than our GA. On the other side,
the instances of type i100, i075, i050 and i025 seem to be quite challenging for
the exact algorithms, since BnC’s running time grows up exponentially. Even
the initialisation time is at least four times greater than GA’s average running
time. The quality of initial solutions is also very bad (for the group i025, gaps
are even greater than 30%).
In contrast to BnC’s exponential nature, the GA performs very efficiently,
keeping a satisfying quality of the obtained solutions (in the worst case, the
success rate was 10%, and the total running time was less than 2 seconds).
In 50% of the instances, the GA had a success rate of 100%, while in 70% of
them, the average gap was not greater than 0.1%. In only one out of 40 tested
instances, the average gap was greater than 3% (i025.444).
For extended results obtained on the class B of instances, see
http://www.geocities.com/jkratica/instances/. These instances appear to
be easily solvable for BnC, CPLEX and OSL (optimal solutions are usually
achieved within few seconds).
5
Conclusions
We proposed a genetic algorithm for the Index Selection Problem based on binary encoding, efficient data structures (for the evaluation of the objective function), on the uniform crossover, and simple mutation. The algorithm is tested
on the class of challenging instances known from the literature. Obtained results
indicate its efficiency and reliability.
The algorithm fits well into the parallel implementation and different island models, described in [15], should be tested. Incorporation of some problemdependent variation operators, or a local-search procedures, could possibly make
the algorithm more powerful for the instances of a larger size.
A Genetic Algorithm for the Index Selection Problem
11
Acknowledgements
We thank to A. Caprara and J.J. Salazar Gonzales, the authors of [8, 9], for
providing us the generator of test-instances for the ISP.
Ivana Ljubić is supported by the Doctoral Scholarship Programme of the
Austrian Academy of Sciences (DOC).
References
1. J. Krarup and P.M. Pruzan. The simple plant location problem: Survey and synthesis. European Journal of Operational Research, 12:36–81, 1983.
2. D. W. Tcha and B.-Y. Lee. A branch-and-bound algorithm for the multilevel
uncapacitated facility location problem. European Jornal of Operational Research,
18(1):35–43, 1984.
3. Y. A. Kochetov and E. N. Goncharov. Probabilistic tabu search algorithm for the
multi-stage uncapacitated facility location problem. In B. Fleischmann, R. Lasch,
U. Derigs, W. Domschke, and U. Rieder, editors, Operations Research Proceedings
2000, pages 65–70. Springer, 2000.
4. E. Barcucci, R. Pinzani, and R. Sprugnoli. Optimal selection of secondary indexes.
IEEE Transactions on Software Engineering, 16, 1990.
5. M.Y.L. Ip, L.V. Saxton, and V.V. Raghavan. On the selection of an optimal set
of indexes. IEEE Transactions on Software Engineering, pages 135–143, 1983.
6. A. Caprara, M. Fischetti, and D. Maio. Exact and approximate algorithms for
the index selection problem in physical database design. IEEE Transactions on
Knowledge and Data Engineering, 7(6), 1995.
7. S. Finkelstein, M. Schkolnick, and P. Tiberio. Physical database design for relational databases. ACM Transactions on Database Systems, 13:91–128, 1988.
8. A. Caprara and J.J. Salazar Gonzalez. A branch-and-cut algorithm for a generalization of the uncapacitated facility location. Problem Trabajos de Operativa TOP, 4(1):135–163, 1996.
9. A. Caprara and J.J Salazar Gonzalez. Separating lifted odd-hole inequalities to
solve the index selection problem. Discrete Applied Mathematics, 92:111–134, 1999.
10. J. Kratica. Improving performances of the genetic algorithm by caching. Computers
and Artificial Intelligence, 18(3):271–283, 1999.
11. J. Kratica. Parallelization of Genetic Algorithms for Solving Some NP-complete
Problems (in Serbian). PhD thesis, Faculty of Mathematics, Belgrade, 2000.
12. G. Syswerda. Uniform crossover in genetic algorithms. In 3th International Conference on Genetic ALgorithms, ICGA’89, pages 2–9, Internet, 1989. Morgan Kaufmann, San Mateo, Calif.
13. V. Filipović. Proposition for improvement tournament selection operator in genetic
algorithms (in serbian). Master’s thesis, Faculty of Mathematics, Belgrade, 1998.
14. V. Filipović, J. Kratica, D. Tošić, and I. Ljubić. Fine grained tournament selection
for the simple plant location problem. In 5th Online World Conference on Soft
Computing Methods in Industrial Applications, WSC5, pages 152–158, Internet,
2000. ISBN: 951-22-5205-8.
15. Erick Cantu-Paz. Efficient and Accurate Parallel Genetic Algorithms. Kluwer
Academic Publishers, Boston, 2000.