A Genetic Algorithm for the Index Selection Problem

Kratica, Jozef; Ljubić, Ivana; Tošić, Dušan

A Genetic Algorithm for the Index Selection Problem

Jozef Kratica

2003

visibility

…

description

11 pages

link

1 file

This paper considers the problem of minimizing the response time for a given database workload by a proper choice of indexes. This problem is NP-hard and known in the literature as the Index Selection Problem (ISP). We propose a genetic algorithm (GA) for solving the ISP. Computational results of the GA on standard ISP instances are compared to branchand- cut method and its initialisation heuristics and two state of the art MIP solvers: CPLEX and OSL. These results indicate good performance, reliability and efficiency of the proposed approach.

A Genetic Algorithm for the Index Selection Problem Jozef Kratica1 , Ivana Ljubić2 and Dušan Tošić3 1 2 Institute of Mathematics, Serbian Academy of Sciences and Arts, Kneza Mihajla 35/I, pp. 367, 11001 Belgrade, Yugoslavia [email protected] Institute of Computer Graphics and Algorithms, Vienna University of Technology, Favoritenstraße 9–11/186, 1040 Vienna, Austria [email protected] 3 Faculty of Mathematics, University of Belgrade, Studentski trg 16, Belgrade, Yugoslavia [email protected] Abstract. This paper considers the problem of minimizing the response time for a given database workload by a proper choice of indexes. This problem is NP-hard and known in the literature as the Index Selection Problem (ISP). We propose a genetic algorithm (GA) for solving the ISP. Computational results of the GA on standard ISP instances are compared to branchand-cut method and its initialisation heuristics and two state of the art MIP solvers: CPLEX and OSL. These results indicate good performance, reliability and efficiency of the proposed approach. 1 Introduction The Index Selection Problem (ISP) is one of the crucial problems in the physical design of databases. The main goal is to choose a subset of given indexes to be created in a database, so that the response time for a given database workload is minimized. The ISP represents a generalization of the well-known Uncapacitated Facility Location Problem - UFLP, which is known as NP-hard in a strong sense [1]. Some authors also consider the ISP problem as the multilevel (or multi-stage) uncapacitated facility location problem ([2, 3]). Suppose we are given a set of indexes (that can be built or not) and a set of queries. Each built index requires a maintenance time, while each query can be answered in answer time, which depends on the set of built indexes. The main goal is to minimize the overall execution time, defined as the sum of the maintenance times plus the sum of the answer times for all queries. In the simple case, each query can be answered either without using any index, in a given answer time, or with using one built index, reducing answer time by a gain specified for every index usable for query. In this situation the problem can be formulated as an UFLP. 2 Jozef Kratica, Ivana Ljubić and Dušan Tošić However, in most database management systems each query may use not only a single index, but a set of indexes ([4, 5]). This problem can not be formulated as a pure UFLP, but must be generalized and formulated as an ISP. Detailed description of the problem and analysis of all aspects of its influence on the managing of database systems is beyond the scope of this paper. Some important survey articles are [6, 7]. In this paper we propose a genetic algorithm (GA) for solving the ISP. In the following section, we will present the mathematical formulation of the problem and point out the previous work. Section 3 explains in detail all important aspects of the GA: encoding, evaluation of solutions and genetic operators. In Section 4, we will compare the results of the GA with branch-and-cut (BnC) algorithm proposed in [8, 9]. Results are also compared with those obtained using two state of the art MIP solvers: CPLEX and OSL. Finally, in Section 5, we will draw out some conclusions and propose ideas for the future work. 2 The Index Selection Problem Let N = {1, 2, ..., n} be the set of all indexes and M = {1, 2, ..., m} the set of all queries for a database. Each index can be built or not - each built index requires a given maintenance time fj > 0. In the ISP, not only single indexes, but also sets of indexes can be used for a query. Therefore, we are given a set of configurations P = {1, 2, ..., p} - each configuration k ∈ P is associated with some subset Nk ⊂ N of indexes. A configuration is active, if all its indexes are built. When all indexes of a certain configuration k ∈ P are built, during the execution of a query i ∈ M , we gain gik ≥ 0 time. In practice, for the most pairs (i,k), i ∈ M , k ∈ P the gains gik are equal zero. This can easily be explained with the fact that a certain configuration has an influence only on a limited number of queries from M . Our goal is to build the indexes, so that the total time needed to execute all the queries is minimized, i.e. that the total time of gains is maximized. Formally, the problem can be stated as:   XX X max  gik · xik − fj · yj  (1) i∈M k∈P j∈N subject to X xik ≤ 1, i∈M (2) k∈P xik ≤ yj , xik , yj ∈ {0, 1}, i ∈ M, j ∈ N, k ∈ P i ∈ M, j ∈ N, k ∈ P (3) (4) In this formulation, each solution is represented by the set of built indexes S ⊂ N , where y represents a characteristic vector of S, i.e.: A Genetic Algorithm for the Index Selection Problem yj = ½ 1, index j ∈ S 0, otherwise For each query-configuration pair (i, k) ∈ M × P , we introduce a variable: ½ 1, query i uses configuration k xik = 0, otherwise 3 (5) (6) Constraint (2) explains that each query i ∈ M can use at most one configuration. Constraint (3) says that the queries can be served only by active configurations. 2.1 Previous Work In [6], two methods for solving the ISP are proposed. One is a heuristics based on the solution of the suitably defined knapsack subproblem and on Lagrangean decomposition. This method is used on larger scale ISP instances, up to several thousand indexes and queries, and it produces solutions of a good quality. The second method is a branch-and-bound algorithm based on the linear programming relaxation of the model. The performance of the algorithm is enhanced by means of the preprocessing which reduces the size of the candidate index set. Optimal solutions on the ISP instances involving several hundred indexes and queries have been reported. In [8], an improvement of the previous method by using branch-and-cut (BnC) algorithm is proposed. The problem is formulated as the Set Packing Problem that contains all clique inequalities. For initialisation of the root node in the branch-and-bound tree, a so called LP-heuristics is used: this is an adapted greedy heuristics for the UFLP, followed by a 2-interchange heuristics. The greedy heuristics starts with an empty index set, and iteratively adds to the current set the index leading to the largest objective function increase. The 2-interchange procedure tries to improve the current solution by adding an index, removing an index, or interchanging two indexes. Analogous algorithms working on configurations instead of indexes are also applied, and the best solution is taken as a primal bound. Computational results on two different challenging types of instances (class A and class B) are reported. Although these instances contain only 50 (respectively 100) indexes and queries, presented results indicate that two new classes are more complicated than the instances tested in the previous paper with several thousand of indexes and queries. A separation procedure for a suitably defined family of lifted odd-hole inequalities is embedded within the branch-and-cut method presented in [9]. These inequalities are obtained by applying Chvátal-Gomory derivation to the clique inequalities. As an initialisation heuristics, the LP-heuristics described above is used. The results of this approach compared to those obtained using CPLEX 3.0 MIP solver, show the effectiveness of the presented approach on the instances of class A and class B. However, the performance analysis indicates that the instances of class B are easily solvable by both exact algorithms, and that the instances of real interest are those of class A. 4 Jozef Kratica, Ivana Ljubić and Dušan Tošić 3 The Genetic Algorithm We used traditional generational GA (containing Npop individuals) with overlapping populations: Nelit is the number of elitist individuals that survive to the next generation. Since the evaluation of the objective value is a time-consuming operation, we store a certain amount of already calculated values in a cachetable of size NCache . Before the objective value for a certain individual is calculated, the table is checked [10, 11]. Least Recently Used caching strategy, which is simple but effective, is used for that purpose. Caching is implemented by a hash-queue data structure. 3.1 Encoding and Objective Function We store only non-zero gik (i ∈ M, k ∈ P ) values, since their number is small compared to the matrix dimension |M | · |P |. For each query i we need to remember three variables: the number of gik variables that are greater than zero, the values itself stored in an array and their original indices. Thus, we enhance the evaluation of the objective function, since during the search the complete matrix need not to be considered. Figure 1 shows an example. A similar strategy has been used when the contents of the configurations are stored. Each solution S 1 2 3 1 15 0 0 2 0 20 0 3 0 0 30 4 5 6 7 8 0 20 0 0 0 0 0 0 0 0 0 0 0 0 40 1 2 3 4 5 1 2 15 1 20 5 2 1 20 2 3 2 30 3 40 8 1 2 3 4 5 1 2 20 5 15 1 2 1 20 2 3 2 40 8 30 3 (b) (c) (a) Fig. 1. An example how the values (gik ) can be efficiently stored: (a) The rows represent the queries, while the columns represent possible configurations; (b) The matrix is reduced so that the first number which represents the number of non-zero values is followed by an array of values itself and their original indices. (c) The gains gik are finally sorted in the decreasing order. is encoded using the binary vector y, given by (5). According to the values of indexes that are built, we determine the set of active configurations. For each query i, we need to find the configuration k ∗ whose time gain is maximized, i.e. k ∗ = max gik . (7) k∈P Finally, we need to sum all such gains over all queries, and to subtract the maintenance times of the built indexes. The whole objective value’s evaluation process can be enhanced significantly if the values gik are sorted in a decreasing order, for each query i ∈ M . That way, the first active configuration (if such exists) in the sorted array, represents the best one. If the obtained value is negative, the building of the indexes does not pay off. In that case yj = 0, for all j ∈ N and there are no active configurations the objective value is set to zero. A Genetic Algorithm for the Index Selection Problem 5 Time Complexity of the Objective Function The set of built indexes can be found in O(n) time. Each configuration can contain n indexes in the worst case. Thus, all active configurations can be found in O(np) time. For each query i ∈ M , finding the best configuration according to (7), can be done in O(p) time. Hence, finding the best configurations for all queries needs O(mp) time. However, this is only a theoretical upper bound, since the matrix (gik ) is usually sparse. Finally, subtraction of the maintenance time for each built index i ∈ S can be performed in O(n) time. In total, evaluation of the objective function for each individual can be done in O((n + m)p) time. 3.2 Genetic Operators and Other Aspects of the GA In order to avoid premature convergence, multiple individuals are discarded from the population. To recombine two individuals, we use the uniform crossover described in [12]. The recombination is applied to a pair of selected individuals with probability pcro . Mutating every single gene may be unnecessarily low, since usually only several genes really mutate, while the majority of them remains the same. For a given mutation rate, the number of muted genes of one generation can be predicted. Using the Central limit theorem, by the Gaussian distribution, the mutation procedure is performed only on the number of randomly chosen genes. That way, a relatively small mutation probability does not create drawbacks on the performance of the simple mutation operator we used. In our model, the mutation rate has been changed over the generations: We start with a mutation rate pmstart , and let it converge towards pmend : −Ngen pm = pmend + (pmstart − pmend ) · 2 pmgen (8) where Ngen represents the current generation, and pmgen a mutation’s biasparameter. Note that for the large number of generations, the mutation rate converges towards pmend , but the value itself can not be achieved. As a selection method, we used the fine grained tournament selection, proposed in [13]. Instead of having an integer parameter Ntour , which represents the tournament size in the classic tournament selection, the new operator’s input is a floating point parameter Ftour representing an average tournament size. The size of each of Npop −Nelite tournaments is chosen so that this value is on average as close as possible to Ftour . Such a selection scheme can significantly affect the overall performance of the GA, as it can be seen in [14]. 4 Empirical Results We have tested our GA on class A of randomly generated ISP instances, obtained by the generator of A. Caprara and J.J. Salazar Gonzales [9]. This generator intends to produce instances of the structure similar to real-world ISP instances, in the spirit of [6]. 6 Jozef Kratica, Ivana Ljubić and Dušan Tošić All instances are given with 50 queries, 50 indexes and 500 configurations. For each j ∈ N , fj = 100, and for each k ∈ P , |Nk | is chosen as a random integer from {1, . . . , 5}. |Nk | indexes inside of Nk are then randomly selected from N . For each k ∈ P , a set Mk is defined by randomly selecting 5 queries from M ; The answer time gain gik , ∀i ∈ M , is set to zero if i ∈ M \ Mk , and to a random integer from {1, . . . , t|Nk |} otherwise (where the parameter t is given as an input). The following setup was used for the GA, as it proved to be robust in preliminary tests: population size Npop = 150; number of elitist individuals Nelit = 100; size of the cache NCache = 5000; average group size for the fine grained tournament selection Ftour = 5.5; crossover probability pcro = 0.85, where 30% of information has been exchanged; mutation parameters pmstart = 0.01, pmend = 0.002 and pmgen = 300. Each run was terminated after 2 000 iterations. On all the instances we considered, this criterion allowed the GA to converge so that only minor improvements in the quality of final solutions can be expected when prolonging the runs. Thus, the main goal was to find high-quality solutions, and running times were considered only secondary. Table 1 shows average results of our GA approach obtained on 40 instances of class A, running on an AMD Athlon K7/1.333GHz (Pentium IV class machine) with 256MB SDRAM memory. The GA was run 20 times on each instance. Average time needed to detect the best value is given in t[s] column, while ttot [s] represents the total time needed to execute all 2000 generations. On average, the best value has been reached after gen generations. In sr column, the success rate is given, i.e. in how many runs (out of 20) the optimal value has been reached. The quality of a solution S obtained by the GA is evaluated as a percentage gap with respect to the optimal cost cost(S ∗ ): gap(S) = cost(S) − cost(S ∗ ) · 100% . cost(S ∗ ) (9) In the column gap, average over 20 runs is given. The values cost(S ∗ ) can be found in Table 2. Standard deviation of the average gap (σ) is also presented in the Table 1. There are additional four columns showing in how many runs the obtained gaps were less than 0.2%, 1%, 5% respectively, and in how many runs the gap was greater than 5%. The last two columns are related to the caching: eval represents the average number of needed evaluations, while cache[%] displays savings (in percent) achieved by using the caching technique. On average, instead of making 100, 000 calls of the objective function, we took between 75% and 85% of the values from the cache-table. Results of two MIP solvers (CPLEX and OSL) for the same set of instances are given in Table 2. OSL is tested also on AMD Athlon K7/1.333GHz machine, but CPLEX according to its license was run on Pentium III/800 MHz with 1GB memory. The columns describe: the instance name, the optimal solution, the number of CPLEX iterations, the number of branch-and-bound nodes and A Genetic Algorithm for the Index Selection Problem 7 Table 1. Empirical results of the GA on class A instances Inst. t[s] ttot [s] gen sr i200.111 i200.222 i200.333 i200.444 i200.555 i175.111 i175.222 i175.333 i175.444 i175.555 i150.111 i150.222 i150.333 i150.444 i150.555 i125.111 i125.222 i125.333 i125.444 i125.555 i100.111 i100.222 i100.333 i100.444 i100.555 i075.111 i075.222 i075.333 i075.444 i075.555 i050.111 i050.222 i050.333 i050.444 i050.555 i025.111 i025.222 i025.333 i025.444 i025.555 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.3 0.2 0.1 0.1 0.1 0.1 0.3 0.3 0.1 0.1 0.2 0.2 0.2 0.3 0.2 0.3 0.3 0.3 0.2 0.4 1.1 0.9 1.3 0.9 0.9 1.1 1.0 1.2 1.0 0.9 1.0 1.2 1.1 1.0 1.0 1.0 1.3 1.1 1.0 1.1 1.2 1.4 1.1 0.9 1.2 1.1 1.2 1.3 0.9 1.3 1.1 1.1 1.1 1.2 1.1 1.1 1.1 1.0 1.0 1.1 39 41 83 40 40 67 62 120 47 40 69 68 104 41 38 50 219 97 44 46 528 280 73 40 150 139 294 306 48 138 256 189 180 429 195 360 517 423 193 623 20 20 20 20 20 20 20 19 20 20 20 18 20 20 20 19 18 20 20 20 14 6 20 20 19 17 16 18 20 20 12 6 9 8 9 8 11 2 7 2 err < err > 0.2% 1% 5% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 14 0 0 0 0 0 0 0 0 1 0 0 3 0 0 4 0 2 0 0 0 0 0 0 0 0 8 0 0 0 0 14 7 4 0 0 7 5 0 0 11 0 0 12 0 0 8 0 5 7 0 0 1 0 5 13 5% 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 6 12 0 gap σ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <0.1 0.0 0.0 0.0 <0.1 0.0 0.0 0.0 <0.1 <0.1 0.0 0.0 0.0 <0.1 0.2 0.0 0.0 <0.1 0.1 0.1 <0.1 0.0 0.0 0.1 1.1 0.2 0.5 0.7 1.3 2.0 2.9 5.1 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <0.1 0.0 0.0 0.0 <0.1 0.0 0.0 0.0 <0.1 <0.1 0.0 0.0 0.0 <0.1 0.1 0.0 0.0 <0.1 0.3 0.2 <0.1 0.0 0.0 0.1 1.2 0.3 0.5 0.7 1.6 3.1 2.6 4.1 0.5 eval cache[%] 16844 14684 22449 14485 12448 17151 14941 20447 15747 13355 15659 19950 18381 15448 15244 15883 22142 17742 14433 18594 20335 25388 16624 13368 20469 17299 19688 23088 13982 21611 19693 19667 18180 21174 17980 21381 21263 19434 17808 21784 83.2 85.3 77.6 85.5 87.6 82.9 85.1 79.6 84.3 86.7 84.3 80.1 81.6 84.6 84.8 84.1 77.9 82.3 85.6 81.4 79.7 74.6 83.4 86.6 79.5 82.7 80.3 76.9 86.0 78.4 80.3 80.3 81.8 78.8 82.0 78.6 78.7 80.6 82.2 78.2 8 Jozef Kratica, Ivana Ljubić and Dušan Tošić Table 2. Results obtained by two MIP solvers: CPLEX and OSL Instance cost(S ∗ ) i200.111 i200.222 i200.333 i200.444 i200.555 i175.111 i175.222 i175.333 i175.444 i175.555 i150.111 i150.222 i150.333 i150.444 i150.555 i125.111 i125.222 i125.333 i125.444 i125.555 i100.111 i100.222 i100.333 i100.444 i100.555 i075.111 i075.222 i075.333 i075.444 i075.555 i050.111 i050.222 i050.333 i050.444 i050.555 i025.111 i025.222 i025.333 i025.444 i025.555 41102 40412 41034 41030 40706 35452 34803 35355 35352 35097 29781 29199 29712 29703 29458 24121 23612 24088 24054 23839 18479 18138 18486 18419 18255 12956 12698 12899 12777 12761 7475 7400 7423 7312 7384 2356 2347 2339 2278 2289 Iterations 190 319 383 177 191 220 937 1477 221 211 348 3829 4027 604 531 1191 20619 17458 2325 2338 12752 40775 57212 7863 21598 60188 83844 117720 149620 147597 1138414 1005745 2383662 2309766 1764340 70114754 19846560 131290943 90164493 71976083 CPLEX Nodes 0 4 13 0 0 0 26 66 0 0 2 172 220 7 8 39 959 1038 69 101 465 2001 3346 314 994 2376 4100 5714 4602 7658 36611 40367 124310 98514 61630 1039922 282383 3945602 1596756 1178812 OSL t[s] Iterations t[s] 0.5 3866 22 7.6 5364 56 8.2 5834 70 0.4 4316 24 0.4 3845 20 0.7 5233 54 17.4 6445 71 18.4 18565 232 0.5 4509 24 0.5 4097 22 10.6 5577 64 26.5 26104 281 22.1 45093 546 18.3 6056 72 13.0 6751 78 18.4 8375 105 63.3 222424 2349 41.1 137939 1555 24.1 10920 127 29.2 11763 127 82.7 62798 636 190.5 1792061 18071 187.4 2202795 28493 72.3 51660 496 79.9 212854 2037 288.5 768490 7157 349.2 486.9 495.1 2153357 18380 584.5 2245747 19196 4151.2 3719.7 9900.6 9783.8 7664.2 249709.9 66569.9 441895.3 294388.1 170476.5 - A Genetic Algorithm for the Index Selection Problem Table 3. Comparing results obtained by LP-heuristics, BnC and GA Instance i200.111 i200.222 i200.333 i200.444 i200.555 i175.111 i175.222 i175.333 i175.444 i175.555 i150.111 i150.222 i150.333 i150.444 i150.555 i125.111 i125.222 i125.333 i125.444 i125.555 i100.111 i100.222 i100.333 i100.444 i100.555 i075.111 i075.222 i075.333 i075.444 i075.555 i050.111 i050.222 i050.333 i050.444 i050.555 i025.111 i025.222 i025.333 i025.444 i025.555 BnC (HP9000/720) root-t[s] root-gap[%] ttot [s] 1.1 0 1.1 1.0 0 1.0 0.9 0 0.9 1.0 0 1.0 1.4 0 1.4 1.2 0 1.2 1.3 0 1.3 1.1 0 1.1 1.1 0 1.1 2.0 0 2.0 2.0 0 2.1 2.5 0 2.5 1.3 0 1.3 2.1 0 2.1 3.3 0 3.3 4.7 0 4.8 5.2 0 5.2 1.6 0 1.6 4.3 0 4.3 7.4 0 7.4 32.6 0 33.3 25.4 0 25.4 4.8 0 4.8 31.6 0.2 37.6 26.6 0.3 45.2 83.8 1.9 391.2 78.9 1.8 773.4 47.5 0.5 75.9 92.1 2.3 815.2 79.1 1.9 457.2 139.6 7.4 7553.5 152.4 7.1 6583.4 134.2 5.2 1692.0 192.5 8.0 9284.9 157.4 8.1 10566.7 565.3 37.7 > 50000 595.5 38.4 > 50000 585.5 36.0 > 50000 691.3 35.8 > 50000 584.6 34.9 > 50000 GA (AMD K5) t[s] ttot [s] 1.5 22.9 1.6 21.2 2.3 28.0 1.3 20.1 1.5 18.7 2.0 22.5 1.6 21.4 3.8 25.4 1.6 21.0 1.3 19.1 1.6 20.6 1.9 24.1 3.3 24.4 1.6 21.3 1.4 20.7 1.5 20.5 2.0 25.0 2.8 23.2 1.5 20.3 2.0 24.1 8.1 24.9 6.9 27.8 2.4 22.4 1.5 18.8 5.2 26.8 3.6 22.3 9.7 26.0 6.1 27.4 1.9 20.3 3.4 26.0 4.5 23.8 7.6 25.6 7.7 24.9 5.8 24.5 2.8 23.4 7.0 23.2 6.3 23.3 7.0 22.7 7.6 21.4 5.2 22.2 9 10 Jozef Kratica, Ivana Ljubić and Dušan Tošić CPLEX’s running time in seconds; OSL’s number of iterations and its running time. Results of the branch-and-cut algorithm described in [8, 9] directly compared to the GA are given in Table 3. It is clear that AMD Athlon K7/1.333GHz is a more powerful machine than the one used in [9] (HP 9000/720 at 80 MHz, 59 Specs, 58 MIPS, 18 Mflops), but the exact speedup factor is unknown. This factor is unfortunately not constant, and greatly depends on instance’s characteristics. For a fair comparison of the performance, the GA with the same set of parameters is tested also on AMD K5/100MHz machine with 64MB memory (Pentium I class machine). The performance on this machine is very similar to that of HP 9000/720. The first two columns of Table 3 represent the running time of the LPheuristics (described in Sect. 2.1) (root-t[s]) and its gap (root-gap[%]), respectively. These values are followed by BnC’s total time needed to obtain optimal solution, GA’s average time needed to detect the best value, and GA’s total time needed to execute all 2000 generations. For the instances of type i200, i175, i150 and i125, solutions obtained by the LP-heuristics are already optimal, and no branching has been performed. For some of these instances is the BnC even faster than our GA. On the other side, the instances of type i100, i075, i050 and i025 seem to be quite challenging for the exact algorithms, since BnC’s running time grows up exponentially. Even the initialisation time is at least four times greater than GA’s average running time. The quality of initial solutions is also very bad (for the group i025, gaps are even greater than 30%). In contrast to BnC’s exponential nature, the GA performs very efficiently, keeping a satisfying quality of the obtained solutions (in the worst case, the success rate was 10%, and the total running time was less than 2 seconds). In 50% of the instances, the GA had a success rate of 100%, while in 70% of them, the average gap was not greater than 0.1%. In only one out of 40 tested instances, the average gap was greater than 3% (i025.444). For extended results obtained on the class B of instances, see http://www.geocities.com/jkratica/instances/. These instances appear to be easily solvable for BnC, CPLEX and OSL (optimal solutions are usually achieved within few seconds). 5 Conclusions We proposed a genetic algorithm for the Index Selection Problem based on binary encoding, efficient data structures (for the evaluation of the objective function), on the uniform crossover, and simple mutation. The algorithm is tested on the class of challenging instances known from the literature. Obtained results indicate its efficiency and reliability. The algorithm fits well into the parallel implementation and different island models, described in [15], should be tested. Incorporation of some problemdependent variation operators, or a local-search procedures, could possibly make the algorithm more powerful for the instances of a larger size. A Genetic Algorithm for the Index Selection Problem 11 Acknowledgements We thank to A. Caprara and J.J. Salazar Gonzales, the authors of [8, 9], for providing us the generator of test-instances for the ISP. Ivana Ljubić is supported by the Doctoral Scholarship Programme of the Austrian Academy of Sciences (DOC). References 1. J. Krarup and P.M. Pruzan. The simple plant location problem: Survey and synthesis. European Journal of Operational Research, 12:36–81, 1983. 2. D. W. Tcha and B.-Y. Lee. A branch-and-bound algorithm for the multilevel uncapacitated facility location problem. European Jornal of Operational Research, 18(1):35–43, 1984. 3. Y. A. Kochetov and E. N. Goncharov. Probabilistic tabu search algorithm for the multi-stage uncapacitated facility location problem. In B. Fleischmann, R. Lasch, U. Derigs, W. Domschke, and U. Rieder, editors, Operations Research Proceedings 2000, pages 65–70. Springer, 2000. 4. E. Barcucci, R. Pinzani, and R. Sprugnoli. Optimal selection of secondary indexes. IEEE Transactions on Software Engineering, 16, 1990. 5. M.Y.L. Ip, L.V. Saxton, and V.V. Raghavan. On the selection of an optimal set of indexes. IEEE Transactions on Software Engineering, pages 135–143, 1983. 6. A. Caprara, M. Fischetti, and D. Maio. Exact and approximate algorithms for the index selection problem in physical database design. IEEE Transactions on Knowledge and Data Engineering, 7(6), 1995. 7. S. Finkelstein, M. Schkolnick, and P. Tiberio. Physical database design for relational databases. ACM Transactions on Database Systems, 13:91–128, 1988. 8. A. Caprara and J.J. Salazar Gonzalez. A branch-and-cut algorithm for a generalization of the uncapacitated facility location. Problem Trabajos de Operativa TOP, 4(1):135–163, 1996. 9. A. Caprara and J.J Salazar Gonzalez. Separating lifted odd-hole inequalities to solve the index selection problem. Discrete Applied Mathematics, 92:111–134, 1999. 10. J. Kratica. Improving performances of the genetic algorithm by caching. Computers and Artificial Intelligence, 18(3):271–283, 1999. 11. J. Kratica. Parallelization of Genetic Algorithms for Solving Some NP-complete Problems (in Serbian). PhD thesis, Faculty of Mathematics, Belgrade, 2000. 12. G. Syswerda. Uniform crossover in genetic algorithms. In 3th International Conference on Genetic ALgorithms, ICGA’89, pages 2–9, Internet, 1989. Morgan Kaufmann, San Mateo, Calif. 13. V. Filipović. Proposition for improvement tournament selection operator in genetic algorithms (in serbian). Master’s thesis, Faculty of Mathematics, Belgrade, 1998. 14. V. Filipović, J. Kratica, D. Tošić, and I. Ljubić. Fine grained tournament selection for the simple plant location problem. In 5th Online World Conference on Soft Computing Methods in Industrial Applications, WSC5, pages 152–158, Internet, 2000. ISBN: 951-22-5205-8. 15. Erick Cantu-Paz. Efficient and Accurate Parallel Genetic Algorithms. Kluwer Academic Publishers, Boston, 2000.

Log In

A Genetic Algorithm for the Index Selection Problem

Related papers

Related papers

Related topics