An evolutionary approach to the index selection problem

Yago Saez

An evolutionary approach to the index selection problem

Yago Saez

2011

visibility

…

description

6 pages

link

1 file

In this paper, evolutionary algorithms are explored with the objective of demonstrating that they offer the most efficient and adequate solution to the Index Selection Problem (ISP). The final target is to develop a self-tuning database system requiring little (or no) intervention from experts in physical design. Following the evaluation of the proposal and the discussion of experimental results, conclusions are made regarding the possibilities presented by evolutionary algorithms for future projects.

An Evolutionary Approach to the Index Selection Problem Javier Calle, Yago Sáez, Dolores Cuadra Computer Science Department - Carlos III University of Madrid {fcalle,ysaez,dcuadra}@inf.uc3m.es Abstract— In this paper, evolutionary algorithms are explored with the objective of demonstrating that they offer the most efficient and adequate solution to the Index Selection Problem (ISP). The final target is to develop a self-tuning database system requiring little (or no) intervention from experts in physical design. Following the evaluation of the proposal and the discussion of experimental results, conclusions are made regarding the possibilities presented by evolutionary algorithms for future projects. Keywords- Index Selection Algorithms, Self-Tuning. I. Problem, Evolutionary INTRODUCTION One of the principal concerns present when trying to improve the performance of a database instance is that of finding the most appropriate physical design for that database. Within these general concerns, the Index Selection Problem (ISP) can be defined as the search for a particular combination of indexes such that the cost for a given workload in the database is minimized. This problem has been traditionally formalized as the linear combination of 0-1 values on a string of variables, each representing a different candidate index (i.e., 0-1 integer linear programming) [7]. Being NP-hard [9], the implementation of the ISP ought to consider certain restrictions that allow for a solution to be reached in a reasonable amount of time. Almost all proposed solutions to the ISP, for example, first begin with a search for a reduced subset of candidate indexes. Later, many of these proposals focus on the search for heuristics and efficient pruning techniques in order to avoid the exploration of index combinations known, a priori, to be ineffective. Finally, certain authors [14] opt for statistics-based simulations (rather than taking measurements on real environment executions) in order to save time and not affect databases in use. While the algorithms usually studied as potential solutions to the ISP fix the characteristics of the database and the database management system (DBMS), and even set the workload as static, it is nevertheless the case that each of these parameters is, in reality, dynamic. Thus, it would be much more proper to find an algorithm capable from the start of adapting dynamically to any change in these parameters. Furthermore, a number of additional studies can be found focusing on specific system types (e.g., relational systems, OLTP, etc.) or even certain auxiliary structures [4]. The ability, therefore, of an algorithm to find general solutions applicable to distinct systems and given diverse structures would make its use even more recommendable. Given these considerations, evolutionary algorithms present themselves as the most promising solution, insofar as they can perfectly c 978-1-4577-1124-4/11/$26.00 2011 IEEE adapt to the definition of the problem and, in addition, can dynamically adapt to variations in the objective function [10] [11]. Other related studies demonstrating the viability of the proposed solution can also be found [18]. It is the principal objective of this study to offer an actual measurement of the goodness and superiority of this proposal when compared with the results obtained by frequently-used tools, as well as expert database administrators. This first evaluation, therefore, is made using a relational database in static conditions (i.e., fixing characteristics of the database, DBMS and workload), looking for any type of auxiliary structure offered by the DBMS. It is the hope of the authors of this study that the demonstrated relative goodness of evolutionary algorithms as a solution to the ISP may ground future studies that are more general (i.e., applicable to distinct systems) and focus on dynamic conditions. II. RELATED WORK AND PROPOSAL Let C be the set of candidate indexes for a database with a cost function f: W x DBi x P(C) Æ R providing a real value for the execution of workload (W) on a database with a given state (DBi) and a particular physical design which, for reasons of brevity, is here restricted to the selection of an index combination from the power set of C that is k ∈ P(C). The ISP can be now defined as the search for k which minimizes the cost. One of the first problems to be addressed when selecting the correct indexes for a particular database schema is to decide the workload for which the indexes will optimize the performance of the database. The workload is a significantly large set of updating and query instructions (i.e., sentences) representing the operations that occur in a database. One way of selecting a representative workload is by utilizing the logging capabilities [2] of many DBMSs to capture the trace of queries and modifications made in any of those particular systems. In certain published works [3], for example, the new self-tuning characteristics of Oracle RDBMS are presented. Influencing in large part the selection of a good physical design is the analysis of the most frequent operations with the largest margin for improvement. In the proposal presented here, design is grounded in a set of candidate indexes which will be iteratively optimized during the execution of a predefined set of operations. To select them, Oracle uses the Automatic Workload Repository (AWR) which is updated every hour with operations and statistics collected by the DBMS. The second problem to resolve is the selection of a set of candidate indexes for a given workload. Given that, in many cases, the search space of candidates is often computationally 485 unmanageable (i.e., with an exponential number of solutions), a selection should be made among all possible index combinations, eliminating those which are not representative for the selected workload. In the majority of the proposals studied, this filtration is carried out by the DBMS advisor. To give some examples, the SQL Server Advisor bases its recommendation on the syntactical structure of SQL sentences [2]. In Oracle [3], the history stored (weekly) in the AWR is used for the creation of four lists whose rankings depend on the total time spent during (1) a given week, (2) any given day of the week, (3) any given hour of the week, and (4) the average time consumed during the week. The execution cost of an SQL sentence pertaining to a workload selected with the configuration of candidate indexes is estimated. Dynamic programming [2][5] or whatif analysis and optimization based on greedy algorithms [8][14] is used in most cases. At this point, the procedure for solving the ISP does not execute the definitive configuration of indexes in the database schema. Rather, it is the responsibility of the administrator, according to the data provided by the advisor tool and the administrator’s own experience, to select the definitive configuration of indexes to be introduced in the DBMS. Additionally, certain other proposals provide and increase the degree of certainty in the solution through the use of visual tools which, for example, can demonstrate the interaction between indexes [19]. Solving the ISP for a DB instance in running time and without the intervention of a DB administrator is one of the Self Tuning goals. From the 1970s until present day, researchers have studied not only the ISP in static conditions, but also other physical structures like materialized views and their relationships with indexes [1], as well as the tuning of database configuration parameters [13]. Furthermore, the nature of the ISP restricted to a set of candidate indexes is adjusted to the search for combinations of elements (that may or may not appear) in order to maximize an objective function. Therefore, the ISP may be approached as an optimization problem whose objective function is the minimization of the cost, as measured in response time or number of logical reads. The situation, therefore, is ideal for the proposal of search and optimization techniques like genetic algorithms for the solution of the ISP. Genetic algorithms (GAs) are stochastic search and optimization techniques inspired by the theory of evolution. Over many generations, populations evolve according to the principles of natural selection and the survival of the fittest. Imitating this process, GAs simulate populations of individuals – each representing a possible or candidate solution to a single problem – capable of evolving under the evolutionary pressure exerted by the objective function. The best solutions (i.e., fittest individuals) have a greater probability of surviving and, as a result, reproducing (i.e., being bred) with other surviving solutions. As will be described in more detail in the following section, proper evolution depends largely on the correct initial encoding (i.e., genetic representation) of these candidate solutions. The process carried out by a GA can be summarized in the following steps. First, an initial population is randomly generated in which each individual of that population 486 represents a candidate solution to a single problem and is genetically represented by one or more chromosomes (normally bit strings). Once this population has been created, individuals are evaluated according to a fitness function. The fittest individuals are selected for reproduction using the genetic operators, crossover and mutation. In crossover, part of the genetic representation of each parent is handed down to the child. Mutation, on the other hand, allows for the appearance of new genetic characteristics in the child, similar to what occurs in nature, through the random modification of certain genes. Once this new population has been created and has replaced the former population, the new individuals are evaluated for fitness. This process is repeated numerous times until a termination condition programmed by the designer has been met. Diverse types of genetic operators for selection, crossover and mutation exist, as well as distinct probability factors that they could occur. The parameters chosen in this study allow for the replication of the experimentation environment and will be described in greater detail in the third section. The selection here of GAs over other techniques is due to a number of reasons including, firstly, their behavior when working with very large search spaces, something quite common in real-world examples of the ISP. Furthermore, GAs are highly recommendable techniques for the solution of non-linear optimization problems, and their performance has been sufficiently demonstrated for both academic [12],[17] and real-world [16] search and optimization problems. As an additional consideration, in real-world instances of the ISP, it is not a necessary prerequisite that the global optimal solution be found if the pinpointing of a local optimal solution can offer a significant improvement. Finally, GAs have proven to be robust techniques in the presence of fitness function noise, something generally present in real-world cases. It is important to mention that the application of GAs to the ISP has been proposed in this study as a hypothesis which, in the case that it proves to be valid, can be used as a launching point for future studies and improvements. The proposal of this technique here, however, is not intended to rule out or take the place of a deep analysis of the fitness landscape to determine if any alternative and more adequate optimization techniques exist. In order to adapt the ISP to the GA, the former may be reformulated in the following way: the system ought to find an array of variables x ∈ M – where M is the total configuration of possible indexes in the system – that minimizes the response time and number of logical reads required for a defined set of instructions. The objective function (i.e., fitness function) will be that which minimizes the cost in time and/or reads resulting from the execution of a predefined set of SQL sentences under the configuration of indexes proposed by the array of variables. The parameter M can be calculated using Equation (1) nCols (1) M = ¦C i nCols ⋅ nIndexType , i =1 i being C nCols = ( nCols i ! ) = ( n nCols − p )! ⋅ p! 2011 Third World Congress on Nature and Biologically Inspired Computing where nCols represents the total number of columns involved in the study, where i permits the study of each possible index grouping including anywhere from 1 to nCols elements, and where nIndexType represents the total number of index types (e.g., secondary, bitmap, cluster, etc.) that could be included in the study. The resulting number of possible combinations (M) will then be used to determine the size in bits of the chromosome (N) in Equation (2). (2) N = [log 2 ( M ) ] + 1, [x] = n, if n < x < n + 1 and n ∈ Ν This encoding has the advantage of working with bit strings, the most recommendable for GAs. Values 1 and 0 represent indexes that exist and do not exist, respectively. Nevertheless, the encoding has the inconvenience of presenting a certain amount of redundancy depending on the proximity of M to a power of two. As a solution, this paper proposes the use of non-binary chromosomes based on the types of possible indexes (nIndexTypes). III. EXPERIMENTAL DESIGN The principal objective of this study is determining whether the GA could resolve the ISP in an efficient way. In order to carry out these experiments, a relational database managing system (RDBMS) was required allowing for the use of indexes and providing statistical information about their use and performance. As a result, the Oracle Database 11gTM (http://www.oracle.com/us/products/database) was chosen not only because it fulfilled these prerequisites, but also due to its widespread use by consumers (that said, however, it is important to recognize that the proposal could be scalable to any other DBMS, as well). The DBMS was installed in a dedicated server (Quad-Core AMD OpteronTM 8356 Processor 2.3 GHz and 8 Gb of RAM), and functioning on a Windows Server 2008 Enterprise (64-bit) OS. The experimental scenario and files used to run the experiments can be found in http://labda.inf.uc3m.es/evolutionary/. As mentioned previously, the hypothesis tested by this study was whether the method based on GAs could yield solutions to the ISP with lower costs than other methods with respect to response time and number of logical reads (specifically, consistent gets [6]). These other methods tested here included the recommendations from tuning experts for the selected DBMS, as well as the proposals from the widely used analytical tool, Toad® (http://www.quest.com /toad) for Oracle. In each case, a resulting set k ∈ P(C0) was yielded which was efficient for a database (DBi) and given workload (W), and with a set of candidate indexes C0. A. Independent Variables For this experiment, a database was used with a single table whose description included 25 fields, four of which were numeric and the rest of which were alphanumeric. Thus, it is clear that while the database used contained only a single table, the former nevertheless had to be significantly large. The hypothesis was tested with two different scenarios. In the first one, the workload consists of 6 million records (approximately 3 Gb). In the second scenario, the workload was increased to 30 million records (nearly 15 Gb). In both scenarios, the maximum and average record sizes per table were 521 and 371.63 bytes, respectively. Regarding the measurement of performance, not only the database itself and its design are crucial, but so too are the state of the DBMS (in particular, those of its buffers) and, naturally, the workload (W). Even if W design does not affect the behavior of the proposal, it is nevertheless convenient to design it such that its performance could be improved or worsened depending on the inclusion of different indexes. Therefore, in this experiment, W included a sufficient number of updating and query operations (with a 0.1% volatility in the former). The queries (eight in total) involved conditional expressions over one, two or three columns in the table, affecting nine columns overall. In order that each execution could be carried out under the same conditions, buffers were flushed in each of the iterations. Additionally, since the different updating processes change the state of the database, it was returned to its original state following each execution of W. In this study, only the cost for the execution of W, rather than for re-initialization and preparation processes, is taken into account. In order to demonstrate the goodness of the proposal, indexes for each of the columns involved in a selection were included in the set of candidate indexes for the experimental scenarios. Some of these columns, for example, gave rise to multiple candidates since two distinct index types (i.e., secondary and bitmap) were considered for columns with low cardinality (i.e., less than 28 distinct values). Finally, towards more realistic experiment, multi-attribute indexes were included for columns appearing together in the same sentence. In the end, the set contained some twenty indexes among which the most appropriate were to be identified. B. Dependent Variables The metric used here to demonstrate the efficiency of a particular method was the number of data block consistent gets [6]. This measurement can be as representative as response time and offers a good deal of independence from other factors (e.g., state of the database server, processes foreign to the DBMS, state and management policy of the intermediate memory, state and configuration of the storage devices, etc.) that could potentially affect performance. As a secondary metric, response time was also observed. In order to reduce alterations due to other factors, each execution was repeated 50 times with the DBMS cache emptied for each respective repetition. For response time, the average of the measurements obtained in each execution was used. C. Methods This study compares the following methods: • Toad® for Oracle: a commercial tool with two distinct configurations (Toad2 and Toad5). • Experts: recommendations proposed by two professional administrators, Expert #1 and Expert #2, with access to the database and to all the statistical data collected by the DBMS. • GA: a recommendation based on the application of a simple or canonical GA following Goldberg [16] • None: configuration by default from the DBMS. 2011 Third World Congress on Nature and Biologically Inspired Computing 487 In order to allow for the easy replication of these experiments, AForge.Genetic, an open source genetic algorithm library was used. Furthermore, each of the genetic operators of selection, crossover and mutation utilized in the experiments were those implemented by default in the library. The GA execution parameters tested are: • Population Size: 40 / 100 • Selection Meth.: Ranking/Roulette-wheel (both with elitism) • Crossover (type and rate): Uniform; 75% • Mutation (type and rate): Single Chr. random bit flip; 10% • Termination Condition: 100 generations • Evaluation Function: fittest minimize Oracle consistent gets The GA procedure applied in the experiments can be summarized as follows. First, the individuals of the initial population were randomly generated. Following this initialization process, each individual was evaluated during the sequential execution of the workload with the number of consistent gets used as the fitness value. Once evaluated, the fittest individuals were selected using the ranking/roulettewheel technique. Of the selected individuals from the initial population, 75% were then subjected to crossover, followed by the mutation of one of the genes in 10% of the selected individuals. At this point, the new individuals obtained were evaluated (it was not necessary to re-evaluate the unmodified individuals) and selection, crossover and mutation were repeated until satisfying the termination condition. The computational cost of each iteration (i.e., generation) of the process depended on the evaluation of the new individuals in that iteration. In each evaluation, the database was returned to its initial state, the indexes proposed by the individual to be evaluated were created and the workload was launched. Thus, the execution time for a complete generation could be measured as the time taken for the DBMS to carry out this process multiplied by the number of individuals to be evaluated. Depending upon the convergence of the GA, the number of new individuals analyzed would diminish, thereby increasing the execution velocity. Despite the procedure described and used in the present experiment, the possibility of carrying out a simulation of the workload rather than its execution should not be ruled out for future studies. IV. EXPERIMENTAL RESULTS Two experiments were carried out using tables with 6 and 30 million records, respectively. Once the solutions proposed by the GA and other sources (see Methods) had been obtained, 50 executions of each solution in the same conditions were carried out. Averages of the measurements from those executions were taken for a comparative analysis. A. First scenario: 6 million records In the first scenario, different index combinations were obtained by each of the methods tested – the GA, the commercial tool with two different configurations (i.e., Toad2 and Toad5), the two experts consulted and the empty (i.e., default) configuration. In figure 1.b, response times for 488 each of the proposed combinations (averages from the 50 executions) are compared. In figures 1.a and 1.c, the same solutions for each different method are detailed with respect to consistent gets and database block gets, respectively. B. Second scenario: 30 million records In the second scenario, the same experiments were repeated on a table five times larger than in the previous scenario. Similar to the first scenario, the proposed solutions were different for each method evaluated. However, results also differed in the majority of cases from those observed in the first scenario with the smaller table. As evidenced from figure 2.a, the GA again obtained optimization for consistent gets, the result of which being response time optimization as seen in figure 2.b. In addition, as can be seen in figures 2.c and 2.b, the large reduction in database block gets in the combinations proposed by the index experts resulted in a response time that came significantly close to, yet without surpassing, that of the solution proposed by the GA. C. Additional experimentation With the installation of the virtual machine software, VMWare (www.vmware.com), in the DB server allowing for the restriction of processor and main memory resources, each experiment was repeated in different conditions. While the experiments, therefore, were executed with diverse hardware configurations and workloads, the results obtained in each case were nevertheless similar with respect to the superior performance of the GA over the other methods tested. In order to analyze the diversity of solutions proposed by the GA (due to its stochasticity), an additional experiment was performed which sought to repeat the executions of the GA with the same conditions as in the first execution. However, variations to the proposed solutions were produced rarely, and in those cases the latter could be traced to the appearance of a local minimum that was close to the global optimum and, objectively speaking, superior than all those proposed by either the commercial tool or the experts. D. Statistical testing In the two scenarios, statistical tests were performed to prove the significance of the results obtained. First, results were subjected to the Kolmogorov-Smirnov test to evaluate the normality of their distribution. Measurements based on consistent gets were found to have a normal distribution and a later Student’s t-test (ρ<0.05) showed significant results, ruling out the null hypothesis. Regarding the measurements based on response time, however, the normality of the distribution was not clear. As a result, the non-parametric Mann-Whitney-Wilcoxon test was used, finding statistically significant differences (ρ<0.05) in each case. V. DISCUSSION The metric used in the objective function for the GA in this study was the number of consistent gets rather than response time. This was due to far greater variability of the latter metric, influenced by diverse factors that introduce noise in the measurements. Even the selected algorithm is robust with respect to noise, using the response time metric 2011 Third World Congress on Nature and Biologically Inspired Computing for convergence would nevertheless be much more costly. Furthermore, the selected metric is also efficient in reducing database access time which led to the obtainment of optimal results by the GA, as can be observed in figs 1.a and 2.a. While the commercial tool demonstrated positive improvements in performance, they were inferior to those presented by the GA. One of the causes for this difference can be found in the volatility of the tables (since the workload includes updating operations). In general, commercial tools tend to focus on the optimization of record localization operations and often only on the localizations with larger margin for improvement. Thus, the tools pay little or no attention to the cost coming from table and resulting index updating operations. Such operations are similarly difficult for experts to assess. On the contrary, however, the nature of the operations executed remains completely foreign to the GA which only observes the associated metric (i.e., consistent gets, in this case). It is also important to note that the proposals of the commercial tools include the tuning of certain instance configuration parameters which, when combined, may yield satisfactory results. The present study has left these parameters out of the experiments discussed here, despite the fact that they could also be used by the GA for the comprehensive tuning of the database. This fact constitutes an extremely interesting area for future research. Regarding the performance of the GA relative to the other mechanisms tested, one must note that the latter required data about the execution of the workload over a number of days which, for the former, were unnecessary. The GA has disadvantages in the resources consumption and much time due to its need to test, establish and retract diverse configurations. Since this lowered performance may have been undesirable in many production servers, all experiments of the present study were conducted on a replicated server. The results presented here clearly support the use of GAs in the ISP. Nevertheless, future results could be better yet by testing the application of additional metrics or even the simultaneous application of multiple metrics (multi-objective algorithms). Finally, the improvement of the effectiveness of the algorithm without efficiency loss is another future goal. VI. CONCLUSIONS The current study has evaluated a proposal for the solution of the ISP using GAs. In the course of that evaluation, the efficacy of GAs has been compared with that of commercial tools as well as experts in database administration and tuning. In each case studied, the solution offered from the GA was superior to those obtained through the other methods used. While these results clearly support the case for the use of GAs on a professional level, there nevertheless exist certain performance-related weaknesses as well as a large margin for improvement that ought to be explored before any particular product is marketed. The consumption of DB resources during the exploration of solutions is high and unacceptable for many production servers. In such cases, therefore, it is the recommendation of the authors to execute the GA over a replicated server. Figure 1 (a, b, c). Average results from table with 6 million records Figure 2 (a, b, c). Average results for table with 30 million records 2011 Third World Congress on Nature and Biologically Inspired Computing 489 A particular advantage presented by GAs is that their continued execution provides results that dynamically adapt to variations in the workload over time. Thus, it would be particularly interesting to dedicate future research efforts to measure the time elapsed from the moment that workload changes are produced to the moment when these changes have a noticeable effect upon the proposal. The parameters of the experiments carried out in this study – with respect to the environment, DBMS and applied workload – were all highly general. Nevertheless, and in order to further demonstrate this generality and exploit the use of GAs for the ISP, currently planned studies by the authors aim to expand research to additional DBMSs, other auxiliary structures (e.g., R-tree and clustered indexes), as well as instance configuration parameters. In this way, the authors hope to encounter a reliable mechanism for comprehensive database tuning. Similar to the majority of ISP-related techniques, the proposal presented here is grounded in a set of candidate indexes. Compared with other ISP-related techniques, however, this set can be made relatively large while maintaining reasonable cost margins. Nevertheless, performance decreases with an increased number of possible solutions. Furthermore, it must be added that a large part of the other improvements suggested – the inclusion of configuration parameters (thereby increasing chromosome size), the application of additional metrics (either individual or simultaneous with a multi-objective algorithm), etc. – may result in even lower performance and a higher number of generations required to obtain good solutions. Insofar as the performance of the GA, which was already a weak point of the technique, could decrease further with the implementation of these new improvements, it would be convenient to propose a change in the way solutions are evaluated by substituting the real executions in the replicated server for approximate cost calculations using simulations or heuristics. While this proposal would undoubtedly take away a certain amount of realism from the measurements, it would also allow for the improvement of the proposed solutions (by carrying out larger-scale tests) without greatly affecting the cost. However, given the fact that the GA explores a volume of solutions much lower than the total number of possible solutions, the cost could nevertheless be maintained within margins unthinkable for more exhaustive searches, even those carried out through simulation. Finally, and insofar as it would allow for a more precise classification of the ISP [15], the analysis of the fitness landscape constitutes a necessary topic for future research. With such a classification available, researchers could therefore perform an important comparative study of the implementation of alternative optimization techniques. ACKNOWLEDGMENT This work has been supported by the Spanish Ministry of Education and Science (project Thuban TIN2008-02711). REFERENCES [1] Agrawal, S., Chaudhuri, S., and Narasayya, V. 2000. Automated selection of materialized views and indexes for SQL databases. In Procs. 26th VLDB 2000, pp 496–505. 490 [2] Agrawal, S. Chaudhuri, S., Kollar, L., Marathe, A. P., Narasayya, V.R., and Syamala, M. 2004. Database Tuning Advisor for Microsoft SQL Server 2005. In Procs. 30th VLDB Conference 2004, pp. 1110–1121. [3] Belknap, P., Dageville, B., Dias, K., and Yagoub, K. 2009. Self-Tuning for SQL Performance in Oracle Database 11g. IEEE Int. Conf. on Data Engineering (ICDE), pp.1694-1700. [4] Bellatreche, L., Missaoui, R., Necir, H., and Drias, H. 2007. Selection and Pruning Algorithms for Bitmap Index Selection Problem Using Data Mining. In Song et al. (Eds.): DaWak 2007. LNCS 4654, pp 221-230. Springer. [5] Bruno, N., and Chaudhuri, S. 2005. Automatic Physical Database Tuning: a Relaxation-based Approach. In Procs. ACM SIGMOD Conference 2005, pp. 227–238. [6] Burleson, D.K. 2010. Oracle Tuning: The Definitive Reference. Second ed., Rampant TechPress. [7] Caprara, A., Fischetti, M., and Maio, D. 1995. Exact and Approximate Algorithms for the Index Selection Problem in Physical Database Design. IEEE Transactions on Knowledge and Data Engineering 7(6): pp. 955-967. [8] Chaudhuri, S., and Narasayya, V. 1998. Autoadmin ‘what-if’ index analysis utility. SIGMOD Records 27(2), pp. 367–378. [9] Chaudhuri, S., Datar, M., and Narasayya, V. 2004. Index selection for databases: A hardness study and a principled heuristic solution. IEEE Transactions on Knowledge and Data Engineering, 16(11): pp. 1313-1323. [10] Cobb, H.G., and Grefenstette, J.J. 1993. Genetic Algorithms for Tracking Changing Environments. Proc. 5th Intl. Conf. on Genetic Algorithms, pp. 523-529. [11] Dasgupta, D., and McGregor, D.R. 1992. Nonstationary function optimization using structured genetic algorithm. In Procs. of PPSN II, pp. 145-154. [12] DeJong, K. 1986. An Analysis of Reproduction and Crossover in a Binary - coded Genetic Algorithm. PhD thesis, University of Michigan, Ann Arbor. [13] Duan, S., Thummala, V., and Babu, S. 2009. Tuning database configuration parameters with iTuned. Proc. VLDB Endow. 2, 1, pp. 1246-1257. [14] Finkelstein, S., Schkolnick, M., and Tiberio, P. 1988. Physical database design for relational databases. ACM Transactions on Database Systems 13(1): pp. 91–128. [15] Galindo-Legaria, C., Waas, F. 2002. The Effect of Cost Distributions on Evolutionary Optimization Algorithms. GECCO 2002: 351-358. [16] Goldberg, D. 1989. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley. [17] Holland, J. 1975. Adaptation in Natural and Artificial Systems. University of Michigan Press. [18] Kratica, J, Ljubic, I, and Tošic, D. 2003. A genetic algorithm for the index selection problem. In G. R. Raidl et al. (Eds.), EvoWorkshops 2003. LNCS vol 2611, pp 281-291, Springer. [19] Schnaitter, K., Polyzotis, N., and Getoor, L. 2009. Index Interactions in Physical Design Tuning: Modeling, Analysis, and Applications. VLDB 2009, Vol. 2, N.1, pp. 1234-124. 2011 Third World Congress on Nature and Biologically Inspired Computing

Log In

An evolutionary approach to the index selection problem

Related papers

Related papers

Related topics