Index selection in a self-adaptive data base management system

Arvola Chan

Index selection in a self-adaptive data base management system

Arvola Chan

1976

visibility

…

description

8 pages

link

1 file

We address the problem of a,utomrticaiiy adjusting the physical organization of a data base to optimize its performa& as Its access rquirements change. We describe the p'rindpies of'the automatic index ,selectioB facility of a prototype self-adaptive data base management system that is currently under development. The importance of accurate usage model acquisitioti and data. characteristics estimation is stressed. The statistics gathering mechanisms that' are being incorporated into our prototype system are discussed. Exponential smoothing techniques are used for averrging statistlo observed ,over different periods uf time in order to.predict future characteristics. An heuristic algorithm for selecting indicts 'to .matc@ projected access kqt$~ments'is presented. The cost nude1 on which the decision procedure is besed is flexible. enough to incorporate the overhead costs of index creation, index storage and application pram recompilation.

INDEX SELECTION IN A SELF-ADAPTIVE SYSTEM ‘DATA BASE MANAGEMENT Michrei Hammer Arroii Chan Laboratory forCom~mS&ncr,MIT, Camqrtdge, Massachusetts,02139. We address the problem of a,utomrticaiiy adjusting the physical organization of a data base to optimize its performa& as Its accessrquirements change. We describe the p’rindpies of’the automatic index ,selectioB facility of a prototype self-adaptive data base management system that is currently under development. The importance of accurate usagemodel acquisitioti and data. characteristics estimation is stressed.The statistics gathering mechanisms that’ are being incorporated into our prototype system are discussed. Exponential smoothing techniques are used for averrging statistlo observed,over different periods uf time in order to.predict future characteristics. An heuristic algorithm for selecting indicts ‘to .matc@projected access kqt$~ments’is presented. The cost nude1 on which the decision procedure is besed is flexible. enough to incorporate the overhead costs of index creation, index storage and application p-ram recompilation. INTRODUCTION The efficlcnt utilization of a data baseis highly dependenton the optimal matching of its phystcai~organizatlon to its accessrquirknenu and other ch~ractcristics (such as the distribution of values in It). We consider here the. problem of autorivaticaliy timing the physical organizationof an integrateddata base. By an integrated data base. we mean one that supports a diversity of applicationsin an enterprise, the development .of such data bases.is expected, to be one of the most‘important data processtngactivities for the rest of the 70’s Ill. There are’ many reasons for the incorporatlon of heretoforeseparatebut relateddata baseswith a high degree.of duplication into a slngie Integrated one. The reduction of storageand updating costs,and the elimination of Incontistensies that may be caused by different copiesof the data In different stagesof updating, are among the more important ones. Viewing an integrated data baseas-the repositoryof information for running an enterprise; it can no longer be consldered as a,static entity. Instead, it must be looked upon as continually changing in size, with access gradually .altcring as applications evolve, and a) users.develop familiarity with the system. nscqucntly, the tuning of a data base’sphysical organization must also be’a continual process. In current data asc management systems, the responsibility of maklng reorganization decisions falls on the data base (DBA), whose judgementsare basedon intuition and on a limited amount of communication with some indtvidual data base users. For large integrated data bases,a more systematic means for acquiring Information about data base usage. and a more algorithmic way of evaluating the costs of aiternatlve configurations, wth be essential. A minimal capability of a data base management system should be the incorporation of monitoring mechanismsthat collect.usagestatisticswhile performing query processing..,A more sophisticated sy&m would sense the’ change in accessrqutements, .evaluatc the cost/benefits of various reorganization strategies,and recommendaction to the DBA; eventually,such a systemmieht Lt.&f perform the necessarytuning.’ INDEX SELECTION IN AN ADAPTIVE DA%A BOSESYSTEM. We are currentiy’deveioping a self-adaptivedata basemanagementsystemwhich monitors the accesspatterns and the data characteristicsof a data base,and usesthis information to tune its physical organization. We operate in the environment of a relational data base system,which provides a level of physical data independence that facilitates physlcai reorganization. Continuous monitoring of the usageof a retational data base opens up many possibilities for its reorganization, and we expectto experimentwith a variety of alternatives and study their costs and tradeoffs. As a first cut at the problem, we have concentrated.on the problem of index selection. A secondary index (sometimesreferred to as an inversion) Is a well-known softwarestructure which can improve the performance of accessesto a relation’(file) Cl1 For each.domain(field) of the relation that is inverted, .a table IS maintained, which for each value of the domain in questioncontainspointers to ail those tupies (records) whose contents in thc,desjgnated domain is the specified value. Clearly, the presenceof a secondary index for a particular domain can improve the execution,of many queriesthat referencethat domain; on the other ‘hand; maintenance of such an index has o&s that slow down the performanceof data base updates, insertions. and delictions. Roughly speaking, a domain that is referenced frqucntlj relative to its modtftcation is a good 1 candidate for index maintenance. The choiceof which (if any) domainsto invert must be done with care; a.good choice can significantly improve the performanceof the system,while a bad selectioncan seriously degrade it. The goal of our systemis to makea good choiceof thosedomainsfor which to maintain secondary indices, based on how the data baseis actually used. The operation .of the initial .version of our prototypesystemcan be describedas follows. The specifications of data base interactions,‘by both interactive users and application programs,are expressedin a non-procedural language; these are first translated into an internal representationmadeup of calls to systemlevel modules. The language processorhas availabie to it a modelof the current‘stateof the data base,which contains, among other things, a list of the currently maintained set of secondaryindices,plus various informatio.6 about these 4ndices. Using this information, the language processorcan choosethe best strategy for processing each data base operation in the current environment. Statisticsgathering mechanisms.areembeddedwithin the system modules that interpret the object code of the languageprocessor;and recorddata concerning the execution of every.data base transaction. The statistical information gatheredfor a run is depositedin a collection area and summarized from time to time. When the reorgan,isationcomponentof the systemis invoked (which will occur at ,fixed. intervals of time). the statistical information collected,overthe precedinginterval is combined with statistics from previous intervals and used to obtain a forecastof the accessrequirementsof the upcoming interval; in additiou, a projected assessmentof various characteristiu’.of the data in the data base is made. A near-optimal set of domains for which indices should be maintained is then determinedheuristically; optimality meanswith respect to total cost, taking into account index storageand maintenance,This cost is comparedwith the projected cost for the existing set of indices. Reorganizationis performedonly if its payoff is great enough to,cover its cost as well as that of application program retranslation.. In this paper, we stressour approach to the problemof acquiring an accurateusagemodel and estimating data characteristics, by means of continuous monitoring and the application of forecasting techniques. We have considered the reduced problem of choosing indices for a single relation, but we expect our approach to be extendible to more complex situations once we have developeda model for the cost of processing multi-relation queries using different strategies. We believe,that an accurateinteraction and cost model is essential for a practical environment, and that the use of problem-orientedheuristicsto cut .down the index search.space will be fruitful in achieving efficient and near+imum solution.procedures. ORGANIZATION OF THE PAPER The remainder of this paper is organized as follows. We begin with.a summaryof our view of the data base organization: the file model, the storage and index organizations, and the data base operations. Next, a procedure for determining the strategy(whether or not to useexisting indices)for processingthc,qualif.ication part of a query. is presented. We then describethe statisticsthat are to be gathered during query processing, and explain the use of exponential smoothing techniquesin the+rivation of parametersfor the cost model.’ This is followed by a discussion of the need for heuristicsin solving the index selectionproblem, and then the heuristics that we have devised are presented.Finally, we include a bdef comparisonwith previous related work. FILE MODEL We have chosen the relational model 121of data as the loglcal interface between application programs or interactive users and the data base system,since it provides the level of physical data independence needed to facilitate- physical reorganization. The totality of formatted data ‘in the data base consists therefore. of one or more relations. However, we addresshere the reducedproblemof selectingindices for a data base made up.of ‘a single relation. Herein, the data baseconsistsof a single relation (file) R with n tuples (records)T -. (ti, tg, -..*, tm) where each ti t Di; the lth domain, i.e. R is a subsetof the cartesianproduct of the domains Df, DP *‘““,‘D, Even though .insertion and deletion of tuples are permitted,we will assumethat the cardinal&y (number of tuples) of the relation remains relatively unchangedbetweentwo consecutivepoints at which index selection is considered. Following Rothnie 133,we assumea pagedmemoryenvironmentfor tuple storage. More specificly, the R tuples in the relation R are stored in R pagesof t tuples each,where t is constantand n and p approximately so. The. accessing cost of a page is assumedto’be independentof the sequenceof page accesses, and dominates all other internal processing costs. Hence, the processingcost for a query is measuredsolely in terms of.& number of pages that have to be accessedto resolvethe query. If a query is resolvedby sequentiallyscanning alt tuples in th.e’ relation. then the total number of, pages that are accessedis just p. If an index is available for a domain referenced in a query, it may significantly reducethe numba’of tuples that need’to be examined to resolve the query. These tuples may’correspondinglyresideon fewer than p pages W; shall assumethat such a restricted set of tuples will be rahdomly distributed over the memoryspace,but that it is possibleto access them such that each page is. touched at most once. Yue and Wong 141have derived an exact formula for the expected number of pages that have to be touched in order to accessr tup1e.s which are randomly distributd over the file. Let b(rpf) be the expected number of page accesses for referencingr randomly distributed tuples in a file of p pages each containing t tuples, and let 2 f(0) - 0 f(i+l) - ((~p-l)-ill(~qt-i)~f))tf(i)+ (plrtll(pH-I) then it, has been shown in I4 that b(r;p,t) - f(r). We will makeuse of the above formula todetermine if it is orofitable to use index(u). to resolvea query. INDEX ‘MODEL An index on a domain of a .relation is a mapping from valuesof the domain to tupks inthe relation with those value.% We assumethat indices, just like tuples in the relation,are storedin a paged memory,with the usual twolevel hierarchical organization, i.e. the key valuesand their asmclatedtuple identifier (TID) lbts are assumedto be on separate levels t!il Each value is storedtogetherwith a pointer to the head of its associatedTID list, and the length of that list. To further simplify our discussionhere, we will assumethat the keys and the associated list Pointers and list lengths are organizedas a B-tree [61 eachnodeof which is stored on a page. One or more TID lists may be stored on the .samepage. The costof obtaining the TID list associated.with a domain value. in the index is thus made up of the cost of locating the’head of the list (equal to the height of the B-tree, which is dependent on the number of values in the domain),plus the costof reading the page(s)on which the TID Iist’is stored. Hence; the averagecost for using an index can be readily estimated,given the number of values in .the domain and the lengths of the associatalTID’lists. The cost of modlfying am index as the result of a tuple Insert/delete/update is, however, more difficult .to determine. We assumethat only those domain’values that are associatedwith existing tuples in the relation are stored in the index. In most cases,an insert/delete/updatewill.causea single node in the B-tree to be updated. In rarer cases,a key may have to be inserted into/deleted from the B-tree, resulting possibly in a node overflow/underflow, which may in turn propagatethrough the tree. As for the TID list involved, it is possible’ in many cases to write back a lengthened version of it onto the samepage on which it was stored before the insertion, by utilizing empty spacesin the original page. Occasionally,though, the empty spacemay run out, and it becomesnecessaryto allocatea new page. Similarly, occasionalgarbagecollection may be necessaryto recompact. the TID Ii&, and recover wasted spacecausedby deletions. Therefore, for’ a .given domain, the cost of key insertion, key ‘deletion and. garbage collectiondependson the sequencein which “updating” operations involving this domain are performed; therefore it is very difficult to incorporatethis cost into any,overall parametric tndex maintenance cost,model. We will therefore considerthe index maintenancecost for each domain as made up of two components. The first dependsonly on the frequenciesof lnse_tts/deletes/updates for this domain and consists of the cost of updating a node in the B-tree plus the costof’ reading and writing an average TID list (an update is esjentially equivalent to an delete and an insert); this componentcan be readily estimated.‘The second, made up of the unquantifiable overheadsof node splitting/merging in the B-tree and garbage collection .in the TID space, will be recorded by the statisticsgathering mechanismif the domain ls indexed. Otherwise, it can very roughly be estimatedby using the normalizedaverageoverheadof thoseindices.that are maintained. TRANSkTION c 9 2 m MODEL AND PROCRSSING Out transaction model allows for the retrieval, insertion, uodatinp, and deletion of tuuler in a relation. Data selection (the specification of somesubsetof a relation by.mians 0;.its propertles)‘is the’fundamental component of all, these operations. In choosing data selectionoperatorsto be included in our transaction, we have limited ourselves to’ those for which the.utility of using indices can readily be estimated. These include conjunctions of equality conditions and disJunction of quality conditions. (By an quality condition, we mean a predicate of the form A ‘- k, where.A is somedomain nameand k is a constantor a program variable). The use of indices and. list processing techniques for. resolving thesekinds of queriesis well known I’?),and will not be reiterated here. However, our tuple accesscost model implies that it may not always be desirable to resolve a conjunction or a disjunction of quallty conditions using indices. (This will occur when the set of qualified tuples is expected to reside on close to p pages; we say “close to”, becauseutilizing the index(es) also entails some page accesses.) Therefore, we need a meanswhereby the number of tupks that can be expectedto qualify for such a query can .be estimated. We can define’s selectivity measurefor a domaln index as the averagefraction of the.set of tuples under consideration that have historically satisfied an equality condition involving that domain. We will, furthermore, assumethat the joint ‘sekctlvity of a number of domainsspecified together is equal to the product of’ their individual selectlvities. Hence,glven the sekctivitiu of domainsin eachcondition, the number of tuples that can be expected to satisfy a conjunction of theseconditionscan be determined;.in addition, the number of tuples for a disjunction can be estimatedusing the inclusion-exclusionprinciple 161 Thus, given a query and an exlstlng set of indices, the query processorwill use the availabk indicesonly if the expectedsum of the cost of accessing the indices and the resolved list of tuplcc is lessthan the costof doing a sequentialsearchof the entire tuple space. STATISTICS e I * GATHERING An important component of our adaptive systemis the monitoring mechanismsthat collect usage statistics as data base operations are performed. The statisticsgatheredover a time period are summarizedand then “averaged” with summaries from previous periods. The siatistks,to be gatheredfor the purpose of index selection fall into three general classes. 3 I z (1) update related &tics - This h&s xveral components. First of all, thettare’the total numbers& tuplc!. that ate deleted from (inserted into) the relation in the current time period. (We assume that each insertion or deletion of a tuple will tquite some maintenance for each active index.) In addition, for each domain of th’e relation; we maintain the number of updates made to that domain in the tuplei. (This: involves some maintenance of the index for that domain.) Finally, we record the actual difficultto-patametetire costs of the maintenance of each index (node splitting and merging, garbage co&ction. etc.), measured in terms of the actual number of page accessesexpended for, such overheads. c.2) domain: selectivity statistics -i For each domain, we maintain its average selectivity aver all i~ses of the domain in equality conditions in the current interval. This is accomplished by recording the ,numbet of times tile domain octuts in equality conditions and the selectivity of the domaln ,in each ,of xhese predicaates. The selectivity of the use of a domain in an qtiality condition is measured as follows: . (a) Ii an index for the domain is used tp resolve the -particular quality condition, then the precise selectivity of the domain for this query can be calculated as the fractian of’tupks in tbe.telation with the domain value in question. (b) Suppose the equality condjtion appears in a con]uncUon of conditions that”is resolved ‘b’y sequential scanning. (This scan may be of a reduced set of tupks obtained. via an index.1 Let the total number of tuples scanned be Nb Let the conJunction be of the for& Ci A Ci A I*’ A ,C$ where each of the C, is an ‘quality condition involving domain I$ ‘Let Np Np ““‘; N, be’ the number of tupks’that satisfy Cl, Cl A C+, -, C, A % A - A C, respectively. (Note thati these values are’readily available). The selectjvi~y of domain D, for this $ry is then approximated as Ni/NimI (c) Suppose the quality condition appears in a disjunction if conditions that .is tesolvt+ by sequentia! scanning. .Let the total numbet‘of tuples scanned be Ng. Let the dispnction be of the form Cl V c2 v -.* V C, where each of the C, is an quality tondition itivolving domaln Di. Ler N,, Np “**‘*,N, be the number of tuples that sat!sfy C,, “ci A C+, “‘-; “cl h “c2 A ,y” IJ Ci respectively. (Note that thesevalues ate readily available). The sekctivity of domain Di for this query is then approximated as N,/(No - Sum (N] for ] c 1). Note that this measurement of selectivity accounts for mm-uniform distribution .of domain values over the tuples, as well as the tion-uniform use of domain values in the quality predicates. (3) query type statistics - Every time a query is process& its associated qualifyiq expression is recorded in an access history file using a canonical representation (such as a bit ceding scheme for each domain that appears in the qualification expression and a bit indicating whether it is a conJunction ‘dt’ a disjunction). Note that by recording the actual queries that ticut, we avoid the sttong (and often inaccurate) assumption that the simultaneous occurrence of domains in a query ate mutually independent events. ACCESS PATTERN FORECASTING AND PARAMiTER ESTIMdTION We assume that the index selection problem is to be reconsidered ‘at fixed intervals. At each reotganizatlonal point, we forecast a number of characteristics for the period up to the next reorganizational poinL Specifitilly, v.e predict .the following: (1) the number of occurrences’ of each query type (the type of.a query is detetmjned by whether :its qualification component is a conjunction or a dlsjunction, and by the domains used ip the ‘constituent quality conditioris); .(2) the’sekctivity (3) of each domain of the relation; the numbet,of distinct values’in each domain (if a domain.‘is not indexed in the current interval. then rhe current value for this number is approxitiated as the reciprocal of the observed sekct#viq); (4) the number of tupks (hence the number of pagy),in the entire relation; (51 the.expected cost (in terms of page accesses)of maintaining an ind!x for each domain and .the expected cosr’of ech UK of such an index. We could do these projections sokly on .the basis of statistic colkcted during the previous time period, or we could combini together the statistics collected ‘ooet a11previous periods. However, neither alone would be 4 satisfactory for the purpose of a stable and yet responsiveadaptive system.In thc’former case,the system would be overly vulnerable to chance.fluctuations in accessrequirementsand data characteristics,whereas in the latter case,,it would be too insensitive to real changes Intuitively, a weighted moving average of observations over previous periods should be more satisfactory. We utilize the technique of exponential smoothing [9,101for our .forecasting and estimation proceduresbecauseof its simplicity of.computation, its flexibility, its minimal storage requirement, and its ability ‘to be generalized to account for trends and cycles. The basic formulation of exponenrial smoothing in making a forecastof a discretetime seriesis as follows: new forecast - a 6 (actual observationfrom last period) + (1-a)$ (previous forecast) where a is called a smoothing constantand takeson valuesbetween0 and’l. In essence,this is a weighted average of all previous observations with the weight decreasinggeometricallyover successivelyearlier observations. The ‘rate of response to recent’changescan be adjusted simply by changing the smoothing constant: the larger the smoothing constant, the more sensitive is the forecastto recentchangesand chance‘fluctuations. (The value of. a can be ‘sekcted by the. DBA or can.be adaptively chosenby the system’itselfto minimize the difference, between observed and predkted data.) The compactness of the schemelies in the fact that only two parameters need to be maintain.ed for each time series,the current observationand the previous estimate. The above schemeis used for the estimation of domain selectivity. .Since the file site can generally be expectedto be growing and the level of activity in a data base system can generally be expected to, be upward moving as usersbecomefamiliar .with the capabilities of the system,we will use a modified. version -of exponential smoothing,which takes’trends into account, in order to forecast query frequencies, index maintenancecost,and fik size. Its formulation is as follows 191: I E new average I a e (current observation)+ (l-a) t (old av!rage) current t,rend- new average- old average new trend - a I (current trend) t (l-a) * (old trend) m new forecast - new average+ ((l-a)/u) o (newtrend) In this form, it is necessaryto storeonly the previouslycalculatedvalues for the (new) average and for the (new) trend, and the.calculation is still simple. .4 INDEX SELECTIONATAREORGANIZATIONPOINT AS we have said, at each reorganizational point, a number of forecasts‘are made. A projection is made of the selectivity of each’domain in the relation, together.with the expectedcostsof accessingand ma’intaining an index on that domain during the next interval. In addition; the expectedsize,and ,hencethe storagerequirement, of an index’for each domain is estimated. Then the costof creationfor each index that is not currently maintained it approximated. For any proposedset of indices,we can thus projecta,total costfor the next time.period. (This cost includes retrieval processing; index creation, maintenanceand storage and application program recompilation.) Therefore, we should be able to choosethat setof domainswhoseindexing in the next interval will minimize this total cost. A straightforward approach to the index selectionproblem would be to evaluate the total cost for each possible choice of ,domains to be indexed, and then selectthat set of domains which gives .the smallest cost. With m domains in the relation, there are 2m possiblechoicesof index sets. For small m, this enumerative approach is probably the best strategy,as the optimal combinationof domainsto be indexed is guaranteed to be found. But for moderate m, the costof repeatedapplication of thecost evaluation procedurebecomesvery expensive, and for large m; it is prohibitive. It is not uncommonto find single%elationdata baseswith tens of domains. Therefore, it is appropriate to look for ways whereby the searchspaceof potential index setscan be systematically aeductd There are two possible approachesto this. One is to look for propertiesin the cost function that will allow it to be minimized without exhaustive enumeration,such ai through a depth-first search. This’is the approach that Schkolnick Dlj has taken. .However, even with a simplified model for cost evaluation, his upper bound of the order of 2mO+08 m sets to. be tested is not enough of a reduction to enable the inexpensive. selection of the, optimal’index set for a relation with a moderatenumberof domains. If a more accurateand detailed cust model is to be used, then the i,ncentive for reducing the search.spaceis even stronger. Yet in this case,the hope of fin.ding an algorithmic way of exploring a reducedsearchspaceof practical ‘size that will yield the optimal solution is even’ dimmer. Therefore, it is appropriate to draw on the experience of artificial intelligence researchers working in areaswhere formal mathematicalstructuresare computationally impractical U2.131and use heuristic methods that significantly prune down the searchspaceand that work towards obtaining a near-optimal solutiql. 5 1 111 LaHI INDEX SELECTION HEURISTICS In our index selection procedure, wemakeuse of five problem-oriented heuristics. (1) A near optimum choice of domains to beindexed can be made incrementally using a depth-first search. This heuristic permits analysis of the problem as a stepwise minimization, each time adding to.the index ,tEt that domain which will bring the best improvement to the cost function. There .have been two previous suggestions regarding the incremental selection of domains to be itrdexed.. Farley and Schuster Ml suggest that the incremental selection process can’be terminated once no single domain ‘in the non-indexed set can be chosen that will yield incremental cost/benefit& This Is insufficient for our choice of query and tuple access’models. There are two reasons why it may be necessary to consider the incremental savings brought by considering two or more indices together. First, it may happen that.for a query involving a conJunction of conditions, the selectivity of any one domain may n,ot be’sufficient to restrict the number of tupks to be accessedto be less than the total number of pages in the ‘relation, whereas the joint selecttvity of two or more domains might Secondly, a d!sjunction. ofatnditlons can An be resolved via indices only if all of the domains involved In the. disjunction are indexed. alternative strategy has been suggested by Held [IQ who, at any staged the Incremen~lIndex selectidn procedure, considers the incremental savings of each of the possible subset of domains in the candidate set with less than or equal to some,fixed number of domains in it. This, of course, may be very inefficient. We have taken an intermediate.approach. We consider the adjoining of multiple domains to the index set only if no single domain that will yield positive*incremenhI savings can ‘be found. (2) Some domains can be eliminated from the initial candidate set by virtue of their low’occurrence frquencies in queries. This ,effectively reduces m, the initial number of domains. in the candidate set. From the forecasted frequency of each retrieval specification, we can find. the pro&ted number of occurrences of each domain. Using the selectivity estimate of the domain, we,can find a gross upper bound on .the number of page accessesthat an index on the domain can save inthe processing of the forecasted set of queries. If this upper bound is less than the projected cost of m&talning.an index on the domain, then this domain can safely be excluded from the initial candidate’ set, i.e., the domain. occurs so infrquently that it can never be profitabk to index it. The upper bound ‘is calculated as follows. An index saves the largest number of page accessesif its domain only occurs in one-term queries, Under, this assumption, the maximum number of page accessesis saved for each occurrence Jf all the qualified tuples are clustered together. Let p - number of pages in relation s I selectivity of domain f - number of occurrences then an upper bound on the number of page accessessaved is f * p * (1 - s).. (Foreach occurrence, we would have accessp pages without an index, and only s9p with one). (3) Only a small subset of all possible candidates need be considered to determine the next domain or set of domalns to be adjoined to the index set at each stage. We can rank. the domains by the above described upper bound and consider only the top ranking M domains. and combinations of these, for’ detailed incremental savings calculation. Furthermore,‘.a.bound M’ (M’&i) can be put on the n.umber of domains that will be considered together. (4) Not all queries can use indices profitably. The expected number of tuples that will qualify for each query type can be estimated using the selectivity of the domains involved. Those that cannot profitably make use of indices’are eliminated from the pro&ted query set whose processing cost is to be minimized. This eliminates some unnecessary cost calculations. Similarly, only.queries that involve domains’still in the candidate set need to be retained in the query set for incrcmeptal savings calculation. (This heuristic does not cut down the search space, but does speed up the cost evahsation at each step.) (5) An upper bound can be put on the number of costevaluations that are performed selection procedure. The procedure is terminated when this bound is exceeded. in the entire To tllustrate the above heuristics, we present the details of our index selection ‘procedure. Our procedure can be divided into two stages. During the first, a tentative set of domains to be indexed fs chosen in an jtrcremental fashion. It is possible to put a bound on the cost of this phase by ad&sting the parameters M and M.‘. An. alternative (Or additional) constraint is to bound the total .number of sets of indices for which cost evaluation is performed. (1) Rank the domains in the relation using the procedure described above atid let S be the set of domains that occur so infrequently that indexing them can not be profitabk 6 (2) Partition the set of domains D in the relation into tliree.dis&int subsets: D, - the index set, Dc - the candidate a&, and D, - the non-index set Initialize D, to null, D, to S, and Dc to D - D,. (3) Consider in turn the incrementalsavingsgained by indexing eachof the M top ranking domains. in the candidate set. Adjoin to D, the one that will give the best improvement to the cost function. If one cannot be found, then consider larger-sizedcombinations(up to M3 of these M domains. Consider combinations of the next larger size only if it is not profitable to ad&in any of the combinations less than or qua1 to the current size. Revert to consideradjoining single domains after each adjoinment. Remove the domains from DCas they are adjoined to D, (4) Terminate the first phaseof the procedureif: (a) no subset of the M (or number of domainsleft in the candidate set, whichever is smaller) top ranking domains up to size hi’ can be chosenthat will improve the cost function. (b) the upper bound on the total numberof costevaluationsis reached. The next stage may be called the bump-shift phaseB21. Domainsthat have been adjoinedto the index set early in the first stage may turn out to be uneconomicalas the result of later addition of other domains to the set, and thus should be removed from the index set. Only individual domainswill be consideredfor removal from the tentatively chosen index set.’ If more.than one can be chosen,then the one that leadsto the best cost improvement is removed first,’ and the procedure repeateduntil a local minimum,of the cost function is reached. (A possible extention to this stageis to consider the profitability of maintaining combinedindices for those domains that are adJoined to the index set together.) Once an optimal setof domainsfor indexing has been chosen, the total ,cost for this choice of domains, including query processingcost,index storage,maintenanceand creation (if applicable) costs, is compared with the cost for the existing set of indices. Reorganization is done only if the difference between the two is enough to cover the application programrecompilationcost. COMPARISON WITH PREVIOUS WORK We have presented an experimental and heuristic approachto the index selectionproblem that is different in many respectsfrom recentstudieson index selectionby StonebrakerOSI,King U7l, Schkolnick Oil, Farley M, and Held Da]. These other studies have either beenformal analyses,which have made many simplifying assumptions in order to obtain an analytic solution, or elsesystemdesignsthat have been incompleteor unrealistic in various w’ays. Our work attempts to go farther than theseby utilizing more complete.andaccurate models of cost and access,and by emphasizing important aspectsof realistic data baseenvironments. Below we summarize,the novel aspects of our approach. We have stressedthe importance of accurate usage model acquisition and data characteristic estimation in a dynamic environment where accessacquirementsare continually ‘changing. ‘.We believe it necessary to apply forecasting techniques to predict future accessrequirements based on past observations, in order to capture and respond to the dynamic and changing nature of data .base usage. Our. scheme endeavours to obtain a precise model of data base usageby recording actual query patterns, thereby avoiding the strong ‘assumption that the probabilities of two domains appearing -in a query are mutually independent, We also take into considerationthe facts that valuesof a domain are not equally likely to be used in querks, and that they are not evenly distributed amongtuplesof the relation, by monitoring the actual sekctivities of domain values that are used in queries. The size of actual data basesis reflected,in our concernfor efficient heuristics to speedup the index selection process. Our cost models account for such real overheadsas the expenseof index accessingand the cost of key insertion/deletion and garbagecollection in index maintenance,and are basedon reasonablemodelsfor data base storage. Our approach of minimizing the total processingcostfor the upcoming time interval, rather than the expected cost for a single query, is flexibk enough to accountfor the overhead costs of index creation, index storage and application program recompilation. CONCLUSIONS We have presenteda high-level descriptionof our approachto th.epmblem of index selectionin an adaptive data base management systemthat we are developing. This has beendone,however,only in the restricted environment of a single-relation data baseaccessedthrough a restrictedinterface with limited capabilities for the selection of data: TO fully realize the flexibility of a relational data base, it is necessaryto consider’s multi-relation environment together with a high-kvel non-pro&dural language Interface that permits queries. with arbitrary interconnection betweenrelations in the qualification part and high level operatorson the qualified data. In s’uch an environment, it is necessaryto consider the utility of indices for more complicated operations (such as restriction, projection, division, join, etc. [21)and to selectindicesfor all the relations in the data base as a whole. This ‘Is where the recording of detailed accesshistory is necessaryfor optimal index selection, and the use of heurisdcs should be fruitful:for cutting down the searchspace and for selecting.richer index structures (such as combined indices). Our heuristic index selectionprocedureshould be readily extendible to such an t?iWolr~nt, 7 1 provided a cost evaluation procedurehas been defined to estimatethe cost of processingan arbitrary query with the presence of a particular set ?f indices. Our plan is to experimentally assess the optimality of our heurlst.ic index selection algorithm on a reducedenvironment,before embarking on the more ambitious propct of index selection for a more.genetal environment. More fundamentally,our intent Is to experimentally study the needs for, and capabilities of, a self-adaptive data managementsystemin realistic data baseenvironments.. ACKNOWLEDGEMENTS This research’was supported by the Advanced ResearchProjectsAgencyof the Department of D.efenseand ivat monitored by the Office of Naval Researchunder contractno. N9Wlf-1. REFERENCES W J. Martin, “Computer Data-Base Orga?ization”, Prentice Halt Inc. Englewocd Cliffs, New Jersey, 1975. f21 E. F. Codd, “A Relational Model of Data for Large SharedData Banks”,CACM, Vol. 13,No. 6, June, 1970.. C31J. B. Rothnie, T. Lozano, ‘Attribute BasedFile Organization in a Paged Memory Environment”;CACM, Vol. 17,No. 2, Feb., 1974. f41 P. C. Y’ue, C. K. Wang, “StorageCost Considerationsin SecondaryIndex Selection-,International Journal of Computer’and Information Sciences,Vol. 4, No. 4,1975. ‘f5l A. F. Cardenas, “Analyiis and Performanceof Inverted Data BaseStructures”,CACM, Vol. 18,‘No. 6, May, 1975. 163R..Bayer, E. McCreight, l Organization and Maintenanceof Large Ordered Indexes”, Acta Informaticq. Vol. I, Fast. $1972. f71 M; M. Astrahan; D. DChamberlin, “Implementationof a StructuredEnglish Query Language”, .Proceedings of the ACM-SIGMOD Conferenceon’the Managementof Data,May, 1975. 181C. L. Liu, “Introduction to Combinatorial Mathematics’McGraw-Hill Book Company,1968. I91 R. G. Brown, “Statistical Forecastingfor Inventory Controls,McGraw-Hill Book Company, 19.59. El01R.-G. Brown,’ “Smoothing, Forecastingand Prediction of DiscreteTime Series: Prentice Hall Inc, Englewood Cliffs, New Jersey,1962. 011 M. .Schkolnick, “The Optimal Selection of SecondaryIndices For Files”, ‘ResearchReport, Department of Computer Science,Carnegie-Mellon University, Nov.,l974. I121A. A. Kuehn, M. J. Hamburger, “A Heuristic Programfor Locating Warehouses”,Management ‘Science,-Vol. 9;.No. 4, July, 1963. 1131R. C. Meier, W. T. Newell, H. L. Pazer, “Simulation in Business and Economics”, Prentice Hall Inc., Engjewood Cliffs, New Jersey,1963. flt] J. H. G. Farley, S. .A. Schtuter, Query Execution and Index Selectionfor Relational Data Bases”, Technical Report CSRG-53,University of Toronto, Mar;,.1975. 051 G. D. Held, “Storage Structures for Relational Data Base Management SystemsI: Memorandum No. ERJM533, University of California, Berkeley,Aug.,.!975. OS1.M. Stonebraker, The Choice of Partial Inversionsand CombinedIndice?, International Journal of Computer and Information Sciences,Vol. 3, No. 2,1974. fll W. F. King, ‘On the Selection,of Indices for a File”, IBM ResearchRJ 1341,San Jose,Jan., 1974. 8

Log In

Index selection in a self-adaptive data base management system

Related papers

Related papers

Related topics