INDEX
SELECTION
IN A SELF-ADAPTIVE
SYSTEM
‘DATA BASE MANAGEMENT
Michrei Hammer
Arroii Chan
Laboratory forCom~mS&ncr,MIT,
Camqrtdge, Massachusetts,02139.
We address the problem of a,utomrticaiiy adjusting the physical organization of a data base to
optimize its performa& as Its accessrquirements change. We describe the p’rindpies of’the
automatic index ,selectioB facility of a prototype self-adaptive data base management system
that is currently under development. The importance of accurate usagemodel acquisitioti and
data. characteristics estimation is stressed.The statistics gathering mechanisms that’ are being
incorporated into our prototype system are discussed. Exponential smoothing techniques are
used for averrging statistlo observed,over different periods uf time in order to.predict future
characteristics. An heuristic algorithm for selecting indicts ‘to .matc@projected access
kqt$~ments’is presented. The cost nude1 on which the decision procedure is besed is flexible.
enough to incorporate the overhead costs of index creation, index storage and application
p-ram recompilation.
INTRODUCTION
The efficlcnt utilization of a data baseis highly dependenton the optimal matching of its phystcai~organizatlon
to its accessrquirknenu and other ch~ractcristics (such as the distribution of values in It). We consider here the.
problem of autorivaticaliy timing the physical organizationof an integrateddata base. By an integrated data base.
we mean one that supports a diversity of applicationsin an enterprise, the development .of such data bases.is
expected, to be one of the most‘important data processtngactivities for the rest of the 70’s Ill. There are’ many
reasons for the incorporatlon of heretoforeseparatebut relateddata baseswith a high degree.of duplication into
a slngie Integrated one. The reduction of storageand updating costs,and the elimination of Incontistensies that
may be caused by different copiesof the data In different stagesof updating, are among the more important ones.
Viewing an integrated data baseas-the repositoryof information for running an enterprise; it can no longer be
consldered as a,static entity. Instead, it must be looked upon as continually changing in size, with access
gradually .altcring as applications evolve, and a) users.develop familiarity
with the system.
nscqucntly, the tuning of a data base’sphysical organization must also be’a continual process. In current data
asc management systems, the responsibility of maklng reorganization decisions falls on the data base
(DBA), whose judgementsare basedon intuition and on a limited amount of communication with
some indtvidual data base users. For large integrated data bases,a more systematic means for acquiring
Information about data base usage. and a more algorithmic way of evaluating the costs of aiternatlve
configurations, wth be essential. A minimal capability of a data base management system should be the
incorporation of monitoring mechanismsthat collect.usagestatisticswhile performing query processing..,A more
sophisticated sy&m would sense the’ change in accessrqutements, .evaluatc the cost/benefits of various
reorganization strategies,and recommendaction to the DBA; eventually,such a systemmieht Lt.&f perform the
necessarytuning.’
INDEX SELECTION IN AN ADAPTIVE DA%A BOSESYSTEM.
We are currentiy’deveioping a self-adaptivedata basemanagementsystemwhich monitors the accesspatterns and
the data characteristicsof a data base,and usesthis information to tune its physical organization. We operate in
the environment of a relational data base system,which provides a level of physical data independence that
facilitates physlcai reorganization. Continuous monitoring of the usageof a retational data base opens up many
possibilities for its reorganization, and we expectto experimentwith a variety of alternatives and study their costs
and tradeoffs. As a first cut at the problem, we have concentrated.on the problem of index selection. A
secondary index (sometimesreferred to as an inversion) Is a well-known softwarestructure which can improve the
performance of accessesto a relation’(file) Cl1 For each.domain(field) of the relation that is inverted, .a table IS
maintained, which for each value of the domain in questioncontainspointers to ail those tupies (records) whose
contents in thc,desjgnated domain is the specified value. Clearly, the presenceof a secondary index for a
particular domain can improve the execution,of many queriesthat referencethat domain; on the other ‘hand;
maintenance of such an index has o&s that slow down the performanceof data base updates, insertions. and
delictions. Roughly speaking, a domain that is referenced frqucntlj relative to its modtftcation is a good
1
candidate for index maintenance. The choiceof which (if any) domainsto invert must be done with care; a.good
choice can significantly improve the performanceof the system,while a bad selectioncan seriously degrade it.
The goal of our systemis to makea good choiceof thosedomainsfor which to maintain secondary indices, based
on how the data baseis actually used.
The operation .of the initial .version of our prototypesystemcan be describedas follows. The specifications of
data base interactions,‘by both interactive users and application programs,are expressedin a non-procedural
language; these are first translated into an internal representationmadeup of calls to systemlevel modules. The
language processorhas availabie to it a modelof the current‘stateof the data base,which contains, among other
things, a list of the currently maintained set of secondaryindices,plus various informatio.6 about these 4ndices.
Using this information, the language processorcan choosethe best strategy for processing each data base
operation in the current environment. Statisticsgathering mechanisms.areembeddedwithin the system modules
that interpret the object code of the languageprocessor;and recorddata concerning the execution of every.data
base transaction. The statistical information gatheredfor a run is depositedin a collection area and summarized
from time to time. When the reorgan,isationcomponentof the systemis invoked (which will occur at ,fixed.
intervals of time). the statistical information collected,overthe precedinginterval is combined with statistics from
previous intervals and used to obtain a forecastof the accessrequirementsof the upcoming interval; in additiou,
a projected assessmentof various characteristiu’.of the data in the data base is made. A near-optimal set of
domains for which indices should be maintained is then determinedheuristically; optimality meanswith respect to
total cost, taking into account index storageand maintenance,This cost is comparedwith the projected cost for
the existing set of indices. Reorganizationis performedonly if its payoff is great enough to,cover its cost as well
as that of application program retranslation..
In this paper, we stressour approach to the problemof acquiring an accurateusagemodel and estimating data
characteristics, by means of continuous monitoring and the application of forecasting techniques. We have
considered the reduced problem of choosing indices for a single relation, but we expect our approach to be
extendible to more complex situations once we have developeda model for the cost of processing multi-relation
queries using different strategies. We believe,that an accurateinteraction and cost model is essential for a
practical environment, and that the use of problem-orientedheuristicsto cut .down the index search.space will be
fruitful in achieving efficient and near+imum solution.procedures.
ORGANIZATION
OF THE PAPER
The remainder of this paper is organized as follows. We begin with.a summaryof our view of the data base
organization: the file model, the storage and index organizations, and the data base operations. Next, a
procedure for determining the strategy(whether or not to useexisting indices)for processingthc,qualif.ication part
of a query. is presented. We then describethe statisticsthat are to be gathered during query processing, and
explain the use of exponential smoothing techniquesin the+rivation of parametersfor the cost model.’ This is
followed by a discussion of the need for heuristicsin solving the index selectionproblem, and then the heuristics
that we have devised are presented.Finally, we include a bdef comparisonwith previous related work.
FILE MODEL
We have chosen the relational model 121of data as the loglcal interface between application programs or
interactive users and the data base system,since it provides the level of physical data independence needed to
facilitate- physical reorganization. The totality of formatted data ‘in the data base consists therefore. of one or
more relations. However, we addresshere the reducedproblemof selectingindices for a data base made up.of ‘a
single relation. Herein, the data baseconsistsof a single relation (file) R with n tuples (records)T -. (ti, tg, -..*, tm)
where each ti t Di; the lth domain, i.e. R is a subsetof the cartesianproduct of the domains Df, DP *‘““,‘D,
Even though .insertion and deletion of tuples are permitted,we will assumethat the cardinal&y (number of tuples)
of the relation remains relatively unchangedbetweentwo consecutivepoints at which index selection is considered.
Following Rothnie 133,we assumea pagedmemoryenvironmentfor tuple storage. More specificly, the R tuples in
the relation R are stored in R pagesof t tuples each,where t is constantand n and p approximately so. The.
accessing cost of a page is assumedto’be independentof the sequenceof page accesses,
and dominates all other
internal processing costs. Hence, the processingcost for a query is measuredsolely in terms of.& number of
pages that have to be accessedto resolvethe query. If a query is resolvedby sequentiallyscanning alt tuples in th.e’
relation. then the total number of, pages that are accessedis just p. If an index is available for a domain
referenced in a query, it may significantly reducethe numba’of tuples that need’to be examined to resolve the
query. These tuples may’correspondinglyresideon fewer than p pages W; shall assumethat such a restricted set
of tuples will be rahdomly distributed over the memoryspace,but that it is possibleto access them such that each
page is. touched at most once. Yue and Wong 141have derived an exact formula for the expected number of
pages that have to be touched in order to accessr tup1e.s
which are randomly distributd over the file. Let b(rpf)
be the expected number of page accesses
for referencingr randomly distributed tuples in a file of p pages each
containing t tuples, and let
2
f(0) - 0
f(i+l) - ((~p-l)-ill(~qt-i)~f))tf(i)+ (plrtll(pH-I)
then it, has been shown in I4 that b(r;p,t) - f(r). We will makeuse of the above formula todetermine if it is
orofitable to use index(u). to resolvea query.
INDEX ‘MODEL
An index on a domain of a .relation is a mapping from valuesof the domain to tupks inthe relation with those
value.% We assumethat indices, just like tuples in the relation,are storedin a paged memory,with the usual twolevel hierarchical organization, i.e. the key valuesand their asmclatedtuple identifier (TID) lbts are assumedto be
on separate levels t!il Each value is storedtogetherwith a pointer to the head of its associatedTID list, and the
length of that list. To further simplify our discussionhere, we will assumethat the keys and the associated list
Pointers and list lengths are organizedas a B-tree [61 eachnodeof which is stored on a page. One or more TID
lists may be stored on the .samepage. The costof obtaining the TID list associated.with a domain value. in the
index is thus made up of the cost of locating the’head of the list (equal to the height of the B-tree, which is
dependent on the number of values in the domain),plus the costof reading the page(s)on which the TID Iist’is
stored. Hence; the averagecost for using an index can be readily estimated,given the number of values in .the
domain and the lengths of the associatalTID’lists.
The cost of modlfying am index as the result of a tuple Insert/delete/update is, however, more difficult .to
determine. We assumethat only those domain’values that are associatedwith existing tuples in the relation are
stored in the index. In most cases,an insert/delete/updatewill.causea single node in the B-tree to be updated. In
rarer cases,a key may have to be inserted into/deleted from the B-tree, resulting possibly in a node
overflow/underflow, which may in turn propagatethrough the tree. As for the TID list involved, it is possible’ in
many cases to write back a lengthened version of it onto the samepage on which it was stored before the
insertion, by utilizing empty spacesin the original page. Occasionally,though, the empty spacemay run out, and
it becomesnecessaryto allocatea new page. Similarly, occasionalgarbagecollection may be necessaryto recompact.
the TID Ii&, and recover wasted spacecausedby deletions. Therefore, for’ a .given domain, the cost of key
insertion, key ‘deletion and. garbage collectiondependson the sequencein which “updating” operations involving
this domain are performed; therefore it is very difficult to incorporatethis cost into any,overall parametric tndex
maintenance cost,model. We will therefore considerthe index maintenancecost for each domain as made up of
two components. The first dependsonly on the frequenciesof lnse_tts/deletes/updates
for this domain and consists
of the cost of updating a node in the B-tree plus the costof’ reading and writing an average TID list (an update
is esjentially equivalent to an delete and an insert); this componentcan be readily estimated.‘The second, made
up of the unquantifiable overheadsof node splitting/merging in the B-tree and garbage collection .in the TID
space, will be recorded by the statisticsgathering mechanismif the domain ls indexed. Otherwise, it can very
roughly be estimatedby using the normalizedaverageoverheadof thoseindices.that are maintained.
TRANSkTION
c
9
2
m
MODEL AND PROCRSSING
Out transaction model allows for the retrieval, insertion, uodatinp, and deletion of tuuler in a relation. Data
selection (the specification of somesubsetof a relation by.mians 0;.its propertles)‘is the’fundamental component
of all, these operations. In choosing data selectionoperatorsto be included in our transaction, we have limited
ourselves to’ those for which the.utility of using indices can readily be estimated. These include conjunctions of
equality conditions and disJunction of quality conditions. (By an quality condition, we mean a predicate of the
form A ‘- k, where.A is somedomain nameand k is a constantor a program variable). The use of indices and.
list processing techniques for. resolving thesekinds of queriesis well known I’?),and will not be reiterated here.
However, our tuple accesscost model implies that it may not always be desirable to resolve a conjunction or a
disjunction of quallty conditions using indices. (This will occur when the set of qualified tuples is expected to
reside on close to p pages; we say “close to”, becauseutilizing the index(es) also entails some page accesses.)
Therefore, we need a meanswhereby the number of tupks that can be expectedto qualify for such a query can
.be estimated. We can define’s selectivity measurefor a domaln index as the averagefraction of the.set of tuples
under consideration that have historically satisfied an equality condition involving that domain. We will,
furthermore, assumethat the joint ‘sekctlvity of a number of domainsspecified together is equal to the product of’
their individual selectlvities. Hence,glven the sekctivitiu of domainsin eachcondition, the number of tuples that
can be expected to satisfy a conjunction of theseconditionscan be determined;.in addition, the number of tuples
for a disjunction can be estimatedusing the inclusion-exclusionprinciple 161 Thus, given a query and an exlstlng
set of indices, the query processorwill use the availabk indicesonly if the expectedsum of the cost of accessing
the indices and the resolved list of tuplcc is lessthan the costof doing a sequentialsearchof the entire tuple space.
STATISTICS
e
I
*
GATHERING
An important component of our adaptive systemis the monitoring mechanismsthat collect usage statistics as data
base operations are performed. The statisticsgatheredover a time period are summarizedand then “averaged”
with summaries from previous periods. The siatistks,to be gatheredfor the purpose of index selection fall into
three general classes.
3
I
z
(1) update related &tics
- This h&s xveral components. First of all, thettare’the total numbers& tuplc!.
that ate deleted from (inserted into) the relation in the current time period. (We assume that each
insertion or deletion of a tuple will tquite some maintenance for each active index.) In addition, for
each domain of th’e relation; we maintain the number of updates made to that domain in the tuplei.
(This: involves some maintenance of the index for that domain.) Finally, we record the actual difficultto-patametetire costs of the maintenance of each index (node splitting and merging, garbage co&ction.
etc.), measured in terms of the actual number of page accessesexpended for, such overheads.
c.2) domain: selectivity statistics -i For each domain, we maintain its average selectivity aver all i~ses of the
domain in equality conditions in the current interval. This is accomplished by recording the ,numbet of
times tile domain octuts in equality conditions and the selectivity of the domaln ,in each ,of xhese
predicaates. The selectivity of the use of a domain in an qtiality condition is measured as follows:
.
(a)
Ii an index for the domain is used tp resolve the -particular quality condition, then the precise
selectivity of the domain for this query can be calculated as the fractian of’tupks in tbe.telation
with the domain value in question.
(b)
Suppose the equality condjtion appears in a con]uncUon of conditions that”is resolved ‘b’y
sequential scanning. (This scan may be of a reduced set of tupks obtained. via an index.1 Let the
total number of tuples scanned be Nb Let the conJunction be of the for& Ci A Ci A I*’ A ,C$
where each of the C, is an ‘quality condition involving domain I$ ‘Let Np Np ““‘; N, be’ the
number of tupks’that satisfy Cl, Cl A C+, -, C, A % A - A C, respectively. (Note thati these
values are’readily available). The selectjvi~y of domain D, for this $ry is then approximated as
Ni/NimI
(c) Suppose the quality condition appears in a disjunction if conditions that .is tesolvt+ by sequentia!
scanning. .Let the total numbet‘of tuples scanned be Ng. Let the dispnction be of the form Cl V
c2 v -.* V C, where each of the C, is an quality
tondition itivolving
domaln Di. Ler N,, Np
“**‘*,N, be the number of tuples that sat!sfy C,, “ci A C+, “‘-; “cl h “c2 A ,y” IJ Ci respectively.
(Note that thesevalues ate readily available). The sekctivity of domain Di for this query is then
approximated as N,/(No - Sum (N] for ] c 1).
Note that this measurement of selectivity accounts for mm-uniform distribution .of domain values over
the tuples, as well as the tion-uniform use of domain values in the quality predicates.
(3) query type statistics - Every time a query is process& its associated qualifyiq expression is recorded in
an access history file using a canonical representation (such as a bit ceding scheme for each domain that
appears in the qualification expression and a bit indicating whether it is a conJunction ‘dt’ a
disjunction). Note that by recording the actual queries that ticut, we avoid the sttong (and often
inaccurate) assumption that the simultaneous occurrence of domains in a query ate mutually
independent events.
ACCESS PATTERN
FORECASTING
AND PARAMiTER
ESTIMdTION
We assume that the index selection problem is to be reconsidered ‘at fixed intervals. At each reotganizatlonal
point, we forecast a number of characteristics for the period up to the next reorganizational poinL Specifitilly,
v.e
predict .the following:
(1)
the number of occurrences’ of each query type (the type of.a query is detetmjned by whether :its
qualification component is a conjunction or a dlsjunction, and by the domains used ip the ‘constituent
quality conditioris);
.(2) the’sekctivity
(3)
of each domain of the relation;
the numbet,of distinct values’in each domain (if a domain.‘is not indexed in the current interval. then
rhe current value for this number is approxitiated as the reciprocal of the observed sekct#viq);
(4) the number of tupks (hence the number of pagy),in the entire relation;
(51 the.expected cost (in terms of page accesses)of maintaining an ind!x for each domain and .the expected
cosr’of ech UK of such an index.
We could do these projections sokly on .the basis of statistic colkcted during the previous time period, or we
could combini together the statistics collected ‘ooet a11previous periods. However, neither alone would be
4
satisfactory for the purpose of a stable and yet responsiveadaptive system.In thc’former case,the system would
be overly vulnerable to chance.fluctuations in accessrequirementsand data characteristics,whereas in the latter
case,,it would be too insensitive to real changes Intuitively, a weighted moving average of observations over
previous periods should be more satisfactory. We utilize the technique of exponential smoothing [9,101for our
.forecasting and estimation proceduresbecauseof its simplicity of.computation, its flexibility, its minimal storage
requirement, and its ability ‘to be generalized to account for trends and cycles. The basic formulation of
exponenrial smoothing in making a forecastof a discretetime seriesis as follows:
new forecast - a 6 (actual observationfrom last period) + (1-a)$ (previous forecast)
where a is called a smoothing constantand takeson valuesbetween0 and’l. In essence,this is a weighted average
of all previous observations with the weight decreasinggeometricallyover successivelyearlier observations. The
‘rate of response to recent’changescan be adjusted simply by changing the smoothing constant: the larger the
smoothing constant, the more sensitive is the forecastto recentchangesand chance‘fluctuations. (The value of. a
can be ‘sekcted by the. DBA or can.be adaptively chosenby the system’itselfto minimize the difference, between
observed and predkted data.) The compactness of the schemelies in the fact that only two parameters need to be
maintain.ed for each time series,the current observationand the previous estimate. The above schemeis used for
the estimation of domain selectivity.
.Since the file site can generally be expectedto be growing and the level of activity in a data base system can
generally be expected to, be upward moving as usersbecomefamiliar .with the capabilities of the system,we will
use a modified. version -of exponential smoothing,which takes’trends into account, in order to forecast query
frequencies, index maintenancecost,and fik size. Its formulation is as follows 191:
I
E
new average I a e (current observation)+ (l-a) t (old av!rage)
current t,rend- new average- old average
new trend - a I (current trend) t (l-a) * (old trend)
m
new forecast - new average+ ((l-a)/u) o (newtrend)
In this form, it is necessaryto storeonly the previouslycalculatedvalues for the (new) average and for the (new)
trend, and the.calculation is still simple.
.4
INDEX SELECTIONATAREORGANIZATIONPOINT
AS we have said, at each reorganizational point, a number of forecasts‘are made. A projection is made of the
selectivity of each’domain in the relation, together.with the expectedcostsof accessingand ma’intaining an index
on that domain during the next interval. In addition; the expectedsize,and ,hencethe storagerequirement, of an
index’for each domain is estimated. Then the costof creationfor each index that is not currently maintained it
approximated. For any proposedset of indices,we can thus projecta,total costfor the next time.period. (This cost
includes retrieval processing; index creation, maintenanceand storage and application program recompilation.)
Therefore, we should be able to choosethat setof domainswhoseindexing in the next interval will minimize this
total cost.
A straightforward approach to the index selectionproblem would be to evaluate the total cost for each possible
choice of ,domains to be indexed, and then selectthat set of domains which gives .the smallest cost. With m
domains in the relation, there are 2m possiblechoicesof index sets. For small m, this enumerative approach is
probably the best strategy,as the optimal combinationof domainsto be indexed is guaranteed to be found. But
for moderate m, the costof repeatedapplication of thecost evaluation procedurebecomesvery expensive, and for
large m; it is prohibitive. It is not uncommonto find single%elationdata baseswith tens of domains. Therefore,
it is appropriate to look for ways whereby the searchspaceof potential index setscan be systematically aeductd
There are two possible approachesto this. One is to look for propertiesin the cost function that will allow it to
be minimized without exhaustive enumeration,such ai through a depth-first search. This’is the approach that
Schkolnick Dlj has taken. .However, even with a simplified model for cost evaluation, his upper bound of the
order of 2mO+08 m sets to. be tested is not enough of a reduction to enable the inexpensive. selection of the,
optimal’index set for a relation with a moderatenumberof domains. If a more accurateand detailed cust model
is to be used, then the i,ncentive for reducing the search.spaceis even stronger. Yet in this case,the hope of
fin.ding an algorithmic way of exploring a reducedsearchspaceof practical ‘size that will yield the optimal
solution is even’ dimmer. Therefore, it is appropriate to draw on the experience of artificial intelligence
researchers working in areaswhere formal mathematicalstructuresare computationally impractical U2.131and use
heuristic methods that significantly prune down the searchspaceand that work towards obtaining a near-optimal
solutiql.
5
1
111
LaHI
INDEX
SELECTION
HEURISTICS
In our index selection procedure,
wemakeuse of five problem-oriented heuristics.
(1) A near optimum choice of domains to beindexed can be made incrementally using a depth-first search.
This heuristic permits analysis of the problem as a stepwise minimization, each time adding to.the
index ,tEt that domain which will bring the best improvement to the cost function. There .have been
two previous suggestions regarding the incremental selection of domains to be itrdexed.. Farley and
Schuster Ml suggest that the incremental selection process can’be terminated once no single domain ‘in
the non-indexed set can be chosen that will yield incremental cost/benefit& This Is insufficient for our
choice of query and tuple access’models. There are two reasons why it may be necessary to consider the
incremental savings brought by considering two or more indices together. First, it may happen that.for
a query involving a conJunction of conditions, the selectivity of any one domain may n,ot be’sufficient
to restrict the number of tupks to be accessedto be less than the total number of pages in the ‘relation,
whereas the joint selecttvity of two or more domains might Secondly, a d!sjunction. ofatnditlons
can
An
be resolved via indices only if all of the domains involved In the. disjunction are indexed.
alternative strategy has been suggested by Held [IQ who, at any staged the Incremen~lIndex selectidn
procedure, considers the incremental savings of each of the possible subset of domains in the candidate
set with less than or equal to some,fixed number of domains in it. This, of course, may be very
inefficient. We have taken an intermediate.approach. We consider the adjoining of multiple domains
to the index set only if no single domain that will yield positive*incremenhI savings can ‘be found.
(2)
Some domains can be eliminated from the initial candidate set by virtue of their low’occurrence
frquencies in queries. This ,effectively reduces m, the initial number of domains. in the candidate set.
From the forecasted frequency of each retrieval specification, we can find. the pro&ted number of
occurrences of each domain. Using the selectivity estimate of the domain, we,can find a gross upper
bound on .the number of page accessesthat an index on the domain can save inthe processing of the
forecasted set of queries. If this upper bound is less than the projected cost of m&talning.an
index on
the domain, then this domain can safely be excluded from the initial candidate’ set, i.e., the domain.
occurs so infrquently that it can never be profitabk to index it. The upper bound ‘is calculated as
follows. An index saves the largest number of page accessesif its domain only occurs in one-term
queries, Under, this assumption, the maximum number of page accessesis saved for each occurrence Jf
all the qualified tuples are clustered together. Let
p - number of pages in relation
s I selectivity of domain
f - number of occurrences
then an upper bound on the number of page accessessaved is f * p * (1 - s).. (Foreach occurrence, we
would have accessp pages without an index, and only s9p with one).
(3) Only a small subset of all possible candidates need be considered to determine the next domain or set of
domalns to be adjoined to the index set at each stage. We can rank. the domains by the above
described upper bound and consider only the top ranking M domains. and combinations of these, for’
detailed incremental savings calculation. Furthermore,‘.a.bound M’ (M’&i) can be put on the n.umber
of domains that will be considered together.
(4)
Not all queries can use indices profitably. The expected number of tuples that will qualify for each
query type can be estimated using the selectivity of the domains involved. Those that cannot profitably
make use of indices’are eliminated from the pro&ted query set whose processing cost is to be
minimized. This eliminates some unnecessary cost calculations. Similarly, only.queries that involve
domains’still
in the candidate set need to be retained in the query set for incrcmeptal savings
calculation. (This heuristic does not cut down the search space, but does speed up the cost evahsation at
each step.)
(5)
An upper bound can be put on the number of costevaluations that are performed
selection procedure. The procedure is terminated when this bound is exceeded.
in the entire
To tllustrate the above heuristics, we present the details of our index selection ‘procedure. Our procedure can be
divided into two stages. During the first, a tentative set of domains to be indexed fs chosen in an jtrcremental
fashion. It is possible to put a bound on the cost of this phase by ad&sting the parameters M and M.‘. An.
alternative (Or additional) constraint is to bound the total .number of sets of indices for which cost evaluation is
performed.
(1) Rank the domains in the relation using the procedure described above atid let S be the set of domains
that occur so infrequently that indexing them can not be profitabk
6
(2) Partition the set of domains D in the relation into tliree.dis&int subsets: D, - the index set, Dc - the
candidate a&, and D, - the non-index set Initialize D, to null, D, to S, and Dc to D - D,.
(3) Consider in turn the incrementalsavingsgained by indexing eachof the M top ranking domains. in the
candidate set. Adjoin to D, the one that will give the best improvement to the cost function. If one
cannot be found, then consider larger-sizedcombinations(up to M3 of these M domains. Consider
combinations of the next larger size only if it is not profitable to ad&in any of the combinations less
than or qua1 to the current size. Revert to consideradjoining single domains after each adjoinment.
Remove the domains from DCas they are adjoined to D,
(4) Terminate the first phaseof the procedureif:
(a) no subset of the M (or number of domainsleft in the candidate set, whichever is smaller) top
ranking domains up to size hi’ can be chosenthat will improve the cost function.
(b) the upper bound on the total numberof costevaluationsis reached.
The next stage may be called the bump-shift phaseB21. Domainsthat have been adjoinedto the index set early
in the first stage may turn out to be uneconomicalas the result of later addition of other domains to the set, and
thus should be removed from the index set. Only individual domainswill be consideredfor removal from the
tentatively chosen index set.’ If more.than one can be chosen,then the one that leadsto the best cost improvement
is removed first,’ and the procedure repeateduntil a local minimum,of the cost function is reached. (A possible
extention to this stageis to consider the profitability of maintaining combinedindices for those domains that are
adJoined to the index set together.) Once an optimal setof domainsfor indexing has been chosen, the total ,cost
for this choice of domains, including query processingcost,index storage,maintenanceand creation (if applicable)
costs, is compared with the cost for the existing set of indices. Reorganization is done only if the difference
between the two is enough to cover the application programrecompilationcost.
COMPARISON WITH PREVIOUS WORK
We have presented an experimental and heuristic approachto the index selectionproblem that is different in
many respectsfrom recentstudieson index selectionby StonebrakerOSI,King U7l, Schkolnick Oil, Farley M, and
Held Da]. These other studies have either beenformal analyses,which have made many simplifying assumptions
in order to obtain an analytic solution, or elsesystemdesignsthat have been incompleteor unrealistic in various
w’ays. Our work attempts to go farther than theseby utilizing more complete.andaccurate models of cost and
access,and by emphasizing important aspectsof realistic data baseenvironments. Below we summarize,the novel
aspects of our approach. We have stressedthe importance of accurate usage model acquisition and data
characteristic estimation in a dynamic environment where accessacquirementsare continually ‘changing. ‘.We
believe it necessary to apply forecasting techniques to predict future accessrequirements based on past
observations, in order to capture and respond to the dynamic and changing nature of data .base usage. Our.
scheme endeavours to obtain a precise model of data base usageby recording actual query patterns, thereby
avoiding the strong ‘assumption that the probabilities of two domains appearing -in a query are mutually
independent, We also take into considerationthe facts that valuesof a domain are not equally likely to be used in
querks, and that they are not evenly distributed amongtuplesof the relation, by monitoring the actual sekctivities
of domain values that are used in queries.
The size of actual data basesis reflected,in our concernfor efficient heuristics to speedup the index selection
process. Our cost models account for such real overheadsas the expenseof index accessingand the cost of key
insertion/deletion and garbagecollection in index maintenance,and are basedon reasonablemodelsfor data base
storage. Our approach of minimizing the total processingcostfor the upcoming time interval, rather than the
expected cost for a single query, is flexibk enough to accountfor the overhead costs of index creation, index
storage and application program recompilation.
CONCLUSIONS
We have presenteda high-level descriptionof our approachto th.epmblem of index selectionin an adaptive data
base management systemthat we are developing. This has beendone,however,only in the restricted environment
of a single-relation data baseaccessedthrough a restrictedinterface with limited capabilities for the selection of
data: TO fully realize the flexibility of a relational data base, it is necessaryto consider’s multi-relation
environment together with a high-kvel non-pro&dural language Interface that permits queries. with arbitrary
interconnection betweenrelations in the qualification part and high level operatorson the qualified data. In s’uch
an environment, it is necessaryto consider the utility of indices for more complicated operations (such as
restriction, projection, division, join, etc. [21)and to selectindicesfor all the relations in the data base as a whole.
This ‘Is where the recording of detailed accesshistory is necessaryfor optimal index selection, and the use of
heurisdcs should be fruitful:for cutting down the searchspace and for selecting.richer index structures (such as
combined indices). Our heuristic index selectionprocedureshould be readily extendible to such an t?iWolr~nt,
7
1
provided a cost evaluation procedurehas been defined to estimatethe cost of processingan arbitrary query with
the presence of a particular set ?f indices. Our plan is to experimentally assess the optimality of our heurlst.ic
index selection algorithm on a reducedenvironment,before embarking on the more ambitious propct of index
selection for a more.genetal environment. More fundamentally,our intent Is to experimentally study the needs
for, and capabilities of, a self-adaptive data managementsystemin realistic data baseenvironments..
ACKNOWLEDGEMENTS
This research’was supported by the Advanced ResearchProjectsAgencyof the Department of D.efenseand ivat
monitored by the Office of Naval Researchunder contractno. N9Wlf-1.
REFERENCES
W J. Martin, “Computer Data-Base Orga?ization”, Prentice Halt Inc. Englewocd Cliffs, New Jersey, 1975.
f21 E. F. Codd, “A Relational Model of Data for Large SharedData Banks”,CACM, Vol. 13,No. 6, June, 1970..
C31J. B. Rothnie, T. Lozano, ‘Attribute BasedFile Organization in a Paged Memory Environment”;CACM, Vol.
17,No. 2, Feb., 1974.
f41 P. C. Y’ue, C. K. Wang, “StorageCost Considerationsin SecondaryIndex Selection-,International Journal of
Computer’and Information Sciences,Vol. 4, No. 4,1975.
‘f5l A. F. Cardenas, “Analyiis and Performanceof Inverted Data BaseStructures”,CACM, Vol. 18,‘No. 6, May,
1975.
163R..Bayer, E. McCreight, l Organization and Maintenanceof Large Ordered Indexes”, Acta Informaticq. Vol. I,
Fast. $1972.
f71 M; M. Astrahan; D. DChamberlin, “Implementationof a StructuredEnglish Query Language”, .Proceedings of
the ACM-SIGMOD Conferenceon’the Managementof Data,May, 1975.
181C. L. Liu, “Introduction to Combinatorial Mathematics’McGraw-Hill Book Company,1968.
I91 R. G. Brown, “Statistical Forecastingfor Inventory Controls,McGraw-Hill Book Company, 19.59.
El01R.-G. Brown,’ “Smoothing, Forecastingand Prediction of DiscreteTime Series: Prentice Hall Inc, Englewood
Cliffs, New Jersey,1962.
011 M. .Schkolnick, “The Optimal Selection of SecondaryIndices For Files”, ‘ResearchReport, Department of
Computer Science,Carnegie-Mellon University, Nov.,l974.
I121A. A. Kuehn, M. J. Hamburger, “A Heuristic Programfor Locating Warehouses”,Management ‘Science,-Vol.
9;.No. 4, July, 1963.
1131R. C. Meier, W. T. Newell, H. L. Pazer, “Simulation in Business and Economics”, Prentice Hall Inc.,
Engjewood Cliffs, New Jersey,1963.
flt] J. H. G. Farley, S. .A. Schtuter, Query Execution and Index Selectionfor Relational Data Bases”, Technical
Report CSRG-53,University of Toronto, Mar;,.1975.
051 G. D. Held, “Storage Structures for Relational Data Base Management SystemsI: Memorandum No. ERJM533, University of California, Berkeley,Aug.,.!975.
OS1.M. Stonebraker, The Choice of Partial Inversionsand CombinedIndice?, International Journal of Computer
and Information Sciences,Vol. 3, No. 2,1974.
fll W. F. King, ‘On the Selection,of Indices for a File”, IBM ResearchRJ 1341,San Jose,Jan., 1974.
8