On t h e E s t i m a t i o n of J o i n R e s u l t Sizes
Arun Swami I and K. Bernhard Schiefer 2
1 IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120-6099
2 IBM Toronto Lab, 895 Don MiUs Road, North York, Ontario, Canada M3C 1W3
Abstract. Good estimates of join result sizes are critical for query optimization in relational database management systems. We address the
problem of incrementally obtaining accurate and consistent estimates of
join result sizes. We have invented a new rule for choosing join selectivities for estimating join result sizes. The rule is part of a new unified
algorithm called Algorithm ELS (Equivalence and Largest Selectivity).
Prior to computing any result sizes, equivalence classes are determined
for the join columns. The algorithm also takes into account the effect
of local predicates on table and column cardinalities. These computations allow the correct selectivity values for each eligible join predicate
to be computed. We show that the algorithm is correct and gives better
estimates than current estimation algorithms.
1
Introduction
The join is an important operation in relational database management systems.
When a user-generated query involves multiple joins, the cost of executing the
query can vary dramatically depending on the query evaluation plan (QEP)
chosen by the query optimizer. The join order and access methods used are
important determinants of the lowest cost QEP.
The query optimizer estimates the eventual result size, or cardinality, of
joining the specified tables and uses this information to choose between different
join orders and access methods. Thus, the estimation of join result sizes in a
query is an important problem, as the estimates have a significant influence on
the Q E P chosen.
A survey of the use of statistics and estimation techniques in query optimization is given in [7]. There has been a lot of work done on estimating setectivities,
e.g., ([1, 10, 8]). In [6], a number of new selectivity estimation methods were
proposed for highly skewed distributions such as Zipf distributions [17, 3]. Errors in the statistics maintained by the database system can affect the various
estimates computed by the query optimizer. An analytical model was used in
[4] to study the propagation of errors in the estimated size of join results as the
number of joins increases in queries with a single equivalence class.
When more than two tables are involved, the query optimization algorithm
often needs to estimate the join result sizes incrementally. Incremental estimation is used, for example, in the dynamic programming algorithm [13], the A B
algorithm [15] and randomized algorithms [14, 5]. The optimizer first determines
the join result size from joining the first two tables, then determines the join result size from the join with a third table, and so on. Thus, the query optimizer
288
incrementally estimates the final join result size from the result sizes of the
intermediate tables produced.
We have identified a number of problems with current practice in join size
estimation. The problems include dealing with local predicates correctly, correct
handling of multiple columns in a single table that belong to a single equivalence
class, and correct incremental estimation of join result sizes. In this paper, we
describe an algorithm called ELS that solves these problems and give arguments
for the correctness of the solutions.
In the next section, we give a brief overview of the terminology and known
results in join size estimation that are used in this paper. The reader knowlegeable in query optimization can skim Section 2. Section 3 describes some of the
problems in current join size estimation practice. Algorithm ELS is outlined in
Section 4. In [16] we also show how one can handle the case of multiple local
predicates on a single column. In Sections 5, 6, and 7, we describe the techniques used in Algorithm ELS to address the problems described in Section 3.
We illustrate the value of our algorithm in terms of execution time in Section 8.
In Section 9 we summarize the contributions of this paper and indicate some
directions for future work.
2
Background
The following assumptions are made by most query-processing cost models in
practice and this paper:
1. Independence: Within each relation, values chosen from distinct columns that
are involved in join predicates are independent.
2. Uniformity of Attribute Values: The distinct values in a join column appear
equifrequently in the column.
3. Containment: When joining two tables, the set of values in the join column
with the smaller column cardinality is a subset of the set of values in the join
column with the larger column cardinality.
The implications of these and other assumptions are discussed in depth in [2].
Note that we need the uniformity assumption only for the join columns, i.e.,
we can use data distribution information for local predicate selectivities. Some
work has been done ([1]) to relax some of these assumptions. However, this
requires assuming a certain parametric model of the data (nonuniform Zipf distribution) and determination of the parameter values, or keeping a large amount
of metadata (correlation statistics). There is a tradeoff between implementation
complexity and possible gains in accuracy of estimates. Most query optimizers
choose to work with the assumptions listed above. As in other work, we focus
on conjunctive queries where the selection condition in the WHERE clause is a
conjunction of predicates. These queries constitute the most important class of
queries.
A table may participate in several join predicates. Let a table R be joined
with another (possibly intermediate) table I. At this point in the join ordering,
the query optimizer only needs to consider the predicates that link columns in
289
table R with the corresponding columns in a second table S that is present in
table I. These join predicates are termed eligible join predicates.
The query may include predicates that involve only a single table. Such predicates are called local predicates. The join resutr size is typically estimated as
the product of the cardinalities of the operand tables after applying the local
predicates (if any) and the join predicates [13]. When queries contain equality
predicates (either local or join predicates), it is possible to derive additional
implied predicates using transitive closure. (Similar derivations are possible for
nonequality predicates too, but equality predicates are the most common and
important class of predicates that generate implied predicates.) Transitive closure can be applied to two join predicates that share a common join column in
order to obtain another join predicate.
Example la
Suppose the following SQL statement is entered as a query 3.
SELECT Rl.a
FROM RI, R2, R3
WHERE (Rl.x = R2.y) AND (R2.y = R3.z)
We can use the principle of transitivity with the join predicates Jl: (Rl.x -R2.y) and J2: (R2.y = R3.z) to obtain the join predicate J3: (Rl.x = R3.z).
Performing this predicate transitive closure gives the optimizer maximum
freedom to vary the join order and ensures that the same QEP is generated
for equivalent queries independently of how the queries are specified. In the
above example, without predicate J3, most query optimizers would avoid the
join order beginning with (R1 ~ R3) since this would be evaluated as a cartesian
product, which is very expensive 4. Since join predicates can share columns, the
effects of applying them may not be independent. In the above example, once
join predicates J1 and J2 have been evaluated, J3 has in effect been evaluated,
and hence evaluating it separately can have no further effect on the size of the
join result. Equivalence classes of join columns can be used to capture these
dependencies.
Initially, each column is an equivalence class by itself. When an equality
(local or join) predicate is seen during query optimization, the equivalence classes
corresponding to the two columns on each side of the equality are merged. In
the example above, given join predicate J1, columns x and y are in the same
equivalence class. When join predicate J2 is seen, the equivalence classes of
columns y and z are merged. Thus columns x, y and z are now in a single
equivalence class. As we will see later, we can use the equivalence class structure
to correctly take into account the effect of dependencies between join predicates.
We will often use the terminology "x and y are j-equivalent (columns)" to denote
that x and y belong to the same equivalence class.
Multiple equivalence classes can be handled using the independence assumption as follows. Since the join columns in different equivalence classes are assumed satisfy the independence assumption, the predicates involving columns
3 In the rest of the paper, we show only the WHERE clause of SQL queries.
4 We use ~ as the symbol for the join operation
290
in one equivalence class reduce the join result size independent of the predicates
involving columns in another equivalence class. Thus, in the rest of the paper,
we can focus on the case of a single equivalence class.
Two kinds of statistics are typically important in query optimization. One
statistic is the number of tuples contained in a single table. This value is known
as the table eardinality and is denoted by IIRII, wherein R is the table. The
second statistic is the number of distinct values present in a column. This value
is known as the column cardinality and is denoted by dx, wherein x is the column.
These two statistics are important because they are used to estimate the size of
the results from different operations on the table data, which in turn helps to
determine the cost of these operations.
Consider a join predicate (J:(RI.xl = R 2 . x 2 ) ) , where the column cardinality
of xl (x2) of table R1 (R2) is dl (d2). The size of the join result can be estimated
[13] as:
II(R1 ~ R2)]I = d~ x liRlll • IIR II
dl • d2
'
where d~ = min(dl, d2)
(Equation 1)
Since min(dl, d2)/dl x d2 = 1/max(dl,d2) Equation Equation 1 can be rewritten as:
I[(R1 ~ R2)II = I]Rlll x [IR2I[ x Sj,
1
where S j - max(dr, d2)
(Equation 2)
where S j is termed the selectivity of the join predicate J. In order to enable
a more intuitive understanding of Equation 2, we will derive Equation 1 using
the assumptions listed above. For each distinct value, there are HRil[/d~ tuples
in R~ having that value in the join column (from the uniformity assumption).
Using the containment assumption, we get that the number of distinct values
that appear in the join columns of both the tables is min(dl, d~). But these are
precisely the values for which the join takes place. Hence, the number of tuples
in the intermediate result (R1 >~ R2) is:
II(R1 ~ R2)II =
min(dx,d2) • HL~R,
~I_____[ • HR2]_____[_-IIRlll x IIR21[ x
max(
d2
al
l,d2)
In [12], Rosenthal showed that Equation 1 holds even if the uniformity assumption is weakened to require expected uniformity for only one of the join columns.
Equation 1 can be generalized for n relations to get Equation 3 below. Let
R~ denote the result of joining tables R1, R 2 , . . ' , Rn on columns xl, x~,--., x,~
respectively. Here there are equality predica{es between each pair of columns and
in the terminology of equivalence classes, all the zis are in a single equivalence
class, i.e., are j-equivalent. Denote the column cardinality of xi by di for i E
1 . . . n. Let d(1), d(2),---, d(,~) be a rearrangement of dl, d2,.-., dn in increasing
order as d(1) _< d(2) _< "'" _< d(~). Then, such that d~l, d~2,.-. , di~ are ordered in
an ascending sequence. Then,
d(t) -- min(d(1), d(2),-'', d(n)) -- min(dl, d2,..-, dn)
291
Then, using the assumptions listed above, it can be shown that the size of the
intermediate result R~ is:
IIR~II =
IIR21J
d(1) x
x T
IIRlll • tlR2tl x...
x..-
IIR=II
• - -
d~
• IIRnll
d(2) • d(a) • ... • d(n)
'
where d(1) = min(dl, d2 ~Ecl~d~i)~n 3)
that is, all column cardinalities except for the smallest one are present in the
denominator.
Example lb
follows:
Continuing with Example la, let the statistics for the tables be as
IIRI]I = 100, IIR211 = 1000, IIn31I = 1000, dx = 10, dy = 100, dz = 1000
The join selectivities from Equation 2 are:
Jl:
J2:
J3:
S jl = 1/max(d1, d2) = 1/max(10, 100) = 0.01
S j2 = 1/ma.x(d2, d3) = 1/max(100, 1000) = 0.001
S j3 = 1/max(d~, d3) - 1/max(10, 1000) = 0.001
If R2 is joined with R3, the size of the intermediate result table can be
estimated using Equation 2:
11R2 ~ Rail = IIR2I[ x IIRzII x Sg2 -- 1000 • 1000 • 0.001 = 1000
Using Equation 3, the size of (R1 M R2 N Ra) is estimated to be:
lira N R2 M Rail =
100 x 1000 x 1000
-- 1000
100 x 1000
which is the correct answer.
3
Problems in Join Size Estimation
There are a number of problems with current practice in join size estimation.
The problems include dealing with local predicates correctly, correct handling
of multiple columns in a single table that belong a single equivalence class, and
correct incremental estimation of join result sizes.
3.1
Effect o f L o c a l P r e d i c a t e s
If a local predicate is present, the local predicate may reduce the cardinality
of the table. This reduced cardinality is called the effective cardinality of the
table. It can be used in other cardinality calculations, such as for estimating
join result sizes. When the local predicate is on a join column, the predicate
can also reduce the column cardinality of the join column. Thus, it is clear that
local predicates will affect join result sizes, since the predicates affect both the
number of participating tuples and, possibly, the column cardinality of the join
columns. It is evident that an algorithm is needed for taking into account the
292
effect of local predicates when estimating join result sizes. We do not know of
any algorithm that correctly takes into account both local predicates and join
predicates. In Section 5, we describe how to correctly take into account the effect
of local predicates in join size estimation.
3.2
Single Table J - E q u i v a l e n t C o l u m n s
During the incremental calculation of join result sizes, it can happen that two
or more of the eligible join predicates involve join columns of the next table that
are j-equivalent. For example, consider a table R1 joining with another table
R2 using the following join predicates: (RI.x = R2.y) and (Rl.x = R2.z). By
transitive closure, we get the predicate (R2.y = R2.z). When R1 is joined with
R2, all three predicates are eligible and involve columns y and z of R2 that are
j-equivalent by virtue of the predicate (R2.y = R2.z).
Current query optimizers do not treat this as a special case and therefore, by
default, the join selectivities of all such join predicates are used in calculating the
join result sizes. However, using them produces an incorrect result because the
join columns are not independent. In Section 6, we describe an algorithm that
correctly handles the case when two or more of the join columns of a relation
are j-equivalent.
3.3
J-Equivalent C o l u m n s from Different Tables
Another problem that arises in incremental computation is that the effect of all
the eligible predicates involving different tables is not independent. In Example lb, once join predicates J1 and J2 have been evaluated, J3 has in effect been
evaluated and hence, evaluating it separately can have no further effect on the
estimated size of the join result. Yet, in current practice, its selectivity will be
included in the result size estimate.
In [13], the selectivities of all the eligible join predicates are multiplied together along with the product of the individual effective table cardinalities (as
reduced by the selectivities of local predicates). We call this the multiplicative
rule (.s
This rule can compute an incorrect join result size because all the join
selectivities are used without accounting for dependencies. Rule A,t can dramatically underestimate the join result size as shown in Example 2.
Example Z As in Example lb, let R2 and R3 be joined first, and then R1. From
Example lb, we have I[R2 ~ R31[ = 1000. When R1 is joined, Rule ~ estimates
the join result size to be:
HR2 ~ R3 ~ R1H = IIR2 ~ R3H X HR1][ x '-'r X SJ3
= 1000 • 100 • 0.01 x 0.001 = 1 (correct answer is 1000)
The problem of underestimation of join result sizes by Rule .hi suggests dividing the eligible join predicates into groups, with join predicates involving
j-equivalent columns being grouped together, and then choosing for each group
a single join selectivity. The p~oblem appears to be analogous to the problem of
picking among multiple local predicates on a single column [16]. Thus, one would
expect to pick the smallest join selectivity value for each group. We refer to this
293
idea as the smallest selectivity rule (SS). Though this seems to be the intuitive
choice, this choice of join selectivity can compute an incorrect join result size as
shown in Example 3.
Example 3 As in Example lb, let R2 and R3 be joined first, and then R1. From
Example lb, we have [IR2 N R3[I = 1000. When R1 is to be joined, Rule S S puts
predicates J1 and J3 in a single group, since columns x, y and z are j-equivalent.
Since Sj3 < S jl, it uses only predicate J3 in calculating the join result size.
Thus, Rule S,.q estimates the join result size as follows:
IIR2 N R3 N Rill = IIR2 R311 x Ilalll x
= 1000 x 100 x 0.001 = 100 (correct answer is 1000)
A third procedure has been proposed but, to our knowledge, has not been
implemented in any optimizer. The proposal is to assign a representative join
selectivity to each equivalence class and to use that selectivity whenever join
predicates in that equivalence class are being used. The problem with this proposal is that there is no certainty that a correct value for this representative
join selectivity exists that will work in all cases. In our example query, if the
representative selectivity is 0.01, the estimate for the final join result size will be
10000, which is too high. If the representative selectivity is 0.001, the estimate
for the final join result size will be 100, which is too low. In Section 7, we describe a new rule for correctly choosing the join selectivities when undertaking
an incremental estimation of join result sizes.
4
Algorithm
ELS
We describe an algorithm called ELS (Equivalence and Largest Selectivity) that
solves the problems described in the previous section. Algorithm ELS consists
of two phases. The first is a preliminary phase that is to be performed before any
join result sizes are computed. In this phase, the predicates implied due to transitive closure are generated and join selectivities are calculated. The preliminary
phase consists of steps 1 through 5. The second phase computes, incrementally,
the join result sizes. The processing for this final phase is outlined in step 6.
1. Examine each given predicate.
9Remove any predicate that is identical to another predicate, so that queries
involving duplicate predicates such as (Rl.x > 500) AND (R1 .x > 500) can
be handled.
9 Build equivalence classes for all columns that are participating in any of
the predicates.
2. Generate all implied predicates using transitive closure. There are five variations:
a. Two join predicates can imply another join predicate.
(Rl.X = R2.y) AND (R2.y = R3.z) ==~ (Rl.x = R3.z)
b. Two join predicates can imply a local predicate.
(Rl.x = R2.y) AND (Rl.x = R2.w) = (R2.y = R2.w)
294
3.
4.
5.
6.
5
c. Two local predicates can imply another local predicate.
(Rl.x = RI.y) AND (RI.y = Rl.z) ~ (R~.x = Rl.z)
d. A join predicate and a local predicate can imply another join predicate.
(Rl.x = R2.y) AND (Rl.x = Rl.v) ~ (R2.y = Rl.v)
e. A join predicate and a local predicate can imply another local predicate.
Here op denotes a comparison operator, and c is a constant.
(Rl.x = R2.y) AND (Rl.x op c) ==~ (R2.y op c)
Assign to each local predicate a selectivity estimate that incorporates any
distribution statistics. In [16], we present a complete algorithm to handle the
case of multiple local predicates on a single column. In essence, the most
restrictive equality predicate is chosen if it exists, otherwise we chose a pair
of range predicates which form the tightest bound.
For each table, compute an estimate of the table cardinality and the column
cardinality of each column after all the local predicates have been included.
We show that if these new estimates are correctly used in the computation
of join selectivities and join result sizes, we do not need to concern ourselves
with local predicates after this step. This is discussed further in Section 5.
Process each join predicate by computing its join selectivity. If two join
columns from the same table are j-equivalent, special care is needed in the
computation of the join selectivities. We describe an algorithm for handling
this case in Section 6.
For each intermediate result of the join order, estimate the result size. We
have invented a new rule, which is described in Section 7, to estimate the
result size. We also prove that the rule is correct under the assumptions
stated in Section 2.
Effect of Local P r e d i c a t e s on Join R e s u l t Sizes
In Section 3, we described the need to incorporate the effect of local predicates
in the estimation of join result sizes. Let R be one of the tables participating in a
join. Let a join column of R be x and let a local predicate involve column y of R.
Here, y could be identical with x. Let IIRII be the cardinality of R before the local
predicate is applied, and IIR]I~ be the cardinality of R after the local predicate is
applied. Let d= denote the column cardinality of join column x before the local
predicate is applied, and d~ denote the column cardinality of column x after the
local predicate is applied. Similarly, we can define dy and d~.
The estimates of HRII~, d~, and d~ depend on the local predicate and the
information available about the column y. For example, if the local predicate is
of the form y -- a where a is a literal, we know that d~ -- 1. If we have distribution
statistics on y, they can be used to accurately estimate IIRII~. Otherwise, we can
use the uniformity assumption (see Section 2) to estimate [IRII' (i.e., [IRII' =
]lRII/du). For some other local predicate L involving y, if the local predicate
selectivity SL is known, we can estimate [[R[[' = [[R[[ x SL and d~' = d~ x SL.
Now that we have estimated ]IR[[' and d~, we can estimate d~ as follows.
If y is identical with x, clearly d~ -- d~. If y and x are different columns, we
use a simple urn model to estimate d~. The process is one of assigning [[R[['
295
balls among d~ urns that are initially empty. Any of the urns can be chosen
uniformly. The balls correspond to the selected tuples and the urns correspond
to the distinct values in column x. The number of distinct values in x remaining
after selection corresponds to the number of non-empty urns. We obtain the
expected number of non-empty urns as follows, where n = d~ and k : IlRll'.
Prob{ball is put in urn i}
Prob{ball is not put in urn i}
Prob{none of the k balls is put in urn i}
Prob{at least one of the k balls is put in urn i}
= 1In
= (1 - 1/n)
= (1 - 1/n) k
= (1 - (1 - l / n ) k)
Then, the expected number of non-empty urns is given by n • (1 - (1 - 1/n)k).
Hence in the case where column x is distinct from column y, we have that
=
• (i - ( i
-
Note that if [[R[[' is sufficiently large enough and approaches [[RI[, the term
(1 - 1/dx) HRjI' will be close to 0 and d~ ~ d~. This estimate can be quite different
from another common estimate d~ = dx x ([IRI['/HR[[). This can be seen from the
following numerical example. Let dx = 10000, []R[I = 100000 and [IR[[' = 50000.
Then the urn model estimate gives d~ = 9933 whereas the other estimate gives
d~ = 5000. If {[R[[' = [[R[[, the estimate according to the urn model is d~ = 10000.
At this point, the table and column cardinality estimates have already incorporated all local predicates. The rest of the estimation algorithm only has
to deal with join predicates. We have thus taken into account the effect of local predicates and simplified subsequent processing. We can also take advantage
of distribution statistics on columns that are involved in local predicates. This
enables us to obtain more accurate estimates of join result sizes. The original,
unreduced table and column cardinalities are retained for use in cost calculations
before the local predicates have been applied, for example, when estimating the
cost of accessing the table.
We have assumed that if the effect of a local predicate on the table cardinality
and the column cardinalities of join columns is taken into account, then the local
predicate is, in effect, applied before or together with the join predicates on this
table. This being true, then the following argument shows that our approach
to size estimation is correct. Consider all the local and join predicates being
used to form an intermediate result. For the estimation of the result size, it does
not matter in which order the predicates are applied. We can assume that the
local predicates are applied first, evaluate their effects on the table statistics,
and then compute the size obtained by applying the join predicates. Hence, in
the following discussion, we will focus on join predicates and assume that local
predicates can be handled as described above.
6
J-Equivalent
Join Columns
in a Single Table
Algorithm E L S needs to deal with the special case described in Section 3.2.
Consider the following example query, involving columns from the same table
that are j-equivalent.
296
W H E R E (Rl.x = R2.y) AND (Rl.x = R2.w)
After transitive closure of predicates, we obtain the following tranformed query:
W H E R E (Rl.x = R2.y) AND (Rt.x = R2.w) AND (R2.y = R2.w)
where the implied local predicate (R2.y -- R2.w) is added using rule 2.a from
Section 4. Let the statistics for the tables in the query be as follows: [IR1{I =
100, IIR21[ = 1000, d~ = 100, du = 10, d~ = 50. After transitive closure, R2.y
participates in every join in which R2.w participates and vice versa. Hence, for
computing join selectivities and join result sizes, we have to know the number
of R2 tuples that can qualify under the local predicate and also the number
of distinct values that the qualifying tuples contain. We wish to quantify the
effect of the local predicate on the effective cardinality of R2 and the column
cardinalities prior to evaluating the effect of the join predicates on the result
sizes. To do so, we use a probabilistic argument as follows.
Without loss of generality, let dw > = d r. Let {IR2II denote the table cardinality prior to join predicates being applied. We use the containment assumption for
join columns (see Section 2) to infer that all the d r distinct values are contained
in the dw distinct values. Now, consider any tuple of R2. It has some value, say
q, in column R2.y. Assuming the independence and uniformity assumptions for
columns R2.w and R2.y, the probability that R2.w has the same value q is 1/d,,,.
Since this is the case for every tuple of R2, the effective cardinality of R2 is given
by I]R2]I', where
IIR211'
r,R ,,1/
Once this selection has been performed, only one of the columns needs to be
joined since the evaluation of the local predicate has made the other join redundant. Using a simple urn model as in Section 5, we deduce that the column
cardinality value that should be used for join selectivity computations is:
[d r x (1 - (1 - 1/dr)HR211')]
that is, we pick the column with the most restrictive effect on the column cardinality. In our example query, IIR2{I' = IIR211/d~ = 1000/50 = 20. We pick R2.y
and the effective column cardinality in joins is hence,
rd,, • (, _ (, _
1 = B o • (, -(, -
= 9
The formulas given above can be generalized to the case of three or more join
columns, from the same table, that are j-equivalent. Let 1, 2 , . . . , n be the equivalent join columns of table R, and let the column cardinalities be dr, d 2 , . . . , dn.
Let d(1), d(2), 9-,
9 d(,~) be a rearrangement of dl, d 2 , . . . , d,, in increasing order as
d(1) ~ d(2) ~ ... _~ d(n). Then,
d(1) = min(d(1), d(2),--', d(,,)) -- rain(dr, d2,.-., dn)
297
Using an argument similar to the one above, it follows that
IIRII' =
[
d(2) • d(3) • .-.,•
d(n)
]
and that the effective column cardinality in joins is
[d(t) x (1 -(1 - 1/d(1))11"11'1]
7
Incremental Computation of Result Sizes
We saw in Section 3 that the "intuitive" rules currently used in query optimizers
(Rule .hi and Rule S $ ) do not correctly choose the join selectivities in the incremental estimation of join sizes. We present a new rule (called s
for picking
one of the join selectivities for an equivalence class.
s
Given a choice of join selectivities for a single equivalence class,
always pick the largest join selectivity.
R u l e / : $ appears counter-intuitive and a proof is provided in [16].
Multiple equivalence classes can be accommodated by using the independence
assumption. Since the join columns in different equivalence classes are assumed
to satisfy the independence assumption, each equivalence class independently
reduces the join result size. The algorithm then determines the join selectivity
for each equivalence class and multiplies them together to obtain the effective
join selectivity.
Example 3 As in Example lb, let R2 and R3 be joined first, and then R1. From
Example lb, we have [JR2 • R3 [[ = 1000. When R1 is to be joined, Rule s puts
predicates J1 and J3 in a single group, since columns x, y and z are j-equivalent.
Since Sj3 < Sj1, it uses only predicate J1 in calculating the join result size.
Thus, Rule s estimates the join result size as follows:
IIR2
R3 ~ Rill = IIR
N//311 x IIRlll x
Sj1
= 1000 x 100 x 0.01
= 1000 (correct)
As before, denote the result of joining tables Rt, R2,'", Rn by R~. Let us say
that we have calculated the size of R~. We now wish to join table Rn+l to obtain
the result R~ +1. We now prove that Rule/:S is correct by showing that the
algorithm calculates the join result size in agreement with the size calculated by
Equation 3 for a single equivalence class. Recall that we have already discussed
how to handle multiple equivalence classes. The proof is by induction on the
number of tables.
8
Experiment
We have presented Algorithm ELS and shown that it correctly estimates of join
result sizes. In this section, we illustrate that correct estimation of join sizes can
make a significant difference in query execution times. We use a straightforward
select-project-join query that is shown below. For this query, Algorithm ELS
can result in an order of magnitude improvement in query execution time.
298
SELECT
FROM
WHERE
COUNT()
S,M,B,G
s=mANDm--bANDb--gANDs<
I00
The tables used in the query are S (small), M (medium), B (big) and G (giant).
A single column from each table participates in the query and is denoted by the
table name in lower case, e.g., column s from table S. The query after transitive
closure is transformed as follows.
SELECT COUNT()
FROM S , M , B , G
WHEREs=mANDm=bANDb=gANDs=bANDs=gAND
m=gANDs<
100 A N D m < 1 0 0 A N D b < 1 0 0 A N D g < 1 0 0
The table and column cardinalites are given below.
I1,~11 = lOOO, IIMII = lOOOO, IIBII : 50000, IIGII = 100000
d8 = 1000, dm= 10000, db = 50000, d9 = 100000
The query optimizer in the Starburst DBMS [9] was modified to implement
the multiplicative rule (Rule A4), the smallest selectivity rule (Rule SS), and
our Algorithm ELS (with Rule/:8). Predicate transitive closure (PTC) was
implemented as a query rewrite rule [11] so that we could disable it as necessary
for the experiments. Note that both Rule 8 8 and Rule s are sensible only
when predicate transitive closure has been applied. No other changes were made
to the optimizer and the optimizer's entire repertoire was enabled (including the
Nested Loops and Sort Merge join methods).
The standard algorithm most commonly in use in current relational systems
computes join selectivities independent of the effect of local predicates. Let Algorithm S M denote using Rule A4 with the standard algorithm. Similarly, let
Algorithm SSS denote using Rule S S with the standard algorithm.
We ran four experiments. The original query, before predicate transitive closure, was run using Algorithm SM. Next, the query after predicate transitive
closure was applied, was run with Algorithm S M and Algorithm SSS. Finally,
the original query was run with Algorithm ELS. The results are shown in the
table below which identifies the query and rewrite rule used, the algorithm used,
the join order chosen by the algorithm, the result sizes estimated after each join,
and the elapsed time for the chosen query evaluation plan (QEP) in seconds.
Note that all the QEPs were executed using the same buffer size and that the
access methods and join methods did not differ between the QEPs. Both Nested
Loops and Sort Merge join methods were used in the QEPs.
Query
Original
Orig. -{- PTC
Orig. + PTC
Orig.
Algorithm
Join Order
Estimated Result Sizes ITime
S ~ M ~ B t~ G (100, 100, 100)
610
SM
S~B~Mt~G(0.2,4xl0
- s , 4 x 10 -21 ) 472
SM
SMB~MMG(0.2,4x10
-4 , 4 x 1 0 -T) 472
SSS
B M G ~ M M S (100, 100, 100)
50
ELS
299
We see that there is an improvement in query execution time when predicate
transitive closure is used (see the first two rows in the above table). Predicate
transitive closure is important because the additional predicates permit early
selection and greater flexibility in the selection of the join order. This may result
in the generation of better QEPs.
However, using predicate transitive closure is not enough. After predicate
transitive closure is used, the QEPs chosen differ in the join order. This is due
to the different estimates obtained for the intermediate join result sizes. The
QEP chosen by ELS runs 9 - 12 times faster than the other QEPs.
The correct join result size after any subset of joins has been performed can be
shown to be exactly 100. We can see in the second and third rows of the table that
Algorithm SM and Algorithm SSS greatly underestimate the join result sizes.
The optimizer estimates that the size of the 3-table composite result involving
tables S, M, and B is extremely small. This misleads the optimizer into believing
that table G will be accessed very infrequently. Thus, the optimizer puts table
G last in the join order, which in turn leads to a poor QEP.
9
Summary
Accurate estimation of join result sizes is crucial in query optimization because
the estimates have a significant influence on the query plans chosen. This paper described an algorithm (called ELS) for the correct incremental estimation
of join result sizes. For each table, estimates of the table cardinality and the
column cardinality of each column are computed, while taking into account the
effect of local predicates and the case when two or more join columns in a table belong to the same equivalence class. These estimates are then used in the
computation of join selectivities. For any intermediate result, Algorithm ELS
chooses a correct subset of the eligible join predicates to determine the result
size, while taking into account any dependencies between eligible join predicates.
We proved the correctness of Algorithm ELS. We have shown that using the algorithm can make a significant difference in query execution time (possibly an
order of magnitude or more).
Future work involves relaxing the assumptions that are used in this paper
and in most of the work on join size estimation. We have already relaxed the
uniformity assumption in the case of local predicates. Relaxing the assumption
in the case of join predicates would enable query optimizers to account for important data distributions such as the Zipfian distribution. The independence
assumption needs to be relaxed. All published work on estimation has dealt with
the case of a single SQL query block. Estimation in nested SQL queries is a much
harder problem. Work also needs to be done to extend the results in this paper
to queries involving disjunctions.
Acknowledgements
We wish to thank Paul Bird, Philip Blair, Peter Haas, Richard Hedges, Bruce
Lindsay, Guy Lohman, Sheila Richardson, Pat Selinger and Lori Strain for their
comments on various drafts of this paper.
300
References
1. S. Christodoulakis. Estimating Block Transfers and Join Sizes. In Proceedings of
ACM-SIGMOD International Conference on Management of Data, pages 40-54,
1983.
2. S. Christodoulakis. Implications of Certain Assumptions in Database Performance
Evaluation. ACM Transactions on Database Systems, 9(2):163-186, June 1984.
3. C. Faloutsos and H. V. Jagadish. On B-tree Indices for Skewed Distributions. In
Proceedings of the Eighteenth International Conference on Very Large Data Bases,
4.
5.
6.
7.
8.
9.
pages 363-374, Vancouver, British Columbia, 1992. Morgan Kaufman.
Y. E. Ioannidis and S. Christodoulakis. On the Propogation of Errors in the Size
of Join Results. In Proceedings of ACM-SIGMOD International Conference on
Management of Data, pages 268-277, Denver, Colorado, 1991.
Y.C. Kang. Randomized Algorithms for Query Optimization. PhD thesis, University of Wisconsin-Madison, October 1991. TR 1053.
C. A. Lynch. Selectivity Estimation and Query Optimization in Large Databases
with Highly Skewed Distributions of Column Values. In Proceedings of the Fourteenth International Conference on Very Large Data Bases, pages 240-251, Los
Angeles, USA, 1988. Morgan Kaufman.
M. V. Mannino, P. Chu, and T. Sager. Statistical Profile Estimation in Database
Systems. ACM Computing Surveys, 20(3):191-221, September 1988.
M. Muralikrishna and D. J. Dewitt. Equi-Depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries. In Proceedings of ACM-SIGMOD
International Conference on Management of Data, pages 28-36, Chicago, Illinois,
1988.
K. Ono and G. M. Lohman. Measuring the Complexity of Join Enumeration in
Query Optimization. In Proceedings of the Sixteenth International Conference on
Very Large Data Bases, pages 314-325, Brisbane, Australia, 1990. Morgan Kaufman.
10. G. Piatetsky-Shapiro and C. Connell. Accurate Estimation of the Number of Tuples Satisfying a Condition. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 256-276, 1984.
11. H. Pirahesh, J. Hellerstein, and W. Hasan. Extensible/Rule Based Query Rewrite
Optimization in Starburst. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 39-48, San Diego, California, 1992.
12. A. Rosenthal. Note on the Expected Size of a Join. ACM-SIGMOD Record, pages
19-25, July 1981.
13. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price.
Access Path Selection in a Relational Database Management System. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages
23-34, 1979.
14. A. Swami. Optimization of Large Join Queries. PhD thesis, Stanford University,
June 1989. STAN-CS-89-1262.
15. A. Swami and B. Iyer. A.Polynomial Time Algorithm for Optimizing Join Queries.
In Proceedings of IEEE Data Engineering Conference, pages 345-354. IEEE Computer Society, April 1993.
16. A. Swami and K. B. Schiefer. On the Estimation of Join Result Sizes. Technical
report, IBM Research Division, October 1993. IBM Research Report RJ 9569.
17. G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley,
Reading, MA, 1949.