Academia.eduAcademia.edu

On the estimation of join result sizes

1994, Lecture Notes in Computer Science

Good estimates of join result sizes are critical for query optimization in relational database management systems. We address the problem of incrementally obtaining accurate and consistent estimates of join result sizes. We have invented a new rule for choosing join selectivities for estimating join result sizes. The rule is part of a new unified algorithm called Algorithm ELS (Equivalence and Largest Selectivity). Prior to computing any result sizes, equivalence classes are determined for the join columns. The algorithm also takes into account the effect of local predicates on table and column cardinalities. These computations allow the correct selectivity values for each eligible join predicate to be computed. We show that the algorithm is correct and gives better estimates than current estimation algorithms.

On t h e E s t i m a t i o n of J o i n R e s u l t Sizes Arun Swami I and K. Bernhard Schiefer 2 1 IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120-6099 2 IBM Toronto Lab, 895 Don MiUs Road, North York, Ontario, Canada M3C 1W3 Abstract. Good estimates of join result sizes are critical for query optimization in relational database management systems. We address the problem of incrementally obtaining accurate and consistent estimates of join result sizes. We have invented a new rule for choosing join selectivities for estimating join result sizes. The rule is part of a new unified algorithm called Algorithm ELS (Equivalence and Largest Selectivity). Prior to computing any result sizes, equivalence classes are determined for the join columns. The algorithm also takes into account the effect of local predicates on table and column cardinalities. These computations allow the correct selectivity values for each eligible join predicate to be computed. We show that the algorithm is correct and gives better estimates than current estimation algorithms. 1 Introduction The join is an important operation in relational database management systems. When a user-generated query involves multiple joins, the cost of executing the query can vary dramatically depending on the query evaluation plan (QEP) chosen by the query optimizer. The join order and access methods used are important determinants of the lowest cost QEP. The query optimizer estimates the eventual result size, or cardinality, of joining the specified tables and uses this information to choose between different join orders and access methods. Thus, the estimation of join result sizes in a query is an important problem, as the estimates have a significant influence on the Q E P chosen. A survey of the use of statistics and estimation techniques in query optimization is given in [7]. There has been a lot of work done on estimating setectivities, e.g., ([1, 10, 8]). In [6], a number of new selectivity estimation methods were proposed for highly skewed distributions such as Zipf distributions [17, 3]. Errors in the statistics maintained by the database system can affect the various estimates computed by the query optimizer. An analytical model was used in [4] to study the propagation of errors in the estimated size of join results as the number of joins increases in queries with a single equivalence class. When more than two tables are involved, the query optimization algorithm often needs to estimate the join result sizes incrementally. Incremental estimation is used, for example, in the dynamic programming algorithm [13], the A B algorithm [15] and randomized algorithms [14, 5]. The optimizer first determines the join result size from joining the first two tables, then determines the join result size from the join with a third table, and so on. Thus, the query optimizer 288 incrementally estimates the final join result size from the result sizes of the intermediate tables produced. We have identified a number of problems with current practice in join size estimation. The problems include dealing with local predicates correctly, correct handling of multiple columns in a single table that belong to a single equivalence class, and correct incremental estimation of join result sizes. In this paper, we describe an algorithm called ELS that solves these problems and give arguments for the correctness of the solutions. In the next section, we give a brief overview of the terminology and known results in join size estimation that are used in this paper. The reader knowlegeable in query optimization can skim Section 2. Section 3 describes some of the problems in current join size estimation practice. Algorithm ELS is outlined in Section 4. In [16] we also show how one can handle the case of multiple local predicates on a single column. In Sections 5, 6, and 7, we describe the techniques used in Algorithm ELS to address the problems described in Section 3. We illustrate the value of our algorithm in terms of execution time in Section 8. In Section 9 we summarize the contributions of this paper and indicate some directions for future work. 2 Background The following assumptions are made by most query-processing cost models in practice and this paper: 1. Independence: Within each relation, values chosen from distinct columns that are involved in join predicates are independent. 2. Uniformity of Attribute Values: The distinct values in a join column appear equifrequently in the column. 3. Containment: When joining two tables, the set of values in the join column with the smaller column cardinality is a subset of the set of values in the join column with the larger column cardinality. The implications of these and other assumptions are discussed in depth in [2]. Note that we need the uniformity assumption only for the join columns, i.e., we can use data distribution information for local predicate selectivities. Some work has been done ([1]) to relax some of these assumptions. However, this requires assuming a certain parametric model of the data (nonuniform Zipf distribution) and determination of the parameter values, or keeping a large amount of metadata (correlation statistics). There is a tradeoff between implementation complexity and possible gains in accuracy of estimates. Most query optimizers choose to work with the assumptions listed above. As in other work, we focus on conjunctive queries where the selection condition in the WHERE clause is a conjunction of predicates. These queries constitute the most important class of queries. A table may participate in several join predicates. Let a table R be joined with another (possibly intermediate) table I. At this point in the join ordering, the query optimizer only needs to consider the predicates that link columns in 289 table R with the corresponding columns in a second table S that is present in table I. These join predicates are termed eligible join predicates. The query may include predicates that involve only a single table. Such predicates are called local predicates. The join resutr size is typically estimated as the product of the cardinalities of the operand tables after applying the local predicates (if any) and the join predicates [13]. When queries contain equality predicates (either local or join predicates), it is possible to derive additional implied predicates using transitive closure. (Similar derivations are possible for nonequality predicates too, but equality predicates are the most common and important class of predicates that generate implied predicates.) Transitive closure can be applied to two join predicates that share a common join column in order to obtain another join predicate. Example la Suppose the following SQL statement is entered as a query 3. SELECT Rl.a FROM RI, R2, R3 WHERE (Rl.x = R2.y) AND (R2.y = R3.z) We can use the principle of transitivity with the join predicates Jl: (Rl.x -R2.y) and J2: (R2.y = R3.z) to obtain the join predicate J3: (Rl.x = R3.z). Performing this predicate transitive closure gives the optimizer maximum freedom to vary the join order and ensures that the same QEP is generated for equivalent queries independently of how the queries are specified. In the above example, without predicate J3, most query optimizers would avoid the join order beginning with (R1 ~ R3) since this would be evaluated as a cartesian product, which is very expensive 4. Since join predicates can share columns, the effects of applying them may not be independent. In the above example, once join predicates J1 and J2 have been evaluated, J3 has in effect been evaluated, and hence evaluating it separately can have no further effect on the size of the join result. Equivalence classes of join columns can be used to capture these dependencies. Initially, each column is an equivalence class by itself. When an equality (local or join) predicate is seen during query optimization, the equivalence classes corresponding to the two columns on each side of the equality are merged. In the example above, given join predicate J1, columns x and y are in the same equivalence class. When join predicate J2 is seen, the equivalence classes of columns y and z are merged. Thus columns x, y and z are now in a single equivalence class. As we will see later, we can use the equivalence class structure to correctly take into account the effect of dependencies between join predicates. We will often use the terminology "x and y are j-equivalent (columns)" to denote that x and y belong to the same equivalence class. Multiple equivalence classes can be handled using the independence assumption as follows. Since the join columns in different equivalence classes are assumed satisfy the independence assumption, the predicates involving columns 3 In the rest of the paper, we show only the WHERE clause of SQL queries. 4 We use ~ as the symbol for the join operation 290 in one equivalence class reduce the join result size independent of the predicates involving columns in another equivalence class. Thus, in the rest of the paper, we can focus on the case of a single equivalence class. Two kinds of statistics are typically important in query optimization. One statistic is the number of tuples contained in a single table. This value is known as the table eardinality and is denoted by IIRII, wherein R is the table. The second statistic is the number of distinct values present in a column. This value is known as the column cardinality and is denoted by dx, wherein x is the column. These two statistics are important because they are used to estimate the size of the results from different operations on the table data, which in turn helps to determine the cost of these operations. Consider a join predicate (J:(RI.xl = R 2 . x 2 ) ) , where the column cardinality of xl (x2) of table R1 (R2) is dl (d2). The size of the join result can be estimated [13] as: II(R1 ~ R2)]I = d~ x liRlll • IIR II dl • d2 ' where d~ = min(dl, d2) (Equation 1) Since min(dl, d2)/dl x d2 = 1/max(dl,d2) Equation Equation 1 can be rewritten as: I[(R1 ~ R2)II = I]Rlll x [IR2I[ x Sj, 1 where S j - max(dr, d2) (Equation 2) where S j is termed the selectivity of the join predicate J. In order to enable a more intuitive understanding of Equation 2, we will derive Equation 1 using the assumptions listed above. For each distinct value, there are HRil[/d~ tuples in R~ having that value in the join column (from the uniformity assumption). Using the containment assumption, we get that the number of distinct values that appear in the join columns of both the tables is min(dl, d~). But these are precisely the values for which the join takes place. Hence, the number of tuples in the intermediate result (R1 >~ R2) is: II(R1 ~ R2)II = min(dx,d2) • HL~R, ~I_____[ • HR2]_____[_-IIRlll x IIR21[ x max( d2 al l,d2) In [12], Rosenthal showed that Equation 1 holds even if the uniformity assumption is weakened to require expected uniformity for only one of the join columns. Equation 1 can be generalized for n relations to get Equation 3 below. Let R~ denote the result of joining tables R1, R 2 , . . ' , Rn on columns xl, x~,--., x,~ respectively. Here there are equality predica{es between each pair of columns and in the terminology of equivalence classes, all the zis are in a single equivalence class, i.e., are j-equivalent. Denote the column cardinality of xi by di for i E 1 . . . n. Let d(1), d(2),---, d(,~) be a rearrangement of dl, d2,.-., dn in increasing order as d(1) _< d(2) _< "'" _< d(~). Then, such that d~l, d~2,.-. , di~ are ordered in an ascending sequence. Then, d(t) -- min(d(1), d(2),-'', d(n)) -- min(dl, d2,..-, dn) 291 Then, using the assumptions listed above, it can be shown that the size of the intermediate result R~ is: IIR~II = IIR21J d(1) x x T IIRlll • tlR2tl x... x..- IIR=II • - - d~ • IIRnll d(2) • d(a) • ... • d(n) ' where d(1) = min(dl, d2 ~Ecl~d~i)~n 3) that is, all column cardinalities except for the smallest one are present in the denominator. Example lb follows: Continuing with Example la, let the statistics for the tables be as IIRI]I = 100, IIR211 = 1000, IIn31I = 1000, dx = 10, dy = 100, dz = 1000 The join selectivities from Equation 2 are: Jl: J2: J3: S jl = 1/max(d1, d2) = 1/max(10, 100) = 0.01 S j2 = 1/ma.x(d2, d3) = 1/max(100, 1000) = 0.001 S j3 = 1/max(d~, d3) - 1/max(10, 1000) = 0.001 If R2 is joined with R3, the size of the intermediate result table can be estimated using Equation 2: 11R2 ~ Rail = IIR2I[ x IIRzII x Sg2 -- 1000 • 1000 • 0.001 = 1000 Using Equation 3, the size of (R1 M R2 N Ra) is estimated to be: lira N R2 M Rail = 100 x 1000 x 1000 -- 1000 100 x 1000 which is the correct answer. 3 Problems in Join Size Estimation There are a number of problems with current practice in join size estimation. The problems include dealing with local predicates correctly, correct handling of multiple columns in a single table that belong a single equivalence class, and correct incremental estimation of join result sizes. 3.1 Effect o f L o c a l P r e d i c a t e s If a local predicate is present, the local predicate may reduce the cardinality of the table. This reduced cardinality is called the effective cardinality of the table. It can be used in other cardinality calculations, such as for estimating join result sizes. When the local predicate is on a join column, the predicate can also reduce the column cardinality of the join column. Thus, it is clear that local predicates will affect join result sizes, since the predicates affect both the number of participating tuples and, possibly, the column cardinality of the join columns. It is evident that an algorithm is needed for taking into account the 292 effect of local predicates when estimating join result sizes. We do not know of any algorithm that correctly takes into account both local predicates and join predicates. In Section 5, we describe how to correctly take into account the effect of local predicates in join size estimation. 3.2 Single Table J - E q u i v a l e n t C o l u m n s During the incremental calculation of join result sizes, it can happen that two or more of the eligible join predicates involve join columns of the next table that are j-equivalent. For example, consider a table R1 joining with another table R2 using the following join predicates: (RI.x = R2.y) and (Rl.x = R2.z). By transitive closure, we get the predicate (R2.y = R2.z). When R1 is joined with R2, all three predicates are eligible and involve columns y and z of R2 that are j-equivalent by virtue of the predicate (R2.y = R2.z). Current query optimizers do not treat this as a special case and therefore, by default, the join selectivities of all such join predicates are used in calculating the join result sizes. However, using them produces an incorrect result because the join columns are not independent. In Section 6, we describe an algorithm that correctly handles the case when two or more of the join columns of a relation are j-equivalent. 3.3 J-Equivalent C o l u m n s from Different Tables Another problem that arises in incremental computation is that the effect of all the eligible predicates involving different tables is not independent. In Example lb, once join predicates J1 and J2 have been evaluated, J3 has in effect been evaluated and hence, evaluating it separately can have no further effect on the estimated size of the join result. Yet, in current practice, its selectivity will be included in the result size estimate. In [13], the selectivities of all the eligible join predicates are multiplied together along with the product of the individual effective table cardinalities (as reduced by the selectivities of local predicates). We call this the multiplicative rule (.s This rule can compute an incorrect join result size because all the join selectivities are used without accounting for dependencies. Rule A,t can dramatically underestimate the join result size as shown in Example 2. Example Z As in Example lb, let R2 and R3 be joined first, and then R1. From Example lb, we have I[R2 ~ R31[ = 1000. When R1 is joined, Rule ~ estimates the join result size to be: HR2 ~ R3 ~ R1H = IIR2 ~ R3H X HR1][ x '-'r X SJ3 = 1000 • 100 • 0.01 x 0.001 = 1 (correct answer is 1000) The problem of underestimation of join result sizes by Rule .hi suggests dividing the eligible join predicates into groups, with join predicates involving j-equivalent columns being grouped together, and then choosing for each group a single join selectivity. The p~oblem appears to be analogous to the problem of picking among multiple local predicates on a single column [16]. Thus, one would expect to pick the smallest join selectivity value for each group. We refer to this 293 idea as the smallest selectivity rule (SS). Though this seems to be the intuitive choice, this choice of join selectivity can compute an incorrect join result size as shown in Example 3. Example 3 As in Example lb, let R2 and R3 be joined first, and then R1. From Example lb, we have [IR2 N R3[I = 1000. When R1 is to be joined, Rule S S puts predicates J1 and J3 in a single group, since columns x, y and z are j-equivalent. Since Sj3 < S jl, it uses only predicate J3 in calculating the join result size. Thus, Rule S,.q estimates the join result size as follows: IIR2 N R3 N Rill = IIR2 R311 x Ilalll x = 1000 x 100 x 0.001 = 100 (correct answer is 1000) A third procedure has been proposed but, to our knowledge, has not been implemented in any optimizer. The proposal is to assign a representative join selectivity to each equivalence class and to use that selectivity whenever join predicates in that equivalence class are being used. The problem with this proposal is that there is no certainty that a correct value for this representative join selectivity exists that will work in all cases. In our example query, if the representative selectivity is 0.01, the estimate for the final join result size will be 10000, which is too high. If the representative selectivity is 0.001, the estimate for the final join result size will be 100, which is too low. In Section 7, we describe a new rule for correctly choosing the join selectivities when undertaking an incremental estimation of join result sizes. 4 Algorithm ELS We describe an algorithm called ELS (Equivalence and Largest Selectivity) that solves the problems described in the previous section. Algorithm ELS consists of two phases. The first is a preliminary phase that is to be performed before any join result sizes are computed. In this phase, the predicates implied due to transitive closure are generated and join selectivities are calculated. The preliminary phase consists of steps 1 through 5. The second phase computes, incrementally, the join result sizes. The processing for this final phase is outlined in step 6. 1. Examine each given predicate. 9Remove any predicate that is identical to another predicate, so that queries involving duplicate predicates such as (Rl.x > 500) AND (R1 .x > 500) can be handled. 9 Build equivalence classes for all columns that are participating in any of the predicates. 2. Generate all implied predicates using transitive closure. There are five variations: a. Two join predicates can imply another join predicate. (Rl.X = R2.y) AND (R2.y = R3.z) ==~ (Rl.x = R3.z) b. Two join predicates can imply a local predicate. (Rl.x = R2.y) AND (Rl.x = R2.w) = (R2.y = R2.w) 294 3. 4. 5. 6. 5 c. Two local predicates can imply another local predicate. (Rl.x = RI.y) AND (RI.y = Rl.z) ~ (R~.x = Rl.z) d. A join predicate and a local predicate can imply another join predicate. (Rl.x = R2.y) AND (Rl.x = Rl.v) ~ (R2.y = Rl.v) e. A join predicate and a local predicate can imply another local predicate. Here op denotes a comparison operator, and c is a constant. (Rl.x = R2.y) AND (Rl.x op c) ==~ (R2.y op c) Assign to each local predicate a selectivity estimate that incorporates any distribution statistics. In [16], we present a complete algorithm to handle the case of multiple local predicates on a single column. In essence, the most restrictive equality predicate is chosen if it exists, otherwise we chose a pair of range predicates which form the tightest bound. For each table, compute an estimate of the table cardinality and the column cardinality of each column after all the local predicates have been included. We show that if these new estimates are correctly used in the computation of join selectivities and join result sizes, we do not need to concern ourselves with local predicates after this step. This is discussed further in Section 5. Process each join predicate by computing its join selectivity. If two join columns from the same table are j-equivalent, special care is needed in the computation of the join selectivities. We describe an algorithm for handling this case in Section 6. For each intermediate result of the join order, estimate the result size. We have invented a new rule, which is described in Section 7, to estimate the result size. We also prove that the rule is correct under the assumptions stated in Section 2. Effect of Local P r e d i c a t e s on Join R e s u l t Sizes In Section 3, we described the need to incorporate the effect of local predicates in the estimation of join result sizes. Let R be one of the tables participating in a join. Let a join column of R be x and let a local predicate involve column y of R. Here, y could be identical with x. Let IIRII be the cardinality of R before the local predicate is applied, and IIR]I~ be the cardinality of R after the local predicate is applied. Let d= denote the column cardinality of join column x before the local predicate is applied, and d~ denote the column cardinality of column x after the local predicate is applied. Similarly, we can define dy and d~. The estimates of HRII~, d~, and d~ depend on the local predicate and the information available about the column y. For example, if the local predicate is of the form y -- a where a is a literal, we know that d~ -- 1. If we have distribution statistics on y, they can be used to accurately estimate IIRII~. Otherwise, we can use the uniformity assumption (see Section 2) to estimate [IRII' (i.e., [IRII' = ]lRII/du). For some other local predicate L involving y, if the local predicate selectivity SL is known, we can estimate [[R[[' = [[R[[ x SL and d~' = d~ x SL. Now that we have estimated ]IR[[' and d~, we can estimate d~ as follows. If y is identical with x, clearly d~ -- d~. If y and x are different columns, we use a simple urn model to estimate d~. The process is one of assigning [[R[[' 295 balls among d~ urns that are initially empty. Any of the urns can be chosen uniformly. The balls correspond to the selected tuples and the urns correspond to the distinct values in column x. The number of distinct values in x remaining after selection corresponds to the number of non-empty urns. We obtain the expected number of non-empty urns as follows, where n = d~ and k : IlRll'. Prob{ball is put in urn i} Prob{ball is not put in urn i} Prob{none of the k balls is put in urn i} Prob{at least one of the k balls is put in urn i} = 1In = (1 - 1/n) = (1 - 1/n) k = (1 - (1 - l / n ) k) Then, the expected number of non-empty urns is given by n • (1 - (1 - 1/n)k). Hence in the case where column x is distinct from column y, we have that = • (i - ( i - Note that if [[R[[' is sufficiently large enough and approaches [[RI[, the term (1 - 1/dx) HRjI' will be close to 0 and d~ ~ d~. This estimate can be quite different from another common estimate d~ = dx x ([IRI['/HR[[). This can be seen from the following numerical example. Let dx = 10000, []R[I = 100000 and [IR[[' = 50000. Then the urn model estimate gives d~ = 9933 whereas the other estimate gives d~ = 5000. If {[R[[' = [[R[[, the estimate according to the urn model is d~ = 10000. At this point, the table and column cardinality estimates have already incorporated all local predicates. The rest of the estimation algorithm only has to deal with join predicates. We have thus taken into account the effect of local predicates and simplified subsequent processing. We can also take advantage of distribution statistics on columns that are involved in local predicates. This enables us to obtain more accurate estimates of join result sizes. The original, unreduced table and column cardinalities are retained for use in cost calculations before the local predicates have been applied, for example, when estimating the cost of accessing the table. We have assumed that if the effect of a local predicate on the table cardinality and the column cardinalities of join columns is taken into account, then the local predicate is, in effect, applied before or together with the join predicates on this table. This being true, then the following argument shows that our approach to size estimation is correct. Consider all the local and join predicates being used to form an intermediate result. For the estimation of the result size, it does not matter in which order the predicates are applied. We can assume that the local predicates are applied first, evaluate their effects on the table statistics, and then compute the size obtained by applying the join predicates. Hence, in the following discussion, we will focus on join predicates and assume that local predicates can be handled as described above. 6 J-Equivalent Join Columns in a Single Table Algorithm E L S needs to deal with the special case described in Section 3.2. Consider the following example query, involving columns from the same table that are j-equivalent. 296 W H E R E (Rl.x = R2.y) AND (Rl.x = R2.w) After transitive closure of predicates, we obtain the following tranformed query: W H E R E (Rl.x = R2.y) AND (Rt.x = R2.w) AND (R2.y = R2.w) where the implied local predicate (R2.y -- R2.w) is added using rule 2.a from Section 4. Let the statistics for the tables in the query be as follows: [IR1{I = 100, IIR21[ = 1000, d~ = 100, du = 10, d~ = 50. After transitive closure, R2.y participates in every join in which R2.w participates and vice versa. Hence, for computing join selectivities and join result sizes, we have to know the number of R2 tuples that can qualify under the local predicate and also the number of distinct values that the qualifying tuples contain. We wish to quantify the effect of the local predicate on the effective cardinality of R2 and the column cardinalities prior to evaluating the effect of the join predicates on the result sizes. To do so, we use a probabilistic argument as follows. Without loss of generality, let dw > = d r. Let {IR2II denote the table cardinality prior to join predicates being applied. We use the containment assumption for join columns (see Section 2) to infer that all the d r distinct values are contained in the dw distinct values. Now, consider any tuple of R2. It has some value, say q, in column R2.y. Assuming the independence and uniformity assumptions for columns R2.w and R2.y, the probability that R2.w has the same value q is 1/d,,,. Since this is the case for every tuple of R2, the effective cardinality of R2 is given by I]R2]I', where IIR211' r,R ,,1/ Once this selection has been performed, only one of the columns needs to be joined since the evaluation of the local predicate has made the other join redundant. Using a simple urn model as in Section 5, we deduce that the column cardinality value that should be used for join selectivity computations is: [d r x (1 - (1 - 1/dr)HR211')] that is, we pick the column with the most restrictive effect on the column cardinality. In our example query, IIR2{I' = IIR211/d~ = 1000/50 = 20. We pick R2.y and the effective column cardinality in joins is hence, rd,, • (, _ (, _ 1 = B o • (, -(, - = 9 The formulas given above can be generalized to the case of three or more join columns, from the same table, that are j-equivalent. Let 1, 2 , . . . , n be the equivalent join columns of table R, and let the column cardinalities be dr, d 2 , . . . , dn. Let d(1), d(2), 9-, 9 d(,~) be a rearrangement of dl, d 2 , . . . , d,, in increasing order as d(1) ~ d(2) ~ ... _~ d(n). Then, d(1) = min(d(1), d(2),--', d(,,)) -- rain(dr, d2,.-., dn) 297 Using an argument similar to the one above, it follows that IIRII' = [ d(2) • d(3) • .-.,• d(n) ] and that the effective column cardinality in joins is [d(t) x (1 -(1 - 1/d(1))11"11'1] 7 Incremental Computation of Result Sizes We saw in Section 3 that the "intuitive" rules currently used in query optimizers (Rule .hi and Rule S $ ) do not correctly choose the join selectivities in the incremental estimation of join sizes. We present a new rule (called s for picking one of the join selectivities for an equivalence class. s Given a choice of join selectivities for a single equivalence class, always pick the largest join selectivity. R u l e / : $ appears counter-intuitive and a proof is provided in [16]. Multiple equivalence classes can be accommodated by using the independence assumption. Since the join columns in different equivalence classes are assumed to satisfy the independence assumption, each equivalence class independently reduces the join result size. The algorithm then determines the join selectivity for each equivalence class and multiplies them together to obtain the effective join selectivity. Example 3 As in Example lb, let R2 and R3 be joined first, and then R1. From Example lb, we have [JR2 • R3 [[ = 1000. When R1 is to be joined, Rule s puts predicates J1 and J3 in a single group, since columns x, y and z are j-equivalent. Since Sj3 < Sj1, it uses only predicate J1 in calculating the join result size. Thus, Rule s estimates the join result size as follows: IIR2 R3 ~ Rill = IIR N//311 x IIRlll x Sj1 = 1000 x 100 x 0.01 = 1000 (correct) As before, denote the result of joining tables Rt, R2,'", Rn by R~. Let us say that we have calculated the size of R~. We now wish to join table Rn+l to obtain the result R~ +1. We now prove that Rule/:S is correct by showing that the algorithm calculates the join result size in agreement with the size calculated by Equation 3 for a single equivalence class. Recall that we have already discussed how to handle multiple equivalence classes. The proof is by induction on the number of tables. 8 Experiment We have presented Algorithm ELS and shown that it correctly estimates of join result sizes. In this section, we illustrate that correct estimation of join sizes can make a significant difference in query execution times. We use a straightforward select-project-join query that is shown below. For this query, Algorithm ELS can result in an order of magnitude improvement in query execution time. 298 SELECT FROM WHERE COUNT() S,M,B,G s=mANDm--bANDb--gANDs< I00 The tables used in the query are S (small), M (medium), B (big) and G (giant). A single column from each table participates in the query and is denoted by the table name in lower case, e.g., column s from table S. The query after transitive closure is transformed as follows. SELECT COUNT() FROM S , M , B , G WHEREs=mANDm=bANDb=gANDs=bANDs=gAND m=gANDs< 100 A N D m < 1 0 0 A N D b < 1 0 0 A N D g < 1 0 0 The table and column cardinalites are given below. I1,~11 = lOOO, IIMII = lOOOO, IIBII : 50000, IIGII = 100000 d8 = 1000, dm= 10000, db = 50000, d9 = 100000 The query optimizer in the Starburst DBMS [9] was modified to implement the multiplicative rule (Rule A4), the smallest selectivity rule (Rule SS), and our Algorithm ELS (with Rule/:8). Predicate transitive closure (PTC) was implemented as a query rewrite rule [11] so that we could disable it as necessary for the experiments. Note that both Rule 8 8 and Rule s are sensible only when predicate transitive closure has been applied. No other changes were made to the optimizer and the optimizer's entire repertoire was enabled (including the Nested Loops and Sort Merge join methods). The standard algorithm most commonly in use in current relational systems computes join selectivities independent of the effect of local predicates. Let Algorithm S M denote using Rule A4 with the standard algorithm. Similarly, let Algorithm SSS denote using Rule S S with the standard algorithm. We ran four experiments. The original query, before predicate transitive closure, was run using Algorithm SM. Next, the query after predicate transitive closure was applied, was run with Algorithm S M and Algorithm SSS. Finally, the original query was run with Algorithm ELS. The results are shown in the table below which identifies the query and rewrite rule used, the algorithm used, the join order chosen by the algorithm, the result sizes estimated after each join, and the elapsed time for the chosen query evaluation plan (QEP) in seconds. Note that all the QEPs were executed using the same buffer size and that the access methods and join methods did not differ between the QEPs. Both Nested Loops and Sort Merge join methods were used in the QEPs. Query Original Orig. -{- PTC Orig. + PTC Orig. Algorithm Join Order Estimated Result Sizes ITime S ~ M ~ B t~ G (100, 100, 100) 610 SM S~B~Mt~G(0.2,4xl0 - s , 4 x 10 -21 ) 472 SM SMB~MMG(0.2,4x10 -4 , 4 x 1 0 -T) 472 SSS B M G ~ M M S (100, 100, 100) 50 ELS 299 We see that there is an improvement in query execution time when predicate transitive closure is used (see the first two rows in the above table). Predicate transitive closure is important because the additional predicates permit early selection and greater flexibility in the selection of the join order. This may result in the generation of better QEPs. However, using predicate transitive closure is not enough. After predicate transitive closure is used, the QEPs chosen differ in the join order. This is due to the different estimates obtained for the intermediate join result sizes. The QEP chosen by ELS runs 9 - 12 times faster than the other QEPs. The correct join result size after any subset of joins has been performed can be shown to be exactly 100. We can see in the second and third rows of the table that Algorithm SM and Algorithm SSS greatly underestimate the join result sizes. The optimizer estimates that the size of the 3-table composite result involving tables S, M, and B is extremely small. This misleads the optimizer into believing that table G will be accessed very infrequently. Thus, the optimizer puts table G last in the join order, which in turn leads to a poor QEP. 9 Summary Accurate estimation of join result sizes is crucial in query optimization because the estimates have a significant influence on the query plans chosen. This paper described an algorithm (called ELS) for the correct incremental estimation of join result sizes. For each table, estimates of the table cardinality and the column cardinality of each column are computed, while taking into account the effect of local predicates and the case when two or more join columns in a table belong to the same equivalence class. These estimates are then used in the computation of join selectivities. For any intermediate result, Algorithm ELS chooses a correct subset of the eligible join predicates to determine the result size, while taking into account any dependencies between eligible join predicates. We proved the correctness of Algorithm ELS. We have shown that using the algorithm can make a significant difference in query execution time (possibly an order of magnitude or more). Future work involves relaxing the assumptions that are used in this paper and in most of the work on join size estimation. We have already relaxed the uniformity assumption in the case of local predicates. Relaxing the assumption in the case of join predicates would enable query optimizers to account for important data distributions such as the Zipfian distribution. The independence assumption needs to be relaxed. All published work on estimation has dealt with the case of a single SQL query block. Estimation in nested SQL queries is a much harder problem. Work also needs to be done to extend the results in this paper to queries involving disjunctions. Acknowledgements We wish to thank Paul Bird, Philip Blair, Peter Haas, Richard Hedges, Bruce Lindsay, Guy Lohman, Sheila Richardson, Pat Selinger and Lori Strain for their comments on various drafts of this paper. 300 References 1. S. Christodoulakis. Estimating Block Transfers and Join Sizes. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 40-54, 1983. 2. S. Christodoulakis. Implications of Certain Assumptions in Database Performance Evaluation. ACM Transactions on Database Systems, 9(2):163-186, June 1984. 3. C. Faloutsos and H. V. Jagadish. On B-tree Indices for Skewed Distributions. In Proceedings of the Eighteenth International Conference on Very Large Data Bases, 4. 5. 6. 7. 8. 9. pages 363-374, Vancouver, British Columbia, 1992. Morgan Kaufman. Y. E. Ioannidis and S. Christodoulakis. On the Propogation of Errors in the Size of Join Results. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 268-277, Denver, Colorado, 1991. Y.C. Kang. Randomized Algorithms for Query Optimization. PhD thesis, University of Wisconsin-Madison, October 1991. TR 1053. C. A. Lynch. Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distributions of Column Values. In Proceedings of the Fourteenth International Conference on Very Large Data Bases, pages 240-251, Los Angeles, USA, 1988. Morgan Kaufman. M. V. Mannino, P. Chu, and T. Sager. Statistical Profile Estimation in Database Systems. ACM Computing Surveys, 20(3):191-221, September 1988. M. Muralikrishna and D. J. Dewitt. Equi-Depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 28-36, Chicago, Illinois, 1988. K. Ono and G. M. Lohman. Measuring the Complexity of Join Enumeration in Query Optimization. In Proceedings of the Sixteenth International Conference on Very Large Data Bases, pages 314-325, Brisbane, Australia, 1990. Morgan Kaufman. 10. G. Piatetsky-Shapiro and C. Connell. Accurate Estimation of the Number of Tuples Satisfying a Condition. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 256-276, 1984. 11. H. Pirahesh, J. Hellerstein, and W. Hasan. Extensible/Rule Based Query Rewrite Optimization in Starburst. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 39-48, San Diego, California, 1992. 12. A. Rosenthal. Note on the Expected Size of a Join. ACM-SIGMOD Record, pages 19-25, July 1981. 13. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 23-34, 1979. 14. A. Swami. Optimization of Large Join Queries. PhD thesis, Stanford University, June 1989. STAN-CS-89-1262. 15. A. Swami and B. Iyer. A.Polynomial Time Algorithm for Optimizing Join Queries. In Proceedings of IEEE Data Engineering Conference, pages 345-354. IEEE Computer Society, April 1993. 16. A. Swami and K. B. Schiefer. On the Estimation of Join Result Sizes. Technical report, IBM Research Division, October 1993. IBM Research Report RJ 9569. 17. G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA, 1949.