ADB Slides 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Query Optimization

Using Heuristics in Query Optimization

● Process for heuristics optimization


○ The parser of a high-level query generates an initial internal
representation;
○ Apply heuristics rules to optimize the internal representation.
○ A query execution plan is generated to execute groups of operations
based on the access paths available on the files involved in the query.

● The main heuristic is to apply first the operations that reduce the size
of intermediate results.
○ E.g., Apply SELECT and PROJECT operations before applying the
JOIN or other binary operations.
Example

l Example:
• For every project located in ‘Stafford’, retrieve the project number, the
controlling department number and the department manager’s last name,
address and birthdate.
l Relation algebra:

l SQL query:
Example (cont’d)
Selection and projections

Cascading of selections
I σc1∧c2∧...∧cn ( R ) ≡ σ c1 (σ c2 (. ..(σ cn ( R ))))
Commutativity
I σ c1 (σ c2 ( R )) ≡ σ c2 (σ c1 ( R ))
Cascading of projections
I π a1 ( R ) ≡ π a1 (π a2 (. ..(π an ( R )) ...)
I iff ai ⊆ ai+1, i = 1,2, ...n − 1
Cartesian products and joins

Commutativity
I R ×S ≡ S×R
I R⨝S ≡S⨝R
Assosiativity
I R× ( S × T ) ≡ ( R × S ) ×T
I R⨝(S ⨝T )≡(R⨝S )⨝T
Their combination
I R ⨝ ( S ⨝T )≡ R ⨝ ( T ⨝S )≡( R ⨝T )⨝S
≡( T ⨝R ) ⨝S
Other operations

Selection-projection commutativity
I πa ( σc (R)) ≡ σc ( πa (R))
I iff every attribute in c is included in the set of attributes a
Combination (join definition)
I σc ( R × S ) ≡ R ⨝ c S
Selection-Cartesian/join commutativity
I σc ( R × S ) ≡ σc (R) × S
I iff the attributes in c appear only in R and not in S
Selection distribution/replacement
I σc (R ⨝ S) ≡ σ c1 ∧ c2 ( R ⨝ S ) ≡ σc1 ( σc2 ( R ⨝ S )) ≡
σc1 (R) ⨝ σc2 (S)
I iff c1 is relevant only to R and c2 is relevant only to S
Other operations (cont.)

Projection-Cartesian product commutativity


I πa ( R × S ) ≡ πa1 (R) × πa2 (S)
I iff a1 is the subset of attributes in a appearing in R and a2 is the subset
of attributes in a appearing in S
Projection-join commutativity
I πa ( R ⨝ c S ) ≡ πa1 (R) ⨝ c πa2 (S)
I iff same as before and every attribute in c appears in a
Attribute elimination
I πa( R ⨝ c S ) ≡ πa( πa1 (R) ⨝ c πa2 (S) )
I iff a1 subset of attributes in R appearing in either a or c and a2 is the
subset of attributes in S appearing in either a or c
Query Optimization using Equivalence Rules

1. Break up any select operations with conjunctive conditions into a cascade of


select operations.
2. Move each select operation as far down the query tree as is permitted by the
attributes involved in the select condition.
3. Rearrange the leaf nodes of the tree so that the leaf node relations with the
most restrictive select operations are executed first in the query tree
representation.
4. Combine a Cartesian product operation with a subsequent select operation in
the tree into a join operation.
5. Break down and move lists of projection attributes down the tree as far as
possible by creating new project operations as needed.
6. Identify subtrees that represent groups of operations that can be executed by
a single algorithm.
Query Execution Plans

• An execution plan for a relational algebra query consists of a


combination of the relational algebra query tree and information about
the access methods to be used for each relation as well as the
methods to be used in computing the relational operators stored in the
tree.
• Materialized evaluation: the result of an operation is stored as a
temporary relation.
• Pipelined evaluation: as the result of an operator is produced, it is
forwarded to the next operator in sequence.
Using Selectivity and Cost Estimates in
Query Optimization
● Cost-based query optimization:
○ Estimate and compare the costs of executing a query using
different execution strategies and choose the strategy with the
lowest cost estimate.
○ More suitable for compiled queries (Stored Procedure).
○ (Compare to heuristic query optimization)

● Issues
○ Cost function
○ Number of execution strategies to be considered
Cost estimation
A plan is a tree of operators
Two parts to estimating the cost of a plan
I For each node, estimate the cost of performing the corresponding operation
I For each node, estimate the size of the result and any properties it might have
(e.g., sorted)
Combine the estimates and produce an estimate for the entire plan
We have seen various storage methods and algorithms
I And know the cost of using each one, depending on the input
cardinality
The problem is estimating the output cardinality of the operations
I Namely, selections and joins
Selectivity factor

The maximum number of tuples in the result of any query is the


product of the cardinalities of the participating relations
Every predicate in the where-clause eliminates some of these potential
results
Selectivity factor of a single predicate is the ratio of the expected
result size to the maximum result size
Total result size is estimated as the maximum size times the product
of the selectivity factors
Various selectivity factors

column = value → 1
#keys(column)
I Assumes uniform distribution in the values
I Is itself an approximation
column1 = column2 → 1
max(#keys(column1),#keys(column2))
I Each value in column1 has a matching value in column2; given a value
in column1, the predicate is just a selection
I Again, an approximation
Various selectivity factors (cont.)

(high(column)−value)
column > value → (high(column)−low (column))
(value2−value1)
value1 < column < value2 → (high(column)−low (column))
column in list → number of items in list times s.f. of column = value
column in sub-query → ratio of subquery’s estimated size to the
number of keys in column
not (predicate) → 1 - (s.f. of predicate)
P1 ∨P2 → fP1 + fP2 − fP1 · fP2
Key assumptions made

The values across columns are uncorrelated


The values in a single column follow a uniform distribution
Both of these assumptions rarely hold
The first assumption is hard to lift
I Only recently have researchers started tackling the problem
The uniform distribution assumption can be lifted with better
statistical methods
I In our case, histograms
Histograms

Elegant data structures to capture value distributions


I Not affected by the uniform distribution assumption (though this is not
entirely true)
They offer trade-offs between size and accuracy
I The more memory that is dedicated to a histogram, the more accurate
it is
I But also, the more expensive to manipulate
Two basic classes: equi-width and equi-depth
Desirable histogram properties

Small
I Typically, a DBMS will allocate a single page for a histogram!
Accurate
I Typically, less than 5% error
Fast access
I Single lookup access and simple algorithms
Equi-width histogram construction
True distribution Distribution approximation

9.00

6.75

The total range is divided into 4.50

sub-ranges of equal width 2.25

Each sub-range becomes a 0


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
bucket
The total number of tuples in min max count

each bucket is stored 0 2 8


3 5 4
6 8 15
9 11 3
12 14 15
Equi-width histogram estimation
To estimate the output cardinality True distribution Distribution approximation

of a range query 9.00

I The starting bucket is 6.75

identified 4.50
I The histogram is then
scanned forward until the 2.25

ending bucket is identified 0


I The numbers of tuples in the 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

buckets of the range are min max count


summed 0 2 8
I Within each bucket the 3 5 4
uniform distribution 6 8 15

assumption is made
9 11 3
12 14 15
6 ≤ v ≤ 10: 3
3
· 15 + 2
3
· 3 = 17
Equi-depth histogram construction and estimation
The total range is divided into True distribution Distribution approximation

sub-ranges so that the number 9.00

of tuples in each range is 6.75

(approximately) equal 4.50

Each sub-range becomes a 2.25


bucket
0
The same schema as in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

equi-width histograms is used min max count

In fact, the same algorithm is 0 3 8


4 7 10
used for estimation (!) 8 9 10

6 ≤ v ≤ 10: 10
14
13
14
7
9
2 · 10 + 2 · 10 + 1 · 7 ≈ 17
4 2 4
Using Selectivity and Cost Estimates in
Query Optimization
l Catalog Information Used in Cost Functions
• Information about the size of a file
• number of records (tuples) (r),
• record size (R),
• number of blocks (b)
• blocking factor (bfr)
• Information about indexes and indexing attributes of a file
• Number of levels (x) of each multilevel index
• Number of first-level index blocks (bI1)
• Number of distinct values (d) of an attribute
• Selectivity (sl) of an attribute
• Selection cardinality (s) of an attribute. (s = sl * r)
Using Selectivity and Cost Estimates in
Query Optimization (Cont’d)
l Examples of Cost Functions for SELECT (in terms of block transfers)
l S1. Linear search (brute force) approach
• CS1a = b;
• For an equality condition on a key, CS1a = (b/2) if the record is found;
otherwise CS1a = b.
l S2. Binary search:
• CS2 = log2b + (s/bfr) –1
• For an equality condition on a unique (key) attribute, CS2 =log2b
• Where s is the selection cardinality
l S3. Using a primary index (S3a) or hash key (S3b) to retrieve a
single record
• CS3a = x + 1; CS3b = 1 for static or linear hashing;
• CS3b = 2 for extendible hashing;
• Where x is number of index levels
Using Selectivity and Cost Estimates in
Query Optimization (Cont’d)
l Examples of Cost Functions for SELECT (contd.)
l S4. Using an ordering index to retrieve multiple records:
• For the comparison condition on a key field with an ordering index,
• CS4 = x + (b/2)
l S5. Using a clustering index to retrieve multiple records:
• CS5 = x + (s/bfr)
l S6. Using a secondary (B+-tree) index:
• For an equality comparison, CS6a = x + s;
• For an comparison condition such as >, <, >=, or <=,
• CS6a = x + (bI1/2) + (r/2)
Using Selectivity and Cost Estimates in
Query Optimization (Cont’d)
l Examples of Cost Functions for SELECT (contd.)
l S7. Conjunctive selection:
• Use either S1 or one of the methods S2 to S6 to solve.
• For the latter case, use one condition to retrieve the records and then
check in the memory buffer whether each retrieved record satisfies the
remaining conditions in the conjunction.
l S8. Conjunctive selection using a composite index:
• Same as S3a, S5 or S6a, depending on the type of index.
Example of Cost Functions for SELECT

l EMPLOYEE file has r=10,000 records stored in b=2000


blocks with bfr=5 records/block and the following access
path:
• A clustering index on Salary with levels xSalary=3 and average
selection cardinality sSalary=20.
• A secondary index on the key attribute Ssn with xssn=4 (sssn=1)
l For the statement Sigmassn=123456789(Employee)
l Using S1: Cost = b/2=1000
l Using S6a: Cost = xssn+1 = 4+1 = 5
Cost-based query optimisation

The paradigm employed is cost-based query optimisation


I Simply put: enumerate alternative plans, estimate the cost of each
plan, pick the plan with the minimum cost
For cost-based optimisation, we need a cost model
I Since what “hurts” performance is I/O, the cost model should use I/O
as its basis
I Hence, the cardinality-based cost model
F Cardinality is the number of tuples in a relation
Plan enumeration

Plan enumeration consists of two parts (again, not necessarily


independent from one another)
I Access method selection (i.e., what is the best way to access a relation
that appears in the query?)
I Join enumeration (i.e., what is the best algorithm to join two relations,
and when should we apply it?)
Access methods, join algorithms and their various combinations define
a search space
I The search space can be huge
I Plan enumeration is the exploration of this search space
Search space exploration

As was stated, the search space is huge


I Exhaustive exploration is out of the question
I Because it could be the case that exploring the search space might take
longer than actually evaluating the query
I The way in which we explore the search space describes a query
optimisation method
F Dynamic programming, rule-based optimisation, randomised
exploration, . . .
Types of plan

op
left-deep

There are two types of


op op
op right-deep
op op op plan, according to their
op op op op shape
op op op
I Deep (left or right)
op
I Bushy
op op
op op
Different shapes for
bushy
op op op op different objectives
op op op op
Just an idea . . .

A query over five relations, only one access method, only one join
algorithm, only left-deep plans
I Remember, cost(R ⨝ S) != cost(S ⨝ R)
I So, the number of possible plans is 5! = 120
I If we add one extra access method, the number of possible plans
becomes 25 · 5! = 3840
I If we add one extra join algorithm, the number of possible plans
becomes 24 · 25 · 5! = 61440
Cardinality-based cost model

A cardinality-based cost model means we need good ways of doing


the following
I Using cardinalities to estimate costs (e.g., accurate cost functions)
I Estimating output cardinalities after we apply certain operations (e.g.,
after a selection the cardinality will change; it will not change after a
projection)
F Because these output cardinalities will be used as inputs to the cost
functions of other operations
Rule-based optimisation

Basically an issue of if-then rules


I If (condition list) then apply some transformation to the plan
constructed so far
F Estimate the cost of the new plan, keep it only if it is cheaper than the
original
I The order in which the rules are applied is significant
I As a consequence, rules are applied by precedence
F For instance, pushing down selections is given high precedence
F Combining two relations with a Cartesian product is given low
precedence
Randomised exploration

Mostly useful in big queries (more than 15 joins or so)


The problem is one of exploring a bigger portion of the search space
I So, every once in a while the optimiser “jumps” to some other part of
the search space with some probability
As a consequence, it gets to explore parts of the search space it would
not have explored otherwise
Dynamic programming

In the beginning, there was System R, which had an optimiser


System R’s optimiser was using dynamic programming
I An efficient way of exploring the search space
Heuristics: use the equivalence rules to push down selections and
projections, delay Cartesian products
I Minimise input cardinality to, and memory requirements of the joins
Constraints: left-deep plans, nested loops and sort-merge join only
I Left-deep plans took better advantage of pipelining
I Hash-join had not been developed back then
Dynamic programming steps

Identify the cheapest way to access every single relation in the query,
applying local predicates
I For every relation, keep the cheapest access method overall and the
cheapest access method for an interesting order
For every access method, and for every join predicate, find the
cheapest way to join in a second relation
I For every join result keep the cheapest plan overall and the cheapest
plan in an interesting order
Join in the rest of the relations using the same principle
Selinger Optimizer
Cache

Example Subplan Best


choic
e
Cost

A index 100
B seq scan 50
optjoin(ABCD) – assume all joins are NL …

∂=1
A = best way to access A
(e.g., sequential scan, or predicate pushdown into index...)
B = best way to access B
C = best way to access C
D = best way to access D

Total cost computations: choose(N,1), where N


is number of relations
Cache

Example Subplan
Subpla
n
A
Best
choic
choice
e
index
Cost

10
A index 0100
B seq scan 5
B seq scan 050

optjoin(ABCD) {A,B} BA 156
{B,C} BC 98

∂=2
{A,B} = AB or BA
(using previously computed best way to access A and B)
{B,C} = BC or CB
{C,D} = CD or DC
{A,C} = AC or CA
{A,D} = AD or DA Total cost computations: choose(N,2) x 2

{B,D} = BD or DB
Cache

Example Subplan
Subpla
n
Best
Best
choic
e
choice
Cost
Cost

A index 10
A index 100
0
B seq scan 5
optjoin(ABCD) B seq scan 50
0
{A,B} BA 15
{A,B} BA 156
6
Already computed –
{B,C} BC 9
lookup in cache {B,C} BC 98
8
∂=3 …
{A,B,C} BCA 125
{A,B,C} = remove A, compare A({B,C}) to ({B,C})A …
remove B, compare B({A,C}) to ({A,C})B {B,C,D} BCD 115

remove C, compare C({A,B}) to ({A,B})C


{A,B,D} = remove A, compare A({B,D}) to ({B,D})A
….
{A,C,D} = …
{B,C,D} = …
• Total cost computations: choose(N,3) x 3 x 2
Cache

Example Subplan
Subpla
n
Best
choic
choice
e
Cost

A index 10
A index 0100
B seq scan 5
optjoin(ABCD) B seq scan 050
Already computed – {A,B} BA 15
{A,B} BA 6156
lookup in cache {B,C} BC 9
∂=4 {B,C} BC 898
{A,B,C} BCA 12
{A,B,C,D} = remove A, compare A({B,C,D}) to ({B,C,D})A {A,B,C} BCA 5125

remove B, compare B({A,C,D}) to ({A,C,D})B {B,C,D} BCD 115
{B,C,D} BCD 11
remove C, compare C({A,B,D}) to ({A,B,D})C {A,B,C,D} ABCD 5215
remove D, compare D({A,B,C}) to ({A,B,C})D

Final answer is plan with minimum cost of


these four

Total cost computations: choose(N,4) x 4 x 2


Complexity

choose(n,1) + choose(n,2) + … + choose(n,n) total


subsets considered

All subsets of a size n set = power set of n = 2^n

Equiv. to computing all binary strings of size n


000,001,010,100,011,101,110,111
Each bit represents whether an item is in or out of set
Complexity (continued)
For each subset,
k ways to remove 1 join
k<n

m ways to join 1 relation with remainder

Total cost: O(nm2^n) plan evaluations


n = 20, m = 2
4.1 x 10^7
Interesting Orders
• Some queries need data in sorted order
– Some plans produce sorted data (e.g., using an index scan or merge join

• May be non-optimal way to join data, but overall optimal plan


– Avoids final sort

• In cache, maintain best overall plan, plus best plan for each
interesting order

• At end, compare cost of


best plan + sort into order
to
best in order plan
• Increases complexity by factor of k+1, where k is number of
interesting orders
Example

SELECT A.f3, B.f2 FROM A,B where A.f3 = B.f4


ORDER BY A.f3

Subplan Best choice Cost Best in A.f3 order Cost


A index 100 index 100
B seq scan 50 seqscan 50
{A,B} BA hash 156 AB merge 180

compare:
cost(sort(output)) + 156
to
180
Overview of Query Optimization in
Oracle
l Oracle DBMS V8
• Rule-based query optimization: the optimizer chooses execution
plans based on heuristically ranked operations.
• (Currently it is being phased out)
• Cost-based query optimization: the optimizer examines alternative
access paths and operator algorithms and chooses the execution plan
with lowest estimate cost.
• The query cost is calculated based on the estimated usage of resources
such as I/O, CPU and memory needed.
• Application developers could specify hints to the ORACLE query
optimizer.
• The idea is that an application developer might know more information
about the data.
Semantic Query Optimization

l Semantic Query Optimization:


• Uses constraints specified on the database schema in order to modify one
query into another query that is more efficient to execute.
l Consider the following SQL query,
SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHERE E.SUPERSSN=M.SSN AND E.SALARY>M.SALARY
l Explanation:
• Suppose that we had a constraint on the database schema that stated that
no employee can earn more than his or her direct supervisor. If the
semantic query optimizer checks for the existence of this constraint, it
need not execute the query at all because it knows that the result of the
query will be empty. Techniques known as theorem proving can be used
for this purpose.

You might also like