ADB Slides 4
ADB Slides 4
ADB Slides 4
● The main heuristic is to apply first the operations that reduce the size
of intermediate results.
○ E.g., Apply SELECT and PROJECT operations before applying the
JOIN or other binary operations.
Example
l Example:
• For every project located in ‘Stafford’, retrieve the project number, the
controlling department number and the department manager’s last name,
address and birthdate.
l Relation algebra:
l SQL query:
Example (cont’d)
Selection and projections
Cascading of selections
I σc1∧c2∧...∧cn ( R ) ≡ σ c1 (σ c2 (. ..(σ cn ( R ))))
Commutativity
I σ c1 (σ c2 ( R )) ≡ σ c2 (σ c1 ( R ))
Cascading of projections
I π a1 ( R ) ≡ π a1 (π a2 (. ..(π an ( R )) ...)
I iff ai ⊆ ai+1, i = 1,2, ...n − 1
Cartesian products and joins
Commutativity
I R ×S ≡ S×R
I R⨝S ≡S⨝R
Assosiativity
I R× ( S × T ) ≡ ( R × S ) ×T
I R⨝(S ⨝T )≡(R⨝S )⨝T
Their combination
I R ⨝ ( S ⨝T )≡ R ⨝ ( T ⨝S )≡( R ⨝T )⨝S
≡( T ⨝R ) ⨝S
Other operations
Selection-projection commutativity
I πa ( σc (R)) ≡ σc ( πa (R))
I iff every attribute in c is included in the set of attributes a
Combination (join definition)
I σc ( R × S ) ≡ R ⨝ c S
Selection-Cartesian/join commutativity
I σc ( R × S ) ≡ σc (R) × S
I iff the attributes in c appear only in R and not in S
Selection distribution/replacement
I σc (R ⨝ S) ≡ σ c1 ∧ c2 ( R ⨝ S ) ≡ σc1 ( σc2 ( R ⨝ S )) ≡
σc1 (R) ⨝ σc2 (S)
I iff c1 is relevant only to R and c2 is relevant only to S
Other operations (cont.)
● Issues
○ Cost function
○ Number of execution strategies to be considered
Cost estimation
A plan is a tree of operators
Two parts to estimating the cost of a plan
I For each node, estimate the cost of performing the corresponding operation
I For each node, estimate the size of the result and any properties it might have
(e.g., sorted)
Combine the estimates and produce an estimate for the entire plan
We have seen various storage methods and algorithms
I And know the cost of using each one, depending on the input
cardinality
The problem is estimating the output cardinality of the operations
I Namely, selections and joins
Selectivity factor
column = value → 1
#keys(column)
I Assumes uniform distribution in the values
I Is itself an approximation
column1 = column2 → 1
max(#keys(column1),#keys(column2))
I Each value in column1 has a matching value in column2; given a value
in column1, the predicate is just a selection
I Again, an approximation
Various selectivity factors (cont.)
(high(column)−value)
column > value → (high(column)−low (column))
(value2−value1)
value1 < column < value2 → (high(column)−low (column))
column in list → number of items in list times s.f. of column = value
column in sub-query → ratio of subquery’s estimated size to the
number of keys in column
not (predicate) → 1 - (s.f. of predicate)
P1 ∨P2 → fP1 + fP2 − fP1 · fP2
Key assumptions made
Small
I Typically, a DBMS will allocate a single page for a histogram!
Accurate
I Typically, less than 5% error
Fast access
I Single lookup access and simple algorithms
Equi-width histogram construction
True distribution Distribution approximation
9.00
6.75
identified 4.50
I The histogram is then
scanned forward until the 2.25
assumption is made
9 11 3
12 14 15
6 ≤ v ≤ 10: 3
3
· 15 + 2
3
· 3 = 17
Equi-depth histogram construction and estimation
The total range is divided into True distribution Distribution approximation
6 ≤ v ≤ 10: 10
14
13
14
7
9
2 · 10 + 2 · 10 + 1 · 7 ≈ 17
4 2 4
Using Selectivity and Cost Estimates in
Query Optimization
l Catalog Information Used in Cost Functions
• Information about the size of a file
• number of records (tuples) (r),
• record size (R),
• number of blocks (b)
• blocking factor (bfr)
• Information about indexes and indexing attributes of a file
• Number of levels (x) of each multilevel index
• Number of first-level index blocks (bI1)
• Number of distinct values (d) of an attribute
• Selectivity (sl) of an attribute
• Selection cardinality (s) of an attribute. (s = sl * r)
Using Selectivity and Cost Estimates in
Query Optimization (Cont’d)
l Examples of Cost Functions for SELECT (in terms of block transfers)
l S1. Linear search (brute force) approach
• CS1a = b;
• For an equality condition on a key, CS1a = (b/2) if the record is found;
otherwise CS1a = b.
l S2. Binary search:
• CS2 = log2b + (s/bfr) –1
• For an equality condition on a unique (key) attribute, CS2 =log2b
• Where s is the selection cardinality
l S3. Using a primary index (S3a) or hash key (S3b) to retrieve a
single record
• CS3a = x + 1; CS3b = 1 for static or linear hashing;
• CS3b = 2 for extendible hashing;
• Where x is number of index levels
Using Selectivity and Cost Estimates in
Query Optimization (Cont’d)
l Examples of Cost Functions for SELECT (contd.)
l S4. Using an ordering index to retrieve multiple records:
• For the comparison condition on a key field with an ordering index,
• CS4 = x + (b/2)
l S5. Using a clustering index to retrieve multiple records:
• CS5 = x + (s/bfr)
l S6. Using a secondary (B+-tree) index:
• For an equality comparison, CS6a = x + s;
• For an comparison condition such as >, <, >=, or <=,
• CS6a = x + (bI1/2) + (r/2)
Using Selectivity and Cost Estimates in
Query Optimization (Cont’d)
l Examples of Cost Functions for SELECT (contd.)
l S7. Conjunctive selection:
• Use either S1 or one of the methods S2 to S6 to solve.
• For the latter case, use one condition to retrieve the records and then
check in the memory buffer whether each retrieved record satisfies the
remaining conditions in the conjunction.
l S8. Conjunctive selection using a composite index:
• Same as S3a, S5 or S6a, depending on the type of index.
Example of Cost Functions for SELECT
op
left-deep
A query over five relations, only one access method, only one join
algorithm, only left-deep plans
I Remember, cost(R ⨝ S) != cost(S ⨝ R)
I So, the number of possible plans is 5! = 120
I If we add one extra access method, the number of possible plans
becomes 25 · 5! = 3840
I If we add one extra join algorithm, the number of possible plans
becomes 24 · 25 · 5! = 61440
Cardinality-based cost model
Identify the cheapest way to access every single relation in the query,
applying local predicates
I For every relation, keep the cheapest access method overall and the
cheapest access method for an interesting order
For every access method, and for every join predicate, find the
cheapest way to join in a second relation
I For every join result keep the cheapest plan overall and the cheapest
plan in an interesting order
Join in the rest of the relations using the same principle
Selinger Optimizer
Cache
A index 100
B seq scan 50
optjoin(ABCD) – assume all joins are NL …
∂=1
A = best way to access A
(e.g., sequential scan, or predicate pushdown into index...)
B = best way to access B
C = best way to access C
D = best way to access D
Example Subplan
Subpla
n
A
Best
choic
choice
e
index
Cost
10
A index 0100
B seq scan 5
B seq scan 050
…
optjoin(ABCD) {A,B} BA 156
{B,C} BC 98
…
∂=2
{A,B} = AB or BA
(using previously computed best way to access A and B)
{B,C} = BC or CB
{C,D} = CD or DC
{A,C} = AC or CA
{A,D} = AD or DA Total cost computations: choose(N,2) x 2
{B,D} = BD or DB
Cache
Example Subplan
Subpla
n
Best
Best
choic
e
choice
Cost
Cost
A index 10
A index 100
0
B seq scan 5
optjoin(ABCD) B seq scan 50
0
{A,B} BA 15
{A,B} BA 156
6
Already computed –
{B,C} BC 9
lookup in cache {B,C} BC 98
8
∂=3 …
{A,B,C} BCA 125
{A,B,C} = remove A, compare A({B,C}) to ({B,C})A …
remove B, compare B({A,C}) to ({A,C})B {B,C,D} BCD 115
Example Subplan
Subpla
n
Best
choic
choice
e
Cost
A index 10
A index 0100
B seq scan 5
optjoin(ABCD) B seq scan 050
Already computed – {A,B} BA 15
{A,B} BA 6156
lookup in cache {B,C} BC 9
∂=4 {B,C} BC 898
{A,B,C} BCA 12
{A,B,C,D} = remove A, compare A({B,C,D}) to ({B,C,D})A {A,B,C} BCA 5125
…
remove B, compare B({A,C,D}) to ({A,C,D})B {B,C,D} BCD 115
{B,C,D} BCD 11
remove C, compare C({A,B,D}) to ({A,B,D})C {A,B,C,D} ABCD 5215
remove D, compare D({A,B,C}) to ({A,B,C})D
• In cache, maintain best overall plan, plus best plan for each
interesting order
compare:
cost(sort(output)) + 156
to
180
Overview of Query Optimization in
Oracle
l Oracle DBMS V8
• Rule-based query optimization: the optimizer chooses execution
plans based on heuristically ranked operations.
• (Currently it is being phased out)
• Cost-based query optimization: the optimizer examines alternative
access paths and operator algorithms and chooses the execution plan
with lowest estimate cost.
• The query cost is calculated based on the estimated usage of resources
such as I/O, CPU and memory needed.
• Application developers could specify hints to the ORACLE query
optimizer.
• The idea is that an application developer might know more information
about the data.
Semantic Query Optimization