DMS-HS20-Query Processing Optimization 2
DMS-HS20-Query Processing Optimization 2
DMS-HS20-Query Processing Optimization 2
• Query Processing
• Execution models Understanding the data
• Optimization I – heuristics & Calculating Costs
rewriting
• Optimization II – cost models Rule Based Optimizer
• Optimization III - Operators Cost Based Optimizer
Gustavo Alonso
Institute of Computing Platforms
Department of Computer Science
ETH Zürich
Query processing Optimization II 1
Query optimization
• Since SQL is declarative, a database engine has many options to
translate a query into an executable program
• Problems to solve:
• Which starting point? Definitely not the query provided by the user
• Queries are normalized, put in capital letters, syntactic sugar removed, etc.
• Queries are rewritten
• How to generate possible plans for the same query
• How to decide which plans are best
• Rule based (use heuristics)
• Cost based (estimate the cost of the plans and choose the cheapest one)
https://docs.oracle.com/cd/B28359_01/s
erver.111/b28274/optimops.htm#i82005
Query processing Optimization II 5
The basics for optimization
• The main information source for query optimization are statistics on
the data
• These statistics are constantly collected on tables, indexes, buffers,
and system and made available (in Oracle, through the “Dictionary”)
• The statistical data is the basis for the decisions the query optimizer
makes when deciding to choose a plan over another and also
regarding which operator implementation to use
SELECT * FROM person WHERE 25 < age < 40; SELECT * FROM person WHERE 25 < age < 40;
60
50
40
30
20
10
0
20 bis 42 42 bis 48 48 bis 53 53 bis 59 59 bis 70
Ranges of values are fixed and equal Same number of tuples per bucket
Tells how many values in each range Helps to partition data evenly
Helps identifying hot-spots The size of a range helps with cardinality estimates
May store distinct values and min/max, etc May store distinct values and min/max, etc.11
Singleton or frequency histogram
• The frequency histogram plots the frequency of every distinct item in
a table
• In essence, how often each value appears in the table
• Very useful to compute the selectivity of queries
• Highly accurate as it gives counts for every possible value
• Can be done if the number of distinct values is not too high
12
Selecting a type of histogram (example)
• NDV: This represents the
number of distinct values in a
column. For example, if a column
only contains the values 100,
200, and 300, then the NDV for
this column is 3.
• n: This variable represents the
number of histogram buckets.
The default is 254.
• p: This variable represents an
internal percentage threshold
that is equal to (1–(1/n)) * 100.
For example, if n = 254, then p is
99.6.
https://docs.oracle.com/database/121/TGSQL/tgsql_histo.htm
Query processing Optimization II
#TGSQL-GUID-FFA0C0AF-3761-4829-995E-9AFA524F96CE 13
Zone Maps
• A zone map is a combination of coarse index and statistics
• For every block in of a table
• Keep the max and min values for some or all columns
• Before reading a block, check the zone map:
• If range does not match the predicate, do not read the block
• It can significantly reduce I/O cost
• In some cases it can replace an index
• Other statistics can be kept in a zone map
• Example of use of the Zone Maps concept is Snowflake (see chapter
on Storage)
https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20b_ops.htm#8157
Query processing Optimization II 31
Heuristics for joins (example Oracle)
With the rule-based approach, the optimizer follows these steps to
choose an execution plan for a statement that joins R tables:
1. The optimizer generates a set of R join orders, each with a different
table as the first table.
2. The optimizer then chooses among the resulting set of execution
plans. The goal of the optimizer's choice is to maximize the number
of nested loops join operations in which the inner table is accessed
using an index scan. Since a nested loops join involves accessing the
inner table many times, an index on the inner table can greatly
improve the performance of a nested loops join.
https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm
Query processing Optimization II 32
How to generate join orders
• To fill each position in the join order, the optimizer chooses the table with the
most highly ranked available access path. The optimizer repeats this step to fill
each subsequent position in the join order.
• For each table in the join order, the optimizer also chooses the operation with
which to join the table to the previous table or row source in the order. The
optimizer does this by "ranking" the sort-merge operation as access path 12 and
applying these rules:
• If the access path for the chosen table is ranked 11 or better, the optimizer chooses a nested
loops operation using the previous table or row source in the join order as the outer table.
• If the access path for the table is ranked lower than 12, and there is an equijoin condition
between the chosen table and the previous table or row source in join order, the optimizer
chooses a sort-merge operation.
• If the access path for the chosen table is ranked lower than 12, and there is not an equijoin
condition, the optimizer chooses a nested loops operation with the previous table or row
source in the join order as the outer table.
https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm
Query processing Optimization II 33
How to chose a plan
Usually, the optimizer does not consider the order in which tables appear in the FROM clause when choosing an
execution plan. The optimizer makes this choice by applying the following rules in order:
a) The optimizer chooses the execution plan with the fewest nested-loops operations in which the inner
table is accessed with a full table scan.
b) If there is a tie, the optimizer chooses the execution plan with the fewest sort-merge operations.
c) If there is still a tie, the optimizer chooses the execution plan for which the first table in the join order
has the most highly ranked access path:
d) If there is a tie among multiple plans whose first tables are accessed by the single-column indexes
access path, the optimizer chooses the plan whose first table is accessed with the most merged indexes.
e) If there is a tie among multiple plans whose first tables are accessed by bounded range scans, the
optimizer chooses the plan whose first table is accessed with the greatest number of leading columns of
the composite index.
f) If there is still a tie, the optimizer chooses the execution plan for which the first table appears later in
the query's FROM clause.
https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm
Query processing Optimization II 34
Cost Based Optimizer
37
Query optimization
SELECT a,b,c
FROM R,S,T
WHERE r.id= s.id AND s.id = T.id AND …
Step 1: find out all possible ways to access tables R,S,T and estimate the cost as well as the selectivity
Step 2: pick the best access methods
Step 3: Generate all possible join orders for the three tables
(R S) T; ( R T) S; ( T S) R
Step 4: estimate the cost for each join
Step 5: pick the best query tree
40
Access Plans for S
• SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b
ORDER BY R.c;
41
Access Plans for T
• SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b
ORDER BY R.c;
42
Join Plans for R join S
• SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b
ORDER BY R.c;
43
Join Plans for three-way joins
• SELECT * FROM R, S, T WHERE R.a = S.a AND R.b = T.b
ORDER BY R.c;
• Consider all combinations of joins (assoc., commut.)
• e.g., (R ⋈ S) ⋈ T, S ⋈ (T ⋈ R), ….
• sometimes even enumerate Cartesian products
D D
C C
A B C D A B B
A
• As the number of joins increases, the number of alternative plans grows rapidly; we need
to restrict the search space.
• System-R: consider only left-deep join trees. Left-deep trees allow us to generate all fully
pipelined plans: Intermediate results not written to temporary files (Not all left-deep
trees are fully pipelined, though).
Enumeration Left-Deep Plans
• Enumerated using N passes (if N relations joined):
• Pass 1: Find best 1-relation plan for each relation.
• Pass 2: Find best way to join result of each 1-relation plan (as outer) to
another relation. (All 2-relation plans.)
• Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the
N’th relation. (All N-relation plans.)
• For each subset of relations, retain only:
• Cheapest plan overall, plus
• Cheapest plan for each interesting order of the tuples.
Enumeration of Plans (Contd.)
• ORDER BY, GROUP BY, aggregates etc. handled as a final step, using
either an `interestingly ordered’ plan or an additional sorting
operator.
• An n-1 plan is not combined with an additional relation unless there is
a join condition between them, unless all predicates in WHERE have
been used up.
• i.e., avoid Cartesian products if possible.
• In spite of pruning plan space, this approach is still exponential in the
# of tables.
Ways to simplify the problem
• Example from old version of Oracle (for both cost based and rule based):
• The optimizer first determines whether joining two or more of the tables
definitely results in a row source containing at most one row. The optimizer
recognizes such situations based on UNIQUE and PRIMARY KEY constraints
on the tables. If such a situation exists, the optimizer places these tables
first in the join order. The optimizer then optimizes the join of the remaining
set of tables.
• For join statements with outer join conditions, the table with the outer join
operator must come after the other table in the condition in the join order.
The optimizer does not consider join orders that violate this rule.
https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm
Query processing Optimization II 49
Example from systems
With the cost-based approach, the optimizer generates a set of execution plans based on the
possible join orders, join operations, and available access paths. The optimizer then estimates the
cost of each plan and chooses the one with the lowest cost. The optimizer estimates costs in these
ways:
• The cost of a nested loops operation is based on the cost of reading each selected row of the outer table and
each of its matching rows of the inner table into memory. The optimizer estimates these costs using the
statistics in the data dictionary.
• The cost of a sort-merge join is based largely on the cost of reading all the sources into memory and sorting
them.
The optimizer also considers other factors when determining the cost of each operation. For
example:
• A smaller sort area size is likely to increase the cost for a sort-merge join because sorting takes more CPU
time and I/O in a smaller sort area. Sort area size is specified by the initialization parameter SORT_AREA_SIZE.
• A larger multiblock read count is likely to decrease the cost for a sort-merge join in relation to a nested
loops join. If a large number of sequential blocks can be read from disk in a single I/O, an index on the inner
table for the nested loops join is less likely to improve performance over a full table scan. The multiblock read
count is specified by the initialization parameter DB_FILE_MULTIBLOCK_READ_COUNT.
• For join statements with outer join conditions, the table with the outer join operator must come after the
other table in the condition in the join order. The optimizer does not consider join orders that violate this rule.
https://docs.oracle.com/cd/F49540_01/DOC/server.815/a67781/c20c_joi.htm
Query processing Optimization II 50
Complementary reading
• Take a look at some of the manuals indicated in the slides and look at
the following documents to get an idea of how what we have
discussed maps to real systems
• Oracle’s query optimizer (Oracle 19)
• https://www.oracle.com/technetwork/database/bi-datawarehousing/twp-
optimizer-with-oracledb-19c-5324206.pdf
• SQL Server query processing
• https://docs.microsoft.com/en-us/sql/relational-databases/query-processing-
architecture-guide?view=sql-server-ver15