Dbms Chapter 5
Dbms Chapter 5
Dbms Chapter 5
Unit-5
Query Processing and
Optimization
Evaluation
Query output Execution plan
engine
Sequence of primitive
operation Database Catalog
Data Statistics about Data
Parser and translator
A query is translated into SQL and into a relational algebraic expression.
A parser checks the syntax and verifies the relations and the attributes which
are used in the query.
Different possible relational algebra expression for a single query.
Example.
σSalary<5000 Πname,salary
(Employee) (Employee)
Query Evaluation
Query Evaluation(command processor) engine takes a query evaluation plan
executes that plan and returns the answers to the query.
It specify which access path to follow.
Specify which algorithm to use to evaluate operator.
Specify how operators interleave(not to work on or leave blank)
ΠName
Bottom to top
σSalary>5000
Execution
(Employee)
Measures of Query Cost
Cost is generally measured as the total time required to execute a statement/query.
Factors contribute to time cost
Disk accesses (time to process a data request and retrieve the required data from the
storage device)
CPU time to execute a query
Network communication cost
Disk access is the predominant (major) cost, since disk access is slow as compared
to in-memory operation.
Cost to write a block is greater than cost to read a block because data is read back
after being written to ensure that the write was successful.
Mechanical Hard Drive
Slide
http://faculty.salina.k-state.edu/tim/ossg/File_sys/abstractions.html 10
To measure the cost of data access from disk we consider: a) number of block
transfer and number of disk seek from disk.
Best case: when all data or all tables reside into main memory
Worst case: store few blocks or parts of the block and we need many swaps.
Practically, actual disk access <estimated disk cost since we consider worst case
most of the time.
Selection Operator
Symbol: σ (Sigma)
Notation: σ condition (Relation)
Example Display the detail of students belongs to “CE” Branch. Answer σBranch=‘CE’ (Student)
Student Output
RollNo Name Branch SPI RollNo Name Branch SPI
101 Raju CE 8 101 Raju CE 8
103 gaurav CI 9
104 dipen CE 9
Search algorithm for selection operation
Linear search (A1)
Binary search (A2)
Linear search (A1)
It scans each blocks and tests all records to see whether 0 1 2 3 4 5
they satisfy the selection condition.
Cost of linear search (worst case) = ts + br*tT 1 2 3 44 5 6
br denotes number of blocks containing records from
relation r
If the selection condition is there on a (primary) key
attribute, then system can stop searching if the required
record is found.
Linear
cost of linear search (best case) = ts + (br /2) * tT Search
If the selection is on non (primary) key attribute then
multiple block may contains required records, then the
cost of scanning such blocks need to be added to the cost
estimate.
Linear search can be applied regardless of
selection condition or
ordering of records in the file (relation)
This algorithm is slower than binary search algorithm.
Binary search (A2)
Generally, this algorithm is used if selection is an equality comparison on the (primary)
key attribute and file (relation) is ordered (sorted) on (primary) key attribute.
cost of binary search = ts + [log2(br)] * tT
br denotes number of blocks containing records from relation r
This algorithm is faster than linear search algorithm.
0 1 2 3 4 5
1 2 3 44 5 6
(0 + 5)/2 = 2.5==2
(2 + 5)/2 = 3.5==3
Sr. Lec. Sujan Shrestha
9801104103
[email protected]
Sorting
Several of the relational operations, such as joins, can be implemented efficiently
if the input relations are first sorted.
We can sort a relation by building an index on the relation and then using that
index to read the relation in sorted order.
Such a process orders the relation only logically rather than physically.
Hence reading of tuples in the sorted order may lead to disk access for each record,
which can be very expensive.
So it is desirable to order the records physically.
Sorting of relation that fit into main memory, standard sorting techniques such as
quick-sort can be used.
Sorting of relations that do not fit in main memory is called external sorting.
Most commonly used algorithm for this type of sorting is external sort merge
algorithm.
External Sort-Merge (Example)
Blocks=3
24 19 14
24 2
19 24 16
19 3
31 31 19
31 7
33 14 24
33 14
14 16 31
14 14
16 33 33
16 16
16 16 3 16
2
21 21 16 19
3
3 3 21 21
7
2 24
2 2 14
7 create merge merge 31
7 7 16
runs pass-1 pass-2
14 33
14 14 21
initial relation runs runs sorted output
nested-loop join algorithm
r⋈θs
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr, ts) to test if they satisfy the given join condition ?
if test satisfied
add tr . ts to the result;
end inner loop
end outer loop
Block Nested-Loop Join Algorithm
4 records 4 records
Approaches to Query Optimization
Heuristic Based Optimization
Heuristic based optimization uses rule-based optimization approaches for query
optimization.
Performs select and project operations before join operations. This is done by moving the
select and project operations down the query tree. This reduces the number of tuples available
for join.
Avoid cross-product operation because they result in very large-sized intermediate tables.
This algorithms do not necessarily produce the best query plan.
Transformation of relational expressions
Two relational algebra expressions are said to be equivalent if the two expressions
generate the same set of tuples.
Example:
Customer Account
CID ANO Name ANO Balance
C01 A01 Raj A01 3000
Customer
Name
dipen
Jay
Transformation of relational expressions
Combined selection operation can be divided into sequence of individual selections.
This transformation is called cascade of σ.
Example:
Customer
CID ANO Name Balance
σANO<3 Λ Balance<2000 (Customer) Output
C01 1 Raj 3000
CID ANO Name Balance
C02 2 dipen 1000 OUTPUT
C02 2 dipen 1000
C03 3 Jay 2000
Customer
CID ANO Name Balance
σANO<3 (σBalance<2000 (Customer)) Output
C01 1 Raj 3000
CID ANO Name Balance
C02 2 dipen 1000 OUTPUT
C02 2 dipen 1000
C03 3 Jay 2000
Customer Customer
Name
CID ANO Name Balance
ΠName (ΠANO, Name (Customer))
C01 1 Raj 3000 Raj
C02 2 dipen 1000 OUTPUT dipen
C03 3 Jay 2000 Jay
C04 4 Ram 4000 ΠName (Customer) Ram
Customer Account
CID ANO Name ANO Balance
σANO<3 (Customer Account) Output
C01 1 Raj 1 3000 CID ANO Name Balance
C02 2 dipen 2 1000 OUTPUT C01 1 Raj 3000
Customer Account
CID ANO Name ANO Balance
(Account) σANO<3 (Customer) Output
C01 1 Raj 1 3000 CID ANO Name Balance
C02 2 dipen 2 1000 OUTPUT C01 1 Raj 3000
E1 σθ E2 = E2 σθ E1
Transformation of relational expressions
Natural join operations are associative.
Output
Student.RNo Name Branch Sem Result.RNo SPI BL Rank
101 Raju CE 3 103 9 0 1
103 OM CE 3 103 9 0 1
3. Perform projection early
4. Perform most restrictive selection and operation(i.e. with smallest result size)
before other operation.
Evaluation of expressions
Expression may contain more than one operations, solving
expression will be difficult if it contains more than one
operations.
Display customer name who has balance less than 2500. ΠCust_Name
ΠCust_Name ( σBalance<2500 (account) (customer) )
Bottom to top
Execution
To evaluate such expression we need to evaluate each
operations one by one in appropriate order.
Two methods for evaluating an expression carrying
multiple operations are: σBalance<2500 (customer)
Materialization
Pipelining
(account)
Materialization
Materialization evaluates the expression tree of the relational algebra operation
from the bottom and performs the innermost or leaf-level operations first.
The intermediate result of each operation is materialized (store in temporary
relation) and becomes input for subsequent (next) operations.
The cost of materialization is the sum of the individual operations plus the cost of
writing the intermediate results to disk.
The problem with materialization is that
it creates lots of temporary relations
it performs lots of I/O operations
Materialization Student Result
R1
Store into
Hard Disk
Πx,y Store the final
result in hard disk
R2
Materialization
R1
Store into Πx,y Store the final
result in hard disk
Ram/Buffer
R2
Pipelining
Pipelining
Pipe line can be demand driven or producer driven
Demand driven
ΠName Generated the records lazily
Request of records Top to down approach
When receives a request for records, it compute
the next record to be returned.
Request of records
σmarks>85
(Student)
(Result)
Pipelining
Producer driven
Pushing the records
ΠName down to UP approach
Thank you