Dbms Chapter 5

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

 Database Management Systems (DBMS)

Unit-5
Query Processing and
Optimization

Sr. Lec. Sujan Shrestha


9801104103
[email protected]
 Looping
Outline
• Steps in query processing
• Measures of query cost
• Selection operation
• Sorting and join
• Evaluation of expressions
• Query optimization
• Transformation of relational expressions
Steps in Query Processing

Parser checks the syntax of Translator translates the


query and verifies attribute query into its internal
name and relation name form (relational algebra)

Parser and Relational algebra


Query
translator expression

Choose best execution plan


Optimizer
Execute the query-evaluation
plan and returns output

Evaluation
Query output Execution plan
engine

Sequence of primitive
operation Database Catalog
Data Statistics about Data
Parser and translator
 A query is translated into SQL and into a relational algebraic expression.
 A parser checks the syntax and verifies the relations and the attributes which
are used in the query.
 Different possible relational algebra expression for a single query.
 Example.

 SELECT Ename,salary FROM Employee WHERE salary > 5000;


Translated into Relational Algebra Expression.

∏ Ename,salary (σ salary>5000 (Employee))

σ salary>5000 (∏ Ename,salary (Employee))


Query optimization
 Query optimization is the process of selecting the most efficient query-
evaluation plan from among the many strategies usually possible for
processing a given query especially if the query is complex.
 A relational algebra operation annotated with instructions on how to evaluate
it is called an evaluation primitive.
 A sequence of primitive operations that can be used to evaluate a query is a
query execution plan or query evaluation plan.
Πname,salary σSalary<5000

σSalary<5000 Πname,salary

(Employee) (Employee)
Query Evaluation
 Query Evaluation(command processor) engine takes a query evaluation plan
executes that plan and returns the answers to the query.
 It specify which access path to follow.
 Specify which algorithm to use to evaluate operator.
 Specify how operators interleave(not to work on or leave blank)
ΠName

Bottom to top
σSalary>5000

Execution
(Employee)
Measures of Query Cost
 Cost is generally measured as the total time required to execute a statement/query.
 Factors contribute to time cost
 Disk accesses (time to process a data request and retrieve the required data from the
storage device)
 CPU time to execute a query
 Network communication cost
 Disk access is the predominant (major) cost, since disk access is slow as compared
to in-memory operation.
 Cost to write a block is greater than cost to read a block because data is read back
after being written to ensure that the write was successful.
Mechanical Hard Drive

Slide
http://faculty.salina.k-state.edu/tim/ossg/File_sys/abstractions.html 10
 To measure the cost of data access from disk we consider: a) number of block
transfer and number of disk seek from disk.

 tT : average time to transfer a single block of data in second


 ts : average block access time(disk seek) + rotation latency in sec
 bl: no of blocks
 Se: no of seeks
 The cost (bl * tT + Se * ts) sec

 Block access = block read + block write


where BW> BR
 The cost of all the algorithm that we consider depends on the size of the buffer
memory

 Best case: when all data or all tables reside into main memory
 Worst case: store few blocks or parts of the block and we need many swaps.
 Practically, actual disk access <estimated disk cost since we consider worst case
most of the time.
Selection Operator

 Symbol: σ (Sigma)
 Notation: σ condition (Relation)

 Operation: Selects tuples from a relation that satisfy a given condition.


 Operators: =, <>, <, >, <=, >=, Λ (AND), V (OR)

Example Display the detail of students belongs to “CE” Branch. Answer σBranch=‘CE’ (Student)

Student Output
RollNo Name Branch SPI RollNo Name Branch SPI
101 Raju CE 8 101 Raju CE 8

102 ravi ME 9 104 dipen CE 9

103 gaurav CI 9

104 dipen CE 9
Search algorithm for selection operation
 Linear search (A1)
 Binary search (A2)
Linear search (A1)
 It scans each blocks and tests all records to see whether 0 1 2 3 4 5
they satisfy the selection condition.
 Cost of linear search (worst case) = ts + br*tT 1 2 3 44 5 6
br denotes number of blocks containing records from
relation r
 If the selection condition is there on a (primary) key
attribute, then system can stop searching if the required
record is found.
Linear
 cost of linear search (best case) = ts + (br /2) * tT Search
 If the selection is on non (primary) key attribute then
multiple block may contains required records, then the
cost of scanning such blocks need to be added to the cost
estimate.
 Linear search can be applied regardless of
 selection condition or
 ordering of records in the file (relation)
 This algorithm is slower than binary search algorithm.
Binary search (A2)
 Generally, this algorithm is used if selection is an equality comparison on the (primary)
key attribute and file (relation) is ordered (sorted) on (primary) key attribute.
 cost of binary search = ts + [log2(br)] * tT
 br denotes number of blocks containing records from relation r
 This algorithm is faster than linear search algorithm.

0 1 2 3 4 5

1 2 3 44 5 6

(0 + 5)/2 = 2.5==2

(2 + 5)/2 = 3.5==3
Sr. Lec. Sujan Shrestha
9801104103
[email protected]
Sorting
 Several of the relational operations, such as joins, can be implemented efficiently
if the input relations are first sorted.
 We can sort a relation by building an index on the relation and then using that
index to read the relation in sorted order.
 Such a process orders the relation only logically rather than physically.
 Hence reading of tuples in the sorted order may lead to disk access for each record,
which can be very expensive.
 So it is desirable to order the records physically.
 Sorting of relation that fit into main memory, standard sorting techniques such as
quick-sort can be used.
 Sorting of relations that do not fit in main memory is called external sorting.
 Most commonly used algorithm for this type of sorting is external sort merge
algorithm.
External Sort-Merge (Example)
 Blocks=3
24 19 14
24 2
19 24 16
19 3
31 31 19
31 7
33 14 24
33 14
14 16 31
14 14
16 33 33
16 16
16 16 3 16
2
21 21 16 19
3
3 3 21 21
7
2 24
2 2 14
7 create merge merge 31
7 7 16
runs pass-1 pass-2
14 33
14 14 21
initial relation runs runs sorted output
nested-loop join algorithm
r⋈θs
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr, ts) to test if they satisfy the given join condition ?
if test satisfied
add tr . ts to the result;
end inner loop
end outer loop
Block Nested-Loop Join Algorithm

for each block br of r do begin


for each block bs of s do begin
for each tuple tr in br do begin
for each tuple ts in bs do begin
test pair (tr, ts) to determine if they pass the given join condition
if test passed
add tr . ts to the result;
end
end
end
end
Cost of computing for all joins
 R is the outer and S is the inner relation of the join.
 Number of records of R: (NR)
 Number of records of S: (NS)
 Number of blocks of R: (BR)
 Number of blocks of S: (BS)

Join Worst Case Best Case


Nested-Loop Join BR + NR ∗ BS BR + BS
Block Nested-Loop Join BR + BR ∗ BS BR + BS
Sum (Nested loop join)
 Assuming worst case memory availability and the
following given statistics for the relations customer
and depositor
 Number of records of customer: 10,000 (ncustomer)
 Number of records of depositor: 5,000 (ndepositor)
 Number of blocks of customer: 400 (bcustomer)
 Number of blocks of depositor: 100 (bdepositor)
 Estimate the cost
1. with depositor as outer relation
2. with customer as outer relation
Sum (Nested loop join)
 Assuming worst case memory availability and the
following given statistics for the relations customer and
depositor
 Number of records of customer: 10,000 (ncustomer)
 Number of records of depositor: 5,000 (ndepositor)
 Number of blocks of customer: 400 (bcustomer)
 Number of blocks of depositor: 100 (bdepositor)
 Estimate the cost
1. with depositor as outer relation
No. of blocks access = ndepositor * bcustomer + bdepositor
= 5000 * 400 + 100
= 2000100
Sum (Nested loop join)
 Assuming worst case memory availability and the following given
statistics for the relations customer and depositor
 Number of records of customer: 10,000 (ncustomer)
 Number of records of depositor: 5,000 (ndepositor)
 Number of blocks of customer: 400 (bcustomer)
 Number of blocks of depositor: 100 (bdepositor)
 Estimate the cost
1. with customer as outer relation
No. of blocks access = ncustomer * bdepositor + bcustomer
= 10000 * 100 + 400
= 1000400
Sum (Nested loop join)
 Assuming best case memory availability and the following given
statistics for the relations customer and depositor
 Number of records of customer: 10,000 (ncustomer)
 Number of records of depositor: 5,000 (ndepositor)
 Number of blocks of customer: 400 (bcustomer)
 Number of blocks of depositor: 100 (bdepositor)
 Estimate the cost
1. with customer as outer relation
No. of blocks access = bdepositor + bcustomer
= 100 + 400
= 500
Join Operations
 There are several different algorithms that can be used to implement joins
 Nested-Loop Join
 Block Nested-Loop Join
 Index Nested-Loop Join
 Sort-Merge Join
 Hash-Join
Query optimization
 It is a process of selecting the most efficient query evaluation plan from the available possible plans.

ΠCust_Name ( σBalance<2500 (account) (customer) ) Customer Account


CID ANO Name ANO Balance
Efficient C01 A01 Raj A01 3000
plan 2 records 4 records
C02 A02 dipen A02 1000

C03 A03 Jay A03 2000

ΠCust_Name ( σBalance<2500 (account customer) ) C04 A04 Ram A04 4000

4 records 4 records
Approaches to Query Optimization
 Heuristic Based Optimization
 Heuristic based optimization uses rule-based optimization approaches for query
optimization.
 Performs select and project operations before join operations. This is done by moving the
select and project operations down the query tree. This reduces the number of tuples available
for join.
 Avoid cross-product operation because they result in very large-sized intermediate tables.
 This algorithms do not necessarily produce the best query plan.
Transformation of relational expressions
 Two relational algebra expressions are said to be equivalent if the two expressions
generate the same set of tuples.
 Example:
Customer Account
CID ANO Name ANO Balance
C01 A01 Raj A01 3000

C02 A02 dipen A02 1000

C03 A03 Jay A03 2000

C04 A04 Ram A04 4000

ΠName ( σBalance<2500 (Account) (Customer) ) ΠName ( σBalance<2500 (Account Customer) )

Customer

Name
dipen
Jay
Transformation of relational expressions
 Combined selection operation can be divided into sequence of individual selections.
This transformation is called cascade of σ.
 Example:

Customer
CID ANO Name Balance
σANO<3 Λ Balance<2000 (Customer) Output
C01 1 Raj 3000
CID ANO Name Balance
C02 2 dipen 1000 OUTPUT
C02 2 dipen 1000
C03 3 Jay 2000

C04 4 Ram 4000 σANO<3 (σBalance<2000 (Customer))

σθ1Λθ2 (E) = σθ1 (σθ2 (E))


Transformation of relational expressions
 Selection operations are commutative.
 Example:

Customer
CID ANO Name Balance
σANO<3 (σBalance<2000 (Customer)) Output
C01 1 Raj 3000
CID ANO Name Balance
C02 2 dipen 1000 OUTPUT
C02 2 dipen 1000
C03 3 Jay 2000

C04 4 Ram 4000 σBalance<2000 (σANO<3 (Customer))

σθ1 (σθ2 (E)) = σθ2 (σθ1 (E))


Transformation of relational expressions
 If more than one projection operation is used in expression then only the outer
projection operation is required. So skip all the other inner projection operation.
 Example:

Customer Customer

Name
CID ANO Name Balance
ΠName (ΠANO, Name (Customer))
C01 1 Raj 3000 Raj
C02 2 dipen 1000 OUTPUT dipen
C03 3 Jay 2000 Jay
C04 4 Ram 4000 ΠName (Customer) Ram

ΠL1 (ΠL2 (…(Π Ln (E))…)) = ΠL1 (E)


Transformation of relational expressions
 Selection operation can be joined with Cartesian product and theta join.
 Example:

Customer Account
CID ANO Name ANO Balance
σANO<3 (Customer Account) Output
C01 1 Raj 1 3000 CID ANO Name Balance
C02 2 dipen 2 1000 OUTPUT C01 1 Raj 3000

C03 3 Jay 3 2000 C02 2 dipen 1000

C04 4 Ram 4 4000 (Customer) σANO<3 (Account)

σθ (E1 E2)) = E1 θE2

σθ1 (E1 θ2 E2)) = E1 θ1Λ θ2 E2


Transformation of relational expressions
 Theta operations are commutative.
 Example:

Customer Account
CID ANO Name ANO Balance
(Account) σANO<3 (Customer) Output
C01 1 Raj 1 3000 CID ANO Name Balance
C02 2 dipen 2 1000 OUTPUT C01 1 Raj 3000

C03 3 Jay 3 2000 C02 2 dipen 1000

C04 4 Ram 4 4000 (Customer) σANO<3 (Account)

E1 σθ E2 = E2 σθ E1
Transformation of relational expressions
 Natural join operations are associative.

(E1 E2) E3 = E1 (E2 E3)


 Selection operation distribute over theta join operation under the following condition
 When all the attributes in the selection condition θ0 involves only the attributes of the one of
the expression (says E1) being joined.

σθ0 (E1 θ E2) = (σθ0 (E1)) θ E2


 When the selection condition θ1 involves only the attributes of E1 and θ2 involves only the
attributes of E2.

σθ1Λθ2 (E1 θ E2) = (σθ1(E1) θ (σθ2 (E2)))


Transformation of relational expressions
 Set operations union and intersection are commutative.
Output
Customer Employee Customer U Employee
Name
Cst_Name Emp_Name
Raj
Raj dipen OUTPUT
dipen
dipen Suresh
Suresh
Employee U Customer

Customer Employee Customer ∩ Employee


Output
Cst_Name Emp_Name
Name
Raj dipen OUTPUT
dipen
dipen Suresh
Employee ∩ Customer

E1 U E2 = E2 U E1 & E1 ∩ E2 = E2 ∩ E1 Set difference is not commutative


Transformation of relational expressions
 Set operations union and intersection are associative.
Output
Customer Employee Student (Customer U Employee) U Student
Name
Cst_Name Emp_Name Stu_Name
Raj
Raj dipen Raj OUTPUT
dipen
dipen Suresh dipen
Suresh
Customer U (Employee U Student)

Customer Employee Student (Customer ∩ Employee) ∩ Student


Output
Cst_Name Emp_Name Stu_Name
Name
Raj dipen Raj OUTPUT
dipen
dipen Suresh dipen
Customer ∩ (Employee ∩ Student)

(E1 U E2) U E3 = E1 U (E2 U E3) & (E1 ∩ E2) ∩ E3 = E1 ∩ (E2 ∩ E3)


Transformation of relational expressions
 Selection operation distributes over U, ∩ and –.

σθ(E1 U E2) = σθ(E1) U σθ(E2)

σθ(E1 ∩ E2) = σθ(E1) ∩ σθ(E2)

σθ(E1 – E2) = σθ(E1) – σθ(E2)


Heuristic Processing Strategies
 1. Perform Selection operation as early as possible.

ΠName ( σBalance<2500 (Account) (Customer) ) ΠName ( σBalance<2500 (Account Customer) )

 2. Combine the Cartesian product with a subsequent selection operation who


predicate represents a join condition into a Join operation.
Cartesian Product
Example Perform Cross Product between Student and Result.

Student Result Consider only selected tuples


• Student – Branch=‘CE’ and Sem=3
RNo Name Branch Sem RNo SPI BL Rank • Result – SPI>7 and BL<1
101 Raju CE 3 101 8 1 2
102 Sanjay ME 5 103 9 0 1
103 Om CE 3 105 7 2 3
104 krishna CE 5
σBranch=‘CE’ Λ Sem=3 (Student) ) X σSPI>7 Λ BL<1) (Result) σBranch=‘CE’ Λ Sem=3 (σSPI>7 Λ BL<1 ((Student) X (Result))
Answer

Output
Student.RNo Name Branch Sem Result.RNo SPI BL Rank
101 Raju CE 3 103 9 0 1
103 OM CE 3 103 9 0 1
 3. Perform projection early
 4. Perform most restrictive selection and operation(i.e. with smallest result size)
before other operation.
Evaluation of expressions
 Expression may contain more than one operations, solving
expression will be difficult if it contains more than one
operations.
 Display customer name who has balance less than 2500. ΠCust_Name
ΠCust_Name ( σBalance<2500 (account) (customer) )

Bottom to top
Execution
 To evaluate such expression we need to evaluate each
operations one by one in appropriate order.
Two methods for evaluating an expression carrying

multiple operations are: σBalance<2500 (customer)
 Materialization
 Pipelining

(account)
Materialization
 Materialization evaluates the expression tree of the relational algebra operation
from the bottom and performs the innermost or leaf-level operations first.
 The intermediate result of each operation is materialized (store in temporary
relation) and becomes input for subsequent (next) operations.
 The cost of materialization is the sum of the individual operations plus the cost of
writing the intermediate results to disk.
 The problem with materialization is that
 it creates lots of temporary relations
 it performs lots of I/O operations
Materialization Student Result

RollNo Name RollNo Marks


 List the name of student 101 Raju 101 90
who got more than 85 marks
102 ravi 102 60

ΠName 103 gaurav 103 70

104 dipen 104 32

σmarks>85(Result ) Student (σmarks>85(Result ))


RollNo Marks Name
RollNo Marks Name
101 90 Raju
101 90 Raju
σmarks>85 (Student)
ΠName ( Student (σmarks>85(Result ))

(Result)  It is called materialized evaluation since the result of each intermediate


operation are created(materialized) and then are used to evaluate the next
level operation.
 Cost = cost of all operation + cost of writing intermediate data
Pipelining
 In pipelining, operations form a queue, and results are passed from one
operation to another as they are calculated.
 To reduce number of intermediate temporary relations, we pass results of
one operation to the next operation in the pipelines.
 Combining operations into a pipeline eliminates the cost of reading and
writing temporary relations.

R1
Store into
Hard Disk
Πx,y Store the final
result in hard disk
R2
Materialization
R1
Store into Πx,y Store the final
result in hard disk
Ram/Buffer
R2
Pipelining
Pipelining
 Pipe line can be demand driven or producer driven

 Demand driven
ΠName  Generated the records lazily
Request of records  Top to down approach
 When receives a request for records, it compute
the next record to be returned.
Request of records

σmarks>85
(Student)

(Result)
Pipelining
 Producer driven
 Pushing the records
ΠName  down to UP approach

Pushing the data  Operation do not wait for request to produce


records but instead generate the records eagerly.
 Each operation at the bottom of a pipeline
continually generates output records and put them
in its output buffer, until the buffer is full.
 Over the operation uses a record from a pipelined
σmarks>85 (Student)
input it removes from the buffer

Pushing the data


(Result)
Choice of evaluation plan
 Evaluation plan: A evaluation plan is needed to define exactly what algorithm
should be used for each operation, and “how the execution of the operation
should be co-ordinated
 List the name of student who got more than 85 marks in DBMS
Result Subject
ΠName Student

RollNo Name Address RollNo Sid Marks Sid Sname

101 Raju Balaju 101 S1 90 S1 DBMS


Hash join 102 S2 60 S2 AI
102 ravi Kalimati
Pipelining
103 gaurav Kalanki 103 S1 70 S3 CG
Merge join Student
104 dipen balkhu 104 S4 32 S4 CN

ΠName ( Student (σmarks>85(Result) σSname=DBMS(Subject)) )


σSname=DBMS σmarks>85  Here, first we had selected few tuples with restriction in subject and
result relation using B+ indexing .
B tree
 Then, this selected tuples are pipelined and joined via merge join.
indexing
 Later the resulted tuples are again hash joined with student table
(Subject) (Result) and at last we projected the only name as the output.
Questions asked in GTU
1. Explain query processing steps. OR Discuss various steps of query processing with proper diagram.
2. Explain Heuristics based and cost based optimization.
3. Distinguish between materialized and pipelining.
4. Explain the measures of finding out the cost of a query in query processing.
5. Describe various types of join and sorting algorithm.
 Database Management Systems (DBMS)

Thank you

Sr. Lec. Sujan Shrestha


9801104103
[email protected]

You might also like