Query Processing and Optimization: Chapter - 2

CHAPTER - 2
QUERY PROCESSING AND OPTIMIZATION
COURSE NAME: ADVANCED DATABASE SYSTEM
Compiled by: Temesgen Tilahun

INTRODUCTION
 Query processing requires that the DBMS identify and execute a strategy for retrieving
the results of the query. Query optimization is necessary to determine the optimal
alternative to process a query.
 There are two main techniques for query optimization.
 The first approach is to use a rule based or heuristic method for ordering the
operations in a query execution strategy.
 The second approach estimates the cost of different execution strategies and chooses
the best solution.
 In general most commercial database systems use a combination of both techniques.
Compiled by: Temesgen Tilahun 2

INTRODUCTION
 In this chapter we discuss the techniques used internally by a DBMS to process,

optimize, and execute high-level queries.
 A query expressed in a high-level query language such as SQL must first be

scanned, parsed, and validated.
 The scanner identifies the query tokens such as SQL keywords, attribute names,
and relation names that appear in the text of the query.
 The parser checks the query syntax to determine whether it is formulated

according to the syntax rules (rules of grammar) of the query language.
INTRODUCTION
 The query must also be validated by checking that all attribute and relation names are
valid and semantically meaningful names in the schema of the particular database
being queried.
 An internal representation of the query is then created, usually as a tree data structure
called a query tree.
 It is also possible to represent the query using a graph data structure called a query
graph.
 The DBMS must then devise an execution strategy or query plan for retrieving the
results of the query from the database files.

INTRODUCTION
 A query typically has many possible execution strategies and the process of
choosing a suitable one for processing a query is known as query optimization.
 The query optimizer module has the task of producing a good execution plan.
 The code generator generates the code to execute that plan.
 The runtime database processor has the task of running (executing) the query
code, whether in compiled or interpreted mode, to produce the query result.
 If a runtime error results, an error message is generated by the runtime database

processor.

Translating SQL Queries into Relational Algebra
 In practice, SQL is the query language that is used in most commercial

RDBMSs.
 An SQL query is first translated into an equivalent extended relational algebra

expression represented as a query tree data structure that is then optimized.
 Typically, SQL queries are decomposed into query blocks, which form the
basic units that can be translated into the algebraic operators and optimized.
 A query block contains a single SELECT – FROM – WHERE expression, as well

as GROUP BY and HAVING clauses if these are part of the block.
Query Processing
 Query Processing: It is a procedure of converting a query written in high level language

(Eg. SQL) into a correct and efficient execution plan expressed in low level language, which is
used for data manipulation.
 Execution Plan: Query processing is a stepwise process.
 Query Processor: Query processor is responsible for generating execution plan.
 Before retrieving, updating or deleting data in database, a query goes through a series of query
compilation steps. These steps are known as execution plan.
 In query processing, the first phase is transformation in which parser first checks the syntax of
query and also checks the relations and attributes used in the query that are defined in the
database.

Query Processing
 After checking the syntax and verifying the relations, query is transformed into equivalent
expression that are more efficient to execute.
 Transformation, depends upon various factors like existence of certain database structures,
presence of different indexes, file is sorted or not, cost of transformation, physical
characteristics of data, etc.
 After transformation of query, transformed query is evaluated by using number of strategies

known as access plans.
 The next step is to validate the user privileges and ensure that the query does not disobey the
relevant integrity constraints.
 Finally, execution plan is executed to generate the result.

General Strategy for Query Processing
 (i) Representation of query: Query written by user cannot be processed directly by

system.
 Query processor first checks the syntax and existence of relations and their attributes
in database.
 After validations, query processor transform it into equivalent and more efficient
expression; for example, query will be converted into a standard internal format that
parser can manipulate.
 Internal form may be relational algebra, relational calculus, any low-level language,
operator graphs etc.

General Strategy for Query Processing
 (ii) Operator Graphs: Operator graphs are used to represent query.
 It gives the sequence of operations that can be performed. It is easy to understand the
query represented by operator graphs.
 It is useful to determine redundancy in query expressions, result of transformation,

simplify the view etc.
 (iii) Response time and Data characteristics consideration: Data characteristics like
length of records, expected sizes of both intermediate and final results, size of
relations etc., are also considered for optimizing the query.
 In addition to this overall response time is also determined.

Steps in Query Processing
 (i) Syntax Analysis: Query in high-level language is parsed into

tokens and tokens are analyzed for any syntax error.
 Order of tokens are also maintained to make sure that all the rules
of language grammars are followed.
 In case of any error, query is rejected and an error code with

explanation for rejection is returned to the user.
 (Only syntax is checked in this step).


 (ii) Query Decomposition: In this step, query is decomposed into query blocks
which are the low-level operations.
 It starts with the high-level query that is transformed into low level operations
and checks whether that query is syntactically and semantically correct.
 For example, SQL query is decomposed into blocks like Select block, From
block, Where block, etc.

Steps in Query Decompositions
 (A) Query Analysis: In the query analysis stage, programming language compiler
checks that the query is lexically and syntactically correct.
 A syntactically correct query is analyzed using system catalogues to verify the

existence of relations and attributes used in query.
 After analysis a correct query is converted into some internal representation, which is
more efficient for processing.
 The type specification of the query qualifier and result is also checked at this stage.
 The internal representation may be, query tree or query graph.

Query Tree Notation
 The internal representation may be query tree or query graph.
 Query tree notation: A typical internal representation of query is query tree.
 It is also known as relational algebra tree.
 A query tree is constructed using tree data structure that corresponds to the relational algebra
expression.
 Main Components of Query Tree are:

⧫ Root of tree – represents result of query.
⧫ Internal nodes – represent intermediate relation that is the output of applying an operation in the
algebra.
⧫ Leaf nodes – represent input relations of the query.

Query Tree Notation
 The sequence of operations is directed
from leaves to the root node.
 Example of Query Tree Notation:

Query Graph Notation
 Query Graph Notation: Graph data structure is also used for internal
representation of query.
 Main components of graphs notations are:
⧫ Relation nodes – represent relations by single circle.
⧫ Constant nodes – represent constant values by double circle.
⧫ Edges – represent relation and join conditions.
⧫ Square brackets – represent attributes retrieved from each relation.

Query Graph Notation
 Figure of Query Graph Notation:

 (B) Query Normalization: After query analysis, it is normalized to remove any

redundancy.
 In this phase, query is converted into normalized form that can be easily
manipulated.
 A set of equivalency rules are applied to query and simplify the projection and
selection operations to avoid redundancy.
 Query can be converted into one of the following two normal forms:
 Conjunctive normal form and Disjunctive normal forms.

 (1) Conjunctive normal form: It is a sequence of conjuncts that are connected with ‘AND’ operator.
 A conjunct consists of one or more terms connected with ‘OR’ operator.
 A conjunctive selection consists only those tuples that satisfy all conjuncts.
 Example: (Emp_Job = ‘Analyst’ ∨ salary < 50000) ∧ (Hire_Date > 1–1–2000)
 (2) Disjunctive normal forms: It is a sequence of disjuncts that are connected with ‘OR’ operator.
 A disjunct consists of one or more terms connected with ‘AND’ operator.
 A disjunctive selection contains those tuples that satisfy anyone of the disjunct.
 Example: (Emp_Job= ‘Analyst’ ∧ salary < 5000) ∨ (Hire_Date > 1–1–2000).
 Disjunctive normal form is more useful as it allows the query to break into a series of independent
sub-queries linked by union.

 (C) Semantic Analyzer: The semantic analyzer performs the following tasks :
 It helps in reducing the number of predicates and rejects contradictory

normalized forms.
 In case of missing join specification, components of query do not contribute to

generation of results.
 It identifies these queries and rejects them.
 It makes sure that each object in query is referenced correctly according to its
data type.

 (D) Query Simplifier: The major tasks of query simplifier are as follows:
 It eliminates common sub-expressions and redundant qualification.
 It introduces integrity constraints, view definitions into the query graph

representation.
 It eliminates query that voids any integrity constraint without accessing the database.
 It transforms sub-graphs into semantically equivalent and more efficient form.
 It deals with user access rights.
 (E) Query Restructuring: At the final stage of query decomposition, transformation

rules are applied to restructure the query to give a more efficient implementation.

 (iii) Query Optimization: The aim of the query optimization step is to choose the best
possible query execution plan with minimum resources required to execute that plan.
 (iv) Execution Plan: It is the basic algorithm used for each operation in the query.
 Execution plans are classified into following four types: (Check From Page 6 of the PDF file)
1) Left-deep tree query execution plan
2) Right-deep tree query execution plan
3) Linear tree execution plan and
4) Bushy execution plan

 (v) Query Code Generator: After selecting the best possible execution
plan query is converted into low-level language so that it can be taken as
input by runtime database process.
 (vi) Runtime Database Processor: It deals directly with main database and
do the necessary operation mentioned in query and returns the result to
user.
Query Optimization
 Query Optimization means converting a query into an equivalent form which is more
efficient to execute.

Query Optimization
 The main issues that need to be considered in query optimization are:
 Reduction of data transfer with database.
 Use of available indexes for fast searching.
 Reduction of number of times the database is manipulated.
 The order in which joins should be performed.
 How to store intermediate results?
 There are two main techniques used to implement query optimization.
 These are heuristic query optimization and cost based query optimization.

Transformation Rules for Relational Algebra
 The transformation rules are used to formulate a relational algebra expression

into different ways and query optimizer choose the most efficient equivalent
expression to execute.
 Two expressions are considered to be equivalent if they have same set of

attributes in different order but representing the same information.

Cost based Query Optimization
 In cost based query optimization, optimizer estimates the cost of running of all
alternatives of a query and choose the optimum alternative.
 The alternative which uses the minimum resources is having minimum cost.
 The cost of a query operation is mainly depend on its selectivity i.e., the
proportion of the input relations that forms the output.
 The main components used to determine the cost of execution of a query:
 Access cost of secondary storage, Storage cost, Computation cost, Memory

usage cost and Communication cost.

 (a) Access cost of Secondary Storage: Access cost to secondary storage consists of cost
of database manipulation operations which includes searching, writing, reading of data
blocks stored in the secondary memory.
 The cost of searching depends upon the type of indexes (primary, secondary, hashed),
type of file structure and ordering of relation.
 In addition to physical storage location like file blocks are allocated contiguously on
the same disk or scattered on the disk.
 (b) Storage cost: Storage cost consists of cost of storing intermediate results (tables or
files) that are generated by the execution strategy for the query.

 (c) Computation cost: Computation cost consists of performing in-memory

operations during query execution such as sorting of records in a file, merging of
records, performing computations on field values, searching of records. These are
mainly performed on data buffers.
 (d) Memory usage cost: It consists of cost of relating to the number of memory
buffers needed during query execution.
 (e) Communication cost: It consists of the cost of transferring query and its result
from database location to the location of terminal where the query is originated.

 From all the above components, the most important is access cost to secondary storage
because secondary storage is comparatively slower than other devices.
 Optimizer try to minimize computation cost for small databases as most of the data
files are stored in main memory.
 For large database, it try to minimize the access cost to secondary storage and
 For distributed databases, it tries to minimize the communication cost because various
sites are involved for data transfer.
 To estimate the cost of execution strategies, optimizer access statistical data stored in
DBMS catalog.

The Information Stored in DBMS catalog is given below:
A. Number of records in relation X, given as R.
B. Number of blocks required to store relation X, given as B.
C. Blocking factor of relation X, given as BFR.
D. Primary access method for each file and attributes for each file.
E. Number of levels for each multi-level index for an attribute A given as IA.
F. Number of first-level index blocks for an attribute A, given as BA I 1 .
G. Selection cardinality of attribute A in relation R, given as SA, where SA = R × SLA ,

where SLA is the selectivity of the attributes.
Cost Function for Selection Operation
 Cost Function for Selection Operation: Selection operation works on a
single relation in relation algebra and retrieves the records that satisfy the
given condition.
 Depending upon the structure of file, available indexes, searching
methods, the estimated cost of strategies for selection operation is as
given below.
The Estimated Cost of Strategies for Selection Operation is as given below:

Example of Employee Table
 Consider, there are 10,000 records stored in 2000 blocks.
 Also the following indexes are available.
 A secondary index on the Emp-ID with 4 levels.
 A clustering index on salary with 3 levels and average selection cardinality of

30.
 A secondary index on non-key attributes Age with 2 levels and 4 first level
index blocks.
 There are 200 distinct values for Age.

EXAMPLE – 1 CONT’D . . .

EXAMPLE – 1 CONT’D . . .
(iii) This query has a conjunctive selection condition: To estimate the cost of
use using anyone of the two components of selection condition, to retrieve the
records plus the linear search.
 The linear search costs 2000 and condition salary > 9000 first gives cost
estimate of Isalary + (B/2) = 2 + 1000 = 1002 and
 Cost of condition Age = 20 is 30 + 2 = 32
 So, the total cost is 32 + 1002 = 1034

Cost Function for Join Operation
 Cost Function for Join Operation: Join operation is the most time
consuming operation in databases and
 An accurate cost function for join operations depends upon the
estimate of size of file (number of records) after the join operation.

The Estimated Cost of Strategies for Join Operation is given below:

 Example: Consider we have 600 records in department table.
 BFR for department table is 60 and number of blocks are 600/60 = 10
 For the Join Operation,
 Type of JOINS COST
 Block nested-loop joins 22000 (Buffer has only one block)

2010 (if all blocks of employee can be
read into database buffer)
 Hash Join 6010 (is hash index is held in memory)

EXAMPLE
 Given two unary relation (contains only one attribute), R1 and R2
 R1 = 7, 2, 9, 8, 3, 9, 1, 3, 6
 R2 = 8, 4, 2, 1, 3, 2, 7, 3

Thank You !
?

Query Processing and Optimization: Chapter - 2

Uploaded by

Copyright:

Available Formats

Query Processing and Optimization: Chapter - 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Query Processing and Optimization: Chapter - 2

Uploaded by

Copyright:

Available Formats

CHAPTER - 2

QUERY PROCESSING AND OPTIMIZATION

COURSE NAME: ADVANCED DATABASE SYSTEM

Compiled by: Temesgen Tilahun

 There are two main techniques for query optimization.

 In general most commercial database systems use a combination of both techniques.

Compiled by: Temesgen Tilahun 2

 In this chapter we discuss the techniques used internally by a DBMS to process,

 A query expressed in a high-level query language such as SQL must first be

 The parser checks the query syntax to determine whether it is formulated

Compiled by: Temesgen Tilahun 4

 The code generator generates the code to execute that plan.

 If a runtime error results, an error message is generated by the runtime database

Compiled by: Temesgen Tilahun 5

 In practice, SQL is the query language that is used in most commercial

 An SQL query is first translated into an equivalent extended relational algebra

 A query block contains a single SELECT – FROM – WHERE expression, as well

 Query Processing: It is a procedure of converting a query written in high level language

 Execution Plan: Query processing is a stepwise process.

 Query Processor: Query processor is responsible for generating execution plan.

Compiled by: Temesgen Tilahun 7

 After transformation of query, transformed query is evaluated by using number of strategies

 Finally, execution plan is executed to generate the result.

Compiled by: Temesgen Tilahun 8

 (i) Representation of query: Query written by user cannot be processed directly by

Compiled by: Temesgen Tilahun 9

 (ii) Operator Graphs: Operator graphs are used to represent query.

 It is useful to determine redundancy in query expressions, result of transformation,

 In addition to this overall response time is also determined.

Compiled by: Temesgen Tilahun 10

 (i) Syntax Analysis: Query in high-level language is parsed into

 In case of any error, query is rejected and an error code with

 (Only syntax is checked in this step).

Compiled by: Temesgen Tilahun 12

which are the low-level operations.

and checks whether that query is syntactically and semantically correct.

block, Where block, etc.

Compiled by: Temesgen Tilahun 13

 A syntactically correct query is analyzed using system catalogues to verify the

 The internal representation may be, query tree or query graph.

Compiled by: Temesgen Tilahun 14

 The internal representation may be query tree or query graph.

 Query tree notation: A typical internal representation of query is query tree.

 It is also known as relational algebra tree.

 Main Components of Query Tree are:

⧫ Leaf nodes – represent input relations of the query.

Compiled by: Temesgen Tilahun 15

 The sequence of operations is directed

from leaves to the root node.

 Example of Query Tree Notation:

Compiled by: Temesgen Tilahun 16

 Main components of graphs notations are:

⧫ Relation nodes – represent relations by single circle.

⧫ Constant nodes – represent constant values by double circle.

⧫ Edges – represent relation and join conditions.

⧫ Square brackets – represent attributes retrieved from each relation.

Compiled by: Temesgen Tilahun 17

 Figure of Query Graph Notation:

Compiled by: Temesgen Tilahun 18