Query Processing in Distributed Database
Query Processing in Distributed Database
Query Processing in Distributed Database
PROCESSING IN
DISTRIBUTED
DATABASE Submitted to: MRS. ANJU SHARMA
Submitted by:
MCA/25015/22 : SURBHI YADAV
MCA/25016/22 : AYUSH JAJOO
WHAT IS DISTRIBUTIVE DATABASE ?
A distributed database is basically a database that is not limited to one system, it is spread
over different sites that don’t share physical components, that for the users it looks like one
HOMOGENEOUS DISTRIBUTIVE DATABASE single database.
HETEROGENEOUS DISTRIBUTIVE DATABASE
▰ all different sites store database identically. ▰ different sites can use different schema and software
▰ operating system, database management system, and the ▰ a particular site might be completely unaware of the other sites.
data structures – same at all sites.
▰ different operating system, different database application -
▰ easy to manage. different at different sites.
2
Calculus query on distributed relation
QUERY
Global Schema
DECOMPOSITION
GLOBAL
OPTIMISATION Fragment Statistics
1. QUERY DECOMPOSITION
: ▰ The input to this phase is calulus query on a relation.
▰ It decomposes calculus query into algebraic query on distributed relations.
▰ It checks the global directory to retrieve info about the gloabal conceptual schema (GCS).
▰ This phase includes 4 successive steps:
1. NORMALIZATION
2. ANALYSIS
3. SIMPLIFICATION
4. RESTRUCTURE
4
PHASES OF DISTRIBUTED QUERY PROCESSING
2. DATA LOCALIZATION :
▰ The input to this phase is an algebraic query on a global relation.
▰ Localization is done using data distribution info from fragmented schema.
▰ Identifies fragments, tranforms it into such that it can operate on fragments.
▰ Tranforming the query into query on fragments os done in two steps:
1. Algerbraic query is mapped into fragmented query.
2. Further simplified and structured to produce query better in performance.
5
PHASES OF DISTRIBUTED QUERY PROCESSING
3. GLOBAL QUERY
▰ The algebraic query
OPTIMISATION : on fragments is the input to the phase
▰ The main objective in this phase is to determine an execution strategy close to the optimal
solution
▰ Permutation on queries leads to many similar queries.
▰ They can be ordered based on the cost function.
▰ Cost function is decided using terms such as time units, computing resources (disk space,
I/O, CPU cost, communication).
▰ To select best optimum ordering of operators, execution cost is calculated.
▰ The output of this phase is optimised algrebraic query with communication
operators including fragments.
6
PHASES OF DISTRIBUTED QUERY PROCESSING
4. LOCAL OPTIMISATION :
▰ This is performed by all participating sites that have the fragments which are involved in
the query.
▰ Sub queries on a local site becomes a local query on that site. This is optimised using
local conceptual schema and then executed.
▰ The output is then returned to the site where the query is generated.
7
SIMPLE JOIN
• A major decision in the selection of query processing strategy is choosing a
join strategy.
• Consider the following relation algebra exp:
account depositor branch
Result to be S1 S2 S3
produced here
Query
issued
SIMPLE DISTRIBUTED JOIN
PROCESSING
Consider the following relational algebra expression in which the three relations are
neither replicated nor fragmented
account depositor branch
SI
3. Devise strategies similar to the previous one, with the roles of S1, S2, S3 exchanged.
Relation 1 Relation 1 Relation 1
indices indices indices
S1
(Re-create indices at S1)
• This recreation of indices extra processing overhead and extra disk access.
• However, the second strategy has a disadvantage that a potentially large relation
• (customer account ) must be shipped from S2 to S3.
• The relation repeats the address data for a customer once for each account that the customer has . The second
strategy may result in extra network transmission compared to first strategy.
SEMI JOIN
Semi-join is a technique for processing a join between two tables that are stored
sites. The basic idea is to reduce the transfer cost by first sending only the projected
join column(s) to the other site, where it is joined with the second relation.
Let r1 be a relation with schema R1 stores at site S1
Let r2 be a relation with schema R2 stores at site S2
Evaluate the expression r1 r2 and obtain the result at S1.
These factors are interrelated. Consider them all when you work on the physical and logical
design of a database. The following types of parallelism are supported by the Db2 database
system:
• Inter-query parallelism
• Intra-query parallelism
QUERY PARALLELISM
Interquery parallelism :-
Interquery parallelism refers to the ability of the database to accept queries from
multiple applications at the same time. Each query runs independently of the others,
but the database manager runs all of them at the same time. Db2 database products
have always supported this type of parallelism.
Intraquery parallelism :-
Intraquery parallelism refers to the simultaneous processing of parts of a single
query, using either intrapartition parallelism, interpartition parallelism, or both.
INTERPARTITION INTRA -PARTITION
PARALLELISM PARALLELISM
PARALLELISM
Interquery parallelism refers to the ability of the database to accept queries from
multiple applications at the same time. Each query runs independently of the others,
but the database manager runs all of them at the same time.
Consider r1 r2 r3 r4 where relation ri is stored at site Si. The result must be presented at
site S1.
Once tuples of (r1 r2) and (r3 r4) arrive at S1 (r1 r2) (r3 r4) is computed in parallel
with the computation of (r1 r2) at S2 and the computation of (r3 r4) at S4.
THANK YOU.
11