Query Processing in Distributed Database

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

QUERY

PROCESSING IN
DISTRIBUTED
DATABASE Submitted to: MRS. ANJU SHARMA

Submitted by:
MCA/25015/22 : SURBHI YADAV
MCA/25016/22 : AYUSH JAJOO
WHAT IS DISTRIBUTIVE DATABASE ?
A distributed database is basically a database that is not limited to one system, it is spread
over different sites that don’t share physical components, that for the users it looks like one
HOMOGENEOUS DISTRIBUTIVE DATABASE single database.
HETEROGENEOUS DISTRIBUTIVE DATABASE
▰ all different sites store database identically. ▰ different sites can use different schema and software
▰ operating system, database management system, and the ▰ a particular site might be completely unaware of the other sites.
data structures – same at all sites.
▰ different operating system, different database application -
▰ easy to manage. different at different sites.

2
Calculus query on distributed relation

QUERY
Global Schema
DECOMPOSITION

Algebraic query on distributed relation

CENTRAL DATA LOCALISATION Fragment Schema


CONTROL
SIGHT
Localized Fragmented Query

GLOBAL
OPTIMISATION Fragment Statistics

Optimized Query with communication operation


LOCAL
CONTROL LOCAL OPTIMISATION Local Schema
SIGHT

Optimized to Local Query 3


PHASES OF DISTRIBUTED QUERY PROCESSING

1. QUERY DECOMPOSITION
: ▰ The input to this phase is calulus query on a relation.
▰ It decomposes calculus query into algebraic query on distributed relations.
▰ It checks the global directory to retrieve info about the gloabal conceptual schema (GCS).
▰ This phase includes 4 successive steps:
1. NORMALIZATION
2. ANALYSIS
3. SIMPLIFICATION
4. RESTRUCTURE

4
PHASES OF DISTRIBUTED QUERY PROCESSING

2. DATA LOCALIZATION :
▰ The input to this phase is an algebraic query on a global relation.
▰ Localization is done using data distribution info from fragmented schema.
▰ Identifies fragments, tranforms it into such that it can operate on fragments.
▰ Tranforming the query into query on fragments os done in two steps:
1. Algerbraic query is mapped into fragmented query.
2. Further simplified and structured to produce query better in performance.

It is still not optimal since fragmentation information is not yet used.

5
PHASES OF DISTRIBUTED QUERY PROCESSING
3. GLOBAL QUERY
▰ The algebraic query
OPTIMISATION : on fragments is the input to the phase
▰ The main objective in this phase is to determine an execution strategy close to the optimal
solution
▰ Permutation on queries leads to many similar queries.
▰ They can be ordered based on the cost function.
▰ Cost function is decided using terms such as time units, computing resources (disk space,
I/O, CPU cost, communication).
▰ To select best optimum ordering of operators, execution cost is calculated.
▰ The output of this phase is optimised algrebraic query with communication
operators including fragments.

6
PHASES OF DISTRIBUTED QUERY PROCESSING
4. LOCAL OPTIMISATION :
▰ This is performed by all participating sites that have the fragments which are involved in
the query.
▰ Sub queries on a local site becomes a local query on that site. This is optimised using
local conceptual schema and then executed.

▰ The output is then returned to the site where the query is generated.

7
SIMPLE JOIN
• A major decision in the selection of query processing strategy is choosing a
join strategy.
• Consider the following relation algebra exp:
account depositor branch
Result to be S1 S2 S3
produced here
Query
issued
SIMPLE DISTRIBUTED JOIN
PROCESSING
 Consider the following relational algebra expression in which the three relations are
neither replicated nor fragmented
 account depositor branch

 account is stored at site S1


 depositor at S2
 branch at S3
 For a query issued at site SI, the system needs to produce the result at site SI
• Among the possible strategies for processing the above query:
1. Ship copies of all the three relations to site SI. Choose a strategy for processing the
entire query locally at site SI.

Relation 1 Relation 2 Relation 3

SI

2. Ship a copy of the account relation to site S2, and compute


temp1 = account depositor at S2.
Ship temp1 from S2 to S3 , compute temp2 = temp 1 branch at S3.
S1
Ship the result temp2 to S1.
Account
relation copy Temp1 Temp2
S1 S2 S3

3. Devise strategies similar to the previous one, with the roles of S1, S2, S3 exchanged.
Relation 1 Relation 1 Relation 1
indices indices indices

S1
(Re-create indices at S1)

• This recreation of indices extra processing overhead and extra disk access.
• However, the second strategy has a disadvantage that a potentially large relation
• (customer account ) must be shipped from S2 to S3.
• The relation repeats the address data for a customer once for each account that the customer has . The second
strategy may result in extra network transmission compared to first strategy.
SEMI JOIN

 Semi-join is a technique for processing a join between two tables that are stored
sites. The basic idea is to reduce the transfer cost by first sending only the projected
join column(s) to the other site, where it is joined with the second relation.
 Let r1 be a relation with schema R1 stores at site S1
Let r2 be a relation with schema R2 stores at site S2
 Evaluate the expression r1 r2 and obtain the result at S1.

1. Compute temp1  R1  R2 (r1) at S1.


2. Ship temp1 from S1 to S2.
3. Compute temp2  r2 temp1 at S2
4. Ship temp2 from S2 to S1.
5. Compute r1 temp2 at S1. This is the same as r1 r2 .
SEMI JOIN
The semi-join of r1 with r2, is denoted by:
r1 r2
it is defined by:
R1 (r1 r2)
Thus, r1 r2 selects those tuples of r1 that contributed to r1 r2 .
In step 3 above, temp2 = r2 r1.
For joins of several relations, the above strategy can be extended to a series of semi-join
steps.
SEMI JOIN EXAMPLE
Departments table
1.CREATE TABLE  "DEPARTMENTS"   
2.   (    "DEPARTMENT_ID" NUMBER(10,0) NOT NULL ENABLE,   
3.    "DEPARTMENT_NAME" VARCHAR2(50) NOT NULL ENABLE,   
4.     CONSTRAINT "DEPARTMENTS_PK" PRIMARY KEY ("DEPARTMENT_ID") ENABLE  
5.   )  
Customer table
6.CREATE TABLE  "CUSTOMER"   
7.   (    "CUSTOMER_ID" NUMBER,   
8.    "FIRST_NAME" VARCHAR2(4000),   
9.    "LAST_NAME" VARCHAR2(4000),   
    "DEPARTMENT_ID" NUMBER  
EXAMPLE
Execute this query
1.SELECT   departments.department_id, departments.department_name  
2.        FROM     departments  
3.        WHERE    EXISTS  
4.                 (  
5.                 SELECT 1  
6.                 FROM   customer  
7.                 WHERE customer.department_id = departments.department_id  
8.                 )  
9.        ORDER BY departments.department_id;  
PARALLELISM
 Components of a task, such as a database query, can be run in parallel to dramatically
enhance performance. The nature of the task, the database configuration, and the hardware
environment, all determine how the Db2® database product will perform a task in parallel.

 These factors are interrelated. Consider them all when you work on the physical and logical
design of a database. The following types of parallelism are supported by the Db2 database
system:
• Inter-query parallelism

• Intra-query parallelism
QUERY PARALLELISM
 Interquery parallelism :-
Interquery parallelism refers to the ability of the database to accept queries from
multiple applications at the same time. Each query runs independently of the others,
but the database manager runs all of them at the same time. Db2 database products
have always supported this type of parallelism.

 Intraquery parallelism :-
Intraquery parallelism refers to the simultaneous processing of parts of a single
query, using either intrapartition parallelism, interpartition parallelism, or both.
INTERPARTITION INTRA -PARTITION
PARALLELISM PARALLELISM
PARALLELISM
 Interquery parallelism refers to the ability of the database to accept queries from
multiple applications at the same time. Each query runs independently of the others,
but the database manager runs all of them at the same time.

 Consider r1 r2 r3 r4 where relation ri is stored at site Si. The result must be presented at

site S1.

 r1 is shipped to S2 and r1 r2 is computed at S2: simultaneously r3 is shipped to S4 and r3 r4 is


computed at S4

S2 ships tuples of (r1 r2) to S1 as they produced;


S4 ships tuples of (r3 r4) to S1

 Once tuples of (r1 r2) and (r3 r4) arrive at S1 (r1 r2) (r3 r4) is computed in parallel
with the computation of (r1 r2) at S2 and the computation of (r3 r4) at S4.
THANK YOU.

11

You might also like