Olap Exp05

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Department of Information Technology

Academic Year: 2020-21 Semester: V


Class / Branch: TE IT
Subject: OLAP Lab
Experiment No. 5

1. Aim: To study query optimization in Distributed Database.

2. S/W Required :- SQL Server Management Studio, SQL Server 2017

3. Theory :
. Data is one the vital entity in the database is managed by two using two major
database approaches known as centralized database management approach and distributed
database management approach. Centralized database management is one the traditional
approach of database management, in which all of the data in database is sited on central
location. Centralized database approach has overcome several limitations of file oriented
approach of prior times. Since in centralized database approach the data is placed on central
repository hence it is easy to access or extract data from multiple tables as compare to
distributed database approach where data is distributed over several sites. In centralized
database the database query can be easily transformed into set of relational algebra’s
operation, but in distributed database system one has to put more effort to analyze the
amount of data exchange in addition to corresponding set of relational algebra’s operations.
Distributed database system is defined as collection of logically interrelated data distributed
over several sites.
The number of nodes in distributed system is connected either by using wired or
wireless network media. One of the major issues in the distributed database design is the
placement of data and program across the number of computer or site available in the
system. After making placement of data and application program one has to focus on
transforming a distributed query into equivalent low level query so that actual
Prof . Apeksha Mohite Department of Information Technology
implementation and execution strategy of the query can be carried out. In distributed
database system, it is obvious the database query will extract data from several different
sites, so in this case the important factor is to reduce to amount of data transmission to
maximum extent.

Distributed Query Processing :


A distributed database query is processed in stages as follows:
1. Query Mapping : The input query on distributed database is specified formally in a
query language. It is then translated to relational algebra query on global relations.
2. Localization: In distributed database there may be fragments of relations being stored
on multiple sites. This stage maps the distributed query to separate queries on individual
fragments using data distribution and replication information.
3. Global Query Optimization: Optimization consists of selecting a strategy from a list of
candidates that is closest to the optimal. A list of candidate queries can be obtained by
permuting the ordering of operations within the fragment query generated in the
previous step. Time is the preferred unit for measurement. The total cost is a
combination of costs such as CPU cost, I/O costs and communication costs. Since
Distributed databases are connected over a network, often the cost of communication
over network are the most significant. This is most considered when the sites are
connected through a wide area network (WAN).
4. Local Query Optimization: This stage is common to all the sites in distributed
database.
In this experiment we focus on computing and analyzing the performance of joins and
semi joins in distributed database system. The various metrics that will be considered while
analyzing performance of join and semi join in distributed database system are Query Cost,
Memory used, CPU Cost, Input Output Cost, Sort Operations, Data Transmission, Total
Time and Response Time. In short the intention of this study is analyze the performance and
behavior of join and semi-join approach in distributed database system.

Join and Semi joins:

Before proceeding further let us first understand the concept of Join and Semi joins.

Join is one of the most imperative operations in database theory that is used to
extract information from two or more than two tables. Technically join operation is one of

Prof . Apeksha Mohite Department of Information Technology


the special cases of Cartesian product. In join unlike Cartesian product before concatenation
the tuples of the join tables are checked against specified condition. There are various types
of joins like equi-join, self join, inner join, outer join etc. Independent of type all of these
are used to extract data from two or more tables.
A semi-join is one of the important operations in relation theory that is used to
optimize a joins query. Semi join is used to reduce the size of relation that is used as an
operand. A semi-join from Ri to Rj on attribute A can be denoted as Rj⋉ Ri . Research
shows that semi joins are very helpful in optimizing the join query by reducing the quantity
of data exchanged. But one of the darken side of using semi join is that it increases the local
processing cost as well as number of message. It returns rows that match an EXISTS sub-
query without duplicating rows from the left side of the predicate when several rows on the
right side satisfy the norms of the sub-query. The research has shown that Semi-join and
anti-join transformation cannot be done if the sub-query is on an OR branch of the WHERE
clause. The objective of semi join in distributed database is to reduce the data transmission
from one site to another.
Analysis of Joins and semi-joins:
To analyze the working and performance of joins and semi joins operation in
distributed database system the following tables EMP and DEPT are to be considered.
While analyzing the performance in distributed database it is assumed that EMP and DEPT
tables are placed at site1 and site2 respectively. The complete structure of the above said
table is as given below.

EMP
ENO ENAME JOB MGR_NO HIRE_D SALARY P_NO D_NO

DEPT
D_NO DNAME LOC

In EMP table we have 14 tuples and we assume each tuple is 100bytes long.
We assume size of each attribute in EMP as below:
ENO ENAME JOB MGR_NO HIRE_D SALARY P_NO D_NO
(5 bytes) (15 bytes) (10 bytes) (6 bytes) (20 bytes) (20 bytes) (20 bytes) (5 bytes)

Prof . Apeksha Mohite Department of Information Technology


In DEPT table we have 4 tuples and we assume each tuple is 150 bytes long.
We assume size of each attribute in DEPT as below:
D_NO LOC
DNAME
(15 (96
(50 bytes)
bytes) bytes)

It is assumed that both relations are not fragmented. So total memory consumed by EMP and
DEPT table is 1400 bytes and 600 bytes respectively. It is further assumed that the following query
is requested at site3.
Query at site3 : Find the name, Dname, Deptno and job of the employee.
Select ENAME,DNAME,JOB ,D_NO FROM EMP,DEPT;
Query Processing Using Joins in Distributed Database System:
The original query that extract data from two tables EMP and DEPT in distributed
database system can be implemented and executed in three different ways as given below
in Case I, II and III.
After join the result produced at site 3 will have 14 rows and attributes
ENAME,DNAME,JOB,D_NO (total 79 bytes)
Case I: In this case to implement and execute the query one has to transfer both join table
EMP and DEPT to the resultant location i.e. at site 3. The following diagram shows the
query plan of above said case.

Site 1 : EMP Site 2 : DEPT

Site 3

The total number of bytes transferred in this case will be computed as follow:
1. Transmission from S1 to S3 : 14 * 100 = 1400 bytes.
2. Transmission from S2 to S3 : 04 * 150 = 600 bytes.
3. Total transmission to S3 = 1400 + 600 = 2000 bytes.

Prof . Apeksha Mohite Department of Information Technology


Case II: Transfer EMP table to Site 2 where DEPT table is available. Apply join operation
here i.e. at site 2 and transmit the required result at site 3.

The following diagram shows the query plan of above said case.

Site 1 : EMP Site 2 : DEPT

Site 3

The total number of bytes transferred in this case will be computed as follow:
1. Transmission from S1 to S2 : 14 * 100 = 1400 bytes.
2. Transmission from S2 to S3 : 14 * 79 = 1106 bytes.
3. Total transmission to S3 = 1400 + 1106 = 2506 bytes.

Case III: This is just reverse case of Case II; in this case DEPT table will be transmitted at Site 1
where EMP table is already available. Now apply the join operation here and transmit the required
result to site 3. The following diagram shows the query plan of above said case.

Site 1 : EMP Site 2 : DEPT

Site 3

Prof . Apeksha Mohite Department of Information Technology


The total number of bytes transferred in this case will be computed as follow:
1. Transmission from S2 to S1 : 4 * 150 = 600 bytes.
2. Transmission from S2 to S3 : 14 * 79 = 1106 bytes.
3. Total transmission to S3 = 600 + 1106 = 1706 bytes.

Comparative analysis of above cases :

Strategy CASE I CASE II CASE III

Transmission in bytes 2000 2506 1706

We implement the same in SQL SERVER MANAGEMENT STUDIO :


Site 1 : EMP TABLE:

Site 2: DEPT TABLE :

Prof . Apeksha Mohite Department of Information Technology


Query executed at Site 3:
SELECT ENAME,DNAME,JOB,T2.D_NO
FROM [DESKTOP-BBVLSN7\S1].[QPEMP].[dbo].[E_DATA] T1
JOIN [DESKTOP-BBVLSN7\S2].[QPDEPT].[dbo].[D_DATA] T2
ON T1.D_NO=T2.D_NO
Results:

Execution Plan:

The optimizer selects the best strategy and implements the same. Thus, from the above
analysis we can conclude that the case III provides efficient results with best resource
utilization. Thus, we have seen how joins can be used for processing distributed queries.

Prof . Apeksha Mohite Department of Information Technology


Query Processing Using Semi Joins :
We implement the same query using semi joins on Site 1 (Result site) this time. Rest
assumptions are kept same.

Site 1 : EMP Site 2 : DEPT

We analyze the query processing with semi joins in three steps:


1. Project the join attribute D_NO of EMP at site 1 to site 2.
14 tuples from S1 are transferred to S2. D_NO is 4 bytes. So total bytes
transferred = 14 * 4 = 56 bytes.
2. Join this D_NO attribute with DEPT table and transfer required attributes
from result to S1
Here 14 tuples are joined with DEPT so, result also has 14 tuples. Now from
this DNAME(50 bytes) & D_NO (4 bytes) is transferred to S1. Thus, total
transfer in this step is 14 *54 (50+4) = 756 bytes.
3. Join required attributes and project the result.
In this step we already have the required attributes from DEPT table at S2 on
S1. We now only need to join ENAME & JOB attributes in this step and
project the result. Since S1 has EMP table there is no data transfer involved.
Thus, total transmission for result is sum of previous two computations i.e. 56
bytes + 756 bytes = 812 bytes.
Thus, we in all achieve the result using semi joins with data transfer of mere
812 bytes.
We implement the same in SQL SERVER MANAGEMENT STUDIO :
Query at Site 1:
SELECT ENAME,T2.DNAME,
JOB,T2.D_NO
FROM E_DATA,[DESKTOP-BBVLSN7\S2].[QPDEPT].[dbo].[D_DATA] T2
WHERE T2. D_NO
IN
( SELECT T2. D_NO
FROM [DESKTOP-BBVLSN7\S2].[QPDEPT].[dbo].[D_DATA] T2
WHERE E_DATA.D_NO=T2.D_NO)

Prof . Apeksha Mohite Department of Information Technology


Results :

Execution Plan:

JOINS VERSUS SEMI JOINS :


One of the interesting questions is when the query has to be executed with Join
and when with semi join. The selection of join and semi joins in distributed
system is directly depends upon the data transmission from one site to another.
In this study the major fact that came out is that semi joins is found more useful
than join when the data transmission from one site to another is more.

Prof . Apeksha Mohite Department of Information Technology


4. Conclusion:

Prof . Apeksha Mohite Department of Information Technology

You might also like