Lecture 09

Data Warehousing & DATA
MINING (SE-409)
• Lecture-9
• Need for Speed: Parallelism

Dr. Huma
• Software Engineering department
University of Engineering and Technology, Taxila
1
When toparallelize?
Useful for operations that access(sequential vs
parallel) significant amounts of data.
Useful for operations that can be

implemented independent of each other
“Divide-&-Conquer”[ quarter wise data]
Parallel execution improves processing for:

• Large table scans and joins [where size does
matter]
• Creation of large indexes
• Partitioned index scans
• Bulk inserts, updates, and deletes
• Aggregations and copying
2
Are you ready to parallelize?
Parallelism can be exploited, if there is…
 Symmetric multi-processors (SMP), clusters, or Massively
Parallel (MPP) systems AND
 Sufficient I/O bandwidth AND
 Underutilized or intermittently used CPUs (for example,

systems where CPU usage is typically less than 30%) AND
 Sufficient memory to support additional memory-intensive

processes such as sorts, hashing, and I/O buffers
[intermediate results]
Word of caution
Parallelism can reduce system performance on over-
utilized systems or systems with small I/O bandwidth. 3
19
Scalability – Size is NOT everything
Index usage
• Hash based
Number of Concurrent Users
• B-Tree [I/O
BOTTLE
NECK]
• Multiple
• Bitmapped
Complexity of Data Model Complexity of Technique

• Simple table retrieval
• Moderate complexity Join
• Tendency analysis
 • Clustering
 Data
Warehousing
4
20
Scalability- Speed-Up &Scale-Up
Ideal
Transactions/S
Speed-Up
More resources means
proportionally less time Real
ec
for given amount of
data.
What it effect on Degree of
Parallelism
performance
Ideal
Secs/Transacti
If resources increased
in proportion to
on
increase in data size,
time is constant.
Data 5
Degree
Warehousing of Parallelism
QuantifyingSpeed-up
Ts Ts: Time on serial
Speedup = processor
Tm Tm: Time on multiple
processors
Sequential Execution Ideal Parallel Execution

Control work
(“overhead”)
Task-
Task-
Task-
3
1
2
Task-1 Task-2 Task-3
18 time 6 time
units units
Speedup = 18 = 300%
6
6
Speed-Up & Amdahl’sLaw
Reveals maximum expected speedup from
parallel algorithms given the proportion of task
that must be computed sequentially. It gives the
speedup S as
f is the fraction of the problem that must be computed

sequentially N is the number of processors
As f approaches 0 , S approaches N
Example-1: f = 5% and N = 100 then S =
Not 1:1 Ratio
16.8
Example-2: f = 10% and N = 200 then S
= 9.56 Data Warehousing 7 23
Parallelization OLTP Vs.DSS
There is a big difference.
OLTP [How many transactions per second can be done, or many

parallel users can be entertain per second.]
Parallelization of MULTIPLE queries Or Batch updates in parallel
DSS: No. of user are few but

Parallelization of a SINGLE query [ complex join and scan to break
and run in parallel]
25
Brief Intro to ParallelProcessing
• Parallel Hardware Architectures
• Symmetric MultiProcessing (SMP)
• Distributed Memory or Massively Parallel
Processing (MPP)
• Non-uniform Memory Access (NUMA)
• Parallel Software Architectures

• Shared Memory
• Shard Disk Shared Everything
• Shared Nothing
• Types of parallelism
• Data Parallelism
• Spatial Parallelism
Data 26
Warehousing
Symmetrical Multi Processing(SMP)
• A number of independent I/O and number of
processors all sharing access to a single large
memory space.
I/O I/O I/O P P P P

1 2 3 4
Main Memory
27
Distributed Memory Machines MMP
• Composed of a number of self-contained, self-controlled nodes
connected through a network interface.
• Each node contains its own CPU, processor, memory and I/O.
• Architecture better known as Massively Parallel Processing (MPP)

or cluster computing.
Node
• Memory is distributed across all
nodes. I/O P I/O P I/O P
Memory Memory Memory
Bus, Switch or Network
28
• Software Architecture
12
Shared disk RDBMSArchitecture
Here multiple tables of databases are sharing : logically all memories are one unit
Here every processor is capable of modifying a memory through this central node
It means two or more processors can modify a memory at a time : issue…???
How to resolve : Locking {Hardware + Software}
Adv
High level of fault tolerance but if data present at multiple place coherence issue can rise.
Dis Adv Clients/Users

Serialization due to locking
( slow process). Through put slow down
Shared Disk Interconnect
Interconnect switch can become a

bottleneck
30
Shared Nothing RDBMSArchitecture
Adv
Data ownership changes infrequently
There is no locking: no coherence no memory over

lapping Clients/Users
Dis Adv
Data availability low: on failure Very careful with data
distribution [if not according to query then solution
only redistribution]
Redistribution is expensive
31
16
Data Parallelism: Concept
• Parallel execution of a single data manipulation task
across multiple partitions of data.
• Partitions static or dynamic
• Tasks executed almost-independently across partitions.[

means very low or few serialization]
• “Query coordinator” must coordinate between the

independently executing processes.
17
Data Parallelism: Example all work in parallel
Select count (*)
62 from Emp
Partition-1
Partition 1 Query where age > 50
Server-1 AND
sal > 10,000’;
Partition-2
Query
440
. Server-2
Query
. . Coordinator
. .
Emp Table .
1,123 Query
Partition-k Server-k
Ans = 62 + 440 + ... + 1,123 = 99,000

18
Data Parallelism: Ensuring Speed-UP
To get a speed-up of N with N partitions, it must be
ensured that:
• There are enough computing resources.
• Query-coordinator is very fast as compared to query servers. create

bottleneck
• Work done in each partition almost same to avoid performance

bottlenecks.
• Same number of records in each partition would not be enough.
• Need to have uniform distribution of records w.r.t filter criterion

across partitions. 19
Temporal Parallelism (pipelining)
Involves taking a complex task and breaking it down into
independent subtasks for parallel execution on a stream of
data inputs.
Task Execution Time = T
[] [] [] [] [] []
Time = T/3 Time = T/3 Time = T/3
[] [] [] []
20
Pipelining: Time Chart
[] []
[] []
[] []
Time = T/3 Time = T/3
[]
T=0 T=1 T=2 T=3

21
Pipelining: Speed-Up Calculation
Time for sequential execution of 1 task =T
Time for sequential execution of N tasks = N * T
(Ideal) time for pipelined execution of one task using an M stage pipeline = T
(Ideal) time for pipelined execution of N tasks using an M stage pipeline = T + ((N-1)  (T/M))
Speed-up (S) =
Pipeline parallelism focuses on increasing throughput of task execution, NOT on decreasing

sub-task execution time.
22
Pipelining: Speed-Up Example
Example: Bottling soft drinks in a factory
10 CRATES LOADS OF BOTTLES

Sequential execution = 10  T
Fill bottle, Seal bottle, Label Bottle pipeline = T + T  (10-1)/3 = 4  T
Speed-up = 2.50

Fill bottle, Seal bottle, Label Bottle pipeline = T + T  (20-1)/3 = 7.3  T
Speed-up = 2.72

Fill bottle, Seal bottle, Label Bottle pipeline = T + T  (40-1)/3 = 14.0  T
Speed-up = 2.85
23
Pipelining: Input vs Speed-Up
3
2.8
2.6
Speed-up (S)
2.4
2.2
2
1.8
1.6
1.4
1.2
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Input (N)
Asymptotic limit on speed-up for M stage pipeline is M.
The speed-up will NEVER be M, as initially filling the pipeline

took T time units. 24
Pipelining: Limitations
• Some relational operators do not produce first output
until consumed all their inputs.
• Aggregate and sort operators have this property. One
cannot pipeline these operators.
• Often, execution cost of one operator is much greater

than others hence skew.
• e.g. Sum() or count() vs Group-by() or Join.
25
Quiz 1
Marks 10+10 Time : 30 min
• Question 1: (10 Marks) CLO 1

Define few applications of DataWareHouse
• Question 2: (10 Marks) CLO 1
What are the types of Physical Data extraction
26

Lecture 09

Uploaded by

Copyright:

Available Formats

Lecture 09

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 09

Uploaded by

Copyright:

Available Formats

Data Warehousing & DATA

• Need for Speed: Parallelism

University of Engineering and Technology, Taxila

Useful for operations that can be

Parallel execution improves processing for:

 Sufficient I/O bandwidth AND

 Underutilized or intermittently used CPUs (for example,

 Sufficient memory to support additional memory-intensive

Complexity of Data Model Complexity of Technique

Sequential Execution Ideal Parallel Execution

f is the fraction of the problem that must be computed

OLTP [How many transactions per second can be done, or many

DSS: No. of user are few but

• Parallel Software Architectures

I/O I/O I/O P P P P

• Architecture better known as Massively Parallel Processing (MPP)

Memory Memory Memory

Bus, Switch or Network

Dis Adv Clients/Users

Shared Disk Interconnect

Interconnect switch can become a

There is no locking: no coherence no memory over

Data availability low: on failure Very careful with data

distribution [if not according to query then solution

• Partitions static or dynamic

• Tasks executed almost-independently across partitions.[

• “Query coordinator” must coordinate between the

Ans = 62 + 440 + ... + 1,123 = 99,000

• There are enough computing resources.

• Query-coordinator is very fast as compared to query servers. create

• Work done in each partition almost same to avoid performance

• Same number of records in each partition would not be enough.

• Need to have uniform distribution of records w.r.t filter criterion

Task Execution Time = T

Time = T/3 Time = T/3 Time = T/3

Time = T/3 Time = T/3 Time = T/3

Time = T/3 Time = T/3 Time = T/3

Time = T/3 Time = T/3

T=0 T=1 T=2 T=3

Time for sequential execution of N tasks = N * T

Pipeline parallelism focuses on increasing throughput of task execution, NOT on decreasing

10 CRATES LOADS OF BOTTLES

20 CRATES LOADS OF BOTTLES

40 CRATES LOADS OF BOTTLES

Asymptotic limit on speed-up for M stage pipeline is M.

The speed-up will NEVER be M, as initially filling the pipeline

• Often, execution cost of one operator is much greater

• e.g. Sum() or count() vs Group-by() or Join.

• Question 1: (10 Marks) CLO 1

You might also like