Lecture 09

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Data Warehousing & DATA

MINING (SE-409)
• Lecture-9

• Need for Speed: Parallelism


Dr. Huma
• Software Engineering department

University of Engineering and Technology, Taxila

1
When toparallelize?
Useful for operations that access(sequential vs
parallel) significant amounts of data.

Useful for operations that can be


implemented independent of each other
“Divide-&-Conquer”[ quarter wise data]

Parallel execution improves processing for:


• Large table scans and joins [where size does
matter]
• Creation of large indexes
• Partitioned index scans
• Bulk inserts, updates, and deletes
• Aggregations and copying
2
Are you ready to parallelize?
Parallelism can be exploited, if there is…
 Symmetric multi-processors (SMP), clusters, or Massively
Parallel (MPP) systems AND

 Sufficient I/O bandwidth AND

 Underutilized or intermittently used CPUs (for example,


systems where CPU usage is typically less than 30%) AND

 Sufficient memory to support additional memory-intensive


processes such as sorts, hashing, and I/O buffers
[intermediate results]

Word of caution
Parallelism can reduce system performance on over-
utilized systems or systems with small I/O bandwidth. 3
19
Scalability – Size is NOT everything
Index usage
• Hash based
Number of Concurrent Users
• B-Tree [I/O
BOTTLE
NECK]
• Multiple
• Bitmapped

Complexity of Data Model Complexity of Technique


• Simple table retrieval
• Moderate complexity Join
• Tendency analysis

 • Clustering

 Data
Warehousing
4
20
Scalability- Speed-Up &Scale-Up
Ideal

Transactions/S
Speed-Up
More resources means
proportionally less time Real

ec
for given amount of
data.
What it effect on Degree of
Parallelism
performance
Ideal

Secs/Transacti
If resources increased
in proportion to

on
increase in data size,
time is constant.
Data 5
Degree
Warehousing of Parallelism
QuantifyingSpeed-up
Ts Ts: Time on serial
Speedup = processor
Tm Tm: Time on multiple
processors

Sequential Execution Ideal Parallel Execution


Control work
(“overhead”)

Task-
Task-

Task-

3
1

2
Task-1 Task-2 Task-3

18 time 6 time
units units
Speedup = 18 = 300%
6
6
Speed-Up & Amdahl’sLaw
Reveals maximum expected speedup from
parallel algorithms given the proportion of task
that must be computed sequentially. It gives the
speedup S as

f is the fraction of the problem that must be computed


sequentially N is the number of processors
As f approaches 0 , S approaches N
Example-1: f = 5% and N = 100 then S =
Not 1:1 Ratio
16.8
Example-2: f = 10% and N = 200 then S
= 9.56 Data Warehousing 7 23
Parallelization OLTP Vs.DSS
There is a big difference.

OLTP [How many transactions per second can be done, or many


parallel users can be entertain per second.]
Parallelization of MULTIPLE queries Or Batch updates in parallel

DSS: No. of user are few but


Parallelization of a SINGLE query [ complex join and scan to break
and run in parallel]
25
Brief Intro to ParallelProcessing
• Parallel Hardware Architectures
• Symmetric MultiProcessing (SMP)
• Distributed Memory or Massively Parallel
Processing (MPP)
• Non-uniform Memory Access (NUMA)

• Parallel Software Architectures


• Shared Memory
• Shard Disk Shared Everything
• Shared Nothing
• Types of parallelism
• Data Parallelism
• Spatial Parallelism

Data 26
Warehousing
Symmetrical Multi Processing(SMP)
• A number of independent I/O and number of
processors all sharing access to a single large
memory space.

I/O I/O I/O P P P P


1 2 3 4

Main Memory

27
Distributed Memory Machines MMP
• Composed of a number of self-contained, self-controlled nodes
connected through a network interface.

• Each node contains its own CPU, processor, memory and I/O.

• Architecture better known as Massively Parallel Processing (MPP)


or cluster computing.
Node
• Memory is distributed across all
nodes. I/O P I/O P I/O P

Memory Memory Memory

Bus, Switch or Network

28
• Software Architecture

12
Shared disk RDBMSArchitecture
Here multiple tables of databases are sharing : logically all memories are one unit
Here every processor is capable of modifying a memory through this central node
It means two or more processors can modify a memory at a time : issue…???
How to resolve : Locking {Hardware + Software}
Adv
High level of fault tolerance but if data present at multiple place coherence issue can rise.

Dis Adv Clients/Users


Serialization due to locking
( slow process). Through put slow down

Shared Disk Interconnect

Interconnect switch can become a


bottleneck
30
Shared Nothing RDBMSArchitecture
Adv
Data ownership changes infrequently

There is no locking: no coherence no memory over


lapping Clients/Users
Dis Adv

Data availability low: on failure Very careful with data

distribution [if not according to query then solution

only redistribution]

Redistribution is expensive

31
16
Data Parallelism: Concept
• Parallel execution of a single data manipulation task
across multiple partitions of data.

• Partitions static or dynamic

• Tasks executed almost-independently across partitions.[


means very low or few serialization]

• “Query coordinator” must coordinate between the


independently executing processes.

17
Data Parallelism: Example all work in parallel
Select count (*)
62 from Emp
Partition-1
Partition 1 Query where age > 50
Server-1 AND
sal > 10,000’;
Partition-2
Query
440
. Server-2

Query
. . Coordinator
. .
Emp Table .
1,123 Query
Partition-k Server-k

Ans = 62 + 440 + ... + 1,123 = 99,000


18
Data Parallelism: Ensuring Speed-UP
To get a speed-up of N with N partitions, it must be
ensured that:

• There are enough computing resources.

• Query-coordinator is very fast as compared to query servers. create


bottleneck

• Work done in each partition almost same to avoid performance


bottlenecks.

• Same number of records in each partition would not be enough.

• Need to have uniform distribution of records w.r.t filter criterion


across partitions. 19
Temporal Parallelism (pipelining)
Involves taking a complex task and breaking it down into
independent subtasks for parallel execution on a stream of
data inputs.

Task Execution Time = T

[] [] [] [] [] []

Time = T/3 Time = T/3 Time = T/3

[] [] [] []

20
Pipelining: Time Chart
Time = T/3 Time = T/3 Time = T/3

[] []

Time = T/3 Time = T/3 Time = T/3

[] []

Time = T/3 Time = T/3 Time = T/3

[] []

Time = T/3 Time = T/3

[]

T=0 T=1 T=2 T=3


21
Pipelining: Speed-Up Calculation
Time for sequential execution of 1 task =T

Time for sequential execution of N tasks = N * T

(Ideal) time for pipelined execution of one task using an M stage pipeline = T

(Ideal) time for pipelined execution of N tasks using an M stage pipeline = T + ((N-1)  (T/M))

Speed-up (S) =

Pipeline parallelism focuses on increasing throughput of task execution, NOT on decreasing


sub-task execution time.

22
Pipelining: Speed-Up Example
Example: Bottling soft drinks in a factory

10 CRATES LOADS OF BOTTLES


Sequential execution = 10  T
Fill bottle, Seal bottle, Label Bottle pipeline = T + T  (10-1)/3 = 4  T
Speed-up = 2.50

20 CRATES LOADS OF BOTTLES


Sequential execution = 20  T
Fill bottle, Seal bottle, Label Bottle pipeline = T + T  (20-1)/3 = 7.3  T
Speed-up = 2.72

40 CRATES LOADS OF BOTTLES


Sequential execution = 40  T
Fill bottle, Seal bottle, Label Bottle pipeline = T + T  (40-1)/3 = 14.0  T
Speed-up = 2.85

23
Pipelining: Input vs Speed-Up
3
2.8
2.6

Speed-up (S)
2.4
2.2
2
1.8
1.6
1.4
1.2
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Input (N)

Asymptotic limit on speed-up for M stage pipeline is M.

The speed-up will NEVER be M, as initially filling the pipeline


took T time units. 24
Pipelining: Limitations
• Some relational operators do not produce first output
until consumed all their inputs.
• Aggregate and sort operators have this property. One
cannot pipeline these operators.

• Often, execution cost of one operator is much greater


than others hence skew.

• e.g. Sum() or count() vs Group-by() or Join.

25
Quiz 1
Marks 10+10 Time : 30 min

• Question 1: (10 Marks) CLO 1


Define few applications of DataWareHouse
• Question 2: (10 Marks) CLO 1
What are the types of Physical Data extraction

26

You might also like