Parallel Algorithm - Introduction

ELEMENTARY
PARALLEL
ALGORITHMS
CS 1406 - ACOA
• What is an Algorithm?
- An algorithm is a sequence of instructions followed to solve a
problem. While designing an algorithm, we should consider the
architecture of computer on which the algorithm will be executed. As
per the architecture, there are two types of computers −
• Sequential Computer
• Parallel Computer
• Depending on the architecture of computers, we have two types of
algorithms −
• Sequential Algorithm − An algorithm in which some consecutive steps
of instructions are executed in a chronological order to solve a
problem.
• Parallel Algorithm − The problem is divided into sub-problems and are
executed in parallel to get individual outputs. Later on, these
individual outputs are combined together to get the final desired
output.
To design an algorithm properly, we must have a clear idea of the
basic model of computation in a parallel computer.
• Model of Computation:
• Both sequential and parallel computers operate on a set (stream) of

instructions called algorithms. These set of instructions (algorithm)
instruct the computer about what it has to do in each step.
• Depending on the instruction stream and data stream, computers can
be classified into four categories −
• Single Instruction stream, Single Data stream (SISD) computers
• Single Instruction stream, Multiple Data stream (SIMD) computers
• Multiple Instruction stream, Single Data stream (MISD) computers
• Multiple Instruction stream, Multiple Data stream (MIMD) computers
SISD Computers
• SISD computers contain one control unit, one
processing unit, and one memory unit.
• In this type of computers, the processor receives a single stream of

instructions from the control unit and operates on a single stream of data
from the memory unit.
• During computation, at each step, the processor receives one instruction
from the control unit and operates on a single data received from the
memory unit.
SIMD Computers
• SIMD computers contain one control unit, multiple processing
units, and shared memory or interconnection network.
• Here, one single control unit sends instructions to all processing units. During
computation, at each step, all the processors receive a single set of instructions
from the control unit and operate on different set of data from the memory unit.
• Each of the processing units has its own local memory unit to store both data and
instructions. In SIMD computers, processors need to communicate among
themselves. This is done by shared memory or by interconnection network.
MISD Computers
• As the name suggests, MISD computers contain multiple control units,
multiple processing units, and one common memory unit.
• Here, each processor has its own control unit and they share a common
memory unit. All the processors get instructions individually from their own
control unit and they operate on a single stream of data as per the instructions
they have received from their respective control units. This processor operates
simultaneously.
MIMD Computers
• MIMD computers have multiple control units, multiple processing units, and
a shared memory or interconnection network .
• Here, each processor has its own control unit, local memory unit, and
arithmetic and logic unit. They receive different sets of instructions from
their respective control units and operate on different sets of data.
• An MIMD computer that shares a common memory is
known as multiprocessors, while those that uses an
interconnection network is known as multicomputers.
• Based on the physical distance of the processors,
multicomputers are of two types −
– Multicomputer − When all the processors are very
close to one another (e.g., in the same room).
– Distributed system − When all the processors are far
away from one another (e.g.- in the different cities
Parallel Algorithm - Models
• The model of a parallel algorithm is developed by considering a
strategy for dividing the data and processing method and
applying a suitable strategy to reduce interactions. In this
chapter, we will discuss the following Parallel Algorithm Models −
• Data parallel model

• Task graph model
• Work pool model
• Master slave model
• Producer consumer or pipeline model
• Hybrid model
Data Parallel
• In data parallel model, tasks are assigned to processes and each
task performs similar types of operations on different data. Data
parallelism is a consequence of single operations that is being
applied on multiple data items.
• Data-parallel model can be applied on shared-address spaces and
message-passing paradigms. In data-parallel model, interaction
overheads can be reduced by selecting a locality preserving
decomposition, by using optimized collective interaction routines,
or by overlapping computation and interaction.
• The primary characteristic of data-parallel model problems is that
the intensity of data parallelism increases with the size of the
problem, which in turn makes it possible to use more processes to
solve larger problems.
• Example − Dense matrix multiplication.
Data Parallel
Task Graph Model
• In the task graph model, parallelism is expressed by a task
graph. A task graph can be either trivial or nontrivial. In this
model, the correlation among the tasks are utilized to
promote locality or to minimize interaction costs. This model
is enforced to solve problems in which the quantity of data
associated with the tasks is huge compared to the number of
computation associated with them. The tasks are assigned to
help improve the cost of data movement among the tasks.
• Examples − Parallel quick sort, sparse matrix factorization,

and parallel algorithms derived via divide-and-conquer
approach.
• Here, problems are divided into atomic tasks and implemented as a graph. Each task is an
independent unit of job that has dependencies on one or more antecedent task. After the
completion of a task, the output of an antecedent task is passed to the dependent task. A task with
antecedent task starts execution only when its entire antecedent task is completed. The final output
of the graph is received when the last dependent task is completed (Task 6 in the above figure).
Work Pool Model
• In work pool model, tasks are dynamically assigned to the processes for balancing
the load. Therefore, any process may potentially execute any task. This model is
used when the quantity of data associated with tasks is comparatively smaller than
the computation associated with the tasks.
• There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is

centralized or decentralized. Pointers to the tasks are saved in a physically shared
list, in a priority queue, or in a hash table or tree, or they could be saved in a
physically distributed data structure.
• The task may be available in the beginning, or may be generated dynamically. If the
task is generated dynamically and a decentralized assigning of task is done, then a
termination detection algorithm is required so that all the processes can actually
detect the completion of the entire program and stop looking for more tasks.
• Example − Parallel tree search

• Work Pool Model
Master-Slave Model
• In the master-slave model, one or more master processes generate task and
allocate it to slave processes. The tasks may be allocated beforehand if
– the master can estimate the volume of the tasks, or
– a random assigning can do a satisfactory job of balancing load, or
– slaves are assigned smaller pieces of task at different times.
• This model is generally equally suitable to shared-address-
space or message-passing paradigms, since the interaction is naturally two
ways.
• In some cases, a task may need to be completed in phases, and the task in
each phase must be completed before the task in the next phases can be
generated. The master-slave model can be generalized
to hierarchical or multi-level master-slave model in which the top level
master feeds the large portion of tasks to the second-level master, who
further subdivides the tasks among its own slaves and may perform a part of
the task itself.
Precautions in using the master-slave model:
Care should be taken to assure that the master does not become a congestion
point. It may happen if the tasks are too small or the workers are
comparatively fast.
The tasks should be selected in a way that the cost of performing a task
dominates the cost of communication and the cost of synchronization.
Asynchronous interaction may help overlap interaction and the computation
associated with work generation by the master.
Pipeline Model
• It is also known as the producer-consumer model. Here a set of data is

passed on through a series of processes, each of which performs some task
on it. Here, the arrival of new data generates the execution of a new task
by a process in the queue. The processes could form a queue in the shape
of linear or multidimensional arrays, trees, or general graphs with or
without cycles.
• This model is a chain of producers and consumers. Each process in the
queue can be considered as a consumer of a sequence of data items for
the process preceding it in the queue and as a producer of data for the
process following it in the queue. The queue does not need to be a linear
chain; it can be a directed graph. The most common interaction
minimization technique applicable to this model is overlapping interaction
with computation.
• Example − Parallel LU factorization algorithm.
Pipeline Model
Hybrid Models
• A hybrid algorithm model is required when

more than one model may be needed to solve
a problem.
• A hybrid model may be composed of either
multiple models applied hierarchically or
multiple models applied sequentially to
different phases of a parallel algorithm.
• Example − Parallel quick sort
PROCESS
• Process is a program in execution. (Program residing in secondary
storage device which is NOT-ACTIVE is not a Process)
• Program is an Passive entity and Process is an Active entity.
• A process includes:
– program counter Process in Memory
– stack
– data section
Process State Diagram
• Most process posses a typical characteristics:
CPU I/O CPU
……………
Burst Burst Burst
• In I/O Burst:
– It may be an Input from user, waiting to read content of file
from secondary storage, waiting to get input from input
device, etc.
– It may give some output continuously/some time interval, to
output device, etc.
• CPU can execute/ access a program or data, on it’s on main

memory. [Can’t from Secondary storage memory]
22
Process State Diagram
Admitted
Exit
Scheduler
Dispatch
NEW HALTED
READY RUNNING
Interrupt
I/O or WAITING I/O or

Event Event
Completion Wait
Long Term Short Term Scheduler

Scheduler
Note: In Order to keep CPU busy all times, follow Process State Diagram
PROCESS SYNCHRONISATION
• It refers to the idea that multiple processes are to join up or handshake at
a certain point, in order to reach an agreement or commit to a certain
sequence of action.
• Concurrent accesses to shared resource can lead to race condition

RACE CONDITION
LOCK –UNLOCK PRIMITIVES
• Solutions to mutual exclusion problems can be constructed using
mechanisms referred to as locks.
• A process that attempts to access shared data that is protected with

a locked gate waits until the associated gate is unlocked and then
locks the gate to prevent other processes having access the data.
• After the process accesses and performs the required operations on

the data, the process unlocks the gate to allow another process to
access the data. The important characteristic of lock and unlock
operations is that they are executed atomically, that is, as
uninterruptible operations; in other words, once they are initiated,
they do not stop until they complete.
DEADLOCK
• Deadlock is a common problem in multiprocessing systems
• A deadlock occurs when a process enters a waiting state because a
requested system resource is held by another waiting process, which in
turn is waiting for another resource held by another waiting process. If a
process is unable to change its state indefinitely because the resources
requested by it are being used by another waiting process, then the
system is said to be in a deadlock
• Usually occurs when a cycle is formed and the no of resource instances
are lesser than no of processors
Requesting
Held R1
By
P1 P2
Requesting R2 Held By
WAYS TO DEAL WITH DEADLOCK
• DEADLOCK PREVENTION
• DEADLOCK DETECTION
• DEADLOCK CORRECTION
DEADLOCK PREVENTION
• Deadlock prevention works by preventing one of the four
Coffman conditions from occurring:
– Mutual Exclusion (mutex): prevents simultaneous access to a resource

– Hold and wait: Deadlock involves circular hold and wait; one process
cannot hold a resource yet be waiting for another resource; this
condition may be removed by requiring processes to request all the
resources they will need before starting up
– No-Preemption: If a process holding some resource requests another
resource that cannot be immediately allocated to it then all the
resources currently being held are released implicitly
– Circular Wait: each process must be waiting for a resource which is
being held by another process which in turn is waiting for the first
process to release the resource
Cache Coherence Protocol
• In contemporary multiprocessor system. It is customary to have one
or two levels of cache associated with each processor. This
organization is essential to achieve high performance.
• Multiple copies of same data may exist in different cache memories

simultaneously
• If multiple processors are allowed to update their own copies freely, a

data inconsistency problem arises
• This problem is known as Cache Coherence Problem.
• To cope with this problem Cache Coherence Protocol is implemented.

• This protocol is of two types:
– Software Based Protocol
– Hardware Based Protocol
• SOFTWARE BASED-
– the detecting of potential cache coherence problem is done by the compiler, and the design
complexity is transferred to software.
– software approaches generally make conservative decisions. Leading to inefficient cache utilization.
– Compiler-based cache coherence mechanism perform an analysis on the code to determine which
data items may become unsafe for caching, and they mark those items accordingly. So, there are some
more cacheable items, and the operating system or hardware does not cache those items.
– The simplest approach is to prevent any shared data variables from being cached.
– This is too conservative, because a shared data structure may be exclusively used during some periods
and may be effectively read-only during other periods.
• HARDWARE BASED
– Hardware solution provide dynamic recognition at run time of potential inconsistency
conditions.
– The problem is only dealt with when it actually arises, leading to improved performances
over a software approach
– Hardware schemes can be divided into two categories: i)Directory protocol; ii)Snoopy
protocols
• Directory Protocols:
– Directory protocols collect and maintain information about where copies of lines reside.
– There is centralized controller that is part of the main memory controller, and a directory
that is stored in main memory.
– The directory contains global state information about the contents of the various local
caches.
– When an individual cache controller makes a request, the centralized controller checks
and issues necessary commands for data transfer between memory and caches or
between caches themselves
• Snoopy Protocols:
– Snoopy protocols distribute the responsibility for maintaining cache coherence among all of the cache
controllers in a multiprocessor system.
– A cache must recognize when a line that it holds is shared with other caches.
– When an update action is performed on a shared cache line, it must be announced to all other caches
by a broadcast mechanism.
– Each cache controller is able to “snoop” on the network to observed these broadcasted notification
and react accordingly.
– Snoopy protocols are ideally suited to a bus-based multiprocessor
– Two basic approaches to the snoopy protocol are: Write invalidates or write- update (write-broadcast)
(Snoopy Protocols)
Write-invalidate Protocol Write- update (write-broadcast) Protocol
can be multiple readers but only one there can be multiple writers as well as
write at a time multiple readers
Initially, a line may be shared among When a processors wishes to update a
several caches for reading purposes. shared line, the word to be updated is
When one of the caches wants to perform distributed to all others, and caches
a write to the line it first issues a notice containing that line can update it.
that invalidates that time in the other
caches, making the line exclusive to the
writing cache.
Once the line is exclusive, the owning
processor can make local writes until
some other processor requires the same
line
Task Scheduling Approaches in
Multiprocessor System
• Stochastic- concerns scheduling problems involving random attributes,
such as random processing times, random due dates, random weights,
and stochastic machine breakdowns
• Deterministic- all the parameters of the system and of the tasks are
known in advance
• Probable- all the parameters of the system and of the tasks are presumed
References :
• Joseph JaJa ,Introduction to Parallel Algorithms, Addison-Wesley
Professional; 1 edition (April 3, 1992)
• Frank Thomson Leighton, Introduction to Parallel Algorithms and
Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann Pub; 1
edition (September 1, 1991)
• Shimon Even, Graph Algorithms, 2nd Edition, ISBN-13: 978-
0521736534
• Trahan, Jerry. (1988). Instruction Sets for Parallel Random Access
Machines. 172.
• H. T. Lau, Algorithms on Graphs, Tab Books; 1st edition (December
1, 1989)

Parallel Algorithm - Introduction

Uploaded by

Copyright:

Available Formats

Parallel Algorithm - Introduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Algorithm - Introduction

Uploaded by

Copyright:

Available Formats

ELEMENTARY

• Both sequential and parallel computers operate on a set (stream) of

• In this type of computers, the processor receives a single stream of

• Data parallel model

• Examples − Parallel quick sort, sparse matrix factorization,

• There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is

• Example − Parallel tree search

• It is also known as the producer-consumer model. Here a set of data is

• A hybrid algorithm model is required when

• Program is an Passive entity and Process is an Active entity.

• CPU can execute/ access a program or data, on it’s on main

I/O or WAITING I/O or

Long Term Short Term Scheduler

• Concurrent accesses to shared resource can lead to race condition

• A process that attempts to access shared data that is protected with

• After the process accesses and performs the required operations on

– Mutual Exclusion (mutex): prevents simultaneous access to a resource

• Multiple copies of same data may exist in different cache memories

• If multiple processors are allowed to update their own copies freely, a

• This problem is known as Cache Coherence Problem.

• To cope with this problem Cache Coherence Protocol is implemented.

– Snoopy protocols are ideally suited to a bus-based multiprocessor

You might also like