Distributedcomp

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

What is parallelism? Explain the level of parallelism.

Parallelism is a computer science and computational concept that refers to the


simultaneous execution of multiple tasks, processes, or operations to improve
computational efficiency and overall performance. Instead of performing tasks
sequentially, parallelism allows multiple tasks to be executed at the same time,
either by breaking a single task into smaller sub-tasks or by performing
unrelated tasks concurrently.

Parallelism types are:

1. Instruction-Level Parallelism (ILP): This level of parallelism occurs within


a single processor or core. It involves the execution of multiple machine
instructions at the same time, often through techniques like pipelining
and superscalar architectures.
2. Thread-Level Parallelism (TLP): TLP involves running multiple threads or
processes simultaneously on a multi-core processor or multiple
processors. Each thread can execute different parts of a program,
allowing for parallel computation.
3. Data-Level Parallelism (DLP): In DLP, data is divided into smaller chunks,
and operations on these data chunks are performed concurrently. This is
common in vector processing and SIMD (Single Instruction, Multiple
Data) architectures.
4. Task-Level Parallelism (TasLP): Task-level parallelism is the concurrent
execution of distinct tasks or processes, which can be either related or
unrelated, to achieve a specific goal. This is often used in distributed
computing and parallel computing environments.

Parallelism is used to harness the processing power of modern computer


systems efficiently, as it can lead to significant performance improvements and
reduced execution times for tasks that can be divided and executed
concurrently. It's a fundamental concept in various fields, including high-
performance computing, scientific simulations, graphics processing, and
distributed systems. However, implementing parallelism effectively often
requires careful consideration of synchronization, load balancing, and
communication between parallel units to ensure that tasks are coordinated
and data consistency is maintained.
What is Data and Memory Level Parallelism?

Data level parallelism and memory level parallelism are two concepts related
to parallel computing. They involve exploiting parallelism in data operations
and memory access, respectively, to improve the performance of a computing
system.

Let's define each concept in more detail:

1. Data Level Parallelism (DLP):

Data level parallelism refers to the ability to perform multiple operations on


different data elements simultaneously. It involves breaking down a
computational task into smaller subtasks that can be executed independently
and in parallel. These subtasks operate on different data elements
concurrently, taking advantage of SIMD (Single Instruction, Multiple Data)
architectures or vector processing units.

DLP can be found in various scenarios, such as:

- Vectorized operations: SIMD instructions enable performing the same


operation on multiple data elements simultaneously.

For example, adding two arrays of numbers element-wise using vector


instructions.

- Graphics processing: GPUs (Graphics Processing Units) leverage data


parallelism to process large amounts of data simultaneously, such as rendering
multiple pixels in parallel.

2. Memory Level Parallelism (MLP):

Memory level parallelism focuses on exploiting parallelism in memory access


operations. It aims to optimize the use of memory resources and minimize
memory access latencies by performing multiple memory operations
concurrently.

MLP can be achieved through techniques like:

- Memory-level parallel instructions: Modern processors often incorporate


instructions that allow multiple memory accesses to be issued simultaneously.
These instructions enable overlapping memory operations and computation,
improving overall efficiency.

- Caches and prefetching: By using cache memories and prefetching


mechanisms, processors can anticipate future memory accesses and bring data
into the cache ahead of time, reducing memory access latencies. This allows
for more effective parallelism between computation and memory access.Both
data level parallelism and memory level parallelism are crucial for achieving
high-performance computing. They aim to maximize resource utilization and
exploit parallelism at the data and memory levels to speed up computations.

Explain Uniprocessor architecture. Describe parallelism and pipelining within


CPU.

A uniprocessor is a system with a single processor which has three major


components that are main memory i.e. the central storage unit, the central
processing unit i.e. CPU, and an input output unit like monitor, keyboard,
mouse, etc. Uniprocessor is a computer system type that has only one central
processing unit which means that only one instruction will be executed at a
time. All tasks and operations are handled by a single processor. This type of
processor is found only on personal computers mobile devices and small
embedded systems. These systems have limited computing power as they can
execute only one instruction at a time. These types of processors are suitable
for various common computing tasks such as web browsing, word processing,
and basic gaming.

2. Parallelism and Pipelining within CPU:

In pipelining, several functional units are working in sequence to implement a


single computation. These functional units form an assembly line or pipeline.
Each functional unit describes a specific phase of the computation and each
computation goes through the entire pipeline.

If there is only a single computation to be executed, the pipeline cannot


extract any parallelism. But, when the same computation is to be implemented
multiple times, these computations can be overlapped through the functional
units.
Parallel adders can be implemented using techniques such as carry-look ahead
and carry-save. A parallel adder is a digital circuit that adds two binary
numbers, where the length of one bit is larger as compared to the length of
another bit and the adder operates on equivalent pairs of bits parallelly.

Define parallelism. Explain fine-grained parallelism and medium-grained


parallelism.

Parallelism in distributed computing refers to the simultaneous execution of


multiple tasks or processes across distributed nodes or processors, enabling
faster and more efficient computation by breaking down tasks into smaller,
concurrent units.

1. Fine-grain Parallelism
Fine-grain parallelism refers to a level of parallelism in computing where
tasks or operations are divided into very small, fine-grained sub-tasks
that can be executed concurrently. This often involves breaking down a
program or computation into tiny units, which can be executed in
parallel, sometimes even at the level of individual instructions or data
elements. Fine-grain parallelism is typically used in situations where very
high levels of parallelism are required, such as in multi-core processors,
GPU (Graphics Processing Unit) programming, or highly parallel
algorithms.
Example :
Assume there are 100 processors that are responsible for processing the
10*10 image. Ignoring the communication overhead, the 100 processors
can process the 10*10 image in 1 clock cycle. Each processor is working
on 1 pixel of the image and then communicates the output to other
processors. This is an example of fine-grained parallelism
 In fine-grained parallelism, the program is divided into a large
number of small tasks.
 These tasks are assigned individually for many processors.
 The amount of work associated with a parallel task is low, and the
work is evenly distributed among the processors.
 Therefore, fine-grained parallelism facilitates load balancing.
 Each task processes less data, the number of processors required
to perform a full treatment high.
 This, in turn, increases communication and synchronization.
 Fine-grained parallelism used in the architecture to support fast
communication.
2. Coarse-grain Parallelism
Coarse-grain parallelism refers to a level of parallelism in computing
where tasks or operations are divided into relatively larger, coarse-
grained sub-tasks that can be executed concurrently. In contrast to fine-
grain parallelism, where tasks are broken down into very small units,
coarse-grain parallelism involves larger chunks of work that are
distributed among different processing units. This approach is often
used when managing fine-grain parallelism becomes impractical or when
the overhead of coordinating fine-grained tasks outweighs the benefits.
Coarse-grain parallelism is common in distributed computing, multi-
threaded programming, and high-level parallel algorithms where tasks
are larger and more substantial.
Example:
Further, if we reduce the processors to 2, then the processing will take
50 clock cycles. Each processor need to process 50 elements which
increases the computation time, but the communication overhead
decreases as the number of processors which share data decreases. This
case illustrates coarse-grained parallelism
 In coarse-grained parallelism, the program is broken down into major
tasks.
 In this regard, a large amount of computation occurs in the processor.
 This can lead to load imbalance, in which certain processing tasks of the
data array, while others may be idle.
 Further, coarse-grained parallelism is not used parallelism in the
program as most of the computations are executed sequentially on the
CPU.
 The advantage of this type of parallelism is low communication and
synchronization.
 Messaging architecture takes a long time to transfer data between
processes, making it suitable for coarse-grained parallelism
 The Y-MP is an example of a coarse-grained parallel computer, which
has a grain size of about 20 years.
3. Medium Grained Parallelism
Medium-grained parallelism, as the name suggests, falls between fine-
grain parallelism and coarse-grain parallelism in terms of the size of
tasks or operations that are parallelized. In medium-grained parallelism,
tasks are broken down into moderately-sized sub-tasks that can be
executed concurrently. This level of granularity aims to strike a balance
between the high level of control and coordination required for fine-
grain parallelism and the reduced parallelism efficiency of coarse-grain
parallelism.
Medium-grained parallelism is often used in scenarios where tasks are of
intermediate complexity and can be divided into chunks that are
substantial enough to make parallelization worthwhile but not so small
that the overhead of managing parallel tasks becomes a significant
performance bottleneck. It provides a compromise between maximizing
parallelism and minimizing coordination overhead, making it suitable for
various parallel computing applications.
 Medium-grained parallelism is used relatively to fine-grained and
coarse-grained parallelism. Medium-grained parallelism is a
compromise between fine-grained and coarse-grained parallelism,
where we have task size and communication time greater than fine-
grained parallelism and lower than coarse-grained parallelism.
 Most general-purpose parallel computers fall in this category.
 Example:
Consider a 10*10 image that needs to be processed, given that,
processing of the 100 pixels is independent of each other.
Illustrate Amdal’s law and Gustafson’s law.

1. Amdahl’s Law
Amdahl’s Law was named after Gene Amdahl, who presented it in 1967.
In general terms, Amdahl’s Law states that in parallelization, if P is the
proportion of a system or program that can be made parallel, and 1-P is
the proportion that remains serial, then the maximum speedup S(N) that
can be achieved using N processors is:
S(N)=1/((1-P)+(P/N))
As N grows the speedup tends to 1/(1-P).
Speedup is limited by the total time needed for the sequential (serial)
part of the program. For 10 hours of computing, if we can parallelize 9
hours of computing and 1 hour cannot be parallelized, then our
maximum speedup is limited to 10 times as fast. If computers get faster
the speedup itself stays the same.
2. Gustafson’s Law
This law says that increase of problem size for large machines can retain
scalability with respect to the number of processors.
American computer scientist and businessman, John L. Gustafson (born
January 19, 1955) found out that practical problems show much better
speedup than Amdahl predicted.
Gustafson’s law: The computation time is constant (instead of the
problem size), increasing number of CPUs solve bigger problem and
get better results in the same time.
 Execution time of program on a parallel computer is (a+b)
 a is the sequential time and b is the parallel time
 Total amount of work to be done in parallel varies linearly with
the number of processors.
 So b is fixed as P is varied. The total run time is (a+p*b)
 The speed up is (a+p*b)/(a+b)
 Define α = a/(a+b), the sequential fraction of the execution time,
then Any sufficiently large problem can be efficiently parallelized
with a speedup
 Where p is the number of processors, and α is the serial portion of
the problem
 Gustafson proposed a fixed time concept which leads to scaled
speed up for larger problem sizes.
 Basically, we use larger systems with more processors to solve
larger problems

Explain UMA AND NUMA models of multiprocessor system.

UMA

UMA (Uniform Memory Access) system is a shared memory architecture for


the multiprocessors. In this model, a single memory is used and accessed by all
the processors present the multiprocessor system with the help of the
interconnection network. Each p memory accessing time (latency) and access
speed. It can employ either of the single bus or multiple bus. As it provides
balanced shared memory access, it is also known as SMP (Symmetric
multiprocessor) systems.

It is suitable for general purpose and time-sharing applications. It has limited


available bandwidth to the memory.

NUMA

NUMA (Non-uniform Memory Access) is also a multiprocessor model in which


each processor connected with the dedicated memory. However, these small
part memory combine to make a single address space.

The main point to consider here is that unlike UMA, the access time of the
memory relies on the distance where the processor is placed which means
varying memory access time. It allows access to any of the memory location by
using the physical address.

It is used for Real-time and time-critical application. It has high available


bandwidth to the memory.

3. Flynn's Taxonomy

M.J. Flynn proposed a classification for the organization of a instructions and


data items that are manipulated simultaneously i.e., at a time how many
instructions are executed and upon how many data items.

Normal operations (instructions) of the computer are: execute them in the


processor

 fetch instruction means sequence of instructions read from memory


(instruction stream)

 execution means operations performed on data in the processor like inputs,


outputs, temporary results (data stream).

Parallel processing can occur in instruction stream or data stream or on both.

Flynn's classification divides computers into four major groups that are:

1. Single instruction stream, single data stream (SISD)

2. Single instruction stream, multiple data stream (SIMD

3. Multiple instruction stream, single data stream (MISD)

4. Multiple instruction stream, multiple data stream (MIMD)


SISD (Single Instruction Single Data Stream)

 Single instruction: Only one instruction stream is being acted or


executed by CPU during one clock cycle.
 Single data stream: Only one data stream is used as input during one
clock cycle.
 A classical Von Neumann computer comes under these categories.
 single computer containing a control unit and a memory unit
 instructions are executed sequentially
 system may or may not have internal parallel processing capabilities
 parallel processing can be achieved by pipelining or multiple functional
units
 For example- IBM 370 computers.

Single instruction stream, multiple data stream (SIMD)

It represents a computer organization that includes many processing units


(Arithmetic and logic units) under the supervision of a common control unit.
Since it has multiple data to process so require many processing units but since
instruction is single so only one control unit.

All processors receive same instruction from the control unit but operate on
different items of data. It’s also known as vector or array processors machine.
For example- ILLIAC-IV

Multiple instruction stream, single data stream (MISD)

 it’s a theoretical approach.


 An MISD computing is a multiprocessor machine capable of executing
different instructions on processing elements but all of them operating
on the same data set.

Multiple instruction stream, multiple data stream (MIMD)

 A MIMD system is a multiprocessor machine that is capable of executing


multiple instructions over multiple data streams.
 capable of processing multiple programs at the same time
 multiprocessor or multicomputer system
 If number of instructions are high than it’s known as tightly coupled else
known as loosely coupled. For example- Cray-2 computer

Fleng’s classification

The classification is based on the way contents stored in memory are


processed. The contents can be either data or instructions.

Classification is based on:

1. Word Serial and Bit Serial (WSBS)

2. Word Parallel and Bit Serial (WPBS)

3. Word Serial and Bit Parallel (WSBP)

4. Word Parallel and Bit Parallel (WPBP)

 maximum parallelism : Maximum number of binary digits that can be


processed within a unit time by a computer (degree of parallelism).
 based on word length and bit slice length
 word length means the number of bit in a word generally 8, 16, 32
totally depends upon processor. Every word has unique address.
 Bit slice length means number of bits one from each word at the same
vertical position

Word Serial and Bit Serial (WSBS)

One bit of one selected word is processed at a time. This represents serial
processing and needs maximum processing time.

Word serial bit parallel (WSBP)

It is found in most existing computers and has been called as Word Slice
processing because one word of n bit is processed at a time. All bits of a
selected word are processed at a time. Bit parallel means all bits of a word.
Word Parallel and Bit Serial (WPBS)

It has been called bit slice processing because m-bit slice is processed at a
time. Word parallel signifies selection of all words. It can be considered as one
bit from all words are processed at a time.

Word Parallel and Bit Parallel (WPBP)

It is known as fully parallel processing in which an array on n x m bits is


processed at one time. Maximum parallelism is achieved here.

UMA NUMA
Offers limited bandwidth Offers relatively more bandwidth
then UMA
Since it uses a Single memory It uses a multiple control
controller
It is slower then NUMA It is faster than UMA
Memory access time is Equal or Memory Access time is unequal
Balanced
It is mainly used in Time-Sharing and t offers better performance and
general-Purpose application Speed, making it suitable for Real-
Time and Time-Critical applications
Lower communication Higher communication
Simpler hardware design More complex hardware design
Easier to build and maintain Complex to manage non-uniform
access
Shared memory programming model: May require explicit data placement
uniform access API and NUMA-aware programming
Generally more cost-effective for Costlier due to complex memory
smaller systems architecture and interconnect
RISC CISC
It stand for Reduced Instruction Set It stands for Complex Instruction Set
Computer Computer
R ISC has simple decoding of CISC has complex decoding of
instruction instruction
Uses of the pipeline are simple in RISC Uses of the pipeline are difficult in
CISC
The execution time of RISC is very The execution time of CISC is longer
short
RISC has more transistors on memory CISC has transistor to store complex
registers instruction
It has fixed format instruction It has variable format instruction
It emphasizes on Software to optimize It emphasizes on Hardware to
the instruction set optimize the instruction set
It is a Hard-Wired unit of Microprogramming unit in CISC
programming in the RISC processor processor
It requires multiple register sets to It requires a single register set to store
store the instruction the instruction
It takes more space in memory It takes less space in memory
Example: ARM, PA-RISC, AVR, ARC, Example: VAX, Motorola 68000 family,
and the SPARC system/360, AMD and intel x86

You might also like