Distributedcomp
Distributedcomp
Distributedcomp
Data level parallelism and memory level parallelism are two concepts related
to parallel computing. They involve exploiting parallelism in data operations
and memory access, respectively, to improve the performance of a computing
system.
1. Fine-grain Parallelism
Fine-grain parallelism refers to a level of parallelism in computing where
tasks or operations are divided into very small, fine-grained sub-tasks
that can be executed concurrently. This often involves breaking down a
program or computation into tiny units, which can be executed in
parallel, sometimes even at the level of individual instructions or data
elements. Fine-grain parallelism is typically used in situations where very
high levels of parallelism are required, such as in multi-core processors,
GPU (Graphics Processing Unit) programming, or highly parallel
algorithms.
Example :
Assume there are 100 processors that are responsible for processing the
10*10 image. Ignoring the communication overhead, the 100 processors
can process the 10*10 image in 1 clock cycle. Each processor is working
on 1 pixel of the image and then communicates the output to other
processors. This is an example of fine-grained parallelism
In fine-grained parallelism, the program is divided into a large
number of small tasks.
These tasks are assigned individually for many processors.
The amount of work associated with a parallel task is low, and the
work is evenly distributed among the processors.
Therefore, fine-grained parallelism facilitates load balancing.
Each task processes less data, the number of processors required
to perform a full treatment high.
This, in turn, increases communication and synchronization.
Fine-grained parallelism used in the architecture to support fast
communication.
2. Coarse-grain Parallelism
Coarse-grain parallelism refers to a level of parallelism in computing
where tasks or operations are divided into relatively larger, coarse-
grained sub-tasks that can be executed concurrently. In contrast to fine-
grain parallelism, where tasks are broken down into very small units,
coarse-grain parallelism involves larger chunks of work that are
distributed among different processing units. This approach is often
used when managing fine-grain parallelism becomes impractical or when
the overhead of coordinating fine-grained tasks outweighs the benefits.
Coarse-grain parallelism is common in distributed computing, multi-
threaded programming, and high-level parallel algorithms where tasks
are larger and more substantial.
Example:
Further, if we reduce the processors to 2, then the processing will take
50 clock cycles. Each processor need to process 50 elements which
increases the computation time, but the communication overhead
decreases as the number of processors which share data decreases. This
case illustrates coarse-grained parallelism
In coarse-grained parallelism, the program is broken down into major
tasks.
In this regard, a large amount of computation occurs in the processor.
This can lead to load imbalance, in which certain processing tasks of the
data array, while others may be idle.
Further, coarse-grained parallelism is not used parallelism in the
program as most of the computations are executed sequentially on the
CPU.
The advantage of this type of parallelism is low communication and
synchronization.
Messaging architecture takes a long time to transfer data between
processes, making it suitable for coarse-grained parallelism
The Y-MP is an example of a coarse-grained parallel computer, which
has a grain size of about 20 years.
3. Medium Grained Parallelism
Medium-grained parallelism, as the name suggests, falls between fine-
grain parallelism and coarse-grain parallelism in terms of the size of
tasks or operations that are parallelized. In medium-grained parallelism,
tasks are broken down into moderately-sized sub-tasks that can be
executed concurrently. This level of granularity aims to strike a balance
between the high level of control and coordination required for fine-
grain parallelism and the reduced parallelism efficiency of coarse-grain
parallelism.
Medium-grained parallelism is often used in scenarios where tasks are of
intermediate complexity and can be divided into chunks that are
substantial enough to make parallelization worthwhile but not so small
that the overhead of managing parallel tasks becomes a significant
performance bottleneck. It provides a compromise between maximizing
parallelism and minimizing coordination overhead, making it suitable for
various parallel computing applications.
Medium-grained parallelism is used relatively to fine-grained and
coarse-grained parallelism. Medium-grained parallelism is a
compromise between fine-grained and coarse-grained parallelism,
where we have task size and communication time greater than fine-
grained parallelism and lower than coarse-grained parallelism.
Most general-purpose parallel computers fall in this category.
Example:
Consider a 10*10 image that needs to be processed, given that,
processing of the 100 pixels is independent of each other.
Illustrate Amdal’s law and Gustafson’s law.
1. Amdahl’s Law
Amdahl’s Law was named after Gene Amdahl, who presented it in 1967.
In general terms, Amdahl’s Law states that in parallelization, if P is the
proportion of a system or program that can be made parallel, and 1-P is
the proportion that remains serial, then the maximum speedup S(N) that
can be achieved using N processors is:
S(N)=1/((1-P)+(P/N))
As N grows the speedup tends to 1/(1-P).
Speedup is limited by the total time needed for the sequential (serial)
part of the program. For 10 hours of computing, if we can parallelize 9
hours of computing and 1 hour cannot be parallelized, then our
maximum speedup is limited to 10 times as fast. If computers get faster
the speedup itself stays the same.
2. Gustafson’s Law
This law says that increase of problem size for large machines can retain
scalability with respect to the number of processors.
American computer scientist and businessman, John L. Gustafson (born
January 19, 1955) found out that practical problems show much better
speedup than Amdahl predicted.
Gustafson’s law: The computation time is constant (instead of the
problem size), increasing number of CPUs solve bigger problem and
get better results in the same time.
Execution time of program on a parallel computer is (a+b)
a is the sequential time and b is the parallel time
Total amount of work to be done in parallel varies linearly with
the number of processors.
So b is fixed as P is varied. The total run time is (a+p*b)
The speed up is (a+p*b)/(a+b)
Define α = a/(a+b), the sequential fraction of the execution time,
then Any sufficiently large problem can be efficiently parallelized
with a speedup
Where p is the number of processors, and α is the serial portion of
the problem
Gustafson proposed a fixed time concept which leads to scaled
speed up for larger problem sizes.
Basically, we use larger systems with more processors to solve
larger problems
UMA
NUMA
The main point to consider here is that unlike UMA, the access time of the
memory relies on the distance where the processor is placed which means
varying memory access time. It allows access to any of the memory location by
using the physical address.
3. Flynn's Taxonomy
Flynn's classification divides computers into four major groups that are:
All processors receive same instruction from the control unit but operate on
different items of data. It’s also known as vector or array processors machine.
For example- ILLIAC-IV
Fleng’s classification
One bit of one selected word is processed at a time. This represents serial
processing and needs maximum processing time.
It is found in most existing computers and has been called as Word Slice
processing because one word of n bit is processed at a time. All bits of a
selected word are processed at a time. Bit parallel means all bits of a word.
Word Parallel and Bit Serial (WPBS)
It has been called bit slice processing because m-bit slice is processed at a
time. Word parallel signifies selection of all words. It can be considered as one
bit from all words are processed at a time.
UMA NUMA
Offers limited bandwidth Offers relatively more bandwidth
then UMA
Since it uses a Single memory It uses a multiple control
controller
It is slower then NUMA It is faster than UMA
Memory access time is Equal or Memory Access time is unequal
Balanced
It is mainly used in Time-Sharing and t offers better performance and
general-Purpose application Speed, making it suitable for Real-
Time and Time-Critical applications
Lower communication Higher communication
Simpler hardware design More complex hardware design
Easier to build and maintain Complex to manage non-uniform
access
Shared memory programming model: May require explicit data placement
uniform access API and NUMA-aware programming
Generally more cost-effective for Costlier due to complex memory
smaller systems architecture and interconnect
RISC CISC
It stand for Reduced Instruction Set It stands for Complex Instruction Set
Computer Computer
R ISC has simple decoding of CISC has complex decoding of
instruction instruction
Uses of the pipeline are simple in RISC Uses of the pipeline are difficult in
CISC
The execution time of RISC is very The execution time of CISC is longer
short
RISC has more transistors on memory CISC has transistor to store complex
registers instruction
It has fixed format instruction It has variable format instruction
It emphasizes on Software to optimize It emphasizes on Hardware to
the instruction set optimize the instruction set
It is a Hard-Wired unit of Microprogramming unit in CISC
programming in the RISC processor processor
It requires multiple register sets to It requires a single register set to store
store the instruction the instruction
It takes more space in memory It takes less space in memory
Example: ARM, PA-RISC, AVR, ARC, Example: VAX, Motorola 68000 family,
and the SPARC system/360, AMD and intel x86