HPC Module 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Parallel programming and parallel algorithms

Parallel programming and parallel algorithms are key concepts in computer


science, especially important for improving the performance of software on
modern hardware.
Parallel Programming
Parallel programming involves dividing a problem into smaller sub-
problems that can be solved concurrently, usually to speed up processing.
It leverages multiple processors or cores in a computer to perform
computations simultaneously. Key concepts in parallel programming
include:
1. Concurrency vs. Parallelism:
o Concurrency is about dealing with multiple tasks at the same
time, but not necessarily simultaneously.
o Parallelism is about performing multiple tasks simultaneously.
2. Threads and Processes:
o Threads are the smallest unit of processing that can be
scheduled by an operating system.
o Processes are instances of running programs that may consist
of multiple threads.
3. Shared Memory vs. Distributed Memory:
o Shared Memory: Multiple processors share the same memory
space, making data exchange between tasks easier but
requiring careful synchronization.
o Distributed Memory: Each processor has its own memory,
necessitating communication between processors, often using
message passing (e.g., MPI).
4. Synchronization:
o To avoid race conditions, deadlocks, and other concurrency
issues, synchronization techniques such as locks, semaphores,
and barriers are used.
5. Programming Models:
o Task Parallelism: Tasks are divided among multiple processors.
o Data Parallelism: Data is divided among processors, and the
same operation is performed on each subset of data.
Parallel Algorithms
Parallel algorithms are designed to exploit parallelism to solve problems
faster than would be possible with a sequential algorithm. Key concepts
include:
1. Speedup and Efficiency:
o Speedup measures how much faster a parallel algorithm is
compared to a sequential one.
o Efficiency measures the effectiveness of using parallel
resources.
2. Problem Decomposition:
o Problems can be decomposed into tasks (functional
decomposition) or data elements (data decomposition) that can
be processed in parallel.
3. Load Balancing:
o Ensuring that all processors or cores are equally utilized to avoid
bottlenecks.
4. Parallel Algorithm Design Strategies:
o Divide and Conquer: Problems are recursively divided into
smaller subproblems that can be solved in parallel.
o Pipelining: Different stages of a process are executed in parallel.
o Parallel Sorting: Algorithms like parallel quicksort and parallel
merge sort.
o Parallel Graph Algorithms: Includes algorithms for shortest path,
spanning tree, etc.
5. Common Parallel Algorithms:
o Parallel Sorting: Algorithms that divide the data and sort them in
parallel.
o Matrix Multiplication: Often done in parallel to speed up large
matrix operations.
o Search Algorithms: Parallel versions of depth-first search (DFS)
or breadth-first search (BFS).
Programming models
Programming models in parallel computing provide a framework to write
programs that can run concurrently on multiple processors or cores. These
models abstract the complexity of parallel execution, making it easier for
developers to implement parallelism in their applications. Here are some
common parallel programming models:
1. Shared Memory Model
 Overview: In the shared memory model, multiple processors or
threads share a common address space. This model simplifies
communication between threads since they can directly access and
modify shared variables.
 Synchronization: Since threads share memory, synchronization
mechanisms (like locks, semaphores, and barriers) are needed to
avoid race conditions and ensure correct execution.
 Examples:
o OpenMP: A popular API that supports multi-platform shared-
memory parallelism in C, C++, and Fortran.
o Pthreads: A POSIX standard for implementing multithreading at
the level of the operating system.
2. Distributed Memory Model
 Overview: In this model, each processor has its own local memory.
Processors communicate with each other by passing messages,
typically over a network. This model is common in cluster computing
and large-scale parallel processing.
 Communication: Message-passing interfaces are used for
communication between processors.
 Examples:
o MPI (Message Passing Interface): A widely-used standard for
communication in distributed memory environments.
o MapReduce: A programming model for processing large
datasets across distributed clusters.
3. Hybrid Model
 Overview: The hybrid model combines elements of both shared
memory and distributed memory models. This approach is used to
leverage the benefits of both models, often in large systems that
consist of multiple nodes with shared memory within each node and
distributed memory between nodes.
 Examples:
o A common approach is to use OpenMP for shared memory
parallelism within a node and MPI for communication between
nodes.
4. Data Parallel Model
 Overview: In the data parallel model, the same operation is
performed simultaneously on different pieces of distributed data. This
model is effective when the computation can be broken down into
many independent tasks that operate on different parts of the data.
 Key Characteristics: Emphasizes the distribution of data rather than
tasks. Each processor works on a different subset of the data.
 Examples:
o CUDA: A parallel computing platform and programming model
developed by NVIDIA, which enables parallel execution of code
on GPUs.
o OpenCL: An open standard for cross-platform parallel
programming on heterogeneous systems, including CPUs,
GPUs, and other processors.
5. Task Parallel Model
 Overview: In the task parallel model, different tasks (which could be
different functions or parts of a program) are executed in parallel. This
model is useful when the tasks are independent or can be divided into
smaller sub-tasks.
 Load Balancing: An important aspect of task parallelism is ensuring
that all processors are utilized efficiently, which often requires
dynamic scheduling and load balancing.
 Examples:
o Threading Building Blocks (TBB): A C++ library developed by
Intel for task-based parallelism.
o Fork/Join Framework: A framework in Java for dividing a
problem into smaller subproblems (forking) and then combining
their results (joining).
6. Pipeline Model
 Overview: The pipeline model is used when a sequence of tasks
needs to be performed in a specific order. Each stage of the pipeline
can be executed in parallel with different data, similar to an assembly
line.
 Example:
o Streaming Data Processing: Where data is continuously
processed in stages, such as in image processing or real-time
data analysis.
7. Bulk Synchronous Parallel (BSP) Model
 Overview: BSP is a parallel computing model where computation is
divided into a sequence of supersteps. In each superstep, processors
perform local computations, exchange data with other processors,
and then synchronize before moving on to the next superstep.
 Examples:
o Used in large-scale graph processing frameworks like Apache
Giraph.
Parallel programming on multiprocessors
Parallel programming on multiprocessors involves designing and
implementing software that can run concurrently on multiple processing
units within a single system. Multiprocessors typically refer to systems with
more than one CPU (Central Processing Unit) or cores within a CPU,
enabling parallel execution of tasks. Here’s an overview of key concepts,
challenges, and approaches in parallel programming on multiprocessors:
1. Types of Multiprocessor Architectures
 Symmetric Multiprocessing (SMP):
o All processors share a single, uniform memory space and have
equal access to I/O devices.
o Common in modern multi-core processors where each core can
access the same memory.
 Non-Uniform Memory Access (NUMA):
o Processors have their own local memory but can access
memory from other processors, though access times may vary
(non-uniform).
o Used in large-scale systems where processors are grouped in
nodes, each with its own memory.
2. Parallel Programming Models for Multiprocessors
 Shared Memory Model:
o Threads or processes running on different processors share the
same memory space.
o Easier communication between threads but requires
synchronization mechanisms to manage access to shared
resources.
 Thread-Level Parallelism (TLP):
o Multiple threads are spawned, each running on a different
processor or core.
o Commonly implemented using libraries like POSIX Threads
(Pthreads), OpenMP, or Java’s multithreading capabilities.
3. Key Concepts in Multiprocessor Parallel Programming
 Synchronization:
o Ensures that concurrent processes or threads do not interfere
with each other when accessing shared resources.
o Mechanisms include locks (mutexes), semaphores, barriers, and
condition variables.
o Critical for preventing race conditions, deadlocks, and ensuring
correct execution.
 Load Balancing:
o Distributing work evenly across all processors to maximize
resource utilization and minimize execution time.
o Can be static (pre-determined distribution) or dynamic (adjusting
workload distribution at runtime).
 Data Locality:
o In NUMA systems, it’s important to place data close to the
processor that uses it most frequently to reduce memory access
times.
o Techniques include affinity scheduling, where threads are bound
to specific processors to optimize cache usage and memory
access.
4. Challenges in Parallel Programming on Multiprocessors
 Scalability:
o Ensuring that the parallel program scales efficiently as the
number of processors increases. Poor scalability can occur due
to excessive synchronization, load imbalance, or memory
contention.
 Debugging and Testing:
o Concurrency issues like race conditions and deadlocks are
difficult to detect and reproduce, making parallel programs
harder to debug and test compared to sequential ones.
 Performance Tuning:
o Optimizing cache usage, minimizing synchronization overhead,
and ensuring efficient memory access patterns are crucial for
achieving good performance.
Parallel programming on multicomputer
Parallel programming on multicomputer involves designing software that
runs concurrently across multiple computers, often referred to as nodes,
within a distributed computing environment. Each computer in the system
typically has its own local memory, and the computers communicate with
each other via a network. This approach is common in high-performance
computing (HPC), cloud computing, and large-scale distributed systems.
Here’s an overview of key concepts, challenges, and approaches in parallel
programming on multicomputer:
1. Multicomputer Architecture
 Distributed Memory:
o In a multicomputer system, each node has its own private
memory, and nodes do not share memory directly.
o Communication between nodes occurs through message
passing over a network.
 Cluster Computing:
o A common form of multicomputer system where multiple
standalone computers (nodes) are connected via a high-speed
network and work together to perform parallel processing tasks.
 Grid Computing:
o A form of distributed computing that connects geographically
dispersed computers to work on a common task, often used in
scientific research.
2. Parallel Programming Models for Multicomputer
 Message Passing Model:
o The most common programming model for multicomputer,
where processes running on different nodes communicate by
sending and receiving messages.
o Each process has its own local memory, and explicit
communication is required to share data between processes.
 Data Parallelism:
o Similar to data parallelism in multiprocessors, but data is
distributed across nodes, and each node works on its local
portion of the data.
3. Key Concepts in Parallel Programming on Multicomputers
 Message Passing Interface (MPI):
o MPI is a standard library for message passing in parallel
programs, widely used in distributed memory systems.
o It provides a range of functions for point-to-point communication,
collective communication (e.g., broadcasts, reductions), and
synchronization.
 Communication and Synchronization:
o Managing communication latency and synchronization overhead
is crucial in multicomputers.
o Techniques like asynchronous communication, non-blocking
sends/receives, and collective operations are often used to
optimize performance.
 Load Balancing:
o Ensuring an even distribution of work across all nodes to avoid
situations where some nodes are idle while others are
overloaded.
o Load balancing can be static (determined at the start) or dynamic
(adjusted during runtime based on workload).
 Fault Tolerance:
o In a distributed environment, failures are more common, so fault
tolerance is important. Techniques include checkpointing
(periodically saving the state of a computation) and replication
(duplicating tasks across nodes).
4. Challenges in Parallel Programming on Multicomputers
 Network Latency and Bandwidth:
o Communication between nodes is slower than memory access,
so minimizing data transfer and optimizing communication
patterns are key to performance.
o Techniques like data locality, minimizing communication
frequency, and overlapping computation with communication are
often employed.
 Scalability:
o As the number of nodes increases, the overhead of
communication and synchronization can grow, making it
challenging to maintain performance.
 Debugging and Profiling:
o Debugging parallel programs on multicomputers is complex due
to the distributed nature of the system and potential issues like
race conditions, deadlocks, and communication errors.
o Profiling tools help analyse performance and identify bottlenecks
in communication and computation.
Matrix algorithms
Matrix algorithms are fundamental in numerous scientific, engineering, and
data processing applications. Below are several key matrix algorithms,
including both sequential and parallel approaches:
1. Matrix Addition
 Problem: Given two matrices A and B of the same dimensions,
compute their sum C=A+B.
 Sequential Approach: Iterate through each element and add the
corresponding elements of A and B to form C.
 Parallel Approach:
o Element-wise Parallelization: Each processor is assigned a
subset of elements, which it adds in parallel. This is a
straightforward approach, as each addition operation is
independent.
 Example:
o For matrices of size n×n, divide the matrix into submatrices and
assign each submatrix to a different processor for parallel
addition.
2. Matrix Multiplication
 Problem: Given two matrices A (of size m×n) and B (of size n×p,
compute the product C=ABC = ABC=AB (of size m×p).
 Sequential Approach: Use triple nested loops to compute each
element of CCC.
 Parallel Approaches:
o Block Multiplication: Divide the matrices into smaller blocks
and perform block-wise multiplication in parallel, then combine
the results.
o Cannon’s Algorithm: An efficient parallel algorithm for matrix
multiplication in distributed memory systems, where each
processor computes a block of the result matrix.
o Strassen’s Algorithm: A divide-and-conquer algorithm that
reduces the number of multiplications required, which can be
parallelized to improve performance.
 Example:
o In block multiplication, the matrix is divided into 4 blocks. Each
processor handles the multiplication of its assigned blocks, and
the results are combined.
3. Matrix Transposition
 Problem: Given a matrix AAA of size m×n, produce the transposed
matrix AT of size n×m.
 Sequential Approach: Iterate through the matrix, swapping elements
Aij with Aji.
 Parallel Approach:
o Element-wise Parallelization: Each processor is assigned a set
of elements to swap in parallel. Care must be taken to ensure
that no two processors write to the same location at the same
time.
 Example:
o For large matrices, divide the matrix into submatrices or rows,
and each processor transposes its assigned section.
4. LU Decomposition
 Problem: Decompose a matrix A into a lower triangular matrix L and
an upper triangular matrix U such that A=LU Sequential Approach:
Use Gaussian elimination to perform the decomposition.
 Parallel Approaches:
o Row-wise Parallelization: Each row of the matrix can be
processed in parallel, especially during the forward elimination
phase.
o Block LU Decomposition: Divide the matrix into blocks, and
perform LU decomposition on each block in parallel.
Synchronization is required after each step to ensure
correctness.
 Example:
o For a matrix of size n×n, the matrix is divided into submatrices.
Each processor handles the LU decomposition of its assigned
block.
5. Cholesky Decomposition
 Problem: Decompose a symmetric positive-definite matrix A into the
product of a lower triangular matrix L and its transpose LT, such that
A=LLT.
 Sequential Approach: Use a variant of Gaussian elimination tailored
for symmetric matrices.
 Parallel Approaches:
o Blocked Cholesky: Similar to block LU decomposition, the
matrix is divided into blocks, with each block being processed in
parallel. Synchronization is required after each step.
o Pipeline Cholesky: Process different parts of the matrix in a
pipeline fashion, with each processor working on a different
stage of the decomposition.
 Example:
o For large matrices, each block is processed in parallel, with
communication between processors to synchronize the results.
Graph algorithms
Graph algorithms are crucial in computer science, used for solving
problems related to networks, social graphs, data structures, and more.
Here’s an overview of key graph algorithms, including both sequential and
parallel approaches:
1. Breadth-First Search (BFS)
 Problem: Explore all nodes in a graph starting from a given node,
visiting all nodes at the current depth level before moving to the next
level.
 Sequential Approach:
o Use a queue to explore nodes level by level, marking each
visited node.
 Parallel Approach:
o Level-Synchronous BFS: Nodes at the current level are
processed in parallel, with each processor handling a subset of
the nodes. After processing, the next level is queued for the next
iteration.
o Hybrid BFS: Combines level-synchronous and asynchronous
BFS to improve performance on large or irregular graphs.
 Example:
o In parallel BFS on a social network graph, each processor
explores a different subset of a node's neighbors, potentially on
different machines in a distributed system.
2. Depth-First Search (DFS)
 Problem: Explore as far as possible along each branch before
backtracking.
 Sequential Approach:
o Use a stack (either explicitly or via recursion) to explore nodes
deeply before moving to the next sibling.
 Parallel Approach:
o Recursive Parallel DFS: Divide the graph into subgraphs, with
each processor performing DFS on its subgraph. Care must be
taken to handle overlapping nodes at boundaries.
o Task Parallelism: Each processor is assigned a subtree, with
the DFS running independently on different subtrees.
 Example:
o In parallel DFS for a large tree, each processor could explore
different branches of the tree concurrently.
Kruskal’s Algorithm
Kruskal's Algorithm is a classic approach to finding the Minimum Spanning
Tree (MST) of a connected, undirected graph. The MST connects all
vertices in the graph with the minimum possible total edge weight and
without any cycles.
Steps of Kruskal’s Algorithm:
1. Sort the Edges: Begin by sorting all the edges of the graph in non-
decreasing order based on their weights.
2. Initialize the MST: Start with an empty MST and consider each edge
one by one from the sorted list.
3. Select Edges: For each edge, check if it forms a cycle with the edges
already included in the MST:
o Cycle Detection: Use the Union-Find data structure (also known
as Disjoint Set Union, DSU) to efficiently determine if adding the
current edge would form a cycle. If it does not form a cycle, add
it to the MST.
o Union-Find Operations:
 Find: Determine the root of the set containing a vertex.
 Union: Merge two sets if they belong to different
components, effectively connecting them.
4. Repeat: Continue this process until the MST contains exactly V−1
edges (where V is the number of vertices in the graph).
Example:
For a graph with vertices A, B, C, and D and edges with weights, such as
AB=1, BC=2, CD=3, and DA=4:
 Sort the edges: (AB,1), (BC,2), (CD,3),(DA,4).
 Add edges one by one to the MST, ensuring no cycles are formed,
resulting in a final MST with edges AB, BC, and CD.
Efficiency:
 Time Complexity: O(E log E), where E is the number of edges,
primarily due to the sorting step.
 Space Complexity: O(V+E), to store the graph and Union-Find
structure.
Dijkstra’s Algorithm
Dijkstra’s Algorithm is a fundamental algorithm used to find the shortest
paths from a single source node to all other nodes in a weighted graph with
non-negative edge weights. It is widely used in various applications,
including routing and network analysis.
Steps of Dijkstra’s Algorithm:
1. Initialization:
o Set the distance to the source node to 0 and to all other nodes
to infinity.
o Use a priority queue (or min-heap) to keep track of nodes to be
processed, starting with the source node.
2. Processing Nodes:
o Extract the node with the smallest distance from the priority
queue. This node is considered to have the shortest path from
the source.
o For each adjacent node of the extracted node, calculate the
tentative distance through the extracted node. If this distance is
shorter than the currently known distance, update the distance
and insert the adjacent node into the priority queue with the new
distance.
3. Update Distances:
o Repeat the process until the priority queue is empty. Each time
a node is processed, its shortest distance from the source is
finalized.
4. Reconstruction:
o After processing all nodes, the algorithm provides the shortest
distance from the source to every other node. To find the shortest
path itself, backtrack from the destination node using a
predecessor array or list.
Floyd-Warshall Algorithm
The Floyd-Warshall Algorithm is a classic algorithm used to find the
shortest paths between all pairs of nodes in a weighted graph. It is
particularly useful for dense graphs and when you need to compute the
shortest paths between every pair of nodes.
Steps of Floyd-Warshall Algorithm:
1. Initialization:
o Create a distance matrix D where D[i][j] represents the shortest
distance from node i to node j.
o Initialize D[i][j]to the weight of the edge between i and j if there is
an edge; otherwise, set D[i][j] to infinity. Set D[i][i] to 0 for all
nodes i.
2. Algorithm Execution:
o For each node k (considered as an intermediate node), update
the distance matrix:
For each pair of nodes (i,j), check if the path from i to j through k is shorter
than the direct path from i to j. Update D[i][j] as follows:
 D[i][j]=min(D[i][j],D[i][k]+D[k][j])
o Repeat the above step for each node k as an intermediate node.
3. Output:
o After processing all nodes, the matrix D contains the shortest
path distances between all pairs of nodes. If D[i][j] remains
infinity, it indicates that there is no path from node i to node j.
The Floyd-Warshall Algorithm is especially useful for scenarios where you
need to compute shortest paths between every pair of nodes, and its
straightforward approach makes it suitable for implementation in both
sequential and parallel computing environments. In parallel settings, the
matrix updates can be distributed across multiple processors to speed up
computation, particularly for large graphs.
Prim’s Algorithm
Prim’s Algorithm is a well-known method for finding the Minimum
Spanning Tree (MST) of a connected, undirected graph. The MST
connects all vertices in the graph with the minimum possible total edge
weight without forming any cycles.
Steps of Prim’s Algorithm:
1. Initialization:
o Start with a single node (chosen arbitrarily) and mark it as part
of the MST.
o Initialize a priority queue (or min-heap) to keep track of edges
connecting the nodes already included in the MST to those not
yet included. Initially, this queue will contain edges that connect
the starting node to its adjacent nodes.
2. Building the MST:
o Extract Minimum Edge: From the priority queue, extract the
edge with the minimum weight that connects a node inside the
MST to a node outside the MST.
o Update MST: Add this edge and the newly connected node to
the MST.
o Add New Edges: For the newly added node, update the priority
queue with all edges connecting this node to nodes not yet in the
MST. This ensures that future edges connecting the MST to the
remaining nodes are considered.
3. Repeat:
o Continue the process of extracting the minimum edge and
adding the corresponding node and edge to the MST until all
nodes are included in the MST.
4. Output:
o Once all nodes are included, the algorithm completes, and the
MST is represented by the edges added during the process.
Prim’s Algorithm is particularly effective for dense graphs, where the
number of edges is close to the square of the number of vertices. It can be
parallelized by distributing the work of extracting minimum edges and
updating the priority queue across multiple processors, enhancing
performance for large-scale graphs.
Edmonds-Karp Algorithm
Edmonds-Karp Algorithm is an implementation of the Ford-Fulkerson
method for computing the maximum flow in a flow network. It uses the
Breadth-First Search (BFS) to find augmenting paths, which ensures
polynomial time complexity.
Steps of Edmonds-Karp Algorithm:
1. Initialization:
o Start with an initial flow of 0 for all edges in the network.
o Construct a residual graph where each edge (u,v) has a capacity
and a flow, with the initial flow set to 0. The residual capacity of
an edge (u,v) is defined as capacity(u,v)−flow(u,v).
2. Find Augmenting Paths:
o Use Breadth-First Search (BFS) to find the shortest augmenting
path from the source node s to the sink node t in the residual
graph. An augmenting path is a path where the residual capacity
of each edge along the path is positive.
3. Augment Flow:
o Determine the maximum flow that can be sent along the found
augmenting path. This is the minimum residual capacity along
the path.
o Update the flow along the path by adding this maximum flow to
the forward edges and subtracting it from the backward edges in
the residual graph.
4. Repeat:
o Repeat the BFS to find new augmenting paths and update the
flow until no more augmenting paths can be found in the residual
graph.
5. Output:
o The algorithm terminates when no more augmenting paths can
be found. The maximum flow is the sum of flows on all edges
leaving the source node s.

You might also like