Unit5 RMD PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Department of CSE CS6801 Multicore Architecture and programming

CS6801 MULTICORE ARCHITECTURES AND PROGRAMMING

UNIT V PARALLEL PROGRAM DEVELOPMENT (case studies)

n-Body solvers
1. The N body solver problem
2. Serial solution to n-Body solver problem
3. Parallelizing the n-body solver problem
4. Parallelizing using OpenMP
5. Parallelizing using MPI
Tree search
6. The tree search problem
7. Recursive and Non recursive solution to TSP
8. Parallelizing tree search using MPI static partitioning
9. Parallelizing tree search using MPI dynamic partitioning
10. Parallelizing tree search using OpenMP

1. The n-body solver problem

 It simulates the motion of planets or stars


 Finds the positions and velocities of a collection of interacting particles over a period of time.
 An n-body solver is a program that finds the solution to an n-body problem by simulating the
behavior of the particles

Positions and velocities are determined using the following methods:

 Newton’s second law of motion. ( force(f)=mass(m)*acceleration(a))

 Newton’s law of universal gravitation

Every particle attracts every other particle in the universe with a force which is
directly proportional to the product of their masses and inversely proportional to the square of
the distance between their centers

1 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

Applying Newtons’s law of universal gravitation - To find the Force(F)

If particle q has position sq(t) at time t and particle k has position sk(t), then force on q exerted by particle k is given by

Where,

G is the gravitational constant (6.674×10−11 N · (m/kg)2)

mq & mk are passes of particle m and q

|sq(t) - sk(t) | represents the distance between particle k and q

Total force on particle q by n other particles are calculated by

Applying Newton’s law of motion (f=m*a) -To find the acceleration

If the acceleration of particle q is aq(t), then Fq(t)=mqaq(t)=mqs”q(t), Where, s”q(t) is the second
derivative of the position sq(t)

The acceleration of particle q is obtained by the following formula

[ Note: Velocity is single derivative of the position sq(t) ].

Antiderivative of the above equation will give the velocities but this approach has problem because the RHS
of has unknown functions sq(t) &sj(t) -not just variable t. Hence, we will use Euler’s method

Applying Euler’s method - To find the position and velocities at time t0 + Δt

In Euler’s method , Given the initial values of the function(g(t0))& its derivative ( g’(t0)) ,we use the tangent
line to approximate the function at desired interval t0 + Δt . The equation for the tangent line is

2 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

Where,
g(t0) - value of the function at time t0
g’(t0) is the slope of the line

2. Serial solution to n-body solver problem


Basic n-body solver ( pseudo code)

The total force calculation inLine 4-5 can be refined further as follows

3 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

The position and velocities in lines 7 is calculated as follows

Reduced n-body solver


The code for reduced solver is same as basic solver but the no. of operations for calculation of
total force differs.

Calculationtotal force for the reduced solver is given below

4 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

3. Parallelizing n-body solver problem


Apply Foster’s methodologyto parallelize the n-body solver

 Partitioning ,communication,aggregation and mapping)

Initially, we want a lot of tasks.Start by making our tasks the computations of the positions,
the velocities, and the total forces at each timestep.

 Communication : Among the tasks in basic n-body solver

 Aggregation: Communication among the aggregated tasks in basic n-body solver

5 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

Communication among the aggregated tasks in reduced n-body solver

q<r

 Mapping

The task are assigning to the cores using one of the following methods

1.Block distribution - for basic n-body solver

2.Cyclic distribution - for reduced n-body solver

For this example, the components are calculation for the particles

6 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

4. Parallelizing n-body solver problem using OpenMP


Parallelizing the basic n-body solver: Lets see the serial code for the n-body solver problem

The paralleized code will be

There is no loop carried dependence in both the loops. The above code will have lots of
forking and joining of threads. This can be solved as follows

7 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

We can make only one thread to print the output as follows by adding single directive.

Parallelizing the reduced n-body solver

It can be parallelized as follows

Lets recall the details of force calculation in inner loop

8 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

Loop carried dependence

There are number of solutions to solve this problem

Solution 1:

9 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

Solution 2:

Solution 3:

4. Parallelizing n-body solver problem using MPI


Parellelizing the basic solver using MPI

Choices with respect to the data structures:

 Use MPI data type than derived data type struct for pos,mass,velocity

 Use separate array for pos,mass,velocity than the struct

 Each process stores the entire global array of particle masses.

 Each process only uses a single n-element array for the positions.

 Each process uses a pointer loc_pos that refers to the start of its block of pos.

 So on process 0 local_pos = pos; on process 1 local_pos = pos + loc_n; etc.

10 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

Pseudo code for MPI Version of basic solver

The syntax for MPI_Allgather will be

The line 1 gets the input for all the particles & distributes to all the processes.It can be written with
MPI as follows

Parellelizing the reduced solver using MPI

Communication in MPI reduced solver

11 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

Pseudo code for MPI implementation of reduced solver

Note:

MPI_Sendrecv_replace () can be used to send and receive data at line numbers 8,9,15 and 16

TREE SEARCH
6.The Tree Search Problem

The problem which is solved by searching the tree is called as tree search problem.

Example: Traveling salesman problem (TSP)

What is TSP?

 Given a set of cities and distance between every pair of cities, the problem is to find the shortest
possible route that visits every city exactly once and returns to the starting point.
 It is an NP-Complete problem. This means that there is no algorithm known for solving it that, in all
cases, is significantly better than exhaustive search

How to solve TSP?

Example:

12 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

13 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

--------------------------------------------------------------------------------------------------------
7. Recursive and Non Recursive solution to TSP

Recursive vs. Non recursive design


•Recursion helps understanding of sequential code
Not easy for parallelization.
•Non-recursive design
Explicit management of stack data structure

Loops instead of recursive calls

Better for parallelization


–Expose the traversal of search tree explicitly.
–Allow scheduling of parallel threads (processes)

Pseudo code for recursive solution to TSP

14 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

The algorithm makes use of several global variables:


 n: the total number of cities in the problem
 digraph: a data structure representing the input digraph
 hometown: a data structure representing vertex or city 0, the salesperson’s hometown
 best_tour: a data structure representing the best tour so far

The function City_count examines the partial tour tour to see if there are n cities on the partial tour. If
there are, we know that we simply need to return to the hometown to complete the tour, and we can check to
see if the complete tour has a lower cost than the current “best tour” by calling Best_tour. If it does, we can
replace the current best tour with this tour by calling the function Update_best_tour.

Note that before the first call to Depth_first_search, the best tour variable should be initialized so
that its cost is greater than the cost of any possible least-cost tour. If the partial tour tour hasn’t visited n
cities, we can continue branching down in the tree by “expanding the current node,” in other words, by trying
to visit other Feasible checks to see if the city or vertex has already been visited, and, if not, whether it can
possibly lead to a least-cost tour. If the city is feasible, we add it to the tour, and recursively call Depth first
search. When we return from Depth first search, we remove the city from the tour, since it shouldn’t be
included in the tour used in subsequent recursive calls.

Pseudo code for recursive solution to TSP (solution 1)

• Non-recursive implementation is more flexible


• Shared data structures
• Freedom to schedule threads/processes flexibly
• How to eliminate recursion
• Explicit management of “stack” data structure
• Loop instead of recursive calls
15 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

The basic idea is that we can try to eliminate recursion by pushing necessary data on our own stack
before branching deeper into the tree, and when we need to go back up the tree—either because we’ve
reached a leaf or because we’ve found a node that can’t lead to a better solution—we can pop the stack.
In this version, a stack record consists of a single city, the city that will be added to the tour when its
record is popped. In the recursive version we continue to make recursive calls until we’ve visited every node
of the tree that corresponds to a feasible partial tour. At this point, the stack won’t have any more activation
records for calls to Depth_first_search, and we’ll return to the function that made the original call to
Depth first search. The main control structure in our iterative version is the while loop extending from
Line 3 to Line 20, and the loop termination condition is that our stack is empty. As long as the search needs to
continue, we need to make sure the stack is nonempty, and, in the first two lines, we add each of the non-
hometown cities. Note that this loop visits the cities in decreasing order, from n-1 down to 1. This is because
of the order created by the stack, whereby the stack pops the top cities first. By reversing the order, we can
insure that the cities are visited in the same order as the recursive function.

Also notice that in Line 5 we check whether the city we’ve popped is the constant NO_CITY. This
constant is used so that we can tell when we’ve visited all of the children of a tree node; if we didn’t use it, we
wouldn’t be able to tell when to back up in the tree. Thus, before pushing all of the children of a node (Lines
15–17), we push the NO_CITY marker

Pseudo code for recursive solution to TSP (solution 2)- used for parallel implementation
An alternative is the one that uses partial tours as stack records (see pseudo code below. This gives
code that is closer to the recursive function. However, it also results in a slower version, since it’s necessary
for the function that pushes onto the stack to create a copy of the tour before actually pushing it on to the stack.
To emphasize this point, we’ve called the function Push copy. The extra memory required will probably not be
a problem. However, allocating storage for a new tour and copying the existing tour is time-consuming. To
some degree we can mitigate these costs by saving freed tours in our own data structure, and when a freed tour
is available we can use it in the Push copy function instead of calling malloc.

16 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

Tree showing the working of the above algorithm

17 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

8. Parallelizing tree search using MPI static partitioning


Static partitioning divides the work equally before distributing the work to threads

18 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

From thread code to MPI code


The major steps are
1. Distribute initial partial tours to processes ( or partitioning the tree)
 Use a loop of MPI_Send()

 Or use MPI_Scatterv() which supports non-uniform message sizes to different


destinations.
2. Inform the best tour to all processes
 A process finds a new best tour if the new cost is lower.

 Do not use blocking group communication MPI_Bcast()

 Sender: May use MPI_Send() to inform others


–Safer to user MPI_Bsend() with its own buffer space.
 Receiver: Do not use blocking MPI_Recv().
–Use asynchronous non-blocking receiving with MPI_Iprobe
3. Print the best tour

19 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

1.Distribute initial partial tours to processes


The following MPI functions are used for distributing the work

Sending a different number of objects to each process in the communicator

Gathering a different number of objects from each process in the communicator

2.Inform the best tour to all process


MPI_Send(),MPI_Iprobe(), MPI_Recv() are used
To send the best tour

Checking to see if message is available (Asynchronous non-blocking receive )

20 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

To receive the best tour


If a message is available, use standard MPI_Recv() to receive it.

MPI code to check for new best tour cost

Modes and Buffered Sends


MPI provides four modes for sends.
 Standard: MPI_Send()
–Use system buffer. Block if there is no buffer space
 Synchronous: MPI_Ssend()
–Block until a matching receive is posted.
 Ready: MPI_Rsend()
–Error unless a matching receive is posted before sending
 Buffered: MPI_Bsend()
–Supply your own buffer space.
3.Printing the best tour
At the end of MPI tree search
 Gather and print the best tour at the end.
-Use MPI_Allreduce() to find the lowest from all.
-Process 0 prints the final result
 Clean unreceived messages before shutting down MPI
- Some messages won’t be received during parallel search.
-Use MPI_Iprobe to receive outstanding messages before MPI_Finalize()
Example

21 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

--------------------------------------------------------------------------------------------------------
9. Parallelizing tree search using MPI dynamic partitioning
Weakness of static partitioning:
 Load imbalance
-Many paths may be dynamically pruned
-The workload assigned to threads can be uneven.
 How to improve load balancing?
- Schedule computation statically initially.
- Shift workload dynamically when some threads have nothing to do
–Also called work stealing
To address this issue, we can try to dynamically redistribute the work as the computation proceeds. To
do this, we can replace the test !Empty(my_stack) controlling execution of the while loop with more complex
code( i.e Terminated() )
Dynamic work stealing code for thread my_rank

Dynamic partitioning

 Uses majority of MPI code for static partitioning

 Special handling of distributed termination detection

 Emulate in a distributed memory setting

22 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

 Handle a process runs out of work (stack is empty)

–Request work from MyRank+1 first.

–Wait for receiving additional work

–Quit if no more work is available

 A process with work splits its stack and sends work to an idle process.

–Use special MPI message packaging

Pseudo code for Terminated() function

 The function My_avail_tour count can simply return the size of the process’ stack.

 If a process has enough work so that it can usefully split its stack, it calls Fulfill_request (Line 2).
Fulfill_request uses MPI Iprobe to check for a request for work from another process. If there is a
request, it receives it, splits its stack, and sends work to the requesting process. If there isn’t a request
for work, the process just returns.

Splitting the stack

A Split_stack function is called by Fulfill_request. The MPI version of Split_stack packs the contents
of the new stack into contiguous memory and sends the block of contiguous memory, which is unpacked by
the receiver into a new stack.
23 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

MPI provides a function, MPI_Pack, for packing data into a buffer of contiguous memory. It also
provides a function, MPI Unpack, for unpacking data from a buffer of contiguous memory.

Distributed termination detection

The functions Out_of_work and No_work_left (Lines 11 and 15) implement the termination
detection algorithm.

 Each process maintains a variable (oow) as # of out-of-work processes.

 The entire computation quits if oow = n where n is # of processes.

 When a process runs out of work, notify everybody (oow++)

 When a process receives new workload, notify everybody (oow--)

Sending requests for work

Once we’ve decided on which process we plan to send a request to, we can just send a zero-length
message with a “request for work” tag.

Checking for and receiving work

Once a request is sent for work, it’s critical that the sending process repeatedly check for a response
from the destination.The Check_for_work function should therefore first probe for a message from the
destination indicating work is available, and, if there isn’t such a message, it should probe for a message from
the destination saying there’s no work available. If there is work available, the Receive_work function can
receive the message with work and unpack the contents of the message buffer into the process’ stack.

24 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

10. Parallelizing tree search using OpenMP


The issues involved in implementing the static and dynamic parallel tree-search programs
using OpenMP are the same as the issues involved in implementing the programs using Pthreads. The
below example is pthread implementation of TSP

The pthread code of dynamic parallelized TSP will replace Empty() in above while loop with
Terminated() and the pseudo code for Terminated() is given below

25 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

There are almost no substantive differences between a static implementation that uses OpenMP and
one that uses Pthreads. However, a couple of points should be mentioned:

1. When a single thread executes some code in the Pthreads version, the test

if (my_rank == whatever)
can be replaced by the OpenMP directive
# pragma omp single

This will insure that the following structured block of code will be executed by one thread in the team,
and the other threads in the team will wait in an implicit barrier at the end of the block until the executing
thread is finished.
When whatever is 0 (as it is in each test in the Pthreads program), the test can also be replaced by the
OpenMP directive
# pragma omp master

This will insure that thread 0 executes the following structured block of code. However,
the master directive doesn’t put an implicit barrier at the end of the block, so it may be necessary to also add
a barrier directive after a structured block that has been modified by a master directive.
2. The Pthreads mutex that protects the best tour can be replaced by a single critical directive placed
either inside the Update best tour function or imme-diately before the call to Update best tour. This is the
only potential source of a race condition after the distribution of the initial tours, so the
simple critical directive won’t cause a thread to block unnecessarily.
OpenMP even provides a nonblocking version of omp_set_lock. Recall that OpenMP provides a lock
object omp_lock_t and the following functions for acquiring and relinquishing the lock, respectively:
void omp_set_lock(omp_lock_t* lock p /* in/out */);
void omp_unset_lock(omp_lock_t* lock p /* in/out */);
It also provides the function
26 RMDEC
Department of CSE CS6801 Multicore Architecture and programming

int omp_test_lock(omp_lock_t* lock p /* in/out */);


which is analogous to pthread mutex trylock; it attempts to acquire the lock lock p, and if it succeeds it returns
true (or nonzero). If the lock is being used by some other thread, it returns immediately with return value false
(or zero).
If we examine the pseudocode for the Pthreads Terminated function in Program 6.8, we see that in
order to adapt the Pthreads version to OpenMP, we need to emulate the functionality of the Pthreads function
calls
Pthread_cond_signal(&term_cond_var);
pthread_cond_broadcast(&term_cond_var);
pthread_cond_wait(&term_cond_var, &term_mutex);
The simplest solution to emulating a condition wait in OpenMP is to use busy-waiting. Since there are
two conditions a waiting thread should test for, we can use two different variables in the busy-wait loop:

27 RMDEC

You might also like