Unit5 RMD PDF

Department of CSE CS6801 Multicore Architecture and programming
CS6801 MULTICORE ARCHITECTURES AND PROGRAMMING
UNIT V PARALLEL PROGRAM DEVELOPMENT (case studies)
n-Body solvers
1. The N body solver problem
2. Serial solution to n-Body solver problem
3. Parallelizing the n-body solver problem
4. Parallelizing using OpenMP
5. Parallelizing using MPI
Tree search
6. The tree search problem
7. Recursive and Non recursive solution to TSP
8. Parallelizing tree search using MPI static partitioning
9. Parallelizing tree search using MPI dynamic partitioning
10. Parallelizing tree search using OpenMP
1. The n-body solver problem
 It simulates the motion of planets or stars

 Finds the positions and velocities of a collection of interacting particles over a period of time.
 An n-body solver is a program that finds the solution to an n-body problem by simulating the
behavior of the particles
Positions and velocities are determined using the following methods:
 Newton’s second law of motion. ( force(f)=mass(m)*acceleration(a))
 Newton’s law of universal gravitation
Every particle attracts every other particle in the universe with a force which is
directly proportional to the product of their masses and inversely proportional to the square of
the distance between their centers
1 RMDEC
Applying Newtons’s law of universal gravitation - To find the Force(F)
If particle q has position sq(t) at time t and particle k has position sk(t), then force on q exerted by particle k is given by
Where,
G is the gravitational constant (6.674×10−11 N · (m/kg)2)
mq & mk are passes of particle m and q
|sq(t) - sk(t) | represents the distance between particle k and q
Total force on particle q by n other particles are calculated by
Applying Newton’s law of motion (f=m*a) -To find the acceleration
If the acceleration of particle q is aq(t), then Fq(t)=mqaq(t)=mqs”q(t), Where, s”q(t) is the second
derivative of the position sq(t)
The acceleration of particle q is obtained by the following formula
[ Note: Velocity is single derivative of the position sq(t) ].
Antiderivative of the above equation will give the velocities but this approach has problem because the RHS
of has unknown functions sq(t) &sj(t) -not just variable t. Hence, we will use Euler’s method
Applying Euler’s method - To find the position and velocities at time t0 + Δt
In Euler’s method , Given the initial values of the function(g(t0))& its derivative ( g’(t0)) ,we use the tangent
line to approximate the function at desired interval t0 + Δt . The equation for the tangent line is
2 RMDEC
Where,
g(t0) - value of the function at time t0
g’(t0) is the slope of the line
2. Serial solution to n-body solver problem

Basic n-body solver ( pseudo code)
The total force calculation inLine 4-5 can be refined further as follows
3 RMDEC
The position and velocities in lines 7 is calculated as follows
Reduced n-body solver

The code for reduced solver is same as basic solver but the no. of operations for calculation of
total force differs.
Calculationtotal force for the reduced solver is given below
4 RMDEC
3. Parallelizing n-body solver problem

Apply Foster’s methodologyto parallelize the n-body solver
 Partitioning ,communication,aggregation and mapping)
Initially, we want a lot of tasks.Start by making our tasks the computations of the positions,
the velocities, and the total forces at each timestep.
 Communication : Among the tasks in basic n-body solver
 Aggregation: Communication among the aggregated tasks in basic n-body solver
5 RMDEC
Communication among the aggregated tasks in reduced n-body solver
q<r
 Mapping
The task are assigning to the cores using one of the following methods
1.Block distribution - for basic n-body solver
2.Cyclic distribution - for reduced n-body solver
For this example, the components are calculation for the particles
6 RMDEC
4. Parallelizing n-body solver problem using OpenMP

Parallelizing the basic n-body solver: Lets see the serial code for the n-body solver problem
The paralleized code will be
There is no loop carried dependence in both the loops. The above code will have lots of
forking and joining of threads. This can be solved as follows
7 RMDEC
We can make only one thread to print the output as follows by adding single directive.
Parallelizing the reduced n-body solver
It can be parallelized as follows
Lets recall the details of force calculation in inner loop
8 RMDEC
Loop carried dependence
There are number of solutions to solve this problem
Solution 1:
9 RMDEC
Solution 2:
Solution 3:
4. Parallelizing n-body solver problem using MPI

Parellelizing the basic solver using MPI
Choices with respect to the data structures:
 Use MPI data type than derived data type struct for pos,mass,velocity
 Use separate array for pos,mass,velocity than the struct
 Each process stores the entire global array of particle masses.
 Each process only uses a single n-element array for the positions.
 Each process uses a pointer loc_pos that refers to the start of its block of pos.
 So on process 0 local_pos = pos; on process 1 local_pos = pos + loc_n; etc.
10 RMDEC
Pseudo code for MPI Version of basic solver
The syntax for MPI_Allgather will be
The line 1 gets the input for all the particles & distributes to all the processes.It can be written with
MPI as follows
Parellelizing the reduced solver using MPI
Communication in MPI reduced solver
11 RMDEC
Pseudo code for MPI implementation of reduced solver
Note:
MPI_Sendrecv_replace () can be used to send and receive data at line numbers 8,9,15 and 16
TREE SEARCH
6.The Tree Search Problem
The problem which is solved by searching the tree is called as tree search problem.
Example: Traveling salesman problem (TSP)
What is TSP?
 Given a set of cities and distance between every pair of cities, the problem is to find the shortest
possible route that visits every city exactly once and returns to the starting point.
 It is an NP-Complete problem. This means that there is no algorithm known for solving it that, in all
cases, is significantly better than exhaustive search
How to solve TSP?
Example:
12 RMDEC
13 RMDEC
--------------------------------------------------------------------------------------------------------
7. Recursive and Non Recursive solution to TSP
Recursive vs. Non recursive design

•Recursion helps understanding of sequential code
Not easy for parallelization.
•Non-recursive design
Explicit management of stack data structure
Loops instead of recursive calls
Better for parallelization

–Expose the traversal of search tree explicitly.
–Allow scheduling of parallel threads (processes)
Pseudo code for recursive solution to TSP
14 RMDEC
The algorithm makes use of several global variables:

 n: the total number of cities in the problem
 digraph: a data structure representing the input digraph
 hometown: a data structure representing vertex or city 0, the salesperson’s hometown
 best_tour: a data structure representing the best tour so far
The function City_count examines the partial tour tour to see if there are n cities on the partial tour. If
there are, we know that we simply need to return to the hometown to complete the tour, and we can check to
see if the complete tour has a lower cost than the current “best tour” by calling Best_tour. If it does, we can
replace the current best tour with this tour by calling the function Update_best_tour.
Note that before the first call to Depth_first_search, the best tour variable should be initialized so
that its cost is greater than the cost of any possible least-cost tour. If the partial tour tour hasn’t visited n
cities, we can continue branching down in the tree by “expanding the current node,” in other words, by trying
to visit other Feasible checks to see if the city or vertex has already been visited, and, if not, whether it can
possibly lead to a least-cost tour. If the city is feasible, we add it to the tour, and recursively call Depth first
search. When we return from Depth first search, we remove the city from the tour, since it shouldn’t be
included in the tour used in subsequent recursive calls.
Pseudo code for recursive solution to TSP (solution 1)
• Non-recursive implementation is more flexible

• Shared data structures
• Freedom to schedule threads/processes flexibly
• How to eliminate recursion
• Explicit management of “stack” data structure
• Loop instead of recursive calls
15 RMDEC
The basic idea is that we can try to eliminate recursion by pushing necessary data on our own stack
before branching deeper into the tree, and when we need to go back up the tree—either because we’ve
reached a leaf or because we’ve found a node that can’t lead to a better solution—we can pop the stack.
In this version, a stack record consists of a single city, the city that will be added to the tour when its
record is popped. In the recursive version we continue to make recursive calls until we’ve visited every node
of the tree that corresponds to a feasible partial tour. At this point, the stack won’t have any more activation
records for calls to Depth_first_search, and we’ll return to the function that made the original call to
Depth first search. The main control structure in our iterative version is the while loop extending from
Line 3 to Line 20, and the loop termination condition is that our stack is empty. As long as the search needs to
continue, we need to make sure the stack is nonempty, and, in the first two lines, we add each of the non-
hometown cities. Note that this loop visits the cities in decreasing order, from n-1 down to 1. This is because
of the order created by the stack, whereby the stack pops the top cities first. By reversing the order, we can
insure that the cities are visited in the same order as the recursive function.
Also notice that in Line 5 we check whether the city we’ve popped is the constant NO_CITY. This
constant is used so that we can tell when we’ve visited all of the children of a tree node; if we didn’t use it, we
wouldn’t be able to tell when to back up in the tree. Thus, before pushing all of the children of a node (Lines
15–17), we push the NO_CITY marker
Pseudo code for recursive solution to TSP (solution 2)- used for parallel implementation
An alternative is the one that uses partial tours as stack records (see pseudo code below. This gives
code that is closer to the recursive function. However, it also results in a slower version, since it’s necessary
for the function that pushes onto the stack to create a copy of the tour before actually pushing it on to the stack.
To emphasize this point, we’ve called the function Push copy. The extra memory required will probably not be
a problem. However, allocating storage for a new tour and copying the existing tour is time-consuming. To
some degree we can mitigate these costs by saving freed tours in our own data structure, and when a freed tour
is available we can use it in the Push copy function instead of calling malloc.
16 RMDEC
Tree showing the working of the above algorithm
17 RMDEC
8. Parallelizing tree search using MPI static partitioning

Static partitioning divides the work equally before distributing the work to threads
18 RMDEC
From thread code to MPI code

The major steps are
1. Distribute initial partial tours to processes ( or partitioning the tree)
 Use a loop of MPI_Send()
 Or use MPI_Scatterv() which supports non-uniform message sizes to different

destinations.
2. Inform the best tour to all processes
 A process finds a new best tour if the new cost is lower.
 Do not use blocking group communication MPI_Bcast()
 Sender: May use MPI_Send() to inform others

–Safer to user MPI_Bsend() with its own buffer space.
 Receiver: Do not use blocking MPI_Recv().
–Use asynchronous non-blocking receiving with MPI_Iprobe
3. Print the best tour
19 RMDEC
1.Distribute initial partial tours to processes

The following MPI functions are used for distributing the work
Sending a different number of objects to each process in the communicator
Gathering a different number of objects from each process in the communicator
2.Inform the best tour to all process

MPI_Send(),MPI_Iprobe(), MPI_Recv() are used
To send the best tour
Checking to see if message is available (Asynchronous non-blocking receive )
20 RMDEC
To receive the best tour

If a message is available, use standard MPI_Recv() to receive it.
MPI code to check for new best tour cost
Modes and Buffered Sends

MPI provides four modes for sends.
 Standard: MPI_Send()
–Use system buffer. Block if there is no buffer space
 Synchronous: MPI_Ssend()
–Block until a matching receive is posted.
 Ready: MPI_Rsend()
–Error unless a matching receive is posted before sending
 Buffered: MPI_Bsend()
–Supply your own buffer space.
3.Printing the best tour
At the end of MPI tree search
 Gather and print the best tour at the end.
-Use MPI_Allreduce() to find the lowest from all.
-Process 0 prints the final result
 Clean unreceived messages before shutting down MPI
- Some messages won’t be received during parallel search.
-Use MPI_Iprobe to receive outstanding messages before MPI_Finalize()
Example
21 RMDEC
--------------------------------------------------------------------------------------------------------
9. Parallelizing tree search using MPI dynamic partitioning
Weakness of static partitioning:
 Load imbalance
-Many paths may be dynamically pruned
-The workload assigned to threads can be uneven.
 How to improve load balancing?
- Schedule computation statically initially.
- Shift workload dynamically when some threads have nothing to do
–Also called work stealing
To address this issue, we can try to dynamically redistribute the work as the computation proceeds. To
do this, we can replace the test !Empty(my_stack) controlling execution of the while loop with more complex
code( i.e Terminated() )
Dynamic work stealing code for thread my_rank
Dynamic partitioning
 Uses majority of MPI code for static partitioning
 Special handling of distributed termination detection
 Emulate in a distributed memory setting
22 RMDEC
 Handle a process runs out of work (stack is empty)
–Request work from MyRank+1 first.
–Wait for receiving additional work
–Quit if no more work is available
 A process with work splits its stack and sends work to an idle process.
–Use special MPI message packaging
Pseudo code for Terminated() function
 The function My_avail_tour count can simply return the size of the process’ stack.
 If a process has enough work so that it can usefully split its stack, it calls Fulfill_request (Line 2).
Fulfill_request uses MPI Iprobe to check for a request for work from another process. If there is a
request, it receives it, splits its stack, and sends work to the requesting process. If there isn’t a request
for work, the process just returns.
Splitting the stack
A Split_stack function is called by Fulfill_request. The MPI version of Split_stack packs the contents
of the new stack into contiguous memory and sends the block of contiguous memory, which is unpacked by
the receiver into a new stack.
23 RMDEC
MPI provides a function, MPI_Pack, for packing data into a buffer of contiguous memory. It also
provides a function, MPI Unpack, for unpacking data from a buffer of contiguous memory.
Distributed termination detection
The functions Out_of_work and No_work_left (Lines 11 and 15) implement the termination
detection algorithm.
 Each process maintains a variable (oow) as # of out-of-work processes.
 The entire computation quits if oow = n where n is # of processes.
 When a process runs out of work, notify everybody (oow++)
 When a process receives new workload, notify everybody (oow--)
Sending requests for work
Once we’ve decided on which process we plan to send a request to, we can just send a zero-length
message with a “request for work” tag.
Checking for and receiving work
Once a request is sent for work, it’s critical that the sending process repeatedly check for a response
from the destination.The Check_for_work function should therefore first probe for a message from the
destination indicating work is available, and, if there isn’t such a message, it should probe for a message from
the destination saying there’s no work available. If there is work available, the Receive_work function can
receive the message with work and unpack the contents of the message buffer into the process’ stack.
24 RMDEC
10. Parallelizing tree search using OpenMP

The issues involved in implementing the static and dynamic parallel tree-search programs
using OpenMP are the same as the issues involved in implementing the programs using Pthreads. The
below example is pthread implementation of TSP
The pthread code of dynamic parallelized TSP will replace Empty() in above while loop with
Terminated() and the pseudo code for Terminated() is given below
25 RMDEC
There are almost no substantive differences between a static implementation that uses OpenMP and
one that uses Pthreads. However, a couple of points should be mentioned:
1. When a single thread executes some code in the Pthreads version, the test
if (my_rank == whatever)
can be replaced by the OpenMP directive
# pragma omp single
This will insure that the following structured block of code will be executed by one thread in the team,
and the other threads in the team will wait in an implicit barrier at the end of the block until the executing
thread is finished.
When whatever is 0 (as it is in each test in the Pthreads program), the test can also be replaced by the
OpenMP directive
# pragma omp master
This will insure that thread 0 executes the following structured block of code. However,
the master directive doesn’t put an implicit barrier at the end of the block, so it may be necessary to also add
a barrier directive after a structured block that has been modified by a master directive.
2. The Pthreads mutex that protects the best tour can be replaced by a single critical directive placed
either inside the Update best tour function or imme-diately before the call to Update best tour. This is the
only potential source of a race condition after the distribution of the initial tours, so the
simple critical directive won’t cause a thread to block unnecessarily.
OpenMP even provides a nonblocking version of omp_set_lock. Recall that OpenMP provides a lock
object omp_lock_t and the following functions for acquiring and relinquishing the lock, respectively:
void omp_set_lock(omp_lock_t* lock p /* in/out */);
void omp_unset_lock(omp_lock_t* lock p /* in/out */);
It also provides the function
26 RMDEC
int omp_test_lock(omp_lock_t* lock p /* in/out */);

which is analogous to pthread mutex trylock; it attempts to acquire the lock lock p, and if it succeeds it returns
true (or nonzero). If the lock is being used by some other thread, it returns immediately with return value false
(or zero).
If we examine the pseudocode for the Pthreads Terminated function in Program 6.8, we see that in
order to adapt the Pthreads version to OpenMP, we need to emulate the functionality of the Pthreads function
calls
Pthread_cond_signal(&term_cond_var);
pthread_cond_broadcast(&term_cond_var);
pthread_cond_wait(&term_cond_var, &term_mutex);
The simplest solution to emulating a condition wait in OpenMP is to use busy-waiting. Since there are
two conditions a waiting thread should test for, we can use two different variables in the busy-wait loop:
27 RMDEC

Unit5 RMD PDF

Uploaded by

Copyright:

Available Formats

Unit5 RMD PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit5 RMD PDF

Uploaded by

Copyright:

Available Formats

Department of CSE CS6801 Multicore Architecture and programming

CS6801 MULTICORE ARCHITECTURES AND PROGRAMMING

UNIT V PARALLEL PROGRAM DEVELOPMENT (case studies)

1. The n-body solver problem

 It simulates the motion of planets or stars

Positions and velocities are determined using the following methods:

 Newton’s second law of motion. ( force(f)=mass(m)*acceleration(a))

 Newton’s law of universal gravitation

Applying Newtons’s law of universal gravitation - To find the Force(F)

G is the gravitational constant (6.674×10−11 N · (m/kg)2)

mq & mk are passes of particle m and q

|sq(t) - sk(t) | represents the distance between particle k and q

Total force on particle q by n other particles are calculated by

Applying Newton’s law of motion (f=m*a) -To find the acceleration

The acceleration of particle q is obtained by the following formula

[ Note: Velocity is single derivative of the position sq(t) ].

Applying Euler’s method - To find the position and velocities at time t0 + Δt

2. Serial solution to n-body solver problem

The position and velocities in lines 7 is calculated as follows

Reduced n-body solver

Calculationtotal force for the reduced solver is given below

3. Parallelizing n-body solver problem

 Partitioning ,communication,aggregation and mapping)

 Communication : Among the tasks in basic n-body solver

 Aggregation: Communication among the aggregated tasks in basic n-body solver

Communication among the aggregated tasks in reduced n-body solver

1.Block distribution - for basic n-body solver

2.Cyclic distribution - for reduced n-body solver

4. Parallelizing n-body solver problem using OpenMP

The paralleized code will be

Parallelizing the reduced n-body solver

It can be parallelized as follows

Lets recall the details of force calculation in inner loop

Loop carried dependence

There are number of solutions to solve this problem

4. Parallelizing n-body solver problem using MPI

Choices with respect to the data structures:

 Use separate array for pos,mass,velocity than the struct

 Each process stores the entire global array of particle masses.

 So on process 0 local_pos = pos; on process 1 local_pos = pos + loc_n; etc.

Pseudo code for MPI Version of basic solver

The syntax for MPI_Allgather will be

Parellelizing the reduced solver using MPI

Communication in MPI reduced solver

Pseudo code for MPI implementation of reduced solver

Example: Traveling salesman problem (TSP)

How to solve TSP?

Recursive vs. Non recursive design

Loops instead of recursive calls

Better for parallelization

Pseudo code for recursive solution to TSP

The algorithm makes use of several global variables:

Pseudo code for recursive solution to TSP (solution 1)

• Non-recursive implementation is more flexible

Tree showing the working of the above algorithm

8. Parallelizing tree search using MPI static partitioning

From thread code to MPI code

 Or use MPI_Scatterv() which supports non-uniform message sizes to different

 Do not use blocking group communication MPI_Bcast()