Unit5 RMD PDF
Unit5 RMD PDF
Unit5 RMD PDF
n-Body solvers
1. The N body solver problem
2. Serial solution to n-Body solver problem
3. Parallelizing the n-body solver problem
4. Parallelizing using OpenMP
5. Parallelizing using MPI
Tree search
6. The tree search problem
7. Recursive and Non recursive solution to TSP
8. Parallelizing tree search using MPI static partitioning
9. Parallelizing tree search using MPI dynamic partitioning
10. Parallelizing tree search using OpenMP
Every particle attracts every other particle in the universe with a force which is
directly proportional to the product of their masses and inversely proportional to the square of
the distance between their centers
1 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
If particle q has position sq(t) at time t and particle k has position sk(t), then force on q exerted by particle k is given by
Where,
If the acceleration of particle q is aq(t), then Fq(t)=mqaq(t)=mqs”q(t), Where, s”q(t) is the second
derivative of the position sq(t)
Antiderivative of the above equation will give the velocities but this approach has problem because the RHS
of has unknown functions sq(t) &sj(t) -not just variable t. Hence, we will use Euler’s method
In Euler’s method , Given the initial values of the function(g(t0))& its derivative ( g’(t0)) ,we use the tangent
line to approximate the function at desired interval t0 + Δt . The equation for the tangent line is
2 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
Where,
g(t0) - value of the function at time t0
g’(t0) is the slope of the line
The total force calculation inLine 4-5 can be refined further as follows
3 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
4 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
Initially, we want a lot of tasks.Start by making our tasks the computations of the positions,
the velocities, and the total forces at each timestep.
5 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
q<r
Mapping
The task are assigning to the cores using one of the following methods
For this example, the components are calculation for the particles
6 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
There is no loop carried dependence in both the loops. The above code will have lots of
forking and joining of threads. This can be solved as follows
7 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
We can make only one thread to print the output as follows by adding single directive.
8 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
Solution 1:
9 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
Solution 2:
Solution 3:
Use MPI data type than derived data type struct for pos,mass,velocity
Each process only uses a single n-element array for the positions.
Each process uses a pointer loc_pos that refers to the start of its block of pos.
10 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
The line 1 gets the input for all the particles & distributes to all the processes.It can be written with
MPI as follows
11 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
Note:
MPI_Sendrecv_replace () can be used to send and receive data at line numbers 8,9,15 and 16
TREE SEARCH
6.The Tree Search Problem
The problem which is solved by searching the tree is called as tree search problem.
What is TSP?
Given a set of cities and distance between every pair of cities, the problem is to find the shortest
possible route that visits every city exactly once and returns to the starting point.
It is an NP-Complete problem. This means that there is no algorithm known for solving it that, in all
cases, is significantly better than exhaustive search
Example:
12 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
13 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
--------------------------------------------------------------------------------------------------------
7. Recursive and Non Recursive solution to TSP
14 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
The function City_count examines the partial tour tour to see if there are n cities on the partial tour. If
there are, we know that we simply need to return to the hometown to complete the tour, and we can check to
see if the complete tour has a lower cost than the current “best tour” by calling Best_tour. If it does, we can
replace the current best tour with this tour by calling the function Update_best_tour.
Note that before the first call to Depth_first_search, the best tour variable should be initialized so
that its cost is greater than the cost of any possible least-cost tour. If the partial tour tour hasn’t visited n
cities, we can continue branching down in the tree by “expanding the current node,” in other words, by trying
to visit other Feasible checks to see if the city or vertex has already been visited, and, if not, whether it can
possibly lead to a least-cost tour. If the city is feasible, we add it to the tour, and recursively call Depth first
search. When we return from Depth first search, we remove the city from the tour, since it shouldn’t be
included in the tour used in subsequent recursive calls.
The basic idea is that we can try to eliminate recursion by pushing necessary data on our own stack
before branching deeper into the tree, and when we need to go back up the tree—either because we’ve
reached a leaf or because we’ve found a node that can’t lead to a better solution—we can pop the stack.
In this version, a stack record consists of a single city, the city that will be added to the tour when its
record is popped. In the recursive version we continue to make recursive calls until we’ve visited every node
of the tree that corresponds to a feasible partial tour. At this point, the stack won’t have any more activation
records for calls to Depth_first_search, and we’ll return to the function that made the original call to
Depth first search. The main control structure in our iterative version is the while loop extending from
Line 3 to Line 20, and the loop termination condition is that our stack is empty. As long as the search needs to
continue, we need to make sure the stack is nonempty, and, in the first two lines, we add each of the non-
hometown cities. Note that this loop visits the cities in decreasing order, from n-1 down to 1. This is because
of the order created by the stack, whereby the stack pops the top cities first. By reversing the order, we can
insure that the cities are visited in the same order as the recursive function.
Also notice that in Line 5 we check whether the city we’ve popped is the constant NO_CITY. This
constant is used so that we can tell when we’ve visited all of the children of a tree node; if we didn’t use it, we
wouldn’t be able to tell when to back up in the tree. Thus, before pushing all of the children of a node (Lines
15–17), we push the NO_CITY marker
Pseudo code for recursive solution to TSP (solution 2)- used for parallel implementation
An alternative is the one that uses partial tours as stack records (see pseudo code below. This gives
code that is closer to the recursive function. However, it also results in a slower version, since it’s necessary
for the function that pushes onto the stack to create a copy of the tour before actually pushing it on to the stack.
To emphasize this point, we’ve called the function Push copy. The extra memory required will probably not be
a problem. However, allocating storage for a new tour and copying the existing tour is time-consuming. To
some degree we can mitigate these costs by saving freed tours in our own data structure, and when a freed tour
is available we can use it in the Push copy function instead of calling malloc.
16 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
17 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
18 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
19 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
20 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
21 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
--------------------------------------------------------------------------------------------------------
9. Parallelizing tree search using MPI dynamic partitioning
Weakness of static partitioning:
Load imbalance
-Many paths may be dynamically pruned
-The workload assigned to threads can be uneven.
How to improve load balancing?
- Schedule computation statically initially.
- Shift workload dynamically when some threads have nothing to do
–Also called work stealing
To address this issue, we can try to dynamically redistribute the work as the computation proceeds. To
do this, we can replace the test !Empty(my_stack) controlling execution of the while loop with more complex
code( i.e Terminated() )
Dynamic work stealing code for thread my_rank
Dynamic partitioning
22 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
A process with work splits its stack and sends work to an idle process.
The function My_avail_tour count can simply return the size of the process’ stack.
If a process has enough work so that it can usefully split its stack, it calls Fulfill_request (Line 2).
Fulfill_request uses MPI Iprobe to check for a request for work from another process. If there is a
request, it receives it, splits its stack, and sends work to the requesting process. If there isn’t a request
for work, the process just returns.
A Split_stack function is called by Fulfill_request. The MPI version of Split_stack packs the contents
of the new stack into contiguous memory and sends the block of contiguous memory, which is unpacked by
the receiver into a new stack.
23 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
MPI provides a function, MPI_Pack, for packing data into a buffer of contiguous memory. It also
provides a function, MPI Unpack, for unpacking data from a buffer of contiguous memory.
The functions Out_of_work and No_work_left (Lines 11 and 15) implement the termination
detection algorithm.
Once we’ve decided on which process we plan to send a request to, we can just send a zero-length
message with a “request for work” tag.
Once a request is sent for work, it’s critical that the sending process repeatedly check for a response
from the destination.The Check_for_work function should therefore first probe for a message from the
destination indicating work is available, and, if there isn’t such a message, it should probe for a message from
the destination saying there’s no work available. If there is work available, the Receive_work function can
receive the message with work and unpack the contents of the message buffer into the process’ stack.
24 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
The pthread code of dynamic parallelized TSP will replace Empty() in above while loop with
Terminated() and the pseudo code for Terminated() is given below
25 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
There are almost no substantive differences between a static implementation that uses OpenMP and
one that uses Pthreads. However, a couple of points should be mentioned:
1. When a single thread executes some code in the Pthreads version, the test
if (my_rank == whatever)
can be replaced by the OpenMP directive
# pragma omp single
This will insure that the following structured block of code will be executed by one thread in the team,
and the other threads in the team will wait in an implicit barrier at the end of the block until the executing
thread is finished.
When whatever is 0 (as it is in each test in the Pthreads program), the test can also be replaced by the
OpenMP directive
# pragma omp master
This will insure that thread 0 executes the following structured block of code. However,
the master directive doesn’t put an implicit barrier at the end of the block, so it may be necessary to also add
a barrier directive after a structured block that has been modified by a master directive.
2. The Pthreads mutex that protects the best tour can be replaced by a single critical directive placed
either inside the Update best tour function or imme-diately before the call to Update best tour. This is the
only potential source of a race condition after the distribution of the initial tours, so the
simple critical directive won’t cause a thread to block unnecessarily.
OpenMP even provides a nonblocking version of omp_set_lock. Recall that OpenMP provides a lock
object omp_lock_t and the following functions for acquiring and relinquishing the lock, respectively:
void omp_set_lock(omp_lock_t* lock p /* in/out */);
void omp_unset_lock(omp_lock_t* lock p /* in/out */);
It also provides the function
26 RMDEC
Department of CSE CS6801 Multicore Architecture and programming
27 RMDEC