Intro To MPI: Hpc-Support@duke - Edu
Intro To MPI: Hpc-Support@duke - Edu
Intro To MPI: Hpc-Support@duke - Edu
MPI
http://www.oit.duke.edu/scsc
[email protected]
Outline
Overview of Parallel Computing Architectures
Shared Memory versus Distributed Memory
Intro to MPI
Parallel programming with only 6 function calls
Better Performance
Async communication (Latency Hiding)
MPI
MPI is short for the Message Passing Interface, an industry standard
library for sending and receiving arbitrary messages into a users
program
There are two major versions:
It is fairly low-level
Hello World
#include mpi.h
int main( int argc, char** argv ) {
int SelfTID, NumTasks, t, data;
MPI_Status mpistat;
MPI_Init( &argc, &argv );
MPI_Comm_size( MPI_COMM_WORLD, &NumTasks );
MPI_Comm_rank( MPI_COMM_WORLD, &SelfTID );
printf(Hello World from %i of %i\n,SelfTID,NumTasks);
if( SelfTID == 0 ) {
for(t=1;t<NumTasks;t++) {
data = t;
MPI_Send(&data,1,MPI_INT,t,55,MPI_COMM_WORLD);
}
} else {
MPI_Recv(&data,1,MPI_INT,0,55,MPI_COMM_WORLD,&mpistat);
printf(TID%i: received data=%i\n,SelfTID,data);
}
MPI_Finalize();
return( 0 );
}
TID#0
MPI_Init()
TID#0
MPI_Init()
TID#1
MPI_Init()
TID#2
TID#3
MPI_Init()
MPI_Init()
TID#0
TID#1
MPI_Comm_size()
MPI_Comm_rank()
printf()
MPI_Comm_size()
MPI_Comm_rank()
printf()
TID#2
TID#3
MPI_Comm_size()
MPI_Comm_rank()
printf()
MPI_Comm_size()
MPI_Comm_rank()
printf()
TID#0
TID#1
if( SelfTID == ) {
for()
MPI_Send()
} else {
MPI_Recv()
TID#2
TID#3
} else {
MPI_Recv()
} else {
MPI_Recv()
MPI_Finalize()
Closes network connections and shuts down any other processes
or threads that MPI may have created to assist with
communication
Where am I? Who am I?
MPI_Comm_size returns the size of the Communicator (group of
machines/tasks) that the current task is involved with
MPI_COMM_WORLD means All Machines/All Tasks
These two integers are the only information you get to separate out
what task should do what work within your program
MPI_Send Arguments
The MPI_Send function requires a lot of arguments to identify what
data is being sent and where it is going
MPI_Send(dataptr,numitems,datatype,dest,tag,MPI_COMM_WORLD);
MPI_SHORT
MPI_DOUBLE
MPI_INT
MPI_COMPLEX
MPI_LONG
MPI_LONG_LONG
MPI_DOUBLE_COMPLEX
MPI_Recv Arguments
MPI_Recv has many of the same arguments
MPI_Recv(dataptr,numitems,datatype,src,tag,MPI_COMM_WORLD,&mpistat);
to determine:
Message Matching
MPI requires that the (size,datatype,tag,other-task,communicator)
match between sender and receiver
Except for MPI_ANY_SOURCE and MPI_ANY_TAG
Except that receive buffer can be bigger than size sent
Task-2 sending (100,float,44) to Task-5
While Task-5 waits to receive ...
You can then find out how much was actually received with:
MPI_Get_count( &status, MPI_INT, &count );
Error Handling
In C, all MPI functions return an integer error code
char buffer[MPI_MAX_ERROR_STRING];
int err,len;
err = MPI_Send( ... );
if( err != 0 ) {
MPI_Error_string(err,buffer,&len);
printf(Error %i [%s]\n,i,buffer);
}
2 underscores
You CAN allocate more tasks than you have machines (or CPUs)
Each task is a separate Unix process, so if 2 or more tasks end up on
the same machine, they simply time-share that machines resources
This can make debugging a little difficult since the timing of
sends and recvs will be significantly changed on a time-shared
system
Blocking Communication
MPI_Send and MPI_Recv are blocking
Your MPI task (program) waits until the message is sent or
received before it proceeds to the next line of code
Note that for Sends, this only guarantees that the message has been
put onto the network, not that the receiver is ready to process it
For large messages, it *MAY* wait for the receiver to be ready
TID#0
if( SelfTID == 0 ) {
MPI_Send( tag=55 )
/* TID-0 blocks */
TID#1
} else {
MPI_Recv( tag=66 )
/* TID-1 blocks */
ensure that any other needed sync is provided some other way
Be careful with your buffer management -- make sure you dont
re-use the buffer until you know the data has been sent/recvd!
MPI_Isend returns a request-ID that allows you to check if it has
MPI_Isend Arguments
MPI_Isend( send_buf, count, MPI_INT, dest, tag, MPI_COMM_WORLD, &mpireq );
MPI_Isend, contd
MPI_Request mpireq;
MPI_Status mpistat;
/* initial data for solver */
for(i=0;i<n;i++) {
send_buf[i] = 0.0f;
}
MPI_Isend( send_buf, n, MPI_FLOAT, dest, tag, MPI_COMM_WORLD, &mpireq );
. . .
/* do some real work */
. . .
/* make sure last send completed */
MPI_Wait( &mpireq, &mpistat );
/* post the new data for the next iteration */
MPI_Isend( send_buf, n, MPI_FLOAT, dest, tag, MPI_COMM_WORLD, &mpireq );
Very useful if many data items are to be received and you want to
is complete
int flag;
err = MPI_Test( &mpireq, &flag, &mpistat );
if( flag ) {
/* message mpireq is now complete */
} else {
/* message is still pending */
/* do other work? */
}
MPI_Request_free
The Isend/Irecv request-IDs are a large, though limited resource ...
you can run out of them, so you need to either Wait/Test for them or
Free them
err = MPI_Request_free( &mpireq );
Example use-case:
You know (based on your programs logic) that when a certain
Recv is complete, all of your previous Isends must have also
completed
Then we can maximize the networks ability to ship the data around
and spend less time waiting
Latency Hiding
One of the keys to avoiding Amdahls Law is to hide as much of the
network latency (time to transfer messages) as possible
Compare the following
1
0.8
0.6
0.4
Network Usage
CPU Usage
0.2
0
9 10 11 12 13
1
0.8
0.6
0.4
Network Usage
CPU Usage
0.2
0
9 10 11 12 13
Synchronization
The basic MPI_Recv call is Blocking the calling task waits until
the message is received
So there is already some synchronization of tasks happening
Another major synchronization construct is a Barrier
Every task must check-in to the barrier before any task can leave
Thus every task will be at the same point in the code when all of
them leave the barrier together
DoWork_X();
MPI_Barrier( MPI_COMM_WORLD );
DoWork_Y();
Every task will complete DoWork_X() before any one of them starts
DoWork_Y()
But it is a quick and easy way to make lots of race conditions and
deadlocks go away!
MPI_Recv() implies a certain level of synchronization, maybe that is
all you really need?
Larger barriers == more waiting
If Task#1 and Task#2 need to synchronize, then try to make only
those two synchronize together (not all 10 tasks)
Well talk about creating custom Communicators later
This forces the calling task to WAIT for the receiver to post a
Only after the Recv is posted will the Sender be released to the next
line of code
The sender thus has some idea of where, in the program, the
Collective Operations
MPI specifies a whole range of group-collective operations within
MPI_Reduce and MPI_Allreduce
Min, Max, Product, Sum, And/Or (bitwise or logical), Minlocation, Max-location
E.g. all tasks provide a partial-sum, then MPI_Reduce computes
the global sum
And, by splitting out the Sends/Recvs, you may be able to find ways
to hide some of the latency
MPI_Reduce
int send_buf[4], recv_buf[4];
/* each task fills in send_buf */
for(i=0;i<4;i++) {
send_buf[i] = ...;
}
err = MPI_Reduce( send_buf, recv_buf, 4,
MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD );
if( SelfTID == 0 ) {
/* recv_buf has the min values */
/* e.g. recv_buf[0] has the min of
send_buf[0]on all tasks */
} else {
/* for all other tasks,
recv_buf has invalid data */
}
send_buf:
TID#0
TID#1
TID#2
TID#3
recv_buf:
TID#0
MPI_Allreduce
MPI_Reduce sends the final data to the receive buffer of the root
task only
MPI_Allreduce sends the final data to ALL of the receive buffers (on
all of the tasks)
Can be useful for computing and distributing the global sum of a
calculation (MPI_Allreduce with MPI_SUM)
... computing and detecting convergence of a solver (MPI_MAX)
... computing and detecting eureka events (MPI_MINLOC)
1.00
.001
.001
...
.001
=====
1.00 !!
TID#0:
2 Tasks:
1.00
.001
.001
...
.001
=====
1.00
TID#1:
+
=====
1.05
.001
.001
.001
...
.001
=====
.050
Parallel Debugging
Parallel programming has all the usual bugs that sequential
programming has ... bad logic, improper arguments, array overruns
But it also has several new kinds of bugs that can creep in
Deadlock
Race Conditions
Deadlock
A Deadlock condition occurs when all Tasks are stopped at
synchronization points and none of them are able to make progress
Sometimes this can be something simple:
Deadlock, contd
Symptoms:
Program hangs (!)
Debugging deadlock is often straightforward unless there is also a
race condition involved
Simple approach: put print statement in front of all
Recv/Barrier/Wait functions
Race Conditions
A Race is a situation where multiple Tasks are making progress
toward some shared or common goal where timing is critical
Generally, this means youve ASSUMED something is synchronized,
but you havent FORCED it to be synchronized
E.g. all Tasks are trying to find a hit in a database
When a hit is detected, the Task sends a message
Normally, hits are rare events and so sending the message is not a
big deal
But if two hits occur simultaneously, then we have a race condition
E.g. a race condition leads two Tasks to both think they are in charge
of data-item-X nothing crashes, they just keep writing and
overwriting X (maybe even doing duplicate computations)
Often, race conditions dont cause crashes at the time they actually
occur the crash occurs much later in the execution and for a
totally unrelated reason
E.g. a race condition leads one Task to think that the solver converged
but the other Task thinks we need another iteration crash occurs
because one Task tried to compute the global residual
programming
They try to work like a sequential debugger ... when you print a
variable, it prints on all tasks; when you pause the program, it pauses
on all remote machines; etc.
Totalview is the biggest commercial version; Allinea is a recent
contender
Suns Prism may still be out there
Parallel Tracing
It is often useful to see what is going on in your parallel program
When did a given message get sent? recvd?
How long did a given barrier take?
You can Trace the MPI calls with a number of different tools
Take time-stamps at each MPI call, then visualize the messages as
arrows between different tasks
Technically, all the work of MPI is done in PMPI_* functions,
MPI Communicators
A Communicator is a group of MPI-tasks
The Communicators must match between Send/Recv
This is useful for parallel libraries -- your library can create its own
Communicator and isolate its messages from other user-code