PVM and MPI
PVM and MPI
PVM and MPI
Outline
Motivation for MPI The process that produced MPI What is di erent about MPI? the usual" send receive the MPI send receive simple collective operations New in MPI: Not in MPI Some simple complete examples, in Fortran and C Communication modes, more on collective operations Implementation status MPICH - a free, portable implementation MPI resources on the Net MPI-2
2 Single Program, Multiple Data 2 Same program runs everywhere. 2 Restriction on the general message-passing model. 2 Some vendors only support SPMD parallel programs. 2 General message-passing model can be emulated.
What is SPMD?
2 Messages are packets of data moving between sub-programs. 2 The message passing system has to be told the
Messages
following information:
Sending processor Source location Data type Data length Receiving processors Destination location Destination size
Access
2 Simplest form of message passing. 2 One process sends a message to another 2 Di erent types of point-to point communication
Point-to-Point Communication
Synchronous Sends
Provide information about the completion of the message.
Asynchronous Sends
Only know when the message has left.
?
"Beep"
10
2 Relate to when the operation has completed. 2 Only return from the subroutine call when the
Blocking Operations
NonBlocking Operations
Return straight away and allow the subprogram to continue to perform other work. At some later time the subprogram can TEST or WAIT for the completion of the nonblocking operation.
11
12
Barriers
Synchronise processes.
Broadcast
A onetomany communication.
Barrier
Barrier
Barrier
13
14
Reduction Operations
Combine data from several processes to produce a single result.
STRIKE
Starting with a large serial application Look at the Physics Is problem inherently parallel? Examine loop structures Are any independent? Moderately so? Tools like Forge90 can be helpful Look for the core linear algebra routines Replace with parallelized versions Already been done. check survey
15
16
Master Slave Master task starts all slave tasks and coordinates their work and I O SPMD hostless Same program executes on di erent pieces of the problem Functional Several programs are written; each performs a di erent function in the application.
Granularity of tasks Key measure is communication computation ratio of the machine: Number of bytes sent divided by number of ops performed. Larger granularity gives higher speedups but often lower parallelism. Number of messages Desirable to keep the number of messages low but depending on the algorithm it can be more e cient to break large messages up and pipeline the data when this increases parallelism. Functional vs. Data parallelism Which better suits the application? PVM allows either or both to be used.
17
18
Message latency Network latency can be high. Algorithms should be designed to account for this f.e. send data before it is needed. Di erent Machine Powers Virtual machines may be composed of computers whose performance varies over orders of magnitude. Algorithm must be able to handle this. Fluctuating machine and network loads Multiple users and other competing PVM tasks cause the machine and network loads to change dynamically. Load balancing is important.
Static load balancing Problem is divided up and tasks are assigned to processors only once. The number or size of tasks may be varied to account for di erent computational powers of machines. Dynamic load balancing by pool of tasks Typically used with master slave scheme. The master keeps a queue of tasks and sends them to idle slaves until the queue is empty. Faster machines end up getting more tasks naturally. see xep example in PVM distribution Dynamic load balancing by coordination Typically used in SPMD scheme. All the tasks synchronize and redistribute their work either at xed times or if some condition occurs f.e. load imbalance exceeds some limit
19
20
Limit size, number of outstanding messages Can load imbalance cause too many outstanding messages? May have to send very large data in parts
Sending Task
Communication Tips
Bag of Tasks
Pvmd
Running
Receiving Task
Complex communication patterns Network is deadlock-free, shouldn't hang Still have to consider Correct data distribution Bottlenecks Consider using a library ScaLAPACK: LAPACK for distributed-memory machines BLACS: Communication primitives Oriented towards linear algebra Matrix distribution w no send-recv Used by ScaLAPACK
Possible improvements Adjust size of jobs To speed of workers To turnaround time granularity Start bigger jobs before smaller ones Allow workers to communicate more complex scheduling
21
22
PVM Is
PVM is a software package that allows a collection of serial, parallel and vector computers on a network to be managed as one large computing resource. Poor man's supercomputer High performance from network of workstations O -hours crunching Metacomputer linking multiple supercomputers Very high performance Computing elements adapted to subproblems Visualization Educational tool Simple to install Simple to learn Available Can be modi ed
Host
Multiprocessor host
Logical
Pvmd (host)
Tasks
Console(s)
23
24
PVM daemon pvmd One manages each host of virtual machine Mainly a message router, also has kernel-like functions Has message entry points where tasks request service Inter-host point of contact Authentication Creates processes Collects output printed by processes Fault detection of processes, network More robust than application components Interface library libpvm Linked with each application component 1. Functions to compose, send, receive messages 2. PVM syscalls that send requests to pvmd Machine-dependent communication part can be replaced Kept as simple as possible PVM Console Interactive control of virtual machine Kind of like a shell Normal PVM task, several can be attached, to any host
A simple message-passing environment Hosts, Tasks, Messages No enforced topology Virtual machine can be composed of any mix of machine types Process Control Tasks can be spawned killed anywhere in the virtual machine Communication Any task can communicate with any other Data conversion is handled by PVM Dynamic Process Groups Tasks can join leave one or more groups at any time Fault Tolerance Task can request noti cation of lost gained resources Underlying operating system usually Unix is visible Supports C, C++ and Fortran Can use other languages must be able to link with C
Programming in PVM
25
26
Hellos World
* tid of child *
printf"I'm tx n", pvm_mytid; pvm_spawn"hello2", char**0, 0, "", 1, &tid; pvm_recv-1, -1; pvm_bufinfocc, int*0, int*0, &tid; pvm_upkstrbuf; printf"Message from tx: s n", tid, buf; pvm_exit; exit0;
ptid = pvm_parent; strcpybuf, "hello, world from "; gethostnamebuf + strlenbuf, 64; pvm_initsendPvmDataDefault; pvm_pkstrbuf; pvm_sendptid, 1; pvm_exit; exit0;
Software is highly portable Allows fully heterogeneous virtual machine hosts, network Dynamic process, machine con guration Support for fault tolerant programs System can be customized Large existing user base Some comparable systems Portable message-passing MPI p4 Express PICL One-of-a-kind NX CMMD Other types of communication AM Linda Also DOSs, Languages, ...
27
28
Portability
803 486 BSDI, NetBSD, FreeBSD Alliant FX 8 803 486 Linux BBN Butter y TC2000 DEC AlphaOSF-1, Mips, uVAX Convex C2, CSPP DG Aviion Cray T-3D, YMP, 2, C90 Unicos HP 68000, PA-Risc Encore Multimax IBM RS-6000, RT Fujitsu 780UXP M Mips IBM Power-4 NeXT Intel Paragon, iPSC 860, iPSC 2 Silicon Graphics Kendall Square Sun 3, 4x SunOS, Solaris Maspar NEC SX-3 Sequent Stardent Titan Thinking Machines CM-2, CM-5
PVM source code, user's guide, examples and related material are published on Netlib, a software repository with several sites around the world. To get started, send email to netlib:
mail [email protected] Subject: send index from pvm3
Very portable across Unix machines, usually just pick options Multiprocessors: Distributed-memory: T-3D, iPSC 860, Paragon, CM-5, SP-2 MPI Shared-memory: Convex HP, SGI, Alpha, Sun, KSR, Symmetry Source code largely shared with generic 80 PVM is portable to non-Unix machines VMS port has been done OS 2 port has been done Windows NT port in progress PVM di erences are almost transparent to programmer Some options may not be supported Program runs in di erent environment
A list of les and instructions will be automatically mailed back Using xnetlib: select directory pvm3 FTP: host netlib2.cs.utk.edu, login anonymous, directory pvm3 URL: http: www.netlib.org pvm3 index.html Bug reports, comments, questions can be mailed to:
[email protected] comp.parallel.pvm
Usenet newsgroup for discussion and support: Book: PVM: Parallel Virtual Machine A Users' Guide and Tutorial for Networked Parallel Computing MIT press 1994.
29
30
Package requires a few MB of disk + a few MB architecture Don't need root privelege Libraries and executables can be shared between users PVM chooses machine architecture name for you more than 60 currently de ned Environment variable PVM ROOT points to installed path E.g. usr local pvm3.3.4 or $HOME pvm3 If you use csh, add to your .cshrc: If you use sh or ksh, add to your .profile:
PVM ROOT= usr local pvm3 PVM DPATH=$PVM ROOT lib pvmd export PVM ROOT PVM DPATH setenv PVM ROOT usr local pvm3
Installing PVM
Software comes with con gurations for most Unix machines Installation is easy After package is extracted
cd $PVM ROOT make
Software automatically Determines architecture type Creates necessary subdirectories Builds pvmd, console, libraries, group server and library Installs executables and libraries in lib and bin
Important directories below $PVM ROOT include Header les man Manual pages lib Scripts lib ARCH System executables bin ARCH System tasks
31
32
Three ways to start PVM pvm -ddebugmask -nhostname host le PVM console starts pvmd, or connects to one already running
xpvm
Starting PVM
Graphical console, same as above pvmd -ddebugmask -nhostname host le Manual start, used mainly for debugging or when necessary to enter passwords Some common error messages
Can't start pvmd Check PVM ROOT is set, .rhosts Can't contact local daemon Version mismatch No such host
PVM crashed previously; socket le left over Mixed versions of PVM installed or stale executables Can't resolve IP address
Duplicate host
Graphical interface for PVM Performs console-like functions Real-time graphical monitor with View of virtual machine con guration, activity Space-time plot of task status Host utilization plot Call level debugger, showing last libpvm call by each task Writes SDDF format trace les Can be used for post-mortem analysis Built on top of PVM using Group library Libpvm trace system Output collection system
XPVM
tmp
directory
correct
Stale segments left from crash or not enough are con gured
33
34
Programming Interface
About 80 functions Message bu er manipulation Create, destroy bu ers Pack, unpack data Message passing Send, receive Multicast Process control Create, destroy tasks Query task tables Find own tid, parent tid Dynamic process groups With optional group library Join, leave group Map group members ! tids Broadcast Global reduce Machine con guration Add, remove hosts Query host status Start, halt virtual machine Miscellaneous Get, set options Request noti cation Register special tasks Get host timeofday clock o sets
Start new tasks
Process Control
pvm spawnfile, argv, flags, where, ntask, tids
Round-robin Named host "." is local Named architecture class Complements host set Start on MPP service node Enable debugging dbx Enable tracing
35
36
pvm initsendencoding
Pack and send a contiguous, single-typed data bu er As fast as native calls on multiprocessor machines
Sends bu er to other tasks, returns when safe to clear bu er To receive Blocking or non-blocking receive pvm upktypedata, num items, stride Unpack message into user variables Can also pvm probesource, tag for a message Another receive primitive: pvm trecvsource, tag, Equivalent to pvm nrecv if timeout set to zero Equivalent to pvm recv if timeout set to null
pvm recvsource, tag pvm nrecvsource, tag
timeout
37
38
Collective functions operate across all members of a group Synchronize all tasks in a group
pvm bcastgroup, tag pvm barriergroup, count
Collective Communication
pvm reduce*func, data, num items, data type, msgtag, group, rootinst
39
40
Examples illustrate usage and serve as templates Examples include hello, hello other Hello world master, slave Master slave program spmd SPMD program gexample Group and collective operations timing, timing slave Tests communication performance hitc, hitc slave Dynamic load balance example xep, mtile Interactive X-Window example Examples come with Make le.aimk les Both C and Fortrans versions for some examples
Compiling Applications
Always To manipulate trace masks For resource manager interface
Specify include directory: cc -I$PVM ROOT include ... Fortran: INCLUDE ' usr local pvm3 include fpvm3.h' Compiling and linking C programs must be linked with
Always If using group library functions possibly other libraries for socket or XDR functions
libpvm3.a libgpvm3.a
41
42
Aimk Shares single make le between architectures Builds for di erent architectures in separate directories Determines PVM architecture Runs make, passing it PVM ARCH Does one of three things If $PVM ARCH Mm akefile exists: Runs make in subdirectory, using make le Else if Makefile.aimk exists: Creates subdirectory, runs make using Makefile.aimk Otherwise: Runs make in current directory
Important for application performance Not done automatically yet? Static Assignment of work or placement of tasks Must predict algorithm time May have di erent processor speeds Externally imposed static machine loads Dynamic Adapting to changing conditions Make simple scheduler: E.g. Bag of Tasks Simple, often works well Divide work into small jobs Given to processors as they become idle PVM comes with examples C xep Fortran hitc Can include some fault tolerance Work migration: Cancel forward job Poll for cancel message from master Can interrupt with pvm sendsig Kill worker expensive Task migration: Not in PVM yet Even with load balancing, expect performance to be variable
Load Balancing
Six Examples
43
A vector circulates among the processors Each processor lls in a part of the vector
Circular Messaging
44
Circular messaging Inner product Matrix vector multiply row distribution Matrix vector multiply column distribution Integration to evaluate Solve 1-D heat equation
P3
P4
P2
P5
P1
Solution: SPMD Uses the following PVM features: spawn group barrier send-recv pack-unpack
45
program spmd1 include ' src icl pvm pvm3 include fpvm3.h' PARAMETER NPROC=4 integer rank, left, right, i, j, ierr integer tidsNPROC-1 integer dataNPROC C Group Creation call pvmfjoingroup 'foo', rank if rank .eq. 0 then call pvmfspawn'spmd1',PVMDEFAULT,'*',NPROC-1,tids1,ierr endif call pvmfbarrier 'foo', NPROC, ierr C compute the neighbours IDs call pvmfgettid 'foo', MODrank+NPROC-1,NPROC, left call pvmfgettid 'foo', MODrank+1,NPROC, right if rank .eq. 0 then C I am the first process do 10 i=1,NPROC datai = 0 call pvmfinitsend PVMDEFAULT, ierr call pvmfpack INTEGER4, data, NPROC, 1, ierr call pvmfsend right, 1 , ierr call pvmfrecv left, 1, ierr call pvmfunpack INTEGER4, data, NPROC, 1, ierr write*,* ' Results received :' write*,* dataj,j=1,NPROC else I am an intermediate process call pvmfrecv left, 1, ierr call pvmfunpack INTEGER4, data, NPROC, 1, ierr datarank+1 = rank call pvmfinitsendPVMDEFAULT, ierr call pvmfpack INTEGER4, data, NPROC, 1, ierr call pvmfsend right, 1, ierr endif call pvmflvgroup 'foo', ierr call pvmfexitierr stop end
Inner Product
s=
i
46
Xx y
n T
=1
Partial
Ddot
10
Solution: Master - Slave Uses the following PVM features: spawn group barrier send-recv pack-unpack Master sends out data, collects the partial solutions and computes the sum. Slaves receive data, compute partial inner product and send the results to master.
47
48
program inner include ' src icl pvm pvm3 include fpvm3.h' PARAMETER NPROC=7 PARAMETER N = 100 double precision ddot external ddot integer remain, nb integer rank, i, ierr, bufid integer tidsNPROC-1, slave, master double precision xN,yN double precision result,partial remain = MODN,NPROC-1 nb = N-remain NPROC-1 call pvmfjoingroup 'foo', rank if rank .eq. 0 then call pvmfspawn'inner',PVMDEFAULT,'*',NPROC-1,tids,ierr endif call pvmfbarrier 'foo', NPROC, ierr call pvmfgettid 'foo', 0, master C if rank .eq. 0 then C do 10 i=1,N xi = 1.0d0 yi = 1.0d0 Send the data count = 1 do 20 i=1,NPROC-1 call pvmfinitsend PVMDEFAULT, ierr call pvmfpack REAL8, xcount, nb, 1, ierr call pvmfpack REAL8, ycount, nb, 1, ierr call pvmfgettid 'foo', i, slave call pvmfsend slave, 1, ierr count = count + nb continue Set the values MASTER
Slave
Receive a part of X Receive a part of Y partial = Ddotpart of X and part of Y send partial to the master
10 C
20
49
result = 0.d0 C Add the remainding part partial = ddotremain,xN-remain+1,1,yN-remain+1,1 result = result + partial Get the result do 30 i =1,NPROC-1 call pvmfrecv-1,1,bufid call pvmfunpack REAL8, partial, 1, 1, ierr result = result + partial continue print *, ' The ddot = ', result SLAVE else C Receive the data call pvmfrecv -1, 1, bufid call pvmfunpack REAL8, x, nb, 1, ierr call pvmfunpack REAL8, y, nb, 1, ierr Compute the partial product partial = ddotnb,x1,1,y1,1 Send back the result call pvmfinitsend PVMDEFAULT, ierr call pvmfpack REAL8, partial, 1, 1, ierr call pvmfsend master, 1, ierr endif
30
Problem: In parallel compute y = y + Ax, where y is of length m, x is of length n and A is an m n matrix. Solution: Master - Slave Uses the following PVM features: spawn group barrier send-recv pack-unpack
n
50
C C
P1
call pvmflvgroup 'foo', ierr call pvmfexitierr stop end
P2 P3 P4 A X Y Y
51
52
program matvec_row include ' src icl pvm pvm3 include fpvm3.h' C C C C C C C
y --- y + A * x A : MxN visible only on the slaves X : N Y : M PARAMETER NPROC = 4 PARAMETER M = 9, N = 6 PARAMETER NBY = INTM NPROC+1 double precision XN, YM integer tidsNPROC integer mytid, rank, i, ierr, from
Slave
Receive X Receive Y Compute my part of the product and Update my part of Y Send back my part of Y
C
call pvmfmytid mytid call pvmfjoingroup 'foo', rank if rank .eq. 0 then call pvmfspawn'matvecslv_row',PVMDEFAULT,'*',NPROC,tids,ierr endif call pvmfbarrier 'foo', NPROC+1, ierr Data initialize for my part of the data do 10 i = 1,N xi = 1.d0 continue do 15 i = 1,M yi = 1.d0 continue Send X and Y to the slaves call call call call pvmfinitsend PVMDEFAULT, ierr pvmfpackREAL8, X, N, 1, ierr pvmfpackREAL8, Y, M, 1, ierr pvmfbcast'foo', 1, ierr
10
15 C
53
54
I get the results do 20 i = 1, NPROC call pvmfrecv-1, 1, ierr call pvmfunpack INTEGER4, from, 1, ierr if from .EQ. NPROC then call pvmfunpack REAL8, Yfrom-1*NBY+1, $ M-NBY*NPROC-1, 1, ierr else call pvmfunpack REAL8, Yfrom-1*NBY+1, $ NBY, 1, ierr endif continue write*,* 'Results received' do 30 i=1,M write*,* 'Y',i,' = ',Yi continue call pvmflvgroup 'foo', ierr call pvmfexitierr stop end C C C C C C C
program matvecslv_row include ' src icl pvm pvm3 include fpvm3.h'
y A X Y
PARAMETER NPROC = 4 PARAMETER M = 9, N = 6 PARAMETER NBY = INTM NPROC+1 double precision ANBY,N double precision XN, YM integer rank, i, ierr, to external dgemv call pvmfjoingroup 'foo', rank call pvmfbarrier 'foo', NPROC+1, ierr Data initialize for my part of the data do 10 j = 1,N do 20 i = 1,NBY Ai,j = 1.d0 continue continue I receive X and Y call pvmfrecv -1, 1, ierr call pvmfunpackREAL8, X, N, 1, ierr call pvmfunpackREAL8, Y, M, 1, ierr C I compute my part if rank .NE. NPROC then call dgemv'N', NBY, N, 1.d0, A, NBY, X, 1, 1.d0, Yrank-1*NBY+1,1 else call dgemv'N', M-NBY*NPROC-1, N, 1.d0, A, NBY, $ X, 1, 1.d0, YNPROC-1*NBY+1, 1 endif $
20
30
20 10 C
55
I send back my part of Y call pvmfinitsendPVMDEFAULT, ierr call pvmfpack INTEGER4, rank, 1, 1, ierr if rank .NE. NPROC then call pvmfpack REAL8, Yrank-1*NBY+1,NBY, 1, ierr else call pvmfpack REAL8, Yrank-1*NBY+1,M-NBY*NPROC-1,1,ierr endif call pvmfgettid'foo',0,to call pvmfsendto, 1, ierr
Problem: In parallel compute y = y + Ax, where y is of length m, x is of length n and A is an m n matrix. Solution: Master - Slave Uses the following PVM features: spawn group barrier reduce send-recv pack-unpack
n
56
P1
P2
P3
P4
57
58
program matvec_col include ' src icl pvm pvm3 include fpvm3.h' C C C C C C C
y --- y + A * x A : MxN visible only on the slaves X : N Y : M PARAMETER NPROC = 4 PARAMETER M = 9, N = 6 double precision XN, YM external PVMSUM integer tidsNPROC integer mytid, rank, i, ierr
Slave
Receive X Compute my Contribution to Y Global Sum on Y leaf
call pvmfmytid mytid call pvmfjoingroup 'foo', rank if rank .eq. 0 then call pvmfspawn'matvecslv_col',PVMDEFAULT,'*',NPROC,tids,ierr endif call pvmfbarrier 'foo', NPROC+1, ierr C do 10 i = 1,N xi = 1.d0 continue do 15 i = 1,M yi = 1.d0 continue Data initialize for my part of the data
10
15
59
60
Send X call pvmfinitsend PVMDEFAULT, ierr call pvmfpackREAL8, X, N, 1, ierr call pvmfbcast'foo', 1, ierr C C C C C C
program matvecslv_col include ' src icl pvm pvm3 include fpvm3.h'
I get the results call pvmfreducePVMSUM,Y,M,REAL8,1,'foo',0,ierr write*,* 'Results received' do 30 i=1,M write*,* 'Y',i,' = ',Yi continue call pvmflvgroup 'foo', ierr call pvmfexitierr stop end
y A X Y
PARAMETER NPROC = 4 PARAMETER M = 9, N = 6 PARAMETER NBX = INTN NPROC+1 external PVMSUM double precision AM,NBX double precision XN, YM integer rank, i, ierr external dgemv call pvmfjoingroup 'foo', rank call pvmfbarrier 'foo', NPROC+1, ierr C Data initialize for my part of the data do 10 j = 1,NBX do 20 i = 1,M Ai,j = 1.d0 continue continue I receive X call pvmfrecv -1, 1, ierr call pvmfunpackREAL8, X, N, 1, ierr
30
20 10 C
61
I compute my part if rank .NE. NPROC then call dgemv'N', M, NBX, 1.d0, A, M, $ Xrank-1*NBX+1, 1, 1.d0, Y,1 else call dgemv'N', M,N-NBX*NPROC-1, 1.d0, A, M, $ XNPROC-1*NBX+1, 1, 1.d0, Y, 1 endif
Computer approximations to by using numerical integration Know tan450 = 1; same as tan 4 = 1; So that; 4 tan,11 = From the integral tables we can nd Z 1 tan,1x = 1 + x2 dx or Z 1 1 tan,11 = 0 1 + x2 dx Using the mid-point rule with panels of uniform length h = 1=n, for various values of n. Evaluate the function at the midpoints of each subinterval x ,1 , x . i h , h=2 is the midpoint. Formula for the integral is
i i
Integration to evaluate
62
x=
where
X f h i , 1=2
n
=1
= hx
4 f x = 1 + x2
63
64
Number of ways to divide up the problem. Each part of the sum is independent. Divide the interval into more or less equal parts and give each process a part. Let each processor take the p part. Compute part of integral. Sum pieces. The example given let's each processor take the p part. Uses the following PVM features: spawn group barrier bcast reduce
th th
program spmd2 include ' src icl pvm pvm3 include fpvm3.h' PARAMETER NPROC=3 EXTERNAL PVMSUM integer mytid, rank, i, ierr integer tids0:NPROC-1 double precision PI25DT parameter PI25DT = 3.141592653589793238462643d0 double precision mypi,pi,h,sum,x,f,a C fa = 4.d0 1.d0 + a*a function to integrate
call pvmfmytid mytid call pvmfjoingroup 'foo', rank if rank .eq. 0 then call pvmfspawn'spmd2',PVMDEFAULT,'*',NPROC-1,tids1,ierr endif call pvmfbarrier 'foo', NPROC, ierr if rank .eq. 0 then write6,98 format'Enter the number of intervals: 0 quits' read5,99n formati10 if n .GT. 100000 then print *, 'Too large value of pi' print *, 'Using 100000 instead' n = 100000 endif call pvmfinitsend PVMDEFAULT, ierr call pvmfpack INTEGER4, n, 1, 1, ierr call pvmfbcast'foo',0,ierr endif ifrank .ne. 0 then call pvmfrecv-1,-1,ierr call pvmfunpackINTEGER4, n, 1, 1, ierr endif
10
98 99
65
66
20 C
97
sum = 0.0d0 do 20 i = rank+1, n, NPROC x = h * dblei - 0.5d0 sum = sum + fx continue mypi = h * sum collect all the partial sums print *,' reduce' call pvmfreducePVMSUM,mypi,1,REAL8,0,'foo',0,ierr node 0 prints the number if rank .eq. 0 then pi = mypi write6,97 pi, abspi - PI25DT format' pi is approximatively: ',F18.16, + ' Error is: ',F18.16 goto 10 endif
@A = @ 2A @t @x2
and a discretization of the form :
30
A +1 , A = A t
i ;j i;j
i;j
+1 , 2A
x2
i;j
+A
i;j
,1
+1 , 2A
i;j
+A
i;j
,1
67
68
Boundaries
Solution: Master Slave Slaves communicate their boundaries values Uses the following PVM features: spawn group barrier send-recv pack-unpack
Slave
Receive my part of the initial values for i = 1 to number of time iterations send my left bound to my left neighbor send my right bound to my right neighbor receive my left neighbor's left bound receive my right neighbor's right bound compute the new temperatures end for send back my result to the master
69
70
C C C C C C C C
C Use PVM to solve a simple heat diffusion differential equation, using 1 master program and 5 slaves. C The master program sets up the data, communicates it to the slaves and waits for the results to be sent from the slaves. Produces xgraph ready files of the results.
enroll in pvm call pvmfmytid mytid spawn the slave tasks call pvmfspawn'heatslv',PVMDEFAULT,'*',NPROC,task_ids,ierr create the initial data set do 10 i = 1,SIZE initi = SINPI * DBLEi-1 DBLESIZE-1 continue init1 = 0.d0 initSIZE = 0.d0
program heat include ' src icl pvm pvm3 include fpvm3.h' integer NPROC, TIMESTEP, PLOTINC, SIZE double precision PI PARAMETERPI = 3.14159265358979323846 PARAMETERNPROC = 3 PARAMETERTIMESTEP = 10 PARAMETERPLOTINC = 1 PARAMETERSIZE = 100 PARAMETERSLAVENAME = 'heatslv' integer num_data integer mytid, task_idsNPROC, i, j integer left, right, k, l integer step integer ierr external wh integer wh double precision initSIZE double precision resultTIMESTEP*SIZE NPROC double precision solutionTIMESTEP,SIZE character*20 filename4 double precision deltat4, deltax2 real etime real t02 real eltime4 step = TIMESTEP num_data = INTSIZE NPROC filename1 filename2 filename3 filename4 deltat1 = deltat2 = deltat3 = deltat4 = = 'graph1' = 'graph2' = 'graph3' = 'graph4' 5.0E-1 5.0E-3 5.0E-6 5.0E-9
10
C run the problem 4 times for different values of delta t do 20 l=1,4 deltax2 = deltatl 1.0 DBLESIZE**2.0 start timing for this run eltimel = etimet0 C C send the initial data to the slaves. include neighbor info for exchanging boundary data do 30 i =1,NPROC call pvmfinitsendPVMDEFAULT,ierr IF i .EQ. 1 THEN left = 0 ELSE left = task_idsi-1 ENDIF call pvmfpackINTEGER4, left, 1, 1, ierr IF i .EQ. NPROC THEN right = 0 ELSE right = task_idsi+1 ENDIF call pvmfpackINTEGER4, right, 1, 1, ierr call pvmfpackINTEGER4, INTstep, 1, 1, ierr call pvmfpackREAL8, deltax2, 1, 1, ierr call pvmfpackINTEGER4, INTnum_data, 1, 1, ierr call pvmfpackREAL8, initnum_data*i-1+1,num_data,1,ierr call pvmfsendtask_idsi, 4, ierr continue wait for the results do 40 i = 1,NPROC call pvmfrecvtask_idsi, 7, ierr call pvmfunpackREAL8, result, num_data*TIMESTEP, 1, ierr
30 C
71
72
update the solution do 50 j = 1, TIMESTEP do 60 k = 1, num_data solutionj,num_data*i-1+1+k-1 = $ resultwhj-1,k-1,num_data+1 60 continue 50 continue 40 continue C stop timing eltimel = etimet0 - eltimel
C C C C C C C
The slaves receive the initial data from the host, exchange boundary information with neighbors, and calculate the heat change in the wire. This is done for a number of iterations, sent by the master.
program heatslv include ' src icl pvm pvm3 include fpvm3.h' PARAMETERMAX1 = 1000 PARAMETERMAX2 = 100000 integer mytid, left, right, i, j, master integer timestep external wh integer wh double precision initMAX1, AMAX2 double precision leftdata, rightdata double precision delta, leftside, rightside C enroll in pvm call pvmfmytidmytid call pvmfparentmaster C receive my data from the master program 10 continue call call call call call call call pvmfrecvmaster,4,ierr pvmfunpackINTEGER4, left, 1, 1, ierr pvmfunpackINTEGER4, right, 1, 1, ierr pvmfunpackINTEGER4, timestep, 1, 1, ierr pvmfunpackREAL8, delta, 1, 1, ierr pvmfunpackINTEGER4, num_data, 1, 1, ierr pvmfunpackREAL8, init, num_data, 1, ierr
produce the output write*,* 'Writing output to file ',filenamel open23, FILE = filenamel write23,* 'TitleText: Wire Heat over Delta Time: ',deltatl write23,* 'XUnitText: Distance' write23,* 'YUnitText: Heat' do 70 i=1,TIMESTEP,PLOTINC write23,* '"Time index: ',i-1 do 80 j = 1,SIZE write23,* j-1,REALsolutioni,j 81 FORMATI5,F10.4 80 continue write23,* '' 70 continue endfile 23 closeUNIT = 23, STATUS = 'KEEP' 20 continue write*,* 'Problem size: ', SIZE do 90 i = 1,4 write*,* 'Time for run ',i-1,': ',eltimei,' sec.' continue
90
C kill the slave processes do 100 i = 1,NPROC call pvmfkilltask_idsi,ierr 100 continue call pvmfexitierr END integer FUNCTION whx,y,z integer x,y,z wh = x * z + y RETURN END
C copy the initial data into my working array do 20 i = 1, num_data Ai = initi continue do 22 i = num_data+1, num_data*timestep Ai = 0 continue
20
22
73
74
C perform the calculation do 30 i = 1, timestep-1 C C trade boundary info with my neighbors send left, receive right IF left .NE. 0 THEN call pvmfinitsendPVMDEFAULT, ierr call pvmfpackREAL8, Awhi-1,0,num_data+1, 1, 1, ierr call pvmfsend left, 5, ierr ENDIF IF right .NE. 0 THEN call pvmfrecv right, 5, ierr call pvmfunpackREAL8, rightdata, 1, 1, ierr call pvmfinitsendPVMDEFAULT, ierr call pvmfpackREAL8, Awhi-1, num_data-1,num_data+1, 1, 1, ierr call pvmfsendright, 6, ierr ENDIF IF left .NE. 0 THEN call pvmfrecvleft, 6, ierr call pvmfunpackREAL8, leftdata, 1, 1, ierr ENDIF
C send the results back to the master program call pvmfinitsendPVMDEFAULT, ierr call pvmfpackREAL8, A, num_data*timestep, 1, ierr call pvmfsendmaster, 7, ierr goto 10 C just for good measure call pvmfexitierr END integer FUNCTION whx,y,z integer x,y,z wh = x*z + y RETURN END
C do the calculations for this iteration do 40 j = 1, IF j .EQ. leftside ELSE leftside ENDIF num_data 1 THEN = leftdata = Awhi-1,j-2,num_data+1
IF j .EQ. num_data THEN rightside = rightdata ELSE rightside = Awhi-1,j,num_data+1 ENDIF IF j. EQ. 1 .AND. left. EQ. 0 THEN Awhi,j-1,num_data+1 = 0.d0 ELSE IF j .EQ. num_data .AND. right .EQ. 0 THEN Awhi,j-1,num_data+1 = 0.d0 ELSE Awhi,j-1,num_data+1 = Awhi-1,j-1,num_data+1 + $ delta*rightside - 2*Awhi-1,j-1,num_data+1+leftside ENDIF continue continue
40 30
75
76
Few systems o er the full range of desired features. modularity for libraries access to peak performance portability heterogeneity subgroups topologies performance measurement tools
Motivation cont.
77
78
MPI Lacks...
Mechanisms for process creation One sided communication put, get, active messages Language binding for Fortran 90 anc C++ There are a xed number of processes from start to nish of an applicaiton. Many features were considered and not included
Time constraint Not enough experience Concern that additional features would delay the appearance of implementations
79
80
What is MPI?
A message-passing library speci cation message-passing model not a compiler speci cation not a speci c product For parallel computers, clusters, and heterogeneous networks Full-featured Designed to permit unleash? the development of parallel software libraries Designed to provide access to advanced parallel hardware for end users library writers tool developers
81
82
83
84
85
86
2C
Header les
include mpi.h include `mpif.h'
2 Fortran
error = MPI xxxxxparameter, ...; MPI xxxxxparameter, ...; 2 Fortran: CALL MPI XXXXXparameter, ..., IERROR
2 C:
87
88
2C
Initializing MPI
int MPI Initint *argc, char ***argv
MPI_COMM_WORLD communicator
2 Fortran
MPI_COMM_WORLD
0 2
1 4 3 5 6
89
90
Rank
Size
MPI Comm rankMPI Comm comm, int *rank MPI COMM RANKCOMM, RANK, IERROR INTEGER COMM, RANK, IERROR
communicator?
MPI Comm sizeMPI Comm comm, int *size MPI COMM SIZECOMM, SIZE, IERROR INTEGER COMM, SIZE, IERROR
91
92
int MPI Finalize 2 Fortran MPI FINALIZEIERROR INTEGER IERROR 2 Must be called last by all processes.
2C
Exiting MPI
Messages
2 Derived types can be built up from basic types. 2 C types are di erent from Fortran types.
93
94
95
96
PointtoPoint Communication
1 3 4
source 0 communicator
C
5 2
dest
call MPI_INIT ierr call MPI_COMM_RANK MPI_COMM_WORLD, rank, ierr call MPI_COMM_SIZE MPI_COMM_WORLD, size, ierr print *, 'Process ', rank, ' of ', size, ' is alive' dest = size - 1 src = 0 if rank .eq. src then to = dest count = 10 tag = 2001 do 10 i=1, 10 datai = i call MPI_SEND data, count, MPI_DOUBLE_PRECISION, to, + tag, MPI_COMM_WORLD, ierr else if rank .eq. dest then tag = MPI_ANY_TAG count = 10 from = MPI_ANY_SOURCE call MPI_RECVdata, count, MPI_DOUBLE_PRECISION, from, + tag, MPI_COMM_WORLD, status, ierr
10
Communication between two processes. Source process sends message to destination process. Communication takes place within a communicator. Destination process is identified by its rank in the communicator.
97
98
Fortran example
double precision PI25DT parameter PI25DT = 3.141592653589793238462643d0 double precision mypi, pi, h, sum, x, f, a integer n, myid, numprocs, i, rc function to integrate fa = 4.d0 1.d0 + a*a call MPI_INIT ierr call MPI_COMM_RANK MPI_COMM_WORLD, myid, ierr call MPI_COMM_SIZE MPI_COMM_WORLD, numprocs, ierr 10 98 99 if myid .eq. 0 then write6,98 format'Enter the number of intervals: 0 quits' read5,99 n formati10 endif call MPI_BCASTn,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr
99
100
C example
int done = 0, n, myid, numprocs, i, rc; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x, a; MPI_Init&argc,&argv; MPI_Comm_sizeMPI_COMM_WORLD,&numprocs; MPI_Comm_rankMPI_COMM_WORLD,&myid;
20
101
102
C example cont.
while !done if myid == 0 printf"Enter the number of intervals: 0 quits "; scanf"d",&n; MPI_Bcast&n, 1, MPI_INT, 0, MPI_COMM_WORLD; if n == 0 break; h = 1.0 double n; sum = 0.0; for i = myid + 1; i = n; i += numprocs x = h * doublei - 0.5; sum += 4.0 1.0 + x*x; mypi = h * sum; MPI_Reduce&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD; if myid == 0 printf"pi is approximately .16f, Error is .16f n", pi, fabspi - PI25DT; MPI_Finalize;
Sender mode Notes Synchronous send Only completes when the receive has started. Bu ered send Always completes unless an error occurs, irrespective of receiver. Standard send Either synchronous or bu ered. Ready send Always completes unless an error occurs, irrespective of whether the receive has completed. Receive Completes when a message has arrived.
Communication modes
103
104
int MPI Ssendvoid *buf, int count, MPI Datatype datatype, int dest, int tag, MPI Comm comm 2 Fortran: MPI SSENDBUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR type BUF* INTEGER COUNT, DATATYPE, DEST, TAG INTEGER COMM, IERROR
2 C:
Sending a message
105
106
int MPI Recvvoid *buf, int count, MPI Datatype datatype, int source, int tag, MPI Comm comm, MPI Status *status 2 Fortran: MPI RECVBUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERROR type BUF* INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, STATUSMPI STATUS SIZE, IERROR
2 C:
Receiving a message
pleted.
2 Processes synchronize. 2 Sender process speci es the synchronous mode. 2 Blocking - both processes wait until the transaction has com-
107
108
2 Sender must specify a valid destination rank. 2 Receiver must specify a valid source rank. 2 The communicator must be the same. 2 Tags must match. 2 Message types must match. 2 Receiver's bu er must be large enough.
148
Chapter 4
data processes
A 0 A A 0 0 0 0 0 0
broadcast
A A A A
scatter
A A A
0 1 2 3 4 5
gather
A A A
A B C D E F
0 0 0 0 0 0
A A
0 0 0 0 0 0
B B B B B B
0 0 0 0 0 0
C C C C C C
0 0 0 0 0 0
D D D D D D
0 0 0 0 0 0
E E E E E E
0 0 0 0 0 0
F F F F F F
0 0 0 0 0 0
allgather
A A A A
A B C D E F
0 0 0 0 0 0
A B C D E F
1 1 1 1 1 1
A B C D E F
2 2 2 2 2 2
A B C D E F
3 3 3 3 3 3
A B C D E F
4 4 4 4 4 4
A B C D E F
5 5 5 5 5 5
A A
0 1 2 3 4 5
B B B B B B
0 1 2 3 4 5
C C C C C C
0 1 2 3 4 5
D D D D D D
0 1 2 3 4 5
E E E E E E
0 1 2 3 4 5
F F F F F F
0 1 2 3 4 5
alltoall
A A A A
109
110
2 Receiver can wildcard. 2 To receive from any source - MPI ANY SOURCE 2 To receive with any tag - MPI ANY TAG 2 Actual source and tag are returned in the receiver's status
Wildcarding
parameter.
Messages do not overtake each other. This is true even for nonsynchronous sends.
111
112
2 Separate communication into three phases: 2 Initiate non-blocking communication. 2 Do some work perhaps involving other
Non-Blocking Communications
NonBlocking Send
1 3 4
out
2
in
0
communicator
113
114
NonBlocking Receive
1 3 4
out
2
in
0
communicator
115
116
Non-blocking Receive
MPI Irecvbuf, count, datatype, src, tag, comm, handle MPI Waithandle, status 2 Fortran: MPI IRECVbuf, count, datatype, src, tag, comm, handle, ierror MPI WAIThandle, status, ierror
2 C:
vice-versa.
117
118
NON-BLOCKING OPERATION Standard send Synchronous send Bu ered send Ready send Receive
Communication Modes
Completion
MPI Waithandle, status MPI Testhandle, ag, status 2 Fortran: MPI WAIThandle, status, ierror MPI TESThandle, ag, status, ierror
2 Waiting versus Testing. 2 C:
MPI CALL
MPI ISEND MPI ISSEND MPI IBSEND MPI IRSEND MPI IRECV
119
120
Barrier Synchronization
int MPI Barrier MPI Comm comm 2 Fortran: MPI BARRIER COMM, IERROR INTEGER COMM, IERROR
2 C:
121
122
Broadcast
int MPI Bcast void *bu er, int count, MPI datatype, int root, MPI Comm comm 2 Fortran: MPI BCAST BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR type BUFFER* INTEGER COUNT, DATATYPE, ROOT, COMM, IERROR
2 C:
Scatter
A B C D E
A B C D E
123
124
Gather
2 Examples:
group of processes.
global sum or product global maximum or minimum global user-de ned operation
A B C D E
125
126
MPI Name MPI MAX MPI MIN MPI SUM MPI PROD MPI LAND MPI BAND MPI LOR MPI BOR MPI LXOR MPI BXOR MPI MAXLOC MPI MINLOC
Function Maximum Minimum Sum Product Logical AND Bitwise AND Logical OR Bitwise OR Logical exclusive OR Bitwise exclusive OR Maximum and location Minimum and location
127
128
MPI_REDUCE
RANK 0
A B C D A B C D
ROOT
E F G H
E F G H
MPI_REDUCE 2
I J K L I J K L
M N O P
M N O P
Q R S T
Q R S T
AoEoIoMoQ
129
130
MPI_SCAN
RANK 0
A B C D A B C D
E F G H
E F G H
AoE
MPI_SCAN
I J K L
I J K L
Summary We have covered Background and scope of MPI Some characteristic features of MPI communicators, datatypes Point-to-Point communication blocking and non-blocking multiple modes Collective communication data movement collective computation
AoEoI
M N O P
M N O P
AoEoIoM
Q R S T
Q R S T
AoEoIoMoQ
131
132
Summary
The parallel computing community has cooperated to develop a full-featured standard message-passing library interface. Implementations abound Applications beginning to be developed or ported MPI-2 process beginning Lots of MPI material available
IBM Research MPI-F IBM Kingston Intel SSD Cray Research Meiko, Inc. SGI Kendall Square Research NEC Fujitsu AP1000 Convex Hughes Aircraft
Vendor Implementations
Argonne Mississippi State MPICH Ohio supercomputer Center LAM University of Edinburgh Technical University of Munich University of Illinois Other interested groups: Sun, Hewlett-Packard, Myricom makers of high-performance network switches and PALLAS a German software company, Sandia National Laboratory Intel Paragon running SUNMOS
Portable Implementations
133
134
Variety of implementations Vendor proprietary Free, portable World wide Real-time, embedded systems All MPP's and networks Implementation strategies Specialized Abstract message-passing devices Active-message devices
Complete MPI implementation On MPP's: IBM SP1 and SP2, Intel IPSC860 and Paragon, TMC CM-5, SGI, Meiko CS-2, NCube, KSR, Sequent Symmetry On workstation networks: Sun, Dec, HP, SGI, Linux, FreeBSD, NetBSD Includes multiple pro ling libraries for timing, event logging, and animation of programs. Includes trace upshot visualization program, graphics library E ciently implemented for shared-memory, high-speed switches, and network environments Man pages Source included Available at ftp.mcs.anl.gov in pub mpi mpich.tar.Z
135
136
The Standard itself: As a Technical report: U. of Tennessee. report As postscript for ftp: at info.mcs.anl.gov in pub mpi mpi-report.ps. As hypertext on the World Wide Web: As a journal article: in the Fall issue of the Journal of Supercomputing Applications MPI Forum discussions The MPI Forum email discussions and both current and earlier versions of the Standard are available from netlib. Books:
Using MPI: Portable Parallel Programming with the Message-Passing Interface, by Gropp, Lusk, and Skjellum, MIT Press, 1994 MPI Annotated Reference Manual, by Otto, Dongarra, Lederhttp: www.mcs.anl.gov mpi
Newsgroup:
comp.parallel.mpi
: the MPI Forum discussion list. : the implementors' discussion list. Implementations available by ftp: MPICH is available by anonymous ftp from info.mcs.anl.gov in the directory pub mpi mpich, le mpich.*.tar.Z. LAM is available by anonymous ftp from tbag.osc.edu in the directory pub lam. The CHIMP version of MPI is available by anonymous ftp from ftp.epcc.ed.ac.uk in the directory pub chimp release. Test code repository new:
[email protected] [email protected] ftp: info.mcs.anl.gov pub mpi-test
Mailing lists:
137
138
MPI-2
MPI
Context ...
Future
Context Active Messages Active Messages System Process Control Process Creation
Merging Features MPI available on: IBM SP Intel Paragon Cray T3D Meiko CS2 PVM/p4
The MPI Forum with old and new participants has begun a follow-on series of meetings. Goals clarify existing draft provide features users have requested make extensions, not changes Major Topics being considered dynamic process management client server real-time extensions one-sided" communication put get, active messages portable access to MPI system state for debuggers language bindings for C++ and Fortran-90 Schedule Dynamic processes, client server by SC '95 MPI-2 complete by SC '96
139
Conclusions
MPI being adopted worldwide Standard documentation is an adequate guide to implementation Implementations abound Implementation community working together