Distributed Memory Programming With: Peter Pacheco

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 125

An Introduction to Parallel Programming

Peter Pacheco

Chapter 3
Distributed Memory
Programming with
MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 1


# Chapter Subtitle
Roadmap

 Writing your first MPI program.


 Using the common MPI functions.
 The Trapezoidal Rule in MPI.
 Collective communication.
 MPI derived datatypes.
 Performance evaluation of MPI programs.
 Parallel sorting.
 Safety in MPI programs.

Copyright © 2010, Elsevier Inc. All rights Reserved 2


A distributed memory system

Copyright © 2010, Elsevier Inc. All rights Reserved 3


A shared memory system

Copyright © 2010, Elsevier Inc. All rights Reserved 4


Hello World!

(a classic)

5
Identifying MPI processes
 Common practice to identify processes by
nonnegative integer ranks.

 p processes are numbered 0, 1, 2, .. p-1

6
Our first MPI program

7
Compilation

wrapper script to compile


source file

mpicc -g -Wall -o mpi_hello mpi_hello.c

produce create this executable file name


debugging (as opposed to default a.out)
information

turns on all warnings

8
Execution

mpiexec -n <number of processes> <executable>

mpiexec -n 1 ./mpi_hello

run with 1 process

mpiexec -n 4 ./mpi_hello

run with 4 processes

9
Execution
mpiexec -n 1 ./mpi_hello

Greetings from process 0 of 1 !

mpiexec -n 4 ./mpi_hello

Greetings from process 0 of 4 !


Greetings from process 1 of 4 !
Greetings from process 2 of 4 !
Greetings from process 3 of 4 !

10
MPI Programs
 Written in C.
 Has main.
 Uses stdio.h, string.h, etc.
 Need to add mpi.h header file.
 Identifiers defined by MPI start with
“MPI_”.
 First letter following underscore is
uppercase.
 For function names and MPI-defined types.
 Helps to avoid confusion.
11
Copyright © 2010, Elsevier Inc. All rights Reserved 12
MPI Components
 MPI_Init
 Tells MPI to do all the necessary setup.

 MPI_Finalize
 Tells MPI we’re done, so clean up anything
allocated for this program.

13
Basic Outline

14
Communicators

 A collection of processes that can send


messages to each other.
 MPI_Init defines a communicator that consists of
all the processes created when the program is
started.

 Called MPI_COMM_WORLD.

15
Communicators

number of processes in the communicator

my rank
(the process making this call)

16
SPMD
 Single-Program Multiple-Data
 We compile one program.
 Process 0 does something different.
 Receives messages and prints them while the
other processes do the work.

 The if-else construct makes our program


SPMD.

17
Communication

Copyright © 2010, Elsevier Inc. All rights Reserved 18


Data types

Copyright © 2010, Elsevier Inc. All rights Reserved 19


Communication

Copyright © 2010, Elsevier Inc. All rights Reserved 20


Message matching

r
MPI_Send
src = q

MPI_Recv
dest = r

Copyright © 2010, Elsevier Inc. All rights Reserved 21


Receiving messages
 A receiver can get a message without
knowing:
 the amount of data in the message,
 the sender of the message,
 or the tag of the message.

Copyright © 2010, Elsevier Inc. All rights Reserved 22


status_p argument

MPI_Status*

MPI_Status* status; MPI_SOURCE


MPI_TAG
MPI_ERROR
status.MPI_SOURCE
status.MPI_TAG

Copyright © 2010, Elsevier Inc. All rights Reserved 23


How much data am I receiving?

Copyright © 2010, Elsevier Inc. All rights Reserved 24


Issues with send and receive
 Exact behavior is determined by the MPI
implementation.
 MPI_Send may behave differently with regard to
buffer size, cutoffs and blocking.
 MPI_Recv always blocks until a matching
message is received.
 Know your implementation;
don’t make assumptions!

Copyright © 2010, Elsevier Inc. All rights Reserved 25


TRAPEZOIDAL RULE IN MPI

Copyright © 2010, Elsevier Inc. All rights Reserved 26


The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 27


The Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 28


One trapezoid

Copyright © 2010, Elsevier Inc. All rights Reserved 29


Pseudo-code for a serial program

Copyright © 2010, Elsevier Inc. All rights Reserved 30


Parallelizing the Trapezoidal Rule

1. Partition problem solution into tasks.


2. Identify communication channels between
tasks.
3. Aggregate tasks into composite tasks.
4. Map composite tasks to cores.

Copyright © 2010, Elsevier Inc. All rights Reserved 31


Parallel pseudo-code

Copyright © 2010, Elsevier Inc. All rights Reserved 32


Tasks and communications for
Trapezoidal Rule

Copyright © 2010, Elsevier Inc. All rights Reserved 33


First version (1)

Copyright © 2010, Elsevier Inc. All rights Reserved 34


First version (2)

Copyright © 2010, Elsevier Inc. All rights Reserved 35


First version (3)

Copyright © 2010, Elsevier Inc. All rights Reserved 36


Dealing with I/O

Each process just


prints a message.

Copyright © 2010, Elsevier Inc. All rights Reserved 37


Running with 6 processes

unpredictable output

Copyright © 2010, Elsevier Inc. All rights Reserved 38


Input
 Most MPI implementations only allow process 0
in MPI_COMM_WORLD access to stdin.
 Process 0 must read the data (scanf) and send
to the other processes.

Copyright © 2010, Elsevier Inc. All rights Reserved 39


Function for reading user input

Copyright © 2010, Elsevier Inc. All rights Reserved 40


COLLECTIVE
COMMUNICATION

Copyright © 2010, Elsevier Inc. All rights Reserved 41


A tree-structured global sum

Copyright © 2010, Elsevier Inc. All rights Reserved 42


Tree-structured communication

1. In the first phase:


(a) Process 1 sends to 0, 3 sends to 2, 5 sends to 4, and
7 sends to 6.
(b) Processes 0, 2, 4, and 6 add in the received values.
(c) Processes 2 and 6 send their new values to
processes 0 and 4, respectively.
(d) Processes 0 and 4 add the received values into their
new values.

2. (a) Process 4 sends its newest value to process 0.


(b) Process 0 adds the received value to its newest
value.

Copyright © 2010, Elsevier Inc. All rights Reserved 43


An alternative tree-structured global sum

44
MPI_Reduce

45
18 and 28 lines are replaced with MPI_reduce 46
Predefined reduction operators in MPI

47
Collective vs. Point-to-Point Communications

 All the processes in the communicator must call


the same collective function.

 For example, a program that attempts to match a


call to MPI_Reduce on one process with a call to
MPI_Recv on another process is erroneous, and,
in all likelihood, the program will hang or crash.

48
Collective vs. Point-to-Point Communications

 The arguments passed by each process to an


MPI collective communication must be
“compatible.”

 For example, if one process passes in 0 as the


dest_process and another passes in 1, then the
outcome of a call to MPI_Reduce is erroneous,
and, once again, the program is likely to hang or
crash.

Copyright © 2010, Elsevier Inc. All rights Reserved 49


Collective vs. Point-to-Point Communications

 The output_data_p argument is only used


on dest_process.

 However, all of the processes still need to


pass in an actual argument corresponding
to output_data_p, even if it’s just NULL.

Copyright © 2010, Elsevier Inc. All rights Reserved 50


Collective vs. Point-to-Point Communications

 Point-to-point communications are matched on


the basis of tags and communicators.

 Collective communications don’t use tags.


 They’re matched solely on the basis of the
communicator and the order in which
they’re called.

Copyright © 2010, Elsevier Inc. All rights Reserved 51


Example (1)

Multiple calls to MPI_Reduce


the order of the calls determine the result.

b seems to be 3 but it is 4.
d seems to be 4 but it is 5

52
MPI_Allreduce

 Useful in a situation in which all of the processes


need the result of a global sum in order to
complete some larger computation.

Copyright © 2010, Elsevier Inc. All rights Reserved 53


A global sum followed
by distribution of the
result.

Copyright © 2010, Elsevier Inc. All rights Reserved 54


A butterfly-structured global sum.

Copyright © 2010, Elsevier Inc. All rights Reserved 55


Broadcast

Data belonging to a single process is sent to


all of the processes in the communicator.

Copyright © 2010, Elsevier Inc. All rights Reserved 56


A tree-structured broadcast.

Copyright © 2010, Elsevier Inc. All rights Reserved 57


A version of Get_input that uses MPI_Bcast

Copyright © 2010, Elsevier Inc. All rights Reserved 58


Data distributions
Compute a vector sum.

Serial implementation of vector addition

Copyright © 2010, Elsevier Inc. All rights Reserved 59


Different partitions of a 12-component
vector among 3 processes

Copyright © 2010, Elsevier Inc. All rights Reserved 60


Partitioning options
 Block partitioning
 Assign blocks of consecutive components to
each process.
 Cyclic partitioning
 Assign components in a round robin fashion.
 Block-cyclic partitioning
 Use a cyclic distribution of blocks of
components.

Copyright © 2010, Elsevier Inc. All rights Reserved 61


Parallel implementation of
vector addition

Copyright © 2010, Elsevier Inc. All rights Reserved 62


Scatter
 MPI_Scatter can be used in a function that reads in an
entire vector on process 0 but only sends the needed
components to each of the other processes.

Copyright © 2010, Elsevier Inc. All rights Reserved 63


Reading and distributing a vector

64
Gather
 Collect all of the components of the vector onto
process 0, and then process 0 can process all of
the components.

65
Print a distributed vector

Copyright © 2010, Elsevier Inc. All rights Reserved 66


Allgather
 Concatenates the contents of each process’
send_buf_p and stores this in each process’
recv_buf_p.
 As usual, recv_count is the amount of data being
received from each process.

Copyright © 2010, Elsevier Inc. All rights Reserved 67


Matrix-vector multiplication

Copyright © 2010, Elsevier Inc. All rights Reserved 68


Matrix-vector multiplication

69
Matrix-vector multiplication

i-th component of y
Dot product of the ith
row of A with x.

Copyright © 2010, Elsevier Inc. All rights Reserved 70


Multiply a matrix by a vector
Serial pseudo-code

Copyright © 2010, Elsevier Inc. All rights Reserved 71


C style arrays

stored as

Copyright © 2010, Elsevier Inc. All rights Reserved 72


Serial matrix-vector multiplication

Copyright © 2010, Elsevier Inc. All rights Reserved 73


An MPI matrix-vector multiplication function

Copyright © 2010, Elsevier Inc. All rights Reserved 74


MPI DERIVED DATATYPES

75
Derived datatypes
 Used to represent any collection of data items in
memory by storing both the types of the items
and their relative locations in memory.
 The idea is that if a function that sends data
knows this information about a collection of data
items, it can collect the items from memory
before they are sent.
 Similarly, a function that receives data can
distribute the items into their correct destinations
in memory when they’re received.

Copyright © 2010, Elsevier Inc. All rights Reserved 76


Derived datatypes
 Formally, consists of a sequence of basic
MPI data types together with a
displacement for each of the data types.
 Trapezoidal Rule example:

Copyright © 2010, Elsevier Inc. All rights Reserved 77


MPI_Type create_struct

 Builds a derived datatype that consists of


individual elements that have different
basic types.

Copyright © 2010, Elsevier Inc. All rights Reserved 78


MPI_Get_address

 Returns the address of the memory


location referenced by location_p.
 The special type MPI_Aint is an integer
type that is big enough to store an address
on the system.

Copyright © 2010, Elsevier Inc. All rights Reserved 79


MPI_Type_commit

Allows the MPI implementation to optimize


its internal representation of the datatype
for use in communication functions.

Copyright © 2010, Elsevier Inc. All rights Reserved 80


MPI_Type_free

 When we’re finished with our new type,


this frees any additional storage used.

Copyright © 2010, Elsevier Inc. All rights Reserved 81


Get input function with a derived datatype

Copyright © 2010, Elsevier Inc. All rights Reserved 82


Get input function with a derived datatype

Copyright © 2010, Elsevier Inc. All rights Reserved 83


Get input function with a derived datatype

Copyright © 2010, Elsevier Inc. All rights Reserved 84


HERE

Copyright © 2010, Elsevier Inc. All rights Reserved 85


PERFORMANCE EVALUATION

86
Elapsed parallel time

Returns the number of seconds that have


elapsed since some time in the past.

87
Elapsed serial time
 In this case, you don’t need to link in the MPI
libraries.
 Returns time in microseconds elapsed from
some point in the past.

88
Elapsed serial time

89
MPI_Barrier
Ensures that no process will return from calling it
until every process in the communicator has
started calling it.

90
MPI_Barrier

91
Run-times of serial and parallel
matrix-vector multiplication

(Seconds)

92
Speedup

Copyright © 2010, Elsevier Inc. All rights Reserved 93


Efficiency

Copyright © 2010, Elsevier Inc. All rights Reserved 94


Speedups of Parallel Matrix-Vector Multiplication

Copyright © 2010, Elsevier Inc. All rights Reserved 95


Efficiencies of Parallel Matrix-Vector Multiplication

Copyright © 2010, Elsevier Inc. All rights Reserved 96


Scalability
A program is scalable if the problem size can be
increased at a rate so that the efficiency doesn’t
decrease as the number of processes increase.

Copyright © 2010, Elsevier Inc. All rights Reserved 97


Scalability
Programs that can maintain a constant efficiency
without increasing the problem size are
sometimes said to be strongly scalable.

Programs that can maintain a constant efficiency if


the problem size increases at the same rate as
the number of processes are sometimes said to
be weakly scalable.

Copyright © 2010, Elsevier Inc. All rights Reserved 98


A PARALLEL SORTING ALGORITHM

Copyright © 2010, Elsevier Inc. All rights Reserved 99


Sorting
 n keys and p = comm sz processes.
 n/p keys assigned to each process.
 No restrictions on which keys are assigned to
which processes.
 When the algorithm terminates:
 The keys assigned to each process should be sorted
in (say) increasing order.
 If 0 ≤ q < r < p, then each key assigned to process q
should be less than or equal to every key assigned to
process r.

100
Serial bubble sort

101
Odd-even transposition sort
 A sequence of phases.
 Even phases, compare swaps:

 Odd phases, compare swaps:

102
Example
Start: 5, 9, 4, 3
Even phase: compare-swap (5,9) and (4,3)
getting the list 5, 9, 3, 4
Odd phase: compare-swap (9,3)
getting the list 5, 3, 9, 4
Even phase: compare-swap (5,3) and (9,4)
getting the list 3, 5, 4, 9
Odd phase: compare-swap (5,4)
getting the list 3, 4, 5, 9

103
Serial odd-even transposition sort

104
Communications among tasks in odd-even sort

Tasks determining a[i] are labeled with a[i].

105
Parallel odd-even transposition sort

106
Pseudo-code

107
Compute_partner

108
Safety in MPI programs

 The MPI standard allows MPI_Send to behave in


two different ways:
 it can simply copy the message into an MPI managed
buffer and return,
 or it can block until the matching call to MPI_Recv
starts.

109
 HERE……….

Copyright © 2010, Elsevier Inc. All rights Reserved 110


Safety in MPI programs
 Many implementations of MPI set a threshold at
which the system switches from buffering to
blocking.
 Relatively small messages will be buffered by
MPI_Send.
 Larger messages, will cause it to block.

111
Safety in MPI programs

 If the MPI_Send executed by each process


blocks, no process will be able to start executing
a call to MPI_Recv, and the program will hang or
deadlock.

 Each process is blocked waiting for an event that


will never happen.

(see pseudo-code)

112
Safety in MPI programs
 A program that relies on MPI provided
buffering is said to be unsafe.

 Such a program may run without problems


for various sets of input, but it may hang or
crash with other sets.

113
MPI_Ssend
 An alternative to MPI_Send defined by the MPI
standard.
 The extra “s” stands for synchronous and
MPI_Ssend is guaranteed to block until the
matching receive starts.

114
Restructuring communication

115
MPI_Sendrecv
 An alternative to scheduling the communications
ourselves.
 Carries out a blocking send and a receive in a
single call.
 The dest and the source can be the same or
different.
 Especially useful because MPI schedules the
communications so that the program won’t hang
or crash.

Copyright © 2010, Elsevier Inc. All rights Reserved 116


MPI_Sendrecv

Copyright © 2010, Elsevier Inc. All rights Reserved 117


Parallel odd-even transposition sort

Copyright © 2010, Elsevier Inc. All rights Reserved 118


Run-times of parallel odd-even sort

(times are in milliseconds)

Copyright © 2010, Elsevier Inc. All rights Reserved 119


Concluding Remarks (1)
 MPI or the Message-Passing Interface is a
library of functions that can be called from C,
C++, or Fortran programs.
 A communicator is a collection of processes
that can send messages to each other.
 Many parallel programs use the single-program
multiple data or SPMD approach.

Copyright © 2010, Elsevier Inc. All rights Reserved 120


Concluding Remarks (2)
 Most serial programs are deterministic: if we run
the same program with the same input we’ll get
the same output.
 Parallel programs often don’t possess this
property.
 Collective communications involve all the
processes in a communicator.

Copyright © 2010, Elsevier Inc. All rights Reserved 121


Concluding Remarks (3)
 When we time parallel programs, we’re usually
interested in elapsed time or “wall clock time”.
 Speedup is the ratio of the serial run-time to the
parallel run-time.
 Efficiency is the speedup divided by the number of
parallel processes.

Copyright © 2010, Elsevier Inc. All rights Reserved 122


Concluding Remarks (4)
 If it’s possible to increase the problem size (n) so
that the efficiency doesn’t decrease as p is
increased, a parallel program is said to be
scalable.
 An MPI program is unsafe if its correct behavior
depends on the fact that MPI_Send is buffering its
input.

Copyright © 2010, Elsevier Inc. All rights Reserved 123


124
 Parallel Bubblesort
 Parallel Quicksort
 Parallel Mergesort
 Parallel Paritioning

125

You might also like