Saint-Petersburg State University
Department of Computational
Physics
Sergei A. Nemnyugin
Basics of parallel programming
.
Saint-Petersburg
2010
2
S.A. Nemnyugin
Basics of parallel programming – St. Petersburg, 2010
Basics of programming techniques for multiprocessor and multicore computers are
given. OpenMP and Message Passing Interface are considered. Short review of
Fortran 90 also is included.
3
Table of Contents
1.
2.
3.
4.
5.
6.
7.
Introduction
OpenMP
Message Passing Interface
Fortran 90
References
Appendix 1. Intel compiler for Linux
Appendix 2. Compilation and execution of MPI
programs in Linux
4
20
33
79
98
99
102
4
Part 1
Principles of parallel
programming
5
1. Introduction
Sequential and parallel programming models
Programming model is a set of programming techniques corresponding to
architecture of abstract computer intended for execution of definite class of
algorithms. Programming model is based on some concept of logical organization of
computer (its architecture). Variety of computer architectures may be classified in
different ways. One of the most popular taxonomies is Flynn’s taxonomy which is
based on number of instructions flows and data flows that interact in a computer
(Fig. 1 - 4).
Fig. 1. SISD (Single Instruction Stream Single Data Stream) architecture
Fig. 2. SIMD (Single Instruction Stream Multiple Data Stream) architecture
6
Fig. 3. MISD (Multiple Instruction Stream Single Data Stream) architecture
Fig. 4. MIMD (Multiple Instruction Stream Multiple Data Stream) architecture
Inherent parallelism of algorithms and computer programs may be presented by an
informational graph. Informational graph shows both execution order of
macrooperations and data flows between them. Nodes of an informational graph
correspond to macrooperations and unidirectional links correspond to data exchanges
(fig. 5).
7
Fig. 5. Informational graph
A node is characterized by two parameters (n, s), where n is node’s identifier and s is
its size which may be measured by number of elementary operations which constitute
it. A link is characterized by two parameters (v, d), where v are transferring data, d
is time required for data delivery from sender to recipient.
Informational graph consists of both linear sequences and multiply connected
contours (loops). Limit cases of informational graph which are topologically
equivalent to linear sequence of macrooperations and set of linear sequences (Fig. 6)
correspond to purely sequential and parallel models of computing.
Fig. 6. Limit cases of informational graph
Sequential programming model may be characterized as follows:
• performance is defined primarily by a hardware;
• high level programming languages are used to develop computer programs;
• good source-level portability of computer programs.
8
Main features of parallel programming model:
• possibility to achieve high performance;
• special programming techniques;
• special software tools;
• program development and verification may be much more laborious than in
sequential model;
• restricted portability of parallel software.
Some specific problems of parallel programming:
• careful planning of simultaneous execution of a set of processes;
• explicit programming of data exchanges;
• possible deadlocks (two processes or threads may wait for some resource
locking those resource which is required for normal execution of another
process or thread);
• errors are nonlocal and dynamic (processes or threads are executed on different
computing nodes or cores, workloads of computing nodes change in time, so
explicit synchronizations may be required);
• loss of determinism (results of calculations differ for different runs –
consequence of “data races” – simultaneous and asynchronous access to shared
variables);
• care about scalability of software (scalability is a desirable property of a
system, a network, or a process, which indicates its ability to either handle
growing amounts of work in a graceful manner, or to be readily enlarged);
• necessity of balanced workload of different CPUs.
Models of parallel programming
Parallel programming model may be realized in different ways taking into account
computer architectures and software development tools. Some of them are listed
below.
Message passing model
Most important features of the message passing model:
• Program cause execution of a set of processes.
• Each task gets its own unique identifier.
• A process interact by means of messages sends and receives.
• New processes may be created in the run-time of a parallel program.
9
Data parallelism model
Most important features of the data parallelism model:
• One operation deals with a set of elements. Program is a sequence of such
operations.
• Fine granularity of computations.
• A programmer describes explicitly how data should be distributed between
subtasks.
Shared memory model
In the shared memory model tasks have access to common memory. Tasks have
common address space. Memory access is controlled by various methods, for
example be means of semaphores. Explicit description of data transfers is not used. It
simplifies programming but special attention should be paid to determinism, data
races etc.
Amdahl’s laws
Amdahl’s laws form theoretical basis of maximum performance estimates of
parallelizable programs. These laws were derived for idealized model of parallel
computations, which doesn’t take into account latency of communications (finite time
of data transfer between nodes of a computer system) and so on.
First law
Performance of a computer system which consists of interconnected components
is defined by its slowest component.
Second law
Let execution time of a program on a sequential computer is T1. Let this time is a
sum of Ts execution time of non-parallelizable part and Tp execution time
of parallelizable part. Let T2 execution time of the program on an ideal
parallel computer with N CPUs. Then speedup is:
K=
where S =
T1 Ts + Tp
1
=
=
,
Tp
P
T2
S+
Ts +
N
N
T
Ts
and P = p are portions of sequential and parallelized parts
Ts + Tp
Ts + Tp
of a program (S + P = 1). See fig. 7.
10
Fig. 7. Speedup in second Amdahl’s law
Third law
Let a computer system consists of N simple identical processing elements. Then
in any case speedup K ≤
1
.
P
Two paradigms of parallel programming
Informational graph on fig. 8 displays coexistence of data and task parallelism in a
same program/algorithm.
Fig. 8. Data and task parallelism in a same program
11
Data parallelism
Multiple arrows between two neighbour nodes of informational graph correspond to
simultaneous application of a macrooperation to a set of data (array). Parts of the
array may be processed by vector CPU or by a set of CPUs on a parallel computer
system.
Vectorization or parallelization in the data parallelism model is introduced into
program on the stage of compilation. A programmer in this case:
• uses compiler options of vector or parallel optimization;
• uses directives of parallel compilation;
• uses both specialized programming languages of parallel computations and
libraries optimized for a given computer architecture.
Main features of the model of data parallelism:
• same program deals with all data;
• address space is global;
• weak synchronization of computations on CPUs of a parallel computer;
• parallel operations are applied to array elements simultaneously on all
available CPUs.
Most widely used software tools in the data parallelism model are special
programming languages or extensions such as DVM Fortran, HPF (High
Performance Fortran) etc.
Realization of the data parallelism model should be supported on the level of
compilation. Such support may be provided by:
• preprocessors which use sequential compilers and specialized libraries of
parallel algorithms;
• pretranslators which perform preliminary analysis of logical structure of a
program, check of dependencies and restricted parallel optimization;
• parallelizing compilers which reveal parallelism in a source code of a program
and transform it in parallel structures. To make such transformation easier
special directives of compilation may be included in a program.
Task parallelism
Loops of an informational graph consisting of “thick” arrows correspond to task
parallelism. Its idea is based on a problem decomposition into a set of smaller
subtasks. All subtasks are processed on different CPUs. This approach is MIMDoriented.
In the task parallelism model subtasks are realized as separate programs written on
commonly used programming language (for example, Fortran or ). Subtasks should
send and receive initial data, intermediate and final results. In practice such
interchange may be realized by means of calls of subroutines from a special library.
12
A programmer is able to control data distribution between different CPUs and
different subtasks as well as data interchange.
Problems of this approach are listed below:
• high laboriousness of development, debugging, testing and verification of a
parallel program;
• a programmer is responsible for equal and dynamically balanced workload of
CPUs of a parallel computer;
• a programmer should minimize data interchange between subtasks because
communications require a lot of time;
• possibility of deadlocks or other situations when message sent by some subtask
may be not delivered to a target subtask.
Attractive features:
• flexibility and more freedom given to a programmer to develop software which
efficiently uses resources of a parallel computer;
• possibility to achieve maximum performance.
Main tools of programming in the task parallelism model are specialized libraries
(e. g. MPI - Message Passing Interface, PVM - Parallel Virtual Machines).
Design of parallel algorithm
In the process of development of a parallel algorithm a programmer should pass
through a sequence of specific stages:
1. Decomposition. This stage includes analysis of a problem and making decision if
parallelization is necessary at all. A problem and related data are divided into
smaller parts subtasks. It is not necessary to take into account features of
computer architecture at this stage.
2. Planning of communications (data exchanges between subtasks). Communications
necessary both for data (initial, intermediate and final results) exchanges and
exchanges by control information are defined. Types of communications also must
be chosen.
3. Agglomeration. At this stage subtasks are agglomerated into bigger constituents of
a parallel program. Sometimes it allows to increase an efficiency of an algorithm
and to reduce its development cost.
4. Planning of computations. Distribution of tasks between CPUs. Main criterion of
distribution effective usage of CPUs with minimal time spent on
communications.
Let us turn to more detailed consideration of listed stages.
Decomposition
There are different approaches to decomposition. Below a short review of main
approaches is given.
13
Data decomposition
In the data decomposition approach at first data must be subdivided into smaller
parts. Secondly procedures of data processing may be decomposed. Data are divided
into parts which have nearly equal size. Operations dealing with data should be
bound to data fragments. In such a way subtasks are formed. Then all required
communications should be defined. Overlaps of subtasks in computational work
should be reduced. It allows to avoid doubling of computations. Decomposition may
be refined in the process of program design. If it is necessary to reduce
communications increase overlap of subtasks is possible.
Analysis begins with biggest data structures as well as most often used structures. At
different stages of computation may be used different data structures so both static
and dynamic decompositions are used.
Recursive dichotomy
Recursive dichotomy may be used to divide a domain into subdomains, requiring
about the same amount of computations. Communications are minimized. At first a
domain is divided into two parts along each dimension. Decomposition is repeated
recursively in each new subdomain so many times as is needed to get required
number of subtasks.
Recursive coordinate dichotomy
Recursive coordinate dichotomy may be applied to nonregular grids. Division is
performed at each step for a dimension having largest extension.
Recursive graph dichotomy
Recursive graph dichotomy may be applied to nonregular grids. In this approach
information about grids topology is used to minimize number of edges crossing
boundaries of subdomains. In such a way number of communications may be
reduced.
Functional decomposition
In the functional decomposition a computational algorithm is subjected to
decomposition and afterward decomposition of data is adjusted to this decomposition.
Functional decomposition may be useful in a case where it is hard or even impossible
to find data structures which may be parallelized. Efficiency of decomposition may
be improved by following some recommendations:
• number of subtasks after decomposition should exceed by an order of
magnitude the number of CPUs;
• extra computations and data exchanges should be avoided;
• subtasks should have approximately same size;
14
ideally, decomposition should be performed in such a way that increase of a
problem's size leads to increase of a number of subtasks (with constant size of
a subtask).
Size of a subtask is defined by granularity of an algorithm. Granularity may be
measured by a number of operations in a block. There are three levels of granularity:
•
1. Fine-grained parallelism instruction-level (no more than 20 instructions per
block, on average 5, number of parallel subtasks from two to few
thousands).
2. Middle-grained parallelism subroutine-level. Block size is up to 2000
instructions. Such kind of parallelism is a bit harder to find because it is
necessary to take into account interprocedural dependencies. Requirements to
communications are smaller than in a case of instruction-level parallelism.
3. Coarse-grained parallelism tasks-level. It is realized via simultaneous
execution of independent programs on a parallel computer. Coarse-grained
parallelism must be supported by operational system.
Most important condition which makes decomposition possible is independence of
subtasks. Below are listed main kinds of independency:
• Data independence data which are processed by one subtask are not
modified by other subtask.
• Control independence order of execution of a program’s parts may be
defined only in the time of execution (if control dependency exists order
of execution is predefined).
• Independence on resources it may be provided by sufficient amount
of computer resources.
• Dependence on output takes place if two or more subtasks write in
the same variable. Input-output independence takes place if statements
of input-output of few subtasks have not access to same variable or file.
In practice complete independence is unachievable.
Planning of communications
There are few basic kinds of communications:
•
local each subtask communicates with few other subtasks;
•
global each subtask communicates with many other subtasks;
structured each subtask and subtasks which communicate with it may be
arranged into regular structure topologically equivalent (for example) to a
lattice;
•
•
unstructured communications form arbitrary graph;
•
static communications don’t change in time;
•
dynamic communications change in time of a program execution;
15
•
synchronous sender and receiver coordinate data exchanges;
asynchronous data exchanges are not coordinated.
Recommendations on planning of communications:
• program has good scalability if each subtask has the same number of
communications;
• local communications are preferable;
• parallel communications are preferable.
•
Agglomeration
At the agglomeration stage architecture of a parallel computer should be taken into
account. Subtasks from two previous stages are combined in such way that to get as
much new subtasks as available CPUs. In order to perform agglomeration efficiently
following recommendations should be taken into account:
• overhead expenses on communications should be reduced;
• if at the agglomeration stage computations or data should be duplicated nor
scalability nor performance should suffer;
• new subtasks should have approximately equal computational complexity;
• scalability should be kept if possible;
• parallel execution must be kept;
• cost of development should be reduced if possible.
Planning of computations
At the stage of planning of computations distribution of subtasks between CPUs has
to be defined. It should be done in such way that execution time of a parallel program
was minimized. Most often used approaches to planning of computations are listed
below.
Master/slave planning
Main (master) subtask is responsible for distribution of slave tasks on CPUs (fig. 9).
Slave task gets initial data from master and returns results.
16
Fig. 9. Simple master/slave schema
Hierarchical master/slave schema
In this approach slave subtasks form few disjoint sets (fig. 10). Each set has its own
master task. Master tasks of sets are controlled by single highest-level master task.
Fig. 10. Hierarchical master/slave schema
Decentralized planning
In this approach master task is absent. Subtasks communicate with each other
according to some strategy (fig. 11). It may be randomly chosen subtasks or small
number of target subtasks (nearest neighbours). In the hybrid centralized-distributed
approach message is sent to master task which sends it to slave tasks according to
round robin strategy.
17
Fig. 11. Decentralized planning of computations
Dynamic balancing may be efficiently realized if the following recommendations are
taken into account:
• in a case when each CPU is loaded by a single subtask an execution time of a
parallel program is defined by a slowest subtask so optimal performance may
be achieved when all subtasks have approximately same size;
• balancing may be provided by loading of each CPU by few tasks.
Multithreading
A thread is a single sequential flow of control within a program. It is also a sequence
of instructions that is executed. A process has the main thread that initializes the
process and begins executing the instructions.
Relationship of threads with a process:
A process has the main thread that initializes the process and begins
executing the instructions.
Any thread can create other threads within the process.
Each thread gets its own stack.
All threads within the process share code and data segments.
18
Threading problems:
• data races;
• deadlocks;
• load imbalance;
• livelocks.
Race conditions occur as a result of the dependencies, in which multiple threads
attempt to update the same memory location, or variable, after threading. They may
not be apparent at all times. The two possible conflicts that can arise as a result of
data races are:
read/write conflict;
write/write conflict.
The two ways by which it is possible to prevent data races in multithreaded
applications are:
Scope variables to be local to each thread (variables declared within
threaded functions, allocate on thread’s stack etc.);
Control concurrent access by using critical regions (examples of
synchronization objects that can be used are: mutex, semaphore, event.
critical section).
Race conditions may be hidden behind a programming language syntax. Below some
examples are given:
Thread 1
X += 1
Thread 2
X += 2
vec[i] += 1
*p1 += 1
vec[j] += 2
*p2 += 2
Func(1)
Func(2)
add [abc], 1
add [abc], 2
Why data race happens
Compiler expands += into separate read
and write of X
Subscripts i and j may be equal
Pointers p1 and p2 might point to same
location
Func might be adding its argument to a
hidden shared variable
At the instruction level the hardware
expands an update of [abc] into separate
reads and writes
Deadlock occurs when a thread waits for a condition that never occurs, most
commonly results from the competition between threads for system resources held by
other thread. Deadlock can occur only if the following conditions take place:
• access to each resource is exclusive;
19
• a thread is allowed to hold one resource while requesting another;
• no thread is willing to relinquish a resource that it has acquired;
• there is a cycle of threads trying to acquire resources, where each resource is
held by one thread and requested by another.
Livelock is a situation when a thread does not progress on computations, but the
thread is not blocked or waiting, threads try to overcome an obstacle presented by
another thread that is doing the same thing.
20
Part 2
OpenMP
21
OpenMP is API (Applications Programming Interface) for shared memory
multiprocessor and multi-core computing systems. Multithreaded programming on C,
C++ and Fortran programming languages are supported.
Model of a parallel program in OpenMP
Model of a parallel program in OpenMP may be formulated as follows (fig. 12):
• Program consists of sequential and parallel sections.
• At the starting moment of execution master thread is created which perform
sequential sections of a program.
• In order to start multi-threaded execution of a parallel section fork is performed
which creates a set of threads. Each thread has its own unique numerical
identifier (master thread has 0). When loop is parallelized all threads execute
same code. In general threads may execute different parts of code.
• After completion of execution of a parallel section join–operation is
performed. All threads except master stop their execution.
Fig. 12. Model of a parallel program in OpenMP
OpenMP consists of the following components:
• Compiler directives are used to create threads, for worksharing among threads
and their synchronization. Directives are included in a parallel program.
• Runtime subroutines are used for setting and getting of attributes of threads.
Calls of runtime subroutines are included in a parallel program.
• Environment variables are used to control parallel program execution.
Environment variables let to set environment of execution of a parallel
program. Any operational system and/or command interpreter has its own
commands to set environment variables.
22
Using compiler directives and runtime libraries a programmer has to follow rules
which may be different in different programming languages. A set of such rules is
called programming language binding.
Fortran bindings
Names of subprograms and compiler directives in Fortran as well as names of
environment variables begin with OMP or OMP_. Compiler directive is following:
{!|C|*}$OMP directive [operator_1[, operator_2, …]]
Directive begins at first (fixed format of a source code in Fortran 77) or any (free
format) position. Directive may be continued to next string. In this case it is
necessary to conform to the standard rules of indication of a statement continuation
for that version of language which is used to write a program (non-blank symbol in
the 6th position of the continuation string in fixed format or ampersand in free-form
format).
Example of OpenMP program (Fortran)
program omp_example
integer i, k, N
real*4 sum, h, x
print *, "Please, type in N:"
read *, N
h = 1.0 / N
sum = 0.0
C$OMP PARALLEL DO SCHEDULE(STATIC) REDUCTION(+:sum)
do i = 1, N
x = i * h
sum = sum + 1.e0 * h / (1.e0 + x**2)
end do
print *, 4.0 * sum
end
C bindings
Function names, pragmas and names of environment variables OpenMP in C begins
form omp, omp_ or OMP_. Compiler directive is following:
23
#pragma omp directive [operator_1[, operator_2, …]]
In OpenMP-programs header file omp.h has to be used.
Example of OpenMP program (in C)
#include "omp.h"
#include <stdio.h>
double f(double x)
{
return 4.0 / (1 + x * x);
}
main () {
const long N = 100000;
long i;
double h, sum, x;
sum = 0;
h = 1.0 / N;
#pragma omp parallel shared(h)
{
#pragma omp for private(x) reduction(+:sum)
for (i = 0; i < N; i++) {
x = h * (i + 0.5);
sum = sum + f(x);
}
}
printf("PI = %f\n", sum / N);
}
OpenMP directives
Descriptions of OpenMP directives (version 2.5) are given.
parallel
…
end parallel
Defines parallel section of a program. It may be used with following statements (their
descriptions you’ll find later in the text):
24
•
private;
•
•
shared;
default;
•
firstprivate;
•
reduction;
•
if;
•
copyin;
•
num_threads.
do
loop do
end do
#pragma omp for
loop for
Defines loop which has to be parallelized (in Fortran and C). It may be used with
following statements:
• private;
•
firstprivate;
•
lastprivate;
•
reduction;
•
•
schedule;
ordered;
•
nowait.
sections
…
end sections
Defines parallel section of a program. Nested sections being defined by section
directives are distributed between threads. It may be used with following statements:
• private;
•
firstprivate;
•
lastprivate;
•
reduction;
•
nowait.
25
section
Defines part of parallel sections which must be executed in one thread.
single
…
end single
Defines section of a program which has to be executed by a single thread. It may be
used with following statements:
• private;
•
firstprivate;
•
copyprivate;
•
nowait.
workshare
…
end workshare
Divides block of a program into parts which may be executed by threads only once.
Block may include only following constructs:
• arrays assignments;
• scalar assignments;
• FORALL;
•
WHERE;
•
atomic;
•
critical;
•
parallel.
parallel do
loop do
end parallel do
Combines directives parallel and do.
parallel sections
…
end parallel sections
Combines directives parallel and sections.
26
parallel workshare
…
end parallel workshare
Combines directives parallel and workshare.
master
…
end master
Defines block which has to be executed by master thread.
critical[(lock)]
…
end critical[(lock)]
Defines block of a program which may be accessed by single thread (critical section).
Lock – unnecessary name of the critical section.
barrier
Barrier synchronization of threads. Every thread which execution reaches given point
suspends until all other threads reach the same point of execution.
atomic
Defines operation as atomic (when atomic operation is executed simultaneous access
to memory from different threads to write is prohibited). It may be applied only to
statement which is situated immediately after this directive. It has following format:
• x
=
x
{+|-|*|/|.AND.|.OR.|.EQV.|.NEQV.}
scalar_expression_without_x
• x
=
scalar_expression_without_x
{+||*|/|.AND.|.OR.|.EQV.|.NEQV.} x
• x
=
{MAX|MIN|IAND|IOR|IEOR}
(x,
scalar_expression_without_x)
• x
=
{MAX|MIN|IAND|IOR|IEOR}
(scalar_expression_without_x, x)
flush[(list of variables)]
Sets a synchronization point where values of variables from the list and accessible
from the thread are written in memory. Provides coherence of memory content which
is accessible from different threads.
27
ordered
…
end ordered
Supplies keeping of those execution order of a loop iterations which corresponds to
sequential execution order.
threadprivate(list of common-blocks)
Defines common blocks in the list as local.
OpenMP statements
OpenMP statements are used together with directives.
private(list of variables)
Defines variables in the list as local.
firstprivate(list of variables)
Defines variables in the list as local and initializes them by values from block
preceding this directive.
lastprivate(list of variables)
Defines variables in the list as local and assigns them values from that block of a
program which was executed last.
copyprivate(list of variables)
After end of execution of a block which is defined by single directive values of
local variables from the list are distributed among other threads.
nowait
Cancels barrier synchronization at the end of parallel section.
shared(list of variables)
Defines variables in the list as shared by all threads.
default(private|shared|none)
Changes default rules of a scope of variables. Keyword private may be used only
in Fortran.
28
reduction(operator|builtin function: list of variables)
Reduces values of local variables from the list by means of an operator or built-in
function of a language. Reduction is applied to few values and returns a single value.
if(scalar logical expression)
Conditional statement.
num_threads(scalar integer expression)
Sets number of threads. Alternative method of setting of a number of threads is usage
of the environment variable OMP_NUM_THREADS.
schedule(method_of_distribution_of_iterations
[,
number_of_loop_iterations])
Defines method of distribution of loop iterations among threads:
• static – number of a loop iterations for each thread is fixed and is
distributed among threads according to round robin planning. If number of
iterations is not given it is set to 1;
• dynamic – number of a loop iterations for each thread is fixed. Next chunk of
iterations is delivered to a thread which became free;
• guided – number of loop iterations for each thread decreases. Next chunk of
iterations is delivered to a thread which became free;
• runtime – method of worksharing is defined at the execution time, by means
of environment variable OMP_SCHEDULE.
copyin(list of common-blocks)
Data are copying from the master thread to local common-blocks of every other
thread at the beginning of parallel sections. Names are placed between «/» symbols.
OpenMP subroutines
Subprograms which form execution environment for a parallel program
From now on at first place C interface of OpenMP subprograms is given and second
is Fortran interface.
29
void omp_set_num_threads(int threads);
subroutine omp_set_num_threads(threads)
integer threads
Sets number of threads which are used to execute parallel sections of a program.
int omp_get_num_threads(void);
integer function omp_get_num_threads()
Returns number of threads which are used to execute parallel sections.
int omp_get_max_threads(void);
integer function omp_get_max_threads()
Returns maximum number of threads which may be used to execute parallel sections
of a program.
int omp_get_thread_num(void);
integer function omp_get_thread_num()
Returns identifier of a thread which is called the function.
int omp_get_num_procs(void);
integer function omp_get_num_procs()
Returns number of processors which may be used by a program.
int omp_in_parallel(void);
logical function omp_in_parallel()
Returns true if call is made from an active parallel section of a program.
void omp_set_dynamic(int threads);
subroutine omp_set_dynamic(threads)
logical threads
Turns on or out dynamic assignment of threads number which are used to execute
parallel sections of a program. By default this opportunity is disabled.
int omp_get_dynamic(void);
logical function omp_get_dynamic()
Returns true if dynamic assignment of threads number is allowed.
30
void omp_set_nested(int nested);
subroutine omp_set_nested(nested)
integer nested
Turns on or out nested parallelism. By default this opportunity is disabled.
int omp_get_nested(void);
logical function omp_get_nested()
Checks if nested parallelism is allowed.
Subprograms for operations with locks
Locks are used to prevent effects leading to unpredictable behaviour of a program. It
may be a result of data races when two or more threads have access to the same
variable.
void omp_init_lock(omp_lock_t *lock);
subroutine omp_init_lock(lock)
integer(kind = omp_lock_kind) :: lock
Initializes lock associated with lock identifier to use it in subsequent calls.
void omp_destroy_lock(omp_lock_t *lock);
subroutine omp_destroy_lock(lock)
integer(kind = omp_lock_kind) :: lock
Makes locks associated with lock identifier undefined.
void omp_set_lock(omp_lock_t *lock);
subroutine omp_set_lock(lock)
integer(kind = omp_lock_kind) :: lock
Changes state of a thread form execution state to wait until lock associated with
identifier lock will be available. Thread becomes owner of available lock.
void omp_unset_lock(omp_lock_t *lock);
subroutine omp_unset_lock(lock)
integer(kind = omp_lock_kind) :: lock
When this call is completed the thread stops to be owner of the lock associated with
identifier lock. If the thread was not owner of the lock result will be undefined.
31
int omp_test_lock(omp_lock_t *lock);
logical function omp_test_lock(lock)
integer(kind = omp_lock_kind) :: lock
Returns true if the lock is associated with identifier lock.
void omp_init_nest_lock(omp_nest_lock_t *lock);
subroutine omp_init_nest_lock(lock)
integer(kind = omp_nest_lock_kind) :: lock
Initializes nested lock associated with identifier lock.
void omp_destroy_nest_lock(omp_nest_lock_t *lock);
subroutine omp_destroy_nest_lock(lock)
integer(kind = omp_nest_lock_kind) :: lock
Sets nested lock associated with identifier lock as undefined.
void omp_set_nest_lock(omp_nest_lock_t *lock);
subroutine omp_set_nest_lock(lock)
integer(kind = omp_nest_lock_kind) :: lock
Changes state of threads from execution to wait until nested lock associated with
identifier lock will be available. Thread becomes owner of available lock.
void omp_unset_nest_lock(omp_nest_lock_t *lock);
subroutine omp_unset_nest_lock(lock)
integer(kind = omp_nest_lock_kind) :: lock
Releases executing thread from being owner of nested lock associated with identifier
lock. If the thread was not owner of the lock result will be undefined.
int omp_test_nest_lock(omp_nest_lock_t *lock);
integer function omp_test_nest_lock(lock)
integer(kind = omp_nest_lock_kind) :: lock
Checks if nested lock is associated with identifier lock. If lock is associated with the
identifier counter’s value will be returned, otherwise 0 is returned.
Timers
Timers may be used to profile OpenMP programs.
32
double omp_get_wtime(void);
double precision function omp_get_wtime()
Returns time (in seconds) passed from some arbitrary moment in the past. Reference
point is fixed during execution of the program.
double omp_get_wtick(void);
double precision function omp_get_wtick()
Returns time (in seconds) passed between subsequent ticks. May be used as a
measure of accuracy of the timer.
OpenMP environment variables
Environment variables may be set as follows:
• export VARIABLE=value (in UNIX)
• set VARIABLE=value (in Microsoft Windows)
OMP_NUM_THREADS
Sets number of threads on execution of parallel sections of a program.
OMP_SCHEDULE
Sets method of distribution of a loop iterations among threads. Possible values:
• static;
• dynamic;
• guided.
Number of iterations (optional parameter) is used after one of these keywords
separated by comma, for example:
export OMP_SCHEDULE=”static, 10”
OMP_DYNAMIC
If the variable has false value dynamical distribution of loop iterations is not
allowed.
OMP_NESTED
If the variable has value false nested parallelism is not allowed.
33
Part 3
Message Passing Interface
34
Message Passing Interface (MPI) is specification which defines how an
implementation of message passing system should be organized. Below description
of free realization of MPI 1– MPICH 1.2.7 is given.
Fortran bindings
Names of subroutines and named constants in MPI programs written in Fortran begin
with symbols MPI_. Exit code is returned by additional integer parameter (last
argument). Successful exit code is MPI_SUCCESS. Definitions (for example,
definitions of named constants) are in the header file mpif.h which must be included
in MPI program by statement include.
In some subroutines special variable status is used which is integer array having
size MPI_STATUS_SIZE.
In calls of MPI subroutines MPI data types nave to be used. Most of them have
correspondent data types of Fortran (see Table 1)
Table 1. MPI data types in Fortran language
Data type in MPI
Data type in Fortran
MPI_INTEGER
Integer
MPI_REAL
Real
MPI_DOUBLE_PRECISION
Double precision
MPI_DOUBLE_COMPLEX
Double complex
MPI_COMPLEX
Complex
MPI_LOGICAL
Logical
MPI_CHARACTER
Character
MPI_BYTE
-
MPI_PACKED
-
Data types which may not exist in some MPI realizations
MPI_INTEGER1
Integer*1
MPI_INTEGER2
Integer*2
MPI_INTEGER4
Integer*4
MPI_REAL4
Real*4
MPI_REAL8
Real*8
Data types MPI_Datatype and MPI_Comm are simulated by standard integer
type of Fortran (integer).
35
In C programs library MPI functions are used whereas in Fortran subroutines.
C bindings
In C programs names of subprograms have the following form:
Class_action_subset or Class_action. In C++ methods of some class are
used and their names have the following form: MPI::Class::action_subset.
Some actions have special names: Create creation of a new object, Get
getting of information about object, Set setting of parameters of an object,
Delete removal of information, Is inquiry if given object has given
properties.
Names of MPI constants are written in uppercase. Their definitions are included in
the header file mpi.h.
Input parameters are passed by value and output (and INOUT) by reference.
Correspondence between MPI data types and standard data types of C is given in the
table 2.
Table 2. MPI data types in C language
Data type in MPI
Data type in C
MPI_CHAR
signed char
MPI_SHORT
signed short int
MPI_INT
signed int
MPI_LONG
signed long int
MPI_UNSIGNED_CHAR
unsigned char
MPI_UNSIGNED_SHORT
unsigned short int
MPI_UNSIGNED
unsigned int
MPI_UNSIGNED_LONG
unsigned long int
MPI_FLOAT
float
MPI_DOUBLE
double
MPI_LONG_DOUBLE
long double
MPI_BYTE
-
MPI_PACKED
-
36
Exit codes
In MPI specific exit codes for subprograms are used. Some of these codes are:
MPI_SUCCESS successful completion, MPI_ERR_OTHER most often reason
is repeated call of MPI_Init.
In place of numeric codes named constants may be used:
•
MPI_ERR_BUFFER wrong pointer at the buffer;
•
MPI_ERR_COMM wrong communicator;
•
MPI_ERR_RANK wrong rank;
•
MPI_ERR_OP wrong operation;
•
MPI_ERR_ARG wrong argument;
•
MPI_ERR_UNKNOWN unknown error;
•
MPI_ERR_TRUNCATE message truncated during receive;
•
MPI_ERR_INTERN internal error. Common reason is lack of memory.
Basic concepts of MPI programming
Communicator is a set of processes which are a whole set of processes of a parallel
MPI program in a time of its execution or a subset with a common context of
execution (fig. 13). Only processes in a same communicator may be involved in
point-to-point or collective exchanges. Every communicator has name. There are few
standard communicators:
• MPI_COMM_WORLD – includes all processes;
• MPI_COMM_SELF – includes only given process;
• MPI_COMM_NULL – empty communicator.
A new communicator may be created by means of special calls. In this case it may
include a subset of processes.
Fig. 13. Communicator
37
Rank is a unique numeric identifier which is assigned to each process of the same
parallel program. It has integer value from 0 to number_of_processes – 1
(fig. 14).
Fig. 14. Ranks of parallel processes
Message tag is a unique numeric identifier which may be assigned to a message. Tags
are used to distinguish messages. Joker MPI_ANY_TAG may be used if a tag doesn’t
play any role in an exchange.
Common structure of an MPI program:
program para
…
if (process = master) then
master clause
else
slave clause
endif
end
38
MPI subprograms
Miscellaneous subprograms
Initializing of MPI
int MPI_Init(int *argc, char **argv)
MPI_INIT(IERR)
Arguments argc and argv are used only in C programs. In this case they are
number of arguments of a command line used to run the program and array of these
arguments. This call precedes all other MPI subprograms calls.
Finalizing of MPI
int MPI_Finalize()
MPI_FINALIZE(IERR)
Finalizes MPI. After this call is completed no MPI subprogram could be used.
MPI_FINALIZE must be called by every process before it will stop its execution.
Getting of a number of processes
int MPI_Comm_size(MPI_Comm comm, int *size)
MPI_COMM_SIZE(COMM., SIZE, IERR)
Input parameter:
comm communicator.
Output parameters:
•
•
size number of processes in a communicator.
Getting of rank of a process
int MPI_Comm_rank(MPI_Comm comm, int *rank)
MPI_COMM_RANK(COMM, RANK, IERR)
Input parameter:
comm communicator.
Output parameter:
•
•
rank rank of the process in a communicator.
39
Getting name of computing node which executes calling process
MPI_Get_processor_name(char *name, int *resultlen)
MPI_GET_PROCESSOR_NAME(NAME, RESULTLEN, IERR)
Output parameters:
•
name identifier of computing
MPI_MAX_PROCESSOR_NAME elements;
•
resultlen name length.
node.
Arrays
has
at
least
Time passed from an arbitrary moment in the past
double MPI_Wtime()
MPI_WTIME(TIME, IERR)
Point-to-point exchanges
Point-to-point exchange involves only two processes: source and target (fig. 15). In
this section interfaces of subprograms for point-to-point exchange are described.
Fig. 15. Point-to-point send-receive operation
Standard block send
int MPI_Send(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
MPI_SEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERR)
Input parameters:
•
buf address of first element in send buffer;
•
count number of elements in the send buffer;
•
datatype MPI data type of elements to be sent;
•
dest rank of target process (integer from 0 to n – 1, where n number of
processes in a communicator);
40
•
tag message tag;
•
comm communicator;
•
ierr exit code.
Standard block send
int MPI_Recv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Status *status)
MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS,
IERR)
Input parameters:
• count maximum number of elements in receive buffer. Actual number of
elements may be defined by means of subroutine MPI_Get_count;
• datatype type of data to be received. Data types in send and receive calls
have to be the same;
• source source rank. Special value MPI_ANY_SOURCE corresponds to
arbitrary source rank value. Identifier which corresponds to arbitrary parameter
value is called “joker”;
• tag tag of the message or joker MPI_ANY_TAG which corresponds to
arbitrary tag value;
• comm communicator.
Output parameters:
•
buf address of the receive buffer. Size of the buffer has to be enough to store
received message entirely otherwise receive ends with a fault (buffer overflow);
status exchange status.
If received message is less than buffer only part of receive buffer is updated.
•
Getting size of received message (count)
int
MPI_Get_count(MPI_Status
*status,
MPI_Datatype
datatype, int *count)
MPI_GET_COUNT(STATUS, DATATYPE, COUNT, IERR)
Type of datatype argument has to be the same as those indicated in send call.
Synchronous send
int MPI_Ssend(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
MPI_SSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERR)
Parameters of this subprogram are the same as in MPI_Send.
41
Buffered send
int MPI_Bsend(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
MPI_BSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERR)
Parameters of this subprogram are the same as in MPI_Send.
Buffer attachment
int MPI_Buffer_attach(void *buf, size)
MPI_BUFFER_ATTACH(BUF, SIZE, IERR)
Output parameter:
buf buffer. Its size is size bytes.
In Fortran buffer is variable or array. At a time only one buffer may be attached to the
process.
•
Buffer detachment
int MPI_Buffer_detach(void *buf, int *size)
MPI_BUFFER_DETACH(BUF, SIZE, IERR)
Output parameters:
• buf address of the buffer;
• size size of the detached buffer.
Call of this subprogram blocks process execution until all messages in receive buffer
will be handled. In C this call doesn’t free buffer’s memory.
Ready send
int MPI_Rsend(void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm)
MPI_RSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERR)
Parameters of this subprogram are the same as in MPI_Send.
Blocking test of message delivering
int MPI_Probe(int source, int tag,
MPI_Status *status)
MPI_PROBE(SOURCE, TAG, COMM, STATUS, IERR)
Input parameters:
•
source source rank or joker;
•
tag tag or joker;
comm communicator.
Output parameter:
•
• status status of operation.
MPI_Comm
comm,
42
Nonblocking test of message delivering
int MPI_Iprobe(int source, int tag, MPI_Comm comm, int
*flag, MPI_Status *status)
MPI_IPROBE(SOURCE, TAG, COMM, FLAG, STATUS, IERR)
Input parameters of this subprogram are the same as in MPI_Probe.
Output parameters:
• flag flag;
• status status.
If message is delivered flag’s value true is returned.
Blocking send and receive
int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype
sendtype, int dest, int sendtag, void *recvbuf, int
recvcount, MPI_Datatype recvtype, int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
MPI_SENDRECV(SENDBUF, SENDCOUNT, SENDTYPE, DEST, SENDTAG,
RECVBUF, RECVCOUNT, RECVTYPE, SOURCE, RECVTAG, COMM, STATUS,
IERR)
Input parameters:
•
sendbuf address of the send buffer;
•
sendcount number of elements which have to be sent;
•
sendtype data types of elements which have to be sent;
•
dest rank of the target;
•
sendtag tag of message which has to be sent;
•
recvbuf address of the receive buffer;
•
recvcount number of elements which have to be received;
•
recvtype data types of elements which have to be received;
•
source rank of the source;
•
recvtag tag of message which has to be received;
comm communicator.
Output parameter:
•
status status of receive operation.
Receive and send operations use the same communicator. Send and receive buffers
must not overlap. Buffers may have different size. Data types of sending and
receiving data also may be different.
•
43
Blocking send and receive with common buffer for send and receive
int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype
datatype, int dest, int sendtag, int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
MPI_SENDRECV_REPLACE(BUF, COUNT, DATATYPE, DEST, SENDTAG,
SOURCE, RECVTAG, COMM, STATUS, IERR)
Input parameters:
•
count number of elements to be sent and size of receive buffer;
•
datatype type of data in receive and send buffer;
•
dest rank of the target;
•
sendtag tag of message to be sent;
•
source rank of the source;
•
recvtag tag of message to be received;
comm communicator.
Output parameters:
•
•
buf address of send and receive buffer;
status status of receive.
Message which has to be received must not be larger (in size) than message being
sent. Data types of elements in send and receive have to be the same. Order of send
and receive is chosen automatically by the system.
•
Initialization of nonblocking standard send
int MPI_Isend(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_ISEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST,
IERR)
Input parameters of this subprogram are the same as in MPI_Send.
Output parameter:
• request identifier of operation.
Initialization of nonblocking synchronous standard send
int MPI_Issend(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_ISSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST,
IERR)
Parameters of this subprogram are the same as in MPI_Send.
44
Nonblocking send with bufferization
int MPI_Ibsend(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_IBSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST,
IERR)
Nonblocking ready send
int MPI_Irsend(void* buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm, MPI_request *request)
MPI_IRSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST,
IERR)
Parameters of nonblocking send subprograms are the same as in previously described
subprograms.
Initialization of nonblocking receive
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm, MPI_Request *request)
MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST,
IERR)
Parameters in this subprogram are the same as in previously described subprograms
with the exception of source which is rank of a source process.
Blocking of a process execution until receive or send is completed
int MPI_Wait(MPI_Request *request, MPI_Status *status)
MPI_WAIT(REQUEST, STATUS, IERR)
Input parameter:
• request identifier of message passing operation.
Output parameter:
• status status of completed operation.
Status value of send operation may be obtained by means of
MPI_Test_cancelled. Subprogram MPI_Wait may be called with empty or
inactive parameter request. In this case operation completes immediately with
empty status.
Successful completion of MPI_Wait after call of MPI_Ibsend means that send
buffer may be used again and data have been sent or copied in send buffer attached
by call of MPI_Buffer_attach. Send might not be cancelled if buffer is attached.
If receive is not registered buffer might not be released. In this case MPI_Cancel
releases memory allocated for communication subsystem.
45
Nonblocking check of message receive or send completion
int MPI_Test(MPI_Request *request, int *flag, MPI_Status
*status)
MPI_TEST(REQUEST, FLAG, STATUS, IERR)
Input parameter:
• request identifier of message passing operation.
Output parameters:
• flag true if operation associated with request identifier is completed;
• status status of completed operation.
If call of MPI_Test uses empty or inactive parameter request operation returns
flag’s value true and empty status.
Test of completion of all exchanges
int MPI_Waitall(int count, MPI_Request requests[],
MPI_Status statuses[])
MPI_WAITALL(COUNT, REQUESTS, STATUSES, IERR)
Execution of a process is blocked until all exchanges associated with active requests
in array requests will be completed. Status of all operations is returned. It is
placed into array statuses. count is number of exchange requests (size of arrays
requests and statuses).
As a result of MPI_Waitall execution requests generated by nonblocking
exchange operations are cancelled and corresponding elements of array get value
MPI_REQUEST_NULL. List may include empty or inactive requests. Each request
gets empty status value.
If one or more exchanges are failed MPI_Waitall returns exit code
MPI_ERR_IN_STATUS and assigns error code to error field of status of
corresponding operation. If exchange is successful the field gets value
MPI_SUCCESS. If exchange wasn’t successful the field gets value
MPI_ERR_PENDING. The last case corresponds to existence of requests on
execution waiting for processing.
Nonblocking test of exchanges completion
int MPI_Testall(int count, MPI_Request requests[], int
*flag, MPI_Status statuses[])
MPI_TESTALL(COUNT, REQUESTS, FLAG, STATUSES, IERR)
On return flag (flag) gets true value if all exchanges associated with active
requests in array requests are completed. If only part of exchanges is completed
flag gets false value and array statuses is indefinite. count number of
requests.
46
Each status which corresponds to existing active request gets status of corresponding
exchange. If request was issued by a nonblocking exchange operation it will be
cancelled and corresponding array element gets value MPI_REQUEST_NULL. Each
status which corresponds to empty or inactive requests gets empty value.
Blocking test of completion of arbitrary number of exchanges
int MPI_Waitany(int count, MPI_Request requests[], int
*index, MPI_Status *status)
MPI_WAITANY(COUNT, REQUESTS, INDEX, STATUS, IERR)
Execution of a process is blocked until at least one exchange from array
(requests) will be completed.
Input parameters:
• requests request;
• count number of elements in array requests.
Output parameters:
•
index index of the request (in C language an integer number from 0 to
count – 1, in Fortran an integer from 1 to count) in array requests;
status exchange status.
If request was issued by a nonblocking exchange operation it will be cancelled and
corresponding array element gets value MPI_REQUEST_NULL. Array of requests
may include empty or inactive requests. If the list does not include active requests or
it is empty subroutine completes immediately with index equal to MPI_UNDEFINED
and empty status.
•
Test of completion of any previously initialized exchange
int MPI_Testany(int count, MPI_Request requests[], int
*index, int *flag, MPI_Status *status)
MPI_TESTANY(COUNT, REQUESTS, INDEX, FLAG, STATUS, IERR)
Arguments of this subprogram are the same as in MPI_Waitany. Extra argument
flag gets value true if one of operations is completed. Blocking subprogram
MPI_Waitany and nonblocking subprogram MPI_Testany are interchangeable
as other similar pairs of subprograms.
Subprograms MPI_Waitsome and MPI_Testsome work similarly to
MPI_Waitany and MPI_Testany except case when two or more exchanges are
completed. In subprograms MPI_Waitany and MPI_Testany exchange from a
list of completed exchanges is chosen arbitrary. For this exchange status is returned.
MPI_Waitsome and MPI_Testsome return status for all completed exchanges.
These subprograms may be used to define how many exchanges are completed:
47
int MPI_Waitsome(int incount, MPI_Request requests[], int
*outcount, int indices[], MPI_Status statuses[])
MPI_WAITSOME(INCOUNT, REQUESTS, OUTCOUNT, INDICES, STATUSES,
IERR)
Here incount is number of requests. In outcount number of completed requests
from array requests is returned. In first outcount elements of array indices
indices of this operations are returned. In first outcount elements of array
statuses statuses of completed operations are returned. If completed request was
issued by nonblocking exchange operation it is cancelled. If a list does not include
active requests execution of the subprogram will be completed immediately and
parameter outcount will get value MPI_UNDEFINED.
Nonblocking check of exchange completion
int MPI_Testsome(int incount, MPI_Request requests[], int
*outcount, int indices[], MPI_Status statuses[])
MPI_TESTSOME(INCOUNT, REQUESTS, OUTCOUNT, INDICES, STATUSES,
IERR)
Arguments are the same as in subprogram MPI_Waitsome. Subprogram
MPI_Testsome is more efficient than MPI_Testany because the first one
returns information about all operations in one call but the second requires a new call
for each completed operation.
Creation of request for standard send
int MPI_Send_init(void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm, MPI_Request
*request)
MPI_SEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM,
REQUEST, IERR)
Input parameters:
•
buf address of the send buffer;
•
count number of elements which have to be sent;
•
datatype type of elements;
•
dest target rank;
•
tag message tag;
comm communicator.
Output parameter:
•
•
request request for exchange operation.
48
Initialization of pending exchange
int MPI_Start(MPI_Request *request)
MPI_START(REQUEST, IERR)
Input parameter:
•
request request for exchange operation.
Call of MPI_Start with request for exchange which was created by
MPI_Send_init initializes exchange with the same properties as exchange
performed by MPI_Isend. Call of MPI_Start with request for exchange which
was created by MPI_Bsend_init initializes exchange with the same properties as
one performed by of MPI_Ibsend. Message passed by means of operation
initialized by MPI_Start may be received by any receive subprogram.
Initialization of exchanges associated with requests (in array requests) for
execution of nonblocking exchange operation
int MPI_Startall(int count, MPI_request *requests)
MPI_STARTALL(COUNT, REQUESTS, IERR)
Cancelling of pending nonblocking exchanges
int MPI_Cancel(MPI_request *request)
MPI_CANCEL(REQUEST, IERR)
MPI_Cancel may be used to cancel exchanges which use both pending and
ordinary requests. After call of MPI_Cancel and subsequent calls MPI_Wait and
MPI_Test request for exchange operation becomes inactive and may be reactivated
for new exchange. Information about cancelled operation is placed in status.
Check if exchange associated with a given status is cancelled
int MPI_Test_cancelled(MPI_Status *status, int *flag)
MPI_TEST_CANCELLED(STATUS, FLAG, IERR)
Cancelling of a request (request) for exchange operation
int MPI_Request_free(MPI_Request *request)
MPI_REQUEST_FREE(REQUEST, IERR)
Call of this subprogram marks request for exchange to cancel and assign it value
MPI_REQUEST_NULL. Exchange operation associated with this request may be
completed. The request is cancelled only when exchange is completed.
Collective exchange operations
Collective exchanges involve two or more processes.
49
Broadcast send (fig. 16)
int MPI_Bcast(void *buffer, int count, MPI_Datatype
datatype, int root, MPI_Comm comm)
MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERR)
Arguments of this subprogram are input and output at the same time:
•
buffer address of the buffer;
•
count number of elements which have to be sent/received;
•
datatype data type in MPI;
•
root rank of process which broadcasts data;
•
comm communicator.
Fig. 16. Broadcast send operation
Barrier synchronization (fig. 17)
int MPI_Barrier(MPI_Comm comm)
MPI_BARRIER(COMM, IERR)
50
Fig. 17. Barrier synchronization
Data scattering (fig. 18)
int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype
sendtype, void *rcvbuf, int rcvcount, MPI_Datatype rcvtype,
int root, MPI_Comm comm)
MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF, RCVCOUNT,
RCVTYPE, ROOT, COMM, IERR)
Input parameters:
•
sendbuf address of the send buffer;
•
sendcount number of elements which have to be sent to each process (not
total number of elements to be sent);
•
sendtype data types of elements which have to be sent;
•
rcvcount number of elements which have to be received;
•
rcvtype data types of elements which have to be received;
•
root rank of sending process;
comm communicator.
Output parameter:
•
•
rcvbuf address of the receive buffer.
Process with rank root distributes send buffer sendbuf among all processes.
Content of the buffer is divided into few parts. Each part consists of sendcount
elements. First part goes to process 0, second part goes to process 1 etc. Argument
send has meaning only on side of main process root.
51
Fig. 18. Data scattering
Gathering of messages (fig. 19)
int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype
sendtype, void *rcvbuf, int rcvcount, MPI_Datatype rcvtype,
int root, MPI_Comm comm)
MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF, RCVCOUNT,
RCVTYPE, ROOT, COMM, IERR)
Each process in communicator comm sends its buffer sendbuf to process with rank
root. Process root merges received data in such a way that after data from process
0 follow data from process 1, then data from process 2 and so on. Arguments
rcvbuf, rcvcount and rcvtype have meaning only on side of main process.
Argument rcvcount is equal to number of data received from each process (but not
total number). When subprograms MPI_Scatter and MPI_Gather are called in
different processes it is necessary to use common main process.
52
Fig. 19. Data gathering
Vector data scattering
int MPI_Scatterv(void *sendbuf, int *sendcounts, int
*displs, MPI_Datatype sendtype, void *rcvbuf, int rcvcount,
MPI_Datatype rcvtype, int root, MPI_Comm comm)
MPI_SCATTERV(SENDBUF, SENDCOUNTS, DISPLS, SENDTYPE, RCVBUF,
RCVCOUNT, RCVTYPE, ROOT, COMM, IERR)
Input parameters:
•
sendbuf address of the send buffer;
•
sendcounts 1-dimensional integer array which contains number of elements
to be sent to each process (index is equal to rank of a target process). Its size is
equal to number of processes in communicator;
•
displs 1-dimensional integer array. Its size is equal to number of processes
in communicator. Element of the array which has index i sets displacement
relative to the beginning of the send buffer. Rank of the target process is equal to
index i;
•
sendtype data types of elements which have to be sent;
•
rcvcount number of elements which have to be received;
•
rcvtype data types of elements which have to be received;
•
root rank of source process;
comm communicator.
Output parameter:
•
•
rcvbuf address of the receive buffer.
53
Gathering of data from all processes in a communicator and writing in receive buffer
with given displacement
int MPI_Gatherv(void *sendbuf, int sendcount, MPI_Datatype
sendtype, void *recvbuf, int *recvcounts, int *displs,
MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_GATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF,
RECVCOUNTS, DISPLS, RECVTYPE, ROOT, COMM, IERR)
Arguments of this subprogram are the same as in subprogram MPI_Scatterv.
Exchanges associated with subprograms MPI_Allgather and MPI_Alltoall
have not root process.
Gathering of data from all processes and scattering to all processes
int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype
sendtype, void *rcvbuf, int rcvcount, MPI_Datatype rcvtype,
MPI_Comm comm)
MPI_ALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF,
RCVCOUNT, RCVTYPE, COMM, IERR)
Input parameters:
•
sendbuf address of the send buffer;
•
sendcount number of elements which have to be sent;
•
sendtype data types of elements which have to be sent;
•
rcvcount number of elements which have to be received from each process;
•
rcvtype data types of elements which have to be received;
comm communicator.
Output parameter:
•
• rcvbuf address of the receive buffer.
Chunk of data sent from j-th process is received by each process which places it in jth block of receive buffer recvbuf.
Send "each to all"
int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype
sendtype, void *rcvbuf, int rcvcount, MPI_Datatype rcvtype,
MPI_Comm comm)
MPI_ALLTOALL(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF, RCVCOUNT,
RCVTYPE, COMM, IERR)
Input parameters:
•
sendbuf address of the send buffer;
•
sendcount number of elements which have to be sent to each process;
54
•
sendtype data types of elements which have to be sent;
•
rcvcount number of elements which have to be received;
•
rcvtype data types of elements which have to be received;
comm communicator.
Output parameter:
•
• rcvbuf address of the receive buffer.
Subprograms MPI_Allgather and MPI_Alltoall are vector counterparts of
subprograms MPI_Allgatherv and MPI_Alltoallv.
Gathering data from all processes and sending to all processes
int MPI_Allgatherv(void *sendbuf, int sendcount,
MPI_Datatype sendtype, void *rcvbuf, int *rcvcounts, int
*displs, MPI_Datatype rcvtype, MPI_Comm comm)
MPI_ALLGATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF,
RCVCOUNTS, DISPLS, RCVTYPE, COMM, IERR)
Arguments of this subprogram are the same as in subprogram MPI_Allgather.
The only exception is input argument displs. It is integer 1-dimensional array. Its
size is equal to number of processes in communicator. Element with index i gives
displacement relatively to the beginning of receive buffer recvbuf which contains
data received from process i. Chunk of data sent from j-th process is received by each
process and is placed in j-th block of receive buffer.
All-to-all send with displacement
int MPI_Alltoallv(void *sendbuf, int *sendcounts, int
*sdispls, MPI_Datatype sendtype, void *rcvbuf, int
*rcvcounts, int *rdispls, MPI_Datatype rcvtype, MPI_Comm
comm)
MPI_ALLTOALLV(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPE,
RCVBUF, RCVCOUNTS, RDISPLS, RCVTYPE, COMM, IERR)
Arguments of this subprogram are the same as in subprogram MPI_Alltoall. The
only exceptions are arguments:
• sdispls 1-dimensional integer array. Its size is equal to number of processes
in communicator. J-th element gives displacement relatively to the beginning of
buffer from which data are sent to j-th process.
• rdispls 1-dimensional integer array. Its size is equal to number of processes
in communicator. Element i gives displacement relatively to the beginning of
buffer which receives message from i-th process.
55
Reduction (fig. 20)
int MPI_Reduce(void *buf, void *result, int count,
MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
MPI_REDUCE(BUF, RESULT, COUNT, DATATYPE, OP, ROOT, COMM,
IERR)
Input parameters:
•
buf address of the send buffer;
•
count number of elements which have to be sent;
•
datatype type of data to be sent;
•
op reduction operation;
•
root rank of master process;
•
comm communicator.
MPI_Reduce applies reduction operation to operands from buf. Result of each
operation is placed in result buffer result. MPI_Reduce has to be called in all
processes in communicator comm. Arguments count, datatype and op in this
call must be the same.
Fig. 20. Reduction operation
Predefined reduction operations are listed in table 3.
56
Table 3. Predefined reduction operations in MPI
Operation
Description
MPI_MAX
Maximum value of 1-dimensional integer or real array
MPI_MIN
Minimum value of 1-dimensional integer or real array
MPI_SUM
Sum of elements of 1-dimensional integer, real or complex
array
MPI_PROD
Product of elements of 1-dimensional integer, real or complex
array
MPI_LAND
Logical AND
MPI_BAND
Bitwise AND
MPI_LOR
Logical OR
MPI_BOR
Bitwise OR
MPI_LXOR
Logical exclusive OR
MPI_BXOR
Bitwise exclusive OR
MPI_MAXLOC
Maximum value of 1-dimensional integer or real array and its
index
MPI_MINLOC
Minimum value of 1-dimensional integer or real array and its
index
Definition of user global operation
int MPI_Op_create(MPI_User_function *function, int commute,
MPI_Op *op)
MPI_OP_CREATE(FUNCTION, COMMUTE, OP, IERR)
Input parameters:
•
function user-defined function;
commute has value true if operation is commutative (result is independent
of operands order).
Definition of user function in C has the following form:
typedef void (MPI_User_function)(void *a, void *b, int *len,
MPI_Datatype *dtype)
Operation is defined as follows:
b[I] = a[I] op b[I]
for I = 0, …, len – 1.
•
57
Deletion of user function
int MPI_Op_free(MPI_Op *op)
MPI_OP_FREE(OP, IERR)
When this call is completed op gets value MPI_OP_NULL.
Simultaneous gathering and scattering
int MPI_Reduce_scatter(void *sendbuf, void *rcvbuf, int
*rcvcounts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
MPI_REDUCE_SCATTER(SENDBUF, RCVBUF, RCVCOUNTS, DATATYPE, OP,
COMM, IERR)
Input parameters:
•
sendbuf address of the receive buffer;
•
rcvcounts 1-dimensional integer array which contains number of elements
in resulting array sent to each process. This array must be the same in all
processes which call this subprogram;
•
datatype type of data to be received;
•
op operation;
comm communicator.
Output parameter:
•
• rcvbuf address of the receive buffer.
Each task receives only one chunk of resulting array.
Gathering and writing of result of reduction operation at receive buffer of each
process
int MPI_Allreduce(void *sendbuf, void *rcvbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
MPI_ALLREDUCE(SENDBUF, RCVBUF, COUNT, DATATYPE, OP, COMM,
IERR)
Input parameters:
•
sendbuf address of the send buffer;
•
count number of elements which have to be sent;
•
datatype type of data to be sent;
•
op reduction operation;
comm communicator.
Output parameter:
•
•
rcvbuf address of the receive buffer.
58
In case of failure this subprogram may return exit code MPI_ERR_OP (incorrect
operation). It takes places if operation is used which is not predefined nor created be
preceding call of MPI_Op_create.
Partial reduction
int MPI_Scan(void *sendbuf, void *rcvbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
MPI_SCAN(SENDBUF, RCVBUF, COUNT, DATATYPE, OP, COMM, IERR)
Input parameters:
•
sendbuf address of the send buffer;
•
count number of elements in receive buffer;
•
datatype type of data to be received;
•
op operation;
comm communicator.
Output parameter:
•
• rcvbuf address of the receive buffer.
Operations with communicators
Standard communicator MPI_COMM_WORLD is created automatically at the start of
parallel program. Other standard communicators:
• MPI_COMM_SELF communicator which includes only calling process;
• MPI_COMM_NULL null (empty) communicator.
Getting access to group which is associated with communicator comm
int MPI_Comm_group(MPI_Comm comm, MPI_Group *group)
MPI_COMM_GROUP(COMM, GROUP, IERR)
Output parameter group. Any operation with a group is possible only after this
call.
Creation of new group newgroup which includes n processes from old group
oldgroup
int MPI_Group_incl(MPI_Group oldgroup, int n, int *ranks,
MPI_Group *newgroup)
MPI_GROUP_INCL(OLDGROUP, N, RANKS, NEWGROUP, IERR)
Ranks of processes are placed in array ranks. New group will include processes
with ranks ranks[0], …, ranks[n — 1]. Rank i in new group corresponds to
rank ranks[i] in old group. If n = 0 empty group MPI_GROUP_EMPTY is created.
59
This subprogram makes it possible not only to create new group but to change order
of processes in existing group.
Creation of new group newgroup by means of exclusion from old group (group)
of processes with ranks ranks[0], …, ranks[n — 1]
int MPI_Group_excl(MPI_Group oldgroup, int n, int *ranks,
MPI_Group *newgroup)
MPI_GROUP_EXCL(OLDGROUP, N, RANKS, NEWGROUP, IERR)
If n = 0 new group is identical to old group.
Creation of new group newgroup from old group group by means of adding to it
n processes with ranks listed in array ranks
int MPI_Group_range_incl(MPI_Group oldgroup, int n, int
ranks[][3], MPI_Group *newgroup)
MPI_GROUP_RANGE_INCL(OLDGROUP, N, RANKS, NEWGROUP, IERR)
Array ranks consists of integer triplets (first_1, last_1, step_1), …,
(first_n, last_n, step_n). New group includes processes with ranks (in old
group) first_1, first_1 + step_1, ….
Creation of new group newgroup from group group by means of exclusion of n
processes with ranks listed in array ranks
int MPI_Group_range_excl(MPI_Group group, int n, int
ranks[][3], MPI_Group *newgroup)
MPI_GROUP_RANGE_EXCL(GROUP, N, RANKS, NEWGROUP, IERR)
Array ranks has the same structure as similar array in subprogram
MPI_Group_range_incl.
Creation of new group newgroup from result of subtraction of group1 and
group2
int MPI_Group_difference(MPI_Group group1, MPI_Group group2,
MPI_Group *newgroup)
MPI_GROUP_DIFFERENCE(GROUP1, GROUP2, NEWGROUP, IERR)
Creation of new group newgroup from intersection of groups group1 and
group2
int MPI_Group_intersection(MPI_Group group1, MPI_Group
group2, MPI_Group *newgroup)
MPI_GROUP_INTERSECTION(GROUP1, GROUP2, NEWGROUP, IERR)
60
Creation of new group newgroup from union of groups group1 and group2
int MPI_Group_union(MPI_Group group1, MPI_Group group2,
MPI_Group *newgroup)
MPI_GROUP_UNION(GROUP1, GROUP2, NEWGROUP, IERR)
There are other constructors of new groups.
Deletion of group group
int MPI_Group_free(MPI_Group *group)
MPI_GROUP_FREE(GROUP, IERR)
Getting of number of processes size in group group
int MPI_Group_size(MPI_Group group, int *size)
MPI_GROUP_SIZE(GROUP, SIZE, IERR)
Getting of rank rank of process in group group
int MPI_Group_rank(MPI_Group group, int *rank)
MPI_GROUP_RANK(GROUP, RANK, IERR)
If process is not included in the group this subprogram returns MPI_UNDEFINED.
Transformation of process rank in one group to its rank in other group
int MPI_Group_translate_ranks(MPI_Group group1, int n, int
*ranks1, MPI_Group group2, int *ranks2)
MPI_GROUP_TRANSLATE_RANKS(GROUP1, N, RANKS1, GROUP2, RANKS2,
IERR)
Comparison of groups group1 and group2
int MPI_Group_compare(MPI_Group group1, MPI_Group group2,
int *result)
MPI_GROUP_COMPARE(GROUP1, GROUP2, RESULT, IERR)
Returns MPI_IDENT if both groups are identical. Returns MPI_SIMILAR if
processes in both groups are the same but their ranks differ. Returns MPI_UNEQUAL
if groups include at least one pair of different processes.
Reduplication of existing communicator oldcomm
int MPI_Comm_dup(MPI_Comm oldcomm, MPI_Comm *newcomm)
MPI_COMM_DUP(OLDCOMM, NEWCOMM, IERR)
This call creates new communicator newcomm with the same group of processes and
the same attributes as initial group but with different context of exchanges. It may be
applied both to intra- and intercommunicator.
61
Creation of new communicator newcomm from subset of processes group of other
communicator oldcomm
int MPI_Comm_create(MPI_Comm oldcomm, MPI_Group group,
MPI_Comm *newcomm)
MPI_COMM_CREATE(OLDCOMM, GROUP, NEWCOMM, IERR)
This call must be performed by all processes of initial communicator. Arguments
have to be the same. If several communicators are created simultaneously they must
be created in the same order by all processes.
Creation of several communicators by splitting of a given communicator
int MPI_Comm_split(MPI_Comm oldcomm, int split, int rank,
MPI_Comm* newcomm)
MPI_COMM_SPLIT(OLDCOMM, SPLIT, RANK, NEWCOMM, IERR)
Group of processes associated with communicator oldcomm is splitted into
nonintersecting subgroups. One subgroup is for each value of split. Processes with
the same value of split form new group. Rank in new group is defined by value of
rank. If processes A and B call MPI_Comm_split with the same value of split
and argument rank passed by process A is less than value of argument passed by
process B as a result rank A in group correspondent to new communicator will be less
than rank of process B. If otherwise calls use the same value of rank then system
assign ranks arbitrarily. For each group its own communicator newcomm is created.
MPI_Comm_split has to be called by all processes of initial communicator even in
case when they will not be included in new communicator. For that as a value of
argument split in the call of this subprogram has to be used predefined named
constant MPI_UNDEFINED. Correspondent processes will return MPI_COMM_NULL
as new communicator. New communicators created by subprogram
MPI_Comm_split do not intersect but by repeated calls of MPI_Comm_split it
is possible to create also intersected communicators.
Marking communicator comm to be deleted
int MPI_Comm_free(MPI_Comm *comm)
MPI_COMM_FREE(COMM, IERR)
Exchanges associated with this communicator are completed as usual and
communicator will be deleted only when it will not has active references to it. This
operation may be applied both to intra- and intercommunicator.
Comparison of communicators (comm1) and (comm2)
int MPI_Comm_compare(MPI_Comm comm1, MPI_Comm comm2, int
*result)
MPI_COMM_COMPARE(COMM1, COMM2, RESULT, IERR)
Output parameter:
62
• result integer value which is equal to MPI_IDENT if contexts and groups
associated with communicators coincide; MPI_CONGRUENT if coincide only
groups; MPI_SIMILAR or MPI_UNEQUAL if nor groups nor contexts are the
same.
Empty communicator MPI_COMM_NULL may not be used as argument.
Assigning to communicator comm a string name name
int MPI_Comm_set_name(MPI_Comm com, char *name)
MPI_COMM_SET_NAME(COM, NAME, IERR)
Getting name of communicator
int MPI_Comm_get_name(MPI_Comm comm, char *name, int
*reslen)
MPI_COMM_GET_NAME(COMM, NAME, RESLEN, IERR)
Output parameters:
•
name name of communicator comm;
•
reslen length of the name.
Name is array of characters.
MPI_MAX_NAME_STRING.
Its
size
must
be
greater
than
Check if communicator comm (input parameter) is an intercommunicator
int MPI_Comm_test_inter(MPI_Comm comm, int *flag)
MPI_COMM_TEST_INTER(COMM, FLAG, IERR)
Output parameter:
•
flag is true if communicator is an intercommunicator.
Creation of intracommunicator newcomm from intercommunicator oldcomm
int MPI_Intercomm_merge(MPI_Comm oldcomm, int high, MPI_Comm
*newcomm)
MPI_INTERCOMM_MERGE(OLDCOMM, HIGH, NEWCOMM, IERR)
Argument high is used to unite groups of both intracommunicators in order to create
new communicator.
Getting access to remote group associated with intercommunicator comm
int MPI_Comm_remote_group(MPI_Comm comm, MPI_Group *group)
MPI_COMM_REMOTE_GROUP(COMM, GROUP, IERR)
Output parameter:
63
• group remote group.
Getting size of remote group which is associated with intercommunicator comm
int MPI_Comm_remote_size(MPI_Comm comm, int *size)
MPI_COMM_REMOTE_SIZE(COMM, SIZE, IERR)
Output parameter:
•
size number of processes in communicator comm.
Creation of intercommunicator
int MPI_Intercomm_create(MPI_Comm local_comm, int
local_leader, MPI_Comm peer_comm, int remote_leader, int
tag, MPI_Comm *new_intercomm)
MPI_INTERCOMM_CREATE(LOCAL_COMM, LOCAL_LEADER, PEER_COMM,
REMOTE_LEADER, TAG, NEW_INTERCOMM, IERR)
Input parameters:
• local_comm local intracommunicator;
• local_leader rank of leader in local communicator (usually 0);
• peer_comm remote communicator;
• remote_leader rank of leader in remote communicator (usually 0);
• tag tag of intercommunicator which is used by leaders of both groups for
exchanges using context of parent communicator.
Output parameter:
new_intercomm intercommunicator.
Jokers must not be used as arguments. This call has to be performed in both groups of
processes which have to be connected with each other. In each of this calls local
intracommunicator is used which corresponds to given group of processes. Local and
remote groups shouldn’t intersect otherwise deadlocks could appear.
•
Virtual topologies
Virtual topologies in MPI make it possible to use more convenient (in some cases)
methods of referencing to processes of a parallel application.
Creation of new communicator comm_cart by supplying initial communicator
comm_old with Cartesian topology (fig. 21)
int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims,
int *periods, int reorder, MPI_Comm *comm_cart)
MPI_CART_CREATE(COMM_OLD, NDIMS, DIMS, PERIODS, REORDER,
COMM_CART, IERR)
64
Input parameters:
•
comm_old initial communicator;
•
ndims dimension of Cartesian grid;
•
dims integer array which consists of ndims elements and defines number of
processes along each dimension;
•
periods logical array which consists of ndims elements and defines if grid
is periodic (true) along correspondent dimension;
reorder logical variable. If it is equal to true system is allowed to change
order of numeration of processes.
Information about structure of Cartesian topology is contained in ndims, dims and
periods. MPI_Cart_create is collective operation (it must be called by all
processes from communicator which has to be supplied by Cartesian topology).
•
Fig. 21. Cartesian topology
Getting Cartesian coordinates of process from its rank in group
int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims,
int *coords)
MPI_CART_COORDS(COMM, RANK, MAXDIMS, COORDS, IERR)
Input parameters:
•
comm communicator which is supplied with Cartesian topology;
•
rank rank of a process in comm;
maxdims number of elements in 1-dimensional array coords in calling
program.
Output parameter:
•
•
coords 1-dimensional integer array (consists of ndims elements) which
contains Cartesian coordinates of process.
65
Getting of rank of process (rank) from its Cartesian coordinates in communicator
comm
int MPI_Cart_rank(MPI_Comm comm, int *coords, int *rank)
MPI_CART_RANK(COMM, COORDS, RANK, IERR)
Input parameter:
coords 1-dimensional integer array (consists of ndims elements) which
contains Cartesian coordinates of process.
Both MPI_Cart_rank and MPI_Cart_coords are local.
•
Splitting of communicator comm in subgroups correspondent to Cartesian subgrids of
lower dimension
int MPI_Cart_sub(MPI_Comm comm, int *remain_dims, MPI_Comm
*comm_new)
MPI_CART_SUB(COMM, REMAIN_DIMS, COMM_NEW, IERR)
I-th element of the array remain_dims defines if I-th dimension is contained in
subgrid (true). Output parameter:
newcomm communicator which contains subgrid to which belongs given
process.
Subprogram MPI_Cart_sub may be used only with communicator supplied with
Cartesian topology.
•
Getting of information about Cartesian topology associated with communicator comm
int MPI_Cart_get(MPI_Comm comm, int maxdims, int *dims, int
*periods, int *coords)
MPI_CART_GET(COMM, MAXDIMS, DIMS, PERIODS, COORDS, IERR)
Input parameter:
• maxdims number of elements in arrays dims, periods and vectors in
calling program;
Output parameters:
•
dims 1-dimensional integer array which defines number of processes along
each dimension;
•
periods logical array which consists of ndims elements and defines if grid
is periodic (true) along correspondent dimension;
•
coords 1-dimensional integer array which contains Cartesian coordinates of
process.
66
Getting of rank of process (newrank) in Cartesian topology after reordering
int MPI_Cart_map(MPI_Comm comm_old, int ndims, int *dims,
int *periods, int *newrank)
MPI_CART_MAP(COMM_OLD, NDIMS, DIMS, PERIODS, NEWRANK, IERR)
Input parameters:
• comm communicator;
• ndims dimensionality of Cartesian grid;
• dims integer array which consists of ndims elements and defines number of
processes along each dimension;
• periods logical array which consists of ndims elements and defines if grid
is periodic (true) along correspondent dimension.
If process doesn’t belong to grid subprogram returns value MPI_UNDEFINED.
Getting of source rank (source) of message which ought to be received and target
process (dest) which should receive message for given direction of shift
(direction) as well as its magnitude (disp)
int MPI_Cart_shift(MPI_Comm comm, int direction, int displ,
int *source, int *dest)
MPI_CART_SHIFT(COMM, DIRECTION, DISPL, SOURCE, DEST, IERR)
For n-dimensional Cartesian grid value of direction has to be in interval from 0
to n – 1.
Getting of dimensionality (ndims) of Cartesian topology which is associated with
communicator comm
int MPI_Cartdim_get(MPI_Comm comm, int *ndims)
MPI_CARTDIM_GET(COMM, NDIMS, IERR)
Creation of new communicator comm_graph which is supplied with graph topology
(fig. 22)
int MPI_Graph_create(MPI_Comm comm, int nnodes, int *index,
int *edges, int reorder, MPI_Comm *comm_graph)
MPI_GRAPH_CREATE(COMM, NNODES, INDEX, EDGES, REORDER,
COMM_GRAPH, IERR)
Input parameters:
•
comm initial communicator which is not supplied with topology;
•
nnodes number of graph nodes;
•
index 1-dimensional integer array which contains orders of nodes (number of
incoming and outcoming arcs);
67
•
edges 1-dimensional integer array which contains arcs of the graph;
•
reorder true value allows reordering of numeration of processes.
Fig. 22. Graph topology
Getting nodes of graph which are neighbors of given node
int MPI_Graph_neighbors(MPI_Comm comm, int rank, int
maxneighbors, int *neighbors)
MPI_GRAPH_NEIGHBORS(COMM, RANK, MAXNEIGHBORS, NEIGHBORS,
IERR)
Input parameters:
•
comm communicator with graph topology;
•
rank rank of process in group associated with communicator comm;
maxneighbors number of elements in array neighbors.
Output parameter:
•
•
neighbors array containing ranks of processes which are neighbors of given
process.
Getting number of neighbor nodes (nneighbors) for given in communicator with
graph topology
int MPI_Graph_neighbors_count(MPI_Comm comm, int rank, int
*nneighbors)
MPI_GRAPH_NEIGHBORS_COUNT(COMM, RANK, NNEIGHBORS, IERR)
Input parameters:
•
comm communicator;
•
rank rank of process which corresponds to the node.
68
Getting information about graph topology associated with communicator comm
int MPI_Graph_get(MPI_Comm comm, int maxindex, int maxedges,
int *index, int *edges)
MPI_GRAPH_GET(COMM, MAXINDEX, MAXEDGES, INDEX, EDGES, IERR)
Input parameters:
•
comm communicator;
•
maxindex number of elements in array index in calling program;
maxedges number of elements in array edges in calling program.
Output parameters:
•
• index 1-dimensional integer array which contains structure of graph (see
description of subprogram MPI_Graph_create);
• edges 1-dimensional integer array which contains information about arcs of
graph.
Getting rank of process in graph topology after reordering (newrank)
int MPI_Graph_map(MPI_Comm comm, int nnodes, int *index, int
*edges, int *newrank)
MPI_GRAPH_MAP(COMM, NNODES, INDEX, EDGES, NEWRANK, IERR)
Input parameters:
•
comm communicator;
•
nnodes number of graph nodes;
•
index 1-dimensional integer array which contains structure of graph (see
description of subprogram MPI_Graph_create);
edges 1-dimensional integer array which contains information about arcs of
graph.
If process does not belong to the graph this subprogram returns MPI_UNDEFINED.
•
Getting of information on graph topology which is related to communicator comm
int MPI_Graphdims_get(MPI_Comm comm, int *nnodes, int
*nedges)
MPI_GRAPHDIMS_GET(COMM, NNODES, NEDGES, IERR)
Output parameters:
•
nnodes number of graph nodes;
•
nedges number of graph edges.
69
Getting type of topology (toptype) associated with communicator comm
int MPI_Topo_test(MPI_Comm comm, int *toptype)
MPI_TOPO_TEST(COMM, TOPTYPE, IERR)
Output parameter:
• toptype topology (MPI_CART for Cartesian topology and MPI_GRAPH for
graph topology).
Derived data types
Derived data types of MPI are used to send data which elements are not contiguous in
memory. Derived type must be created by call of constructor and then it has to be
registered. Before program will be completed all derived types should be cancelled.
Constructor of vector type (fig. 23)
int MPI_Type_vector(int count, int blocklen, int stride,
MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_VECTOR(COUNT, BLOCKLEN, STRIDE, OLDTYPE, NEWTYPE,
IERR)
Input parameters:
• count number of blocks (nonnegative integer);
• blocklen length of a block (number of elements, nonnegative integer);
• stride number of elements between beginning of previous and beginning of
the next block;
• oldtype basic type.
• newtype identifier of a new type.
Initial data must be of the same type.
Fig. 23. Vector derived type
70
Constructor of vector type
int MPI_Type_hvector(int count, int blocklen, MPI_Aint
stride, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_HVECTOR(COUNT, BLOCKLEN, STRIDE, OLDTYPE, NEWTYPE,
IERR)
Arguments of this subprogram are the same as in subprogram MPI_Type_vector
except stride value must be given in bytes.
Constructor of structured type
int MPI_Type_struct(int count, int blocklengths[], MPI_Aint
indices[], MPI_Datatype oldtypes[], MPI_Datatype *newtype)
MPI_TYPE_STRUCT(COUNT, BLOCKLENGTHS, INDICES, OLDTYPES,
NEWTYPE, IERR)
Input parameters:
• count number of elements in derived type and number of elements in arrays
oldtypes, indices and blocklengths;
• blocklengths number of elements at each block (array);
• indices displacement of each block in bytes;
• oldtypes type of elements at each block (array).
Output parameter:
• newtype identifier of derived type.
MPI_Aint name of scalar type with the same length as length of pointer.
Constructor of indexed type
int MPI_Type_indexed(int count, int blocklens[], int
indices[], MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_INDEXED(COUNT, BLOCKLENS, INDICES, OLDTYPE,
NEWTYPE, IERR)
Input parameters:
•
count number of blocks in derived type and number of elements in arrays
indices and blocklens;
•
blocklens number of elements at each block;
•
indices displacements of blocks which is measured in cells of basic type
(integer array);
oldtype basic type.
Output parameter:
•
•
newtype identifier of derived type.
71
Constructor of indexed type
int MPI_Type_hindexed(int count, int blocklens[], MPI_Aint
indices[], MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_HINDEXED(COUNT, BLOCKLENS, INDICES, OLDTYPE,
NEWTYPE, IERR)
Displacements indices are given in bytes.
Constructor of derived type with contiguous disposition of elements
int MPI_Type_contiguous(int count, MPI_Datatype oldtype,
MPI_Datatype *newtype)
MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE, NEWTYPE, IERR)
Input parameters:
•
count counter of replicas;
oldtype basic type.
Output parameter:
• newtype identifier of the new type.
•
Constructor of indexed type with blocks of equal size
int MPI_Type_create_indexed_block(int count, int
blocklength, int displacements[], MPI_Datatype oldtype,
MPI_Datatype *newtype)
MPI_TYPE_CREATE_INDEXED_BLOCK(COUNT, BLOCKLENGTH,
DISPLACEMENTS, OLDTYPE, NEWTYPE, IERR)
Input parameters:
• count number of blocks in derived type and number of elements in arrays
indices and blocklens;
• blocklength number of elements at each block;
• displacements displacements of blocks measured in units of length of type
oldtype (integer array);
• oldtype basic type.
•
newtype identifier of derived type.
Constructor of derived data type which corresponds to subarray of multidimensional
array
int MPI_Type_create_subarray(int ndims, int *sizes, int
*subsizes, int *starts, int order, MPI_Datatype oldtype,
MPI_Datatype *newtype)
MPI_TYPE_CREATE_SUBARRAY(NDIMS, SIZES, SUBSIZES, STARTS,
ORDER, OLDTYPE, NEWTYPE, IERR)
72
Input parameters:
•
ndims dimension of array;
•
sizes number of elements having type oldtype at each dimension of the
whole array;
•
subsizes number of elements having type oldtype at each dimension of
the subarray;
•
starts initial coordinates of subarray at each dimension;
•
order flag which defines reordering;
•
oldtype basic type.
• newtype new type.
Registration of derived type datatype
int MPI_Type_commit(MPI_Datatype *datatype)
MPI_TYPE_COMMIT(DATATYPE, IERR)
Removing of derived type datatype
int MPI_Type_free(MPI_Datatype *datatype)
MPI_TYPE_FREE(DATATYPE, IERR)
Basic types might not be removed.
Getting size of the data type datatype in bytes
int MPI_Type_size(MPI_Datatype datatype, int *size)
MPI_TYPE_SIZE(DATATYPE, SIZE, IERR)
Output parameter size.
Getting number of elements in a single object having type datatype (extent)
int MPI_Type_extent(MPI_Datatype datatype, MPI_Aint *extent)
MPI_TYPE_EXTENT(DATATYPE, EXTENT, IERR)
Output parameter extent.
Displacements may be given relative to basic address which is contained in constant
MPI_BOTTOM.
Getting address from given location
int MPI_Address(void *location, MPI_Aint *address)
MPI_ADDRESS(LOCATION, ADDRESS, IERR)
73
This subprogram in C programs returns the same address as operation & (sometimes
this rule may be violated). It is more helpful in Fortran programs because C has own
tools to do the same.
Getting actual parameters used in creation of derived type
int MPI_Type_get_contents(MPI_Datatype datatype, int
max_integers, int max_addresses, int max_datatypes, int
*integers, MPI_Aint *addresses, MPI_Datatype *datatypes)
MPI_TYPE_GET_CONTENTS(DATATYPE, MAX_INTEGERS, MAX_ADDRESSES,
MAX_DATATYPES, INTEGERS, ADDRESSES, DATATYPES, IERR)
Input parameters:
•
datatype identifier of derived type;
•
max_integers number of elements in array integers;
•
max_addresses number of elements in array addresses;
max_datatypes number of elements in array datatypes.
Output parameters:
•
• integers contains integer arguments which were used at creating of given
data type;
• addresses contains arguments address which were used at creating of
given data type;
• datatypes — contains arguments datatype which were used at creating of
given data type.
Getting low bound of datatype
int MPI_Type_lb(MPI_Datatype datatype, MPI_Aint
*displacement)
MPI_TYPE_LB(DATATYPE, DISPLACEMENT, IERR)
Output parameter:
• displacement — displacement (in bytes) of low bound relative to source.
Getting upper bound of datatype
int MPI_Type_ub(MPI_Datatype datatype, MPI_Aint
*displacement)
MPI_TYPE_UB(DATATYPE, DISPLACEMENT, IERR)
74
Data packing
int MPI_Pack(void *inbuf, int incount, MPI_Datatype
datatype, void *outbuf, int outcount, int *position,
MPI_Comm comm)
MPI_PACK(INBUF, INCOUNT, DATATYPE, OUTBUF, OUTCOUNT,
POSITION, COMM, IERR)
When this call is performed incount elements of given type are chosen from input
buffer starting with position.
Input parameters:
• inbuf address of input buffer;
• incount number of input data;
• datatype type of input data;
• outcount size of output buffer in bytes;
• position current position in buffer in bytes;
• comm communicator corresponding to packing message.
Output parameter:
•
outbuf address of output buffer.
Data unpacking
int MPI_Unpack(void *inbuf, int insize, int *position, void
*outbuf, int outcount, MPI_Datatype datatype, MPI_Comm comm)
MPI_UNPACK(INBUF, INSIZE, POSITION, OUTBUF, OUTCOUNT,
DATATYPE, COMM, IERR)
Input parameters:
•
inbuf address of input buffer;
•
insize size of input buffer in bytes;
•
position current position in buffer in bytes;
•
outcount number of data which must be unpacked;
•
datatype type of output data;
comm communicator corresponding to unpacking message.
Output parameter:
•
•
outbuf address of output buffer.
Getting memory size (in bytes) which is necessary for unpacking of message
int MPI_Pack_size(int incount, MPI_Datatype datatype,
MPI_Comm comm, int *size)
75
MPI_PACK_SIZE(INCOUNT, DATATYPE, COMM, SIZE, IERR)
Input parameters:
•
incount argument count which was used at packing;
•
datatype type of packed data;
•
comm communicator.
Attributes
Attributes provide a software developer by additional mechanism of information
exchange between processes.
Creation of new key keyval for attribute (output parameter)
int MPI_Keyval_create(MPI_Copy_function *copy_fn,
MPI_Delete_function *delete_fn, int *keyval, void
*extra_state)
MPI_KEYVAL_CREATE(COPY_FN, DELETE_FN, KEYVAL, EXTRA_STATE,
IERR)
Keys are unique and are not seen by a programmer though they are explicitly kept as
integer values. A defined key may be used to set attributes and get access to them in
any communicator. Function copy_fn is called when communicator is duplicated
by subprogram MPI_Comm_dup. Function delete_fn is used for removal.
Parameter extra_state sets additional information (state) for copy and delete
functions.
Setting type of function MPI_Copy_function
typedef int MPI_Copy_function(MPI_Comm oldcomm, int keyval,
void *extra_state, void *attribute_val_in, void
*attribute_val_out, int *flag)
SUBROUTINE COPY_FUNCTION(OLDCOMM, KEYVAL, EXTRA_STATE,
ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT, FLAG, IERR)
Copy function is called for each keyvalue in initial communicator in any order. Each
call of copy function is performed with keyvalue and corresponding attribute. If
flag’s value is flag = 0 returned attribute is removed from duplicated
communicator. Otherwise (flag = 1) new value is set for attribute which is equal to
value returned in parameter attribute_val_out.
Function copy_fn in C and Fortran may be defined by values
MPI_NULL_COPY_FN or MPI_DUP_FN. MPI_NULL_COPY_FN is function which
doesn’t perform any actions except of returning flag’s value flag = 0 and
MPI_SUCCESS. MPI_DUP_FN is simplest reduplication function. It returns flag’s
76
value flag = 1, attribute’s value by means of attribute_val_out and
completion code MPI_SUCCESS.
Deletion function is similar to copy_fn and may be defined as follows. Function
delete_fn is called when a communicator has to be deleted by means of call
MPI_Comm_free or when MPI_Attr_delete is called. It must be of type
MPI_Delete_function, which is defined as:
typedef int MPI_Delete_function(MPI_Comm comm, int keyval,
void *attribute_val, void *extra_state);
SUBROUTINE DELETE_FUNCTION(COMM, KEYVAL, ATTRIBUTE VAL,
EXTRA STATE, IERR)
This function is called by subroutines MPI_Comm_free, MPI_Attr_delete and
MPI_Attr_put. A deletion function may be defined as "null"
MPI_NULL_DELETE_FN. MPI_NULL_DELETE_FN doesn’t perform any actions
but returns MPI_SUCCESS.
Special key’s value MPI_KEYVAL_INVALID may not be returned by subroutine
MPI_Keyval_create. It is used for keys initialization.
Deletion of a key keyval
int MPI_Keyval_free(int *keyval)
MPI_KEYVAL_FREE(KEYVAL, IERR)
Call of this function assigns to keyval the value MPI_KEYVAL_INVALID. An
attribute in use may be deleted because its actual deletion takes place only after
deletion of all references to the attribute. All references must be deleted explicitly by
means of, for example, call MPI_Attr_delete. Each those call deletes one copy
of the attribute. Call of MPI_Comm_free deletes all copies of the attribute that are
related to communicator under deletion.
Setting of attribute which may be used by subroutine MPI_Attr_get
int MPI_Attr_put(MPI_Comm comm, int keyval, void* attribute)
MPI_ATTR_PUT(COMM, KEYVAL, ATTRIBUTE, IERR)
Call of this subprogram associates key’s value keyval with the attribute. If the
attribute’s value was set before the call result is the same as in situation when at first
MPI_Attr_delete is used (and call of delete_fn is performed) and then new
value is saved. The call will be completed with error if a key with value keyval is
absent. In particular, code MPI_KEYVAL_INVALID corresponds to wrong value of
a key. Change of system attributes MPI_TAG_UB, MPI_HOST, MPI_IO and
MPI_WTIME_IS_GLOBAL is not allowed.
77
Getting attribute value which corresponds to a key’s value keyval
int MPI_Attr_get(MPI_Comm comm, int keyval, void *attribute,
int *flag)
MPI_ATTR_GET(COMM, KEYVAL, ATTRIBUTE, FLAG, IERR)
The first parameter defines the communicator to which the attribute is attached. If a
key with value keyval is absent, error takes place. Error does not arise if key value
is set but corresponding attribute is not attached to the communicator comm. In this
case flag value flag = false is returned.
When call of MPI_Attr_put is performed, an attribute’s value is passed by means
of attribute_val and during a call of MPI_Attr_get address of returned
attribute’s value is passed through attribute_val parameter. Attributes may be
received only from programs which are written in same programming languages as
subprograms or programs called MPI_Attr_put.
Deletion of attribute with given key’s value
int MPI_Attr_delete(MPI_Comm comm, int keyval)
MPI_ATTR_DELETE(COMM, KEYVAL, IERR)
Deletion of attribute with given key’s value is performed by an attribute’s deletion
function delete_fn which has to be defined when keyval is created. Parameter
comm defines the communicator to which the attribute is attached. All parameters of
the subprogram are input parameters. For any reduplication of a communicator by
means of MPI_Comm_dup subprogram all copy functions are invoked for attributes
which were set at a given time. Order of invocations is arbitrary. The same actions
are performed when a communicator has to be deleted by MPI_Comm_free but all
deletion functions are called in this case.
Implementations
There are few realizations of the MPI specification. Among them are: MPICH (MPI
CHameleon, www.mcs.anl.gov) – free, open-source MPI implementation;
LAM (Local Area Multicomputer) – high-quality open-source MPI implementation
(www.lam-mpi.org); Microsoft® MPI and Intel® MPI etc.
There are some implementations which support usage in Grid-environment.
78
From MPI-1 to MPI-2
A lot of parallel software use MPI-1 implementations but now most of
implementations support also MPI-2 specification. It has enhanced functionality such
as:
• spawning of new tasks in a process of program execution;
• new kinds of point-to-point exchanges (one-sided);
• parallel input-output;
• enhanced collective operations (including operations for intercommunicators)
and so on.
79
Part 4
Fortran 90
80
In this section short description of Fortran programming language is given. It is one
of languages most often used in scientific, applied and parallel programming. Until
now it has not competitors in performance and convenience of programming of
computational problems.
Format of a source code
Source code of a Fortran program may be written in fixed or free format. The fixed
format corresponds to old standards of Fortran 77, and free format is used in
Fortran 90 and newest standards. Fortran 90 also supports fixed format.
Any string of a source code in fixed format consists of 72 positions. First five
positions may be used only for labels or comments. Sixth position may be blank or
may be used to place any non-blank symbol. In the last case a string with a non-blank
symbol in the sixth position is considered as a continuation string of a previous one.
A statement may be placed in any positions from 7 to 72.
In the free format all positions are equivalent and a string length is 132 symbols.
Program structure
Program consists of main program and, possibly, subprograms. Subprograms may be
both functions and subroutines, external and internal. A program’s components may
be compiled separately.
Main program begins with statement PROGRAM. Then name of the program follows:
PROGRAM
_
Name of a program begins with a letter, then letters, digits and underscores may
follow, for example:
PROGRAM SUMMATION
PROGRAM QUADRATIC_EQUATION_SOLVER45
Maximum length of any identifier in Fortran is 31 symbols.
First statement of a subprogram is FUNCTION or SUBROUTINE. Last statement of
any program component is END. Last statement of the main program may be of the
following form:
END[[ PROGRAM] PROGRAM_NAME]
PROGRAM_NAME - unnecessary part of the statement.
Just after header all definitions of variables, constants and other objects which are
used in the (sub)program should be placed. It is definitions part of a program. Then
part of executable statements follows.
81
Basic data types
Below list of built-in data types in the order of rank increase is given:
• LOGICAL(1) and BYTE
• LOGICAL(2), LOGICAL(4)
• INTEGER(1), INTEGER(2), INTEGER(4)
• REAL(4), REAL(8)
• COMPLEX(8), COMPLEX(16)
Each built-in data type in Fortran has few kinds, which differ by interval of allowable
values and precision (for numerical types).
Any value of CHARACTER type is a character string. Length of a string may be
different it is defined by value of LEN parameter in a statement of a string variable
description:
CHARACTER(LEN = 430) :: Shakespeare_sonet
Description sentence
Description sentence for variables has following format in Fortran 90:
type[, attributes] :: list_of_variables
Identifiers in a list are separated by commas, type defines type, for example:
REAL, PARAMETER :: salary = 2000
Following attributes may be used in Fortran 90:
• PARAMETER for named constants;
•
PUBLIC variable is accessible from outside a module;
•
PRIVATE variable is not accessible from outside a module;
•
POINTER variable is a pointer;
•
TARGET variable may be used as a target for pointers;
•
ALLOCATABLE for dynamic (allocatable) arrays;
•
DIMENSION for arrays;
•
INTENT defines a type of a subprogram’s argument (input, output or input
and output at the same time);
•
OPTIONAL unnecessary argument of a subprogram;
•
SAVE to save value of a local variable of a subprogram after return;
•
EXTERNAL for external function;
•
INTRINSIC for intrinsic function.
82
Literal constants
Literal numerical constants are written as usual. In complex literal constants
parenthesizes are used:
• (0., 1.) imaginary unit i;
•
(2., 1.) complex number 2 + i.
There are two logical literal constants:
• .TRUE.
• .FALSE.
Literal character constant may be written between two quotation marks or two
apostrophes:
“Hello, Fortranner!”
‘Good night, Fortranner!’
Arithmetical and logical operators
Below list of arithmetic operators in the order of priority decrease is given:
• ** — exponentiation;
• *, / — multiplication, division;
• –, + — subtraction, addition.
”Minus” (–) and “plus” (+) are used also for unary operators:
-2.14
+321
Below relation operators in Fortran are listed:
Notation
Alternative notation
Name
.LT.
<
Less
.LE.
<=
Less or equal
.GT.
>
Greater
.GE.
>=
Greater or equal
.EQ.
==
Equal
.NE.
/=
Not equal
Only “equal” (.EQ.) and "not equal" (.NE.) relations may be applied to complex
variables and constants.
83
Logical operators:
Operator
Description
.NOT.
Logical negation
.AND.
Logical multiplication (logical AND)
.OR.
Logical addition (logical OR)
.EQV. and .NEQV.
Logical equivalence and nonequivalence (equality and
nonequality of logical values)
Arrays
Arrays are described with DIMENSION attribute:
REAL, DIMENSION(1:100) :: C
List of extents may be used with this attribute. Extent is number of elements in a
dimension. The list of extents has the following form:
(extent_1, extent_2, …, extent_n)
Number of extents is equal to array’s rank. Each extent is described as follows:
[low_boundary : ] upper_boundary
Example:
REAL, DIMENSION(0:10, 2, -3:3,11) :: FGRID
If low boundary value is omitted it is supposed to be 1.
For allocatable arrays list of extents has form of colons separated by commas.
Number of colons in this case must be equal to dimension of an array:
REAL, ALLOCATABLE, DIMENSION(:, :) :: BE_LATER
Size of allocatable array may be defined at a time of program execution. Only at that
time memory may be allocated for such array:
PROGRAM dyn_array
IMPLICIT NONE
INTEGER SIZE
REAL, ALLOCATABLE, DIMENSION(:) :: array
WRITE(*, *) 'SIZE?'
READ(*, *) SIZE
IF(SIZE > 0) ALLOCATE(array(SIZE))
…
IF(ALLOCATED(array)) DEALLOCATE(array)
END PROGRAM dyn_array
Array may be described without DIMENSION attribute. In this case array’s extents
must be described after its name:
REAL X(10, 20, 30), Y(100), Z(2, 300, 2, 4)
84
Example
Solution of nonlinear equation by Newton’s method
program newton
implicit none
real(8) :: x, dx, f, df
x = 3.3
! initial approximation
do
! newton’s iterations
dx = f(x) / df(x) ! step evaluation
x = x – dx
! next approximation
if(dx <= spacing(x)) exit ! loop is finished when
! step is less than distance between two successive
! iterations
end do
print *, x
! output of the solution
print *, f(x) ! output of a function value
print *, df(x) ! output of derivative of function
end program newton
real(8) function f(x)
implicit none
real(8) :: x
f = sin(x)
return
end
Statements of Fortran 90
Here list of some statements of Fortran 90 is given. Unnecessary elements are
denoted by square brackets []. If the blank symbol is not placed in square brackets it
is necessary in this context.
Nonexecutable statements
Statements of program components
PROGRAM program_name
Main program’s header
85
MODULE module_name
Module’s header
END[ MODULE[module_name]]
Last module’s statement
USE module_name[, ONLY only_list]
Statement which attaches a module
[RECURSIVE ]SUBROUTINE subroutine_name
[([list_of_formal_parameters])]
Subroutine’s header
[type
][RECURSIVE
]FUNCTION
function_name
([list_of_formal_parameters]) [ RESULT(result_name)]
Function’s header
INTERFACE[ generic_description]
Definition of an interface, header statement
END[ ]INTERFACE
Last statement in an interface definition
CONTAINS
Definition of internal subprogram
ENTRY
Entry statement
86
Descriptions and initializations
type[[, attribute][, attribute:]:.::] objects_list
Description statement. type is one from the list:
• INTEGER[(KIND=]kind_parameter)]
• REAL[(KIND=]kind_parameter)]
• LOGICAL[(KIND=]kind_parameter)]
• COMPLEX[(KIND=]kind_parameter)]
• CHARACTER[list_of_type_parameters]
• DOUBLE[ ]PRECISION]
• TYPE(type_name)
Attributes are any allowable combination of the following:
PARAMETER, PUBLIC, PRIVATE, POINTER, TARGET, ALLOCATABLE,
DIMENSION, INTENT, EXTERNAL, INTRINSIC, OPTIONAL, SAVE
TYPE[, access_attribute ::] name_of_derived_type
Definition of a derived type, header ststement. Here access_attribute —
PUBLIC or PRIVATE
END[ ]TYPE[name_of_derived_type]
Definition of a derived type, last statement
IMPLICIT list
where list — type(list_of_letters)[, type(list_of_letters)]
… or NONE
Definition of implicit tipization
ALLOCATABLE [::] array_name[(extents_list)][,
array_name[(extents_list)]…]
Definition of allocatable arrays
DIMENSION array_name(extents)[, array_name(extents)…]
Arrays description
PARAMETER (list_of_definitions_of_named_constants)
Definition of named constants
EXTERNAL list_of_external_names
Assigning of attribute EXTERNAL
INTRINSIC list_of_intrinsic_names
Assigning of attribute INTRINSIC
87
INTENT(parameter_of_input/output)
list_of_formal_parameters
Assigning of attribute INTENT
OPTIONAL list_of_formal_parameters
Assigning of attribute OPTIONAL
SAVE[[::]list_of_objects_to_save]
Assigning of attribute SAVE
COMMON /[name_of_common_block]/ list_of_variables [, /
name_of_common_block / list_of_variables]
Definition of common blocks
DATA objects_list /list_of_values /[, objects_list /
list_of_values /…]
Initialization of variables and arrays
FORMAT([list_of_descriptors])
Format specification
Executable operators
Control statements
END[ PROGRAM[ program_name]]
Last statement of a program
END[ subprogram_kind [ subprogram_name]]
where subprogram_kind — SUBROUTINE or FUNCTION
Last statement of a subprogram
CALL subroutine_name[(list_of_actual_parameters)]
Call of subroutine
RETURN
Return statement
88
STOP[ message]
Stop statement
Assignments
variable = expression
Assignment for scalar and array-like objects
reference => target
Attachment of reference to a target
Loops and branchings
IF(scalar_logical_expression) executable_statement
Conditional statement
WHERE(array_logical_expression) array = expression_array
Conditional assignment for arrays
[if_name:] IF(scalar_logical_expression) THEN
ELSE[[ ]IF(scalar_logical_expression) THEN[ if_
END[ ]IF[ if_name]
Conditional statement IF_THEN_ELSE
WHERE(array_logical_expression)
ELSEWHERE
END[ ]WHERE
Branching for array assignments
[select_name:] SELECT[ ]CASE (scalar_expression)
CASE (list_of_possible_values)[ select_name]
CASE DEFAULT[ select_name]
END[ ]SELECT[ select_name]
Multibranching SELECT
]
89
GO[ ]TO label
Unconditional jump to a specified label
[do_name:]
DO[
label]
variable
=
scalar_integer_expression1, scalar_integer_expression2[,
scalar_integer_expression3]
DO-loop header
[do_name:]
DO[
WHILE(scalar_logical_expression)
While-loop (precondition loop) header
label]
[,]
CYCLE[ do_name]
Transition to the loop’s do_name end
EXIT[ do_name]
Exit from the loop do_name
CONTINUE
Jump to next iteration of a loop
END[ ]DO[ do_name]
Last statement of a loop do_name
Operations with dynamic memory
ALLOCATE(list_of_objects_to_be_allocated [, STAT=status])
Allocation of memory for listed objects
DEALLOCATE(list_of_allocated_objects[, STAT = status])
Memory deallocation
90
Input-output statements
READ(input_control_list) [input_list]
READ format[,input_list]
Input
WRITE(output_control_list) [output_list]
Output
PRINT format[, output_list]
Write on standard output device
OPEN(descriptors)
Attachment of a file to a logical input-output device
CLOSE(descriptors)
Closing of a file
Intrinsic subprograms
ABS(A)
Absolute value
ACHAR(I)
I-th symbol of a sorting ASCII sequence of a processor
ACOS(X)
Arccosine in radians
AIMAG(Z)
Imaginary part of a complex value
AINT(A[, KIND])
Truncation to integer value (with specified kind parameter)
91
ALLOCATED(ARRAY)
Check if an array is allocated
ANINT(A[, KIND])
Nearest integer value to A
ASIN(X)
Arcsine in radians
ATAN(X)
Arctangent in radians
ATAN2(Y, X)
Argument of a complex value
CALL DATE_AND_TIME([DATE][, TIME] [,ZONE] [,VALUES])
Date and time getting
CALL RANDOM_NUMBER(R)
Uniformly distributed pseudorandom number from interval [0, 1)
CALL RANDOM_SEED([SIZE] [, PUT][, GET])
Getting/setting of random seed
CALL SYSTEM_CLOCK([COUNT] [, COUNT_RATE] [, COUNT_MAX])
Integer count of real timer
CEILING(A)
Minimum integer value which is greater or equal to A
CHAR(I[, KIND])
ICHAR(C)
I-th symbol in a sorted sequence of a processor
CMPLX(X[, Y] [, KIND])
Constructor of a complex number (with specified kind parameter)
92
CONJG(Z)
Complex conjugation
COS(X)
Trigonometric cosine
COSH(X)
Hyperbolic cosine
COUNT(MASK[, DIM])
Number of masked array elements (along specified dimension)
CSHIFT(ARRAY, SHIFT[, DIM])
Cyclic shift of array elements (along specified dimension)
DIGITS(X)
Number of significant digits in the model of floating point representation of X
DOT_PRODUCT(VECTOR_A, VECTOR_B)
Dot product of one-dimensional arrays
DPROD(X, Y)
Scalar multiplication with double precision
EOSHIFT(ARRAY, SHIFT[, BOUNDARY] [, DIM])
Linear shift of array elements
EPSILON(X)
Minimum number in a model of floating point representation of X such that its sum
with unit is distinguishable from unit
EXP(X)
Exponent
EXPONENT(X)
Exponent in a model of floating point representation of X
93
FLOOR(A)
Minimum integer value that is not greater than A
FRACTION(X)
Fractional part in a model of floating point representation of X
HUGE(X)
Maximum value in a model of floating point representation of X
IACHAR( )
Index of a character argument in a sorting sequence of a processor
IAND(I, J)
Bitwise logical AND
IBCLR(I, POSITION)
Setting zero bit in a specified position
IBITS(I, POSITION, LENGTH)
Extraction of a bit subsequence
IBSET(I, POSITION)
Setting binary unit in a given position
ICHAR(C)
Index of a character argument in a sorting sequence of a processor
IEOR(I, J)
Bitwise XOR
INDEX(STRING, SUBSTRING[, BACK])
Starting index of substring in a string
INT(A[, KIND])
Transformation to integer (with specified kind parameter)
94
IOR(I, J)
Bitwise logical OR
ISHIFT(I, SHIFT)
Logical bit shift
ISHIFTC(I, SHIFT[, SIZE])
Logical right cyclic shift of part of bits
KIND(X)
Kind parameter of an argument
LBOUND(ARRAY[, DIM])
Low bound of array-like argument
LEN(S)
Length of a string
LEN_TRIM(STRING)
Length of a string without tale blanks
ALOG(X)
Natural logarithm
ALOG10(X)
Decimal logarithm
LOGICAL(L[, KIND])
Transformation of an argument to logical type (with specified kind parameter)
MATMUL(MATRIX_A, MATRIX_B)
Matrix multiplication
MAX(A1, A2[, A3, …..])
Maximum of arguments in a list
95
MAXLOC(ARRAY[, MASK])
Index of maximum array element
MAXVAL(ARRAY[, DIM ] [, MASK])
Maximum array element (along specified dimension and/or according to a given
mask)
MIN(A1, A2 [, A3, …..])
Minimum element in a list
MINLOC(ARRAY[, MASK])
Index of minimum array element
MINVAL(ARRAY[, DIM ] [, MASK])
Minimum array element (along specified dimension and/or according to a given
mask)
MOD(A, P)
Remainder from division modulo P
MODULO(A, P)
Division modulo P
NINT(A[, KIND])
Nearest integer to an argument’s value
NOT(I)
Bitwise logical negation
PRECISION(X)
Decimal precision for real type argument
PRESENT(A)
Test of presence of optional argument
96
PRODUCT(ARRAY[, DIM] [,MASK])
Product of array elements (along specified dimension and/or according to a given
mask)
REAL(A[, KIND])
Transformation to real type (with a kind specified)
RESHAPE(SOURCE, SHAPE[, PAD] [, ORDER])
Change of array shape
SCAN(STRING, SET[, BACK])
Index of last symbol in string argument STRING in SET
SELECTED_INT_KIND(R)
Kind parameter of integer type with a given interval
SELECTED_REAL_KIND([P][, R])
Kind parameter of real type with a given precision and interval
SHAPE(SOURCE)
Shape of an argument (array-like argument assumed)
SIGN(A, B)
Absolute value of A with a sign of B
SIN(X)
Trigonometric sine
SINH(X)
Hyperbolic sine
SIZE(ARRAY[, DIM])
Array size (along specified dimension)
SQRT(X)
Square root
97
SUM(ARRAY[, DIM][, MASK])
Sum of array elements (along specified dimension and/or according to a given mask)
TAN(X)
Trigonometric tangent
TANH(X)
Hyperbolic tangent
TINY(X)
Minimum positive floating point number
TRANSPOSE(MATRIX)
Matrix transpose
UBOUND(ARRAY [, DIM])
Upper array boundary
98
References
1. S. Nemnyugin, O. Stesik, Parallel programming for multiprocessor computing
systems. "BHV", Saint-Petersburg, 2002, 396 p.
2. V.V. Voevodin, V.V. Voevodin, Parallel computing. "BHV", Saint-Petersburg,
2002, 599 p.
3. S. Nemnyugin, O. Stesik, Modern Fortran. "BHV", Saint-Petersburg, 2004,
481 p.
99
Appendix 1
Intel compiler for Linux
Here short description of Intel® Fortran compiler for Linux is given. It has to be used
in the following way:
ifort [options] file1 [file2 ...]
where options are unnecessary and fileN is a Fortran source (with extensions .f
.for .ftn .f90 .fpp .F .FOR .F90 .i .i90), assembly (.s .S), object (.o), static library (.a),
or other linkable file
Performance options
• -O1 - optimize for maximum speed, but disable some optimizations which
increase code size for a small speed benefit;
• -O2 - enable optimizations (default);
• -O3 - enable -O2 plus more aggressive optimizations that may not improve
performance for all programs;
• -O0 - disable optimizations;
• -O - same as -O2;
• -fast - enable -xP -O3 -ipo -no-prec-div –static;
• -[no-]prec-div - improve precision of floating-point divides (some speed
impact);
• -mcpu=<cpu> - optimize for a specific cpu: pentium - optimize for
Pentium® processor, pentiumpro - optimize for Pentium® Pro, Pentium®
II and Pentium® III processors, pentium4 - optimize for Pentium® 4
processor (default);
• -march=<cpu> - generate code exclusively for a given <cpu>:
pentiumpro - Pentium® Pro and Pentium(R) II processor instructions,
pentiumii - MMX(TM)instructions, pentiumiii - streaming SIMD
extensions, pentium4 - Pentium® 4 new instructions;
• -x<codes> - generate specialized code to run exclusively on processors
indicated by <codes>: W - Intel Pentium® 4 and compatible Intel processors,
P - Intel® Core(TM) Duo processors, Intel Core(TM) Solo processors, Intel
Pentium® 4 and compatible Intel processors with Streaming SIMD Extensions
3 (SSE3) instruction support;
100
•
•
•
•
•
-ip - enable single-file Interprocedural (IP) optimizations (within files);
-ipo[n] - enable multi-file IP optimizations (between files);
-qp - compile and link for function profiling with UNIX gprof tool;
-p - same as –qp;
-opt-report - generate an optimization report to stderr.
Instrumentation options
• -tcheck - generate instrumentation to detect multi-threading bugs (requires
Intel® Thread Checker; cannot be used with compiler alone);
• -tprofile - generate instrumentation to analyze multi-threading
performance (requires Intel® Thread Profiler; cannot be used with compiler
alone);
• -openmp - enable the compiler to generate multi-threaded code based on the
OpenMP directives;
• -openmp-profile - link with instrumented OpenMP runtime library to
generate OpenMP profiling information for use with the OpenMP component
of the VTune(TM) Performance Analyzer;
• -openmp-stubs - enables the user to compile OpenMP programs in
sequential mode. The openmp directives are ignored and a stub OpenMP
library is linked (sequential);
• -cluster-openmp - allows the user to run an OpenMP program on a
cluster;
• -parallel - enable the auto-parallelizer to generate multi-threaded code for
loops that can be safely executed in parallel.
Output and debug options
•
•
•
•
-c - compile to object (.o) only, do not link;
-S - compile to assembly (.s) only, do not link;
-o <file> - name output file;
-print-multi-lib - print information about libraries being used.
Fortran preprocessor options
• -module [path] - specify path where mod files should be placed and first
location to look for mod files;
• -I<dir> - add directory to include file search path.
101
Language options
•
•
•
•
•
•
•
•
•
•
•
•
•
-i2 - set default KIND of integer variables to 2;
-i4 - set default KIND of integer variables to 4;
-i8 - set default KIND of integer variables to 8;
-integer-size <size> - specifies the default size of integer and logical
variables size: 16, 32, 64;
-r8 - set default size of REAL to 8 bytes;
-r16 - set default size of REAL to 16 bytes;
-real-size <size> - specify the size of REAL and COMPLEX
declarations, constants, functions, and intrinsics size: 32, 64, 128;
-[no]fixed - specifies source files are in fixed format;
-[no]free - specifies source files are in free format;
-auto - make all local variables AUTOMATIC;
-auto-scalar - make scalar local variables AUTOMATIC (default);
-save - save all variables (static allocation);
-syntax-only - perform syntax check only.
Miscellaneous options
• -help - print help message;
• -V - display compiler version information.
Linking options
•
•
•
•
•
•
-L<dir> - instruct linker to search <dir> for libraries;
-i-dynamic - link Intel provided libraries dynamically;
-i-static - link Intel provided libraries statically;
-dynamic-linker<file> - select dynamic linker other than the default;
-static - prevents linking with shared libraries;
-shared - produce a shared object.
102
Appendix 2
Compilation and execution of MPI programs
in Linux
Compilation
Scripts
mpif77, mpif90, mpicc, mpiCC
compile and link MPI programs written in Fortran 77, Fortran 90, C and C++. These
commands can be used to compile and link MPI programs written in different
programming languages and provide the options and any special libraries that are
needed to compile and link MPI programs. Any such script uses some compiler
installed in a system. Both options specific to assumed compiler and MPI-options can
be used.
Execution
mpirun is a shell script that is used to execute parallel MPI programs. It typically
works like this:
mpirun -np <number of processes> <options> <program name
and its arguments>
Options
The options for mpirun must come before the program it is necessary to run:
• -h - short help;
• -machinefile <machinefile name> - take the list of possible
machines to run on from the file <machinefile name>;
• -np <np> - specify the number of processes to run;
103
• -t – testing (do not actually run, just print what would be executed);
• -v - verbose (throw in some comments).
Machinefile is ordinary plain text file with network names of computers, each
name is in separate line:
pd00
pd01
pd02
pd03
pd99
Additional options may be included in a machinefile, which define, for example,
maximum number of processes of a parallel program which may be executed on a
computer.