Academia.eduAcademia.edu

Basics of parallel programming

parserplus.com

Saint-Petersburg State University Department of Computational Physics Sergei A. Nemnyugin Basics of parallel programming . Saint-Petersburg 2010 2 S.A. Nemnyugin Basics of parallel programming – St. Petersburg, 2010 Basics of programming techniques for multiprocessor and multicore computers are given. OpenMP and Message Passing Interface are considered. Short review of Fortran 90 also is included. 3 Table of Contents 1. 2. 3. 4. 5. 6. 7. Introduction OpenMP Message Passing Interface Fortran 90 References Appendix 1. Intel compiler for Linux Appendix 2. Compilation and execution of MPI programs in Linux 4 20 33 79 98 99 102 4 Part 1 Principles of parallel programming 5 1. Introduction Sequential and parallel programming models Programming model is a set of programming techniques corresponding to architecture of abstract computer intended for execution of definite class of algorithms. Programming model is based on some concept of logical organization of computer (its architecture). Variety of computer architectures may be classified in different ways. One of the most popular taxonomies is Flynn’s taxonomy which is based on number of instructions flows and data flows that interact in a computer (Fig. 1 - 4). Fig. 1. SISD (Single Instruction Stream  Single Data Stream) architecture Fig. 2. SIMD (Single Instruction Stream  Multiple Data Stream) architecture 6 Fig. 3. MISD (Multiple Instruction Stream  Single Data Stream) architecture Fig. 4. MIMD (Multiple Instruction Stream  Multiple Data Stream) architecture Inherent parallelism of algorithms and computer programs may be presented by an informational graph. Informational graph shows both execution order of macrooperations and data flows between them. Nodes of an informational graph correspond to macrooperations and unidirectional links correspond to data exchanges (fig. 5). 7 Fig. 5. Informational graph A node is characterized by two parameters (n, s), where n is node’s identifier and s is its size which may be measured by number of elementary operations which constitute it. A link is characterized by two parameters (v, d), where v are transferring data, d is time required for data delivery from sender to recipient. Informational graph consists of both linear sequences and multiply connected contours (loops). Limit cases of informational graph which are topologically equivalent to linear sequence of macrooperations and set of linear sequences (Fig. 6) correspond to purely sequential and parallel models of computing. Fig. 6. Limit cases of informational graph Sequential programming model may be characterized as follows: • performance is defined primarily by a hardware; • high level programming languages are used to develop computer programs; • good source-level portability of computer programs. 8 Main features of parallel programming model: • possibility to achieve high performance; • special programming techniques; • special software tools; • program development and verification may be much more laborious than in sequential model; • restricted portability of parallel software. Some specific problems of parallel programming: • careful planning of simultaneous execution of a set of processes; • explicit programming of data exchanges; • possible deadlocks (two processes or threads may wait for some resource locking those resource which is required for normal execution of another process or thread); • errors are nonlocal and dynamic (processes or threads are executed on different computing nodes or cores, workloads of computing nodes change in time, so explicit synchronizations may be required); • loss of determinism (results of calculations differ for different runs – consequence of “data races” – simultaneous and asynchronous access to shared variables); • care about scalability of software (scalability is a desirable property of a system, a network, or a process, which indicates its ability to either handle growing amounts of work in a graceful manner, or to be readily enlarged); • necessity of balanced workload of different CPUs. Models of parallel programming Parallel programming model may be realized in different ways taking into account computer architectures and software development tools. Some of them are listed below. Message passing model Most important features of the message passing model: • Program cause execution of a set of processes. • Each task gets its own unique identifier. • A process interact by means of messages sends and receives. • New processes may be created in the run-time of a parallel program. 9 Data parallelism model Most important features of the data parallelism model: • One operation deals with a set of elements. Program is a sequence of such operations. • Fine granularity of computations. • A programmer describes explicitly how data should be distributed between subtasks. Shared memory model In the shared memory model tasks have access to common memory. Tasks have common address space. Memory access is controlled by various methods, for example be means of semaphores. Explicit description of data transfers is not used. It simplifies programming but special attention should be paid to determinism, data races etc. Amdahl’s laws Amdahl’s laws form theoretical basis of maximum performance estimates of parallelizable programs. These laws were derived for idealized model of parallel computations, which doesn’t take into account latency of communications (finite time of data transfer between nodes of a computer system) and so on. First law Performance of a computer system which consists of interconnected components is defined by its slowest component. Second law Let execution time of a program on a sequential computer is T1. Let this time is a sum of Ts  execution time of non-parallelizable part and Tp  execution time of parallelizable part. Let T2  execution time of the program on an ideal parallel computer with N CPUs. Then speedup is: K= where S = T1 Ts + Tp 1 = = , Tp P T2 S+ Ts + N N T Ts and P = p are portions of sequential and parallelized parts Ts + Tp Ts + Tp of a program (S + P = 1). See fig. 7. 10 Fig. 7. Speedup in second Amdahl’s law Third law Let a computer system consists of N simple identical processing elements. Then in any case speedup K ≤ 1 . P Two paradigms of parallel programming Informational graph on fig. 8 displays coexistence of data and task parallelism in a same program/algorithm. Fig. 8. Data and task parallelism in a same program 11 Data parallelism Multiple arrows between two neighbour nodes of informational graph correspond to simultaneous application of a macrooperation to a set of data (array). Parts of the array may be processed by vector CPU or by a set of CPUs on a parallel computer system. Vectorization or parallelization in the data parallelism model is introduced into program on the stage of compilation. A programmer in this case: • uses compiler options of vector or parallel optimization; • uses directives of parallel compilation; • uses both specialized programming languages of parallel computations and libraries optimized for a given computer architecture. Main features of the model of data parallelism: • same program deals with all data; • address space is global; • weak synchronization of computations on CPUs of a parallel computer; • parallel operations are applied to array elements simultaneously on all available CPUs. Most widely used software tools in the data parallelism model are special programming languages or extensions such as DVM Fortran, HPF (High Performance Fortran) etc. Realization of the data parallelism model should be supported on the level of compilation. Such support may be provided by: • preprocessors which use sequential compilers and specialized libraries of parallel algorithms; • pretranslators which perform preliminary analysis of logical structure of a program, check of dependencies and restricted parallel optimization; • parallelizing compilers which reveal parallelism in a source code of a program and transform it in parallel structures. To make such transformation easier special directives of compilation may be included in a program. Task parallelism Loops of an informational graph consisting of “thick” arrows correspond to task parallelism. Its idea is based on a problem decomposition into a set of smaller subtasks. All subtasks are processed on different CPUs. This approach is MIMDoriented. In the task parallelism model subtasks are realized as separate programs written on commonly used programming language (for example, Fortran or ). Subtasks should send and receive initial data, intermediate and final results. In practice such interchange may be realized by means of calls of subroutines from a special library. 12 A programmer is able to control data distribution between different CPUs and different subtasks as well as data interchange. Problems of this approach are listed below: • high laboriousness of development, debugging, testing and verification of a parallel program; • a programmer is responsible for equal and dynamically balanced workload of CPUs of a parallel computer; • a programmer should minimize data interchange between subtasks because communications require a lot of time; • possibility of deadlocks or other situations when message sent by some subtask may be not delivered to a target subtask. Attractive features: • flexibility and more freedom given to a programmer to develop software which efficiently uses resources of a parallel computer; • possibility to achieve maximum performance. Main tools of programming in the task parallelism model are specialized libraries (e. g. MPI - Message Passing Interface, PVM - Parallel Virtual Machines). Design of parallel algorithm In the process of development of a parallel algorithm a programmer should pass through a sequence of specific stages: 1. Decomposition. This stage includes analysis of a problem and making decision if parallelization is necessary at all. A problem and related data are divided into smaller parts  subtasks. It is not necessary to take into account features of computer architecture at this stage. 2. Planning of communications (data exchanges between subtasks). Communications necessary both for data (initial, intermediate and final results) exchanges and exchanges by control information are defined. Types of communications also must be chosen. 3. Agglomeration. At this stage subtasks are agglomerated into bigger constituents of a parallel program. Sometimes it allows to increase an efficiency of an algorithm and to reduce its development cost. 4. Planning of computations. Distribution of tasks between CPUs. Main criterion of distribution  effective usage of CPUs with minimal time spent on communications. Let us turn to more detailed consideration of listed stages. Decomposition There are different approaches to decomposition. Below a short review of main approaches is given. 13 Data decomposition In the data decomposition approach at first data must be subdivided into smaller parts. Secondly procedures of data processing may be decomposed. Data are divided into parts which have nearly equal size. Operations dealing with data should be bound to data fragments. In such a way subtasks are formed. Then all required communications should be defined. Overlaps of subtasks in computational work should be reduced. It allows to avoid doubling of computations. Decomposition may be refined in the process of program design. If it is necessary to reduce communications increase overlap of subtasks is possible. Analysis begins with biggest data structures as well as most often used structures. At different stages of computation may be used different data structures so both static and dynamic decompositions are used. Recursive dichotomy Recursive dichotomy may be used to divide a domain into subdomains, requiring about the same amount of computations. Communications are minimized. At first a domain is divided into two parts along each dimension. Decomposition is repeated recursively in each new subdomain so many times as is needed to get required number of subtasks. Recursive coordinate dichotomy Recursive coordinate dichotomy may be applied to nonregular grids. Division is performed at each step for a dimension having largest extension. Recursive graph dichotomy Recursive graph dichotomy may be applied to nonregular grids. In this approach information about grids topology is used to minimize number of edges crossing boundaries of subdomains. In such a way number of communications may be reduced. Functional decomposition In the functional decomposition a computational algorithm is subjected to decomposition and afterward decomposition of data is adjusted to this decomposition. Functional decomposition may be useful in a case where it is hard or even impossible to find data structures which may be parallelized. Efficiency of decomposition may be improved by following some recommendations: • number of subtasks after decomposition should exceed by an order of magnitude the number of CPUs; • extra computations and data exchanges should be avoided; • subtasks should have approximately same size; 14 ideally, decomposition should be performed in such a way that increase of a problem's size leads to increase of a number of subtasks (with constant size of a subtask). Size of a subtask is defined by granularity of an algorithm. Granularity may be measured by a number of operations in a block. There are three levels of granularity: • 1. Fine-grained parallelism  instruction-level (no more than 20 instructions per block, on average 5, number of parallel subtasks  from two to few thousands). 2. Middle-grained parallelism  subroutine-level. Block size is up to 2000 instructions. Such kind of parallelism is a bit harder to find because it is necessary to take into account interprocedural dependencies. Requirements to communications are smaller than in a case of instruction-level parallelism. 3. Coarse-grained parallelism  tasks-level. It is realized via simultaneous execution of independent programs on a parallel computer. Coarse-grained parallelism must be supported by operational system. Most important condition which makes decomposition possible is independence of subtasks. Below are listed main kinds of independency: • Data independence  data which are processed by one subtask are not modified by other subtask. • Control independence  order of execution of a program’s parts may be defined only in the time of execution (if control dependency exists order of execution is predefined). • Independence on resources  it may be provided by sufficient amount of computer resources. • Dependence on output  takes place if two or more subtasks write in the same variable. Input-output independence takes place if statements of input-output of few subtasks have not access to same variable or file. In practice complete independence is unachievable. Planning of communications There are few basic kinds of communications: • local  each subtask communicates with few other subtasks; • global  each subtask communicates with many other subtasks; structured  each subtask and subtasks which communicate with it may be arranged into regular structure topologically equivalent (for example) to a lattice; • • unstructured  communications form arbitrary graph; • static  communications don’t change in time; • dynamic  communications change in time of a program execution; 15 • synchronous  sender and receiver coordinate data exchanges; asynchronous  data exchanges are not coordinated. Recommendations on planning of communications: • program has good scalability if each subtask has the same number of communications; • local communications are preferable; • parallel communications are preferable. • Agglomeration At the agglomeration stage architecture of a parallel computer should be taken into account. Subtasks from two previous stages are combined in such way that to get as much new subtasks as available CPUs. In order to perform agglomeration efficiently following recommendations should be taken into account: • overhead expenses on communications should be reduced; • if at the agglomeration stage computations or data should be duplicated nor scalability nor performance should suffer; • new subtasks should have approximately equal computational complexity; • scalability should be kept if possible; • parallel execution must be kept; • cost of development should be reduced if possible. Planning of computations At the stage of planning of computations distribution of subtasks between CPUs has to be defined. It should be done in such way that execution time of a parallel program was minimized. Most often used approaches to planning of computations are listed below. Master/slave planning Main (master) subtask is responsible for distribution of slave tasks on CPUs (fig. 9). Slave task gets initial data from master and returns results. 16 Fig. 9. Simple master/slave schema Hierarchical master/slave schema In this approach slave subtasks form few disjoint sets (fig. 10). Each set has its own master task. Master tasks of sets are controlled by single highest-level master task. Fig. 10. Hierarchical master/slave schema Decentralized planning In this approach master task is absent. Subtasks communicate with each other according to some strategy (fig. 11). It may be randomly chosen subtasks or small number of target subtasks (nearest neighbours). In the hybrid centralized-distributed approach message is sent to master task which sends it to slave tasks according to round robin strategy. 17 Fig. 11. Decentralized planning of computations Dynamic balancing may be efficiently realized if the following recommendations are taken into account: • in a case when each CPU is loaded by a single subtask an execution time of a parallel program is defined by a slowest subtask so optimal performance may be achieved when all subtasks have approximately same size; • balancing may be provided by loading of each CPU by few tasks. Multithreading A thread is a single sequential flow of control within a program. It is also a sequence of instructions that is executed. A process has the main thread that initializes the process and begins executing the instructions. Relationship of threads with a process: A process has the main thread that initializes the process and begins executing the instructions. Any thread can create other threads within the process. Each thread gets its own stack. All threads within the process share code and data segments. 18 Threading problems: • data races; • deadlocks; • load imbalance; • livelocks. Race conditions occur as a result of the dependencies, in which multiple threads attempt to update the same memory location, or variable, after threading. They may not be apparent at all times. The two possible conflicts that can arise as a result of data races are: read/write conflict; write/write conflict. The two ways by which it is possible to prevent data races in multithreaded applications are: Scope variables to be local to each thread (variables declared within threaded functions, allocate on thread’s stack etc.); Control concurrent access by using critical regions (examples of synchronization objects that can be used are: mutex, semaphore, event. critical section). Race conditions may be hidden behind a programming language syntax. Below some examples are given: Thread 1 X += 1 Thread 2 X += 2 vec[i] += 1 *p1 += 1 vec[j] += 2 *p2 += 2 Func(1) Func(2) add [abc], 1 add [abc], 2 Why data race happens Compiler expands += into separate read and write of X Subscripts i and j may be equal Pointers p1 and p2 might point to same location Func might be adding its argument to a hidden shared variable At the instruction level the hardware expands an update of [abc] into separate reads and writes Deadlock occurs when a thread waits for a condition that never occurs, most commonly results from the competition between threads for system resources held by other thread. Deadlock can occur only if the following conditions take place: • access to each resource is exclusive; 19 • a thread is allowed to hold one resource while requesting another; • no thread is willing to relinquish a resource that it has acquired; • there is a cycle of threads trying to acquire resources, where each resource is held by one thread and requested by another. Livelock is a situation when a thread does not progress on computations, but the thread is not blocked or waiting, threads try to overcome an obstacle presented by another thread that is doing the same thing. 20 Part 2 OpenMP 21 OpenMP is API (Applications Programming Interface) for shared memory multiprocessor and multi-core computing systems. Multithreaded programming on C, C++ and Fortran programming languages are supported. Model of a parallel program in OpenMP Model of a parallel program in OpenMP may be formulated as follows (fig. 12): • Program consists of sequential and parallel sections. • At the starting moment of execution master thread is created which perform sequential sections of a program. • In order to start multi-threaded execution of a parallel section fork is performed which creates a set of threads. Each thread has its own unique numerical identifier (master thread has 0). When loop is parallelized all threads execute same code. In general threads may execute different parts of code. • After completion of execution of a parallel section join–operation is performed. All threads except master stop their execution. Fig. 12. Model of a parallel program in OpenMP OpenMP consists of the following components: • Compiler directives are used to create threads, for worksharing among threads and their synchronization. Directives are included in a parallel program. • Runtime subroutines are used for setting and getting of attributes of threads. Calls of runtime subroutines are included in a parallel program. • Environment variables are used to control parallel program execution. Environment variables let to set environment of execution of a parallel program. Any operational system and/or command interpreter has its own commands to set environment variables. 22 Using compiler directives and runtime libraries a programmer has to follow rules which may be different in different programming languages. A set of such rules is called programming language binding. Fortran bindings Names of subprograms and compiler directives in Fortran as well as names of environment variables begin with OMP or OMP_. Compiler directive is following: {!|C|*}$OMP directive [operator_1[, operator_2, …]] Directive begins at first (fixed format of a source code in Fortran 77) or any (free format) position. Directive may be continued to next string. In this case it is necessary to conform to the standard rules of indication of a statement continuation for that version of language which is used to write a program (non-blank symbol in the 6th position of the continuation string in fixed format or ampersand in free-form format). Example of OpenMP program (Fortran) program omp_example integer i, k, N real*4 sum, h, x print *, "Please, type in N:" read *, N h = 1.0 / N sum = 0.0 C$OMP PARALLEL DO SCHEDULE(STATIC) REDUCTION(+:sum) do i = 1, N x = i * h sum = sum + 1.e0 * h / (1.e0 + x**2) end do print *, 4.0 * sum end C bindings Function names, pragmas and names of environment variables OpenMP in C begins form omp, omp_ or OMP_. Compiler directive is following: 23 #pragma omp directive [operator_1[, operator_2, …]] In OpenMP-programs header file omp.h has to be used. Example of OpenMP program (in C) #include "omp.h" #include <stdio.h> double f(double x) { return 4.0 / (1 + x * x); } main () { const long N = 100000; long i; double h, sum, x; sum = 0; h = 1.0 / N; #pragma omp parallel shared(h) { #pragma omp for private(x) reduction(+:sum) for (i = 0; i < N; i++) { x = h * (i + 0.5); sum = sum + f(x); } } printf("PI = %f\n", sum / N); } OpenMP directives Descriptions of OpenMP directives (version 2.5) are given. parallel … end parallel Defines parallel section of a program. It may be used with following statements (their descriptions you’ll find later in the text): 24 • private; • • shared; default; • firstprivate; • reduction; • if; • copyin; • num_threads. do loop do end do #pragma omp for loop for Defines loop which has to be parallelized (in Fortran and C). It may be used with following statements: • private; • firstprivate; • lastprivate; • reduction; • • schedule; ordered; • nowait. sections … end sections Defines parallel section of a program. Nested sections being defined by section directives are distributed between threads. It may be used with following statements: • private; • firstprivate; • lastprivate; • reduction; • nowait. 25 section Defines part of parallel sections which must be executed in one thread. single … end single Defines section of a program which has to be executed by a single thread. It may be used with following statements: • private; • firstprivate; • copyprivate; • nowait. workshare … end workshare Divides block of a program into parts which may be executed by threads only once. Block may include only following constructs: • arrays assignments; • scalar assignments; • FORALL; • WHERE; • atomic; • critical; • parallel. parallel do loop do end parallel do Combines directives parallel and do. parallel sections … end parallel sections Combines directives parallel and sections. 26 parallel workshare … end parallel workshare Combines directives parallel and workshare. master … end master Defines block which has to be executed by master thread. critical[(lock)] … end critical[(lock)] Defines block of a program which may be accessed by single thread (critical section). Lock – unnecessary name of the critical section. barrier Barrier synchronization of threads. Every thread which execution reaches given point suspends until all other threads reach the same point of execution. atomic Defines operation as atomic (when atomic operation is executed simultaneous access to memory from different threads to write is prohibited). It may be applied only to statement which is situated immediately after this directive. It has following format: • x = x {+|-|*|/|.AND.|.OR.|.EQV.|.NEQV.} scalar_expression_without_x • x = scalar_expression_without_x {+||*|/|.AND.|.OR.|.EQV.|.NEQV.} x • x = {MAX|MIN|IAND|IOR|IEOR} (x, scalar_expression_without_x) • x = {MAX|MIN|IAND|IOR|IEOR} (scalar_expression_without_x, x) flush[(list of variables)] Sets a synchronization point where values of variables from the list and accessible from the thread are written in memory. Provides coherence of memory content which is accessible from different threads. 27 ordered … end ordered Supplies keeping of those execution order of a loop iterations which corresponds to sequential execution order. threadprivate(list of common-blocks) Defines common blocks in the list as local. OpenMP statements OpenMP statements are used together with directives. private(list of variables) Defines variables in the list as local. firstprivate(list of variables) Defines variables in the list as local and initializes them by values from block preceding this directive. lastprivate(list of variables) Defines variables in the list as local and assigns them values from that block of a program which was executed last. copyprivate(list of variables) After end of execution of a block which is defined by single directive values of local variables from the list are distributed among other threads. nowait Cancels barrier synchronization at the end of parallel section. shared(list of variables) Defines variables in the list as shared by all threads. default(private|shared|none) Changes default rules of a scope of variables. Keyword private may be used only in Fortran. 28 reduction(operator|builtin function: list of variables) Reduces values of local variables from the list by means of an operator or built-in function of a language. Reduction is applied to few values and returns a single value. if(scalar logical expression) Conditional statement. num_threads(scalar integer expression) Sets number of threads. Alternative method of setting of a number of threads is usage of the environment variable OMP_NUM_THREADS. schedule(method_of_distribution_of_iterations [, number_of_loop_iterations]) Defines method of distribution of loop iterations among threads: • static – number of a loop iterations for each thread is fixed and is distributed among threads according to round robin planning. If number of iterations is not given it is set to 1; • dynamic – number of a loop iterations for each thread is fixed. Next chunk of iterations is delivered to a thread which became free; • guided – number of loop iterations for each thread decreases. Next chunk of iterations is delivered to a thread which became free; • runtime – method of worksharing is defined at the execution time, by means of environment variable OMP_SCHEDULE. copyin(list of common-blocks) Data are copying from the master thread to local common-blocks of every other thread at the beginning of parallel sections. Names are placed between «/» symbols. OpenMP subroutines Subprograms which form execution environment for a parallel program From now on at first place C interface of OpenMP subprograms is given and second is Fortran interface. 29 void omp_set_num_threads(int threads); subroutine omp_set_num_threads(threads) integer threads Sets number of threads which are used to execute parallel sections of a program. int omp_get_num_threads(void); integer function omp_get_num_threads() Returns number of threads which are used to execute parallel sections. int omp_get_max_threads(void); integer function omp_get_max_threads() Returns maximum number of threads which may be used to execute parallel sections of a program. int omp_get_thread_num(void); integer function omp_get_thread_num() Returns identifier of a thread which is called the function. int omp_get_num_procs(void); integer function omp_get_num_procs() Returns number of processors which may be used by a program. int omp_in_parallel(void); logical function omp_in_parallel() Returns true if call is made from an active parallel section of a program. void omp_set_dynamic(int threads); subroutine omp_set_dynamic(threads) logical threads Turns on or out dynamic assignment of threads number which are used to execute parallel sections of a program. By default this opportunity is disabled. int omp_get_dynamic(void); logical function omp_get_dynamic() Returns true if dynamic assignment of threads number is allowed. 30 void omp_set_nested(int nested); subroutine omp_set_nested(nested) integer nested Turns on or out nested parallelism. By default this opportunity is disabled. int omp_get_nested(void); logical function omp_get_nested() Checks if nested parallelism is allowed. Subprograms for operations with locks Locks are used to prevent effects leading to unpredictable behaviour of a program. It may be a result of data races when two or more threads have access to the same variable. void omp_init_lock(omp_lock_t *lock); subroutine omp_init_lock(lock) integer(kind = omp_lock_kind) :: lock Initializes lock associated with lock identifier to use it in subsequent calls. void omp_destroy_lock(omp_lock_t *lock); subroutine omp_destroy_lock(lock) integer(kind = omp_lock_kind) :: lock Makes locks associated with lock identifier undefined. void omp_set_lock(omp_lock_t *lock); subroutine omp_set_lock(lock) integer(kind = omp_lock_kind) :: lock Changes state of a thread form execution state to wait until lock associated with identifier lock will be available. Thread becomes owner of available lock. void omp_unset_lock(omp_lock_t *lock); subroutine omp_unset_lock(lock) integer(kind = omp_lock_kind) :: lock When this call is completed the thread stops to be owner of the lock associated with identifier lock. If the thread was not owner of the lock result will be undefined. 31 int omp_test_lock(omp_lock_t *lock); logical function omp_test_lock(lock) integer(kind = omp_lock_kind) :: lock Returns true if the lock is associated with identifier lock. void omp_init_nest_lock(omp_nest_lock_t *lock); subroutine omp_init_nest_lock(lock) integer(kind = omp_nest_lock_kind) :: lock Initializes nested lock associated with identifier lock. void omp_destroy_nest_lock(omp_nest_lock_t *lock); subroutine omp_destroy_nest_lock(lock) integer(kind = omp_nest_lock_kind) :: lock Sets nested lock associated with identifier lock as undefined. void omp_set_nest_lock(omp_nest_lock_t *lock); subroutine omp_set_nest_lock(lock) integer(kind = omp_nest_lock_kind) :: lock Changes state of threads from execution to wait until nested lock associated with identifier lock will be available. Thread becomes owner of available lock. void omp_unset_nest_lock(omp_nest_lock_t *lock); subroutine omp_unset_nest_lock(lock) integer(kind = omp_nest_lock_kind) :: lock Releases executing thread from being owner of nested lock associated with identifier lock. If the thread was not owner of the lock result will be undefined. int omp_test_nest_lock(omp_nest_lock_t *lock); integer function omp_test_nest_lock(lock) integer(kind = omp_nest_lock_kind) :: lock Checks if nested lock is associated with identifier lock. If lock is associated with the identifier counter’s value will be returned, otherwise 0 is returned. Timers Timers may be used to profile OpenMP programs. 32 double omp_get_wtime(void); double precision function omp_get_wtime() Returns time (in seconds) passed from some arbitrary moment in the past. Reference point is fixed during execution of the program. double omp_get_wtick(void); double precision function omp_get_wtick() Returns time (in seconds) passed between subsequent ticks. May be used as a measure of accuracy of the timer. OpenMP environment variables Environment variables may be set as follows: • export VARIABLE=value (in UNIX) • set VARIABLE=value (in Microsoft Windows) OMP_NUM_THREADS Sets number of threads on execution of parallel sections of a program. OMP_SCHEDULE Sets method of distribution of a loop iterations among threads. Possible values: • static; • dynamic; • guided. Number of iterations (optional parameter) is used after one of these keywords separated by comma, for example: export OMP_SCHEDULE=”static, 10” OMP_DYNAMIC If the variable has false value dynamical distribution of loop iterations is not allowed. OMP_NESTED If the variable has value false nested parallelism is not allowed. 33 Part 3 Message Passing Interface 34 Message Passing Interface (MPI) is specification which defines how an implementation of message passing system should be organized. Below description of free realization of MPI 1– MPICH 1.2.7 is given. Fortran bindings Names of subroutines and named constants in MPI programs written in Fortran begin with symbols MPI_. Exit code is returned by additional integer parameter (last argument). Successful exit code is MPI_SUCCESS. Definitions (for example, definitions of named constants) are in the header file mpif.h which must be included in MPI program by statement include. In some subroutines special variable status is used which is integer array having size MPI_STATUS_SIZE. In calls of MPI subroutines MPI data types nave to be used. Most of them have correspondent data types of Fortran (see Table 1) Table 1. MPI data types in Fortran language Data type in MPI Data type in Fortran MPI_INTEGER Integer MPI_REAL Real MPI_DOUBLE_PRECISION Double precision MPI_DOUBLE_COMPLEX Double complex MPI_COMPLEX Complex MPI_LOGICAL Logical MPI_CHARACTER Character MPI_BYTE - MPI_PACKED - Data types which may not exist in some MPI realizations MPI_INTEGER1 Integer*1 MPI_INTEGER2 Integer*2 MPI_INTEGER4 Integer*4 MPI_REAL4 Real*4 MPI_REAL8 Real*8 Data types MPI_Datatype and MPI_Comm  are simulated by standard integer type of Fortran (integer). 35 In C programs library MPI functions are used whereas in Fortran  subroutines. C bindings In C programs names of subprograms have the following form: Class_action_subset or Class_action. In C++ methods of some class are used and their names have the following form: MPI::Class::action_subset. Some actions have special names: Create  creation of a new object, Get  getting of information about object, Set  setting of parameters of an object, Delete  removal of information, Is  inquiry if given object has given properties. Names of MPI constants are written in uppercase. Their definitions are included in the header file mpi.h. Input parameters are passed by value and output (and INOUT) by reference. Correspondence between MPI data types and standard data types of C is given in the table 2. Table 2. MPI data types in C language Data type in MPI Data type in C MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE - MPI_PACKED - 36 Exit codes In MPI specific exit codes for subprograms are used. Some of these codes are: MPI_SUCCESS  successful completion, MPI_ERR_OTHER  most often reason is repeated call of MPI_Init. In place of numeric codes named constants may be used: • MPI_ERR_BUFFER  wrong pointer at the buffer; • MPI_ERR_COMM  wrong communicator; • MPI_ERR_RANK  wrong rank; • MPI_ERR_OP  wrong operation; • MPI_ERR_ARG  wrong argument; • MPI_ERR_UNKNOWN  unknown error; • MPI_ERR_TRUNCATE  message truncated during receive; • MPI_ERR_INTERN  internal error. Common reason is lack of memory. Basic concepts of MPI programming Communicator is a set of processes which are a whole set of processes of a parallel MPI program in a time of its execution or a subset with a common context of execution (fig. 13). Only processes in a same communicator may be involved in point-to-point or collective exchanges. Every communicator has name. There are few standard communicators: • MPI_COMM_WORLD – includes all processes; • MPI_COMM_SELF – includes only given process; • MPI_COMM_NULL – empty communicator. A new communicator may be created by means of special calls. In this case it may include a subset of processes. Fig. 13. Communicator 37 Rank is a unique numeric identifier which is assigned to each process of the same parallel program. It has integer value from 0 to number_of_processes – 1 (fig. 14). Fig. 14. Ranks of parallel processes Message tag is a unique numeric identifier which may be assigned to a message. Tags are used to distinguish messages. Joker MPI_ANY_TAG may be used if a tag doesn’t play any role in an exchange. Common structure of an MPI program: program para … if (process = master) then master clause else slave clause endif end 38 MPI subprograms Miscellaneous subprograms Initializing of MPI int MPI_Init(int *argc, char **argv) MPI_INIT(IERR) Arguments argc and argv are used only in C programs. In this case they are number of arguments of a command line used to run the program and array of these arguments. This call precedes all other MPI subprograms calls. Finalizing of MPI int MPI_Finalize() MPI_FINALIZE(IERR) Finalizes MPI. After this call is completed no MPI subprogram could be used. MPI_FINALIZE must be called by every process before it will stop its execution. Getting of a number of processes int MPI_Comm_size(MPI_Comm comm, int *size) MPI_COMM_SIZE(COMM., SIZE, IERR) Input parameter: comm  communicator. Output parameters: • • size  number of processes in a communicator. Getting of rank of a process int MPI_Comm_rank(MPI_Comm comm, int *rank) MPI_COMM_RANK(COMM, RANK, IERR) Input parameter: comm  communicator. Output parameter: • • rank  rank of the process in a communicator. 39 Getting name of computing node which executes calling process MPI_Get_processor_name(char *name, int *resultlen) MPI_GET_PROCESSOR_NAME(NAME, RESULTLEN, IERR) Output parameters: • name  identifier of computing MPI_MAX_PROCESSOR_NAME elements; • resultlen  name length. node. Arrays has at least Time passed from an arbitrary moment in the past double MPI_Wtime() MPI_WTIME(TIME, IERR) Point-to-point exchanges Point-to-point exchange involves only two processes: source and target (fig. 15). In this section interfaces of subprograms for point-to-point exchange are described. Fig. 15. Point-to-point send-receive operation Standard block send int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) MPI_SEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERR) Input parameters: • buf  address of first element in send buffer; • count  number of elements in the send buffer; • datatype  MPI data type of elements to be sent; • dest  rank of target process (integer from 0 to n – 1, where n  number of processes in a communicator); 40 • tag  message tag; • comm  communicator; • ierr  exit code. Standard block send int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERR) Input parameters: • count  maximum number of elements in receive buffer. Actual number of elements may be defined by means of subroutine MPI_Get_count; • datatype  type of data to be received. Data types in send and receive calls have to be the same; • source  source rank. Special value MPI_ANY_SOURCE corresponds to arbitrary source rank value. Identifier which corresponds to arbitrary parameter value is called “joker”; • tag  tag of the message or joker MPI_ANY_TAG which corresponds to arbitrary tag value; • comm  communicator. Output parameters: • buf  address of the receive buffer. Size of the buffer has to be enough to store received message entirely otherwise receive ends with a fault (buffer overflow); status  exchange status. If received message is less than buffer only part of receive buffer is updated. • Getting size of received message (count) int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count) MPI_GET_COUNT(STATUS, DATATYPE, COUNT, IERR) Type of datatype argument has to be the same as those indicated in send call. Synchronous send int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) MPI_SSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERR) Parameters of this subprogram are the same as in MPI_Send. 41 Buffered send int MPI_Bsend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) MPI_BSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERR) Parameters of this subprogram are the same as in MPI_Send. Buffer attachment int MPI_Buffer_attach(void *buf, size) MPI_BUFFER_ATTACH(BUF, SIZE, IERR) Output parameter: buf  buffer. Its size is size bytes. In Fortran buffer is variable or array. At a time only one buffer may be attached to the process. • Buffer detachment int MPI_Buffer_detach(void *buf, int *size) MPI_BUFFER_DETACH(BUF, SIZE, IERR) Output parameters: • buf  address of the buffer; • size  size of the detached buffer. Call of this subprogram blocks process execution until all messages in receive buffer will be handled. In C this call doesn’t free buffer’s memory. Ready send int MPI_Rsend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) MPI_RSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERR) Parameters of this subprogram are the same as in MPI_Send. Blocking test of message delivering int MPI_Probe(int source, int tag, MPI_Status *status) MPI_PROBE(SOURCE, TAG, COMM, STATUS, IERR) Input parameters: • source  source rank or joker; • tag  tag or joker; comm  communicator. Output parameter: • • status  status of operation. MPI_Comm comm, 42 Nonblocking test of message delivering int MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag, MPI_Status *status) MPI_IPROBE(SOURCE, TAG, COMM, FLAG, STATUS, IERR) Input parameters of this subprogram are the same as in MPI_Probe. Output parameters: • flag  flag; • status  status. If message is delivered flag’s value true is returned. Blocking send and receive int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status) MPI_SENDRECV(SENDBUF, SENDCOUNT, SENDTYPE, DEST, SENDTAG, RECVBUF, RECVCOUNT, RECVTYPE, SOURCE, RECVTAG, COMM, STATUS, IERR) Input parameters: • sendbuf  address of the send buffer; • sendcount  number of elements which have to be sent; • sendtype  data types of elements which have to be sent; • dest  rank of the target; • sendtag  tag of message which has to be sent; • recvbuf  address of the receive buffer; • recvcount  number of elements which have to be received; • recvtype  data types of elements which have to be received; • source  rank of the source; • recvtag  tag of message which has to be received; comm  communicator. Output parameter: • status  status of receive operation. Receive and send operations use the same communicator. Send and receive buffers must not overlap. Buffers may have different size. Data types of sending and receiving data also may be different. • 43 Blocking send and receive with common buffer for send and receive int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status) MPI_SENDRECV_REPLACE(BUF, COUNT, DATATYPE, DEST, SENDTAG, SOURCE, RECVTAG, COMM, STATUS, IERR) Input parameters: • count  number of elements to be sent and size of receive buffer; • datatype  type of data in receive and send buffer; • dest  rank of the target; • sendtag  tag of message to be sent; • source  rank of the source; • recvtag  tag of message to be received; comm  communicator. Output parameters: • • buf  address of send and receive buffer; status  status of receive. Message which has to be received must not be larger (in size) than message being sent. Data types of elements in send and receive have to be the same. Order of send and receive is chosen automatically by the system. • Initialization of nonblocking standard send int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_ISEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERR) Input parameters of this subprogram are the same as in MPI_Send. Output parameter: • request  identifier of operation. Initialization of nonblocking synchronous standard send int MPI_Issend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_ISSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERR) Parameters of this subprogram are the same as in MPI_Send. 44 Nonblocking send with bufferization int MPI_Ibsend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_IBSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERR) Nonblocking ready send int MPI_Irsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_request *request) MPI_IRSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERR) Parameters of nonblocking send subprograms are the same as in previously described subprograms. Initialization of nonblocking receive int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERR) Parameters in this subprogram are the same as in previously described subprograms with the exception of source which is rank of a source process. Blocking of a process execution until receive or send is completed int MPI_Wait(MPI_Request *request, MPI_Status *status) MPI_WAIT(REQUEST, STATUS, IERR) Input parameter: • request  identifier of message passing operation. Output parameter: • status  status of completed operation. Status value of send operation may be obtained by means of MPI_Test_cancelled. Subprogram MPI_Wait may be called with empty or inactive parameter request. In this case operation completes immediately with empty status. Successful completion of MPI_Wait after call of MPI_Ibsend means that send buffer may be used again and data have been sent or copied in send buffer attached by call of MPI_Buffer_attach. Send might not be cancelled if buffer is attached. If receive is not registered buffer might not be released. In this case MPI_Cancel releases memory allocated for communication subsystem. 45 Nonblocking check of message receive or send completion int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) MPI_TEST(REQUEST, FLAG, STATUS, IERR) Input parameter: • request  identifier of message passing operation. Output parameters: • flag  true if operation associated with request identifier is completed; • status  status of completed operation. If call of MPI_Test uses empty or inactive parameter request operation returns flag’s value true and empty status. Test of completion of all exchanges int MPI_Waitall(int count, MPI_Request requests[], MPI_Status statuses[]) MPI_WAITALL(COUNT, REQUESTS, STATUSES, IERR) Execution of a process is blocked until all exchanges associated with active requests in array requests will be completed. Status of all operations is returned. It is placed into array statuses. count is number of exchange requests (size of arrays requests and statuses). As a result of MPI_Waitall execution requests generated by nonblocking exchange operations are cancelled and corresponding elements of array get value MPI_REQUEST_NULL. List may include empty or inactive requests. Each request gets empty status value. If one or more exchanges are failed MPI_Waitall returns exit code MPI_ERR_IN_STATUS and assigns error code to error field of status of corresponding operation. If exchange is successful the field gets value MPI_SUCCESS. If exchange wasn’t successful the field gets value MPI_ERR_PENDING. The last case corresponds to existence of requests on execution waiting for processing. Nonblocking test of exchanges completion int MPI_Testall(int count, MPI_Request requests[], int *flag, MPI_Status statuses[]) MPI_TESTALL(COUNT, REQUESTS, FLAG, STATUSES, IERR) On return flag (flag) gets true value if all exchanges associated with active requests in array requests are completed. If only part of exchanges is completed flag gets false value and array statuses is indefinite. count  number of requests. 46 Each status which corresponds to existing active request gets status of corresponding exchange. If request was issued by a nonblocking exchange operation it will be cancelled and corresponding array element gets value MPI_REQUEST_NULL. Each status which corresponds to empty or inactive requests gets empty value. Blocking test of completion of arbitrary number of exchanges int MPI_Waitany(int count, MPI_Request requests[], int *index, MPI_Status *status) MPI_WAITANY(COUNT, REQUESTS, INDEX, STATUS, IERR) Execution of a process is blocked until at least one exchange from array (requests) will be completed. Input parameters: • requests  request; • count  number of elements in array requests. Output parameters: • index  index of the request (in C language an integer number from 0 to count – 1, in Fortran an integer from 1 to count) in array requests; status  exchange status. If request was issued by a nonblocking exchange operation it will be cancelled and corresponding array element gets value MPI_REQUEST_NULL. Array of requests may include empty or inactive requests. If the list does not include active requests or it is empty subroutine completes immediately with index equal to MPI_UNDEFINED and empty status. • Test of completion of any previously initialized exchange int MPI_Testany(int count, MPI_Request requests[], int *index, int *flag, MPI_Status *status) MPI_TESTANY(COUNT, REQUESTS, INDEX, FLAG, STATUS, IERR) Arguments of this subprogram are the same as in MPI_Waitany. Extra argument flag gets value true if one of operations is completed. Blocking subprogram MPI_Waitany and nonblocking subprogram MPI_Testany are interchangeable as other similar pairs of subprograms. Subprograms MPI_Waitsome and MPI_Testsome work similarly to MPI_Waitany and MPI_Testany except case when two or more exchanges are completed. In subprograms MPI_Waitany and MPI_Testany exchange from a list of completed exchanges is chosen arbitrary. For this exchange status is returned. MPI_Waitsome and MPI_Testsome return status for all completed exchanges. These subprograms may be used to define how many exchanges are completed: 47 int MPI_Waitsome(int incount, MPI_Request requests[], int *outcount, int indices[], MPI_Status statuses[]) MPI_WAITSOME(INCOUNT, REQUESTS, OUTCOUNT, INDICES, STATUSES, IERR) Here incount is number of requests. In outcount number of completed requests from array requests is returned. In first outcount elements of array indices indices of this operations are returned. In first outcount elements of array statuses statuses of completed operations are returned. If completed request was issued by nonblocking exchange operation it is cancelled. If a list does not include active requests execution of the subprogram will be completed immediately and parameter outcount will get value MPI_UNDEFINED. Nonblocking check of exchange completion int MPI_Testsome(int incount, MPI_Request requests[], int *outcount, int indices[], MPI_Status statuses[]) MPI_TESTSOME(INCOUNT, REQUESTS, OUTCOUNT, INDICES, STATUSES, IERR) Arguments are the same as in subprogram MPI_Waitsome. Subprogram MPI_Testsome is more efficient than MPI_Testany because the first one returns information about all operations in one call but the second requires a new call for each completed operation. Creation of request for standard send int MPI_Send_init(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_SEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERR) Input parameters: • buf  address of the send buffer; • count  number of elements which have to be sent; • datatype  type of elements; • dest  target rank; • tag  message tag; comm  communicator. Output parameter: • • request  request for exchange operation. 48 Initialization of pending exchange int MPI_Start(MPI_Request *request) MPI_START(REQUEST, IERR) Input parameter: • request  request for exchange operation. Call of MPI_Start with request for exchange which was created by MPI_Send_init initializes exchange with the same properties as exchange performed by MPI_Isend. Call of MPI_Start with request for exchange which was created by MPI_Bsend_init initializes exchange with the same properties as one performed by of MPI_Ibsend. Message passed by means of operation initialized by MPI_Start may be received by any receive subprogram. Initialization of exchanges associated with requests (in array requests) for execution of nonblocking exchange operation int MPI_Startall(int count, MPI_request *requests) MPI_STARTALL(COUNT, REQUESTS, IERR) Cancelling of pending nonblocking exchanges int MPI_Cancel(MPI_request *request) MPI_CANCEL(REQUEST, IERR) MPI_Cancel may be used to cancel exchanges which use both pending and ordinary requests. After call of MPI_Cancel and subsequent calls MPI_Wait and MPI_Test request for exchange operation becomes inactive and may be reactivated for new exchange. Information about cancelled operation is placed in status. Check if exchange associated with a given status is cancelled int MPI_Test_cancelled(MPI_Status *status, int *flag) MPI_TEST_CANCELLED(STATUS, FLAG, IERR) Cancelling of a request (request) for exchange operation int MPI_Request_free(MPI_Request *request) MPI_REQUEST_FREE(REQUEST, IERR) Call of this subprogram marks request for exchange to cancel and assign it value MPI_REQUEST_NULL. Exchange operation associated with this request may be completed. The request is cancelled only when exchange is completed. Collective exchange operations Collective exchanges involve two or more processes. 49 Broadcast send (fig. 16) int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERR) Arguments of this subprogram are input and output at the same time: • buffer  address of the buffer; • count  number of elements which have to be sent/received; • datatype  data type in MPI; • root  rank of process which broadcasts data; • comm  communicator. Fig. 16. Broadcast send operation Barrier synchronization (fig. 17) int MPI_Barrier(MPI_Comm comm) MPI_BARRIER(COMM, IERR) 50 Fig. 17. Barrier synchronization Data scattering (fig. 18) int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *rcvbuf, int rcvcount, MPI_Datatype rcvtype, int root, MPI_Comm comm) MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF, RCVCOUNT, RCVTYPE, ROOT, COMM, IERR) Input parameters: • sendbuf  address of the send buffer; • sendcount  number of elements which have to be sent to each process (not total number of elements to be sent); • sendtype  data types of elements which have to be sent; • rcvcount  number of elements which have to be received; • rcvtype  data types of elements which have to be received; • root  rank of sending process; comm  communicator. Output parameter: • • rcvbuf  address of the receive buffer. Process with rank root distributes send buffer sendbuf among all processes. Content of the buffer is divided into few parts. Each part consists of sendcount elements. First part goes to process 0, second part goes to process 1 etc. Argument send has meaning only on side of main process root. 51 Fig. 18. Data scattering Gathering of messages (fig. 19) int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *rcvbuf, int rcvcount, MPI_Datatype rcvtype, int root, MPI_Comm comm) MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF, RCVCOUNT, RCVTYPE, ROOT, COMM, IERR) Each process in communicator comm sends its buffer sendbuf to process with rank root. Process root merges received data in such a way that after data from process 0 follow data from process 1, then data from process 2 and so on. Arguments rcvbuf, rcvcount and rcvtype have meaning only on side of main process. Argument rcvcount is equal to number of data received from each process (but not total number). When subprograms MPI_Scatter and MPI_Gather are called in different processes it is necessary to use common main process. 52 Fig. 19. Data gathering Vector data scattering int MPI_Scatterv(void *sendbuf, int *sendcounts, int *displs, MPI_Datatype sendtype, void *rcvbuf, int rcvcount, MPI_Datatype rcvtype, int root, MPI_Comm comm) MPI_SCATTERV(SENDBUF, SENDCOUNTS, DISPLS, SENDTYPE, RCVBUF, RCVCOUNT, RCVTYPE, ROOT, COMM, IERR) Input parameters: • sendbuf  address of the send buffer; • sendcounts  1-dimensional integer array which contains number of elements to be sent to each process (index is equal to rank of a target process). Its size is equal to number of processes in communicator; • displs  1-dimensional integer array. Its size is equal to number of processes in communicator. Element of the array which has index i sets displacement relative to the beginning of the send buffer. Rank of the target process is equal to index i; • sendtype  data types of elements which have to be sent; • rcvcount  number of elements which have to be received; • rcvtype  data types of elements which have to be received; • root  rank of source process; comm  communicator. Output parameter: • • rcvbuf  address of the receive buffer. 53 Gathering of data from all processes in a communicator and writing in receive buffer with given displacement int MPI_Gatherv(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int *recvcounts, int *displs, MPI_Datatype recvtype, int root, MPI_Comm comm) MPI_GATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, DISPLS, RECVTYPE, ROOT, COMM, IERR) Arguments of this subprogram are the same as in subprogram MPI_Scatterv. Exchanges associated with subprograms MPI_Allgather and MPI_Alltoall have not root process. Gathering of data from all processes and scattering to all processes int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *rcvbuf, int rcvcount, MPI_Datatype rcvtype, MPI_Comm comm) MPI_ALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF, RCVCOUNT, RCVTYPE, COMM, IERR) Input parameters: • sendbuf  address of the send buffer; • sendcount  number of elements which have to be sent; • sendtype  data types of elements which have to be sent; • rcvcount  number of elements which have to be received from each process; • rcvtype  data types of elements which have to be received; comm  communicator. Output parameter: • • rcvbuf  address of the receive buffer. Chunk of data sent from j-th process is received by each process which places it in jth block of receive buffer recvbuf. Send "each  to all" int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *rcvbuf, int rcvcount, MPI_Datatype rcvtype, MPI_Comm comm) MPI_ALLTOALL(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF, RCVCOUNT, RCVTYPE, COMM, IERR) Input parameters: • sendbuf  address of the send buffer; • sendcount  number of elements which have to be sent to each process; 54 • sendtype  data types of elements which have to be sent; • rcvcount  number of elements which have to be received; • rcvtype  data types of elements which have to be received; comm  communicator. Output parameter: • • rcvbuf  address of the receive buffer. Subprograms MPI_Allgather and MPI_Alltoall are vector counterparts of subprograms MPI_Allgatherv and MPI_Alltoallv. Gathering data from all processes and sending to all processes int MPI_Allgatherv(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *rcvbuf, int *rcvcounts, int *displs, MPI_Datatype rcvtype, MPI_Comm comm) MPI_ALLGATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RCVBUF, RCVCOUNTS, DISPLS, RCVTYPE, COMM, IERR) Arguments of this subprogram are the same as in subprogram MPI_Allgather. The only exception is input argument displs. It is integer 1-dimensional array. Its size is equal to number of processes in communicator. Element with index i gives displacement relatively to the beginning of receive buffer recvbuf which contains data received from process i. Chunk of data sent from j-th process is received by each process and is placed in j-th block of receive buffer. All-to-all send with displacement int MPI_Alltoallv(void *sendbuf, int *sendcounts, int *sdispls, MPI_Datatype sendtype, void *rcvbuf, int *rcvcounts, int *rdispls, MPI_Datatype rcvtype, MPI_Comm comm) MPI_ALLTOALLV(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPE, RCVBUF, RCVCOUNTS, RDISPLS, RCVTYPE, COMM, IERR) Arguments of this subprogram are the same as in subprogram MPI_Alltoall. The only exceptions are arguments: • sdispls  1-dimensional integer array. Its size is equal to number of processes in communicator. J-th element gives displacement relatively to the beginning of buffer from which data are sent to j-th process. • rdispls  1-dimensional integer array. Its size is equal to number of processes in communicator. Element i gives displacement relatively to the beginning of buffer which receives message from i-th process. 55 Reduction (fig. 20) int MPI_Reduce(void *buf, void *result, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) MPI_REDUCE(BUF, RESULT, COUNT, DATATYPE, OP, ROOT, COMM, IERR) Input parameters: • buf  address of the send buffer; • count  number of elements which have to be sent; • datatype  type of data to be sent; • op  reduction operation; • root  rank of master process; • comm  communicator. MPI_Reduce applies reduction operation to operands from buf. Result of each operation is placed in result buffer result. MPI_Reduce has to be called in all processes in communicator comm. Arguments count, datatype and op in this call must be the same. Fig. 20. Reduction operation Predefined reduction operations are listed in table 3. 56 Table 3. Predefined reduction operations in MPI Operation Description MPI_MAX Maximum value of 1-dimensional integer or real array MPI_MIN Minimum value of 1-dimensional integer or real array MPI_SUM Sum of elements of 1-dimensional integer, real or complex array MPI_PROD Product of elements of 1-dimensional integer, real or complex array MPI_LAND Logical AND MPI_BAND Bitwise AND MPI_LOR Logical OR MPI_BOR Bitwise OR MPI_LXOR Logical exclusive OR MPI_BXOR Bitwise exclusive OR MPI_MAXLOC Maximum value of 1-dimensional integer or real array and its index MPI_MINLOC Minimum value of 1-dimensional integer or real array and its index Definition of user global operation int MPI_Op_create(MPI_User_function *function, int commute, MPI_Op *op) MPI_OP_CREATE(FUNCTION, COMMUTE, OP, IERR) Input parameters: • function  user-defined function; commute  has value true if operation is commutative (result is independent of operands order). Definition of user function in C has the following form: typedef void (MPI_User_function)(void *a, void *b, int *len, MPI_Datatype *dtype) Operation is defined as follows: b[I] = a[I] op b[I] for I = 0, …, len – 1. • 57 Deletion of user function int MPI_Op_free(MPI_Op *op) MPI_OP_FREE(OP, IERR) When this call is completed op gets value MPI_OP_NULL. Simultaneous gathering and scattering int MPI_Reduce_scatter(void *sendbuf, void *rcvbuf, int *rcvcounts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) MPI_REDUCE_SCATTER(SENDBUF, RCVBUF, RCVCOUNTS, DATATYPE, OP, COMM, IERR) Input parameters: • sendbuf  address of the receive buffer; • rcvcounts  1-dimensional integer array which contains number of elements in resulting array sent to each process. This array must be the same in all processes which call this subprogram; • datatype  type of data to be received; • op  operation; comm  communicator. Output parameter: • • rcvbuf  address of the receive buffer. Each task receives only one chunk of resulting array. Gathering and writing of result of reduction operation at receive buffer of each process int MPI_Allreduce(void *sendbuf, void *rcvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) MPI_ALLREDUCE(SENDBUF, RCVBUF, COUNT, DATATYPE, OP, COMM, IERR) Input parameters: • sendbuf  address of the send buffer; • count  number of elements which have to be sent; • datatype  type of data to be sent; • op  reduction operation; comm  communicator. Output parameter: • • rcvbuf  address of the receive buffer. 58 In case of failure this subprogram may return exit code MPI_ERR_OP (incorrect operation). It takes places if operation is used which is not predefined nor created be preceding call of MPI_Op_create. Partial reduction int MPI_Scan(void *sendbuf, void *rcvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) MPI_SCAN(SENDBUF, RCVBUF, COUNT, DATATYPE, OP, COMM, IERR) Input parameters: • sendbuf  address of the send buffer; • count  number of elements in receive buffer; • datatype  type of data to be received; • op  operation; comm  communicator. Output parameter: • • rcvbuf  address of the receive buffer. Operations with communicators Standard communicator MPI_COMM_WORLD is created automatically at the start of parallel program. Other standard communicators: • MPI_COMM_SELF  communicator which includes only calling process; • MPI_COMM_NULL  null (empty) communicator. Getting access to group which is associated with communicator comm int MPI_Comm_group(MPI_Comm comm, MPI_Group *group) MPI_COMM_GROUP(COMM, GROUP, IERR) Output parameter  group. Any operation with a group is possible only after this call. Creation of new group newgroup which includes n processes from old group oldgroup int MPI_Group_incl(MPI_Group oldgroup, int n, int *ranks, MPI_Group *newgroup) MPI_GROUP_INCL(OLDGROUP, N, RANKS, NEWGROUP, IERR) Ranks of processes are placed in array ranks. New group will include processes with ranks ranks[0], …, ranks[n — 1]. Rank i in new group corresponds to rank ranks[i] in old group. If n = 0 empty group MPI_GROUP_EMPTY is created. 59 This subprogram makes it possible not only to create new group but to change order of processes in existing group. Creation of new group newgroup by means of exclusion from old group (group) of processes with ranks ranks[0], …, ranks[n — 1] int MPI_Group_excl(MPI_Group oldgroup, int n, int *ranks, MPI_Group *newgroup) MPI_GROUP_EXCL(OLDGROUP, N, RANKS, NEWGROUP, IERR) If n = 0 new group is identical to old group. Creation of new group newgroup from old group group by means of adding to it n processes with ranks listed in array ranks int MPI_Group_range_incl(MPI_Group oldgroup, int n, int ranks[][3], MPI_Group *newgroup) MPI_GROUP_RANGE_INCL(OLDGROUP, N, RANKS, NEWGROUP, IERR) Array ranks consists of integer triplets (first_1, last_1, step_1), …, (first_n, last_n, step_n). New group includes processes with ranks (in old group) first_1, first_1 + step_1, …. Creation of new group newgroup from group group by means of exclusion of n processes with ranks listed in array ranks int MPI_Group_range_excl(MPI_Group group, int n, int ranks[][3], MPI_Group *newgroup) MPI_GROUP_RANGE_EXCL(GROUP, N, RANKS, NEWGROUP, IERR) Array ranks has the same structure as similar array in subprogram MPI_Group_range_incl. Creation of new group newgroup from result of subtraction of group1 and group2 int MPI_Group_difference(MPI_Group group1, MPI_Group group2, MPI_Group *newgroup) MPI_GROUP_DIFFERENCE(GROUP1, GROUP2, NEWGROUP, IERR) Creation of new group newgroup from intersection of groups group1 and group2 int MPI_Group_intersection(MPI_Group group1, MPI_Group group2, MPI_Group *newgroup) MPI_GROUP_INTERSECTION(GROUP1, GROUP2, NEWGROUP, IERR) 60 Creation of new group newgroup from union of groups group1 and group2 int MPI_Group_union(MPI_Group group1, MPI_Group group2, MPI_Group *newgroup) MPI_GROUP_UNION(GROUP1, GROUP2, NEWGROUP, IERR) There are other constructors of new groups. Deletion of group group int MPI_Group_free(MPI_Group *group) MPI_GROUP_FREE(GROUP, IERR) Getting of number of processes size in group group int MPI_Group_size(MPI_Group group, int *size) MPI_GROUP_SIZE(GROUP, SIZE, IERR) Getting of rank rank of process in group group int MPI_Group_rank(MPI_Group group, int *rank) MPI_GROUP_RANK(GROUP, RANK, IERR) If process is not included in the group this subprogram returns MPI_UNDEFINED. Transformation of process rank in one group to its rank in other group int MPI_Group_translate_ranks(MPI_Group group1, int n, int *ranks1, MPI_Group group2, int *ranks2) MPI_GROUP_TRANSLATE_RANKS(GROUP1, N, RANKS1, GROUP2, RANKS2, IERR) Comparison of groups group1 and group2 int MPI_Group_compare(MPI_Group group1, MPI_Group group2, int *result) MPI_GROUP_COMPARE(GROUP1, GROUP2, RESULT, IERR) Returns MPI_IDENT if both groups are identical. Returns MPI_SIMILAR if processes in both groups are the same but their ranks differ. Returns MPI_UNEQUAL if groups include at least one pair of different processes. Reduplication of existing communicator oldcomm int MPI_Comm_dup(MPI_Comm oldcomm, MPI_Comm *newcomm) MPI_COMM_DUP(OLDCOMM, NEWCOMM, IERR) This call creates new communicator newcomm with the same group of processes and the same attributes as initial group but with different context of exchanges. It may be applied both to intra- and intercommunicator. 61 Creation of new communicator newcomm from subset of processes group of other communicator oldcomm int MPI_Comm_create(MPI_Comm oldcomm, MPI_Group group, MPI_Comm *newcomm) MPI_COMM_CREATE(OLDCOMM, GROUP, NEWCOMM, IERR) This call must be performed by all processes of initial communicator. Arguments have to be the same. If several communicators are created simultaneously they must be created in the same order by all processes. Creation of several communicators by splitting of a given communicator int MPI_Comm_split(MPI_Comm oldcomm, int split, int rank, MPI_Comm* newcomm) MPI_COMM_SPLIT(OLDCOMM, SPLIT, RANK, NEWCOMM, IERR) Group of processes associated with communicator oldcomm is splitted into nonintersecting subgroups. One subgroup is for each value of split. Processes with the same value of split form new group. Rank in new group is defined by value of rank. If processes A and B call MPI_Comm_split with the same value of split and argument rank passed by process A is less than value of argument passed by process B as a result rank A in group correspondent to new communicator will be less than rank of process B. If otherwise calls use the same value of rank then system assign ranks arbitrarily. For each group its own communicator newcomm is created. MPI_Comm_split has to be called by all processes of initial communicator even in case when they will not be included in new communicator. For that as a value of argument split in the call of this subprogram has to be used predefined named constant MPI_UNDEFINED. Correspondent processes will return MPI_COMM_NULL as new communicator. New communicators created by subprogram MPI_Comm_split do not intersect but by repeated calls of MPI_Comm_split it is possible to create also intersected communicators. Marking communicator comm to be deleted int MPI_Comm_free(MPI_Comm *comm) MPI_COMM_FREE(COMM, IERR) Exchanges associated with this communicator are completed as usual and communicator will be deleted only when it will not has active references to it. This operation may be applied both to intra- and intercommunicator. Comparison of communicators (comm1) and (comm2) int MPI_Comm_compare(MPI_Comm comm1, MPI_Comm comm2, int *result) MPI_COMM_COMPARE(COMM1, COMM2, RESULT, IERR) Output parameter: 62 • result  integer value which is equal to MPI_IDENT if contexts and groups associated with communicators coincide; MPI_CONGRUENT if coincide only groups; MPI_SIMILAR or MPI_UNEQUAL if nor groups nor contexts are the same. Empty communicator MPI_COMM_NULL may not be used as argument. Assigning to communicator comm a string name name int MPI_Comm_set_name(MPI_Comm com, char *name) MPI_COMM_SET_NAME(COM, NAME, IERR) Getting name of communicator int MPI_Comm_get_name(MPI_Comm comm, char *name, int *reslen) MPI_COMM_GET_NAME(COMM, NAME, RESLEN, IERR) Output parameters: • name  name of communicator comm; • reslen  length of the name. Name is array of characters. MPI_MAX_NAME_STRING. Its size must be greater than Check if communicator comm (input parameter) is an intercommunicator int MPI_Comm_test_inter(MPI_Comm comm, int *flag) MPI_COMM_TEST_INTER(COMM, FLAG, IERR) Output parameter: • flag  is true if communicator is an intercommunicator. Creation of intracommunicator newcomm from intercommunicator oldcomm int MPI_Intercomm_merge(MPI_Comm oldcomm, int high, MPI_Comm *newcomm) MPI_INTERCOMM_MERGE(OLDCOMM, HIGH, NEWCOMM, IERR) Argument high is used to unite groups of both intracommunicators in order to create new communicator. Getting access to remote group associated with intercommunicator comm int MPI_Comm_remote_group(MPI_Comm comm, MPI_Group *group) MPI_COMM_REMOTE_GROUP(COMM, GROUP, IERR) Output parameter: 63 • group  remote group. Getting size of remote group which is associated with intercommunicator comm int MPI_Comm_remote_size(MPI_Comm comm, int *size) MPI_COMM_REMOTE_SIZE(COMM, SIZE, IERR) Output parameter: • size  number of processes in communicator comm. Creation of intercommunicator int MPI_Intercomm_create(MPI_Comm local_comm, int local_leader, MPI_Comm peer_comm, int remote_leader, int tag, MPI_Comm *new_intercomm) MPI_INTERCOMM_CREATE(LOCAL_COMM, LOCAL_LEADER, PEER_COMM, REMOTE_LEADER, TAG, NEW_INTERCOMM, IERR) Input parameters: • local_comm  local intracommunicator; • local_leader  rank of leader in local communicator (usually 0); • peer_comm  remote communicator; • remote_leader  rank of leader in remote communicator (usually 0); • tag  tag of intercommunicator which is used by leaders of both groups for exchanges using context of parent communicator. Output parameter: new_intercomm  intercommunicator. Jokers must not be used as arguments. This call has to be performed in both groups of processes which have to be connected with each other. In each of this calls local intracommunicator is used which corresponds to given group of processes. Local and remote groups shouldn’t intersect otherwise deadlocks could appear. • Virtual topologies Virtual topologies in MPI make it possible to use more convenient (in some cases) methods of referencing to processes of a parallel application. Creation of new communicator comm_cart by supplying initial communicator comm_old with Cartesian topology (fig. 21) int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart) MPI_CART_CREATE(COMM_OLD, NDIMS, DIMS, PERIODS, REORDER, COMM_CART, IERR) 64 Input parameters: • comm_old  initial communicator; • ndims  dimension of Cartesian grid; • dims integer array which consists of ndims elements and defines number of processes along each dimension; • periods  logical array which consists of ndims elements and defines if grid is periodic (true) along correspondent dimension; reorder  logical variable. If it is equal to true system is allowed to change order of numeration of processes. Information about structure of Cartesian topology is contained in ndims, dims and periods. MPI_Cart_create is collective operation (it must be called by all processes from communicator which has to be supplied by Cartesian topology). • Fig. 21. Cartesian topology Getting Cartesian coordinates of process from its rank in group int MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int *coords) MPI_CART_COORDS(COMM, RANK, MAXDIMS, COORDS, IERR) Input parameters: • comm  communicator which is supplied with Cartesian topology; • rank  rank of a process in comm; maxdims  number of elements in 1-dimensional array coords in calling program. Output parameter: • • coords  1-dimensional integer array (consists of ndims elements) which contains Cartesian coordinates of process. 65 Getting of rank of process (rank) from its Cartesian coordinates in communicator comm int MPI_Cart_rank(MPI_Comm comm, int *coords, int *rank) MPI_CART_RANK(COMM, COORDS, RANK, IERR) Input parameter: coords  1-dimensional integer array (consists of ndims elements) which contains Cartesian coordinates of process. Both MPI_Cart_rank and MPI_Cart_coords are local. • Splitting of communicator comm in subgroups correspondent to Cartesian subgrids of lower dimension int MPI_Cart_sub(MPI_Comm comm, int *remain_dims, MPI_Comm *comm_new) MPI_CART_SUB(COMM, REMAIN_DIMS, COMM_NEW, IERR) I-th element of the array remain_dims defines if I-th dimension is contained in subgrid (true). Output parameter: newcomm  communicator which contains subgrid to which belongs given process. Subprogram MPI_Cart_sub may be used only with communicator supplied with Cartesian topology. • Getting of information about Cartesian topology associated with communicator comm int MPI_Cart_get(MPI_Comm comm, int maxdims, int *dims, int *periods, int *coords) MPI_CART_GET(COMM, MAXDIMS, DIMS, PERIODS, COORDS, IERR) Input parameter: • maxdims  number of elements in arrays dims, periods and vectors in calling program; Output parameters: • dims  1-dimensional integer array which defines number of processes along each dimension; • periods  logical array which consists of ndims elements and defines if grid is periodic (true) along correspondent dimension; • coords  1-dimensional integer array which contains Cartesian coordinates of process. 66 Getting of rank of process (newrank) in Cartesian topology after reordering int MPI_Cart_map(MPI_Comm comm_old, int ndims, int *dims, int *periods, int *newrank) MPI_CART_MAP(COMM_OLD, NDIMS, DIMS, PERIODS, NEWRANK, IERR) Input parameters: • comm  communicator; • ndims  dimensionality of Cartesian grid; • dims  integer array which consists of ndims elements and defines number of processes along each dimension; • periods  logical array which consists of ndims elements and defines if grid is periodic (true) along correspondent dimension. If process doesn’t belong to grid subprogram returns value MPI_UNDEFINED. Getting of source rank (source) of message which ought to be received and target process (dest) which should receive message for given direction of shift (direction) as well as its magnitude (disp) int MPI_Cart_shift(MPI_Comm comm, int direction, int displ, int *source, int *dest) MPI_CART_SHIFT(COMM, DIRECTION, DISPL, SOURCE, DEST, IERR) For n-dimensional Cartesian grid value of direction has to be in interval from 0 to n – 1. Getting of dimensionality (ndims) of Cartesian topology which is associated with communicator comm int MPI_Cartdim_get(MPI_Comm comm, int *ndims) MPI_CARTDIM_GET(COMM, NDIMS, IERR) Creation of new communicator comm_graph which is supplied with graph topology (fig. 22) int MPI_Graph_create(MPI_Comm comm, int nnodes, int *index, int *edges, int reorder, MPI_Comm *comm_graph) MPI_GRAPH_CREATE(COMM, NNODES, INDEX, EDGES, REORDER, COMM_GRAPH, IERR) Input parameters: • comm  initial communicator which is not supplied with topology; • nnodes  number of graph nodes; • index  1-dimensional integer array which contains orders of nodes (number of incoming and outcoming arcs); 67 • edges  1-dimensional integer array which contains arcs of the graph; • reorder  true value allows reordering of numeration of processes. Fig. 22. Graph topology Getting nodes of graph which are neighbors of given node int MPI_Graph_neighbors(MPI_Comm comm, int rank, int maxneighbors, int *neighbors) MPI_GRAPH_NEIGHBORS(COMM, RANK, MAXNEIGHBORS, NEIGHBORS, IERR) Input parameters: • comm  communicator with graph topology; • rank  rank of process in group associated with communicator comm; maxneighbors  number of elements in array neighbors. Output parameter: • • neighbors  array containing ranks of processes which are neighbors of given process. Getting number of neighbor nodes (nneighbors) for given in communicator with graph topology int MPI_Graph_neighbors_count(MPI_Comm comm, int rank, int *nneighbors) MPI_GRAPH_NEIGHBORS_COUNT(COMM, RANK, NNEIGHBORS, IERR) Input parameters: • comm  communicator; • rank  rank of process which corresponds to the node. 68 Getting information about graph topology associated with communicator comm int MPI_Graph_get(MPI_Comm comm, int maxindex, int maxedges, int *index, int *edges) MPI_GRAPH_GET(COMM, MAXINDEX, MAXEDGES, INDEX, EDGES, IERR) Input parameters: • comm  communicator; • maxindex  number of elements in array index in calling program; maxedges  number of elements in array edges in calling program. Output parameters: • • index  1-dimensional integer array which contains structure of graph (see description of subprogram MPI_Graph_create); • edges  1-dimensional integer array which contains information about arcs of graph. Getting rank of process in graph topology after reordering (newrank) int MPI_Graph_map(MPI_Comm comm, int nnodes, int *index, int *edges, int *newrank) MPI_GRAPH_MAP(COMM, NNODES, INDEX, EDGES, NEWRANK, IERR) Input parameters: • comm  communicator; • nnodes  number of graph nodes; • index  1-dimensional integer array which contains structure of graph (see description of subprogram MPI_Graph_create); edges  1-dimensional integer array which contains information about arcs of graph. If process does not belong to the graph this subprogram returns MPI_UNDEFINED. • Getting of information on graph topology which is related to communicator comm int MPI_Graphdims_get(MPI_Comm comm, int *nnodes, int *nedges) MPI_GRAPHDIMS_GET(COMM, NNODES, NEDGES, IERR) Output parameters: • nnodes  number of graph nodes; • nedges  number of graph edges. 69 Getting type of topology (toptype) associated with communicator comm int MPI_Topo_test(MPI_Comm comm, int *toptype) MPI_TOPO_TEST(COMM, TOPTYPE, IERR) Output parameter: • toptype  topology (MPI_CART for Cartesian topology and MPI_GRAPH for graph topology). Derived data types Derived data types of MPI are used to send data which elements are not contiguous in memory. Derived type must be created by call of constructor and then it has to be registered. Before program will be completed all derived types should be cancelled. Constructor of vector type (fig. 23) int MPI_Type_vector(int count, int blocklen, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) MPI_TYPE_VECTOR(COUNT, BLOCKLEN, STRIDE, OLDTYPE, NEWTYPE, IERR) Input parameters: • count  number of blocks (nonnegative integer); • blocklen  length of a block (number of elements, nonnegative integer); • stride  number of elements between beginning of previous and beginning of the next block; • oldtype  basic type. • newtype  identifier of a new type. Initial data must be of the same type. Fig. 23. Vector derived type 70 Constructor of vector type int MPI_Type_hvector(int count, int blocklen, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype) MPI_TYPE_HVECTOR(COUNT, BLOCKLEN, STRIDE, OLDTYPE, NEWTYPE, IERR) Arguments of this subprogram are the same as in subprogram MPI_Type_vector except stride value must be given in bytes. Constructor of structured type int MPI_Type_struct(int count, int blocklengths[], MPI_Aint indices[], MPI_Datatype oldtypes[], MPI_Datatype *newtype) MPI_TYPE_STRUCT(COUNT, BLOCKLENGTHS, INDICES, OLDTYPES, NEWTYPE, IERR) Input parameters: • count  number of elements in derived type and number of elements in arrays oldtypes, indices and blocklengths; • blocklengths  number of elements at each block (array); • indices  displacement of each block in bytes; • oldtypes  type of elements at each block (array). Output parameter: • newtype  identifier of derived type. MPI_Aint  name of scalar type with the same length as length of pointer. Constructor of indexed type int MPI_Type_indexed(int count, int blocklens[], int indices[], MPI_Datatype oldtype, MPI_Datatype *newtype) MPI_TYPE_INDEXED(COUNT, BLOCKLENS, INDICES, OLDTYPE, NEWTYPE, IERR) Input parameters: • count  number of blocks in derived type and number of elements in arrays indices and blocklens; • blocklens  number of elements at each block; • indices  displacements of blocks which is measured in cells of basic type (integer array); oldtype  basic type. Output parameter: • • newtype  identifier of derived type. 71 Constructor of indexed type int MPI_Type_hindexed(int count, int blocklens[], MPI_Aint indices[], MPI_Datatype oldtype, MPI_Datatype *newtype) MPI_TYPE_HINDEXED(COUNT, BLOCKLENS, INDICES, OLDTYPE, NEWTYPE, IERR) Displacements indices are given in bytes. Constructor of derived type with contiguous disposition of elements int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype) MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE, NEWTYPE, IERR) Input parameters: • count  counter of replicas; oldtype  basic type. Output parameter: • newtype  identifier of the new type. • Constructor of indexed type with blocks of equal size int MPI_Type_create_indexed_block(int count, int blocklength, int displacements[], MPI_Datatype oldtype, MPI_Datatype *newtype) MPI_TYPE_CREATE_INDEXED_BLOCK(COUNT, BLOCKLENGTH, DISPLACEMENTS, OLDTYPE, NEWTYPE, IERR) Input parameters: • count  number of blocks in derived type and number of elements in arrays indices and blocklens; • blocklength  number of elements at each block; • displacements  displacements of blocks measured in units of length of type oldtype (integer array); • oldtype  basic type. • newtype  identifier of derived type. Constructor of derived data type which corresponds to subarray of multidimensional array int MPI_Type_create_subarray(int ndims, int *sizes, int *subsizes, int *starts, int order, MPI_Datatype oldtype, MPI_Datatype *newtype) MPI_TYPE_CREATE_SUBARRAY(NDIMS, SIZES, SUBSIZES, STARTS, ORDER, OLDTYPE, NEWTYPE, IERR) 72 Input parameters: • ndims  dimension of array; • sizes  number of elements having type oldtype at each dimension of the whole array; • subsizes  number of elements having type oldtype at each dimension of the subarray; • starts  initial coordinates of subarray at each dimension; • order  flag which defines reordering; • oldtype  basic type. • newtype  new type. Registration of derived type datatype int MPI_Type_commit(MPI_Datatype *datatype) MPI_TYPE_COMMIT(DATATYPE, IERR) Removing of derived type datatype int MPI_Type_free(MPI_Datatype *datatype) MPI_TYPE_FREE(DATATYPE, IERR) Basic types might not be removed. Getting size of the data type datatype in bytes int MPI_Type_size(MPI_Datatype datatype, int *size) MPI_TYPE_SIZE(DATATYPE, SIZE, IERR) Output parameter  size. Getting number of elements in a single object having type datatype (extent) int MPI_Type_extent(MPI_Datatype datatype, MPI_Aint *extent) MPI_TYPE_EXTENT(DATATYPE, EXTENT, IERR) Output parameter  extent. Displacements may be given relative to basic address which is contained in constant MPI_BOTTOM. Getting address from given location int MPI_Address(void *location, MPI_Aint *address) MPI_ADDRESS(LOCATION, ADDRESS, IERR) 73 This subprogram in C programs returns the same address as operation & (sometimes this rule may be violated). It is more helpful in Fortran programs because C has own tools to do the same. Getting actual parameters used in creation of derived type int MPI_Type_get_contents(MPI_Datatype datatype, int max_integers, int max_addresses, int max_datatypes, int *integers, MPI_Aint *addresses, MPI_Datatype *datatypes) MPI_TYPE_GET_CONTENTS(DATATYPE, MAX_INTEGERS, MAX_ADDRESSES, MAX_DATATYPES, INTEGERS, ADDRESSES, DATATYPES, IERR) Input parameters: • datatype  identifier of derived type; • max_integers  number of elements in array integers; • max_addresses  number of elements in array addresses; max_datatypes  number of elements in array datatypes. Output parameters: • • integers  contains integer arguments which were used at creating of given data type; • addresses  contains arguments address which were used at creating of given data type; • datatypes — contains arguments datatype which were used at creating of given data type. Getting low bound of datatype int MPI_Type_lb(MPI_Datatype datatype, MPI_Aint *displacement) MPI_TYPE_LB(DATATYPE, DISPLACEMENT, IERR) Output parameter: • displacement — displacement (in bytes) of low bound relative to source. Getting upper bound of datatype int MPI_Type_ub(MPI_Datatype datatype, MPI_Aint *displacement) MPI_TYPE_UB(DATATYPE, DISPLACEMENT, IERR) 74 Data packing int MPI_Pack(void *inbuf, int incount, MPI_Datatype datatype, void *outbuf, int outcount, int *position, MPI_Comm comm) MPI_PACK(INBUF, INCOUNT, DATATYPE, OUTBUF, OUTCOUNT, POSITION, COMM, IERR) When this call is performed incount elements of given type are chosen from input buffer starting with position. Input parameters: • inbuf address of input buffer; • incount  number of input data; • datatype  type of input data; • outcount  size of output buffer in bytes; • position  current position in buffer in bytes; • comm  communicator corresponding to packing message. Output parameter: • outbuf  address of output buffer. Data unpacking int MPI_Unpack(void *inbuf, int insize, int *position, void *outbuf, int outcount, MPI_Datatype datatype, MPI_Comm comm) MPI_UNPACK(INBUF, INSIZE, POSITION, OUTBUF, OUTCOUNT, DATATYPE, COMM, IERR) Input parameters: • inbuf  address of input buffer; • insize  size of input buffer in bytes; • position  current position in buffer in bytes; • outcount  number of data which must be unpacked; • datatype  type of output data; comm  communicator corresponding to unpacking message. Output parameter: • • outbuf  address of output buffer. Getting memory size (in bytes) which is necessary for unpacking of message int MPI_Pack_size(int incount, MPI_Datatype datatype, MPI_Comm comm, int *size) 75 MPI_PACK_SIZE(INCOUNT, DATATYPE, COMM, SIZE, IERR) Input parameters: • incount  argument count which was used at packing; • datatype  type of packed data; • comm  communicator. Attributes Attributes provide a software developer by additional mechanism of information exchange between processes. Creation of new key keyval for attribute (output parameter) int MPI_Keyval_create(MPI_Copy_function *copy_fn, MPI_Delete_function *delete_fn, int *keyval, void *extra_state) MPI_KEYVAL_CREATE(COPY_FN, DELETE_FN, KEYVAL, EXTRA_STATE, IERR) Keys are unique and are not seen by a programmer though they are explicitly kept as integer values. A defined key may be used to set attributes and get access to them in any communicator. Function copy_fn is called when communicator is duplicated by subprogram MPI_Comm_dup. Function delete_fn is used for removal. Parameter extra_state sets additional information (state) for copy and delete functions. Setting type of function MPI_Copy_function typedef int MPI_Copy_function(MPI_Comm oldcomm, int keyval, void *extra_state, void *attribute_val_in, void *attribute_val_out, int *flag) SUBROUTINE COPY_FUNCTION(OLDCOMM, KEYVAL, EXTRA_STATE, ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT, FLAG, IERR) Copy function is called for each keyvalue in initial communicator in any order. Each call of copy function is performed with keyvalue and corresponding attribute. If flag’s value is flag = 0 returned attribute is removed from duplicated communicator. Otherwise (flag = 1) new value is set for attribute which is equal to value returned in parameter attribute_val_out. Function copy_fn in C and Fortran may be defined by values MPI_NULL_COPY_FN or MPI_DUP_FN. MPI_NULL_COPY_FN is function which doesn’t perform any actions except of returning flag’s value flag = 0 and MPI_SUCCESS. MPI_DUP_FN is simplest reduplication function. It returns flag’s 76 value flag = 1, attribute’s value by means of attribute_val_out and completion code MPI_SUCCESS. Deletion function is similar to copy_fn and may be defined as follows. Function delete_fn is called when a communicator has to be deleted by means of call MPI_Comm_free or when MPI_Attr_delete is called. It must be of type MPI_Delete_function, which is defined as: typedef int MPI_Delete_function(MPI_Comm comm, int keyval, void *attribute_val, void *extra_state); SUBROUTINE DELETE_FUNCTION(COMM, KEYVAL, ATTRIBUTE VAL, EXTRA STATE, IERR) This function is called by subroutines MPI_Comm_free, MPI_Attr_delete and MPI_Attr_put. A deletion function may be defined as "null"  MPI_NULL_DELETE_FN. MPI_NULL_DELETE_FN doesn’t perform any actions but returns MPI_SUCCESS. Special key’s value MPI_KEYVAL_INVALID may not be returned by subroutine MPI_Keyval_create. It is used for keys initialization. Deletion of a key keyval int MPI_Keyval_free(int *keyval) MPI_KEYVAL_FREE(KEYVAL, IERR) Call of this function assigns to keyval the value MPI_KEYVAL_INVALID. An attribute in use may be deleted because its actual deletion takes place only after deletion of all references to the attribute. All references must be deleted explicitly by means of, for example, call MPI_Attr_delete. Each those call deletes one copy of the attribute. Call of MPI_Comm_free deletes all copies of the attribute that are related to communicator under deletion. Setting of attribute which may be used by subroutine MPI_Attr_get int MPI_Attr_put(MPI_Comm comm, int keyval, void* attribute) MPI_ATTR_PUT(COMM, KEYVAL, ATTRIBUTE, IERR) Call of this subprogram associates key’s value keyval with the attribute. If the attribute’s value was set before the call result is the same as in situation when at first MPI_Attr_delete is used (and call of delete_fn is performed) and then new value is saved. The call will be completed with error if a key with value keyval is absent. In particular, code MPI_KEYVAL_INVALID corresponds to wrong value of a key. Change of system attributes MPI_TAG_UB, MPI_HOST, MPI_IO and MPI_WTIME_IS_GLOBAL is not allowed. 77 Getting attribute value which corresponds to a key’s value keyval int MPI_Attr_get(MPI_Comm comm, int keyval, void *attribute, int *flag) MPI_ATTR_GET(COMM, KEYVAL, ATTRIBUTE, FLAG, IERR) The first parameter defines the communicator to which the attribute is attached. If a key with value keyval is absent, error takes place. Error does not arise if key value is set but corresponding attribute is not attached to the communicator comm. In this case flag value flag = false is returned. When call of MPI_Attr_put is performed, an attribute’s value is passed by means of attribute_val and during a call of MPI_Attr_get address of returned attribute’s value is passed through attribute_val parameter. Attributes may be received only from programs which are written in same programming languages as subprograms or programs called MPI_Attr_put. Deletion of attribute with given key’s value int MPI_Attr_delete(MPI_Comm comm, int keyval) MPI_ATTR_DELETE(COMM, KEYVAL, IERR) Deletion of attribute with given key’s value is performed by an attribute’s deletion function delete_fn which has to be defined when keyval is created. Parameter comm defines the communicator to which the attribute is attached. All parameters of the subprogram are input parameters. For any reduplication of a communicator by means of MPI_Comm_dup subprogram all copy functions are invoked for attributes which were set at a given time. Order of invocations is arbitrary. The same actions are performed when a communicator has to be deleted by MPI_Comm_free but all deletion functions are called in this case. Implementations There are few realizations of the MPI specification. Among them are: MPICH (MPI CHameleon, www.mcs.anl.gov) – free, open-source MPI implementation; LAM (Local Area Multicomputer) – high-quality open-source MPI implementation (www.lam-mpi.org); Microsoft® MPI and Intel® MPI etc. There are some implementations which support usage in Grid-environment. 78 From MPI-1 to MPI-2 A lot of parallel software use MPI-1 implementations but now most of implementations support also MPI-2 specification. It has enhanced functionality such as: • spawning of new tasks in a process of program execution; • new kinds of point-to-point exchanges (one-sided); • parallel input-output; • enhanced collective operations (including operations for intercommunicators) and so on. 79 Part 4 Fortran 90 80 In this section short description of Fortran programming language is given. It is one of languages most often used in scientific, applied and parallel programming. Until now it has not competitors in performance and convenience of programming of computational problems. Format of a source code Source code of a Fortran program may be written in fixed or free format. The fixed format corresponds to old standards of Fortran 77, and free format is used in Fortran 90 and newest standards. Fortran 90 also supports fixed format. Any string of a source code in fixed format consists of 72 positions. First five positions may be used only for labels or comments. Sixth position may be blank or may be used to place any non-blank symbol. In the last case a string with a non-blank symbol in the sixth position is considered as a continuation string of a previous one. A statement may be placed in any positions from 7 to 72. In the free format all positions are equivalent and a string length is 132 symbols. Program structure Program consists of main program and, possibly, subprograms. Subprograms may be both functions and subroutines, external and internal. A program’s components may be compiled separately. Main program begins with statement PROGRAM. Then name of the program follows: PROGRAM _ Name of a program begins with a letter, then letters, digits and underscores may follow, for example: PROGRAM SUMMATION PROGRAM QUADRATIC_EQUATION_SOLVER45 Maximum length of any identifier in Fortran is 31 symbols. First statement of a subprogram is FUNCTION or SUBROUTINE. Last statement of any program component is END. Last statement of the main program may be of the following form: END[[ PROGRAM] PROGRAM_NAME] PROGRAM_NAME - unnecessary part of the statement. Just after header all definitions of variables, constants and other objects which are used in the (sub)program should be placed. It is definitions part of a program. Then part of executable statements follows. 81 Basic data types Below list of built-in data types in the order of rank increase is given: • LOGICAL(1) and BYTE • LOGICAL(2), LOGICAL(4) • INTEGER(1), INTEGER(2), INTEGER(4) • REAL(4), REAL(8) • COMPLEX(8), COMPLEX(16) Each built-in data type in Fortran has few kinds, which differ by interval of allowable values and precision (for numerical types). Any value of CHARACTER type is a character string. Length of a string may be different it is defined by value of LEN parameter in a statement of a string variable description: CHARACTER(LEN = 430) :: Shakespeare_sonet Description sentence Description sentence for variables has following format in Fortran 90: type[, attributes] :: list_of_variables Identifiers in a list are separated by commas, type defines type, for example: REAL, PARAMETER :: salary = 2000 Following attributes may be used in Fortran 90: • PARAMETER  for named constants; • PUBLIC  variable is accessible from outside a module; • PRIVATE  variable is not accessible from outside a module; • POINTER  variable is a pointer; • TARGET  variable may be used as a target for pointers; • ALLOCATABLE  for dynamic (allocatable) arrays; • DIMENSION  for arrays; • INTENT  defines a type of a subprogram’s argument (input, output or input and output at the same time); • OPTIONAL  unnecessary argument of a subprogram; • SAVE  to save value of a local variable of a subprogram after return; • EXTERNAL  for external function; • INTRINSIC  for intrinsic function. 82 Literal constants Literal numerical constants are written as usual. In complex literal constants parenthesizes are used: • (0., 1.)  imaginary unit i; • (2., 1.)  complex number 2 + i. There are two logical literal constants: • .TRUE. • .FALSE. Literal character constant may be written between two quotation marks or two apostrophes: “Hello, Fortranner!” ‘Good night, Fortranner!’ Arithmetical and logical operators Below list of arithmetic operators in the order of priority decrease is given: • ** — exponentiation; • *, / — multiplication, division; • –, + — subtraction, addition. ”Minus” (–) and “plus” (+) are used also for unary operators: -2.14 +321 Below relation operators in Fortran are listed: Notation Alternative notation Name .LT. < Less .LE. <= Less or equal .GT. > Greater .GE. >= Greater or equal .EQ. == Equal .NE. /= Not equal Only “equal” (.EQ.) and "not equal" (.NE.) relations may be applied to complex variables and constants. 83 Logical operators: Operator Description .NOT. Logical negation .AND. Logical multiplication (logical AND) .OR. Logical addition (logical OR) .EQV. and .NEQV. Logical equivalence and nonequivalence (equality and nonequality of logical values) Arrays Arrays are described with DIMENSION attribute: REAL, DIMENSION(1:100) :: C List of extents may be used with this attribute. Extent is number of elements in a dimension. The list of extents has the following form: (extent_1, extent_2, …, extent_n) Number of extents is equal to array’s rank. Each extent is described as follows: [low_boundary : ] upper_boundary Example: REAL, DIMENSION(0:10, 2, -3:3,11) :: FGRID If low boundary value is omitted it is supposed to be 1. For allocatable arrays list of extents has form of colons separated by commas. Number of colons in this case must be equal to dimension of an array: REAL, ALLOCATABLE, DIMENSION(:, :) :: BE_LATER Size of allocatable array may be defined at a time of program execution. Only at that time memory may be allocated for such array: PROGRAM dyn_array IMPLICIT NONE INTEGER SIZE REAL, ALLOCATABLE, DIMENSION(:) :: array WRITE(*, *) 'SIZE?' READ(*, *) SIZE IF(SIZE > 0) ALLOCATE(array(SIZE)) … IF(ALLOCATED(array)) DEALLOCATE(array) END PROGRAM dyn_array Array may be described without DIMENSION attribute. In this case array’s extents must be described after its name: REAL X(10, 20, 30), Y(100), Z(2, 300, 2, 4) 84 Example Solution of nonlinear equation by Newton’s method program newton implicit none real(8) :: x, dx, f, df x = 3.3 ! initial approximation do ! newton’s iterations dx = f(x) / df(x) ! step evaluation x = x – dx ! next approximation if(dx <= spacing(x)) exit ! loop is finished when ! step is less than distance between two successive ! iterations end do print *, x ! output of the solution print *, f(x) ! output of a function value print *, df(x) ! output of derivative of function end program newton real(8) function f(x) implicit none real(8) :: x f = sin(x) return end Statements of Fortran 90 Here list of some statements of Fortran 90 is given. Unnecessary elements are denoted by square brackets []. If the blank symbol is not placed in square brackets it is necessary in this context. Nonexecutable statements Statements of program components PROGRAM program_name Main program’s header 85 MODULE module_name Module’s header END[ MODULE[module_name]] Last module’s statement USE module_name[, ONLY only_list] Statement which attaches a module [RECURSIVE ]SUBROUTINE subroutine_name [([list_of_formal_parameters])] Subroutine’s header [type ][RECURSIVE ]FUNCTION function_name ([list_of_formal_parameters]) [ RESULT(result_name)] Function’s header INTERFACE[ generic_description] Definition of an interface, header statement END[ ]INTERFACE Last statement in an interface definition CONTAINS Definition of internal subprogram ENTRY Entry statement 86 Descriptions and initializations type[[, attribute][, attribute:]:.::] objects_list Description statement. type is one from the list: • INTEGER[(KIND=]kind_parameter)] • REAL[(KIND=]kind_parameter)] • LOGICAL[(KIND=]kind_parameter)] • COMPLEX[(KIND=]kind_parameter)] • CHARACTER[list_of_type_parameters] • DOUBLE[ ]PRECISION] • TYPE(type_name) Attributes are any allowable combination of the following: PARAMETER, PUBLIC, PRIVATE, POINTER, TARGET, ALLOCATABLE, DIMENSION, INTENT, EXTERNAL, INTRINSIC, OPTIONAL, SAVE TYPE[, access_attribute ::] name_of_derived_type Definition of a derived type, header ststement. Here access_attribute — PUBLIC or PRIVATE END[ ]TYPE[name_of_derived_type] Definition of a derived type, last statement IMPLICIT list where list — type(list_of_letters)[, type(list_of_letters)] … or NONE Definition of implicit tipization ALLOCATABLE [::] array_name[(extents_list)][, array_name[(extents_list)]…] Definition of allocatable arrays DIMENSION array_name(extents)[, array_name(extents)…] Arrays description PARAMETER (list_of_definitions_of_named_constants) Definition of named constants EXTERNAL list_of_external_names Assigning of attribute EXTERNAL INTRINSIC list_of_intrinsic_names Assigning of attribute INTRINSIC 87 INTENT(parameter_of_input/output) list_of_formal_parameters Assigning of attribute INTENT OPTIONAL list_of_formal_parameters Assigning of attribute OPTIONAL SAVE[[::]list_of_objects_to_save] Assigning of attribute SAVE COMMON /[name_of_common_block]/ list_of_variables [, / name_of_common_block / list_of_variables] Definition of common blocks DATA objects_list /list_of_values /[, objects_list / list_of_values /…] Initialization of variables and arrays FORMAT([list_of_descriptors]) Format specification Executable operators Control statements END[ PROGRAM[ program_name]] Last statement of a program END[ subprogram_kind [ subprogram_name]] where subprogram_kind — SUBROUTINE or FUNCTION Last statement of a subprogram CALL subroutine_name[(list_of_actual_parameters)] Call of subroutine RETURN Return statement 88 STOP[ message] Stop statement Assignments variable = expression Assignment for scalar and array-like objects reference => target Attachment of reference to a target Loops and branchings IF(scalar_logical_expression) executable_statement Conditional statement WHERE(array_logical_expression) array = expression_array Conditional assignment for arrays [if_name:] IF(scalar_logical_expression) THEN ELSE[[ ]IF(scalar_logical_expression) THEN[ if_ END[ ]IF[ if_name] Conditional statement IF_THEN_ELSE WHERE(array_logical_expression) ELSEWHERE END[ ]WHERE Branching for array assignments [select_name:] SELECT[ ]CASE (scalar_expression) CASE (list_of_possible_values)[ select_name] CASE DEFAULT[ select_name] END[ ]SELECT[ select_name] Multibranching SELECT ] 89 GO[ ]TO label Unconditional jump to a specified label [do_name:] DO[ label] variable = scalar_integer_expression1, scalar_integer_expression2[, scalar_integer_expression3] DO-loop header [do_name:] DO[ WHILE(scalar_logical_expression) While-loop (precondition loop) header label] [,] CYCLE[ do_name] Transition to the loop’s do_name end EXIT[ do_name] Exit from the loop do_name CONTINUE Jump to next iteration of a loop END[ ]DO[ do_name] Last statement of a loop do_name Operations with dynamic memory ALLOCATE(list_of_objects_to_be_allocated [, STAT=status]) Allocation of memory for listed objects DEALLOCATE(list_of_allocated_objects[, STAT = status]) Memory deallocation 90 Input-output statements READ(input_control_list) [input_list] READ format[,input_list] Input WRITE(output_control_list) [output_list] Output PRINT format[, output_list] Write on standard output device OPEN(descriptors) Attachment of a file to a logical input-output device CLOSE(descriptors) Closing of a file Intrinsic subprograms ABS(A) Absolute value ACHAR(I) I-th symbol of a sorting ASCII sequence of a processor ACOS(X) Arccosine in radians AIMAG(Z) Imaginary part of a complex value AINT(A[, KIND]) Truncation to integer value (with specified kind parameter) 91 ALLOCATED(ARRAY) Check if an array is allocated ANINT(A[, KIND]) Nearest integer value to A ASIN(X) Arcsine in radians ATAN(X) Arctangent in radians ATAN2(Y, X) Argument of a complex value CALL DATE_AND_TIME([DATE][, TIME] [,ZONE] [,VALUES]) Date and time getting CALL RANDOM_NUMBER(R) Uniformly distributed pseudorandom number from interval [0, 1) CALL RANDOM_SEED([SIZE] [, PUT][, GET]) Getting/setting of random seed CALL SYSTEM_CLOCK([COUNT] [, COUNT_RATE] [, COUNT_MAX]) Integer count of real timer CEILING(A) Minimum integer value which is greater or equal to A CHAR(I[, KIND]) ICHAR(C) I-th symbol in a sorted sequence of a processor CMPLX(X[, Y] [, KIND]) Constructor of a complex number (with specified kind parameter) 92 CONJG(Z) Complex conjugation COS(X) Trigonometric cosine COSH(X) Hyperbolic cosine COUNT(MASK[, DIM]) Number of masked array elements (along specified dimension) CSHIFT(ARRAY, SHIFT[, DIM]) Cyclic shift of array elements (along specified dimension) DIGITS(X) Number of significant digits in the model of floating point representation of X DOT_PRODUCT(VECTOR_A, VECTOR_B) Dot product of one-dimensional arrays DPROD(X, Y) Scalar multiplication with double precision EOSHIFT(ARRAY, SHIFT[, BOUNDARY] [, DIM]) Linear shift of array elements EPSILON(X) Minimum number in a model of floating point representation of X such that its sum with unit is distinguishable from unit EXP(X) Exponent EXPONENT(X) Exponent in a model of floating point representation of X 93 FLOOR(A) Minimum integer value that is not greater than A FRACTION(X) Fractional part in a model of floating point representation of X HUGE(X) Maximum value in a model of floating point representation of X IACHAR( ) Index of a character argument in a sorting sequence of a processor IAND(I, J) Bitwise logical AND IBCLR(I, POSITION) Setting zero bit in a specified position IBITS(I, POSITION, LENGTH) Extraction of a bit subsequence IBSET(I, POSITION) Setting binary unit in a given position ICHAR(C) Index of a character argument in a sorting sequence of a processor IEOR(I, J) Bitwise XOR INDEX(STRING, SUBSTRING[, BACK]) Starting index of substring in a string INT(A[, KIND]) Transformation to integer (with specified kind parameter) 94 IOR(I, J) Bitwise logical OR ISHIFT(I, SHIFT) Logical bit shift ISHIFTC(I, SHIFT[, SIZE]) Logical right cyclic shift of part of bits KIND(X) Kind parameter of an argument LBOUND(ARRAY[, DIM]) Low bound of array-like argument LEN(S) Length of a string LEN_TRIM(STRING) Length of a string without tale blanks ALOG(X) Natural logarithm ALOG10(X) Decimal logarithm LOGICAL(L[, KIND]) Transformation of an argument to logical type (with specified kind parameter) MATMUL(MATRIX_A, MATRIX_B) Matrix multiplication MAX(A1, A2[, A3, …..]) Maximum of arguments in a list 95 MAXLOC(ARRAY[, MASK]) Index of maximum array element MAXVAL(ARRAY[, DIM ] [, MASK]) Maximum array element (along specified dimension and/or according to a given mask) MIN(A1, A2 [, A3, …..]) Minimum element in a list MINLOC(ARRAY[, MASK]) Index of minimum array element MINVAL(ARRAY[, DIM ] [, MASK]) Minimum array element (along specified dimension and/or according to a given mask) MOD(A, P) Remainder from division modulo P MODULO(A, P) Division modulo P NINT(A[, KIND]) Nearest integer to an argument’s value NOT(I) Bitwise logical negation PRECISION(X) Decimal precision for real type argument PRESENT(A) Test of presence of optional argument 96 PRODUCT(ARRAY[, DIM] [,MASK]) Product of array elements (along specified dimension and/or according to a given mask) REAL(A[, KIND]) Transformation to real type (with a kind specified) RESHAPE(SOURCE, SHAPE[, PAD] [, ORDER]) Change of array shape SCAN(STRING, SET[, BACK]) Index of last symbol in string argument STRING in SET SELECTED_INT_KIND(R) Kind parameter of integer type with a given interval SELECTED_REAL_KIND([P][, R]) Kind parameter of real type with a given precision and interval SHAPE(SOURCE) Shape of an argument (array-like argument assumed) SIGN(A, B) Absolute value of A with a sign of B SIN(X) Trigonometric sine SINH(X) Hyperbolic sine SIZE(ARRAY[, DIM]) Array size (along specified dimension) SQRT(X) Square root 97 SUM(ARRAY[, DIM][, MASK]) Sum of array elements (along specified dimension and/or according to a given mask) TAN(X) Trigonometric tangent TANH(X) Hyperbolic tangent TINY(X) Minimum positive floating point number TRANSPOSE(MATRIX) Matrix transpose UBOUND(ARRAY [, DIM]) Upper array boundary 98 References 1. S. Nemnyugin, O. Stesik, Parallel programming for multiprocessor computing systems. "BHV", Saint-Petersburg, 2002, 396 p. 2. V.V. Voevodin, V.V. Voevodin, Parallel computing. "BHV", Saint-Petersburg, 2002, 599 p. 3. S. Nemnyugin, O. Stesik, Modern Fortran. "BHV", Saint-Petersburg, 2004, 481 p. 99 Appendix 1 Intel compiler for Linux Here short description of Intel® Fortran compiler for Linux is given. It has to be used in the following way: ifort [options] file1 [file2 ...] where options are unnecessary and fileN is a Fortran source (with extensions .f .for .ftn .f90 .fpp .F .FOR .F90 .i .i90), assembly (.s .S), object (.o), static library (.a), or other linkable file Performance options • -O1 - optimize for maximum speed, but disable some optimizations which increase code size for a small speed benefit; • -O2 - enable optimizations (default); • -O3 - enable -O2 plus more aggressive optimizations that may not improve performance for all programs; • -O0 - disable optimizations; • -O - same as -O2; • -fast - enable -xP -O3 -ipo -no-prec-div –static; • -[no-]prec-div - improve precision of floating-point divides (some speed impact); • -mcpu=<cpu> - optimize for a specific cpu: pentium - optimize for Pentium® processor, pentiumpro - optimize for Pentium® Pro, Pentium® II and Pentium® III processors, pentium4 - optimize for Pentium® 4 processor (default); • -march=<cpu> - generate code exclusively for a given <cpu>: pentiumpro - Pentium® Pro and Pentium(R) II processor instructions, pentiumii - MMX(TM)instructions, pentiumiii - streaming SIMD extensions, pentium4 - Pentium® 4 new instructions; • -x<codes> - generate specialized code to run exclusively on processors indicated by <codes>: W - Intel Pentium® 4 and compatible Intel processors, P - Intel® Core(TM) Duo processors, Intel Core(TM) Solo processors, Intel Pentium® 4 and compatible Intel processors with Streaming SIMD Extensions 3 (SSE3) instruction support; 100 • • • • • -ip - enable single-file Interprocedural (IP) optimizations (within files); -ipo[n] - enable multi-file IP optimizations (between files); -qp - compile and link for function profiling with UNIX gprof tool; -p - same as –qp; -opt-report - generate an optimization report to stderr. Instrumentation options • -tcheck - generate instrumentation to detect multi-threading bugs (requires Intel® Thread Checker; cannot be used with compiler alone); • -tprofile - generate instrumentation to analyze multi-threading performance (requires Intel® Thread Profiler; cannot be used with compiler alone); • -openmp - enable the compiler to generate multi-threaded code based on the OpenMP directives; • -openmp-profile - link with instrumented OpenMP runtime library to generate OpenMP profiling information for use with the OpenMP component of the VTune(TM) Performance Analyzer; • -openmp-stubs - enables the user to compile OpenMP programs in sequential mode. The openmp directives are ignored and a stub OpenMP library is linked (sequential); • -cluster-openmp - allows the user to run an OpenMP program on a cluster; • -parallel - enable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel. Output and debug options • • • • -c - compile to object (.o) only, do not link; -S - compile to assembly (.s) only, do not link; -o <file> - name output file; -print-multi-lib - print information about libraries being used. Fortran preprocessor options • -module [path] - specify path where mod files should be placed and first location to look for mod files; • -I<dir> - add directory to include file search path. 101 Language options • • • • • • • • • • • • • -i2 - set default KIND of integer variables to 2; -i4 - set default KIND of integer variables to 4; -i8 - set default KIND of integer variables to 8; -integer-size <size> - specifies the default size of integer and logical variables size: 16, 32, 64; -r8 - set default size of REAL to 8 bytes; -r16 - set default size of REAL to 16 bytes; -real-size <size> - specify the size of REAL and COMPLEX declarations, constants, functions, and intrinsics size: 32, 64, 128; -[no]fixed - specifies source files are in fixed format; -[no]free - specifies source files are in free format; -auto - make all local variables AUTOMATIC; -auto-scalar - make scalar local variables AUTOMATIC (default); -save - save all variables (static allocation); -syntax-only - perform syntax check only. Miscellaneous options • -help - print help message; • -V - display compiler version information. Linking options • • • • • • -L<dir> - instruct linker to search <dir> for libraries; -i-dynamic - link Intel provided libraries dynamically; -i-static - link Intel provided libraries statically; -dynamic-linker<file> - select dynamic linker other than the default; -static - prevents linking with shared libraries; -shared - produce a shared object. 102 Appendix 2 Compilation and execution of MPI programs in Linux Compilation Scripts mpif77, mpif90, mpicc, mpiCC compile and link MPI programs written in Fortran 77, Fortran 90, C and C++. These commands can be used to compile and link MPI programs written in different programming languages and provide the options and any special libraries that are needed to compile and link MPI programs. Any such script uses some compiler installed in a system. Both options specific to assumed compiler and MPI-options can be used. Execution mpirun is a shell script that is used to execute parallel MPI programs. It typically works like this: mpirun -np <number of processes> <options> <program name and its arguments> Options The options for mpirun must come before the program it is necessary to run: • -h - short help; • -machinefile <machinefile name> - take the list of possible machines to run on from the file <machinefile name>; • -np <np> - specify the number of processes to run; 103 • -t – testing (do not actually run, just print what would be executed); • -v - verbose (throw in some comments). Machinefile is ordinary plain text file with network names of computers, each name is in separate line: pd00 pd01 pd02 pd03 pd99 Additional options may be included in a machinefile, which define, for example, maximum number of processes of a parallel program which may be executed on a computer.