IBM System Blue Gene Solution: Blue Gene/P Application Development
IBM System Blue Gene Solution: Blue Gene/P Application Development
IBM System Blue Gene Solution: Blue Gene/P Application Development
Carlos Sosa
Brant Knudson
ibm.com/redbooks
SG24-7287-03
Note: Before using this information and the product it supports, read the information in Notices on
page ix.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
The team who wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Summary of changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
September 2009, Fourth Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
December 2008, Third Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
September 2008, Second Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Part 1. Blue Gene/P: System and environment overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1. Hardware overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 System architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 System buildup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Compute and I/O nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Blue Gene/P environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Differences between Blue Gene/L and Blue Gene/P hardware . . . . . . . . . . . . . . . . . . . 7
1.3 Microprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Compute nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 I/O Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Blue Gene/P programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Blue Gene/P specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 Host system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.9.1 Service node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.9.2 Front end nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.9.3 Storage nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.10 Host system software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 2. Software overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Blue Gene/P software at a glance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Compute Node Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 High-performance computing and High-Throughput Computing modes. . . . . . . .
2.2.2 Threading support on Blue Gene/P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Message Passing Interface on Blue Gene/P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Memory considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Memory leaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Memory management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Uninitialized pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Other considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Input/output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Linking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Compilers overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.1 Programming environment overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6.2 GNU Compiler Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
16
17
18
18
18
18
20
20
20
20
20
21
21
21
21
iii
22
22
22
22
23
23
23
25
25
29
30
30
31
32
32
33
37
38
38
39
40
41
41
42
Chapter 5. Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Memory overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Memory management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 L1 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 L2 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 L3 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Double data RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Memory protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Persistent memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
44
45
45
46
46
47
47
49
51
52
52
52
53
57
57
58
59
60
69
71
72
72
72
73
73
74
75
77
79
80
81
83
84
85
89
89
89
92
92
139
140
140
141
141
142
142
143
143
143
149
149
150
Contents
156
156
157
159
161
169
170
170
171
171
173
175
175
175
176
177
178
179
180
181
182
182
183
187
188
188
191
199
201
202
202
202
205
206
206
vi
209
210
210
211
212
212
213
213
213
214
215
221
229
242
243
244
245
246
246
246
247
248
248
251
252
252
253
254
255
268
268
271
272
272
273
279
280
281
284
295
296
296
297
297
299
300
300
305
306
307
308
312
314
314
317
318
319
vii
339
340
340
349
361
362
362
362
362
362
363
363
363
363
363
363
371
371
371
373
374
374
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
viii
Notices
This information was developed for products and services offered in the U.S.A.
IBM might not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service might be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right might be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM might have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement might not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM might
make improvements and/or changes in the product(s) and/or the program(s) described in this publication at
any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM might use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You might copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
ix
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation
in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence
in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by
IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in
other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
AIX
Blue Gene/L
Blue Gene/P
Blue Gene
DB2 Universal Database
DB2
eServer
General Parallel File System
GPFS
IBM
LoadLeveler
POWER
POWER4
POWER5
POWER6
PowerPC
Redbooks
Redbooks (logo)
System p
Tivoli
Preface
This IBM Redbooks publication is one in a series of IBM books written specifically for the
IBM System Blue Gene/P Solution. The Blue Gene/P system is the second generation of a
massively parallel supercomputer from IBM in the IBM System Blue Gene Solution series. In
this book, we provide an overview of the application development environment for the Blue
Gene/P system. We intend to help programmers understand the requirements to develop
applications on this high-performance massively parallel supercomputer.
In this book, we explain instances where the Blue Gene/P system is unique in its
programming environment. We also attempt to look at the differences between the IBM
System Blue Gene/L Solution and the Blue Gene/P Solution. In this book, we do not delve
into great depth about the technologies that are commonly used in the supercomputing
industry, such as Message Passing Interface (MPI) and Open Multi-Processing (OpenMP),
nor do we try to teach parallel programming. References are provided in those instances for
you to find more information if necessary.
Prior to reading this book, you must have a strong background in high-performance
computing (HPC) programming. The high-level programming languages that we use
throughout this book are C/C++ and Fortran95. Previous experience using the Blue Gene/L
system can help you better understand some concepts in this book that we do not extensively
discuss. However, several IBM Redbooks publications about the Blue Gene/L system are
available for you to obtain general information about the Blue Gene/L system. We recommend
that you refer to IBM Redbooks on page 371 for a list of those publications.
xi
We thank the following people and their teams for their contributions to this book:
Tom Liebsch for being the lead source for hardware information
Harold Rodakowski for software information
Thomas M. Gooding for kernel information
Michael Blocksome for parallel paradigms
Michael T. Nelson and Lynn Boger for their help with the compiler
Thomas A. Budnik for his assistance with APIs
Paul Allen for his extensive contributions
We also thank the following people for their contributions to this project:
Gary Lakner
Gary Mullen-Schultz
ITSO, Rochester, MN
Dino Quintero
ITSO, Poughkeepsie, NY
Paul Allen
John Attinella
Mike Blocksome
Lynn Boger
Thomas A. Budnik
Ahmad Faraj
Thomas M. Gooding
Nicholas Goracke
Todd Inglet
Tom Liebsch
Mark Megerian
Sam Miller
Mike Mundy
Tom Musta
Mike Nelson
Jeff Parker
Ruth J. Poole
Joseph Ratterman
Richard Shok
Brian Smith
IBM Rochester
Philip Heidelberg
Sameer Kumar
Martin Ohmacht
James C. Sexton
Robert E. Walkup
Robert Wisniewski
IBM Watson Center
Mark Mendell
IBM Toronto
Ananthanaraya Sugavanam
Enci Zhong
IBM Poughkeepsie
xii
Kirk Jordan
IBM Waltham
Jerrold Heyman
IBM Raleigh
Subba R. Bodda
IBM India
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an e-mail to:
[email protected]
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Preface
xiii
xiv
Summary of changes
This section describes the technical changes made in this edition of the book and in previous
editions. This edition might also include minor corrections and editorial changes that are not
identified.
New information
The compute and I/O node daemon creates a /jobs directory, described in 3.3.1, Control
and I/O daemon on page 33.
The Compute Node Kernel now supports multiple application threads per core, described
in Chapter 4, Execution process modes on page 37.
Support for GOMP, described in 8.5, Support for pthreads and OpenMP on page 100.
Package configuration can be forwarded to the Blue Gene/P compute nodes, described in
8.10, Configuring Blue Gene/P builds on page 105.
mpirun displays APPLICATION RAS events with -verbose 2 or higher, described in
Chapter 11, mpirun on page 177.
The Real-time APIs now support RAS events, described in Chapter 14, Real-time
Notification APIs on page 251.
Environment variable for specifying the cores that generate binary core files, described in
Appendix D, Environment variables on page 339.
Two new environment variables have been added that affect latency on broadcast and
allreduce, DCMF_SAFEBCAST and DCMF_SAFEALLREDUCE, see Appendix D,
Environment variables on page 339 for more information.
Modified information
The prefix of several constants were changed from DCMF_ to MPIO_ in 7.3.2,
Configuring MPI algorithms at run time on page 77.
The Python language version is now 2.6.
The IBM XL compilers support the -qsigtrap compiler option, described in Chapter 8,
Developing applications with IBM XL compilers on page 97.
Users can debug HTC applications using submit, described in 12.3, Running a job using
submit on page 202.
xv
New information
The IBM Blue Gene/P hardware now supports compute nodes with 4 GB of DDR
memory.
STAR-MPI API in Buffer alignment sensitivity on page 73.
Per-site Message Passing Interface (MPI) configuration API in 7.3.2, Configuring MPI
algorithms at run time on page 77.
Blue Gene/P MPI environment variables in MPI functions on page 80.
Scalable Debug API in 9.2.10, Scalable Debug API on page 161.
mpikill command-line utility in mpikill on page 180.
mpirun -start_tool and -tool_args arguments in Invoking mpirun on page 183.
Tool-launching interface in Tool-launching interface on page 188.
In Examples on page 191, failing mpirun prints out reliability, availability, and
serviceability (RAS) events.
job_started() function in mpirun APIs on page 199.
Immediate High-Throughput Computing (HTC) partition user list modification in Altering
the HTC partition user list on page 206.
Partition options modifications in section Partition-related APIs on page 215.
RM_PartitionBootOptions specification in Field specifications for the rm_get_data() and
rm_set_data() APIs on page 229.
Server-side filtering, notification of HTC events, and new job callbacks in Chapter 14,
Real-time Notification APIs on page 251.
Modified information
In mpiexec on page 179, removing some limitations that are no longer present
Use of Compute Node Kernels thread stack protection mechanism in more situations as
described in Memory protection on page 47
Changes to Dynamic Partition Allocator APIs, described in Chapter 15, Dynamic Partition
Allocator APIs on page 295, in non-backwards compatible ways
HTC partitions support for group permissions, as described in Appendix G, htcpartition
on page 359
xvi
New information
High Throughput Computing in Chapter 12, High-Throughput Computing (HTC)
paradigm on page 201
Documentation on htcpartition in Appendix G, htcpartition on page 359
Modified information
The book was reorganized to include Part 4, Job scheduler interfaces on page 207. This
section contains the Blue Gene/P APIs. Updates to the API chapters for HTC are included.
Appendix F, Mapping on page 355 is updated to reflect the predefined mapping for
mpirun.
Summary of changes
xvii
xviii
Part 1
Part
Chapter 1.
Hardware overview
In this chapter, we provide a brief overview of hardware. This chapter is intended for
programmers who are interested in learning about the Blue Gene/P system. In this chapter,
we provide an overview for programmers who are already familiar with the Blue Gene/L
system and who want to understand the differences between the Blue Gene/L and Blue
Gene/P systems.
It is important to understand where the Blue Gene/P system fits within the multiple systems
that are currently available in the market. To gain a historical perspective and a perspective
from an applications point-of-view, we recommend that you read the first chapter of the book
Unfolding the IBM eServer Blue Gene Solution, SG24-6686. Although this book is written for
the Blue Gene/L system, these concepts apply to the Blue Gene/P system.
In this chapter, we describe the Blue Gene/P architecture. We also provide an overview of the
machine with a brief description of some of the components. Specifically, we address the
following topics:
Chip
Compute card
One chip is soldered to a small processor card, one per card, together
with memory (DRAM), to create a compute card (one node). The
amount of DRAM per card is 2 GB or 4 GB.
Node card
The compute cards are plugged into a node card. There are two rows
of sixteen compute cards on the card (planar). From zero to two I/O
Nodes per Compute Node card can be added to the node card.
Rack
System
System
72 racks
Cabled
8x8x16
Rack
32 node cards
1 PF/s
Up to 288 TB
Node Card
(32 chips 4x4x2)
32 compute, 0-2 IO cards
14 TF/s
Up to 4 TB
Compute Card
1 chip, 40
DRAMs
435 GF/s
Up to 128 GB
Chip
4 processors
13.6 GF/s
8 MB EDRAM
13.6 GF/s
2 or 4 GB DDR
Figure 1-1 Blue Gene/P system overview from the microprocessor to the full system
The I/O Node is connected to the external device through an Ethernet port to the 10 gigabit
functional network and can perform file I/O operations.In the next section, we provide an
overview of the Blue Gene environment, including all the components that fully populate the
system.
This node provides access to the users to submit, compile, and build
applications.
Compute node
I/O Node
This node provides access to external devices, and all I/O requests
are routed through this node.
Functional network This network is used by all components of the Blue Gene/P system
except the Compute Node.
Control network
Service Node
File
Servers
System
Console
DB2
MMCS
Functional
10 Gbps
Ethernet
Collective Network
I/OI/O
Node
1151
Node
0
I/OC-Node
Node 1151
0
I/OC-Node
Node 1151
63
Linux
CNK
CNK
fs client
MPI
MPI
ciod
app
app
.
.
Scheduler
Control
Gigabit
Ethernet
I2C
torus
. Collective Network
. Node 1151
I/O
I/O Node
1151
iCon+
Palomino
Pset 1151
I/OC-Node
Node 1151
0
I/OC-Node
Node 1151
63
Linux
CNK
CNK
. fs client
MPI
MPI
app
app
Pset 0
ciod
JTAG
Table 1-1 compares selected features between the Blue Gene/L and Blue Gene/P systems.
Table 1-1 Feature comparison between the Blue Gene/L and Blue Gene/P systems
Feature
Blue Gene/L
Blue Gene/P
700 MHz
850 MHz
Cache coherency
Software managed
SMP
Private L1 cache
32 KB per core
32 KB per core
Private L2 cache
14 stream prefetching
14 stream prefetching
Shared L3 cache
4 MB
8 MB
512 MB-1 GB
2 GB or 4 GB
5.6 GBps
13.6 GBps
Peak performance
Bandwidth
2.1 GBps
5.1 GBps
Bandwidth
700 MBps
1.7 GBps
5.0 s
3.0 s
Peak performance
Power
Node
Network topologies
Torus
Tree
Full system
1.3 Microprocessor
The microprocessor is a PowerPC 450, Book E compliant, 32-bit microprocessor with a clock
speed of 850 MHz. The PowerPC 450 microprocessor, with double-precision floating-point
multiply add unit (double FMA), can deliver four floating-point operations per cycle with
3.4 GLOPS per core.
FPU
PPC 450
FPU
PPC 450
FPU
L1
L1
L2
Prefetching
L2
Prefetching
L2
Prefetching
L2
PPC 450
FPU
PPC 450
FPU
DMA
PPC 450
JTAG
Control
Network
Torus
L1
L1
L1
Prefetching
L2
Prefetching
L2
Prefetching
L2
Prefetching
Collective
FPU
6 directions *
4bits/cycle,
bidirectional
3 ports * 8
bits/cycle,
DMA module allows
Remote
direct
bidirectional
put & get
L2
Multiplexing switch
FPU
L1
DDR-2
Controller
BlueGene/P node
PPC 450
4MB
eDRAM
L3
4MB
eDRAM
L3
Multiplexing switch
FPU
L1
Prefetching
Multiplexing switch
PPC 450
L1
Multiplexing switch
PPC 450
Internal bus
10Gb
Ethernet
Barrier
DDR-2
Controller
4MB
eDRAM
L3
16B/cycle
4MB
DDR2 DRAM
eDRAM
bus L3
Internal bus
DMA
4 ports,
Torus
bidirectional
JTAG
Control
Network
6 directions *
4bits/cycle,
bidirectional
To 10GbBarrier
Collective
physical layer
3 ports * 8
bits/cycle,
bidirectional
4 ports,
bidirectional
DDR-2
Controller
DDR-2
Controller
16B/cycle
DDR2 DRAM
bus
2*16B
10Gb bus @
Ethernet
proc speed
To 10Gb
physical layer
2*16B bus @
proc speed
Node Card
Local DC-DC
regulators
(six required, eight
with redundancy)
1.6 Networks
Five networks are used for various tasks on the Blue Gene/P system:
Three-dimensional torus: point-to-point
The torus network is used for general-purpose, point-to-point message passing and
multicast operations to a selected class of nodes. The topology is a three-dimensional
torus constructed with point-to-point, serial links between routers embedded within the
Blue Gene/P ASICs. Therefore, each ASIC has six nearest-neighbor connections, some of
which can traverse relatively long cables. The target hardware bandwidth for each torus
link is 425 MBps in each direction of the link for a total of 5.1 GBps bidirectional bandwidth
per node. The three-dimensional torus network supports the following features:
10
11
The system software that is provided with each Blue Gene/P core rack or racks includes the
following programs:
IBM DB2 Universal Database Enterprise Server Edition: System administration and
management
Compilers: XL C/C++ Advanced Edition for Linux with OpenMP support and XLF (Fortran)
Advanced Edition for Linux
Processor frequency
850 MHz
Coherency
Symmetrical multiprocessing
L1 Cache (private)
32 KB per core
L2 Cache (private)
14 stream prefetching
8 MB
2 GB or 4 GB
16 GBps
Peak performance
Torus network
Bandwidth
6 GBps
3 s (64 hops)
2 GBps
2.5 s
12
Peak performance
1 PFLOPS
Host system
Service and Front End Nodes, storage system,
Ethernet switch, cabling, SLES10
System software
DB2, XLF/C compilers
Optional HPC software
LoadLeveler, GPFS, ESSL
13
14
Chapter 2.
Software overview
In this chapter, we provide an overview of the software that runs on the Blue Gene/P system.
As shown in Chapter 1, Hardware overview on page 3, the Blue Gene/P environment
consists of compute and I/O nodes. It also has an external set of systems where users can
perform system administration and management, partition and job management, application
development, and debugging. In this heterogeneous environment, software must be able to
interact.
Specifically, we cover the following topics:
15
Compute nodes
I/O Nodes
Front end nodes where users compile and submit jobs
Control management network
Service node, which provides capabilities to manage jobs running in the racks
Hardware in the racks
The Front End Node consists of the interactive resources on which users log on to access the
Blue Gene/P system. Users edit and compile applications, create job control files, launch jobs
on the Blue Gene/P system, post-process output, and perform other interactive activities.
An Ethernet switch is the main communication path for applications that run on the Compute
Node to the external devices. This switch provides high-speed connectivity to the file system,
which is the main disk storage for the Blue Gene/P system. This switch also gives other
resources access to the files on the file system.
A control and management network provides system administrators with a separate
command and control path to the Blue Gene/P system. This private network is not available to
unprivileged users.
The software for the Blue Gene/P system consists of the following integrated software
subsystems:
16
App Layer
Device Layer
System Layer
LWK
CN
Linux
ION
Linux
Compute Node
I/O Node
I/O Node
Compute Node
Compute Node
Compute Node
Compute Node
Compute Node
Compute Node
Compute Node
Link Card
Link Card
Service Card
Node Card
Node Card
Node Card
Service Node
Service Node
File Systems
The software environment illustrated in Figure 2-1 relies on a series of header files and
libraries. A selected set is listed in Appendix C, Header files and libraries on page 335.
The Compute Nodes on Blue Gene/P are implemented as quad cores on a single chip with
2 GB or 4 GB of dedicated physical memory in which applications run.
A process is executed on Blue Gene/P nodes in the following three main modes:
Symmetrical Multiprocessing (SMP) node mode
Virtual node (VN) mode
Dual mode (DUAL)
17
Application programmers see the Compute Node Kernel software as a Linux-like operating
system. This type of operating system is accomplished on the Blue Gene/P software stack by
providing a standard set of run-time libraries for C, C++, and Fortran95. To the extent that is
possible, the supported functions maintain open standard POSIX-compliant interfaces. We
discuss the Compute Node Kernel further in Part 2, Kernel overview on page 27.
Applications can access system calls that provide hardware or system features, as illustrated
by the examples in Appendix B, Files on architectural features on page 331.
A function of the MPI-2 standard that is not supported by Blue Gene/P is dynamic process
management (creating new MPI processes).16 However, the various thread modes are
supported.
18
The Compute Node Kernel keeps track of collisions of stack and heap as the heap is
expanded with a brk() syscall. The Blue Gene/P system includes stack guard pages.
The Compute Node Kernel and its private data are protected from read/write by the user
process or threads. The code space of the process is protected from writing by the process or
threads. Code and read-only data are shared between the processes in Virtual Node Mode
unlike in the Blue Gene/L system.
In general, give careful consideration to memory when writing applications for the Blue
Gene/P system. At the time this book was written, each node has 2 GB or 4 GB of physical
memory.
As previously mentioned, memory addressing is an important topic in regard to the Blue
Gene/P system. An application that stores data in memory falls into one of the following
classifications:
data
bss
heap
stack
You can use the Linux size command to gain an idea of the memory size of the program.
However, the size command does not provide any information about the run-time memory
usage of the application nor on the classification of the types of data. Figure 2-2 illustrates
memory addressing in HPC based on the different node modes that are available on the Blue
Gene/P system.
Figure 2-2 Memory addressing on the Blue Gene/P system as a function of the different node modes
19
2.5.1 Input/output
I/O is an area where you must pay special attention in your application. The CNK does not
perform I/O. This is carried out by the I/O Node.
File I/O
A limited set of file I/O is supported. Do not attempt to use asynchronous file I/O because it
results in run-time errors.
20
Standard input
Standard input (stdin) is supported on the Blue Gene/P system.
Sockets calls
Sockets are supported on the Blue Gene/P system. For additional information, see Chapter 6,
System calls on page 51.
2.5.2 Linking
Dynamic linking is not supported on the Blue Gene/L system. However, it is supported on the
Blue Gene/P system. You can now statically link all code into your application or use dynamic
linking.
Application
XL Libs
GCC Libs
GLIBC
Compute Node Kernel
CIOD (runs on I/O Node)
Figure 2-3 Software stack supporting the execution of Blue Gene/P applications
21
22
23
ciodb
ciodb is now integrated as part of the MMCS server for the Blue Gene/P system, which is
different from the Blue Gene/L system. ciodb is responsible for launching jobs on already
booted blocks. Communication to ciodb occurs through the database and can be initiated
by either mpirun or the Bridge APIs.
MMCS
The MMCS daemon is responsible for configuring and booting blocks. It can be controlled
either by a special console interface (similar to the Blue Gene/L system) or by the Bridge
APIs. The MMCS daemon also is responsible for relaying RAS information into the RAS
database.
mcServer
The mcServer daemon has low-level control of the system, which includes a parallel
efficient environmental monitoring capability as well as a parallel efficient reset and code
load capability for configuring and booting blocks on the system. The diagnostics for the
Blue Gene/P system directly leverage this daemon for greatly improved diagnostic
performance over that of the Blue Gene/L system.
bgpmaster
The bgpmaster daemon monitors the other daemons and restarts any failed components
automatically.
Service actions
Service actions are a suite of administrative shell commands that are used to service
hardware. They are divided into device-specific actions with a begin and end action.
Typically the begin action powers down hardware so it can be removed from the system,
and the end action powers up the replacement hardware. The databases are updated
with these operations, and they coordinate automatically with the scheduling system as
well as the diagnostic system.
Submit server daemon
The submit server deamon is the central resource manager for High-Throughput
Computing (HTC) partitions. When MMCS boots a partition in HTC mode, each partition
registers itself with the submit server deamon before going to initialized state. This
registration process includes several pieces of information, such as partition mode (SMP,
DUAL, VN), partition size, user who booted the partition, and the list of users who can run
on the partition. This information is maintained in a container and is used to match
resource requests from submit commands based on their job requirements.
mpirund
The mpirund is a daemon process running on the service node whose purpose is to
handle connections from frontend mpirun processes, and fork backend mpirun processes.
Real-time server
The Real-time Notification APIs are designed to eliminate the need for a resource
management system to constantly have to read in all of the machine state to detect
changes. The APIs allow the caller to be notified in real-time of state changes to jobs,
blocks, and hardware, such as base partitions, switches, and node cards. After a resource
management application has obtained an initial snapshot of the machine state using the
Bridge APIs, the Bridge APIs can then determine to be notified only of changes, and the
Real-time Notification APIs provides that mechanism.
24
25
26
Part 2
Part
Kernel overview
The kernel provides the glue that makes all components in Blue Gene/P work together. In this
part, we provide an overview of the kernel functionality for applications developers. This part
is for those who require information about system-related calls and interaction with the kernel.
This part contains the following chapters:
27
28
Chapter 3.
Kernel functionality
In this chapter, we provide an overview of the functionality that is implemented as part of the
Compute Node Kernel and I/O Node kernel. We discuss the following topics:
System software overview
Compute Node Kernel
I/O node kernel
29
30
When running a user application, the CNK connects to the I/O Node through the collective
network. This connection communicates to a process that is running on the I/O Node called
the control and I/O daemon (CIOD). All function-shipped system calls are forwarded to the
CIOD process and executed on the I/O Node.
At the user-application level, the Compute Node Kernel supports the following APIs among
others:
Message Passing Interface (MPI)17 support between nodes using MPI library support
Open Multi-Processing (OpenMP)18 API
Standard IBM XL family of compilers support with XLC/C++, XLF, and GNU Compiler
Collection19
Highly optimized mathematical libraries, such as IBM Engineering and Scientific
Subroutine Library (ESSL)20
GNU Compiler Collection (GCC) C Library, or glibc, which is the C standard library and
interface of GCC for a provider library plugging into an other library (system programming
interfaces (SPIs))
CNK provides the following services:
Torus direct memory access (DMA),21 which provides memory access for reading, writing,
or doing both independently of the processing unit
Shared-memory access on a local node
Hardware configuration
Memory management
MPI topology
File I/O
Sockets connection
Signals
Thread management
Transport layer via collective network
31
32
CIOD:
Lightweight proxy between Compute Nodes and the outside world
Debugger access into the Compute Nodes
SMP support
BusyBox
and other
packages
ntpd
syslogd
etc
CIOD
Linux Kernel
Device
/proc
drivers
files
Common Node Services
Figure 3-2 I/O Node kernel overview
33
34
Jobs directory
Before CIOD starts a job it creates a directory called /jobs/<jobId>, where <jobId> is the ID of
the job as assigned by MMCS. This directory can be accessed by jobs running on the
compute nodes that are connected to the I/O node. The directory for each job contains the
following entries:
exe
A symlink to the executable.
cmdline
A file with the list of arguments given to the program.
environ
A file with the list of environment variables given to the program.
wdir
A symlink to the initial working directory for the program.
noderankmap
A file that contains the mapping of node location to MPI rank. This file is created only when
a tool daemon is started.
The /jobs directory is owned by root and has read and execute permission for everybody
(r-xr-xr-x), whereas the individual job directory is owned by the user that started the job and
has read and execute permission for everybody.
When the job completes, CIOD removes the /jobs/<jobId> directory.
Chapter 3. Kernel functionality
35
36
Chapter 4.
37
38
In VN mode, the four cores of a Compute Node act as different processes. Each has its own
rank in the message layer. The message layer supports VN mode by providing a correct torus
to rank mapping and first in, first out (FIFO) pinning. The hardware FIFOs are shared equally
between the processes. Torus coordinates are expressed by quadruplets instead of triplets. In
VN mode, communication between the four threads in a Compute Node occurs through DMA
local copies.
Each core executes one compute process. Processes that are allocated in the same
Compute Node share memory, which can be reserved at job launch. An application that
wants to run with four tasks per node can dedicate a large portion for shared memory, if the
tasks need to share global data. This data can be read/write, and data coherency is handled
in hardware.
The Blue Gene/P MPI implementation supports VN mode operations by sharing the systems
communications resources of a physical Compute Node between the four compute processes
that execute on that physical node. The low-level communications library of the Blue Gene/P
system, that is the message layer, virtualizes these communications resources into logical
units that each process can use independently.
39
Figure 4-3 shows two processes per node. Each process can have up to two threads.
OpenMP and pthreads are supported. Shared memory is available between processes.
Threads are pinned to a processor.
40
41
In the case of the submit command, you use the following commands:
submit ... -mode VN ...
submit ... -mode DUAL ...
See Chapter 12, High-Throughput Computing (HTC) paradigm on page 201 for more
information about the submit command.
42
Chapter 5.
Memory
In this chapter, we provide an overview of the memory subsystem and explain how it relates
to the Compute Node Kernel. This chapter includes the following topics:
Memory overview
Memory management
Memory protection
Persistent memory
43
Total per
node
Size
Replacement
policy
Associativity
L1 instruction
32 KB
Round-Robin
64-way set-associative
16 sets
32-byte line size
L1 data
32 KB
Round-Robin
64-way set-associative
16 sets
32-byte line size
L2 prefetch
14 x 256 bytes
Round-Robin
L3
2 x 4 MB
Least recently
used
8-way associative
2 bank interleaved
128-byte line size
N/A
44
Minimum 2 x 512 MB
Maximum 4 GB
5.2.1 L1 cache
On the Blue Gene/P system, the PowerPC 450 internal L1 cache does not have automatic
prefetching. Explicit cache touch instructions are supported. Although the L1 instruction
cache was designed with support for prefetches, it was disabled for efficiency reasons.
Figure 1-3 on page 9 shows the L1 caches in the PowerPC 450 architecture. The size of the
L1 cache line is 32 bytes. The L1 cache has two buses toward the L2 cache: one for the
stores and one for the loads. The buses are 128 bits in width and run at full processor
frequency. The theoretical limit is 16 bytes per cycle. However, 4.6 bytes is achieved on L1
load misses, and 5.6 bytes is achieved on all stores (write through). This value of 5.6 bytes is
achieved for the stores but not for the loads. The L1 cache has only a three-line fetch buffer.
Therefore, there are only three outstanding L1 cache line requests. The fourth one waits for
the first one to complete before it can be sent.
The L1 hit latency is four cycles for floating point and three cycles for integer. The L2 hit
latency is at about 12 cycles for floating point and 11 cycles for integer. The 4.6-byte
throughput limitation is a result of the limited number of line fill buffers, L2 hit latency, the
policy when a line fill buffer commits its data to L1, and the penalty of delayed load
confirmation when running fully recoverable.
Because only three outstanding L1 cache line load requests can occur at the same time, at
most three cache lines can be obtained every 18 cycles. The maximum memory bandwidth is
three times 32 bytes divided by 18 cycles, which yields 5.3 bytes per cycle, which written as
an equation looks like this:
(3 x 32 bytes) / 18 cycles = 5.3 bytes per cycle
Important:
Avoid instructions when prefetching data in the L1 cache on the Blue Gene/P system.
Using the processor, you can concurrently fill in three L1 cache lines. Therefore, it is
mandatory to reduce the number of prefetching streams to three or less.
To optimize the floating-point units (FPUs) and feed the floating-point registers, you can
use the XL compiler directives or assembler instructions (dcbt) to prefetch data in the
L1 data cache. The applications that are specially tuned for IBM POWER4 or
POWER5 processors that take advantage of four or eight prefetching engines will
choke the memory subsystem of the Blue Gene/P processor.
To take advantage of the single-instruction, multiple-data (SIMD) instructions, it is
essential to keep the data in the L1 cache as much as possible. Without an intensive
reuse of data from the L1 cache and the registers, because of the number of registers,
the memory subsystem is unable to feed the double FPU and provide two
multiply-addition operations per cycle.
Chapter 5. Memory
45
In the worst case, SIMD instructions can hurt the global performance of the application. For
that reason, we advise that you disable the SIMD instructions in the porting phase by
compiling with -qarch=450. Then recompile the code with -qarch=450d and analyze the
performance impact of the SIMD instructions. Perform the analysis with a data set and a
number of processors that is realistic in terms of memory usage.
Optimization tips:
The optimization of the applications must be based on the 32 KB of the L1 cache.
The benefits of the SIMD instructions can be canceled out if data does not fit in the L1
cache.
5.2.2 L2 cache
The L2 cache is the hardware layer that provides the link between the embedded cores and
the Blue Gene/P devices, such as the 8 MB L3-eDRAM and the 32 KB SRAM. The 2 KB L2
cache line is 128 bytes in size. Each L2 cache is connected to one processor core.
The L2 design and architecture were created to provide optimal support for the PowerPC 450
cores for scientific applications. Thus, a logic for automatic sequential stream detection and
prefetching to the L2 added on the PowerPC 440 is still available on PowerPC 450. The logic
is optimized to perform best on sequential streams with increasing addresses. The L2 boosts
overall performance for almost any application and does not require any special software
provisions. It autonomously detects streams, issues the prefetch requests, and keeps the
prefetched data coherent.
You can achieve latency and bandwidth results close to the theoretical limits (4.6 bytes per
cycle) dictated by the PowerPC 450 core by doing careful programming. The L2 accelerates
memory accesses for one to seven sequential streams.
5.2.3 L3 cache
The L3 cache is 8 MB in size. The line size is 128 bytes. Both banks are directly accessed by
all processor cores and the 10 Gb network, only on the I/O Node, and are used in Compute
Nodes for torus DMA. There are three write queues and three read queues. The read queues
directly access both banks.
Each L3 cache implements two sets of write buffer entries. Into each of the two sets, one
32-byte data line can deposit a set per cycle from any queue. In addition, one entry can be
allocated for every cycle in each set. The write rate for random data is much higher in the Blue
Gene/P system than in the Blue Gene/L system. The L3 cache can theoretically complete an
aggregate of four write hits per chip every two cycles. However, banking conflicts reduce this
number in most cases.
Optimization tip: Random access can divide the write sustained bandwidth of the L3
cache by a factor of three on Compute Nodes and more on I/O Nodes.
46
Sustained bandwidth
(bytes per cycle)b, c
Sequential access
L1
L2
11
4.6
L3
50
4.6
104
40
3.7
a. This corresponds to integer load latency. Floating-point latency is one cycle higher.
b. This is the maximum sustainable bandwidth for linear sequential access.
c. Random access bandwidth is dependent on the access width and overlap access, respectively.
Chapter 5. Memory
47
The CNK is strict in terms of TLB setup, for example, the CNK does not create a 256 MB TLB
that covers only 128 MB of real memory. By precisely creating the TLB map, any user-level
page faults (also known as segfaults) are immediately caught.
In the default mode of operation of the Blue Gene/P system, which is SMP node mode, each
physical Compute Node executes a single task per node with a maximum of four threads. The
Blue Gene/P system software treats those four core threads in a Compute Node
symmetrically. Figure 5-1 illustrates how memory is accessed in SMP node mode. The user
space is divided into user space read, execute and user space read/write, execute. The
latter corresponds to global variables, stack, and heap. In this mode, the four threads have
access to the global variables, stack, and heap.
Figure 5-2 shows how memory is accessed in Virtual Node Mode. In this mode, the four core
threads of a Compute Node act as different processes. The CNK reads only sections of an
application from local memory. No user access occurs between processes in the same node.
User space is divided into user-space read, execute and user-space read/write, execute.
The latter corresponds to global variables, stack, and heap. These two sections are designed
to avoid data corruption.
48
Each task in Dual mode gets half the memory and cores so it can run two threads per task.
Figure 5-3 shows that no user access occurs between the two processes. Although a layer of
shared-memory per node and the user-space read, execute is common to the two tasks, the
two user-spaces read/write, execute are local to each process.
Chapter 5. Memory
49
50
Chapter 6.
System calls
System calls provide an interface between an application and the kernel. In this chapter, we
provide information about the service points through which applications running on the
Compute Node request services from the Compute Node Kernel (CNK). This set of entry
points into the CNK is referred to as system calls (syscall). System calls on the Blue Gene/P
system have substantially changed from system calls on the Blue Gene/L system. In this
chapter, we describe system calls that are defined on the Blue Gene/P system.
We cover the following topics in this chapter:
51
File I/O
Directory operations
Time
Process information
Signals
Miscellaneous
Sockets
Compute Node Kernel (CNK)
52
Category
Header required
File I/O
<unistd.h>
File I/O
<sys/types.h>
<sys>/<stat.h>
File I/O
<sys/types.h>
<sys>/<stat.h>
File I/O
<unistd.h>
File I/O
<unistd.h>
File I/O
<unistd.h>
File I/O
<sys/types.h>
<sys>/<stat.h>
File I/O
<sys/types.h>
<unistd.h>
File I/O
<sys/types.h>
<unistd.h>
<fcntl.h>
File I/O
<sys/types.h>
<sys>/<stat.h>
File I/O
<sys/types.h>
<sys>/<stat.h>
File I/O
<sys/vfs.h>
File I/O
<sys/vfs.h>
53
Function prototype
Category
Header required
File I/O
<unistd.h>
File I/O
<sys/types.h>
<unistd.h>
File I/O
<unistd.h>
File I/O
<sys/types.h>
<unistd.h>
File I/O
<unistd.h>
File I/O
<sys/types.h>
<unistd.h>
File I/O
<unistd.h>
<sys/types.h>
<linux/unistd.h>
<errno.h>
File I/O
<sys/types.h>
<sys>/<stat.h>
File I/O
<sys/types.h>
<sys/stat.h>
File I/O
<sys/types.h>
<sys>/<stat.h>
<fcntl.h>
File I/O
<unistd.h>
File I/O
<unistd.h>
File I/O
<unistd.h>
File I/O
<unistd.h>
54
Function prototype
Category
Header required
File I/O
<sys/types.h>
<sys/uio.h>
File I/O
<stdio.h>
File I/O
<sys/types.h>
<sys/stat.h>
File I/O
<sys/types.h>
<sys/stat.h>
File I/O
<sys/statfs.h>
File I/O
<sys/statfs.h>
File I/O
<unistd.h>
File I/O
<sys/types.h>
<unistd.h>
File I/O
<unistd.h>
<sys/types.h>
File I/O
<sys/types.h>
<sys/stat.h>
File I/O
<unistd.h>
File I/O
<sys/types.h>
<utime.h>
File I/O
<unistd.h>
File I/O
<sys/types.h>
<sys/uio.h>
Directory
<unistd.h>
Directory
<unistd.h>
Directory
<sys/types.h>
55
Function prototype
Category
Header required
Directory
<sys/dirent.h>
Directory
<sys/types.h>
<sys/stat.h>
Directory
<unistd.h>
Time
<sys/time.h>
Time
<sys/time.h>
Time
<sys/time.h>
Time
<time.h>
gid_t getgid(void);
Process
information
<unistd.h>
pid_t getpid(void);
Process
information
<unistd.h>
Process
information
<sys/resource.h>
Process
information
<sys/resource.h>
uid_t getuid(void);
Process
information
<unistd.h>
Process
information
<sys/resource.h>
Process
information
<sys/times.h>
int brk(void
*end_data_segment);
Miscellaneous
<unistd.h>
Miscellaneous
<stdlib.h>
Terminates a process.
int sched_yield(void);
Miscellaneous
<sched.h>
56
Function prototype
Category
Header required
Miscellaneous
<sys/utsname.h>
Collective network
Torus network
Direct memory access
Global interrupts
Performance counters
Lockbox
L2
Snoop
L3
DDR hardware initialization
serdes
Environmental monitor
This hardware is set up by either the bootloader or Common Node Services. The L1
interfaces, such as TLB miss handlers, are typically extremely operating system specific, and
therefore an SPI is not defined. TOMAL and XEMAC are present in the Linux 10 Gb Ethernet
device driver (and therefore open source), but there are no plans for an explicit SPI.
57
Category
Header
required
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/socket.h>
Sockets
<sys/socket.h>
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/socket.h>
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/types.h>
<sys/socket.h>
58
Function prototype
Category
Header
required
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/types.h>
<sys/sockets.h>
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/socket.h>
Sockets
<sys/types.h>
<sys/socket.h>
Sockets
<sys/types.h>
<sys/socket.h>
Category
Header required
Signals
<sys/types.h>
<signal.h>
Signals
<signal.h>
Signals
<signal.h>
Manages signals.
Signals
<signal.h>
59
60
acct
ioperm
removexattr
adjtimex
iopl
rtas
afs_syscall
ipc
rts_device_map
bdflush
kexec_load
rts_dma
break
lgetxattr
sched_get_priority_max
capget
listxattr
sched_get_priority_min
capset
llistxattr
sched_getaffinity
chroot
lock
sched_getparam
clock_getres
lookup_dcookie
sched_getscheduler
clock_gettime
lremovexattr
sched_rr_get_interval
clock_nanosleep
lsetxattr
sched_setaffinity
clock_settime
mincore
sched_setparam
create_module
mknod
sched_setscheduler
delete_module
modify_lft
select
epoll_create
mount
sendfile
epoll_ctl
mpxmq_getsetattr
sendfile64
epoll_wait
mq_notify
setdomainname
execve
mq_open
setgroups
fadvise64
mq_timedreceive
sethostname
fadvise64_64
mq_timedsend
setpriority
fchdir
mq_unlink
settimeofday
fdatasync
multiplexer
setxattr
fgetxattr
nfsservctl
stime
flistxattr
nice
stty
flock
oldfstat
swapcontext
fork
oldlstat
swapoff
fremovexattr
oldolduname
swapon
fsetxattr
olduname
sync
ftime
oldstat
sys_debug_setcontext
get_kernel_syms
pciconfig_iobase
sysfs
getgroups
pciconfig_read
syslog
getpgrp
pciconfig_write
timer_create
getpmsg
personality
timer_delete
getppid
pipe
timer_getoverrun
getpriority
pivot_root
timer_gettime
gettid
prof
timer_settime
getxattr
profil
tuxcall
gtty
ptrace
umount
idle
putpmsg
umount2
init_module
query_module
uselib
io_cancel
quotactl
ustat
io_destroy
readahead
utimes
io_getevents
readdir
vfork
io_setup
reboot
vhangup
io_submit
remap_file_pages
vm86
You can find additional information about these system calls on the syscalls(2) - Linux man
page on the Web at:
http://linux.die.net/man/2/syscalls
61
62
Part 3
Part
Applications
environment
In this part, we provide an overview of some of the software that forms part of the applications
environment. Throughout this book, we consider the applications environment as the
collection of programs that are required to develop applications.
This part includes the following chapters:
63
64
Chapter 7.
Parallel paradigms
In this chapter, we discuss the parallel paradigms that are offered on the Blue Gene/P
system. One such paradigm is the Message Passing Interface (MPI),22 for a
distributed-memory architecture, and OpenMP,23 for shared-memory architectures. We refer
to this paradigm as High-Performance Computing (HPC). Blue Gene/P also offers a
paradigm where applications do not require communication between tasks and each node is
running a different instance of the application. We refer to this paradigm as High-Throughput
Computing (HTC). This topic is discussed in Chapter 14.
In this chapter, we address the following topics:
Programming model
IBM Blue Gene/P MPI implementation
Blue Gene/P MPI extensions
MPI functions
Compiling MPI programs on Blue Gene/P
MPI communications performance
OpenMP
65
DCMF Applications
Other programming paradigms have been ported to Blue Gene/P using one or more of the
supported software interfaces as illustrated in Figure 7-1. The respective open source
communities provide support for using these alternative paradigms, as described in the rest
of this section.
Application
Charm++
Application Layer
Berkeley
UPC
Global
Arrays
MPICH2
GASNet
ARMCI
dcmfd ADI
CCMI
Deep Computing
Messaging Framework
DCMF (C++)
DMA SPI
66
Global Arrays
The Global Arrays (GA) toolkit is an open source project developed and maintained by the
Pacific Northwest National Laboratory. The toolkit provides an efficient and portable
shared-memory programming interface for distributed-memory computers. Each process in
a MIMD parallel program can asynchronously access logical blocks of physically distributed
dense multidimensional arrays without need for explicit cooperation by other processes.
Unlike other shared-memory environments, the GA model exposes to the programmer the
non-uniform memory access (NUMA) characteristics of the high-performance computers and
acknowledges that access to a remote portion of the shared data is slower than access to the
local portion. The locality information for the shared data is available, and direct access to the
local portions of shared data is provided. For information about the GA toolkit, refer to the
following Web site:
http://www.emsl.pnl.gov/docs/global/
You also can obtain information about the GA toolkit by sending an e-mail to
[email protected].
Charm++
Charm++ is an open source project developed and maintained by the Parallel Programming
Laboratory at the University of Illinois at Urbana-Champaign. Charm++ is an explicitly parallel
language based on C++ with a runtime library for supporting parallel computation called the
Charm kernel. It provides a clear separation between sequential and parallel objects. The
execution model of Charm++ is message driven, thus helping one write programs that are
latency tolerant. Charm++ supports dynamic load balancing while creating new work as well
as periodically, based on object migration. Several dynamic load balancing strategies are
provided. Charm++ supports both irregular as well as regular, data-parallel applications. It is
based on the Converse interoperable runtime system for parallel programming. You can
access information from the following Parallel Programming Laboratory Web site:
http://charm.cs.uiuc.edu/
You also can access information about the Parallel Programming Laboratory by sending an
e-mail to [email protected].
67
Data (SPMD) model of computation in which the amount of parallelism is fixed at program
startup time, typically with a single thread of execution per processor. You can access more
information at the following Web site:
http://upc.lbl.gov/
You also can access more information by sending an e-mail to [email protected].
In sections 7.2.2, Forcing MPI to allocate too much memory on page 71 through 7.2.7,
Buffer alignment sensitivity on page 73, we discuss several sample MPI codes to explain
some of the implementation-dependent behaviors of the MPI library. Section 7.3.3, Self
Tuned Adaptive Routines for MPI on page 79 discusses an automatic optimization technique
available on the Blue Gene/P MPI implementation.
Collective network
The collective network connects all the Compute Nodes in the shape of a tree. Any node can
be the tree root. The MPI implementation uses the collective network, which is more efficient
than the torus network for collective communication on global communicators, such as
MPI_COMM_WORLD.
Point-to-point network
All MPI point-to-point and subcommunicator communication operations are carried out
through the torus network. The route from a sender to a receiver on a torus network has the
following two possible paths:
Deterministic routing
Packets from a sender to a receiver go along the same path. One advantage of this path is
that the packet order is always maintained without additional logic. However, this
technique also creates network hot spots if several point-to-point communications occur at
the same time and their deterministic routes cross on some node.
Adaptive routing
Different packets from the same sender to the same receiver can travel along different
paths. The exact route is determined at run time depending on the current load. This
technique generates a more balanced network load but introduces a latency penalty.
Selecting deterministic or adaptive routing depends on the protocol used for communication.
The Blue Gene/P MPI implementation supports three different protocols:
MPI short protocol
The MPI short protocol is used for short messages (less than 224 bytes), which consist of
a single packet. These messages are always deterministically routed. The latency for
short messages is around 3.3 s.
MPI eager protocol
The MPI eager protocol is used for medium-sized messages. It sends a message to the
receiver without negotiating with the receiving side that the other end is ready to receive
the message. This protocol also uses deterministic routes for its packets.
MPI rendezvous protocol
Large messages are sent using the MPI rendezvous protocol. In this case, an initial
connection between the two partners is established. Only after that connection is
established, does the receiver use direct memory access (DMA) to obtain the data from
the sender. This protocol uses adaptive routing and is optimized for maximum bandwidth.
Chapter 7. Parallel paradigms
69
By default, MPI send operations use the rendezvous protocol, instead of the eager
protocol, for messages larger than 1200 bytes. Naturally, the initial rendezvous handshake
increases the latency.
There are two types of rendezvous protocols: default and optimized. The optimized
rendezvous protocol generally has less latency than the default rendezvous protocol, but
does not wait for a receive to be posted first. Therefore, unexpected messages can be
received, consuming storage until the receives are issued. The default rendezvous
protocol waits for a receive to be posted first. Therefore, no unexpected messages will be
received. The optimized rendezvous protocol also avoids filling injection FIFOs which can
cause delays while larger FIFOs are allocated. In general, the optimized rendezvous
protocol should be used with smaller rendezvous messages, while the default rendezvous
protocol should be used for larger rendezvous messages. By default, the default
rendezvous protocol is used, and the optimized rendezvous protocol is disabled, since the
default protocol is guaranteed to not run out of memory with unexpected messages.
Enabling the optimized protocol for smaller rendezvous messages improves performance
in some applications. Enabling the optimized rendezvous protocol is done by setting
environment variables, as described below.
The Blue Gene/P MPI library supports a DCMF_EAGER environment variable (which can be
set using mpirun) to set the message size (in bytes) above which the rendezvous protocol
should be used. Consider the following guidelines:
Decrease the rendezvous threshold if any of the following situations are true:
Many short messages are overloading the network.
Eager messages are creating artificial hot spots.
The program is not latency-sensitive.
Increase the rendezvous threshold if any of the following situations are true:
Most communication is a nearest neighbor or at least close in Manhattan distance,
where this distance is the shortest number of hops between a pair of nodes.
You mainly use relatively long messages.
You need better latency on medium-sized messages.
The DCMF_OPTRZV environment variable specifies the number of bytes on the low end of
the rendezvous range where the optimized rendezvous protocol will be used. That is, the
optimized rendezvous protocol will be used if eager_limit <= message_size < (eager_limit
+ DCMF_OPTRZV), for example, if the eager limit (DCMF_EAGER) is 1200 (the default), and
DCMF_OPTRZV is 1000, the eager protocol will be used for message sizes less than 1200
bytes, the optimized rendezvous protocol will be used for message sizes 1200 - 2199 bytes,
and the default rendezvous protocol will be used for message sizes 2200 bytes or larger. The
default DCMF_OPTRZV value is 0, meaning that the optimized rendezvous protocol is not
used.
Several other environment variables can be used to customize MPI communications. Refer to
Appendix D, Environment variables on page 339 for descriptions of these environment
variables.
An efficient MPI application on Blue Gene/P observes the following guidelines:
Overlap communication and computation using MPI_Irecv and MPI_Isend, which allow
DMA to work in the background.
DMA and the collective and GI networks: The collective and GI networks do not use
DMA. In this case, operations cannot be completed in the background.
70
71
You can accomplish the same goal and avoid memory allocation issues by recoding as shown
in Example 7-3 and Example 7-4.
Example 7-3 CPU1 MPI code that can avoid excessive memory allocation
MPI_ISend(cpu2, tag1);
MPI_ISend(cpu2, tag2);
...
MPI_ISend(cpu2, tagn);
Example 7-4 CPU2 MPI code that can avoid excessive memory allocation
MPI_Recv(cpu1, tag1);
MPI_Recv(cpu1, tag2);
...
MPI_Recv(cpu1, tagn);
TASK1 code:
MPI_Send(task2, tag1);
MPI_Recv(task2, tag2);
TASK2 code:
72
MPI_Send(task1, tag2);
MPI_Recv(task1, tag1);
In general, this code has a high probability of deadlocking the system. Obviously, you should
not program this way. Make sure that your code conforms to the MPI specification. You can
achieve this by either changing the order of sends and receives or by using non-blocking
communication calls.
While you should not rely on the run-time system to correctly handle nonconforming MPI
code, it is easier to debug such situations when you receive a run-time error message than to
try and detect a deadlock and trace it back to its root cause.
req = MPI_Isend(buffer,&req);
buffer[0] = something;
MPI_Wait(req);
The code in Example 7-8 results in a race condition on any message-passing machine.
Depending on the run-time factors that are outside the applications control, sometimes the
old buffer[0] is sent and sometimes the new value is sent.
In the last example in this thread, a receive buffer is read before MPI_Wait() because the
asynchronous receive request completed (see Example 7-9).
Example 7-9 Receive buffer before MPI_Wait() has completed
req = MPI_Irecv(buffer);
z = buffer[0];
MPI_Wait (req);
The code shown in Example 7-9 is also illegal. The contents of the receive buffer are not
guaranteed until after MPI_Wait() is called.
73
For buffers that are dynamically allocated (via malloc()), the following techniques can be
used:
Instead of using malloc(), use the following statement and specify 32 for the alignment
parameter:
int posix_memalign(void **memptr, size_t alignment, size_t size)
This statement returns a 32-byte aligned pointer to the allocated memory. You can use
free() to free the memory.
Use malloc(), but request 32 bytes of more storage than required. Then round the returned
address up to a 32-byte boundary as shown in Example 7-10.
Example 7-10 Request 32 bytes more storage than required
struct DataInfo
{
unsigned int iarray[256];
unsigned int count;
} data_info __attribute__ ((aligned ( 32)));
or
unsigned int data __attribute__ ((aligned ( 32)));
or
char data_array[512] __attribute__((aligned(32)));
For buffers that are declared in automatic (stack) storage, only up to a 16-byte alignment is
possible. Therefore, use dynamically allocated aligned static (global) storage instead.
74
7.3.3, Self Tuned Adaptive Routines for MPI on page 79 describes a way to automatically
tune the collective routines used by an application.
75
Pset 1
Pset 2
Pset 3
Pset 4
Comm
1
Comm
Comm
2
3
Application
Comm
4
76
Comm
1
Comm
2
Comm
3
Comm
4
Comm
5
Comm
6
Comm
7
Comm
8
Pset 1
Pset 2
Pset 3
Pset 4
77
78
MPIX_Get_property(comm, MPIDO_USE_TORUS_ALLTOALL, &result);
if (result == 0)
/* this causes the following MPI_Alltoall to use torus protocol */
MPIX_Set_property(comm, MPIDO_USE_TORUS_ALLTOALL, 1);
MPIX_Get_property(comm, MPIDO_USE_MPICH_ALLTOALL, &result);
if (result == 1)
/* turn off the mpich protocol */
MPIX_Set_property(comm, MPIDO_USE_MPICH_ALLTOALL, 0);
MPI_Alltoall();
/* this resets the MP_Alltoall algorithm selection to its previous state */
MPIX_Set_property(comm, MPIDO_USE_TORUS_ALLTOALL, 0);
Usage
DCMF_STAR
DCMF_STAR_THRESHOLD
DCMF_STAR_VERBOSE
79
Environment variable
Usage
DCMF_STAR_NUM_INVOCS
DCMF_STAR_TRACEBACK_LEVEL
DCMF_STAR_CHECK_CALLSITE
The STAR-MPI verbose output can be used to pre-tune the collectives for an application if an
application is called multiple times with similar inputs. The verbose output will show the
algorithm that STAR-MPI determined to be optimal for each MPI collective operation
invocation. The next time the application is run, the caller can indicate to the MPI library the
algorithm to use by setting the DCMF environment variables described in Appendix D,
Environment variables on page 339 or using the techniques described in 7.3.2, Configuring
MPI algorithms at run time on page 77. In this way, the application will avoid STAR-MPIs
less-optimal tuning phase while getting the benefit of using the best algorithms on
subsequent runs.
80
The MPI Routines page includes MPI calls for C and Fortran. For more information, refer to
the following books about MPI and MPI-2:
MPI: The Complete Reference, 2nd Edition, Volume 1, by Marc Snir, Steve Otto, Steven
Huss-Lederman, David Walker, and Jack Dongarra26
MPI: The Complete Reference, Volume 2: The MPI-2 Extensions, by William Gropp,
Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir,
and Marc Snir27
For general information about MPICH2, refer to the MPICH2 Web page at:
http://www.mcs.anl.gov/research/projects/mpich2/
Because teaching MPI is beyond the scope of this book, refer to the following Web page for
tutorials and extensive information about MPI:
http://www.mcs.anl.gov/research/projects/mpi/learning.html
C compiler
mpicxx
C++ compiler
mpif77
Fortran 77compiler
mpif90
Fortran 90 compiler
mpixlc
IBM XL C compiler
mpixlc_r
mpixlcxx
mpixlcxx_r
mpixlf2003
81
mpixlf2003_r
mpixlf77
mpixlf77_r
mpixlf90
mpixlf90_r
mpixlf95
mpixlf95_r
mpich2version
Note: When you invoke the previous scripts, if you do not set the optimization level using
-O, the default is set to no optimization (-O0).
The following environment variables can be set to override the compilers used by the scripts:
MPICH_CC
C compiler
MPICH_CXX
C++ compiler
MPICH_FC
Fortran 77 compiler
The IBM XL Fortran 90 compiler is incompatible with the Fortran 90 MPI bindings in the
MPICH library built with GCC. Therefore, the default version of the mpixlf90 script cannot be
used with the Fortran 90 MPI bindings.
Example 7-13 shows how to use the mpixlf77 script in a makefile.
Example 7-13 Use of MPI script mpixlf77
XL
EXE
OBJ
SRC
FLAGS
= /bgsys/drivers/ppcfloor/comm/default/bin/mpixlf77
=
=
=
=
fhello
hello.o
hello.f
-O3 -qarch=450 -qtune=450
$(EXE): $(OBJ)
${XL} $(FLAGS) -o $@ $^
$(OBJ): $(SRC)
${XL} $(FLAGS) -c $<
clean:
$(RM) $(OBJ) $(EXE)
To build MPI programs for Blue Gene/P, the compilers can be invoked directly rather than
using the above scripts. When invoking the compilers directly you must explicitly include the
required MPI libraries. Example 7-14 shows a makefile that does not use the scripts.
Example 7-14 Makefile with explicit reference to libraries and include files
BGP_FLOOR
= /bgsys/drivers/ppcfloor
BGP_IDIRS
= -I$(BGP_FLOOR)/arch/include -I$(BGP_FLOOR)/comm/include
BGP_LIBS
= -L$(BGP_FLOOR)/comm/lib -lmpich.cnk \
-L$(BGP_FLOOR)/comm/lib -ldcmf.cnk -ldcmfcoll.cnk \
82
-lpthread -lrt \
-L$(BGP_FLOOR)/runtime/SPI -lSPI.cna
XL
= /opt/ibmcmp/xlf/bg/11.1/bin/bgxlf
EXE
OBJ
SRC
FLAGS
=
=
=
=
fhello
hello.o
hello.f
-O3 -qarch=450 -qtune=450 $(BGP_IDIRS)
$(EXE): $(OBJ)
${XL} $(FLAGS) -o $@ $^ $(BGP_LIBS)
$(OBJ): $(SRC)
${XL} $(FLAGS) -c $<
clean:
$(RM) $(OBJ) $(EXE)
The number of MB of data that can be sent from one node to another
node in one second
Latency
The amount of time it takes for the first byte that is sent from one node
to reach its target node
The values for bandwidth and latency provide information about communication.
Here we illustrate two cases. The first case corresponds to a benchmark that involves a single
transfer. The second case corresponds to a collective as defined in the Intel MPI
Benchmarks (see the URL that follows). Intel MPI Benchmarks was formerly known as
Pallas MPI Benchmarks (PMB-MPI1 for MPI1 standard functions only). Intel MPI
Benchmarks - MPI1 provides a set of elementary MPI benchmark kernels.
For more details, see the product documentation included in the package that you can
download from the following Intel Web page:
http://www.intel.com/cd/software/products/asmo-na/eng/219848.htm
83
mpirun -nofree -timeout 300 -verbose 1 -np 512 -mode SMP -partition R01-M1 -cwd
/bgusr/BGTH_BGP/test512nDD2BGP/pallas/pall512DD2SMP/bgpdd2sys1-R01-M1 -exe
/bgusr/BGTH_BGP/test512nDD2BGP/pallas/pall512DD2SMP/bgpdd2sys1-R01-M1/IMB-MPI1.4MB
.perf.rts -args "-msglen 4194304.txt -npmin 512 PingPong" | tee
IMB-MPI1.4MB.perf.PingPong.4194304.512.out) >>
run.IMB-MPI1.4MB.perf.PingPong.4194304.512.out 2>&1
Figure 7-4 shows the bandwidth on the torus network as a function of the message size, for
one simultaneous pair of nearest neighbor communications. The protocol switch from short to
eager is visible in both cases, where the eager to rendezvous switch is most pronounced on
the Blue Gene/L system (see the asterisks (*)). This figure also shows the improved
performance on the Blue Gene/P system (see the diamonds).
Bandwidth in MB/s
400
350
300
250
200
150
100
50
16
38
4
65
53
6
26
21
44
10
48
57
6
41
94
30
4
40
96
10
24
25
6
64
16
84
mpirun -nofree -timeout 300 -verbose 1 -np 512 -mode SMP -partition R01-M1 -cwd
/bgusr/BGTH_BGP/test512nDD2BGP/pallas/pall512DD2SMP/bgpdd2sys1-R01-M1 -exe
/bgusr/BGTH_BGP/test512nDD2BGP/pallas/pall512DD2SMP/bgpdd2sys1-R01-M1/IMB-MPI1.4MB
.perf.rts -args "-msglen 4194304.txt -npmin 512 Allreduce" | tee
IMB-MPI1.4MB.perf.Allreduce.4194304.512.out) >>
run.IMB-MPI1.4MB.perf.Allreduce.4194304.512.out 2>&1
Collective operations are more efficient on the Blue Gene/P system. You should try to use
collective operations instead of point-to-point communication wherever possible. The
overhead for point-to-point communications is much larger than for collectives. Unless all of
your point-to-point communication is purely to the nearest neighbor, it is difficult to avoid
network congestion on the torus network.
Alternatively, collective operations can use the barrier (global interrupt) network or the torus
network. If they run over the torus network, they can still be optimized by using specially
designed communication patterns that achieve optimum performance. Doing this manually
with point-to-point operations is possible in theory, but in general, the implementation in the
Blue Gene/P MPI library offers superior performance.
With point-to-point communication, the goal of reducing the point-to-point Manhattan
distances necessitates a good mapping of MPI tasks to the physical hardware. For
collectives, mapping is equally important because most collective implementations prefer
certain communicator shapes to achieve optimum performance. In general, collectives using
rectangular subcommunicators (with the ranks organized in lines, planes, or cubes) will out
perform irregular subcommunicators (any communicator that is not rectangular). Refer to
Appendix F, Mapping on page 355, which illustrates the technique of mapping.
85
Communicator
Data type
Network
Latency
Bandwidth
MPI_Barrier
MPI_COMM_WORLD
N/A
Global Interrupts
1.25 s
N/A
Rectangular
N/A
Torus
10.96 s
N/A
All other
subcommunicators
N/A
Torus
22.61 s
N/A
MPI_COMM_WORLD
Byte
Collective for
latency, torus for BW
3.61s
2047 MBpsa
Rectangular
Byte
Torus
11.58s
2047 MBpsa
All other
subcommunicators
Byte
Torus
15.36 s
357 MBps
MPI_COMM_WORLD
Integer
Collective for
latency, torus for BW
3.76 s
780 MBpsa
Double
Collective for
latency, torus for BW
5.51 s
363 MBpsa
Integer
Torus
17.66 s
261 MBpsa
Double
Torus
17.54 s
363 MBpsa
Integer
Torus
38.06 s
46 MBps
Double
Torus
37.96 s
49 MBps
MPI_Bcast
MPI_Allreduce
Rectangular
All other
subcommunicators
MPI_Alltoallv
All communicators
Byte
Torus
355 sb
~97% of peak
torus bisection
bandwidth
MPI_Allgatherv
MPI_COMM_WORLD
Byte
Collective for
latency, torus for BW
18.79 sc
3.6x MPICHd
Rectangular
Byte
Torus
276.32 sc
3.6x MPICHd
MPI_Gather
MPI_COMM_WORLD
Byte
Collective for BW
10.77 s
2.0x MPICHd
MPI_Scatter
MPI_COMM_WORLD
Byte
Collective for BW
9.91 s
4.2x MPICHa
MPI_Scatterv
All
Byte
Torus for BW
167 s
2.6x MPICHd
MPI_COMM_WORLD
Byte
Collective
20 s
3.7x MPICHe
MPI_COMM_WORLD
Integer
Collective
3.82 s
780 MBps
Double
Torus
4.03 s
304 MBps
Integer
Torus
17.27 s
284 MBps
MPI_Reduce
Rectangular
86
MPI routine
Communicator
All other
subcommunicators
Data type
Network
Latency
Bandwidth
Double
Torus
17.32 s
304 MBps
Integer
Torus
8.43 s
106 MBps
Double
Torus
8.43 s
113 MBps
Figure 7-5 and Figure 7-6 on page 88 show a comparison between the IBM Blue Gene/L and
Blue Gene/P systems for the MPI_Allreduce() type of communication on integer data types
with the sum operation.
Figure 7-5 MPI_Allreduce() integer sum wall time performance on 512 nodes.
Figure 7-6 on page 88 shows MPI_Allreduce() integer sum bandwidth performance on 512
nodes.
87
88
7.7 OpenMP
The OpenMP API is supported on the Blue Gene/P system for shared-memory parallel
programming in C/C++ and Fortran. This API has been jointly defined by a group of hardware
and software vendors and has evolved as a standard for shared-memory parallel
programming.
OpenMP consists of a collection of compiler directives and a library of functions that can be
invoked within an OpenMP program. This combination provides a simple interface for
developing parallel programs on shared-memory architectures. In the case of the Blue
Gene/P system, it allows the user to exploit the SMP mode on each Compute Node.
Multi-threading is now enabled on the Blue Gene/P system. Using OpenMP, the user can
have access to data parallelism as well as functional parallelism.
For additional information, refer to the official OpenMP Web site at:
http://www.openmp.org/
for
parallel for
89
sections
parallel sections
critical
single
Parallel operations are often expressed in C/C++ and Fortran95 programs as for loops as
shown in Example 7-19.
Example 7-19 for loops in Fortran and C
integer i, n, sum
sum = 0
do 5 i = 1, n
sum = sum + i
continue
The compiler can automatically locate and, where possible, parallelize all countable loops in
your program code in the following situations:
There is no branching into or out of the loop.
An increment expression is not within a critical section.
A countable loop is automatically parallelized only if all of the following conditions are met:
The order in which loop iterations start or end does not affect the results of the
program.
The loop does not contain I/O operations.
Floating-point reductions inside the loop are not affected by round-off error, unless the
-qnostrict option is in effect.
The -qnostrict_induction compiler option is in effect.
The -qsmp=auto compiler option is in effect.
The compiler is invoked with a thread-safe compiler mode.
In the case of C/C++ programs, OpenMP is invoked via pragmas as shown in Example 7-20.
Pragma: The word pragma is short for pragmatic information.28 Pragma is a way to
communicate information to the compiler:
#pragma omp <rest of pragma>
Example 7-20 pragma usage
90
The for loop must not contain statements, such as the following examples, that allow the loop
to be exited prematurely:
break
return
exit
go to labels outside the loop
In a for loop, the master thread creates additional threads. The loop is executed by all
threads, where every thread has its own address space that contains all of the variables the
thread can access. Such variables might be:
Static variables
Dynamically allocated data structures in the heap
Variables on the run-time stack
In addition, variables must be defined according to the type. Shared variables have the same
address in the execution context of every thread. It is important to understand that all threads
have access to shared variables. Alternatively, private variables have a different address in
the execution memory of every thread. A thread can access its own private variables, but it
cannot access the private variable of another thread.
Pragma parallel: In the case of the parallel for pragma, variables are shared by default,
with exception of the loop index.
Example 7-21 shows a simple Fortran95 example that illustrates the difference between
private and shared variables.
Example 7-21 Fortran example using the parallel do directive
program testmem
integer n
parameter (n=2)
parameter (m=1)
integer a(n), b(m)
!$OMP parallel do
do i = 1, n
a(i) = i
enddo
write(6,*)'Done: testmem'
end
In Example 7-21, no variables are explicitly defined as either private or shared. In this case,
by default, the compiler assigns the variable that is used for the do-loop index as private. The
rest of the variables are shared. Figure 7-7 on page 92 illustrates both private and shared
variables as shown in Parallel Programming in C with MPI and OpenMP.29 In this figure, the
blue and yellow arrows indicate which variables are accessible by all the threads.
91
omp_get_num_threads
omp_get_thread_num
omp_set_num_threads
7.7.4 Performance
To illustrate the effect of selected OpenMP compiler directives and the implications in terms of
performance of a particular do loop, we chose the programs presented in Parallel Programming
in C with MPI and OpenMP30 and apply them to the Blue Gene/P system. These simple
examples illustrate how to use these directives and some of the implications in selecting a
particular directive over another directive. Example 7-22 shows a simple program to compute .
Example 7-22 Sequential version of the pi.c program
92
}
pi = area / n;
printf ("Estimate of pi: %7.5f\n", pi);
}
The first way to parallelize this code is to include an OpenMP directive to parallelize the for
loop as shown in Example 7-23.
Example 7-23 Simple use of parallel for loop
#include <omp.h>
long long timebase(void);
int main(argc, argv)
int argc;
char *argv[];
{
int num_threads;
long n, i;
double area, pi, x;
long long time0, time1;
double cycles, sec_per_cycle, factor;
n
= 1000000000;
area = 0.0;
time0 = timebase();
#pragma omp parallel for private(x)
for (i = 0; i < n; i++) {
x = (i+0.5)/n;
area += 4.0 / (1.0 + x*x);
}
pi = area / n;
printf ("Estimate of pi: %7.5f\n", pi);
time1 = timebase();
cycles = time1 - time0;
factor = 1.0/850000000.0;
sec_per_cycle = cycles * factor;
printf("Total time %lf \n",sec_per_cycle, "Seconds \n");
}
Unfortunately this simple approach creates a race condition when computing the area. While
different threads compute and update the value of the area, other threads might be computing
and updating area as well, therefore producing the wrong results. This particular race
condition can be solved in two ways. One way is to use a critical pragma to ensure mutual
exclusion among the threads, and the other way is to use the reduction clause.
Example 7-24 illustrates use of the critical pragma.
Example 7-24 Usage of critical pragma
#include <omp.h>
long long timebase(void);
int main(argc, argv)
int argc;
93
char *argv[];
{
int num_threads;
long n, i;
double area, pi, x;
long long time0, time1;
double cycles, sec_per_cycle, factor;
n
= 1000000000;
area = 0.0;
time0 = timebase();
#pragma omp parallel for private(x)
for (i = 0; i < n; i++) {
x = (i+0.5)/n;
#pragma omp critical
area += 4.0 / (1.0 + x*x);
}
pi = area / n;
printf ("Estimate of pi: %7.5f\n", pi);
time1 = timebase();
cycles = time1 - time0;
factor = 1.0/850000000.0;
sec_per_cycle = cycles * factor;
printf("Total time %lf \n",sec_per_cycle, "Seconds \n");
}
Example 7-25 corresponds to the reduction clause.
Example 7-25 Usage of the reduction clause
#include <omp.h>
long long timebase(void);
int main(argc, argv)
int argc;
char *argv[];
{
int num_threads;
long n, i;
double area, pi, x;
long long time0, time1;
double cycles, sec_per_cycle, factor;
n
= 1000000000;
area = 0.0;
time0 = timebase();
#pragma omp parallel for private(x) reduction(+: area)
for (i = 0; i < n; i++) {
x = (i+0.5)/n;
area += 4.0 / (1.0 + x*x);
}
pi = area / n;
printf ("Estimate of pi: %7.5f\n", pi);
time1 = timebase();
cycles = time1 - time0;
factor = 1.0/850000000.0;
94
BGP_FLOOR
= /bgsys/drivers/ppcfloor
BGP_IDIRS
= -I$(BGP_FLOOR)/arch/include -I$(BGP_FLOOR)/comm/include
BGP_LIBS
= -L$(BGP_FLOOR)/comm/lib -L$(BGP_FLOOR)/runtime/SPI -lmpich.cnk
-ldcmfcoll.cnk -ldcmf.cnk -lrt -lSPI.cna -lpthread
XL
= /opt/ibmcmp/vac/bg/9.0/bin/bgxlc_r
EXE
= pi_critical_bgp
OBJ
= pi_critical.o
SRC
= pi_critical.c
FLAGS
= -O3 -qsmp=omp:noauto -qthreaded -qarch=450
-I$(BGP_FLOOR)/comm/include
FLD
= -O3 -qarch=450 -qtune=450
-qtune=450
$(EXE): $(OBJ)
${XL} $(FLAGS) -o $(EXE) $(OBJ) timebase.o $(BGP_LIBS)
$(OBJ): $(SRC)
${XL} $(FLAGS) $(BGP_IDIRS) -c $(SRC)
clean:
rm pi_critical.o pi_critical_bgp
Table 7-3 illustrates the performance improvement by using the reduction clause.
Table 7-3 Parallel performance using critical pragma versus reduction clause
Execution time in (seconds)
Threads
586.37
20.12
145.03
5.22
180.80
4.78
Blue Gene/P
560.08
12.80
458.84
10.08
374.10
2.70
324.71
2.41
Blue Gene/P
602.62
6.42
95
552.54
5.09
428.42
1.40
374.51
1.28
Blue Gene/P
582.95
3.24
For more in-depth information with additional examples, we recommend you read Parallel
Programming in C with MPI and OpenMP.31 In this section, we selected to illustrate only the
program.
96
Chapter 8.
XL Fortran
http://www-306.ibm.com/software/awdtools/fortran/xlfortran/library/
In this chapter, we discuss specific considerations for developing, compiling, and optimizing
C/C++ and Fortran applications for the IBM Blue Gene/P PowerPC 450 processor and a
single-instruction multiple-data (SIMD), double-precision floating-point multiply add unit
(double floating-point multiply add (FMA). The following topics are discussed:
Compiler overview
Compiling and linking applications on Blue Gene/P
Default compiler options
Unsupported options
Support for pthreads and OpenMP
Creation of libraries on Blue Gene/P
XL runtime libraries
Mathematical Acceleration Subsystem libraries
IBM Engineering Scientific Subroutine Library
Configuring Blue Gene/P builds
Python
Tuning your code for Blue Gene/P
97
98
99
Scripts are already available that do much of this for you. They reside in the same bin
directory as the compiler binary (/opt/ibmcmp/xlf/bg/11.1/bin or /opt/ibmcmp/vacpp/bg/9.0/bin
or /opt/ibmcmp/vac/bg/9.0/bin). Table 8-1 lists the names.
Table 8-1 Scripts available in the bin directory for compiling and linking
Language
C++
Fortran
Important: The double FPU does not generate exceptions. Therefore, the -qflttrap
option is invalid with the 450d processor. Instead you should reset the 450d processor to
-qarch=450.
bgxlf_r
bgxlc_r
bgxlC_r
bgcc_r
The thread-safe compiler version should be used with any threaded, OpenMP, or SMP
application.
100
Thread-safe libraries: Thread-safe libraries ensure that data access and updates are
synchronized between threads.
Usage of -qsmp OpenMP and pthreaded applications:
-qsmp by itself automatically parallelizes loops.
-qsmp=omp automatically parallelizes based on OpenMP directives in the code.
-qsmp=omp:noauto should be used when parallelizing codes manually. It prevents the
compiler from trying to automatically parallelize loops.
Note: -qsmp must be used only with thread-safe compiler mode invocations such as
xlc_r. These invocations ensure that the pthreads, xlsmp, and thread-safe versions of
all default run-time libraries are linked to the resulting executable. See the language
reference for more details about the -qsmp suboptions at:
http://publib.boulder.ibm.com/infocenter/comphelp/v8v101/index.jsp
OpenMP can be used with the GNU compiler but support for OpenMP requires a newer
compiler than is shipped with Blue Gene/P. Instructions to build the 4.3.2 GNU compiler with
GOMP are provided in /bgsys/drivers/ppcfloor/toolchain/README.toolchain.gomp, which is
shipped with the Blue Gene/P software.
Note: OpenMP uses a thread-private stack; by default it is 4MB, this can be set at runtime
via mpirun: -env XLSMPOPTS=stack=8000000 Values set are in bytes. For a discussion of
mpirun, see Chapter 11, mpirun on page 177.
101
Example 8-2 shows the same procedure using the GNU collection of compilers.
Example 8-2 Static library creation using the GNU compiler
pi.c
main.c
libpi.a pi.o
On the other hand, shared libraries are loaded at execution time and shared among different
executables.
Note: -qnostaticlink, used with the C and C++ compilers, indicates to build a dynamic
binary, but by default the static libgcc.a is linked in. To indicate that the shared version of
libgcc should be linked in, also specify -qnostaticlink=libgcc, for example,
/opt/ibmcmp/vacpp/bg/9.0/bin/bgxlc -o hello hello.c -qnostaticlink -qnostaticlink=libgcc
Example 8-3 Shared library creation using the XL compiler
102
Example 8-4 illustrates the same procedure with the GNU collection of compilers.
Example 8-4 Shared library creation using the GNU compiler
-fPIC -c libpi.c
-fPIC -c main.c
-shared \
The command line option to the Blue Gene/P XL Fortran compiler to create a dynamic
executable is -Wl,-dy. The order of the -Wl,-dy is significant. It must come before any -L and -l
options, for example, to build a dynamic executable from the Fortran source file hello.f, run
/opt/ibmcmp/xlf/bg/11.1/bin/bgxlf -Wl,-dy -qpic -o hello hello.f. Example 8-5
illustrates creating a shared library using the XL Fortran compiler.
103
Description
libibmc++.a,
libibmc++.so
libxlf90.a,
libxlf90.so
libxlfmath.a,
libxlfmath.so
IBM XLF stubs for math routines in system library libm, for example, _sin() for
sin(), _cos() for cos(), and so on
libxlfpmt4.a,
libxlfpmt4.so
IBM XLF to be used with -qautobdl=dbl4 (promote floating-point objects that are
single precision)
libxlfpad.a,
libxlfpad.so
libxlfpmt8.a,
libxlfpmt8.so
libxlsmp.a,
libxlsmp.so
libxl.a
libxlopt.a
libmass.a
libmassv.a
ibxlomp_ser.a
104
#include <stdio.h>
#include <sys/utsname.h>
int main(int argc, char** argv)
{
struct utsname uts;
uname(&uts);
printf("sizeof uts: %d\n", sizeof(uts));
printf("sysname: %s\n", uts.sysname);
Chapter 8. Developing applications with IBM XL compilers
105
8.11 Python
Python is a dynamic, object-oriented programming language that can be used on Blue
Gene/P in addition to C, C++ and Fortran. Version 2.61 of the Python interpreter compiled for
Blue Gene/P is installed in /bgsys/drivers/ppcfloor/gnu-linux/bin. Example 8-7 illustrates how
to invoke a simple Python program.
Example 8-7 How to invoke Python on Blue Gene/P
106
107
The registers allow the PowerPC 450 processor to operate certain identical operations in
parallel. Load/store instructions can also be issued with a single instruction. For more detailed
information, see the white paper Exploiting the Dual Floating Point Units in Blue Gene/L on
the Web at:
http://www-01.ibm.com/support/docview.wss?uid=swg27007511
The IBM XL compilers leverage this functionality under the following conditions:
Parallel instructions are issued for load/store instructions if the alignment and size are
aligned with natural alignment. This is 16 bytes for a pair of doubles, but only 8 bytes for a
pair of floats.
The compiler can issue parallel instructions when the application vectors have stride-one
memory accesses. However, the compiler via IPA issues parallel instructions with
non-stride-one data in certain loops, if it can be shown to improve performance.
-qhot=simd is the default with -qarch=450d.
-O4 provides analysis at compile time with limited scope analysis and issuing parallel
instructions (SIMD).
-O5 provides analysis for the entire program at link time to propagate alignment
information. You must compile and link with -O5 to obtain the full benefit.
108
-qtune=450
-O2 and up
XL Fortran
http://www-306.ibm.com/software/awdtools/fortran/xlfortran/library/
For some applications, the compiler generates more efficient code without the TPO SIMD
level. If you have statically allocated array, and a loop in the same routine, call TOBEY with
-qhot or -O4. Nevertheless, on top of SIMD generation from TOBEY, with -qhot, optimizations
are enabled that can alter the semantic of the code and on rare occasions can generate less
efficient code. Also, with -qhot=nosimd, you can suppress some of these optimizations.
To use the SIMD capabilities of the XL compilers:
1. Start to compile:
-O3 -qarch=450d -qtune=450
We recommend that you use -qarch=450d -qtune=450, in this order. The compiler only
generates SIMD instructions from -O2 and up.
2. Increase the optimization level, and call the high-level inter-procedural optimizer:
-O5 (link time, whole-program analysis, and SIMD instruction)
-O4 (compile time, limited scope analysis, and SIMD instructions)
-O3 -qhot=simd (compile time, less optimization, and SIMD instructions)
3. Tune your program:
a. Check the SIMD instruction generation in the object code listing (-qsource -qlist).
b. Use compiler feedback (-qreport -qhot) to guide you.
109
Tell the compiler not to generate SIMD instructions if it is not profitable (trip count
low).
In C, enter:
#pragma nosimd
In Fortran, enter the following line just before the loop:
!IBM* NOSIMD
Many applications can require modifying algorithms. The previous bullet, which
explains how not to generate SIMD instructions, gives constructs that might help to
modify the code. Here are hints to use when modifying your code:
Generate compiler diagnostics to help you modify and understand how the compiler is
optimizing sections of your applications:
The -qreport compiler option generates a diagnostic report about SIMD instruction
generation. To analyze the generated code and the use of quadword loads and stores,
you must look at the pseudo assembler code within the object file listing. The
diagnostic report provides two types of information about SIMD generation:
Information on success
(simdizable) [feature][version]
[feature] further characterizes the simdizable loop:
110
misalign
shift
(4 compile time)
Refers to a simdizable loop with 4 stream shift inserted. Shift
refers to the number of misaligned data references that were
found. It has a performance impact because these loops must
be loaded across, and then an extra select instruction must be
inserted.
priv
reduct
relative align
trip count
Information on failure
In case of misalignment: misalign(...):
* Non-natural: non-naturally aligned accesses
* Run time: run-time alignment
About the structure of the loop:
* Irregular loop structure (while-loop).
* Contains control flow: if/then/else.
* Contains function call: Function call bans SIMD instructions.
* Trip count too small.
About dependences: dependence due to aliasing
About array references:
* Access not stride one
* Memory accesses with unsupported alignment
* Contains run-time shift
About pointer references: Non-normalized pointer accesses
111
Proper I/O utilization tends to be a problem, and in many applications I/O optimization is
required. Identify if that is the case for your application. The IBM toolkit for Blue Gene/P
provides the Modular I/O (MIO) library that can be used for an applications-level I/O
performance improvement (See IBM System Blue Gene Solution: Performance Analysis
Tools, REDP-4256). Memory utilization on Blue Gene/P involves optimal utilization of the
memory hierarchy. This needs to be coordinated with the double floating-point unit to leverage
the execution of instructions in parallel. As part of the IBM toolkit Xprofiler helps analyze
applications by collecting data using the -pg compiler option to identify functions that are most
CPU intensive. gmon profiler also provides similar information. Appendix H, Use of GNU
profiling tool on Blue Gene/P on page 361 provides additional information about gmon. The
IBM toolkit provides the MPI profiler and a tracing library for MPI programs.
112
The advantage of using complex types in arithmetic operations is that the compiler
automatically uses parallel add, subtract, and multiply instructions when complex types
appear as operands to addition, subtraction, and multiplication operators. Furthermore, the
data that you provide does not need to represent complex numbers. In fact, both elements are
represented internally as two real values. See Complex type manipulation functions on
page 121, for a description of the set of built-in functions that are available for the Blue
Gene/P system. These functions are especially designed to efficiently manipulate
complex-type data and include a function to convert non-complex data to complex types.
113
Also available from this Web address, the XL C/C++ Language Reference provides
information about the different variations of the inline keyword supported by XL C and
C++, as well as the inlining function attribute extensions.
114
j = *p;
If *q aliases *p, then the value must be reloaded from memory. If *q does not alias *p, the old
value that is already loaded into i can be used.
To avoid the overhead of reloading values from memory every time they are referenced in the
code, and to allow the compiler to simply manipulate values that are already resident in
registers, you can use several strategies. One approach is to assign input array element
values to local variables and perform computations only on the local variables, as shown in
Example 8-10.
Example 8-10 Array parameters assigned to local variables
#include <math.h>
void reciprocal_roots (const double* x, double* f)
{
double x0 = x[0] ;
double x1 = x[1] ;
double r0 = 1.0/sqrt(x0) ;
double r1 = 1.0/sqrt(x1) ;
f[0] = r0 ;
f[1] = r1 ;
}
If you are certain that two references do not share the same memory address, another
approach is to use the #pragma disjoint directive. This directive asserts that two identifiers
do not share the same storage, within the scope of their use. Specifically, you can use
pragma to inform the compiler that two pointer variables do not point to the same memory
address. The directive in Example 8-11 indicates to the compiler that the pointers-to-arrays of
double x and f do not share memory.
Example 8-11 The #pragma disjoint directive
__inline void ten_reciprocal_roots (double* x, double* f)
{
#pragma disjoint (*x, *f)
int i;
for (i=0; i < 10; i++)
f[i]= 1.0 / sqrt (x[i]);
}
Important: The correct functioning of this directive requires that the two pointers be
disjoint. If they are not, the compiled program cannot run correctly.
115
116
The function takes two arguments. The first argument is an integer constant that expresses
the number of alignment bytes (must be a positive power of two). The second argument is the
variable name, typically a pointer to a memory address.
Example 8-14 shows the C/C++ prototype for the function.
Example 8-14 C/C++ prototype
extern
#ifdef __cplusplus
"builtin"
#endif
void __alignx (int n, const void *addr)
Example 8-14, n is the number of bytes, for example, __alignx(16, y) specifies that the
address y is 16-byte aligned.
In Fortran95, the built-in subroutine is ALIGNX(K,M), where K is of type INTEGER(4), and M
is a variable of any type. When M is an integer pointer, the argument refers to the address of
the pointee.
Example 8-15 asserts that the variables x and f are aligned along 16-byte boundaries.
Example 8-15 Using the __alignx built-in function
#include <math.h>
#include <builtins.h>
__inline void aligned_ten_reciprocal_roots (double* x, double* f)
{
#pragma disjoint (*x, *f)
int i;
__alignx (16, x);
__alignx (16, f);
for (i=0; i < 10; i++)
f[i]= 1.0 / sqrt (x[i]);
}
The __alignx function: The __alignx function does not perform any alignment. It merely
informs the compiler that the variables are aligned as specified. If the variables are not
aligned correctly, the program does not run properly.
After you create a function to handle input variables that are correctly aligned, you then can
create a function that tests for alignment and then calls the appropriate function to perform
the calculations. The function in Example 8-16 checks to see whether the incoming values
are correctly aligned. Then it calls the aligned (Example 8-15) or unaligned (Example 8-12
on page 116) version of the function according to the result.
Example 8-16 Function to test for alignment
void reciprocal_roots (double *x, double *f, int n)
{
/* are both x & f 16 byte aligned? */
if ( ((((int) x) | ((int) f)) & 0xf) == 0) /* This could also be done as:
if (((int) x % 16 == 0) && ((int) f % 16) == 0) */
aligned_ten_reciprocal_roots (x, f, n);
else
ten_reciprocal_roots (x, f, n);
}
117
The alignment test in Example 8-16 on page 117 provides an optimized method of testing for
16-byte alignment by performing a bit-wise OR on the two incoming addresses and testing
whether the lowest four bits are 0 (that is, 16-byte aligned).
On the Blue Gene/P system, the XL compilers provide a set of built-in functions that are
specifically optimized for the PowerPC 450d dual FPU. These built-in functions provide an
almost one-to-one correspondence with the dual floating-point instruction set.
All of the C/C++ and Fortran built-in functions operate on complex data types, which have an
underlying representation of a two-element array, in which the real part represents the
primary element and the imaginary part represents the second element. The input data that
you provide does not need to represent complex numbers. In fact, both elements are
represented internally as two real values. None of the built-in functions performs complex
arithmetic. A set of built-in functions designed to efficiently manipulate complex-type variables
is also available.
The Blue Gene/P built-in functions perform several types of operations, as explained in the
following paragraphs.
Parallel operations perform SIMD computations on the primary and secondary elements of
one or more input operands. They store the results in the corresponding elements of the
output. As an example, Figure 8-3 illustrates how a parallel-multiply operation is performed.
Input
operand a
Primary element
Secondary element
Input
operand b
Primary element
Secondary element
Output
Primary element
Secondary element
Cross operations perform SIMD computations on the opposite primary and secondary
elements of one or more input operands. They store the results in the corresponding
elements in the output. As an example, Figure 8-4 on page 119 illustrates how a
cross-multiply operation is performed.
118
Input
operand a
Primary element
Secondary element
Input
operand b
Primary element
Secondary element
Output
Secondary element
Primary element
Input
operand a
Primary element
Secondary element
Input
operand b
Primary element
Secondary element
Primary element
Secondary element
Output
119
Input
operand a
Primary element
Secondary element
Input
operand b
Primary element
Secondary element
Output
Primary element
Secondary element
In cross-copy operations, the compiler crosses either the primary or secondary element of the first
operand, so that copy-primary and copy-secondary operations can be used interchangeably to
achieve the same result. The operation is performed on the total value of the first operand. As an
example, Figure 8-7 illustrates the result of a cross-copy multiply operation.
Input
operand a
Input
operand b
Output
Primary element
Primary element
Secondary element
Secondary element
In the following paragraphs, we describe the available built-in functions by category. For each
function, the C/C++ prototype is provided. In C, you do not need to include a header file to
obtain the prototypes. The compiler includes them automatically. In C++, you must include the
header file builtins.h.
Fortran does not use prototypes for built-in functions. Therefore, the interfaces for the
Fortran95 functions are provided in textual form. The function names omit the double
underscore (__) in Fortran95.
All of the built-in functions, with the exception of the complex type manipulation functions,
require compilation under -qarch=450d. This is the default setting on the Blue Gene/P system.
120
To help clarify the English description of each function, the following notation is used:
element(variable)
Here element represents one of primary or secondary, and variable represents input variable
a, b, or c, and the output variable result, for example, consider the following formula:
primary(result) = primary(a) + primary(b)
This formula indicates that the primary element of input variable a is added to the primary
element of input variable b and stored in the primary element of the result.
To optimize your calls to the Blue Gene/P built-in functions, follow the guidelines that we
provide in 8.12, Tuning your code for Blue Gene/P on page 107. Using the alignx built-in
function (described in Checking for data alignment on page 116) and specifying the
disjoint pragma (described in Removing possibilities for aliasing (C/C++) on page 114) are
recommended for code that calls any of the built-in functions.
Purpose
Converts two single-precision real values to a single complex value. The real a is
converted to the primary element of the return value, and the real b is converted to
the secondary element of the return value.
Formula
primary(result) =a
secondary(result) = b
C/C++
prototype
Fortran
descriptions
CMPLXF(A,B)
where A is of type REAL(4)
where B is of type REAL(4)
result is of type COMPLEX(4)
Function
Purpose
Converts two double-precision real values to a single complex value. The real a is
converted to the primary element of the return value, and the real b is converted to
the secondary element of the return value.
Formula
primary(result) =a
secondary(result) = b
C/C++
prototype
Fortran
descriptions
CMPLX(A,B)
where A is of type REAL(8)
where B is of type REAL(8)
result is of type COMPLEX(8)
121
Function
Purpose
Extracts the primary part of a single-precision complex value a, and returns the result
as a single real value.
Formula
result =primary(a)
C/C++
prototype
Fortran
descriptions
CREALF(A)
where A is of type COMPLEX(4)
result is of type REAL(4)
Function
Purpose
Extracts the primary part of a double-precision complex value a, and returns the
result as a single real value.
Formula
result =primary(a)
C/C++
prototype
Fortran
descriptions
CREAL(A)
where A is of type COMPLEX(8)
result is of type REAL(8)
CREALL(A)
where A is of type COMPLEX(16)
result is of type REAL(16)
Function
Purpose
Extracts the secondary part of a single-precision complex value a, and returns the
result as a single real value.
Formula
result =secondary(a)
C/C++
prototype
Fortran
descriptions
CIMAGF(A)
where A is of type COMPLEX(4)
result is of type REAL(4)
Function
Purpose
Extracts the imaginary part of a double-precision complex value a, and returns the
result as a single real value.
Formula
result =secondary(a)
C/C++
prototype
Fortran
descriptions
CIMAG(A)
where A is of type COMPLEX(8)
result is of type REAL(8)
CIMAGL(A)
where A is of type COMPLEX(16)
result is of type REAL(16)
a. 128-bit C/C++ long double types are not supported on Blue Gene/L. Long doubles are treated
as regular double-precision longs.
122
Purpose
Loads parallel single-precision values from the address of a, and converts the results
to double-precision. The first word in address(a) is loaded into the primary element of
the return value. The next word, at location address(a)+4, is loaded into the secondary
element of the return value.
Formula
primary(result) = a[0]
secondary(result) = a[1]
C/C++
prototype
Fortran
description
LOADFP(A)
where A is of type REAL(4)
result is of type COMPLEX(8)
Function
Purpose
Loads single-precision values that have been converted to double-precision, from the
address of a. The first word in address(a) is loaded into the secondary element of the
return value. The next word, at location address(a)+4, is loaded into the primary
element of the return value.
Formula
primary(result) = a[1]
secondary(result) = a[0]
C/C++
prototype
Fortran
description
LOADFX(A)
where A is of type REAL(4)
result is of type COMPLEX(8)
Function
Purpose
Loads parallel values from the address of a. The first word in address(a) is loaded into
the primary element of the return value. The next word, at location address(a)+8, is
loaded into the secondary element of the return value.
Formula
primary(result) = a[0]
secondary(result) = a[1]
C/C++
prototype
Fortran
description
LOADFP(A)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
Loads values from the address of a. The first word in address(a) is loaded into the
secondary element of the return value. The next word, at location address(a)+8, is
loaded into the primary element of the return value.
Formula
primary(result) = a[1]
secondary(result) = a[0]
123
124
C/C++
prototype
Fortran
description
LOADFX(A)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
Formula
b[0] = primary(a)
b[1]= secondary(a)
C/C++
prototype
Fortran
description
STOREFP(B, A)
where B is of type REAL(4)
A is of type COMPLEX(8)
result is none
Function
Purpose
Formula
b[0] = secondary(a)
b[1] = primary(a)
C/C++
prototype
Fortran
description
STOREFX(B, A)
where B is of type REAL(4)
A is of type COMPLEX(8)
result is none
Function
Purpose
Stores in parallel values into address(b). The primary element of a is stored as the first
double word in address(b). The secondary element of a is stored as the next double
word at location address(b)+8.
Formula
b[0] = primary(a)
b[1] = secondary(a)
C/C++
prototype
Fortran
description
STOREFP(B, A)
where B is of type REAL(8)
A is of type COMPLEX(8)
result is none
Function
Purpose
Stores values into address(b). The secondary element of a is stored as the first double
word in address(b). The primary element of a is stored as the next double word at
location address(b)+8.
Formula
b[0] = secondary(a)
b[1] = primary(a)
C/C++
prototype
Fortran
description
STOREFP(B, A)
where B is of type REAL(8)
A is of type COMPLEX(8)
result is none
Function
Purpose
Formula
b[0] = primary(a)
b[1] = secondary(a)
C/C++
prototype
Fortran
description
STOREFP(B, A)
where B is of type INTEGER(4)
A is of type COMPLEX(8)
result is none
Move functions
Table 8-5 lists and explains the parallel move functions that are available.
Table 8-5 Move functions
Function
Purpose
Formula
primary(result) = secondary(a)
secondary(result) = primary(a)
C/C++
prototype
Fortran
description
FXMR(A)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
125
Arithmetic functions
In the following sections, we describe all the arithmetic built-in functions, categorized by their
number of operands.
Unary functions
Unary functions, listed in Table 8-6, operate on a single input operand.
Table 8-6 Unary functions
126
Function
Purpose
Formula
primary(result) = primary(a)
secondary(result) = secondary(a)
C/C++
prototype
Fortran
purpose
FPCTIW(A)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Formula
primary(result) = primary(a)
secondary(result) = secondary(a)
C/C++
prototype
Fortran
description
FPCTIWZ(A)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Formula
primary(result) = primary(a)
secondary(result) = secondary(a)
C/C++
prototype
Fortran
description
FPRSP(A)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Formula
primary(result) = primary(a)
secondary(result) = secondary(a)
C/C++
prototype
Fortran
description
FPRE(A)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Formula
primary(result) = primary(a)
secondary(result) = secondary(a)
C/C++
prototype
Fortran
description
FPRSQRTE(A)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Calculates in parallel the negative values of the primary and secondary elements of
operand a.
Formula
primary(result) = primary(a)
secondary(result) = secondary(a)
C/C++
prototype
Fortran
description
FPNEG(A)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Calculates in parallel the absolute values of the primary and secondary elements of
operand a.
Formula
primary(result) = primary(a)
secondary(result) = secondary(a)
C/C++
prototype
Fortran
description
FPABS(A)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Calculates in parallel the negative absolute values of the primary and secondary
elements of operand a.
Formula
primary(result) = primary(a)
secondary(result) = secondary(a)
C/C++
prototype
127
Fortran
description
FPNABS(A)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Binary functions
Binary functions, listed in Table 8-7, operate on two input operands.
Table 8-7 Binary functions
128
Function
Purpose
Formula
C/C++
prototype
Fortran
description
FPADD(A,B)
where A is of type COMPLEX(8)
where B is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Subtracts in parallel the primary and secondary elements of operand b from the
corresponding primary and secondary elements of operand a.
Formula
C/C++
prototype
Fortran
description
FPSUB(A,B)
where A is of type COMPLEX(8)
where B is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Formula
C/C++
prototype
Fortran
description
FPMUL(A,B)
where A is of type COMPLEX(8)
where B is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
The product of the secondary element of a and the primary element of b is stored as
the primary element of the return value. The product of the primary element of a and
the secondary element of b is stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXMUL(A,B)
where A is of type COMPLEX(8)
where B is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
Both of these functions can be used to achieve the same result. The product of a and
the primary element of b is stored as the primary element of the return value. The
product of a and the secondary element of b is stored as the secondary element of the
return value.
Formula
primary(result) = a x primary(b)
secondary(result) = a x secondary(b)
C/C++
prototype
Fortran
description
FXPMUL(B,A) or FXSMUL(B,A)
where B is of type COMPLEX(8)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Multiply-add functions
Multiply-add functions take three input operands, multiply the first two, and add or subtract the
third. Table 8-8 lists these functions.
Table 8-8 Multiply-add functions
Function
Purpose
The sum of the product of the primary elements of a and b, added to the primary
element of c, is stored as the primary element of the return value. The sum of the
product of the secondary elements of a and b, added to the secondary element of c,
is stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FPMADD(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
The sum of the product of the primary elements of a and b, added to the primary
element of c, is negated and stored as the primary element of the return value. The
sum of the product of the secondary elements of a and b, added to the secondary
element of c, is negated and stored as the secondary element of the return value.
Formula
C/C++
prototype
129
130
Fortran
description
FPNMADD(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
The difference of the primary element of c, subtracted from the product of the primary
elements of a and b, is stored as the primary element of the return value. The
difference of the secondary element of c, subtracted from the product of the secondary
elements of a and b, is stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FPMSUB(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
The difference of the primary element of c, subtracted from the product of the primary
elements of a and b, is negated and stored as the primary element of the return value.
The difference of the secondary element of c, subtracted from the product of the
secondary elements of a and b, is negated and stored as the secondary element of
the return value.
Formula
C/C++
prototype
Fortran
description
FPNMSUB(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
The sum of the product of the primary element of a and the secondary element of b,
added to the primary element of c, is stored as the primary element of the return value.
The sum of the product of the secondary element of a and the primary b, added to the
secondary element of c, is stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXMADD(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
The sum of the product of the primary element of a and the secondary element of b,
added to the primary element of c, is negated and stored as the primary element of
the return value. The sum of the product of the secondary element of a and the primary
element of b, added to the secondary element of c, is negated and stored as the
secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXNMADD(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
The difference of the primary element of c, subtracted from the product of the primary
element of a and the secondary element of b, is stored as the primary element of the
return secondary element of a, and the primary element of b is stored as the
secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXMSUB(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
Purpose
The difference of the primary element of c, subtracted from the product of the primary
element of a and the secondary element of b, is negated and stored as the primary
element of the return value. The difference of the secondary element of c, subtracted
from the product of the secondary element of a and the primary element of b, is
negated and stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXNMSUB(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type COMPLEX(8)
result is of type COMPLEX(8)
Function
131
132
Purpose
Both of these functions can be used to achieve the same result. The sum of the
product of a and the primary element of b, added to the primary element of c, is stored
as the primary element of the return value. The sum of the product of a and the
secondary element of b, added to the secondary element of c, is stored as the
secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXCPMADD(C,B,A) or FXCSMADD(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
Both of these functions can be used to achieve the same result. The difference of the
primary element of c, subtracted from the product of a and the primary element of b,
is negated and stored as the primary element of the return value. The difference of the
secondary element of c, subtracted from the product of a and the secondary element
of b, is negated and stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXCPNMADD(C,B,A) or FXCSNMADD(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
Both of these functions can be used to achieve the same result. The difference of the
primary element of c, subtracted from the product of a and the primary element of b,
is stored as the primary element of the return value. The difference of the secondary
element of c, subtracted from the product of a and the secondary element of b, is
stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXCPMSUB(C,B,A) or FXCSMSUB(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
Both of these functions can be used to achieve the same result. The difference of the
primary element of c, subtracted from the product of a and the primary element of b,
is negated and stored as the primary element of the return value. The difference of the
secondary element of c, subtracted from the product of a and the secondary element
of b, is negated and stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXCPNMSUB(C,B,A) or FXCSNMSUB(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
Both of these functions can be used to achieve the same result. The difference of the
primary element of c, subtracted from the product of a and the primary element of b,
is negated and stored as the primary element of the return value. The sum of the
product of a and the secondary element of b, added to the secondary element of c, is
stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXCPNPMA(C,B,A) or FXCSNPMA(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
Both of these functions can be used to achieve the same result. The sum of the
product of a and the primary element of b, added to the primary element of c, is stored
as the primary element of the return value. The difference of the secondary element
of c, subtracted from the product of a and the secondary element of b, is negated and
stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXCPNSMA(C,B,A) or FXCSNSMA(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
133
134
Purpose
The sum of the product of a and the secondary element of b, added to the primary
element of c, is stored as the primary element of the return value. The sum of the
product of a and the primary element of b, added to the secondary element of c, is
stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXCXMA(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
The difference of the primary element of c, subtracted from the product of a and the
secondary element of b, is negated and stored as the primary element of the return
value. The difference of the secondary element of c, subtracted from the product of a
and the primary element of b, is negated and stored as the primary secondary of the
return value.
Formula
C/C++
prototype
Fortran
description
FXCXNMS(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
The difference of the primary element of c, subtracted from the product of a and the
secondary element of b, is stored as the primary element of the return value. The sum
of the product of a and the primary element of b, added to the secondary element of
c, is stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXCXNPMA(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Function
Purpose
The sum of the product of a and the secondary element of b, added to the primary
element of c, is stored as the primary element of the return value. The difference of the
secondary element of c, subtracted from the product of a and the primary element of
b, is stored as the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FXCXNSMA(C,B,A)
where C is of type COMPLEX(8)
where B is of type COMPLEX(8)
where A is of type REAL(8)
result is of type COMPLEX(8)
Select functions
Table 8-9 lists and explains the parallel select functions that are available.
Table 8-9 Select functions
Function
Purpose
The value of the primary element of a is compared to zero. If its value is equal to or
greater than zero, the primary element of c is stored in the primary element of the
return value. Otherwise, the primary element of b is stored in the primary element of
the return value. The value of the secondary element of a is compared to zero. If its
value is equal to or greater than zero, the secondary element of c is stored in the
secondary element of the return value. Otherwise, the secondary element of b is
stored in the secondary element of the return value.
Formula
C/C++
prototype
Fortran
description
FPSEL(A,B,C)
where A is of type COMPLEX(8)
where B is of type COMPLEX(8)
where C is of type COMPLEX(8)
result is of type COMPLEX(8)
135
136
do ii = 1, n, nb
if ((ii + nb - 1) .lt. n) then
istop = (ii + nb - 1)
else
istop = n
endif
!--------------------------------! initialize a block of c to zero
!--------------------------------do j = jj, jstop - 5, 6
do i = ii, istop - 1, 2
call storefp(c(i,j) , zero)
call storefp(c(i,j+1), zero)
call storefp(c(i,j+2), zero)
call storefp(c(i,j+3), zero)
call storefp(c(i,j+4), zero)
call storefp(c(i,j+5), zero)
end do
end do
!-------------------------------------------------------! multiply block by block with 6x4 outer loop un-rolling
!-------------------------------------------------------do kk = 1, n, nb
if ((kk + nb - 1) .lt. n) then
kstop = (kk + nb - 1)
else
kstop = n
endif
do j = jj, jstop - 5, 6
do i = ii, istop - 3, 4
c00
c01
c02
c03
c04
c05
=
=
=
=
=
=
loadfp(c(i,j ))
loadfp(c(i,j+1))
loadfp(c(i,j+2))
loadfp(c(i,j+3))
loadfp(c(i,j+4))
loadfp(c(i,j+5))
c20
c21
c22
c23
c24
c25
=
=
=
=
=
=
loadfp(c(i+2,j ))
loadfp(c(i+2,j+1))
loadfp(c(i+2,j+2))
loadfp(c(i+2,j+3))
loadfp(c(i+2,j+4))
loadfp(c(i+2,j+5))
a00
a20
a01
a21
=
=
=
=
loadfp(a(i,kk ))
loadfp(a(i+2,kk ))
loadfp(a(i,kk+1))
loadfp(a(i+2,kk+1))
do k
b0
b1
b2
b3
=
=
=
=
=
kk, kstop - 1, 2
loadfp(b(k,j ))
loadfp(b(k,j+1))
loadfp(b(k,j+2))
loadfp(b(k,j+3))
137
b4 = loadfp(b(k,j+4))
b5 = loadfp(b(k,j+5))
c00 = fxcpmadd(c00, a00, real(b0))
c01 = fxcpmadd(c01, a00, real(b1))
c02 = fxcpmadd(c02, a00, real(b2))
c03 = fxcpmadd(c03, a00, real(b3))
c04 = fxcpmadd(c04, a00, real(b4))
c05 = fxcpmadd(c05, a00, real(b5))
c20 = fxcpmadd(c20, a20, real(b0))
c21 = fxcpmadd(c21, a20, real(b1))
c22 = fxcpmadd(c22, a20, real(b2))
c23 = fxcpmadd(c23, a20, real(b3))
c24 = fxcpmadd(c24, a20, real(b4))
c25 = fxcpmadd(c25, a20, real(b5))
a00 = loadfp(a(i,k+2 ))
a20 = loadfp(a(i+2,k+2 ))
c00 = fxcpmadd(c00, a01, imag(b0))
c01 = fxcpmadd(c01, a01, imag(b1))
c02 = fxcpmadd(c02, a01, imag(b2))
c03 = fxcpmadd(c03, a01, imag(b3))
c04 = fxcpmadd(c04, a01, imag(b4))
c05 = fxcpmadd(c05, a01, imag(b5))
c20 = fxcpmadd(c20, a21, imag(b0))
c21 = fxcpmadd(c21, a21, imag(b1))
c22 = fxcpmadd(c22, a21, imag(b2))
c23 = fxcpmadd(c23, a21, imag(b3))
c24 = fxcpmadd(c24, a21, imag(b4))
c25 = fxcpmadd(c25, a21, imag(b5))
a01 = loadfp(a(i,k+3))
a21 = loadfp(a(i+2,k+3))
end do
call
call
call
call
call
call
storefp(c(i
storefp(c(i
storefp(c(i
storefp(c(i
storefp(c(i
storefp(c(i
,j ),
,j+1),
,j+2),
,j+3),
,j+4),
,j+5),
c00)
c01)
c02)
c03)
c04)
c05)
call
call
call
call
call
call
storefp(c(i+2,j ),
storefp(c(i+2,j+1),
storefp(c(i+2,j+2),
storefp(c(i+2,j+3),
storefp(c(i+2,j+4),
storefp(c(i+2,j+5),
c20)
c21)
c22)
c23)
c24)
c25)
end do
end do
end do !kk
end do !ii
end do !jj
end
138
Chapter 9.
139
connecting to mmcs_server
connected to mmcs_server
connected to DB2
mmcs$list_blocks
OK
N00_64_1
B manojd (1) connected
N02_32_1
I walkup (0) connected
N04_32_1
B manojd (1) connected
N05_32_1
B manojd (1) connected
N06_32_1
I sameer77(1) connected
N07_32_1
I gdozsa (1) connected
N08_64_1
I vezolle (1) connected
N12_32_1
I vezolle (0) connected
mmcs$ allocate N14_32_1
OK
mmcs$ list_blocks
OK
N00_64_1
B manojd (1) connected
N02_32_1
I walkup (0) connected
N04_32_1
B manojd (1) connected
N05_32_1
B manojd (1) connected
N06_32_1
I sameer77(1) connected
N07_32_1
I gdozsa (1) connected
N08_64_1
I vezolle (1) connected
N12_32_1
I vezolle (0) connected
N14_32_1
I cpsosa (1) connected
mmcs$ submitjob N14_32_1 /bgusr/cpsosa/hello/c/omp_hello_bgp /bgusr/cpsosa/hello/c
OK
jobId=14008
mmcs$ free N14_32_1
140
OK
mmcs$ quit
OK
mmcs_db_console is terminating, please wait...
mmcs_db_console: closing database connection
mmcs_db_console: closed database connection
mmcs_db_console: closing console port
mmcs_db_console: closed console port
For more information about using the MMCS console, see IBM System Blue Gene Solution:
Blue Gene/P System Administration, SG24-7417.
9.1.2 mpirun
In the absence of a scheduling application, we recommend that you use mpirun to run
Blue Gene/P applications on statically allocated partitions. Users can access this application
from the Front End Node, which provides better security protection than using the MMCS
console. For more complete information about using mpirun, see Chapter 11, mpirun on
page 177.
With mpirun, you can select and allocate a block and run a Message Passing Interface (MPI)
application, all in one step as shown in Example 9-2.
Example 9-2 Using mpirun
cpsosa@descartes:/bgusr/cpsosa/red/pi/c> csh
descartes pi/c> set MPIRUN="/bgsys/drivers/ppcfloor/bin/mpirun"
descartes pi/c> set MPIOPT="-np 1"
descartes pi/c> set MODE="-mode SMP"
descartes pi/c> set PARTITION="-partition N14_32_1"
descartes pi/c> set WDIR="-cwd /bgusr/cpsosa/red/pi/c"
descartes pi/c> set EXE="-exe /bgusr/cpsosa/red/pi/c/pi_critical_bgp"
descartes pi/c> $MPIRUN $PARTITION $MPIOPT $MODE $WDIR $EXE -env "OMP_NUM_THREADS=1"
Estimate of pi: 3.14159
Total time 560.055988
All output in this example is sent to the display. To specify that you want this information sent
to a file, you must add the following line, for example, to the end of the mpirun command:
>/bgusr/cpsosa/red/pi/c/pi_critical.stdout 2>/bgusr/cpsosa/red/pi/c/pi_critical.stderr
This line sends standard output to the pi_critical.stdout file and standard error to the
pi_critical.stderr file. Both files are in the /bgusr/cpsosa/red/pi/c directory.
9.1.3 submit
In HTC mode you must use the submit command, which is analogous to mpirun because its
purpose is to act as a shadow of the job. It transparently forwards stdin, and receives stdout
and stderr. More detailed usage information is available in Chapter 12, High-Throughput
Computing (HTC) paradigm on page 201.
141
142
A great amount of documentation is available about the GDB. Because we do not discuss
how to use it in this book, refer to the following Web site for details:
http://www.gnu.org/software/gdb/documentation/
Support has been added to the Blue Gene/P system for which the GDB can work with
applications that run on Compute Nodes. IBM provides a simple debug server called
gdbserver. Each running instance of GDB is associated with one, and only one, Compute
Node. If you must debug an MPI application that runs on multiple Compute Nodes, and you
must, for example, view variables that are associated with more than one instance of the
application, you run multiple instances of GDB.
Most people use GDB to debug local processes that run on the same machine on which they
are running GDB. With GDB, you also have the ability to debug remotely via a GDB server on
the remote machine. GDB on the Blue Gene/L system is used in this mode. We refer to GDB
as the GDB client, although most users recognize it as GDB used in a slightly different
manner.
143
Limitations
Gdbserver implements the minimum number of primitives required by the GDB remote
protocol specification. As such, advanced features that might be available in other
implementations are not available in this implementation. However, sufficient features are
implemented to make it a useful tool. This implementation has some of the following
limitations:
Each instance of a GDB client can connect to and debug one Compute Node. To debug
multiple Compute Nodes at the same time, you must run multiple GDB clients at the same
time. Although you might need multiple GDB clients for multiple Compute Nodes, one
gdbserver on each I/O Node is all that is required. The Blue Gene/P control system
manages that part.
IBM does not ship a GDB client with the Blue Gene/P system. The user can use an
existing GDB client to connect to the IBM-supplied gdbserver. Most functions do work, but
standard GDB clients are not aware of the full double hummer floating-point register set
that Blue Gene/L provides. The GDB clients that come with SUSE Linux Enterprise Server
(SLES) 10 for IBM PowerPC are known to work.
To debug an application, the debug server must be started and running before you attempt
to debug. Using an option on the mpirun or submit command, you can get the debug
server running before your application does. If you do not use this option and you must
debug your application, you do not have a mechanism to start the debug server and thus
have no way to debug your application.
Gdbserver is not aware of user-specified MPI topologies. You still can debug your
application, but the connection information given to you by mpirun for each MPI rank can
be incorrect.
Prerequisite software
The GDB should have been installed during the installation procedure. You can verify the
installation by seeing whether the /bgsys/drivers/ppcfloor/gnu-linux/bin/gdb file exists on your
Front End Node.
The rest of the software support required for GDB should be installed as part of the control
programs.
BGP_FLOOR
= /bgsys/drivers/ppcfloor
BGP_IDIRS
= -I$(BGP_FLOOR)/arch/include -I$(BGP_FLOOR)/comm/include
BGP_LIBS
= -L$(BGP_FLOOR)/comm/lib -lmpich.cnk -L$(BGP_FLOOR)/comm/lib -ldcmfcoll.cnk
-ldcmf.cnk -lpthread -lrt -L$(BGP_FLOOR)/runtime/SPI -lSPI.cna
XL
EXE
OBJ
SRC
144
= /opt/ibmcmp/xlf/bg/11.1/bin/bgxlf90
=
=
=
example_9_4_bgp
example_9_4.o
example_9_4.f
FLAGS
-g -O0 -qarch=450
-qtune=450 -I$(BGP_FLOOR)/comm/include
$(EXE): $(OBJ)
${XL} $(FLAGS) -o $(EXE) $(OBJ) $(BGP_LIBS)
$(OBJ): $(SRC)
${XL} $(FLAGS) $(BGP_IDIRS) -c $(SRC)
clean:
rm *.o example_9_4_bgp
cpsosa@descartes:/bgusr/cpsosa/red/debug> make
/opt/ibmcmp/xlf/bg/11.1/bin/bgxlf90 -g -O0 -qarch=450 -qtune=450
-I/bgsys/drivers/ppcfloor/comm/include -I/bgsys/drivers/ppcfloor/arch/include
-I/bgsys/drivers/ppcfloor/comm/include -c example_9_4.f
** nooffset
=== End of Compilation 1 ===
1501-510 Compilation successful for file example_9_4.f.
/opt/ibmcmp/xlf/bg/11.1/bin/bgxlf90 -g -O0 -qarch=450 -qtune=450
-I/bgsys/drivers/ppcfloor/comm/include -o example_9_4_bgp example_9_4.o
-L/bgsys/drivers/ppcfloor/comm/lib -lmpich.cnk -L/bgsys/drivers/ppcfloor/comm/lib -ldcmfcoll.cnk
-ldcmf.cnk -lpthread -lrt -L/bgsys/drivers/ppcfloor/runtime/SPI -lSPI.cna
The -g switch tells the compiler to include debug information. The -O0 (the letter capital O
followed by a zero) switch tells it to disable optimization.
For more information about the IBM XL compilers for the Blue Gene/P system, see Chapter 8,
Developing applications with IBM XL compilers on page 97.
Important: Make sure that the text file that contains the source for your program is located
in the same directory as the program itself and has the same file name (different
extension).
Debugging
Follow the steps in this section to start debugging your application. In this example, the MPI
programs name is example_9_4_bgp as illustrated in Example 9-4 on page 146 (source code
not shown), and the source code file is example_9_4.f. The partition (block) used is called
N14_32_1.
An extra parameter (-start_gdbserver...) is passed in on the mpirun or submit command. In
this example the application uses MPI so mpirun is used, but the process for submit is the
same. The extra option changes the way mpirun loads and executes your code. Here is a brief
summary of the changes:
1. The code is loaded onto the Compute Nodes (in our example, the executable is
example_9_4_bgp), but it does not start running immediately.
2. The control system starts the specified debug server (gdbserver) on all of the I/O Nodes in
the partition that is running your job, which in our example is N14_32_1.
3. The mpirun command pauses, so that you get a chance to connect GDB clients to the
Compute Nodes that you are going to debug.
4. When you are finished connecting GDB clients to Compute Nodes, you press Enter to
signal the mpirun command, and then the application starts running on the Compute
Nodes.
145
During the pause in step 3, you have an opportunity to connect the GDB clients to the
Compute Nodes before the application runs, which is desirable if you must start the
application under debugger control. This step is optional. If you do not connect before the
application starts running on the Compute Nodes, you can still connect later because the
debugger server was started on the I/O Nodes.
To start debugging your application:
1. Open two separate console shells.
2. Go to the first shell window:
a. Change to the directory (cd) that contains your program executable. In our example,
the directory is /bgusr/cpsosa/red/debug.
b. Start your application using mpirun with a command similar to the one shown in
Example 9-4. You should see messages in the console, similar to those shown in
Example 9-4.
Example 9-4 Messages in the console
set MPIRUN="/bgsys/drivers/ppcfloor/bin/mpirun"
set MPIOPT="-np 1"
set MODE="-mode SMP"
set PARTITION="-partition N14_32_1"
set WDIR="-cwd /bgusr/cpsosa/red/debug"
set EXE="-exe /bgusr/cpsosa/red/debug/example_9_4_bgp"
#
$MPIRUN $PARTITION $MPIOPT $MODE $WDIR $EXE -env "OMP_NUM_THREADS=4" -start_gdbserver
/sbin.rd/gdbserver -verbose 1
#
echo "That's all folks!!"
descartes red/debug> set EXE="-exe /bgusr/cpsosa/red/debug/example_9_4_bgp"
descartes red/debug> $MPIRUN $PARTITION $MPIOPT $MODE $WDIR $EXE -env "OMP_NUM_THREADS=4"
-start_gdbserver /bgsys/drivers/ppcfloor/ramdisk/sbin/gdbserver -verbose 1
<Sep 15 10:14:58.642369> FE_MPI (Info) : Invoking mpirun backend
<Sep 15 10:14:05.741121> BRIDGE (Info) : rm_set_serial() - The machine serial number (alias) is
BGP
<Sep 15 10:15:00.461655> FE_MPI (Info) : Preparing partition
<Sep 15 10:14:05.821585> BE_MPI (Info) : Examining specified partition
<Sep 15 10:14:10.085997> BE_MPI (Info) : Checking partition N14_32_1 initial state ...
<Sep 15 10:14:10.086041> BE_MPI (Info) : Partition N14_32_1 initial state = READY ('I')
<Sep 15 10:14:10.086059> BE_MPI (Info) : Checking partition owner...
<Sep 15 10:14:10.086087> BE_MPI (Info) : partition N14_32_1 owner is 'cpsosa'
<Sep 15 10:14:10.088375> BE_MPI (Info) : Partition owner matches the current user
<Sep 15 10:14:10.088470> BE_MPI (Info) : Done preparing partition
<Sep 15 10:15:04.804078> FE_MPI (Info) : Adding job
<Sep 15 10:14:10.127380> BE_MPI (Info) : Adding job to database...
<Sep 15 10:15:06.104035> FE_MPI (Info) : Job added with the following id: 14035
<Sep 15 10:15:06.104096> FE_MPI (Info) : Loading Blue Gene job
<Sep 15 10:14:11.426987> BE_MPI (Info) : Loading job 14035 ...
<Sep 15 10:14:11.450495> BE_MPI (Info) : Job load command successful
<Sep 15 10:14:11.450525> BE_MPI (Info) : Waiting for job 14035 to get to Loaded/Running state
...
<Sep 15 10:14:16.458474> BE_MPI (Info) : Job 14035 switched to state LOADED
<Sep 15 10:14:21.467401> BE_MPI (Info) : Job loaded successfully
<Sep 15 10:15:16.179023> FE_MPI (Info) : Starting debugger setup for job 14035
146
<Sep 15 10:15:16.179090>
<Sep 15 10:14:21.502593>
description
<Sep 15 10:14:21.523480>
<Sep 15 10:15:16.246415>
<Sep 15 10:15:16.246445>
<Sep 15 10:14:22.661841>
<Sep 15 10:15:17.386617>
<Sep 15 10:15:17.386663>
<Sep 15 10:14:22.721982>
<Sep 15 10:15:17.446486>
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
:
:
:
:
:
:
:
:
Make your connections to the compute nodes now - press [Enter] when you
are ready to run the app. To see the IP connection information for a
specific compute node, enter its MPI rank and press [Enter]. To see
all of the compute nodes, type 'dump_proctable'.
>
<Sep 15 10:17:20.754179>
<Sep 15 10:17:20.754291>
<Sep 15 10:16:26.118529>
...
<Sep 15 10:16:31.128079>
<Sep 15 10:17:25.806882>
<Sep 15 10:16:31.129878>
<Sep 15 10:16:31.152525>
<Sep 15 10:17:25.871476>
<Sep 15 10:16:31.231304>
<Sep 15 10:27:31.301600>
<Sep 15 10:27:31.301639>
<Sep 15 10:27:31.355816>
<Sep 15 10:27:31.355848>
<Sep 15 10:28:26.113983>
<Sep 15 10:28:26.114057>
<Sep 15 10:27:31.435578>
<Sep 15 10:27:31.435615>
been added
<Sep 15 10:27:31.469474>
('I') initial state
<Sep 15 10:27:31.469504>
<Sep 15 10:28:26.483855>
<Sep 15 10:28:26.483921>
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
(Info)
:
:
:
:
:
:
:
:
:
:
:
:
:
:
c. Find the IP address and port of the Compute Node that you want to debug. You can do
this using either of the following ways:
Enter the rank of the program instance that you want to debug and press Enter.
Dump the address or port of each node by typing dump_proctable and press Enter.
> 2
MPI Rank 2: Connect to 172.30.255.85:7302
> 4
MPI Rank 4: Connect to 172.30.255.85:7304
147
>
or
> dump_proctable
MPI Rank 0: Connect to 172.24.101.128:7310
>
3. From the second shell, complete the following steps:
a. Change to the directory (cd) that contains your program executable.
b. Type the following command, using the name of your own executable instead of
example_9_4_bgp:
/bgsys/drivers/ppcfloor/gnu-linux/bin/gdb example_9_4_bgp
c. Enter the following command, using the address of the Compute Node that you want to
debug and determined in step 2:
target remote ipaddr:port
You are now debugging the specified application on the configured Compute Node.
4. Set one or more breakpoints (using the GDB break command). Press Enter from the first
shell to continue that application.
If successful, your breakpoint should eventually be reached in the second shell, and you
can use standard GDB commands to continue.
148
The GDB info shared command can be used to display where GDB found dynamic libraries
when debugging a dynamically linked application. Example 9-7 shows sample output using
the GDB info shared command with the sample .gdbinit file from Example 9-6 on page 148.
Example 9-7 Sample GDB info shared output
Program received signal
0x010024f0 in _start ()
(gdb) info shared
From
To
0x006022f0 0x0061a370
0x81830780 0x81930730
0x81a36850 0x81a437d0
0x81b36da0 0x81b38300
0x81c38c00 0x81c39ac0
0x81d47e10 0x81d95fd0
0x81e5d6d0 0x81f5e8d0
SIGINT, Interrupt.
Syms Read
Yes
Yes
Yes
Yes
Yes
Yes
Yes
149
Figure 9-1 shows how the Core Processor tool GUI looks after the Perl script is invoked. The
Core Processor windows do not provide any initial information. You must explicitly select a
task that is provided via the GUI.
150
3. In the Attach Coreprocessor window (see Figure 9-2), supply the following information:
Session Name: You can run more than one session at a time, so use this option to
distinguish between multiple sessions.
Block name.
CNK binary (with path): To see both your application and the Compute Node Kernel in
the stack, specify your application binary and the Compute Node Kernel image
separated by a colon (:) as shown in the following example:
/bgsys/drivers/ppcfloor/cnk/bgp_kernel.cn:/bguser/bgpuser/hello_mpi_loop.rts
User name or owner of the Midplane Management Control System (MMCS) block.
Port: TCP port on which the MMCS server is listening for console connections, which is
probably 32031.
Host name or TCP/IP address for the MMCS server: Typically it is localhost or the
Service Nodes TCP/IP address.
Click the Attach button.
151
4. At this point, you have not yet affected the state of the processors. Choose Select
Grouping Mode Processor Status.
Notice the text in the upper-left pane (Figure 9-3). The Core Processor tool posts the
status ?RUN? because it does not yet know the state of the processors. (2048) is the
number of nodes in the block that are in that state. The number in parentheses always
indicates the number of nodes that share the attribute displayed on the line, which is the
processor state in this case.
5. Back at the main window (refer to Figure 9-1 on page 150), click the Select Grouping
Mode button.
6. Choose one of the Stack Traceback options. The Core Processor tool halts all the
Compute Node processor cores and displays the requested information. Choose each of
the options on that menu in turn so that you can see the variety of data formats available.
152
153
154
Refer to the following points to help you use the tool more effectively:
The number at the far left, before the colon, indicates the depth within the stack.
The number in parentheses at the end of each line indicates the number of nodes that
share the same stack frame.
If you click any line in the stack dump, the pane on the right (labeled Common nodes)
shows the list of nodes that share that stack frame. See Figure 9-7 on page 156.
When you click one of the stack frames and then select Control Run, the action is
performed for all nodes that share that stack frame. A new Processor Status summary is
displayed. If you again chose a Stack Traceback option, the running processors are halted
and the stacks are refetched.
You can hold down the Shift key and click several stack frames if you want to control all
procedures that are at a range of stack frames.
From the Filter menu option, you can select Group Selection Create Filter to add a
filter with the name that you specify in the Filter pull-down. When the box for your filter is
highlighted, only the data for those processors is displayed in the upper-left window. You
can create several filters if you want.
Set Group Mode to Ungrouped or Ungrouped with Traceback to control one processor at a
time.
155
156
157
The Survey option is less useful for core files because speed is not such a concern.
158
When you select a stack frame in the Traceback output (Figure 9-10), two additional pieces of
information are displayed. The core files that share that stack frame are displayed in the
Common nodes pane. The Location field under the Traceback pane displays the location of
that function and the line number represented by the stack frame. If you select one of the core
files in the Common nodes pane, the contents of that core file are displayed in the bottom
pane.
159
vi core.0 and select the addresses between +++STACK and ---STACK and use them as input for
addr2line
+++STACK
0x01342cb8
0x0134653c
0x0106e5f8
0x010841ec
0x0103946c
0x010af40c
0x010b5e44
0x01004fa0
0x010027cc
0x0100c028
0x0100133c
0x013227ec
0x01322a4c
0xfffffffc
---STACK
Run addr2line with your executable
$addr2line -e a.out
0x01342cb8
0x0134653c
0x0106e5f8
0x010841ec
0x0103946c
0x010af40c
0x010b5e44
0x01004fa0
0x010027cc
0x0100c028
0x0100133c
0x013227ec
0x01322a4c
0xfffffffc/bglhome/usr6/bgbuild/DRV360_2007-070906P-SLES10-DD2-GNU10/ppc/bgp/gnu/glibc-2.4/mallo
c/malloc.c:3377
/bglhome/usr6/bgbuild/DRV360_2007-070906P-SLES10-DD2-GNU10/ppc/bgp/gnu/glibc-2.4/malloc/malloc.c
:3525
modify.cpp:0
??:0
??:0
??:0
??:0
main.cpp:0
main.cpp:0
main.cpp:0
??:0
../csu/libc-start.c:231
../sysdeps/unix/sysv/linux/powerpc/libc-start.c:127
160
API requirements
The following requirements are for writing programs using the Scalable Debug API:
Currently, SUSE Linux Enterprise Server (SLES) 10 for PowerPC is the only supported
platform.
C and C++ are supported with the GNU gcc V4.1.2 level compilers. For more information
and downloads, refer to the following Web address:
http://gcc.gnu.org/
161
API specification
This section describes the functions, return codes, and data structures that make up the
Scalable Debug API.
Functions
This section describes the functions in the Scalable Debug API.
The functions return a status code that indicates success or failure, along with the reason for
the failure. An exit value of SDBG_NO_ERROR or 0 (zero) indicates that the function was
successful, while a non-zero return code indicates failure.
These functions are not thread safe. They must be called only from a single thread.
The following functions are in the Scalable Debug API:
int SDBG_Init();
This initialization function must be called before any other functions in the Scalable Debug
API.
This function returns the following values:
SDBG_NO_ERROR
SDBG_FORK_FE_ERROR
SDBG_EXEC_FE_ERROR
SDBG_CONNECT_FE_ERROR
SDBG_NO_ERROR
SDBG_INIT_NOT_CALLED
SDBG_CONNECT_FE_ERROR
SDBG_TIMEOUT_ERROR
SDBG_JOB_NOT_RUNNING
SDBG_JOB_HTC
SDBG_JOB_USER_DIFF
SDBG_TOOL_SETUP_FAIL
SDBG_TOOL_LAUNCH_FAIL
SDBG_PROC_TABLE
int SDBG_DetachJob();
Detach from a running job.
This function returns the following values:
SDBG_NO_ERROR
SDBG_CONNECT_FE_ERROR
int SDBG_GetStackData(uint32_t *numMsgs, SDBG_StackMsg_t **stackMsgPtr);
Get the stack data for the job. numMsgs is set to the number of stack data messages put in
stackMsgPtr. When the application is finished using the stack data, it must free the stack
data using SDBG_FreeStackData() to prevent a memory leak.
This function returns the following values:
SDBG_NO_ERROR
SDBG_NO_MEMORY
162
Return codes
This section summarizes the following return codes used by the Scalable Debug API:
SDBG_NO_ERROR: No error.
SDBG_TIMEOUT_ERROR: Timeout communicating with front-end process.
SDBG_JOBID_NOT_FOUND: Job ID passed not found.
163
Structures
This section describes the structures defined by the Scalable Debug API.
The SDBG_StackMsg_t structure contains the data returned when getting stack data. The
following fields are in the structure:
uint32_t node;
Node or pid.
uint32_t rank;
Rank as defined in proctable.
uint32_t threadId;
Thread ID for this node.
uint32_t linkReg;
Current link register.
uint32_t iar;
Current instruction register.
uint32_t currentFrame;
Current stack frame (R1).
uint32_t numStackFrames;
Number of stack frames in stackFramesPtr.
SDBG_StackFrame_t *stackFramesPtr;
Pointer to array of stack frames. This structure is NULL if there is no stack data.
164
The SDBG_StackFrame_t structure contains the saved frame address and saved link register
when looking at stack data. The following fields are in the structure:
uint32_t frameAddr;
Stack frame address.
uint32_t savedLR;
Saved link register for this stack frame address.
Environment variable
One environment variable affects the operation of the Scalable Debug API. If the
MPIRUN_CONFIG_FILE environment variable is set, its value is used as the mpirun
configuration file name. The mpirun configuration file contains the shared secret needed for
the API to authenticate with the mpirun daemon on the service node. If not specified, the
mpirun configuration file is located by looking for these files in order: /etc/mpirun.cfg or
<release-dir>/bin/mpirun.cfg (where <release-dir> is the Blue Gene/P system software
directory, for example, /bgsys/drivers/V1R2M0_200_2008-080513P/ppc).
Example code
Example 9-9 illustrates use of the Scalable Debug API.
Example 9-9 Sample code using the Scalable Debug API
#include
#include
#include
#include
#include
#include
<stdio.h>
<stdlib.h>
<errno.h>
<ScalableDebug.h>
<unistd.h>
<attach_bg.h>
165
166
167
168
10
Chapter 10.
169
170
open()
close()
read()
write()
lseek()
The function name open() is a weak alias that maps to the _libc_open function. The
checkpoint library intercepts this call and provides its own implementation of open() that
internally uses the _libc_open function.
The library maintains a file state table that stores the file name, current file position, and the
mode of all the files that are currently open. The table also maintains a translation that
translates the file descriptors used by the Compute Node Kernel to another set of file
descriptors to be used by the application. While taking a checkpoint, the file state table is also
stored in the checkpoint file. Upon a restart, these tables are read. Also the corresponding
files are opened in the required mode, and the file pointers are positioned at the desired
locations as given in the checkpoint file.
The current design assumes that the programs either always read the file or write the files
sequentially. A read followed by an overlapping write, or a write followed by an overlapping
read, is not supported.
171
All signals are classified into one of these two categories as shown in Table 10-1. If the signal
must be delivered immediately, the memory state of the application might change, making the
current checkpoint file inconsistent. Therefore, the current checkpoint must be aborted. The
checkpoint routine periodically checks if a signal has been delivered since the current
checkpoint began. In case a signal has been delivered, it aborts the current checkpoint and
returns to the application.
For signals that are to be postponed, the checkpoint handler simply saves the signal
information in a pending signal list. When the checkpoint is complete, the library calls
application handlers for all the signals in the pending signal list. If more than one signal of the
same type is raised while the checkpoint is in progress, the checkpoint library ensures that
the handler registered by the application is called at least once. However, it does not
guarantee in-order-delivery of signals.
Table 10-1 Action taken on signal
172
Signal name
Signal type
Action to be taken
SIGINT
Critical
Deliver
SIGXCPU
Critical
Deliver
SIGILL
Critical
Deliver
SIGABRT/SIGIOT
Critical
Deliver
SIGBUS
Critical
Deliver
SIGFPE
Critical
Deliver
SIGSTP
Critical
Deliver
SIGSEGV
Critical
Deliver
SIGPIPE
Critical
Deliver
SIGSTP
Critical
Deliver
SIGSTKFLT
Critical
Deliver
SIGTERM
Critical
Deliver
SIGHUP
Non-critical
Postpone
SIGALRM
Non-critical
Postpone
SIGUSR1
Non-critical
Postpone
SIGUSR2
Non-critical
Postpone
SIGTSTP
Non-critical
Postpone
SIGVTALRM
Non-critical
Postpone
SIGPROF
Non-critical
Postpone
SIGPOLL/SIGIO
Non-critical
Postpone
SIGSYS/SIGUNUSED
Non-critical
Postpone
SIGTRAP
Non-critical
Postpone
int BGCheckpoint()
BGCheckpoint() takes a snapshot of the program state at the instant at which it is called. All
the processes of the application must make a call to BGCheckpoint() to take a consistent
global checkpoint.
When a process makes a call to BGCheckpoint(), no outstanding messages should be in the
network or buffers. That is, the recv that corresponds to all the send calls should have
occurred. In addition, after a process has made a call to BGCheckpoint(), other processes
must not send messages to the process until their call to BGCheckpoint() is complete.
Typically, applications are expected to place calls to BGCheckpoint() immediately after a
barrier operation, such as MPI_Barrier(), or after a collective operation, such as
MPI_Allreduce(), when no outstanding message is in the MPI buffers and the network.
173
The state that corresponds to each application process is stored in a separate file. The
location of checkpoint files is specified by ckptDirPath in the call to BGCheckpointInit(). If
ckptDirPath is NULL, then the checkpoint file location is decided by the storage rules
mentioned in 10.4, Directory and file-naming conventions on page 175.
174
<xxx-yyy-zzz>
<seqNo>
The checkpoint sequence number starts at one and is incremented after every successful
checkpoint.
10.5 Restart
A transparent restart mechanism is provided through the use of the BGCheckpointInit()
function and the BG_CHKPTRESTARTSEQNO environment variable. Upon startup, an application is
expected to make a call to BGCheckpointInit(). The BGCheckpointInit() function initializes the
checkpoint library data structures.
Moreover the BGCheckpointInit() function checks for the environment variable
BG_CHKPTRESTARTSEQNO. If the variable is not set, a job launch is assumed, and the function
returns normally. In case the environment variable is set to zero, the individual processes
restart from their individual latest consistent global checkpoint. If the variable is set to a
positive integer, the application is started from the specified checkpoint sequence number.
175
It is the responsibility of the job launch subsystem to make sure that BG_CHKPTRESTARTSEQNO
corresponds to a consistent global checkpoint. In case BG_CHKPTRESTARTSEQNO is set to zero,
the job launch subsystem must make sure that files with the highest checkpoint sequence
number correspond to a consistent global checkpoint. The behavior of the checkpoint library
is undefined if BG_CHKPTRESTARTSEQNO does not correspond to a global consistent checkpoint.
Usage
BGCheckpointInit(char
*ckptDirPath);
BGCheckpoint();
BGCheckpointRestart(int
rstartSqNo);
BGCheckpointExcludeRegion
(void *addr, size_t len);
Usage
BG_CHKPTENABLED
BG_CHKPTDIRPATH
BG_CHKPTRESTARTSEQNO
176
11
Chapter 11.
mpirun
mpirun is a software utility for launching, monitoring, and controlling programs (applications)
that run on the BlueGene/ P system. mpirun on the Blue Gene/P system serves the same
function as on the Blue Gene/L system.
The name mpirun comes from Message Passing Interface (MPI) because its primary use is
to launch parallel jobs. mpirun can be used as a standalone program by providing parameters
either directly through a command line or from environmental variable arguments, or indirectly
through the framework of a scheduler that submits the job on the users behalf. In the former
case, mpirun can be invoked as a shell command. It allows you to interact with the running job
through the jobs standard input, standard output, and standard error. The mpirun software
utility acts as a shadow of the actual IBM Blue Gene/P job by monitoring its status and
providing access to standard input, output, and errors. After the job terminates, mpirun
terminates as well. If the user wants to prematurely end the job before it terminates, mpirun
provides a mechanism to do so explicitly or through a timeout period.
The mpirun software utility provides the capability to debug the job. In this chapter, we
describe the standalone interactive use of mpirun. We also provide a brief overview of mpirun
on the Blue Gene/P system. In addition, we define a list of APIs that allow interaction with the
mpirun program. These APIs are used by applications, such as external resource managers,
that want to programmatically invoke jobs using mpirun.
We address the following topics in this chapter and provide examples:
177
Figure 11-1 mpirun interacting with the rest of the control system on the Blue Gene/P system
After mpirun_be is forked, the sequence of events for booting partitions, starting jobs, and
collecting stdout/stderr is similar to the use of mpirun on the Blue Gene/L system.
The freepartition program was integrated as an option in mpirun for the Blue Gene/P
system. Example 11-1 shows how the free option is now used as part of mpirun on the
Blue Gene/P system.
Example 11-1 mpirun example with -free option
$ mpirun -partition N01_32_1 -free wait -verbose 1
<Jul 06 15:10:48.401421> FE_MPI (Info) : Invoking free partition
<Jul 06 15:10:48.414677> FE_MPI (Info) : freePartition() - connected to mpirun server at spinoza
<Jul 06 15:10:48.414768> FE_MPI (Info) : freePartition() - sent free partition request
<Jul 06 15:11:19.202335> FE_MPI (Info) : freePartition() - partition N01_32_1 was freed successfully
<Jul 06 15:11:19.202746> FE_MPI (Info) : == FE completed
==
<Jul 06 15:11:19.202790> FE_MPI (Info) : == Exit status:
0 ==
Also new in mpirun for the Blue Gene/P system is the support for multiple program,
multiple data (MPMD)33 style jobs where a different executable, arguments, environment,
and current working directory can be supplied for a single job on a processor set (pset)
178
basis, for example, with this capability, a user can run four different executables on a
partition with four psets.
This capability is handled by a new tool called mpiexec, which is not to be confused with
the mpiexec style of submitting a Single Program Multiple Data (SPMD) parallel MPI job.
11.1.1 mpiexec
mpiexec is the method for launching and interacting with parallel Multiple Program Multiple
Data (MPMD) jobs on Blue Gene/P. It is very similar to mpirun with the only exception being
that the arguments supported by mpiexec are slightly different.
Unsupported parameters
The parameters listed in Table 11-1 are supported by mipirun but not supported by mpiexec
because they do not apply to MPMD.
Table 11-1 Unsupported parameters
Parameter
Environment variables
-exe
MPIRUN_CWD MPIRUN_WDIR
-env
MPIRUN_ENV
-exp_env
MPIRUN_EXP_ENV
-env_all
MPIRUN_EXP_ENV_ALL
-mapfile
MPIRUN_ARGS
-args
MPIRUN_ARGS
New parameters
The only parameter that mpiexec supports that is not supported by mpirun is the -configfile
argument. See mipexec example on page 180 for sample usage.
-configfile MPIRUN_MPMD_CONFIGFILE
The MPMD configuration file must end with a newline character.
Limitations
Due to some underlying designs in the Blue Gene/P software stack, when using MPMD, the
following limitations are applicable:
A pset is the smallest granularity for each executable, though one executable can span
multiple psets.
You must use every compute node of each pset; specifically different -np values are not
supported.
The job mode (SMP, DUAL, or VNM) must be uniform across all psets.
179
mipexec example
Example 11-2 illustrates running /bin/hostname on a single 32-node pset, helloworld.sh on
another 32-node pset and goodbyeworld.sh on two 32-node psets. The partition bar consists
of 128 nodes, with 4 I/O nodes.
Example 11-2 mpiexec example
11.1.2 mpikill
The mpikill command sends a signal to an MPI job running on the compute nodes. A signal
can cause a job to terminate or an application might catch and handle signals to affect its
behavior.
The format of the mpikill command is:
mpikill [options] <pid> | --job <jobId>
The job to receive the signal can be specified by either the PID of the mpirun process or the
job database ID. The PID can be used only if the mpirun process is on the same system that
mpikill is run on. By default, the signal sent to the job is KILL. Only the user that the job is
running as can signal a job using mpikill. Table 11-2 lists the options that can be used with
the mpikill command.
Table 11-2 mpikill command options
Option
Description
-s <signal> or -SIGNAL
The signal to send to the job. The signal can be a signal name, such as
TERM, or a signal number, such as 15. The default signal is KILL.
-h or --help
--hostname <hostname>
Specifies the Service Node to use. The default is the value of the
MMCS_SERVER_IP environment variable, if that environment variable is
set, or 127.0.0.1.
--port <port>
--trace <0-7>
--config <filename>
mpirun configuration file, which contains the shared secret needed for
mpikill to authenticate with the mpirun daemon on the service node. If not
specified, the mpirun configuration file is located by looking for these files
in order: /etc/mpirun.cfg or <release-dir>/bin/mpirun.cfg (where
<release-dir> is the Blue Gene/P system software directory, for example,
/bgsys/drivers/V1R2M0_200_2008-080513P/ppc).
180
Example 11-3 illustrates signaling a job running on the same front end node by providing the
PID of the mpirun process.
Example 11-3 Use mpikill to signal a job using the PID of mpirun
Start a mpirun job on FEN 1, using the verbose output to display the job ID. In this case the
job ID is 21203:
FEN1$ mpirun -partition MYPARTITION -verbose 1 sleeper.bg
... -- verbose output
<Jul 06 15:18:10.547452> FE_MPI (Info) : Job added with the following id: 21203
... -- verbose output
On FEN 2, use mpikill to signal the job with SIGINT:
FEN2$ mpikill -INT --job 21203
The job receives the signal and exits, causing the following output from mpirun in shell 1:
... -- verbose output
<Jul 06 15:19:06.745821> BE_MPI (ERROR): The error message in the job record is as
follows:
<Jul 06 15:19:06.745856> BE_MPI (ERROR):
"killed with signal 2"
<Jul 06 15:19:07.106672> FE_MPI (Info) : ==
FE completed
==
<Jul 06 15:19:07.106731> FE_MPI (Info) : == Exit status: 130 ==
181
bridge.config
Contains locations of the default I/O Node and Compute Node images
when allocating partitions
mpirun.cfg
BGP_MACHINE_SN
BGP
BGP_MLOADER_IMAGE /bgsys/drivers/ppcfloor/boot/uloader
BGP_CNLOAD_IMAGE
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/cnk
BGP_IOLOAD_IMAGE
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers
/ppcfloor/boot/ramdisk
BGP_LINUX_MLOADER_IMAGE /bgsys/drivers/ppcfloor/boot/uloader
BGP_LINUX_CNLOAD_IMAGE
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers
/ppcfloor/boot/ramdisk
BGP_LINUX_IOLOAD_IMAGE
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers
/ppcfloor/boot/ramdisk
BGP_BOOT_OPTIONS
BGP_DEFAULT_CWD
$PWD
BGP_ENFORCE_DRAIN
182
BGP_DEFAULT_CWD is used for mpirun jobs when a user does not give the -cwd argument or one
of its environment variables. You can change this value to something more site specific, such
as /bgp/users, /gpfs/, and so on. The special keyword $PWD is expanded to the users current
working directory from where the user executed mpirun.
Challenge protocol
The challenge protocol, which is used to authenticate the mpirun front end when connecting
to the mpirun daemon on the Service Node, is a challenge/response protocol. It uses a
shared secret to create a hash of a random number, thereby verifying that the mpirun front
end has access to the secret.
To protect the secret, the challenge protocol is stored in a configuration file that is accessible
only to the bgpadmin user on the Service Node and to a special mpirun user on the front end
nodes. The front end mpirun binary has its setuid flag enabled so that it can change its uid to
match the mpirun user and read the configuration file to access the secret.
Specifying parameters
You can specify most parameters for the mpirun program in the following different ways:
Command-line arguments
Environment variables
Scheduler interface plug-in
In general, users normally use the command-line arguments and the environment variables.
Certain schedulers use the scheduler interface plug-in to restrict or enable mpirun features
183
according to their environment, for example, the scheduler might have a policy where
interactive job submission with mpirun can be allowed only during certain hours of the day.
Command-line arguments
The mpirun arguments consist of the following categories:
Job control
Block control
Output
Other
Description
-env "ENVVAR=value"
-exp_env <ENVVAR>
-env_all
-np <n>
Creates exactly n MPI ranks for the job. Aliases are -nodes and -n.
Specifies the mode in which the job will run. Choices are SMP (1 rank,
4 threads), DUAL (2 ranks, 2 threads each), or Virtual Node Mode
(4 ranks, 1 thread each).
-exe <executable>
Specifies the full path to the executable to run on the Compute Nodes.
The path is specified as seen by the I/O and Compute Nodes.
-cwd <path>
Specifies the full path to use as the current working directory on the
Compute Nodes. The path is specified as seen by the I/O and
Compute Nodes.
-mapfile <mapfile>
-timeout <n>
a. For additional information about mapping, see Appendix F, Mapping on page 355.
184
Description
-partition <block>
-nofree
If mpirun booted the block, it does not deallocate the block when the job
is done. This is useful for when you want to run a string of jobs
back-to-back on a block but do not want mpirun to boot and deallocate
the block each time (which happens if you had not booted the block first
using the console.) When your string of jobs is finally done, use the
freepartition command to deallocate the block.
-free <wait|nowait>
Frees the partition specified with -partition. No job is run. The wait
parameter does not return control until the partition has changed state
to free. The nowait parameter returns control immediately after
submitting the free partition request.
-noallocate
This option is more interesting for job schedulers. It tells mpirun not to
use a block that is not already booted.
-shape <XxYxZ>
-psets_per_bp <n>
Specifies the I/O Node to Compute Node ratio. The default is to use the
best possible ratio of I/O Nodes to Compute Nodes. Specifying a higher
number of I/O Nodes than what is available results in an error.
-connect <MESH|TORUS>
-reboot
-boot_options <options>
Output options
The output options in Table 11-5 on page 186 control information that is sent to STDIN,
STDOUT, and STDERR.
185
Description
-verbose [0-4]
Sets the verbosity level. The default is 0, which means that mpirun does
not output any status or diagnostic messages unless a severe error
occurs. If you are curious about what is happening, try levels 1 or 2. All
mpirun generated status and error messages appear on STDERR.
-label
Use this option to have mpirun label the source of each line of output. The
source is the MPI rank, and STDERR or STDOUT from which the output
originated.
-enable_tty_reporting
By default, mpirun tells the control system and the C run time on the
Compute Nodes that STDIN, STDOUT, and STDERR are tied to TTY type
devices. While semantically correct for the Blue Gene system, this
prevents blocked I/O to these file descriptors, which can slow down
operations. If you use this option, mpirun senses whether these file
descriptors are tied to TTYs and reports the results accurately to the
control system.
-strace <all|none|n>
Other options
Table 11-6 lists other options. These options provide general information about selected
software and hardware features.
Table 11-6 Other options
186
Arguments
Description
-h
-version
-host <host_name>
-port <port>
-start_gdbserver <path_to_gdbserver>
-start_tool <path>
-config <path>
Arguments
Description
-nw
-only_test_protocol
Environment variables
-partition
MPIRUN_PARTITION
-nodes
-mode
MPIRUN_MODE
-exe
MPIRUN_EXE
-cwd
MPIRUN_CWD MPIRUN_WDIR
-host
MMCS_SERVER_IP MPIRUN_SERVER_HOSTNAME
-port
MPIRUN_SERVER_PORT
-env
MPIRUN_ENV
-exp_env
MPIRUN_EXP_ENV
-env_all
MPIRUN_EXP_ENV_ALL
-mapfile
MPIRUN_MAPFILE
-args
MPIRUN_ARGS
-timeout
MPIRUN_TIMEOUT
-start_gdbserver
MPIRUN_START_GDBSERVER
-label
MPIRUN_LABEL
-nw
MPIRUN_NW
-nofree
MPIRUN_NOFREE
-noallocate
MPIRUN_NOALLOCATE
-reboot
MPIRUN_REBOOT
-boot_options
MPIRUN_BOOT_OPTIONS MPIRUN_KERNEL_OPTIONS
187
Arguments
Environment variables
-verbose
MPIRUN_VERBOSE
-only_test_protocol
MPIRUN_ONLY_TEST_PROTOCOL
-shape
MPIRUN_SHAPE
-psets_per_bp
MPIRUN_PSETS_PER_BP
-connect
MPIRUN_CONNECTION
-enable_tty_reporting
MPIRUN_ENABLE_TTY_REPORTING
-config
MPIRUN_CONFIG_FILE
188
Return code
Description
OK; successful
10
Communication error
11
12
Return code
Description
13
14
15
16
Failed to get the machine serial number (bridge configuration file not found?)
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Failed while checking to see if the user is in the partitions user list
33
A user does not have permission to run the job on the specified partition
34
35
36
Kernel options were specified but the partition is not in a FREE state
37
38
39
40
41
42
43
44
45
46
189
Return code
Description
47
An error occurred while mpirun was waiting for the job to terminate
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
Unexpected message
69
70
Out of memory
71
11.7 Examples
In this section, we present various examples of mpirun commands.
Display information
Example 11-6 shows how to display information using the -h flag.
Example 11-6 Invoking mpirun -h or -help to list all the options available
$ mpirun -h
Usage:
mpirun [options]
190
or
mpirun [options] binary [arg1 arg2 ... argn]
Options:
-h
-version
-partition <partition_id>
-np <compute_nodes>
-mode <SMP|DUAL|VN>
-exe <binary>
-cwd <path>
-host <service_node_host>
-port <service_node_port>
-env <env=val>
-exp_env <env vars>
-env_all
-mapfile <mapfile|mapping>
-args <"<arguments>">
-timeout <seconds>
-start_gdbserver <path>
-label
-nw
-nofree
-free <wait|nowait>
-noallocate
-reboot
-backend
-boot_options <options>
-verbose <0|1|2|3|4>
-trace <0-7>
-only_test_protocol
-strace <all|none|n>
-shape <XxYxZ>
-psets_per_bp <n>
-connect <TORUS|MESH>
-enable_tty_reporting
-config <path>
191
partition if enough resources are found. Upon job completion, mpirun deallocates the partition
if the user has not specified -nofree.
Example 11-7 Dynamic allocation
Using -psets_per_bp
Example 11-8 illustrates the usage of -psets_per_bp. The number of psets per base partition
is defined in the db.properties file. The value can be overridden with the -psets_per_bp
option.
Example 11-8 psets_per_bp
192
dd2sys1fen3
dd2sys1fen3
dd2sys1fen3
dd2sys1fen3
dd2sys1fen3
dd2sys1fen3
dd2sys1fen3
dd2sys1fen3
$ mpirun -partition N00_32_1 -np 32 -mode SMP -cwd /bgusr/cpsosa -exe a.out -env
OMP_NUM_THREADS=4
193
<Aug 11 15:33:44.021233> FE_MPI (WARN) : SignalHandler() - ! occupied by this job. This might
take a while... !
<Aug 11 15:33:44.021261> FE_MPI (WARN) : SignalHandler() !------------------------------------------------!
<Aug 11 15:33:44.021276> FE_MPI (WARN) : SignalHandler() <Aug 11 15:33:44.050365> BE_MPI (WARN) : Received a message from frontend
<Aug 11 15:33:44.050465> BE_MPI (WARN) : Execution of the current command interrupted
<Aug 11 15:33:59.532817> FE_MPI (ERROR): Failure list:
<Aug 11 15:33:59.532899> FE_MPI (ERROR):
- 1. Execution interrupted by signal (failure #71)
dd2sys1fen3:~/bgp/control/mpirun/new>
194
195
197
<Oct 18 13:22:10.804794> FE_MPI (WARN) : SignalHandler() - ! to terminate the job and to free
the resources !
<Oct 18 13:22:10.804818> FE_MPI (WARN) : SignalHandler() - ! occupied by this job. This might
take a while... !
<Oct 18 13:22:10.804841> FE_MPI (WARN) : SignalHandler() !------------------------------------------------!
<Oct 18 13:22:10.804865> FE_MPI (WARN) : SignalHandler() 131072
320
357.97
357.97
357.97
698.38
<Oct 18 13:21:10.936378> BE_MPI (WARN) : Received a message from frontend
<Oct 18 13:21:10.936449> BE_MPI (WARN) : Execution of the current command interrupted
<Oct 18 13:21:16.140631> BE_MPI (ERROR): The error message in the job record is as follows:
<Oct 18 13:21:16.140678> BE_MPI (ERROR):
"killed with signal 9"
<Oct 18 13:22:16.320232> FE_MPI (ERROR): Failure list:
<Oct 18 13:22:16.320406> FE_MPI (ERROR):
- 1. Execution interrupted by signal (failure #71)
198
199
200
12
Chapter 12.
High-Throughput Computing
(HTC) paradigm
In Chapter 7, Parallel paradigms on page 65, we described the High-Performance
Computing (HPC) paradigms. Applications that run in an HPC environment make use of the
network to share data among MPI tasks. In other words, the MPI tasks are tightly coupled.
In this chapter we describe a paradigm that complements the HPC environment. This mode
of running applications emphasizes IBM Blue Gene/P capacity. An application runs loosely
coupled; that is, multiple instances of the applications do not require data communication.
The concept of High-Throughput Computing (HTC) has been defined by Condor and others
(see the following URL):
http://www.cs.wisc.edu/condor/htc.html
In this chapter we cover the implementation of HTC as part of Blue Gene/P functionality. We
provide an overview of how HTC is implemented and how applications can take advantage of
it. We cover the following topics:
HTC design
Booting a partition in HTC mode
Running a job using submit
Checking HTC mode
submit API
Altering the HTC partition user list
For a more detailed description of HTC, and how it is integrated into the control system of
Blue Gene/P, see IBM System Blue Gene Solution: Blue Gene/P System Administration,
SG24-7417.
201
mmcs_db_console
Bridge APIs
202
Table 12-1 contains the available options for the submit command.
Table 12-1 Options available for the submit command
Job options (and syntax)
Description
-exe <exe>
Executable to run.
-env <env=value>
-exp_env <env>
-env_all
-cwd <cwd>
-timeout <seconds>
-strace
-start_gdbserver <path>
Resource options
-mode <SMP or DUAL or VN or LINUX_SMP>
-location <Rxx-Mx-Nxx-Jxx-Cxx>
-pool <id>
General options
-port <port>
-enable_tty_reporting
-raise
Environment variables
Some arguments have a corresponding environment variable. If both an environment variable
and an argument are given, precedence is given to the argument:
--pool SUBMIT_POOL
--cwd SUBMIT_CWD
--port SUBMIT_PORT
203
Using submit
This section provides selected examples on how to invoke the command and how to use the
location and pool arguments.
location argument
The --location argument requests a specific compute core location to run the job; the syntax is
in the form of RXX-MX-NXX-JXX-CXX. The rack numbers (RXX) can range between R00 and
RFF. The midplane numbers (MX) can range between M0 (bottom) and M1 (top). The node
card numbers (NXX) can range between N00-N15. The compute card numbers (JXX) can
range between J04 and J35. Note that J00 and J01 are I/O nodes, and J02 and J03 are
unused. The compute core numbers (CXX) can range between C00 and C03. Note that C00
is valid for SMP, DUAL, and VN mode. Core C01 is valid only for VN mode. Core 02 is valid for
VN and DUAL modes. Core 03 is valid only for VN mode.
The --location argument is combined with your user ID and --mode argument to find an
available location to run the job. If any of these parameters do not match the list of what is
available, the job is not started and an error message is returned. See Example 12-1.
It is also possible to omit a portion of the location. If the core is omitted (for example,
--location R00-M0-N14-J09), one of the cores in the compute node is chosen. If the compute
card is omitted (for example, --location R00-M0-N14), a core on a compute node on the node
card is chosen.
Example 12-1 Requesting specific location
If the location you request is busy (job already running), you see an error message (like that
shown in Example 12-2).
Example 12-2 Job already running
If the location you request was booted in a mode different than the --mode argument you give,
you see an error message (Example 12-3).
Example 12-3 Node mode conflict
$ whoami
bgpadmin
$ submit --cwd /bgusr/tests --exe hello --location R00-M0-N14-J09-C00 --mode vn
204
pool argument
A pool is a collection of compute nodes and is represented by an ID just as a partition is. A
pool consists of one or more partitions. By default, each partitions pool ID is its partition ID.
Outside the framework of a job scheduler, this should always be the case. Thus,
Example 12-5 shows how to run a job on any available compute node in partition
CHEMISTRY.
Example 12-5 Pool argument
If no compute nodes are available, an error message is displayed as shown in Example 12-5.
#include
#include
#include
#include
<unistd.h>
<common/bgp_personality.h>
<common/bgp_personality_inlines.h>
<spi/kernel_interface.h>
int
main()
{
// get our personality
_BGP_Personality_t pers;
if (Kernel_GetPersonality(&pers, sizeof(pers)) == -1) {
fprintf(stderr, "could not get personality\n");
exit(EXIT_FAILURE);
}
// check HTC mode
if (pers.Kernel_Config.NodeConfig & _BGP_PERS_ENABLE_HighThroughput) {
// do something HTC specific
} else {
// do something else
}
}
205
206
Part 4
Part
Job scheduler
interfaces
In this part, we provide information about the job scheduler APIs:
Chapter 13, Control system (Bridge) APIs on page 209
Chapter 14, Real-time Notification APIs on page 251
Chapter 15, Dynamic Partition Allocator APIs on page 295
207
208
13
Chapter 13.
API requirements
APIs
Small partition allocation
API examples
209
All required include files are installed in the /bgsys/drivers/ppcfloor/include directory. See
Appendix C, Header files and libraries on page 335 for additional information about
include files. The include file for the Bridge APIs is rm_api.h.
The Bridge APIs support 64-bit applications that use dynamic linking using shared objects.
The required library files are installed in the /bgsys/drivers/ppcfloor/lib64 directory.
The shared object for linking to the Bridge APIs is libbgpbridge.so. The libbgpbridge.so
library has dependencies on other libraries that are included with the Blue Gene/P
software, including:
libbgpconfig.so
libbgpdb.so
libsaymessage.so
libtableapi.so
These files are installed with the standard system installation procedure. They are
contained in the bgpbase.rpm file.
The requirements for writing programs to the Bridge APIs are explained in the following
sections.
Required
Description
DB_PROPERTY
Yes
BRIDGE_CONFIG
Yes
BRIDGE_DUMP_XML
No
When set to any value, this variable causes the Bridge APIs
to dump in-memory XML streams to files in /tmp for
debugging. When this variable is not set, the Bridge APIs do
not dump in-memory XML streams.
For more information about the db.properties and bridge.config files, see IBM System Blue
Gene Solution: Blue Gene/P System Administration, SG24-7417.
210
211
if (stat != STATUS_OK) {
// Do some error handling here...
return;
}
// How much data (# of partitions) did we get back?
rm_get_data(bgp_part_list, RM_PartListSize, &list_size);
for (int i = 0; i < list_size; i++) {
// If this is the first time through, use RM_PartListFirstPart
if (i == 0){
rm_get_data(bgp_part_list, RM_PartListFirstPart, &bgp_part);
}
// Otherwise, use RM_PartListNextPart
else {
rm_get_data(bgp_part_list, RM_PartListNextPart, &bgp_part);
}
}
// Make sure we free the memory when finished
stat = rm_free_partition_list(bgp_part_list);
if (stat != STATUS_OK) {
// Do some error handling here...
return;
}
13.2 APIs
In the following sections, we provide details about the APIs.
212
213
INCOMPATIBLE_STATE: The state of the partition or job prohibits the specific action. See
Figure 13-1 on page 221, Figure 13-2 on page 226, Figure 13-3 on page 227, and
Figure 13-4 on page 228 for state diagrams.
INCONSISTENT_DATA: The data retrieved from the control system is not valid.
INTERNAL_ERROR: Such errors do not belong to any of the previously listed categories,
such as a memory allocation problem or failures during the manipulation of internal XML
streams.
INTERNAL_ERROR
status_t rm_get_data(rm_element_t *rme, enum RMSpecification spec, void *
result);
This function returns the content of the requested field from a valid rm_element_t (Blue
Gene/P object, base partition object, wire object, switch object, and so on). The
specifications that are available when using rm_get_data() are listed in 13.2.8, Field
specifications for the rm_get_data() and rm_set_data() APIs on page 229, and are
grouped by the object type that is being accessed.
The following return codes are possible:
STATUS_OK
INVALID_INPUT
INTERNAL_ERROR
status_t rm_get_nodecards(rm_bp_id_t bpid, rm_nodecard_list_t **nc_list);
This function returns all node cards in the specified base partition.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INCONSISTENT_DATA
The base partition was not found.
INTERNAL_ERROR
214
INTERNAL_ERROR
status_t rm_set_serial(rm_serial_t serial);
This function sets the machine serial number to be used in all the API calls following this
call. The database can contain more than one machine. Therefore, it is necessary to
specify which machine to work with.
The following return codes are possible:
STATUS_OK
INVALID_INPUT
BP_NOT_FOUND:
One or more of the base partitions in the rm_partition_t structure does not exist.
215
SWITCH_NOT_FOUND:
One or more of the switches in the rm_partition_t structure does not exist.
INTERNAL_ERROR
status_t rm_add_part_user (pm_partition_id_t partition_id, const char *user);
This function adds a new user to the partition. If a partition is in free state any user can
add users. If the partition is in any other state only the partition's owner can add users.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INVALID_INPUT:
partition_id is NULL or the length exceeds the limitations of the control system.
user is NULL or the length exceeds the limitations of the control system.
user is already defined as the partitions user.
INTERNAL_ERROR
status_t rm_assign_job(pm_partition_id_t partition_id, db_job_id_t jid);
This function assigns a job to a partition. A job can be created and simultaneously
assigned to a partition by calling rm_add_job() with a partition ID. If a job is created and
not assigned to a specific partition, it can be assigned later by calling rm_assign_job().
Note: rm_assign_job() is not supported for HTC jobs.
216
PARTITION_NOT_FOUND
INCOMPATIBLE_STATE
The current state of the partition prohibits its creation. See Figure 13-1 on page 221.
INTERNAL_ERROR
status_t pm_destroy_partition(pm_partition_id_t partition_id);
This function shuts down a currently booted partition and updates the database
accordingly.
Note: This API is asynchronous. Control returns to your application before the
operation requested is complete.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INVALID_INPUT
partition_id is NULL or the length exceeds the limitations of the control system.
PARTITION_NOT_FOUND
INCOMPATIBLE_STATE
The state of the partition prohibits its destruction. See Figure 13-1 on page 221.
INTERNAL_ERROR
status_t rm_get_partition(pm_partition_id_t partition_id, rm_partition_t **p);
This function retrieves a partition, according to its ID.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INVALID_INPUT
partition_id is NULL or the length exceeds the limitations of the control system.
PARTITION_NOT_FOUND
INCONSISTENT_DATA
The base partition or switch list of the partition is empty.
INTERNAL_ERROR
status_t rm_get_partitions(rm_partition_state_t_flag_t flag,
rm_partition_list_t **part_list);
This function is useful for status reports and diagnostics. It returns a list of partitions
whose current state matches the flag. The possible flags are contained in the rm_api.h
include file and listed in Table 13-2 on page 218. You can use OR on these values to
create a flag for including partitions with different states.
217
Value
PARTITION_FREE_FLAG
0x01
PARTITION_CONFIGURING_FLAG
0x02
PARTITION_READY_FLAG
0x04
PARTITION_DEALLOCATING_FLAG
0x10
PARTITION_ERROR_FLAG
0x20
PARTITION_REBOOTING_FLAG
0x40
PARTITION_ALL_FLAG
0xFF
218
partition_id is NULL, or the length exceeds the limitations of the control system.
The value for the modify_option parameter is not valid.
PARTITION_NOT_FOUND
INCOMPATIBLE_STATE
The partitions current state forbids its modification. See Figure 13-1 on page 221.
INTERNAL_ERROR
status_t pm_reboot_partition(pm_partition_id_t partition_id);
This function sends a request to reboot a partition and update the resulting status in the
database.
Note: This API is asynchronous. Control returns to your application before the
operation requested is complete.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INVALID_INPUT
partition_id is NULL, or the length exceeds the limitations of the control system.
This API is not supported for HTC partitions.
PARTITION_NOT_FOUND
INCOMPATIBLE_STATE
The partitions current state forbids it to be rebooted. See Figure 13-1 on page 221.
INTERNAL_ERROR
status_t rm_release_partition(pm_partition_id_t partition_id);
This function is the opposite of rm_assign_job() because it releases the partition from all
jobs. Only jobs that are in an RM_JOB_IDLE state have their partition reference removed.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INVALID_INPUT
partition_id is NULL, or the length exceeds the limitations of the control system
(configuration parameter).
PARTITION_NOT_FOUND
INCOMPATIBLE_STATE
The current state of one or more jobs assigned to the partition prevents this release.
See Figure 13-1 on page 221 and Figure 13-2 on page 226.
INTERNAL_ERROR
status_t rm_remove_partition(pm_partition_id_t partition_id);
This function removes the specified partition record from MMCS.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INVALID_INPUT
partition_id is NULL, or the length exceeds the limitations of the control system
(configuration parameter).
219
PARTITION_NOT_FOUND
INCOMPATIBLE_STATE
The partitions current state forbids its removal. See Figure 13-1 on page 221 and
Figure 13-2 on page 226.
INTERNAL_ERROR
status_t rm_remove_part_user(pm_partition_id_t partition_id, const char *user);
This function removes a user from a partition. Removing a user from a partition can be
done only by the partition owner. A user can be removed from a partition that is in any
state. Once a HTC partition is booted, this API can still be used, but the submit server
daemon running on the service node ignores any removed users. Those removed users
are still allowed to run jobs on the partition.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INVALID_INPUT
partition_id is NULL, or the length exceeds the limitations of the control system
(configuration parameter).
user is NULL, or the length exceeds the limitations of the control system.
user is already defined as the partitions user.
Current user is not the partition owner.
INTERNAL_ERROR
status_t rm_set_part_owner(pm_partition_id_t partition_id, const char *user);
This function sets the new owner of the partition. Changing the partitions owner can be
done only to a partition in the RM_PARTITION_FREE state.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INVALID_INPUT
partition_id is NULL, or the length exceeds the limitations of the control system
(configuration parameter).
owner is NULL, or the length exceeds the limitations of the control system.
INTERNAL_ERROR
status_t rm_get_htc_pool(pm_pool_id_t pid, rm_partition_list_t **p)
This function is useful for status reports and diagnostics. It returns a list of partitions
whose HTC pool id matches the parameter.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INCONSISTENT_DATA
At least one of the partitions has an empty base partition list.
INTERNAL_ERROR
220
rm_remove_partition()
rm_add_partition()
RM_PARTITION_FREE
RM_PARTITION_CONFIGURING
pm_create_partition()
pm_destroy_partition()
RM_PARTITION_ERROR
RM_PARTITION_REBOOTING
pm_reboot_partition()
RM_PARTITION_DEALLOCATING
RM_PARTITION_READY
pm_destroy_partition()
221
JOB_ALREADY_DEFINED
A job with the same name already exists.
INTERNAL_ERROR
status_t jm_attach_job(db_job_id_t jid);
This function initiates the spawn of debug servers to a job in the RM_JOB_LOADED state.
Note: jm_attach_job() is not supported for HTC jobs.
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INCOMPATIBLE_STATE
The jobs state prevents it from being attached. See Figure 13-2 on page 226.
INTERNAL_ERROR
status_t jm_begin_job(db_job_id_t jid);
This function begins a job that is already loaded.
Note: jm_begin_job() is not supported for HTC jobs.
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INCOMPATIBLE_STATE
The jobs state prevents it from beginning. See Figure 13-2 on page 226.
INTERNAL_ERROR
status_t jm_cancel_job(db_job_id_t jid);
This function sends a request to cancel the job identified by the jid parameter.
Note: This API is asynchronous. Control returns to your application before the
operation requested is complete.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INCOMPATIBLE_STATE
The jobs state prevents it from being canceled. See Figure 13-2 on page 226.
INTERNAL_ERROR
222
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INCOMPATIBLE_STATE
The jobs state prevents it from being debugged. See Figure 13-2 on page 226.
INTERNAL_ERROR
status_t rm_get_job(db_job_id_t jid, rm_job_t **job);
This function retrieves the specified job object.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INTERNAL_ERROR
Value
JOB_IDLE_FLAG
0x001
JOB_STARTING_FLAG
0x002
JOB_RUNNING_FLAG
0x004
JOB_TERMINATED_FLAG
0x008
JOB_ERROR_FLAG
0x010
JOB_DYING_FLAG
0x020
JOB_DEBUG_FLAG
0x040
JOB_LOAD_FLAG
0x080
JOB_LOADED_FLAG
0x100
JOB_BEGIN_FLAG
0x200
JOB_ATTACH_FLAG
0x400
JOB_KILLED_FLAG
0x800
223
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INCOMPATIBLE_STATE
The jobs state prevents it from being loaded. See Figure 13-2 on page 226.
INTERNAL_ERROR
status_t rm_query_job(db_job_id_t db_job_id, MPIR_PROCDESC **proc_table, int *
proc_table_size);
This function fills the proc_table with information about the specified job.
Note: rm_query_job() is not supported for HTC jobs.
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INTERNAL_ERROR
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INCOMPATIBLE_STATE
The jobs state prevents its removal. See Figure 13-2 on page 226.
224
INTERNAL_ERROR
status_t jm_signal_job(db_job_id_t jid, rm_signal_t signal);
This function sends a request to signal the job identified by the jid parameter.
Note: This API is asynchronous. Control returns to your application before the
operation requested is complete.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INCOMPATIBLE_STATE
The jobs state prevents it from being signaled.
INTERNAL_ERROR
status_t jm_start_job(db_job_id_t jid);
This function starts the job identified by the jid parameter. Note that the partition
information is referenced from the job record in MMCS.
Note: jm_start_job() is not supported for HTC jobs.
Note: This API is asynchronous. Control returns to your application before the
operation requested is complete.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
JOB_NOT_FOUND
INCOMPATIBLE_STATE
The jobs state prevents its execution. See Figure 13-2 on page 226.
INTERNAL_ERROR
status_t rm_get_filtered_jobs(rm_job_filter_t query_parms, rm_job_list_t **job_list);
This function returns a list of jobs whose attributes or states (or both) that match the fields
specified in the filter provided in the rm_job_filter object.
The following return codes are possible:
STATUS_OK
CONNECTION_ERROR
INTERNAL_ERROR
225
rm_add_job()
RM_JOB_IDLE
jm_start_job()
RM_JOB_STARTING
RM_JOB_ERROR
jm_start_job()
RM_JOB_RUNNING
jm_cancel_job()
RM_JOB_TERMINATED
Figure 13-2 Job state diagram for running a Blue Gene/P job
226
RM_JOB_DYING
Figure 13-3 illustrates the main states that a job goes through when debugging a new job.
rm_add_job()
RM_JOB_IDLE
jm_load_job()
RM_JOB_LOAD
RM_JOB_ERROR
jm_load_job()
RM_JOB_LOADED
jm_cancel_job()
jm_attach_job()
RM_JOB_ATTACH
jm_begin_job()
RM_JOB_BEGIN
RM_JOB_RUNNING
jm_cancel_job()
RM_JOB_TERMINATED
RM_JOB_DYING
227
Figure 13-4 illustrates the states a job goes through when debugging an already running job.
rm_add_job()
RM_JOB_IDLE
jm_load_job()
RM_JOB_LOAD
RM_JOB_ERROR
jm_load_job()
RM_JOB_LOADED
jm_cancel_job()
jm_attach_job()
RM_JOB_ATTACH
jm_begin_job()
RM_JOB_BEGIN
RM_JOB_RUNNING
jm_cancel_job()
RM_JOB_TERMINATED
228
RM_JOB_DYING
Figure 13-5 illustrates the states that a job goes through during its life cycle in HTC mode. It
also illustrates that the submit command is required.
229
Table 13-4 Values retrieved from a Blue Gene object using rm_get_data()
Description
Specification
Argument type
RM_BPsize
rm_size3D_t *
RM_Msize
rm_size3D_t *
RM_BPNum
int *
RM_FirstBP
rm_BP_t **
RM_NextBP
rm_BP_t **
RM_SwitchNum
int *
RM_FirstSwitch
rm_switch_t **
RM_NextSwitch
rm_switch_t **
RM_WireNum
int *
RM_FirstWire
rm_wire_t **
RM_NextWire
rm_wire_t **
Notes
Specification
Argument type
Notes
RM_BPID
rm_bp_id_t *
free required
RM_BPState
rm_BP_state_t *
RM_BPStateSeqID
rm_sequence_id_t *
RM_BPLoc
rm_location_t *
RM_BPPartID
pm_partition_id_t *
RM_BPPartState
rm_partition_state_t *
230
free required.
If no partition is
associated, NULL
is returned.
Description
Specification
Argument type
Notes
RM_BPStateSeqID
rm_sequence_id_t *
RM_BPSDB
int *
0=No
1=Yes
RM_BPSD
int *
0=No
1=Yes
RM_BPComputeNodeMemory
rm_BP_computenode_memory_t *
RM_BPAvailableNodeCards
int *
RM_BPNumberIONodes
int *
Table 13-6 shows the values that are set in the base partition object using rm_set_data().
Table 13-6 Values set in a base partition object using rm_set_data()
Description
Specification
Argument type
Notes
RM_BPID
rm_bp_id_t
free required
Specification
Argument type
RM_NodeCardListSize
int *
RM_NodeCardListFirst
rm_nodecard_t **
RM_NodeCardListNext
rm_nodecard_t **
Notes
231
Table 13-8 Values retrieved from a node card object using rm_get_data()
Description
Specification
Argument type
Notes
RM_NodeCardID
rm_nodecard_id_t *
free required;
possible values:
N00..N15
RM_NodeCardQuarter
rm_quarter_t *
RM_NodeCardState
rm_nodecard_state_t *
RM_NodeCardStateSeqID
rm_sequence_id_t *
RM_NodeCardIONodes
int *
RM_NodeCardPartID
pm_partition_id_t *
RM_NodeCardPartState
rm_partition_state_t *
RM_NodeCardPartStateSeqID
rm_sequence_id_t *
RM_NodeCardSDB
int *
RM_NodeCardIONodeNum
int *
RM_NodeCardFirstIONode
rm_ionode_t **
RM_NodeCardNextIONode
rm_ionode_t **
free required.
If no partition is
associated,
NULL is
returned.
0=No
1=Yes
Table 13-9 shows the values that are set in a node card object when using rm_set_data().
Table 13-9 Values set in a node card object using rm_set_data()
Description
Specification
Argument type
RM_NodeCardID
rm_nodecard_id_t
RM_NodeCardIONodeNum
int *
RM_NodeCardFirstIONode
rm_ionode_t *
RM_NodeCardNextIONode
rm_ionode_t *
232
Notes
Specification
Argument type
Notes
RM_IONodeID
rm_ionode_id_t *
Possible values:
J00, J01;
free required
RM_IONodeNodeCardID
rm_nodecard_id_t *
Possible values:
N00..N15;
free required
IP address
RM_IONodeIPAddress
char **
free required
MAC address
RM_IONodeMacAddress
char **
free required
RM_IONodePartID
pm_partition_id_t *
free required.
If no partition is
associated with this
I/O Node, NULL is
returned.
RM_IONodePartState
rm_partition_state_t *
RM_IONodePartStateSeqID
rm_sequence_id_t *
Table 13-11 shows the values that are set in an I/O Node object by using rm_set_data().
Table 13-11 Values set in an I/O Node object using rm_set_data()
Description
Specification
Argument type
Notes
RM_IONodeID
rm_ionode_id_t
Switch object
The switch object (rm_switch_t) represents a switch in the Blue Gene/P system. The switch
object is retrieved from the following specifications:
The Blue Gene object using the RM_FirstSwitch and RM_NextSwitch specifications
The partition object using the RM_PartitionFirstSwitch and RM_PartitionNextSwitch
specifications
233
Table 13-12 shows the values that are retrieved from a switch object using rm_get_data().
Table 13-12 Values retrieved from a switch object using rm_get_data()
Description
Specification
Argument type
Notes
Switch identifier
RM_SwitchID
rm_switch_id_t *
free required
RM_SwitchBPID
rm_BP_id_t *
free required
Switch state
RM_SwitchState
rm_switch_state_t *
RM_SwitchStateSeqID
rm_sequence_id_t *
Switch dimension
RM_SwitchDim
rm_dimension_t *
Values:
RM_DIM_X
RM_DIM_Y
RM_DIM_Z
RM_SwitchConnNum
int *
A connection is a pair
of ports that are
connected internally in
the switch.
RM_SwitchFirstConnection
rm_connection_t *
RM_SwitchNextConnection
rm_connection_t *
Table 13-13 shows the values that are set in a switch object using rm_set_data().
Table 13-13 Values set in a switch object using rm_set_data()
Description
Specification
Argument type
Switch identifier
RM_SwitchID
rm_switch_id_t *
RM_SwitchConnNum
int *
RM_SwitchFirstConnection
rm_connection_t *
RM_SwitchNextConnection
rm_connection_t *
Notes
A connection is a pair of
ports that are connected
internally in the switch.
Wire object
The wire object (rm_wire_t) represents a wire in the Blue Gene/P system. The wire object is
retrieved from the Blue Gene/P object using the RM_FirstWire and RM_NextWire
specifications. See Table 13-14 on page 235.
234
Specification
Argument type
Notes
Wire identifier
RM_WireID
rm_wire_id_t *
free required.
Wire state
RM_WireState
rm_wire_state_t *
Source port
RM_WireFromPort
rm_port_t **
Destination port
RM_WireToPort
rm_port_t **
RM_WirePartID
pm_partition_id_t *
RM_WirePartState
rm_partition_state_t *
RM_WirePartStateSeqID
rm_sequence_id_t *
free required. If no
partition is associated,
NULL is returned.
Port object
The port object (rm_port_t) represents a port for a switch in the Blue Gene. The port object is
retrieved from the wire object using the RM_WireFromPort and RM_WireToPort specifications.
See Table 13-15.
Table 13-15 Values retrieved from a port object using rm_get_data()
Description
Specification
Argument type
Notes
RM_PortComponentID
rm_component_id_t *
free required
Port identifier
RM_PortID
rm_port_id_t *
Specification
Argument type
RM_PartListSize
int *
RM_PartListFirstPart
rm_partition_t **
RM_PartListNextPart
rm_partition_t **
Notes
Partition object
The partition object (rm_partition_t) represents a partition that is defined in the Blue Gene
system. The partition object is retrieved from the partition list object using the
RM_PartListFirstPart and RM_PartListNextPart specifications. A new partition object is
created using the rm_new_partition() API. After setting the appropriate fields in a new
partition object, the partition can be added to the system using the rm_add_partition() API.
See Table 13-17 on page 236.
235
Specification
Argument type
Notes
Partition identifier
RM_PartitionID
pm_partition_id_t *
free required
Partition state
RM_PartitionState
rm_partition_state_t *
RM_PartitionStateSeqID
rm_sequence_id_t *
RM_PartitionConnection
rm_connection_type_t *
Values: TORUS
or MESH
Partition description
RM_PartitionDescription
char **
free required
RM_PartitionSmall
int *
0=No
1=Yes
RM_PartitionPsetsPerBP
int *
RM_PartitionJobID
int *
If no job is
currently on the
partition, 0 is
returned; for HTC
partitions it
always returns 0
even when HTC
jobs are running.
Partition owner
RM_PartitionUserName
char **
free required
Partition options
RM_PartitionOptions
char **
free required
RM_PartitionMloaderImg
char **
free required
RM_PartitionCnloadImg
char **
free required
RM_PartitionIoloadImg
char **
free required
RM_PartitionBPNum
int *
RM_PartitionFirstBP
rm_BP_t **
RM_PartitionNextBP
rm_BP_t **
RM_PartitionSwitchNum
int *
RM_PartitionFirstSwitch
rm_switch_t **
RM_PartitionNextSwitch
rm_switch_t **
RM_PartitionNodeCardNum
int *
RM_PartitionFirstNodeCard
rm_nodecard_t **
RM_PartitionNextNodeCard
rm_nodecard_t **
RM_PartitionUsersNum
int *
RM_PartitionFirstUser
char **
236
free required
Description
Specification
Argument type
Notes
RM_PartitionNextUser
char **
free required
RM_PartitionHTCPoolID
pm_pool_id_t *
Value will be
NULL for a HPC
partition.
free required
RM_PartitionSize
int *
Boot options
RM_PartitionBootOptions
char **
free required
Table 13-18 shows the values that are set in a partition object using rm_set_data().
Table 13-18 Values set in a partition object using rm_set_data()
Description
Specification
Argument type
Notes
Partition identifier
RM_PartitionID
pm_partition_id_t
Up to 32 characters for a
new partition ID, or up to
16 characters followed by
an asterisk (*) for a prefix
for a unique name
RM_PartitionConnection
rm_connection_type_t *
Partition description
RM_PartitionDescription
char *
RM_PartitionSmall
int *
RM_PartitionPsetsPerBP
int *
Partition owner
RM_PartitionUserName
char *
RM_PartitionMloaderImg
char *
Comma-separated list of
images to load on the
Compute Nodes
RM_PartitionCnloadImg
char *
Comma-separated list of
images to load on the I/O
Nodes
RM_PartitionIoloadImg
char *
RM_PartitionBPNum
int *
RM_PartitionFirstBP
rm_BP_t *
RM_PartitionNextBP
rm_BP_t *
RM_PartitionSwitchNum
int *
RM_PartitionFirstSwitch
rm_switch_t *
0=No
1=Yes
237
Description
Specification
Argument type
RM_PartitionNextSwitch
rm_switch_t *
RM_PartitionNodeCardNum
int *
RM_PartitionFirstNodecard
rm_nodecard_t *
RM_PartitionNextNodecard
rm_nodecard_t *
Boot options
RM_PartitionBootOptions
char *
Notes
Specification
Argument type
RM_JobListSize
int *
RM_JobListFirstJob
rm_job_t **
RM_JobListNextJob
rm_job_t **
Notes
Job object
The job object (rm_job_t) represents a job defined in the Blue Gene system. The job object is
retrieved from the job list object using the RM_JobListFirstJob and RM_JobListNextJob
specifications. A new job object is created using the rm_new_job() API. After setting the
appropriate fields in a new job object, the job can be added to the system using the
rm_add_job() API. See Table 13-20.
Table 13-20 Values retrieved from a job object using rm_get_data()
Description
Specification
Argument type
Notes
Job identifier
RM_JobID
rm_job_id_t *
free required
Identifier is unique across
all jobs on the system.
RM_JobPartitionID
pm_partition_id_t *
Job state
RM_JobState
rm_job_state_t *
RM_JobStateSeqID
rm_sequence_id_t *
RM_JobExecutable
char **
free required
RM_JobUserName
char **
free required
RM_JobDBJobID
db_job_id_t *
RM_JobOutFile
char **
free required
RM_JobErrFile
char **
free required
238
free required
Description
Specification
Argument type
Notes
RM_JobOutDir
char **
free required
This directory contains
the output files if a full
path is not given.
RM_JobErrText
char **
free required
RM_JobArgs
char **
free required
RM_JobEnvs
char **
free required
RM_JobInHist
int *
0=No
1=Yes
Job mode
RM_JobMode
rm_job_mode_t *
RM_JobStrace
rm_job_strace_t *
RM_JobStartTime
char **
free required
RM_JobEndTime
char **
free required
RM_JobRunTime
rm_job_runtime_t *
239
Description
Specification
Argument type
Notes
RM_JobComputeNodesUsed
rm_job_computenodes
_used_t *
RM_JobExitStatus
rm_job_exitstatus_t *
User UID
RM_JobUserUid
rm_job_user_uid_t *
User GID
RM_JobUserGid
rm_job_user_gid_t *
Job location
RM_JobLocation
rm_job_location_t *
RM_JobPooID
pm_pool_id_t *
Table 13-21 shows the values that are set in a job object using rm_set_data().
Table 13-21 Values set in a job object using rm_set_data()
Description
Specification
Argument type
Notes
Job identifier
RM_JobID
rm_job_id_t
RM_JobPartitionID
pm_partition_id_t
RM_JobExecutable
char *
RM_JobUserName
char *
240
Description
Specification
Argument type
Notes
RM_JobOutFile
char *
RM_JobErrFile
char *
RM_JobOutDir
char *
RM_JobArgs
char *
RM_JobEnvs
char *
Job mode
RM_JobMode
rm_job_mode_t *
RM_JobStrace
rm_job_strace_t *
User UID
RM_JobUserUid
rm_job_user_uid_t *
User GID
RM_JobUserGid
rm_job_user_gid_t *
Specification
Argument Type
Notes
Job identifier
RM_JobFilterID
rm_job_id_t*
Partition identifier
assigned for the job
RM_JobFilterPartitionID
pm_partition_id_t *
Free required.
Job state
RM_JobFilterState
rm_job_state_t *
RM_JobFilterExecutable
char**
free required.
RM_JobFilterUserName
char**
free required.
RM_JobFilterDBJobID
db_job_id_t*
RM_JobFilterOutDir
char**
Job mode
RM_JobFilterMode
rm_job_mode_t*
241
Description
Specification
Argument Type
Notes
RM_JobFilterStartTime
char**
free required.
Job location
RM_JobFilterLocation
rm_job_location_t *
RM_JobFilterPoolID
pm_pool_id_t *
Job type
RM_JobFilterType
rm_job_state_flag_t
Value
RM_SMP_MODE
0x0000
RM_DUAL_MODE
0x0001
RM_VIRTUAL_NODE_MODE
0x0002
Value
JOB_TYPE_HPC_FLAG
0x0001
JOB_TYPE_HTC_FLAG
0x0002
JOB_TYPE_ALL_FLAG
0x0003
242
243
Here is an example:
<Mar 9 04:24:30> BRIDGE (Debug): rm_get_BG()- Completed Successfully
The message can be one of the following types:
The following verbosity levels, to which the messaging APIs can be configured, define the
policy:
By default, only error and warning messages are written. To have informational and minimal
debug messages written, set the verbosity level to 2. To obtain more detailed debug
messages, set the verbosity level to 3 or 4.
In the following list, we describe the messaging APIs:
int isSayMessageLevel(message_type_t m_type);
Tests the current messaging level. Returns 1 if the specified message type is included in
the current messaging level; otherwise returns 0.
void closeSayMessageFile();
Closes the messaging log file.
Note: Any messaging output after calling this method is sent to stderr.
int sayFormattedMessage(FILE * curr_stream, const void * buf, size_t bytes);
Logs a preformatted message to the messaging output without a time stamp.
void sayMessage(const char * component, message_type_t m_type, const char *
curr_func, const char * format, ...);
Logs a message to the messaging output.
The format parameter is a format string that specifies how subsequent arguments are
converted for output. This value must be compatible with printf format string
requirements.
int sayPlainMessage(FILE * curr_stream, const char * format, ...);
Logs a message to the messaging output without a time stamp.
244
The format parameter is a format string that specifies how subsequent arguments are
converted for output. This value must be compatible with the printf format string
requirements.
void setSayMessageFile(const char* oldfilename, const char* newfilename);
Opens a new file for message logging.
Note: This method can be used to atomically rotate log files.
void setSayMessageLevel(unsigned int level);
Sets the messaging verbose level.
void setSayMessageParams(FILE * stream, unsigned int level);
Uses the provided file for message logging and sets the logging level.
Note: This method has been deprecated in favor of the setSayMessageFile() and
setSayMessageLevel() methods.
245
246
getOption = RM_NextBP;
}
rm_free_BG(rmbg); // Deallocate memory from rm_get_BG()
}
The example code can be compiled and linked with the commands shown in Figure 13-6.
g++ -m64 -pthread -I/bgsys/drivers/ppcfloor/include -c sample1.cc -o sample1.o_64
g++ -m64 -pthread -o sample1 sample1.o_64 -L/bgsys/drivers/ppcfloor/lib64 -lbgpbridge
Figure 13-6 Example compile and link commands
247
249
250
14
Chapter 14.
API support
Real-time Notification APIs
Real-time callback functions
Real-time elements
Server-side event filtering
Real-time Notification API status codes
Sample real-time application code
251
14.1.1 Requirements
The requirements for writing programs to the Real-time Notification APIs are as follows:
Currently, SUSE Linux Enterprise Server (SLES) 10 for PowerPC is the only supported
platform. The application must run on the IBM Blue Gene service node.
When the application calls rt_init(), the API looks for the DB_PROPERTY environment
variable. The corresponding db.properties file indicates the port on which the real-time
server is listening and that the real-time client uses to connect to the server. The
environment variable should be set to point to the actual db.properties file location as
follows:
On a bash shell
export DB_PROPERTY=/bgsys/drivers/ppcfloor/bin/db.properties
On a csh shell
setenv DB_PROPERTY /bgsys/drivers/ppcfloor/bin/db.properties
C and C++ are supported with the GNU gcc V4.1.2-level compilers. For more information
and downloads, refer to the following Web address:
http://gcc.gnu.org/
The include file is /bgsys/drivers/ppcfloor/include/rt_api.h.
Only 64-bit shared library support is provided. Link your real-time application with the file
/bgsys/drivers/ppcfloor/lib64/libbgrealtime.so.
Both the include and shared library files are installed as part of the standard system
installation. They are contained in the bgpbase.rpm file.
Example 14-1 shows a possible excerpt from a makefile that you can create to help automate
builds of your application. This sample is shipped in the directory
/bgsys/drivers/ppcfoor/doc/realtime/sample/simple/Makefile. In this makefile, the program that
is being built is rt_sample_app, and the source is in the rt_sample_app.cc file.
Example 14-1 Makefile excerpt
ALL_APPS = rt_sample_app
CXXFLAGS += -w -Wall -g -m64 -pthread
CXXFLAGS += -I/bgsys/drivers/ppcfloor/include
LDFLAGS += -L/bgsys/drivers/ppcfloor/lib64 -lbgrealtime
LDFLAGS += -pthread
.PHONY: all clean default distclean
default: $(ALL_APPS)
all: $(ALL_APPS)
clean:
$(RM) $(ALL_APPS) *.o
252
distclean: clean
...
Filtering events
Prior to IBM Blue Gene/P release V1R3M0, filtering of real-time events was performed only
on the client. With V1R3M0, filtering of real-time events can be done by the server, which is
more efficient because the messages are sent only if the client wants to receive them. For
more information about server-side filtering, refer to 14.5, Server-side filtering on page 272.
A real-time handle can be configured so that only partition events that affect certain partitions,
job events, or both, are passed to the application.
Setting the client-side partition filter is done by using the rt_set_filter() API with RT_PARTITION
as the filter_type parameter. The filter_names parameter can specify one or more partition
IDs separated by spaces. When rt_get_msgs() is called, partition events are delivered only to
the application if the partition ID matches any of the partition IDs in the filter. If the
filter_names parameter is set to NULL, the partition filter is removed, and all partition events
are delivered to the application. An example of the value to use for the filter_names parameter
for partition IDs R00-M0 and R00-M1 is R00-M0 R00-M1.
253
You can set the client-side job filter by using the rt_set_filter() API with RT_JOB as the
filter_type parameter. The filter_names parameter can specify one or more job IDs (as
strings) separated by spaces. When the rt_get_msgs() API is called, job events are delivered
only to the application if the job ID matches any of the job IDs in the filter. If the filter_names
parameter is set to NULL, the job filter is removed, and all job events are delivered to the
application. An example of the value to use for the filter_names parameter for job IDs 10030
and 10031 is 10030 10031.
The other use of the rt_set_filter() API is to remove both types of filter by passing
RT_CLEAR_ALL in the filter_type parameter.
254
255
can use the RT_CALLBACK_VERSION_CURRENT macro, which is the current version when the
application is compiled.
From inside your callback function, you cannot call a real-time API using the same handle on
which the event occurred; otherwise, your application deadlocks.
The return type of the callback functions is an indicator of whether rt_read_msgs() continues
to attempt to receive another real-time event on the handle or whether it stops. If the callback
function returns RT_CALLBACK_CONTINUE, rt_read_msgs() continues to attempt to receive
real-time events. If the callback function returns RT_CALLBACK_QUIT, rt_read_msgs() does not
attempt to receive another real-time event but returns RT_STATUS_OK.
Sequence identifiers (IDs) are associated with the state of each partition, job, base partition,
node card, wire, and switch. A state with a higher sequence ID is newer. If your application
gets the state for an object from the Bridge APIs in addition to the real-time APIs, you must
discard any state that has a lower sequence ID for the same object.
These APIs provide the raw state for partitions, jobs, base partitions, node cards, wires and
switches in addition to providing the state. The raw state is the status value that is stored in
the Blue Gene/P database as a single character, rather than the state enumeration that the
Bridge APIs use. Several raw state values map to a single state value so your application
might receive real-time event notifications where the state does not change but the raw state
does, for example, the partition raw states of A (allocating), C (configuring), and B
(booting) all map to the Bridge enumerated state of RM_PARTITION_CONFIGURING.
Field end_cb
The end_cb callback function is called when a real-time ended event occurs. Your application
does not receive any more real-time events on this handle until you request real-time events
from the server again by calling the rt_request_realtime API.
The function uses the following signature:
cb_ret_t my_rt_end(rt_handle_t **handle, void *extended_args, void *data);
Table 14-1 lists the arguments to the end_cb callback function.
Table 14-1 Field end_cb
256
Argument
Description
handle
extended_args
data
Field partition_added_cb
The partition_added_cb function is called when a partition added event occurs.
The function uses the following signature:
cb_ret_t my_rt_partition_added(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
pm_partition_id_t partition_id,
rm_partition_state_t partition_new_state,
rt_raw_state_t partition_raw_new_state,
void *extended_args,
void *data);
Table 14-2 lists the arguments to the partition_added_cb function.
Table 14-2 Field partition_added_cb
Argument
Description
handle
seq_id
partition_id
The partitions ID
partition_new_state
partition_raw_new_state
extended_args
data
Field partition_state_changed_cb
The partition_state_changed_cb function is called when a partition state changed event
occurs.
The function uses the following signature:
cb_ret_t my_rt_partition_state_changed(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t previous_seq_id,
pm_partition_id_t partition_id,
rm_partition_state_t partition_new_state,
rm_partition_state_t partition_old_state,
rt_raw_state_t partition_raw_new_state,
rt_raw_state_t partition_raw_old_state,
void *extended_args,
void *data);
257
Description
handle
seq_id
previous_seq_id
partition_id
The partitions ID
partition_new_state
partition_old_state
partition_raw_new_state
partition_raw_old_state
extended_args
data
Field partition_deleted_cb
The partition_deleted_cb function is called when a partition deleted event occurs.
The function uses the following signature:
cb_ret_t my_rt_partition_deleted(
rt_handle_t **handle,
rm_sequence_id_t previous_seq_id,
pm_partition_id_t partition_id,
void *extended_args,
void *data);
Table 14-4 lists the arguments to the partition_deleted_cb function.
Table 14-4 Field partition_deleted_cb
Argument
Description
handle
previous_seq_id
partition_id
The partitions ID
extended_args
data
Field job_added_cb
The job_added_cb function is called when a job added event occurs.
Note that this function is not called if the version field is RT_CALLBACK_VERSION1 and the
job_added_v1_cb field is not NULL. The job_added_v1_cb callback provides more information.
258
Description
handle
seq_id
job_id
partition_id
job_new_state
job_raw_new_state
extended_args
data
Field job_state_changed_cb
The job_state_changed_cb function is called when a job state changed event occurs.
Note that this function is not called if the version field is RT_CALLBACK_VERSION1 and the
job_state_changed_v1_cb field is not NULL. The job_state_changed_v1_cb callback provides
more information.
The function uses the following signature:
cb_ret_t my_rt_job_state_changed(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t previous_seq_id,
db_job_id_t job_id,
pm_partition_id_t partition_id,
rm_job_state_t job_new_state,
rm_job_state_t job_old_state,
rt_raw_state_t job_raw_new_state,
rt_raw_state_t job_raw_old_state,
void *extended_args,
void *data);
259
Description
handle
seq_id
previous_seq_id
job_id
The jobs ID
partition_id
job_new_state
job_old_state
job_raw_new_state
job_raw_old_state
extended_args
data
Field job_deleted_cb
The job_deleted_cb function is called when a job-deleted event occurs.
Note that this function is not called if the version field is RT_CALLBACK_VERSION1 and the
job_deleted_v1_cb field is not NULL. The job_deleted_v1_cb callback provides more
information.
The function uses the following signature:
cb_ret_t my_rt_job_deleted(
rt_handle_t **handle,
rm_sequence_id_t previous_seq_id,
db_job_id_t job_id,
pm_partition_id_t partition_id,
void *extended_args,
void *data);
Table 14-7 lists the arguments to the job_deleted_cb function.
Table 14-7 Field job_deleted_cb
260
Argument
Description
handle
previous_seq_id
job_id
Deleted jobs ID
partition_id
extended_args
data
Field bp_state_changed_cb
The bp_state_changed_cb is called when a base partition state changed event occurs.
The function uses the following signature:
cb_ret_t my_rt_BP_state_changed_fn(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t previous_seq_id,
rm_bp_id_t bp_id,
rm_BP_state_t BP_new_state,
rm_BP_state_t BP_old_state,
rt_raw_state_t BP_raw_new_state,
rt_raw_state_t BP_raw_old_state,
void *extended_args,
void *data);
Table 14-8 lists the arguments to the bp_state_changed_cb function.
Table 14-8 Field bp_state_changed_cb
Argument
Description
handle
seq_id
previous_seq_id
bp_id
BP_new_state
BP_old_state
BP_raw_new_state
BP_raw_old_state
extended_args
data
Field switch_state_changed_cb
The switch_state_changed_cb is called when a switch state changed event occurs.
The function uses the following signature:
cb_ret_t my_rt_switch_state_changed(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t previous_seq_id,
rm_switch_id_t switch_id,
rm_bp_id_t bp_id,
rm_switch_state_t switch_new_state,
rm_switch_state_t switch_old_state,
rt_raw_state_t switch_raw_new_state,
rt_raw_state_t switch_raw_old_state,
void *extended_args,
void *data);
261
Description
handle
seq_id
previous_seq_id
switch_id
The switchs ID
bp_id
switch_new_state
switch_old_state
switch_raw_new_state
switch_raw_old_state
extended_args
data
Field nodecard_state_changed_cb
The nodecard_state_changed_cb is called when a node card state changed event occurs.
The function uses the following signature:
cb_ret_t my_rt_nodecard_state_changed(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t previous_seq_id,
rm_nodecard_id_t nodecard_id,
rm_bp_id_t bp_id,
rm_nodecard_state_t nodecard_new_state,
rm_nodecard_state_t nodecard_old_state,
rt_raw_state_t nodecard_raw_new_state,
rt_raw_state_t nodecard_raw_old_state,
void *extended_args,
void *data);
262
Description
handle
seq_id
previous_seq_id
nodecard_id
bp_id
nodecard_new_state
nodecard_old_state
nodecard_raw_new_state
nodecard_raw_old_state
extended_args
data
Field job_added_v1_cb
The job_added_v1_cb function is called when a job added event occurs.
Note that this function is called only if the version field is RT_CALLBACK_VERSION_1 or later.
The function uses the following signature:
cb_ret_t my_rt_job_added(
rt_handle_t **handle,
rm_sequence_id_t previous_seq_id,
jm_job_id_t job_id,
db_job_id_t db_job_id,
pm_partition_id_t partition_id,
rm_job_state_t job_new_state,
rt_raw_state_t job_raw_new_state,
void *extended_args,
void *data);
Table 14-11 lists the arguments to the job_added_v1_cb function.
Table 14-11 Field job_added_vi_cb
Argument
Description
handle
seq_id
job_id
Job identifier
db_job_id
263
partition_id
job_new_state
job_raw_new_state
extended_args
data
Field job_state_changed_v1_cb
The job_state_changed_v1_cb function is called when a job state changed event occurs.
Note that this function is called only if the version field is RT_CALLBACK_VERSION_1 or later.
The function uses the following signature:
cb_ret_t my_rt_job_state_changed(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t previous_seq_id,
jm_job_id_t job_id,
db_job_id_t db_job_id,
pm_partition_id_t partition_id,
rm_job_state_t job_new_state,
rm_job_state_t job_old_state,
rt_raw_state_t job_raw_new_state,
rt_raw_state_t job_raw_old_state,
void *extended_args,
void *data);
Table 14-12 lists the arguments to the job_state_changed_v1_cb function.
Table 14-12 Field job_state_changed_v1_cb
264
Argument
Description
handle
seq_id
previous_seq_id
job_id
Job identifier
db_job_id
partition_id
job_new_state
job_old_state
job_raw_new_state
job_raw_old_state
extended_args
data
Field job_deleted_v1_cb
The job_deleted_v1_cb function is called when a job-deleted event occurs.
Note that this function is called only if the version field is RT_CALLBACK_VERSION_1 or later.
The function uses the following signature:
cb_ret_t my_rt_job_deleted(
rt_handle_t **handle,
rm_sequence_id_t previous_seq_id,
jm_job_id_t job_id,
db_job_id_t db_job_id,
pm_partition_id_t partition_id,
void *extended_args,
void *data);
Table 14-13 lists the arguments to the job_deleted_v1_cb function.
Table 14-13 Field job_deleted_v1_cb
Argument
Description
handle
previous_seq_id
job_id
Job identifier
db_job_id
partition_id
extended_args
data
Field wire_state_changed_cb
The wire_state_changed_cb function is called when a wire state changed event occurs.
Note that this function is called only if the version field is RT_CALLBACK_VERSION_1 or later.
The function uses the following signature:
cb_ret_t my_rt_wire_state_changed(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t previous_seq_id,
rm_wire_id_t wire_id,
rm_wire_state_t wire_new_state,
rm_wire_state_t wire_old_state,
rt_raw_state_t wire_raw_new_state,
rt_raw_state_t wire_raw_old_state,
void *extended_args,
void *data);
265
Description
handle
seq_id
previous_seq_id
wire_id
Wire identifier
wire_new_state
wire_old_state
wire_raw_new_state
wire_raw_old_state
extended_args
data
Field filter_acknowledge_cb
The filter_acknowledge_cb function is called when a filter acknowledged event occurs.
Note that this function is called only if the version field is RT_CALLBACK_VERSION_1 or later.
The function uses the following signature:
cb_ret_t my_filter_acknowledged(
rt_handle_t **handle,
rt_filter_id_t filter_id,
void *extended_args,
void *data);
Table 14-15 lists the arguments to the filter_acknowledge_cb function.
Table 14-15 Field filter_acknowledge_cb
Argument
Description
handle
filter_id
Filter identifier
extended_args
data
Field htc_compute_node_failed_cb
The htc_compute_node_failed_cb function is called when a HTC compute node failed event
occurs.
Note that this function is called only if the version field is RT_CALLBACK_VERSION_1 or later.
The function uses the following signature:
cb_ret_t my_htc_compute_node_failed(
rt_handle_t **handle,
rt_compute_node_fail_info_t *compute_node_fail_info,
266
void *extended_args,
void *data);
Table 14-16 lists the arguments to the htc_compute_node_failed_cb function.
Table 14-16 Field htc_compute_node_failed_cb
Argument
Description
handle
compute_node_fail_info
extended_args
data
Field htc_io_node_failed_cb
The htc_io_node_failed_cb function is called when a HTC I/O node failed event occurs.
Note that this function is called only if the version field is RT_CALLBACK_VERSION_1 or later.
The function uses the following signature:
cb_ret_t my_htc_io_node_failed(
rt_handle_t **handle,
rt_io_node_fail_info_t *io_node_fail_info,
void *extended_args,
void *data);
Table 14-17 lists the arguments to the htc_io_node_failed_cb function.
Table 14-17 Field htc_io_node_failed_cb
Argument
Description
handle
io_node_fail_info
extended_args
data
Field ras_event_cb
The ras_event_cb function is called when a RAS event occurs.
Note that this function is called only if the version field is RT_CALLBACK_VERSION_2.
The function uses the following signature:
cb_ret_t my_rt_ras_event(
rt_handle_t **handle,
rt_ras_event_t *ras_event_info,
void *extended_args,
void *data);
267
Description
handle
ras_event_info
extended_args
data
268
Description
Data type
rm_component_id_t
RT_SPEC_DB_JOB_ID
Description
Data type
db_job_id_t
RT_SPEC_REASON
Description
Text explaining why the compute node became unavailable (might return
RT_NO_DATA)
Data type
char*
Data type
rm_ionode_id_t
RT_SPEC_COMPUTE_NODE_INFOS
Description
Data type
rt_ionode_fail_info_compute_node_infos_t*
RT_SPEC_REASON
Description
Text explaining why the I/O node became unavailable (might return
RT_NO_DATA)
Data type
char*
269
The following descriptions are of each of the fields in the I/O node failure compute node
information list element.
RT_SPEC_LIST_FIRST
Description
Data type
rt_ionode_fail_info_compute_node_info_t*
RT_SPEC_LIST_NEXT
Description
Next compute node info. rt_get_data() returns RT_NO_DATA if there are no more
elements
Data type
rt_ionode_fail_info_compute_node_info_t*
ID of the compute node associated with the I/O node that failed
Data type
rm_component_id_t
Data type
rt_ras_record_id_t
RT_SPEC_MESSAGE_ID
Description
Data type
char*
RT_SPEC_SEVERITY
Description
270
Data type
rt_ras_severity_t
14.4.2 Example
Example 14-2 illustrates the use of real-time elements in a callback function that prints out the
information in the I/O node failure information element, as shown in Example 14-2.
Example 14-2 Accessing the fields of a real-time element
cb_ret_t rt_htc_io_node_failed_callback(
rt_handle_t** handle,
rt_io_node_fail_info_t* io_node_fail_info,
void* extended_args,
void* data
)
{
rt_status rc;
rm_ionode_id_t io_node_id;
rc = rt_get_data( (rt_element_t*) io_node_fail_info, RT_SPEC_ID, &io_node_id );
const char *reason_buf = "";
const char *reason_p(NULL);
rc = rt_get_data( (rt_element_t*) io_node_fail_info, RT_SPEC_REASON, &reason_buf );
if ( rc == RT_STATUS_OK ) {
reason_p = reason_buf;
} else if ( rc == RT_NO_DATA ) {
reason_p = NULL;
rc = RT_STATUS_OK;
}
ostringstream sstr;
sstr << "[";
rt_ionode_fail_info_compute_node_infos *cn_infos(NULL);
rc = rt_get_data( (rt_element_t*) io_node_fail_info, RT_SPEC_COMPUTE_NODE_INFOS, &cn_infos
);
int i(0);
while ( true ) {
rt_ionode_fail_info_compute_node_info *cn_info_p(NULL);
rt_specification_t spec(i == 0 ? RT_SPEC_LIST_FIRST : RT_SPEC_LIST_NEXT);
rc = rt_get_data( (rt_element_t*) cn_infos, spec, &cn_info_p );
if ( rc == RT_NO_DATA ) {
rc = RT_STATUS_OK;
break;
}
rm_component_id_t compute_node_id(NULL);
rc = rt_get_data( (rt_element_t*) cn_info_p, RT_SPEC_ID, &compute_node_id );
if ( i++ > 0 ) {
Chapter 14. Real-time Notification APIs
271
272
Indicates whether the application wants any job callbacks called. The value is an
integer, where non-zero indicates that job callbacks are called, and 0 indicates
that job callbacks are not called.
Default value
Argument type
int*
RT_FILTER_PROPERTY_JOB_ID
Description
A pattern specifying the job IDs that the job callbacks are called for.
Default value
Argument type
RT_FILTER_PROPERTY_JOB_STATES
Description
The states that jobs are changing to that the job callbacks are called for.
Default value
Argument type
RT_FILTER_PROPERTY_JOB_DELETED
Description
Indicates whether the application wants the job deletion callback called. The value
is an integer, where non-zero indicates that the job deletion callback is called, and
0 indicates that the job deletion callback is not called.
Default value
Argument type
int*
273
RT_FILTER_PROPERTY_JOB_TYPE
Description
Indicates the type of jobs that the application wants the job callbacks called for.
The value is one of the rt_filter_property_partition_type_t enum values.
- RT_FILTER_PARTITION_TYPE_HPC_ONLY: only send events for HPC jobs
- RT_FILTER_PARTITION_TYPE_HTC_ONLY: only send events for HTC jobs
- RT_FILTER_PARTITION_TYPE_ANY: send events for any type of job
Default value
RT_FILTER_PARTITION_TYPE_HPC_ONLY
Argument type
rt_filter_property_partition_type_t*
RT_FILTER_PROPERTY_JOB_PARTITION
Description
A pattern specifying the IDs of the partitions for the jobs that the application wants
the job callbacks called for.
Default value
Argument type
Indicates whether the application wants any partition callbacks called. The value
is an integer, where non-zero indicates that partition callbacks are called, and 0
indicates that partition callbacks are not called.
Default value
Argument type
int*
RT_FILTER_PROPERTY_PARTITION_ID
Description
A pattern specifying the partition IDs that the partition callbacks are called for.
Default value
Argument type
RT_FILTER_PROPERTY_PARTITION_STATES
274
Description
The states that partitions are changing to that the partition callbacks are called for.
Default value
Argument type
RT_FILTER_PROPERTY_PARTITION_DELETED
Description
Indicates whether the application wants the partition deletion callback called. The
value is an integer, where non-zero indicates that the partition deletion callback is
called, and 0 indicates that the partition deletion callback is not called.
Default value
Argument type
int*
RT_FILTER_PROPERTY_PARTITION_TYPE
Description
Indicates the type of jobs that the application wants the job callbacks called for.
The value is one of the rt_filter_property_partition_type_t enum values.
- RT_FILTER_PARTITION_TYPE_HPC_ONLY: only send events for HPC jobs
- RT_FILTER_PARTITION_TYPE_HTC_ONLY: only send events for HTC jobs
- RT_FILTER_PARTITION_TYPE_ANY: send events for any type of job
Default value
RT_FILTER_PARTITION_TYPE_ANY
Argument type
rt_filter_property_partition_type_t*
Indicates whether the application wants any base partition callbacks called. The
value is an integer, where non-zero indicates that base partition callbacks are
called, and 0 indicates that base partition callbacks are not called.
Default value
Argument type
int*
RT_FILTER_PROPERTY_BP_ID
Description
Default value
Argument type
RT_FILTER_PROPERTY_BP_STATES
Description
The states that base partitions are changing to that the base partition callbacks
are called for.
Default value
Argument type
275
Indicates whether the application wants any node card callbacks called. The value
is an integer, where non-zero indicates that node card callbacks are called, and 0
indicates that node card callbacks are not called.
Default value
Argument type
int*
RT_FILTER_PROPERTY_NODE_CARD_ID
Description
Default value
Argument type
RT_FILTER_PROPERTY_NODE_CARD_STATES
Description
The states that node cards are changing to that the node card callbacks are called
for.
Default value
Argument type
Indicates whether the application wants any switch callbacks called. The value is
an integer, where non-zero indicates that switch callbacks are called, and 0
indicates that switch callbacks are not called.
Default value
Argument type
int*
RT_FILTER_PROPERTY_SWITCH_ID
276
Description
Default value
Argument type
RT_FILTER_PROPERTY_SWITCH_STATES
Description
The states that switches are changing to that the switch callbacks are called for.
Default value
Argument type
Indicates whether the application wants any wire callbacks called. The value is an
integer, where non-zero indicates that wire callbacks are called, and 0 indicates
that wire callbacks are not called.
Default value
Argument type
int*
RT_FILTER_PROPERTY_WIRE_ID
Description
Default value
Argument type
RT_FILTER_PROPERTY_WIRE_STATES
Description
The states that wires are changing to that the wire callbacks are called for.
Default value
Argument type
Indicates whether the application wants any HTC callbacks called. The value is
an integer, where non-zero indicates that HTC callbacks are called, and 0
indicates that HTC callbacks are not called.
Default value
Argument type
int*
277
RT_FILTER_PROPERTY_HTC_COMPUTE_NODE_FAIL
Description
Indicates whether the application wants the HTC compute node failure callback
called. The value is an integer, where non-zero indicates that the HTC compute
node failure callback is called, and 0 indicates that the HTC compute node failure
callback is not called.
Default value
Argument type
int*
RT_FILTER_PROPERTY_HTC_IO_NODE_FAIL
Description
Indicates whether the application wants the HTC I/O node failure callback called.
The value is an integer, where non-zero indicates that the HTC I/O node failure
callback is called, and 0 indicates that the HTC I/O node failure callback is not
called.
Default value
Argument type
int*
Indicates whether the application wants any RAS event callbacks called. The
value is an integer, where non-zero indicates that RAS event callbacks are called,
and 0 indicates that RAS event callbacks are not called.
Default value
Argument type
int*
RT_FILTER_PROPERTY_RAS_MESSAGE_ID
Description
Default value
Argument type
RT_FILTER_PROPERTY_RAS_SEVERITIES
278
Description
Default value
Argument type
RT_FILTER_PROPERTY_RAS_JOB_DB_IDS
Description
Default value
Argument type
RT_FILTER_PROPERTY_RAS_PARTITION_ID
Description
Default value
Argument type
14.5.3 Example
Example 14-3 illustrates use of the real-time server-side filtering APIs.
Example 14-3 Using the real-time server-side filtering APIs
#include <rt_api.h>
#include <iostream>
using namespace std;
cb_ret_t rt_filter_acknowledge_callback(
rt_handle_t **handle,
rt_filter_id_t filter_id,
void* extended_args, void* data
)
{
cout << "Received callback for filter acknowledged for filter ID " << filter_id << endl;
return RT_CALLBACK_CONTINUE;
}
int main( int argc, char *argv[] ) {
rt_filter_t *filter_handle(NULL);
rt_create_server_filter( &filter_handle );
char job_name_pattern[] = "^MYPREFIX.*$";
rt_server_filter_set_property( filter_handle, RT_FILTER_PROPERTY_JOB_ID,
(void*) job_name_pattern );
int filter_parts( 0 );
rt_server_filter_set_property( filter_handle, RT_FILTER_PROPERTY_PARTITIONS,
(void*) &filter_parts );
rm_BP_state_t bp_states[] = { RM_BP_UP, RM_BP_ERROR, RM_BP_NAV };
rt_server_filter_set_property( filter_handle, RT_FILTER_PROPERTY_BP_STATES,
(void*) bp_states );
rt_filter_id_t filter_id;
279
rt_handle_t *rt_handle;
rt_callbacks_t rt_callbacks;
rt_callbacks.version = RT_CALLBACK_VERSION_CURRENT;
rt_callbacks.filter_acknowledge_cb = &rt_filter_acknowledge_callback;
rt_init( &rt_handle, RT_BLOCKING, &rt_callbacks );
rt_set_server_filter( &rt_handle, filter_handle, &filter_id );
rt_request_realtime( &rt_handle );
rt_read_msgs( &rt_handle, NULL);
rt_close( &rt_handle );
}
280
281
282
283
#include <rt_api.h>
#include <sayMessage.h>
#include <stdio.h>
#include <unistd.h>
#include <iostream>
#include <sstream>
using namespace std;
return "Dying";
case RM_JOB_DEBUG:
return "Debug";
case RM_JOB_LOAD:
return "Load";
case RM_JOB_LOADED:
return "Loaded";
case RM_JOB_BEGIN:
return "Begin";
case RM_JOB_ATTACH:
return "Attach";
case RM_JOB_NAV:
return "Not a value (NAV)";
}
return "Unknown";
} // job_state_to_msg()
285
286
287
void* extended_args,
void* data
)
{
cout << "Received callback for job " << job_id << " with id " << db_job_id
<< " state change on partition " << partition_id
<< ", old state is " << job_state_to_msg( job_old_state ) << ", new state is "
<< job_state_to_msg( job_new_state ) << endl << "Raw old state=" <<
job_raw_old_state
<< " Raw new state=" << job_raw_new_state << " New sequence ID=" << seq_id
<< " Previous sequence ID=" << previous_seq_id << endl;
return RT_CALLBACK_CONTINUE;
}
cb_ret_t rt_job_deleted_v1_callback(
rt_handle_t **handle,
rm_sequence_id_t previous_seq_id,
jm_job_id_t job_id,
db_job_id_t db_job_id,
pm_partition_id_t partition_id,
void* extended_args,
void* data
)
{
cout << "Received callback for delete of job " << job_id << " with id " << db_job_id
<< " on partition " << partition_id
<< " Previous sequence ID=" << previous_seq_id << endl;
return RT_CALLBACK_CONTINUE;
}
cb_ret_t rt_BP_state_changed_callback(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t prev_seq_id,
rm_bp_id_t bp_id,
rm_BP_state_t BP_new_state,
rm_BP_state_t BP_old_state,
rt_raw_state_t BP_raw_new_state,
rt_raw_state_t BP_raw_old_state,
void* extended_args,
void* data
)
{
cout << "Received callback for BP " << bp_id << " state change,"
" old state is " << BP_state_to_msg( BP_old_state ) << ","
" new state is " << BP_state_to_msg( BP_new_state ) << "\n"
"Raw old state=" << BP_raw_old_state <<
" Raw new state=" << BP_raw_new_state <<
" New sequence ID=" << seq_id <<
" Previous sequence ID=" << prev_seq_id << endl;
return RT_CALLBACK_CONTINUE;
}
cb_ret_t rt_switch_state_changed_callback(
rt_handle_t **handle,
288
rm_sequence_id_t seq_id,
rm_sequence_id_t prev_seq_id,
rm_switch_id_t switch_id,
rm_bp_id_t bp_id,
rm_switch_state_t switch_new_state,
rm_switch_state_t switch_old_state,
rt_raw_state_t switch_raw_new_state,
rt_raw_state_t switch_raw_old_state,
void* extended_args,
void* data
)
{
cout << "Received callback for switch " << switch_id << " state change on BP " << bp_id <<
","
" old state is " << switch_state_to_msg( switch_old_state ) << ","
" new state is " << switch_state_to_msg( switch_new_state ) << "\n"
"Raw old state=" << switch_raw_old_state <<
" Raw new state=" << switch_raw_new_state <<
" New sequence ID=" << seq_id <<
" Previous sequence ID=" << prev_seq_id << endl;
return RT_CALLBACK_CONTINUE;
}
cb_ret_t rt_nodecard_state_changed_callback(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t prev_seq_id,
rm_nodecard_id_t nodecard_id,
rm_bp_id_t bp_id,
rm_nodecard_state_t nodecard_new_state,
rm_nodecard_state_t nodecard_old_state,
rt_raw_state_t nodecard_raw_new_state,
rt_raw_state_t nodecard_raw_old_state,
void* extended_args,
void* data
)
{
cout << "Received callback for node card " << nodecard_id <<
" state change on BP " << bp_id << ","
" old state is " << nodecard_state_to_msg( nodecard_old_state ) << ","
" new state is " << nodecard_state_to_msg( nodecard_new_state ) << "\n"
"Raw old state=" << nodecard_raw_old_state <<
" Raw new state=" << nodecard_raw_new_state <<
" New sequence ID=" << seq_id <<
" Previous sequence ID=" << prev_seq_id << endl;
return RT_CALLBACK_CONTINUE;
}
cb_ret_t rt_wire_state_changed_callback(
rt_handle_t **handle,
rm_sequence_id_t seq_id,
rm_sequence_id_t previous_seq_id,
rm_wire_id_t wire_id,
rm_wire_state_t wire_new_state,
289
rm_wire_state_t wire_old_state,
rt_raw_state_t wire_raw_new_state,
rt_raw_state_t wire_raw_old_state,
void* extended_args,
void* data
)
{
cout << "Received callback for wire '" << wire_id << "',"
" old state is " << wire_state_to_msg( wire_old_state ) << ","
" new state is " << wire_state_to_msg( wire_new_state ) << "\n"
"Raw old state=" << wire_raw_old_state <<
" Raw new state=" << wire_raw_new_state <<
" New sequence ID=" << seq_id <<
" Previous sequence ID=" << previous_seq_id << endl;
return RT_CALLBACK_CONTINUE;
}
cb_ret_t rt_filter_acknowledge_callback(
rt_handle_t **handle,
rt_filter_id_t filter_id,
void* extended_args, void* data
)
{
cout << "Received callback for filter acknowledged for filter ID " << filter_id << endl;
return RT_CALLBACK_CONTINUE;
}
cb_ret_t rt_htc_compute_node_failed_callback(
rt_handle_t** handle,
rt_compute_node_fail_info_t* compute_node_fail_info,
void* extended_args,
void* data
)
{
rt_status rc;
rm_component_id_t compute_node_id;
rc = rt_get_data( (rt_element_t*) compute_node_fail_info, RT_SPEC_ID, &compute_node_id );
if ( rc != RT_STATUS_OK ) {
cerr << "rt_get_data failed in " << __FUNCTION__ << " getting RT_SPEC_ID\n";
return RT_CALLBACK_CONTINUE;
}
db_job_id_t db_job_id;
rc = rt_get_data( (rt_element_t*) compute_node_fail_info, RT_SPEC_DB_JOB_ID, &db_job_id );
if ( rc != RT_STATUS_OK ) {
cerr << "rt_get_data failed in " << __FUNCTION__ << " getting RT_SPEC_DB_JOB_ID\n";
return RT_CALLBACK_CONTINUE;
}
const char *reason_buf = "";
const char *reason_p;
290
cb_ret_t rt_htc_io_node_failed_callback(
rt_handle_t** handle,
rt_io_node_fail_info_t* io_node_fail_info,
void* extended_args,
void* data
)
{
rt_status rc;
rm_ionode_id_t io_node_id;
rc = rt_get_data( (rt_element_t*) io_node_fail_info, RT_SPEC_ID, &io_node_id );
if ( rc != RT_STATUS_OK ) {
cerr << "rt_get_data failed in " << __FUNCTION__ << " getting RT_SPEC_ID.\n";
return RT_CALLBACK_CONTINUE;
}
const char *reason_buf = "";
const char *reason_p(NULL);
rc = rt_get_data( (rt_element_t*) io_node_fail_info, RT_SPEC_REASON, &reason_buf );
if ( rc == RT_STATUS_OK ) {
reason_p = reason_buf;
} else if ( rc == RT_NO_DATA ) {
reason_p = NULL;
rc = RT_STATUS_OK;
} else {
cerr << "rt_get_data failed in " << __FUNCTION__ << " getting RT_SPEC_REASON\n";
return RT_CALLBACK_CONTINUE;
}
ostringstream sstr;
sstr << "[";
291
rt_ionode_fail_info_compute_node_infos *cn_infos(NULL);
rc = rt_get_data( (rt_element_t*) io_node_fail_info, RT_SPEC_COMPUTE_NODE_INFOS, &cn_infos
);
if ( rc != RT_STATUS_OK ) {
cerr << "rt_get_data failed in " << __FUNCTION__ << " getting
RT_SPEC_COMPUTE_NODE_INFOS.\n";
return RT_CALLBACK_CONTINUE;
}
int i(0);
while ( true ) {
rt_ionode_fail_info_compute_node_info *cn_info_p(NULL);
rt_specification_t spec(i == 0 ? RT_SPEC_LIST_FIRST : RT_SPEC_LIST_NEXT);
rc = rt_get_data( (rt_element_t*) cn_infos, spec, &cn_info_p );
if ( rc == RT_NO_DATA ) {
rc = RT_STATUS_OK;
break;
}
if ( rc != RT_STATUS_OK ) {
cerr << "rt_get_data failed in " << __FUNCTION__ << " getting compute node info list
element.\n";
return RT_CALLBACK_CONTINUE;
}
rm_component_id_t compute_node_id(NULL);
rc = rt_get_data( (rt_element_t*) cn_info_p, RT_SPEC_ID, &compute_node_id );
if ( rc != RT_STATUS_OK ) {
cerr << "rt_get_data failed in " << __FUNCTION__ << " getting compute node info
RT_SPEC_ID.\n";
return RT_CALLBACK_CONTINUE;
}
if ( i++ > 0 ) {
sstr << ",";
}
sstr << compute_node_id;
}
sstr << "]";
string cn_ids_str(sstr.str());
cout << "Received callback for HTC I/O node failed.\n"
" io_node=" << io_node_id <<
" cns=" << cn_ids_str;
if ( reason_p ) {
cout << " reason='" << reason_p << "'";
}
cout << endl;
return RT_CALLBACK_CONTINUE;
}
292
293
294
15
Chapter 15.
295
15.2 Requirements
When writing programs to the Dynamic Partition Allocator APIs, you must meet the following
requirements:
Operating system supported
Currently, SUSE Linux Enterprise Server (SLES) 10 for PowerPC is the only supported
platform.
Languages supported
C and C++ are supported with the GNU gcc 4.1.2 level compilers. For more information
and downloads, refer to the following Web address:
http://gcc.gnu.org/
Include files
All required include files are installed in the /bgsys/drivers/ppcfloor/include directory. The
include file for the dynamic allocator API is allocator_api.h.
Library files
The Dynamic Partition Allocator APIs support 64-bit applications using dynamic linking
with shared objects.
Sixty-four bit libraries: The required library files are installed in the
/bgsys/drivers/ppcfloor/lib64 directory. The shared object for linking to the Bridge APIs
is libbgpallocator.so.
The libbgpallocator.so library has dependencies on other libraries included with the
IBM Blue Gene/P software, including the following objects:
libbgpbridge.so
libbgpconfig.so
libbgpdb.so
libsaymessage.so
libtableapi.so
These files are installed with the standard system installation procedure. They are
contained in the bgpbase.rpm file.
296
15.3.1 APIs
The following APIs are used for dynamic partition allocation and are all thread safe:
BGALLOC_STATUS rm_init_allocator(const char * caller_desc, const char *
drain_list);
A program should call rm_init_allocator() and pass a description that will be used as the
text description for all partitions used by subsequent rm_allocate_partition() calls, for
example, passing in ABC job scheduler causes any partitions that are created by
rm_allocate_partition() to have ABC job scheduler as the partition description.
The caller can also optionally specify a drain list file name that identifies the base
partitions (midplanes) that will be excluded from the list of resources to consider when
allocating new partitions. If NULL is passed in for the drain list file name, a default drain list
is set first from the following locations:
The path in the environment variable ALLOCATOR_DRAIN_LIST if it exists
The /etc/allocator_drain.lst file if it exists
If no drain list file is established, no base partitions are excluded. If an invalid file name is
passed in, the call fails, for example, a drain list file with the following content excludes
base partitions R00-M0, R00-M1, and R01-M0 when allocating resources for a partition:
R00-M0
R00-M1
R01-M0
The list of resources can contain items separated by any white-space character (space,
tab, new line, vertical tab, or form feed). Items found that do not match an existing
resource are ignored, but an error message is logged.
BGALLOC_STATUS rm_allocate_partition(
const rm_size_t size,
const rm_connection_type_t conn,
const rm_size3D_t shape,
const rm_job_mode_t mode,
const rm_psetsPerBP_t psetsPerBP,
const char * user_name,
const char * caller_desc,
const char * options,
const char * ignoreBPs,
const char * partition_id,
char ** newpartition_id,
const char * bootOptions);
The caller to rm_allocate_partition() provides input parameters that describe the
characteristics of the partition that should be created from available Blue Gene/P machine
resources. If resources are available that match the requirements, a partition is created
and allocated, and the partition name is returned to the caller along with a return code of
BGALLOC_OK.
If both size and shape values are provided, the allocation is based on the shape value
only.
The user_name parameter is required.
Chapter 15. Dynamic Partition Allocator APIs
297
If the caller_desc value is NULL, the caller description specified on the call to
rm_init_allocator() is used.
The options parameter is optional and can be NULL.
If the ignoreBPs parameter is not NULL, it must be a string of blank-separated base
partition identifiers to be ignored. The base partitions listed in the parameter are ignored
as though the partitions were included in the drain list file currently in effect.
If the partition_id parameter is not NULL, it can specify one of the following options:
The name of the new partition
The name can be from 1 to 32 characters. Valid characters are a...z, A...Z, 0...9, (hyphen), and _ (underscore).
The prefix to be used for generating a unique partition name
The prefix can be from 1 to 16 characters, followed by an asterisk (*). Valid characters
are the same as those for a new partition name, for example, if ABC-Scheduler* is
specified as a prefix, the resulting unique partition name can be
ABC-Scheduler-27Sep1519514155.
The bootOptions parameter is optional and can be NULL. Otherwise it specifies the initial
boot options for the partition and typically is used when booting alternate images.
Important: The returned char * value for newpartition_id should be freed by the caller
when it is no longer needed to avoid memory leaks.
BGALLOC_STATUS rm_allocate_htc_pool(
const
const
const
const
const
const
const
const
const
const
298
The user_name parameter is required. If the caller_desc value is NULL, the caller
description specified on the call to rm_init_allocator is used. If the ignoreBPs parameter is
not NULL, it must be a string of blank-separated base partition identifiers. The base
partitions listed in the parameter are ignored as though the partitions were included in the
drain list file currently in effect.
The pool_id is used as a prefix for generating unique partition names. It must be from 1 to
32 characters. Valid characters are a...z, A...Z, 0...9, -(hyphen), and _ (underscore). If the
user_list parameter is not NULL, the user IDs specified are permitted to run jobs in the
pool.
The bootOptions parameter is optional and can be NULL. Otherwise it specifies the initial
boot options for the partition and is typically used when booting alternate images.
Multiple calls can be made to rm_allocate_htc_pool() with the same pool ID; these calls
allocate additional resources to the pool. The additional resources can use different
parameters such as job mode and users.
BGALLOC_STATUS rm_deallocate_htc_pool(
const unsigned int in_removed,
const char * pool_id,
unsigned * num_removed,
const rm_mode_pref_t mode_pref);
This API deallocates the specified number of nodes from a HTC pool.
The pool_id parameter specifies the name of the pool.
The mode_pref parameter specifies the job mode of the partitions to be deallocated from
the pool. The possible values for this parameter are described in Table 15-1.
Table 15-1 rm_mode_pref_t values
Name
Description
RM_PREF_SMP_MODE
RM_PREF_DUAL_MODE
Dual mode
RM_PREF_VIRTUAL_NODE_MODE
RM_PREF_LINUX_SMP_MODE
Linux/SMP mode
RM_PREF_ANY_MODE
The in_removed parameter specifies the number of nodes to remove from the pool. If the
number of nodes to remove is not a multiple of the size of the partitions in the pool
allocated using the specified mode, or if the number is greater than the number of nodes
available to be removed, fewer nodes than in_removed are removed. If zero is specified,
all the nodes in the pool allocated with the specified mode are deallocated.
The value returned in num_removed is the actual number of nodes removed from the pool.
This number might be less than the number of nodes specified by in_removed.
299
The BGALLOC_STATUS return codes for the Dynamic Partition Allocator can be one of the
following types:
BGALLOC_OK: Invocation completed successfully.
BGALLOC_ILLEGAL_INPUT: The input to the API invocation is invalid. This result is due to
missing required data, illegal data, and similar problems.
BGALLOC_ERROR: An error occurred, such as a memory allocation problem or failure on a
low-level call.
BGALLOC_NOT_FOUND: The request to dynamically create a partition failed because required
resources are not available.
BGALLOC_ALREADY_EXISTS: A partition already exists with the name specified. This error
occurs only when the caller indicates a specific name for the new partition.
Required
Description
DB_PROPERTY
Yes
BRIDGE_CONFIG
Yes
ALLOCATOR_DRAIN_LIST
No
This variable can be set to the path of the base partition drain
list to be used if one is not specified on the call to
rm_init_allocator(). When this variable is not set, the file
/etc/allocator_drain.lst is used as a default if it exists.
BRIDGE_DUMP_XML
No
When set to any value, this variable causes the Bridge APIs to
dump in-memory XML streams to files in /tmp for debugging.
When this variable is not set, the Bridge APIs do not dump
in-memory XML streams.
#include
#include
#include
#include
<iostream>
<sstream>
<cstring>
"allocator_api.h"
using std::cout;
using std::cerr;
using std::endl;
300
int main() {
rm_size3D_t shape;
rm_connection_type_t conn = RM_MESH;
char * ignoreBPs = "R00-M0";
char* new_partition_id;
shape.X = 0;
shape.Y = 0;
shape.Z = 0;
BGALLOC_STATUS alloc_rc;
//set lowest level of verbosity
setSayMessageParams(stderr, MESSAGE_DEBUG1);
alloc_rc = rm_init_allocator("test", NULL);
alloc_rc = rm_allocate_partition(256, conn, shape, RM_SMP_MODE, 0,
"user1",
"New partition description",
NULL,
ignoreBPs,
"ABC-Scheduler*",
&new_partition_id, NULL);
if (alloc_rc == BGALLOC_OK) {
cout << "successfully allocated partition: " << new_partition_id << endl;
free(new_partition_id);
} else {
cerr << "could not allocate partition: " << endl;
if (alloc_rc == BGALLOC_ILLEGAL_INPUT) {
cerr << "illegal input" << endl;
} else if (alloc_rc == BGALLOC_ERROR) {
cerr << "unknown error" << endl;
} else if (alloc_rc == BGALLOC_NOT_FOUND) {
cerr << "not found" << endl;
} else if (alloc_rc == BGALLOC_ALREADY_EXISTS) {
cerr << "partition already exists" << endl;
} else {
cerr << "internal error" << endl;
}
}
}
Example 15-2 shows the commands used to compile and link the sample program.
Example 15-2 compile and link commands
301
302
Part 5
Part
Applications
In this part, we discuss applications that are being used on the IBM Blue Gene/L or IBM Blue
Gene/P system. This part includes Chapter 16, Performance overview of engineering and
scientific applications on page 305.
303
304
16
Chapter 16.
Performance overview of
engineering and scientific
applications
In this chapter, we briefly describe a series of scientific and engineering applications that are
currently being used on either the Blue Gene/L or Blue Gene/P system. For a comprehensive
list of applications, refer to the IBM Blue Gene Web page at:
http://www-03.ibm.com/servers/deepcomputing/bluegene/siapps.html
The examples in this chapter emphasize the benefits of using the Blue Gene supercomputer
as a highly scalable parallel system. They present results for running applications in various
modes that exploit the architecture of the system. We discuss the following topics:
IBM Blue Gene/P system from an applications perspective
Chemistry and life sciences applications
305
306
Figure 16-1 High-performance computing landscape for selected scientific and engineering
applications
In the rest of this chapter, we summarize the performance that has been recorded in the
literature for a series of applications in life sciences and materials science. A comprehensive
list of applications is available for the Blue Gene/L and Blue Gene/P systems. For more
information, see the IBM Blue Gene Applications Web page at:
http://www-03.ibm.com/servers/deepcomputing/bluegene/siapps.html
307
308
AMBER
AMBER47 is the collective name for a suite of programs that are developed by the Scripps
Research Institute. With these programs, users can carry out molecular dynamics
simulations, particularly on biomolecules. The primary AMBER module, called sander, was
designed to run on parallel systems and provides direct support for several force fields for
proteins and nucleic acids. AMBER includes an extensively modified version of sander, called
pmemd (particle mesh). For complete information about AMBER as well as benchmarks, refer
to the AMBER Web site at:
http://amber.scripps.edu/
For implicit solvent (continuum) models, which rely on variations of the Poisson equation of
classical electrostatics, AMBER offers the Generalized Born (GB) method. This method uses
an approximation to the Poisson equation that can be solved analytically and allows for good
scaling. In Figure 16-3, the experiment is with an implicit solvent (GB) model of 120,000
atoms (Aon benchmark).
Linear scaling
AMBER GB scaling
3500
3000
Parallel Speedup
2500
2000
1500
1000
500
0
0
512
1024
1536
2048
2560
3072
Processors
Figure 16-3 Parallel scaling of AMBER on the IBM Blue Gene/L system
309
AMBER also incorporates the PME algorithm, which takes the full electrostatic interactions
into account to improve the performance of electrostatic force evaluation (see Figure 16-4). In
Figure 16-4, the experiment is with an explicit solvent (PME) model of 290,000 atoms
(Rubisco).
Linear scaling
600
Parallel Speedup
500
400
300
200
100
0
0
128
256
384
512
Processors
Blue Matter
Blue Matter48 is a classical molecular dynamics application that has been under development
as part of the IBM Blue Gene project. The effort serves two purposes:
Enables scientific work in the area of biomolecular simulation that IBM announced in
December 1999.
Acts as an experimental platform for the exploration of programming models and
algorithms for massively parallel machines in the context of a real application.
Blue Matter has been implemented via spatial-force decomposition for N-body simulations
using the PME method for handling electrostatic interactions. The Ewald summation method
and particle mesh techniques are approximated by a finite range cut-off and a reciprocal
space portion for the charge distribution. This is done in Blue Matter via the
Particle-Particle-Particle-Mesh (P3ME) method.49
The results presented by Fitch et al.50 show impressive scalability on the Blue Gene/L
system. Figure 16-5 on page 311 shows scalability as a function of the number of nodes. It
illustrates that the performance in time/time step as a function of the number of processors for
-Hairpin contains a total of 5,239 atoms. SOPE contains 13,758 atoms. In this case, the
timings that are reported here correspond to a size of 643 FFT. Rhodopsin contains 43,222
atoms, and ApoA1 contains 92,224 atoms. All runs were carried out using the P3ME method,
which was implemented in Blue Matter at constant particle number, volume, and energy
(NVE).52
310
Ideal
b-Hairpin
Rhodopsin
ApoA1
SOPE
Parallel Speedup
20000
15000
10000
5000
0
512
1024
2048
4096
8192
16384
Processors
Figure 16-5 Performance in time/time step as a function of number of processors (from Fitch, et al.51 )
LAMMPS
Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS)53 is an MD program
from Sandia National Laboratories that is designed specifically for MPP. LAMMPS is
implemented in C++ and is distributed freely as open-source software under the GNU Public
License (GPL).54 LAMMPS can model atomic, polymeric, biological, metallic, or granular
systems using a variety of force fields and boundary conditions. The parallel efficiency of
LAMMPS varies from the size of the benchmark data and the number of steps being
simulated. In general, LAMMPS can scale to more processors on larger systems (see
Figure 16-6).
1 M S y s te m
4 M S y s te m
L i n e a r s c a lin g
9000
8000
Parallel speedup
7000
6000
5000
4000
3000
2000
1000
0
0
2000
4000
6000
8000
P ro ce s s o rs
Figure 16-6 Parallel scaling of LAMMPS on Blue Gene/L (1M System: 1-million atom scaled rhodopsin;
4M System: 4-million atom scaled rhodopsin)
For a one-million atom system, LAMMPS can scale up to 4096 nodes. For a larger system,
such as a four-million atom system, LAMMPS can scale up to 4096 nodes as well. As the size
of the system increases, scalability increases as well.
311
NAMD
NAMD is a parallel molecular dynamics application that was developed for high-performance
calculations of large biological molecular systems.55 NAMD supports the force fields used by
AMBER, CHARMM,56 and X-PLOR57 and is also file compatible with these programs. This
commonality allows simulations to migrate between these four programs. The C++ source for
NAMD and Charm++ are freely available from UIUC. For additional information about NAMD,
see the official NAMD Web site at:
http://www.ks.uiuc.edu/Research/namd/
NAMD incorporates the PME algorithm, which takes the full electrostatic interactions into
account and reduces computational complexity. To further reduce the cost of the evaluation of
long-range electrostatic forces, a multiple time step scheme is employed. The local
interactions (bonded, van der Waals, and electrostatic interactions within a specified distance)
are calculated at each time step. The longer range interactions (electrostatic interactions
beyond the specified distance) are computed less often. An incremental load balancer
monitors and adjusts the load during the simulation.
Due to the good balance of network and processor speed of the Blue Gene system, NAMD is
able to scale to large processor counts (see Figure 16-7). While scalability is affected by
many factors, many simulations can make use of multiple Blue Gene racks. Work by Kumar et
al.58 has reported scaling up to 8192 processors. Timing comparisons often use the
benchmark time metric instead of wall clock time to completion. The benchmark time metric
omits setup, I/O, and load balance overhead. While benchmark scaling can be considered a
guide to what is possible, ideal load balance and I/O parameters for each case must be found
for the wall clock time to scale similarly. Careful consideration of these parameters might be
necessary to achieve the best scalability.
4500
Parallel Speedup
4000
3500
3000
2500
2000
1500
1000
500
0
4
128
512
1024
Processors
2048
4096
Figure 16-7 Parallel speedup on the Blue Gene/L system for the NAMD standard apoA1 benchmark
312
Docking programs place molecules into the active site of the receptor (or target biomolecule)
in a noncovalent fashion and then rank them by the ability of the small molecules to interact
with the receptor.61 An extensive family of molecular docking software packages is
available.62
DOCK is an open-source molecular docking software package that is frequently used in
structure-based drug design.63 The computational aspects of this program can be divided
into two parts. The first part consists of the ligand atoms located inside the cavity or binding
pocket of a receptor, which is a large biomolecule. This step is carried out by a search
algorithm.64 The second part corresponds to scoring or identifying the most favorable
interactions, which is normally done by means of a scoring function.65
The latest version of the DOCK software package is Version 6.1. However, in our work, we
used Version 6.0. This version is written in C++ to exploit code modularity and has been
parallelized using the Message Passing Interface (MPI) paradigm. DOCK V6.0 is parallelized
using a master-worker scheme.66 The master handles I/O and tasks management, while
each worker is given an individual molecule to perform simultaneous independent docking.67
Recently, Peters, et al. have shown that DOCK6 is well suited for doing virtual screening on
the Blue Gene/L or Blue Gene/P system.68 Figure 16-8 shows the receptor HIV-1 reverse
transcriptase in complex with nevirapine as used and described in the Official UCSF DOCK
Web site. The ligand library corresponds to a subset of 27,005 drug-like ligands from the
ZINC database.69 The scalability of the parallel version of the code is illustrated by
constructing a set of ligands with 128,000 copies of nevirapine as recommended in the
Official UCSF DOCK Web site to remove dependence on the order and size of the
compound. You can find this Web site at:
http://dock.compbio.ucsf.edu
In Figure 16-8, the original code is the dark bar. Sorting by total number of atoms per ligand is
represented by the bar with horizontal lines. Sorting by total number of rotatable bonds per
ligand is represented by the white bar.70
60000
Time(Sec.)
50000
40000
30000
20000
10000
0
256
512
1024
2048
Number of Processors
Figure 16-8 The effect of load-balancing optimization for 27,005 ligands on 2048 processors
313
314
HMMER
For a complete discussion of hidden Markov models, refer to the work by Krogh et al.77
HMMER V2.3.2 consists of nine different programs: hmmalign, hmmbuild, hmmcalibrate,
hmmconvert, hmmemit, hmmfetch, hmmindex, hmmpfam, and hmmsearch.78 Out of these
nine programs, hmmcalibrate, hmmpfam, and hmmsearch have been parallelized.
hmmcalibrate is used to identify statistical significance parameters for profile HMM. hmmpfam
is used to search a profile HMM database, and hmmsearch is used to carry out sequence
database searches.79
The first module tested corresponds to hmmcalibrate. Figure 16-9 summarizes the
performance of this module up to 2048 nodes.80 Although this module was not optimized, the
parallel efficiency is still 75% on 2048 nodes. The graph in Figure 16-9 illustrates the
performance of hmmcalibrate using only the first 327 entries in the Pfam database.81
Figure 16-9 .hmmcalibrate parallel performance using the first 327 entries of the Pfam database
Figure 16-10 on page 316 illustrates the work presented by Jiang, et al.80 for optimizing
hmmsearch parallel performance using 50 proteins of the globin family from different
organisms and the UniProt release 8 database. For each processor count, the left bar shows
the original PVM to MPI port. Notice scaling stops at 64 nodes. The second bar shows the
multiple master implementation. The third bar shows the dynamic data collection
implementation, and the right bar shows the load balancing implementation.
315
mpiBLAST-PIO
mpiBLAST is an open-source parallelization of BLAST that uses MPI.83 One of the key
features of the initial parallelization of mpiBLAST is its ability to fragment and distribute
databases.
Thorsen et al.84 have compared the query Arabidopsis thaliana, a model organism for
studying plant genetics. This query was further subdivided into small, medium, and large
query sets that contain 200, 1168, and 28014 sequences, respectively.
Figure 16-11 on page 317 illustrates the results of comparing three queries of three different
sizes. We labeled them small, medium, and large. The database corresponds to NR. This
figure shows that scalability is a function of the query size. The small query scales to
approximately 1024 nodes in coprocessor mode with a parallel efficiency of 72% where the
large query scales to 8,192 nodes with a parallel efficiency of 74%.
316
From the top of Figure 16-11, the thick solid line corresponds to ideal scaling. The thin solid
line corresponds to the large query. The dashed line corresponds to the medium query. The
dotted line corresponds to the small query.
Figure 16-11 Scaling chart for queries run versus the nr database
Latency
The amount of time it takes for the first byte sent from one node to
reach its target node
These two values provide information about communication. In this section, we illustrate two
simple cases. The first case corresponds to a benchmark that involves a single transfer. The
second case corresponds to a collective as defined in the Intel MPI Benchmarks. Intel MPI
Benchmarks is formerly known as Pallas MPI Benchmarks - PMB-MPI1 (for MPI1 standard
functions only). Intel MPI Benchmarks - MPI1 provides a set of elementary MPI benchmark
kernels.
For more details, see the product documentation included in the package that you can
download from the Web at:
http://www.intel.com/cd/software/products/asmo-na/eng/219848.htm
317
The Intel MPI Benchmarks kernel or elementary set of benchmarks was reported as part of
Unfolding the IBM eServer Blue Gene Solution, SG24-6686. Here we describe and perform
the same benchmarks. You can run all of the supported benchmarks, or just a subset,
specified through the command line. The rules, such as time measurement, message
lengths, selection of communicators to run a particular benchmark, are program parameters.
For more information, see the product documentation that is included in the package, which
you can download from the Web at:
http://www.intel.com/software/products/cluster/mpi/mpi_benchmarks_lic.htm
This set of benchmarks has the following objectives:
Provide a concise set of benchmarks targeted at measuring important MPI functions:
point-to-point message-passing, global data movement and computation routines, and
one-sided communications and file I/O
Set forth precise benchmark procedures: run rules, set of required results, repetition
factors, and message lengths
Avoid imposing an interpretation on the measured results: execution time, throughput, and
global operations performance
mpirun -nofree -timeout 300 -verbose 1 -np 512 -mode SMP -partition R01-M1 -cwd
/bgusr/BGTH_BGP/test512nDD2BGP/pallas/pall512DD2SMP/bgpdd2sys1-R01-M1 -exe
/bgusr/BGTH_BGP/test512nDD2BGP/pallas/pall512DD2SMP/bgpdd2sys1-R01-M1/IMB-MPI1.4MB
.perf.rts -args "-msglen 4194304.txt -npmin 512 PingPong" | tee
IMB-MPI1.4MB.perf.PingPong.4194304.512.out) >>
run.IMB-MPI1.4MB.perf.PingPong.4194304.512.out 2>&1
Figure 16-12 on page 319 shows the bandwidth on the torus network as a function of the
message size, for one simultaneous pair of nearest neighbor communications. The protocol
switch from short to eager is visible in these two cases, where the eager to rendezvous switch
318
is most pronounced on the Blue Gene/L system. This figure also shows the improved
performance on the Blue Gene/P system. Notice also in Figure 16-12 that the diamonds
corresponds to the Blue Gene/P system and the asterisks (*) correspond to the Blue Gene/L
system.
350
300
250
200
150
100
40
96
16
38
4
65
53
6
26
21
44
10
48
57
6
41
94
30
4
10
24
25
6
64
16
50
0
1
Bandwidth in MB/s
400
mpirun -nofree -timeout 300 -verbose 1 -np 512 -mode SMP -partition R01-M1 -cwd
/bgusr/BGTH_BGP/test512nDD2BGP/pallas/pall512DD2SMP/bgpdd2sys1-R01-M1 -exe
Chapter 16. Performance overview of engineering and scientific applications
319
/bgusr/BGTH_BGP/test512nDD2BGP/pallas/pall512DD2SMP/bgpdd2sys1-R01-M1/IMB-MPI1.4MB
.perf.rts -args "-msglen 4194304.txt -npmin 512 Allreduce" | tee
IMB-MPI1.4MB.perf.Allreduce.4194304.512.out) >>
run.IMB-MPI1.4MB.perf.Allreduce.4194304.512.out 2>&1
Collective operations are more efficient on the Blue Gene/P system. You should try to use
these operations instead of point-to-point communication wherever possible. The overhead
for point-to-point communications is much larger than those for collectives. Unless all your
point-to-point communication is purely the nearest neighbor, it is also difficult to avoid network
congestion on the torus network.
Alternatively, collective operations can use the barrier (global interrupt) network or the torus
network. If they run over the torus network, they can still be optimized by using specially
designed communication patterns that achieve optimum performance. Doing this manually
with point-to-point operations is possible in theory, but in general, the implementation in the
Blue Gene/P MPI library offers superior performance.
With point-to-point communication, the goal of reducing the point-to-point Manhattan
distances necessitates a good mapping of MPI tasks to the physical hardware. For
collectives, mapping is equally important because most collective implementations prefer
certain communicator shapes to achieve optimum performance. The technique of mapping is
illustrated in Appendix F, Mapping on page 355.
Similar to point-to-point communications, collective communications also works best if you do
not use complicated derived data types and if your buffers are aligned to 16-byte boundaries.
While the MPI standard explicitly allows for MPI collective communications to occur at the
same time as point-to-point communications (on the same communicator), we generally do
not recommend that you allow this to happen for performance reasons.
Table 16-1 summarizes the MPI collectives that have been optimized on the Blue Gene/P
system, together with their performance characteristics when executed on the various
networks of the Blue Gene/P system.
Table 16-1 MPI collectives optimized on the Blue Gene/P system
320
MPI routine
Condition
Network
Performance
MPI_Barrier
MPI_COMM_WORLD
Barrier (global
interrupt) network
1.2 s
MPI_Barrier
Any communicator
Torus network
30 s
MPI_Broadcast
MPI_COMM_WORLD
Collective network
817 MBps
MPI_Broadcast
Rectangular
communicator
Torus network
934 MBps
MPI_Allreduce
MPI_COMM_WORLD
fixed-point
Collective network
778 MBps
MPI_Allreduce
MPI_COMM_WORLD
floating point
Collective network
98 MBps
MPI_Alltoall[v]
Any communicator
Torus network
84-97% peak
MPI_Allgatherv
N/A
Torus network
Same as broadcast
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
Allreduce on BG/L
40
96
32
76
26 8
21
20 44
97
15
2
51
2
64
Allreduce on BG/P
Time in s
Figure 16-13 shows a comparison between the Blue Gene/L and Blue Gene/P systems for
the MPI_Allreduce() type of communication.
Figure 16-14 illustrates the performance of the barrier on Blue Gene/P for up to 32 nodes.
25
20
15
Blue Gene/P
10
5
0
2
16
32
Number of Processors
Figure 16-14 Barrier performance on the Blue Gene/P system
321
322
Part 6
Part
Appendixes
In this part, we provide additional information about system administration for the IBM Blue
Gene/P system. This part includes the following appendixes:
323
324
Appendix A.
325
Figure A-1 shows the conventions used when assigning locations to all hardware except the
various cards in a Blue Gene/P system. Using the charts and diagrams that follow, consider
an example where you have an error in the fan named R23-M1-A3-0. This naming convention
tells you where to look for the error. In the upper-left corner of Figure A-1, you see that racks
use the convention Rxx. Looking at our error message, we can see that the rack involved is
R23. From the chart in Figure A-1, we see that R23 is the fourth rack in row two. (Remember
that all numbering starts with 0). The bottom midplane of any rack is 0. Therefore, we are
dealing with the top midplane (R23-M1).
In the chart, you can see in the fan assemblies description that assemblies 0-4 are on the
front of the rack, bottom to top, respectively. Therefore, we check for an attention light (Amber
LED) on the fan assembly second from the top, because the front-most fan is the one that is
causing the error message to surface. Service, link, and node cards use a similar form of
addressing.
Racks:
Rxx
Midplanes:
Rxx-Mx
Rack Column (0-F)
Rack Row (0-F)
Power Cable:
Rxx-B-C
Power Cable
Bulk Power Supply
Rack Row (0-F)
Rack Column (0-F)
Fans:
Rxx-Mx-Ax-Fx
Power Modules:
Rxx-B-Px
Power Module (0-7)
0-3 Left to right facing front
4-7 left to right facing rear
Bulk Power Supply
Rack Row (0-F)
Rack Column (0-F)
326
Figure A-2 shows the conventions used for the various card locations.
Service Cards:
Rxx-Mx-S
Compute Cards:
Rxx-Mx-Nxx-Jxx
Compute Card (04 through 35)
Node Card (00-15)
Midplane (0-1) 0-Bottom, 1=Top
Rack Column (0-F)
Rack Row (0-F)
Service Card
Midplane (0-1) 0-Bottom, 1=Top
Rack Column (0-F)
Rack Row (0-F)
Link Cards:
Rxx-Mx-Lx
0=Bottom Front
1=Top Front
2=Bottom Rear
Link Card (0-3)
3=Top Rear
Midplane (0-1) 0-Bottom, 1=Top
Rack Column (0-F)
Rack Row (0-F)
I/O Cards:
Rxx-Mx-Nxx-Jxx
00=Bottom Front
07=Top Front
08=Bottom Rear
15=Top Rear
Node Cards:
Rxx-Mx-Nxx
Node Card (00-15)
Table A-1 contains examples of various hardware conventions. The figures that follow the
table provide illustrations of the actual hardware.
Table A-1 Examples of hardware-naming conventions
Card
Element
Name
Example
Compute
Card
R23-M10-N02-J09
I/O
Card
R57-M1-N04-J00
Module
U00
R23-M0-N13-J08-U00
Link
Module
R32-M0-L2_U03
Link
Port
TA through TF
R01-M0-L1-U02-TC
Connector
R21-M1-L2-J13
Node Ethernet
Connector
EN0, EN1
R16-M1-N14-EN1
Service
Connector
R05-M0-S-Control FPGA
Clock
Connector
R13-K- Output 3
327
Note: The fact that Figure A-3 shows numbers 00 through 77 does not imply that this
configuration is the largest possible. The largest configuration possible is 256 racks
numbered 00 through FF.
Figure A-4 identifies each of the cards in a single midplane.
L1
L3
N07
N15
N06
N14
N05
N13
N04
S
N12
N11
N03
N10
N02
N09
N01
N08
L2
N00
L0
328
Figure A-5 shows a diagram of a node card. On the front of the card are Ethernet ports EN0
and EN1. The first nodes behind the Ethernet ports are the I/O Nodes. In this diagram, the
node card is fully populated with I/O Nodes, meaning that it has two I/O Nodes. Behind the
I/O Nodes are the Compute Nodes.
J32
J28
J24
J20
J16
J12
J08
J33
J29
J25
J21
J17
J13
J04
J09
J05
J00
J34
J30
J26
EN0
EN1
J22
J18
J14
J10
J06
J35
J31
J27
J23
J19
J15
J11
J07
J01
Control Network
Control FPGA
Clock R
Clock B
Clock Input
329
Figure A-7 shows the link card. The locations identified as J00 through J15 are the link card
connectors. The link cables are routed from one link card to another to form the torus network
between the midplanes.
U00
U01
U02
U03
U04
U05
J00
J01
J02
J03
J04
J05
J06
J08
J07
J09
J10
J12
J11
J14
J13
J15
Figure A-8 shows the clock card. If the clock is a secondary or tertiary clock, a cable comes to
the input connector on the far right. Next to the input (just to the left) is the master and worker
toggle switch. All clock cards are built with the capability of filling either role. If the clock is a
secondary or tertiary clock, this must be set to worker. Output zero through nine can be used
to send signals to midplanes throughout the system.
Output 9
Output 8
Output 7
Output 6
Output 5
Output 4
Output 3
Output 2
Output 1
Output 0
Input
Figure A-8 Clock card
330
Wo
rk e
Ma
ste
Appendix B.
331
/*
/*
/*
/*
/*
/*
#include <mpi.h>
#include <stdio.h>
#include <spi/kernel_interface.h>
#include <common/bgp_personality.h>
#include <common/bgp_personality_inlines.h>
int main(int argc, char * argv[])
{
int taskid, ntasks;
int memory_size_MBytes;
_BGP_Personality_t personality;
int torus_x, torus_y, torus_z;
int pset_size, pset_rank, node_config;
int xsize, ysize, zsize, procid;
char location[128];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);
MPI_Comm_size(MPI_COMM_WORLD, &ntasks);
Kernel_GetPersonality(&personality, sizeof(personality));
if (taskid == 0)
{
memory_size_MBytes = personality.DDR_Config.DDRSizeMB;
printf("Memory size = %d MBytes\n", memory_size_MBytes);
node_config = personality.Kernel_Config.ProcessConfig;
332
*/
*/
*/
*/
*/
*/
if
(node_config == _BGP_PERS_PROCESSCONFIG_SMP) printf("SMP mode\n");
else if (node_config == _BGP_PERS_PROCESSCONFIG_VNM) printf("Virtual-node mode\n");
else if (node_config == _BGP_PERS_PROCESSCONFIG_2x2) printf("Dual mode\n");
else
printf("Unknown mode\n");
printf("number of MPI tasks = %d\n", ntasks);
xsize = personality.Network_Config.Xnodes;
ysize = personality.Network_Config.Ynodes;
zsize = personality.Network_Config.Znodes;
pset_size = personality.Network_Config.PSetSize;
pset_rank = personality.Network_Config.RankInPSet;
printf("number of processors in the pset = %d\n", pset_size);
printf("torus dimensions = <%d,%d,%d>\n", xsize, ysize, zsize);
}
torus_x = personality.Network_Config.Xcoord;
torus_y = personality.Network_Config.Ycoord;
torus_z = personality.Network_Config.Zcoord;
BGP_Personality_getLocationString(&personality, location);
procid = Kernel_PhysicalProcessorID();
/*-----------------------------------------------*/
/* print torus coordinates and the node location */
/*-----------------------------------------------*/
printf("MPI rank %d has torus coords <%d,%d,%d> cpu = %d, location = %s\n",
taskid, torus_x, torus_y, torus_z, procid, location);
MPI_Finalize();
return 0;
}
Example B-2 illustrates the makefile that is used to build personality.c. This particular file uses
the GNU compiler.
Example: B-2 Makefile to build the personality.c program
BGP_FLOOR
BGP_IDIRS
= /bgsys/drivers/ppcfloor
= -I$(BGP_FLOOR)/arch/include
CC
= /bgsys/drivers/ppcfloor/comm/bin/mpicc
EXE
OBJ
SRC
FLAGS
FLD
=
=
=
=
=
personality
personality.o
personality.c
$(EXE): $(OBJ)
${CC} $(FLAGS) -o $(EXE) $(OBJ) $(BGP_LIBS)
$(OBJ): $(SRC)
Appendix B. Files on architectural features
333
pset = 32
<0,0,0> cpu = 0, location = R00-M0-N04-J23
<0,0,0> cpu = 1, location = R00-M0-N04-J23
<0,0,0> cpu = 2, location = R00-M0-N04-J23
<0,0,0> cpu = 3, location = R00-M0-N04-J23
<1,0,0> cpu = 0, location = R00-M0-N04-J04
<1,0,0> cpu = 1, location = R00-M0-N04-J04
<1,0,0> cpu = 2, location = R00-M0-N04-J04
<1,0,0> cpu = 3, location = R00-M0-N04-J04
Example B-4 illustrates running personality with XYZT mapping for a comparison. Notice
that the output has been ordered by MPI rank for readability.
Example: B-4 Output generated with XYZT mapping
334
pset = 32
<0,0,0>
<1,0,0>
<2,0,0>
<3,0,0>
<0,1,0>
<1,1,0>
<2,1,0>
<3,1,0>
cpu
cpu
cpu
cpu
cpu
cpu
cpu
cpu
=
=
=
=
=
=
=
=
0,
0,
0,
0,
0,
0,
0,
0,
location
location
location
location
location
location
location
location
=
=
=
=
=
=
=
=
R00-M0-N04-J23
R00-M0-N04-J04
R00-M0-N04-J12
R00-M0-N04-J31
R00-M0-N04-J22
R00-M0-N04-J05
R00-M0-N04-J13
R00-M0-N04-J30
Appendix C.
335
Description
mpe_thread.h
mpicxx.h
mpif.h
mpi.h
MPI C defines
mpiof.h
mpio.h
mpix.h
mpido_properties.h
mpi.mod,
mpi_base.mod,
mpi_constants.mod,
mpi_sizeofs.mod
F90 bindings
opa_config.h,
opa_primitives.h,
opa_queue.h,
opa_util.h
Description
dcmf.h
dcmf_collectives.h
336
File name
Description
bgp_personality.h
Defines personality
bgp_personality_inlines.h
bgp_personalityP.h
Table C-4 describes the 32-bit static and dynamic libraries in the
/bgsys/drivers/ppcfloor/comm/default/lib directory and the
/bgsys/drivers/ppcfloor/comm/fast/lib directory. There are links to the default version of the
libraries in /bgsys/drivers/ppcfloor/comm/lib for compatibility with previous releases of the
Blue Gene/P software.
Table C-4 32-bit static and dynamic libraries in /bgsys/drivers/ppcfloor/comm/default/lib/
File name
Description
libmpich.cnk.a,
libmpich.cnk.so
libcxxmpich.cnk.a,
libcxxmpich.cnk.so
libfmpich.cnk.a,
libfmpich.cnk.so
libfmpich_.cnk.a
libmpich.cnkf90.a,
libmpich.cnkf90.so
Fortran 90 bindings
libopa.a
libtvmpich2.so
Table C-5 describes the 32-bit static and dynamic libraries in the
/bgsys/drivers/ppcfloor/comm/sys directory. There are links to the default version of the
libraries in /bgsys/drivers/ppcfloor/comm/lib for compatibility with previous releases of the
Blue Gene/P software.
Table C-5 32-bit static and dynamic libraries in /bgsys/drivers/ppcfloor/comm/sys
File name
Description
libdcmf.cnk.a,
libdcmf.cnk.so
libdcmfcoll.cnk.a,
libdcmfcoll.cnk.so
libdcmf-fast.cnk.a
libdcmfcoll-fast.cnk.a
Fast version of the common BGP message layer interface for general
collectives in C
Description
allocator_api.h
337
File name
Description
attach_bg.h
rm_api.h
rt_api.h
sayMessage.h
sched_api.h
submit_api.h
Table C-7 describes the 64-bit dynamic libraries available to resource management
applications. They are located in the /bgsys/drivers/ppcfloor/lib64 directory.
Table C-7 64-bit dynamic libraries for resource management APIs
338
File Name
Description
libbgpallocator.so
libbgrealtime.so
libbgpbridge.so
libsaymessage.so
Appendix D.
Environment variables
In this appendix, we describe the environment variables that the user can change to affect the
run time characteristics of a program that is running on the IBM Blue Gene/P compute nodes.
Changes are usually made in an attempt to improve performance, although on occasion the
goal is to modify functional attributes of the application.
In this appendix, we discuss the following topics:
Setting environment variables
Blue Gene/P MPI environment variables
Compute Node Kernel environment variables
339
340
341
root's buffer be misaligned from the rest of the nodes. Therefore, by default we must do an
allreduce before dput bcasts to ensure all nodes have the same alignment. If you know all
of your buffers are 16 byte aligned, turning on this option will skip the allreduce step.
Possible values:
N: Perform the allreduce
Y: Bypass the allreduce. If you have mismatched alignment, you will likely get weird
behavior or asserts.
Default is N.
DCMF_SAFEALLGATHER: The optimized allgather protocols require contiguous
datatypes and similar datatypes on all nodes. To verify this is true, we must do an
allreduce at the beginning of the allgather call. If the application uses well-behaved
datatypes, you can set this option to skip over the allreduce. This is most useful in irregular
subcommunicators where the allreduce can be expensive. Possible values:
N: Perform the allreduce.
Y: Skip the allreduce. Setting this with unsafe datatypes will yield unpredictable results,
usually hangs.
Default is N.
DCMF_SAFEALLGATHERV: The optimized allgatherv protocols require contiguous
datatypes and similar datatypes on all nodes. Allgatherv also requires continuous
displacements. To verify this is true, we must do an allreduce at the beginning of the
allgatherv call. If the application uses well-behaved datatypes and displacements, you can
set this option to skip over the allreduce. This is most useful in irregular subcommunicators
where the allreduce can be expensive. Possible values:
N: Perform the allreduce.
Y: Skip the allreduce. Setting this with unsafe datatypes will yield unpredictable results,
usually hangs.
Default is N.
DCMF_SAFESCATTERV: The optimized scatterv protocol requires contiguous datatypes
and similar datatypes on all nodes. It also requires continuous displacements. To verify
this is true, we must do an allreduce at the beginning of the scatterv call. If the application
uses well-behaved datatypes and displacements, you can set this option to skip over the
allreduce. This is most useful in irregular subcommunicators where the allreduce can be
expensive. Possible values:
N: Perform the allreduce.
Y: Skip the allreduce. Setting this with unsafe datatypes will yield unpredictable results,
usually hangs.
Default is N.
DCMF_ALLTOALL, DCMF_ALLTOALLV, or DCMF_ALLTOALLW: Controls the protocol
used for alltoall/alltoallv/alltoallw. Possible values:
MPICH: Turn off all optimizations and use the MPICH point-to-point protocol.
Default (or if anything else is specified) is to use an optimized alltoall/alltoallv/alltoallw.
DCMF_ALLTOALL_PREMALLOC, DCMF_ALLTOALLV_PREMALLOC, or
DCMF_ALLTOALLW_PREMALLOC: These are equivalent options. The alltoall protocols
require 6 arrays to be setup before communication begins. These 6 arrays are each of size
(comm_size) so can be sizeable on large machines. If your application does not use
alltoall, or you need as much memory as possible, you can turn off pre-allocating these
arrays. By default, we allocate them once per communicator creation. There is only one
set, regardless of whether you are using alltoall, alltoallv, or alltoallw. Possible values:
Appendix D. Environment variables
343
TDPUT: Use the tree dput protocol. This is the default in virtual node mode on
MPI_COMM_WORLD
TREE: Use the tree. This is the default in SMP mode on MPI_COMM_WORLD
DPUT: Use the rectangular direct put protocol
PIPE: Use the pipelined CCMI tree protocol
BINOM: Use a binomial protocol
DCMF_ALLREDUCE: Controls the protocol used for allreduce. Possible values:
MPICH: Turn off all optimizations for allreduce and use the MPICH point-to-point
protocol.
RING - Use a rectangular ring protocol. This is the default for rectangular
subcommunicators.
RECT: Use a rectangular/binomial protocol. This is off by default.
BINOM: Use a binomial protocol. This is the default for irregular subcommunicators.
TREE: Use the collective network. This is the default (except for GLOBAL between 512
and 8K) for MPI_COMM_WORLD and duplicates of MPI_COMM_WORLD in
MPI_THREAD_SINGLE mode.
GLOBAL: Use the global collective network protocol for sizes between 512 and 8K.
Otherwise this defaults the same as TREE.
CCMI: Use the CCMI collective network protocol. This is off by default.
PIPE: Use the pipelined CCMI collective network protocol. This is off by default.
ARECT: Enable the asynchronous rectangle protocol
ABINOM: Enable the async binomial protocol
ARING: Enable the asynchronous version of the rectangular ring protocol.
TDPUT: Use the tree+direct put protocol. This is the default for VNM on
MPI_COMM_WORLD
DPUT: Use the rectangular direct put protocol. This is the default for large messages
on rectangular subcomms and MPI_COMM_WORLD
Default varies based on the communicator and message size and if the
operation/datatype pair is supported on the tree hardware.
DCMF_ALLREDUCE_REUSE_STORAGE: This allows the lower level protocols to reuse
some storage instead of malloc/free on every allreduce call. Possible values:
Y: Does not malloc/free on every allreduce call. This improves performance, but retains
malloc'd memory between allreduce calls.
N: Malloc/free on every allreduce call. This frees up storage for use between allreduce
calls.
Default is Y.
DCMF_ALLREDUCE_REUSE_STORAGE_LIMIT: This specifies the upper limit of storage
to save and reuse across allreduce calls when DCMF_ALLREDUCE_REUSE_STORAGE
is set to Y. (This environment variable is processed within the DCMF_Allreduce_register()
API, not in MPIDI_Env_setup().):
Default is 1048576 bytes.
345
346
DCMF_STAR_CHECK_CALLSITE: Turns on sanity check that makes sure all ranks are
involved in the same collective call site. This is important in root-like call sites (Bcast,
Reduce, Gather...etc) where the call site of the root might be different than non root ranks
(different if statements):
Possible values: 1 - (default is 1: on).
DCMF_STAR_VERBOSE: Turns on verbosity of STAR-MPI by writing to an output file in
the form "exec_name-star-rank#.log". This is turned off by default:
Possible values: 1 - (default is 0: off).
DCMF_RECFIFO: The size, in bytes, of each DMA reception FIFO. Incoming torus
packets are stored in this fifo until DCMF Messaging can process them. Making this larger
can reduce torus network congestion. Making this smaller leaves more memory available
to the application. DCMF Messaging uses one reception FIFO. The value specified is
rounded up to the nearest 32-byte boundary:
Default is 8388608 bytes (8 megabytes).
DCMF_INJFIFO: The size, in bytes, of each DMA injection FIFO. These FIFOs store
32-byte descriptors, each describing a memory buffer to be sent on the torus. Making this
larger can reduce overhead when there are many outstanding messages. Making this
smaller can increase that overhead. DCMF Messaging uses 15 injection FIFOs in
DEFAULT and RZVANY mode, and 25 injection FIFOs in ALLTOALL mode (refer to
DCMF_FIFOMODE). The value given is rounded up to the nearest 32-byte boundary:
Default is 32768 (32 kilobytes).
DCMF_RGETFIFO: The size, in bytes, of each DMA remote get FIFO. These FIFOs store
32-byte descriptors, each describing a memory buffer to be sent on the torus, and are
used to queue requests for data (remote gets). Making this larger can reduce torus
network congestion and reduce overhead. Making this smaller can increase that
congestion and overhead. DCMF Messaging uses 7 remote get FIFOs in DEFAULT and
ALLTOALL mode, and 13 remote get FIFOs in RZVANY mode (refer to
DCMF_FIFOMODE). The value given is rounded up to the nearest 32-byte boundary:
Default is 32768 (32 kilobytes).
DCMF_POLLLIMIT: The limit on the number of consecutive non-empty polls of the
reception fifo before exiting the poll function so other processing can be performed.
Making this larger might help performance because polling overhead is smaller. Making
this smaller might be necessary for applications that continuously send to a node that
needs to perform processing. Special values:
0: There is no limit.
Default is 16 polls.
DCMF_INJCOUNTER: The number of DMA injection counter subgroups that DCMF will
allocate during MPI_Init or DCMF_Messager_Initialize. There are 8 DMA counters in a
subgroup. This is useful for applications that access the DMA directly and need to limit the
number of injection counters used for messaging. Possible values:
1..8: The specified value can range from 1 to 8.
Default is 8.
DCMF_RECCOUNTER: The number of DMA reception counter subgroups that DCMF will
allocate during MPI_Init or DCMF_Messager_Initialize. There are 8 DMA counters in a
subgroup. This is useful for applications that access the DMA directly and need to limit the
number of reception counters used for messaging. Possible values:
1..8: The specified value can range from 1 to 8.
Default is 8.
347
DCMF_FIFOMODE: The fifo mode to use. This determines how many injection fifos are
used by messaging and what they are used for:
DEFAULT: The default fifo mode. Uses 22 injection fifos:
RZVANY: Similar to DEFAULT, except it is optimized for sending messages that use the
rendezvous protocol. It has 6 more remote get fifos optimized for sending around
corners:
Default is DEFAULT.
348
349
BG_PERSISTMEMRESET
Boolean indicating that the persistent memory region must be cleared before the job
starts. Default is 0. To enable, the value must be set to 1.
BG_COREDUMPONEXIT
Boolean that controls the creation of core files when the application exits. This variable is
useful when the application performed an exit() operation and the cause and location of
the exit() is not known.
BG_COREDUMPONERROR
Boolean that controls the creation of core files when the application exits with a non-zero
exit status. This variable is useful when the application performed an exit(1) operation and
the cause and location of the exit(1) is not known.
BG_COREDUMPDISABLED
Boolean. Disables creation of core files if set.
BG_COREDUMP_FILEPREFIX
Sets the file name prefix of the core files. The default is core. The MPI task number is
appended to this prefix to form the file name.
BG_COREDUMP_PATH
Sets the directory for the core files.
BG_COREDUMP_REGS
Part of the Booleans that control whether or not register information is included in the core
files. BG_COREDUMP_REGS is the master switch.
BG_COREDUMP_GPR
Part of the Booleans that control whether or not register information is included in the core
files. BG_COREDUMP_GPR controls GPR (integer) registers.
BG_COREDUMP_FPR
Part of the Booleans that control whether or not register information is included in the core
files. BG_COREDUMP_FPR controls output of FPR (floating-point) registers.
BG_COREDUMP_SPR
Part of the Booleans that control whether or not register information is included in the core
files. BG_COREDUMP_SPR controls output of SPR (special purpose) registers.
BG_COREDUMP_PERS
Boolean that controls whether the node's personality information (XYZ dimension location,
memory size, and so on) are included in the core files.
BG_COREDUMP_INTCOUNT
Boolean that controls whether the number of interrupts handled by the node are included
in the core file.
BG_COREDUMP_TLBS
Boolean that controls whether the TLB layout at the time of the core is to be included in the
core file.
BG_COREDUMP_STACK
Boolean that controls whether the application stack addresses are to be included in the
core file.
350
BG_COREDUMP_SYSCALL
Boolean that controls whether a histogram of the number of system calls performed by the
application is to be included in the core file.
BG_COREDUMP_BINARY
Specifies the MPI ranks for which a binary core file will be generated rather than a
lightweight core file. This type of core file can be used with the GNU Project Debugger
(GDB) but not the Blue Gene/P Core Processor utility. If this variable is not set then all
ranks will generate a lightweight core file. The variable must be set to a comma-separated
list of the ranks that will generate a binary core file or * (an asterisk) to have all ranks
generate a binary core file.
BG_APPTHREADDEPTH
Integer that controls the number of application threads per core. Default is 1. The value
can be between 1 and 3.
351
352
Appendix E.
Porting applications
In this appendix, we summarize Appendix A, BG/L prior to porting code, in Unfolding the
IBM eServer Blue Gene Solution, SG24-6686. Porting applications to massively parallel
systems requires special considerations to take full advantage of this specialized architecture.
Never underestimate the effort required to port a code to any new hardware. The amount of
effort depends on the nature of the way in which the code has been implemented.
Answer the following questions to help you in the decision-making process of porting applications
and the level of effort required (answering yes to most of the questions is an indication that your
code is already enabled for distributed-memory systems and a good candidate for Blue Gene/P):
1. Is the code already running in parallel?
2. Is the application addressing 32-bit?
3. Does the application rely on system calls, for example, system?
4. Does the code use the Message Passing Interface (MPI), specifically MPICH? Of the
several parallel programming APIs, the only one supported on the Blue Gene/P system
that is portable is MPICH. OpenMP is supported only on individual nodes.
5. Is the memory requirement per MPI task less than 4 GB?
6. Is the code computational intensive? That is, is there a small amount of I/O compared to
computation?
7. Is the code floating-point intensive? This allows the double floating-point capability of the
Blue Gene/P system to be exploited.
8. Does the algorithm allow for distributing the work to a large number of nodes?
9. Have you ensured that the code does not use flex_lm licensing? At present, flex_lm
library support for Linux on IBM System p is not available.
If you answered yes to all of these questions, answer the following questions:
Has the code been ported to Linux on System p?
353
Is the code Open Source Software (OSS)? These type of applications require the use of
the GNU standard configure and special considerations are required.85
Can the problem size be increased with increased numbers of processors?
Do you use standard input? If yes, can this be changed to single file input?
354
Appendix F.
Mapping
In this appendix, we summarize and discuss mapping of tasks with respect to the Blue
Gene/P system. We define mapping as an assignment of MPI rank onto IBM Blue Gene
processors. As with IBM Blue Gene/L, the network topology for IBM Blue Gene/P is a
three-dimensional (3D) torus or mesh, with direct links between the nearest neighbors in the
+/-x, +/-y, and +/-z directions. When communication involves the nearest neighbors on the
torus network, you can obtain a large fraction of the theoretical peak bandwidth. However,
when MPI ranks communicate with many hops between the neighbors, the effective
bandwidth is reduced by a factor that is equal to the average number of hops that messages
take on the torus network. In a number of cases, it is possible to control the placement of MPI
ranks so that communication remains local. This can significantly improve scaling for a
number of applications, particularly at large processor counts.
The default mapping in symmetrical multiprocessing (SMP) node mode is to place MPI ranks
on the system in XYZT order, where <X,Y,Z> are torus coordinates and T is the processor
number within each node (T=0,1,2,3). If the job uses SMP mode on the Blue Gene/P system,
only one MPI rank is assigned to each node using processor 0. For SMP Node mode and the
default mapping, we get the following results:
MPI rank 0 is assigned to <X,Y,Z,T> coordinates <0,0,0,0>.
MPI rank 1 is assigned to <X,Y,Z,T> coordinates <1,0,0,0>.
MPI rank 2 is assigned to <X,Y,Z,T> coordinates <2,0,0,0>.
The results continue like this, first incrementing the X coordinate, then the Y coordinate, and
then the Z coordinate. In Virtual Node Mode and in Dual mode the mapping defaults to the
TXYZ order. For example, in Virtual Node Mode, the first four MPI ranks use processors
0,1,2,3 on the first node, then the next four ranks use processors 0,1,2,3 on the second node,
where the nodes are populated in XYZ order (first increment T, then X, then Y, and then Z).
The predefined mappings available on Blue Gene/P are the same as those available on Blue
Gene/L: XYZT, XZYT, YZXT, YXZT, ZXYT, ZYXT, TXYZ, TXZY, TYZX, TYXZ, TZXY, TZYX.
355
Table F-1 illustrates this type of mapping using the output from the personality program
presented in Appendix B, Files on architectural features on page 331.
Table F-1 Topology mapping 4x4x2 with TXYZ and XYZT
Mapping option
Topology
Coordinates
Processor
TXYZ
4x4x2
0,0,0
0,0,0
0,0,0
0,0,0
1,0,0
1,0,0
1,0,0
1,0,0
0,0,0
1,0,0
2,0,0
3,0,0
0,1,0
1,1,0
2,1,0
3,1,0
XYZT
4x4x2
The way to specify a mapping depends on the method that is used for job submission. The
mpirun command for the Blue Gene/P system includes two methods to specify the mapping.
You can add -mapfile TXYZ to request TXYZ order. Other permutations of XYZT are also
permitted. You can also create a map file, and use -mapfile my.map, where my.map is the
name of your map file. Alternatively, you can specify the environment variable -env
BG_MAPPING=TXYZ to obtain one of the predefined non-default mappings.
Using customized map file provides the most flexibility. The syntax for the map file is simple. It
must contain one line for each MPI rank in the Blue Gene/P partition, with four integers on
each line separated by spaces, where the four integers specify the <X,Y,Z,T> coordinates for
each MPI rank. The first line in the map file assigns MPI rank 0, the second line assigns MPI
rank 1, and so forth. It is important to ensure that your map file is consistent, with a unique
relationship between MPI rank and <X,Y,Z,T> location.
General guidance
For applications that use a 1D, 2D, 3D, or 4D (D for dimensional) logical decomposition
scheme, it is often possible to map MPI ranks onto the Blue Gene/P torus network in a way
that preserves locality for nearest-neighbor communication, for example, in a
one-dimensional processor topology, where each MPI rank communicates with its rank +/- 1,
the default XYZT mapping is sufficient at least for partitions large enough to use torus
wrap-around.
356
Torus wrap-around is enabled for partitions that are one midplane = 8x8x8 512 nodes, or
multiples of one midplane. With torus wrap-around, the XYZT order keeps communication
local, except for one extra hop at the torus edges. For smaller partitions, such as a 64-node
partition with a 4x4x4 mesh topology, it is better to create a map file that assigns ranks that go
down the X-axis in the +x direction, and then for the next Y-value, fold the line to return in the
-x direction, making a snake-like pattern that winds back and forth, filling out the 4x4x4 mesh.
It is worthwhile to note that for a random placement of MPI ranks onto a 3D torus network, the
average number of hops is one-quarter of the torus length, in each of the three dimensions.
Thus mapping is generally more important for large or elongated torus configurations.
Two-dimensional logical processes topologies are more challenging. In some cases, it is
possible to choose the dimensions of the logical 2D process mesh so that one can fold the
logical 2D mesh to fit perfectly in the 3D Blue Gene/P torus network, for example, if you want
to use one midplane (8x8x8 nodes) in virtual node mode, a total of 2048 CPUs are available.
A 2D process mesh is 32x64 for this problem. The 32 dimension can be lined up along one
edge of the torus, say the X-axis, using TX order to fill up processors (0,1,2,3) on each of the
eight nodes going down the X-axis, resulting in 32 MPI ranks going down the X-axis.
The simplest good mapping, in this case, is to specify -mapfile TXYZ. This keeps
nearest-neighbor communication local on the torus, except for one extra hop at the torus
edges. You can do slightly better by taking the 32x64 logical 2D process mesh, aligning one
edge along the X-axis with TX order and then folding the 64 dimension back and forth to fill
the 3D torus in a seamless manner. It is straightforward to construct small scripts or programs
to generate the appropriate map file. Not all 2D process topologies can be neatly folded onto
the 3D torus.
For 3D logical process topologies, it is best to choose a decomposition or mapping that fits
perfectly onto the 3D torus if possible, for example, if your application uses SMP Node mode
on one Blue Gene/P rack (8x8x16 torus); then it is best to choose a 3D decomposition with 8
ranks in the X-direction, 8 ranks in the Y-direction, and 16 ranks in the Z-direction. If the
application requires a different decomposition - for example, 16x8x8 - you might be able to
use mapping to maintain locality for nearest-neighbor communication. In this case, ZXY order
works.
Quantum chromodynamics (QCD) applications often use a 4D process topology. This can fit
perfectly onto Blue Gene/P using virtual node mode, for example, with one full rack, there are
4096 CPUs in virtual node mode, with a natural layout of 8x8x16x4 (X,Y,Z,T order). By
choosing a decomposition of 8x8x16x4, communication remains entirely local for nearest
neighbors in the logical 4D process mesh. In contrast, a more balanced decomposition of
8x8x8x8 results in a significant amount of link sharing, and thus degraded bandwidth in one of
the dimensions.
In summary, it is often possible to choose a mapping that keeps communication local on the
Blue Gene/P torus network. This is recommended for cases where a natural mapping can be
identified based on the parallel decomposition strategy used by the application. The mapping
can be specified using the -mapfile argument for the mpirun command.
Appendix F. Mapping
357
358
Appendix G.
htcpartition
The htcpartition utility, the subject of this appendix, boots or frees a HTC partition from a
Front End Node or service node. The htcpartition utility is similar to mpirun in two ways.
First, both communicate with the mpirun daemon on the service node; however, htcpartition
cannot run a job. Second, the mpirun scheduler plug-in interface is also called when
htcpartition is executed. The plug-in interface provides a method for the resource scheduler
to specify the partition to boot or free, and if the resource scheduler does not allow mpirun
outside its framework, that policy is also enforced with htcpartition.
The htcpartition utility is located in /bgsys/drivers/ppcfloor/bin along with the other IBM Blue
Gene/P executables. Its return status indicates whether or not the request succeeded; zero
indicates success and non-zero means failure. Table G-1 provides a complete list of options
for the htcpartition command.
Table G-1 htcpartition parameters
Parameter (and syntax)
Description
--help
--version
Version information.
--boot | --free
--partition <partition>
--mode
<smp | dual | vn | linux_smp>
359
Description
--host <hostname>
Service node host name that the mpirun server listens on. If not
specified, the host name must be in the MMCS_SERVER_IP
environment variable.
--port <port>
Service node TCP/IP port number that the mpirun server listens
on. The default port is 9874.
--config <path>
--trace <0-7>
360
Appendix H.
361
362
363
This command searches the current directory for all gmon.out files of the form gmon.out.x
where x is an integer value, starting with 0 until a file in the sequence cannot be found. The
data in these files is summed in the same way as gprof normally does.
As in the previous case, the following command searches the current directory for all
gmon.sample files of the form gmon.sample.x where x is an integer value, starting with 0 until
a file in the sequence cannot be in. A gmon histogram is generated by summing the data
found in each individual file, and the output goes to gmon.sum.
> /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-gprof -sumbg=gmon.sample
pgm
364
Appendix I.
Statement of completion
IBM considers the IBM Blue Gene/P installation to be complete when the following activities
have taken place:
The Blue Gene/P rack or racks have been physically placed in position.
The cabling is complete, including power, Ethernet, and torus cables.
The Blue Gene/P racks can be powered on.
All hardware is displayed in the Navigator and is available.
365
366
References
1. TOP500 Supercomputer sites:
http://www.top500.org/
2. The MPI Forum. The MPI message-passing interface standard. May 1995:
http://www.mcs.anl.gov/mpi/standard.html
3. OpenMP application programming interface (API):
http://www.openmp.org
4. IBM XL family of compilers:
XL C/C++
http://www-306.ibm.com/software/awdtools/xlcpp/
XL Fortran
http://www-306.ibm.com/software/awdtools/fortran/xlfortran/features/bg/
5. GCC, the GNU Compiler Collection:
http://gcc.gnu.org/
6. IBM System Blue Gene Solution: Configuring and Maintaining Your Environment,
SG24-7352.
7. GPFS Multicluster with the IBM System Blue Gene Solution and eHPS Clusters,
REDP-4168.
8. Engineering and Scientific Subroutine Library (ESSL):
http://www.ibm.com/systems/p/software/essl.html
9. See note 2.
10.See note 3.
11.See note 4.
12.See note 5.
13.See note 6.
14.See note 7.
15.See note 8.
16.Gropp, W. and Lusk, E. Dynamic Process Management in an MPI Setting. 7th IEEE
Symposium on Parallel and Distributed Processing. p. 530, 1995:
http://www.cs.uiuc.edu/homes/wgropp/bib/papers/1995/sanantonio.pdf
17.See note 2.
18.See note 3.
19.See note 5.
20.See note 8.
21.Ganier, C J. What is Direct Memory Access (DMA)?
http://cnx.org/content/m11867/latest/
22.See note 2.
367
23.See note 3.
24.A. Faraj, X. Yuan, and D. K. Lowenthal. STAR-MPI: Self Tuned Adaptive Routines for MPI
Collective Operations. The 20th ACM International Conference on Supercomputing
(ICS 06), Queensland, Australia, June 28-July 1, 2006.
25.Quinn, Michael J. Parallel Programming in C with MPI and OpenMP. McGraw-Hill, New
York, 2004. ISBN 0-072-82256-2.
26.Snir, Marc, et. al. MPI: The Complete Reference, 2nd Edition, Volume 1. MIT Press,
Cambridge, Massachusetts, 1998. ISBN 0-262-69215-5.
27.Gropp, William, et. al. MPI: The Complete Reference, Volume 2 - The MPI-2 Extensions.
MIT Press, Cambridge, Massachusetts, 1998. ISBN 0-262-69216-3.
28.See note 3.
29.See note 25.
30.Ibid.
31.Ibid.
32.See note 3.
33.Flynns taxonomy in Wikipedia:
http://en.wikipedia.org/wiki/Flynn%27s_Taxonomy
34.Rennie, Gabriele. Keeping an Eye on the Prize. Science and Technology Review,
July/August 2006:
http://www.llnl.gov/str/JulAug06/pdfs/07_06.3.pdf
35.Rennie, Gabriele. Simulating Materials for Nanostructural Designs. Science and
Technology Review, January/February 2006:
http://www.llnl.gov/str/JanFeb06/Schwegler.html
36.SC06 Supercomputing Web site, press release from 16 November 2006:
http://sc06.supercomputing.org/news/press_release.php?id=14
37.Unfolding the IBM eServer Blue Gene Solution, SG24-6686
38.Sebastiani, D. and Rothlisberger, U. Advances in Density-functional-based Modeling
Techniques of the Car-Parrinello Approach, chapter in Quantum Medicinal Chemistry,
P. Carloni and F. Alber, eds. Wiley-VCH, Germany, 2003. ISBN 9-783-52730-456-1.
39.Car, R. and Parrinello, M. Unified Approach for Molecular Dynamics and
Density-Functional Theory. Physical Review Letter 55, 2471 (1985):
http://prola.aps.org/abstract/PRL/v55/i22/p2471_1
40.See note 34.
41.Suits, F., et al. Overview of molecular dynamics techniques and early scientific results
from the Blue Gene Project. IBM Research & Development, 2005. 49, 475 (2005):
http://www.research.ibm.com/journal/rd/492/suits.pdf
42.Ibid.
43.Case, D. A., et al. The Amber biomolecular simulation programs. Journal of
Computational Chemistry. 26, 1668 (2005).
368
44.Fitch, B. G., et al. Blue Matter, an application framework for molecular simulation on Blue
Gene. Journal of Parallel and Distributed Computing. 63, 759 (2003):
http://portal.acm.org/citation.cfm?id=952903.952912&dl=GUIDE&dl=ACM
45.Plimpton, S. Fast parallel algorithms for short-range molecular dynamics. Journal of
Computational Physics. 117, 1 (1995).
46.Phillips, J., et al. Scalable molecular dynamics with NAMD. Journal of Computational
Chemistry. 26, 1781 (2005).
47.See note 43.
48.See note 44.
49.Ibid.
50.Ibid.
51.Ibid.
52.Ibid.
53.See note 45.
54.LAMMPS Molecular Dynamics Simulator:
http://lammps.sandia.gov/
55.See note 46.
56.Brooks, B. R., et. al. CHARMM. A Program for Macromolecular Energy, Minimization, and
Dynamics Calculations. Journal of Computational Chemistry. 4, 187 (1983).
57.Brnger, A. I. X-PLOR, Version 3.1, A System for X-ray Crystallography and NMR. 1992:
The Howard Hughes Medical Institute and Department of Molecular Biophysics and
Biochemistry, Yale University. 405.
58.Kumar, S., et al. Achieving Strong Scaling with NAMD on Blue Gene/L. Proceedings of
IEEE International Parallel & Distributed Processing Symposium, 2006.
59.Waszkowycz, B., et al. Large-scale Virtual Screening for Discovering Leads in the
Postgenomic Era. IBM Systems Journal. 40, 360 (2001).
60.Patrick, G. L. An Introduction to Medicinal Chemistry, 3rd Edition. Oxford University Press,
Oxford, UK, 2005. ISBN 0-199-27500-9.
61.Kontoyianni, M., et al. Evaluation of Docking Performance: Comparative Data on Docking
Algorithms. Journal of Medical Chemistry. 47, 558 (2004).
62.Kuntz, D., et al. A Geometric Approach to Macromolecule-ligand Interactions. Journal of
Molecular Biology. 161, 269 (1982); Morris, G. M., et al. Automated Docking Using a
Lamarckian Genetic Algorithm and Empirical Binding Free Energy Function. Journal of
Computational Chemistry. 19, 1639 (1998); Jones, G., et al. Development and Validation
of a Genetic Algorithm to Flexible Docking. Journal of Molecular Biology. 267, 904 (1997);
Rarey, M., et al. A Fast Flexible Docking Method Using an Incremental Construction
Algorithm. Journal of Molecular Biology. 261, 470 (1996), Scrdinger, Portland, OR
972001; Pang, Y. P., et al. EUDOC: A Computer Program for Identification of Drug
Interaction Sites in Macromolecules and Drug Leads from Chemical Databases. Journal
of Computational Chemistry. 22, 1750 (2001).
63.(a) http://dock.compbio.ucsf.edu; (b) Moustakas, D. T., et al. Development and
Validation of a Modular, Extensible Docking Program: DOCK5. Journal of Computational
Aided Molecular Design. 20, 601 (2006).
64.Ibid.
65.Ibid.
References
369
66.Ibid.
67.Ibid.
68.Peters, A., et al., High Throughput Computing Validation for Drug Discovery using the
DOCK Program on a Massively Parallel System. 1st Annual MSCBB. Northwestern
University, Evanston, IL, September, 2007.
69.Irwin, J. J. and Shoichet, B. K. ZINC - A Free Database of Commercially Available
Compounds for Virtual Screening. Journal of Chemical Information and Modeling. 45, 177
(2005).
70.Ibid.
71.Pople, J. A. Approximate Molecular Orbital Theory (Advanced Chemistry). McGraw-Hill,
NY. June 1970. ISBN 0-070-50512-8.
72.See note 39.
73.(a) CPMD V3.9, Copyright IBM Corp. 1990-2003, Copyright MPI fur Festkorperforschung,
Stuttgart, 1997-2001. (b) See also:
http://www.cpmd.org
74.Marx, D. and Hutter, J. Ab-initio molecular dynamics: Theory and implementation in
Modern Methods and Algorithms of Quantum Chemistry. J. Grotendorst (ed.), NIC Series,
1, FZ Julich, Germany, 2000. See also the following URL and references therein:
http://www.fz-juelich.de/nic-series/Volume3/marx.pdf
75.Vanderbilt, D. Soft self-consistent pseudopotentials in a generalized eigenvalue
formalism. Physical Review B. 1990, 41, 7892 (1990):
http://prola.aps.org/abstract/PRB/v41/i11/p7892_1
76.See note 73.
77.Eddy, S. R., HMMER Users Guide. Biological Sequence Analysis Using Profile Hidden
Markov Models, Version 2.3.2, October 1998.
78.Ibid.
79.Ibid.
80.Jiang, K., et al. An Efficient Parallel Implementation of the Hidden Markov Methods for
Genomic Sequence Search on a Massively Parallel System. IEEE Transactions on
Parallel and Distributed Systems. 19, 1 (2008).
81.Bateman, A., et al. The Pfam Protein Families Database. Nucleic Acids Research. 30,
276 (2002).
82.Ibid.
83.Darling, A., et al. The Design, Implementation, and Evaluation of mpiBLAST.
Proceedings of 4th International Conference on Linux Clusters (in conjunction with
ClusterWorld Conference & Expo), 2003.
84.Thorsen, O., et al. Parallel genomic sequence-search on a massively parallel system.
Conference on Computing Frontiers: Proceedings of the 4th International Conference on
Computing Frontiers. ACM, 2007, pp. 59-68.
85.Heyman, J. Recommendations for Porting Open Source Software (OSS) to Blue Gene/P,
white paper WP101152:
http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP101152
370
Related publications
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this book.
IBM Redbooks
For information about ordering these publications, see How to get IBM Redbooks on
page 374. Note that some of the documents referenced here might be available in softcopy
only:
IBM System Blue Gene Solution: Blue Gene/P Safety Considerations, REDP-4257
Blue Gene/L: Hardware Overview and Planning, SG24-6796
Blue Gene/L: Performance Analysis Tools, SG24-7278
Evolution of the IBM System Blue Gene Solution, REDP-4247
GPFS Multicluster with the IBM System Blue Gene Solution and eHPS Clusters,
REDP-4168
IBM System Blue Gene Solution: Application Development, SG24-7179
IBM System Blue Gene Solution: Configuring and Maintaining Your Environment,
SG24-7352
IBM System Blue Gene Solution: Hardware Installation and Serviceability, SG24-6743
IBM System Blue Gene Solution Problem Determination Guide, SG24-7211
IBM System Blue Gene Solution: System Administration, SG24-7178
Unfolding the IBM eServer Blue Gene Solution, SG24-6686
Other publications
These publications are also relevant as further information sources:
Bateman, A., et al. The Pfam Protein Families Database. Nucleic Acids Research. 30,
276 (2002).
Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M.
CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics
Calculations. Journal of Computational Chemistry. 4, 187 (1983).
Brnger, A. I. X-PLOR, Version 3.1, A System for X-ray Crystallography and NMR. 1992:
The Howard Hughes Medical Institute and Department of Molecular Biophysics and
Biochemistry, Yale University. 405.
Car, R. and Parrinello, Mi. Unified Approach for Molecular Dynamics and
Density-Functional Theory. Physical Review Letter 55, 2471 (1985):
http://prola.aps.org/abstract/PRL/v55/i22/p2471_1
Case, D. A., et al. The Amber biomolecular simulation programs. Journal of
Computational Chemistry. 26, 1668 (2005).
371
Online resources
These Web sites are also relevant as further information sources:
Compiler-related topics:
XL C/C++
http://www-306.ibm.com/software/awdtools/xlcpp/
XL C/C++ library
http://www.ibm.com/software/awdtools/xlcpp/library/
XL Fortran Advanced Edition for Blue Gene
http://www-306.ibm.com/software/awdtools/fortran/xlfortran/features/bg/
XL Fortran library
http://www-306.ibm.com/software/awdtools/fortran/xlfortran/library/
Related publications
373
Debugger-related topics:
GDB: The GNU Project Debugger
http://www.gnu.org/software/gdb/gdb.html
GDB documentation:
http://www.gnu.org/software/gdb/documentation/
Engineering and Scientific Subroutine Library (ESSL) and Parallel ESSL
http://www.ibm.com/systems/p/software/essl.html
GCC, the GNU Compiler Collection
http://gcc.gnu.org/
Intel MPI Benchmarks is formerly known as Pallas MPI Benchmarks.
http://www.intel.com/cd/software/products/asmo-na/eng/219848.htm
Mathematical Acceleration Subsystem
http://www-306.ibm.com/software/awdtools/mass/index.html
Message Passing Interface Forum
http://www.mpi-forum.org/
MPI Performance Topics
http://www.llnl.gov/computing/tutorials/mpi_performance/
The OpenMP API Specification:
http://www.openmp.org
Danier, CJ, What is Direct Memory Access (DMA)?
http://cnx.org/content/m11867/latest/
374
Index
Numerics
10 Gb Ethernet network 11
3.2 C, GNU 21
32-bit static link files 337
3D torus network 10
A
Ab Initio method 307
abstract device interface (ADI) 68
adaptive routing 69
addr2line utility 159
address space 20
ADI (abstract device interface) 68
Aggregate Remote Memory Copy Interface (ARMCI) 67
ALIGNX 116
__alignx function 116
allocate block 140
AMBER 309
ANSI-C 59
APIs
Bridge. See Bridge APIs
Control System. See Bridge APIs
Dynamic Partition Allocator APIs See Dynamic Partition Allocator APIs
Real-time Notification APIs See Real-time Notification
APIs
applications
checkpoint and restart support for 169176
chemistry and life sciences 307321
compiling and linking 98
debugging 143167
See also GPD (GNU Project debugger), Scalable
Debug API
developing with XL compilers 97138
optimizing 111138
porting 353
running 140142
SIMD instructions in 109111
architecture 46
CIOD threading 34
Argonne National Labs 18
arithmetic functions 126135
ARMCI (Aggregate Remote Memory Copy Interface) 67
asynchronous APIs, Bridge APIs 213
asynchronous file I/O 20
__attribute__(always_inline) extension 114
B
bandwidth, MPI 83
base partition 245
Berkeley Unified Parallel C (Berkeley UPC) 67
Berkeley UPC (Berkeley Unified Parallel C) 67
BG_CHKPTENABLED 176
BG_SHAREDMEMPOOLSIZE 40
BGLAtCheckpoint 174
BGLAtContinue 174
BGLAtRestart 174
BGLCheckpoint 173
BGLCheckpointExcludeRegion 174
BGLCheckpointInit 173
BGLCheckpointRestart 174
bgpmaster daemon 24
binary functions 128
binutils 98
block 140
blrts_xlc 100
blrts_xlc++ 100
blrts_xlf 100
Blue Gene specifications 12
Blue Gene XL compilers, developing applications with 97
Blue Gene/L PowerPC 440d processor 97
Blue Gene/P
software programs 11
V1R3M0 78
Blue Gene/P MPI, environment variables 340
Blue Matter 310
boot sequence, compute node 31
Bridge APIs 23, 161, 209249
asynchronous APIs 213
environment variables 210
examples 246249
first and next calls 211
functions 212
HTC paradigm and 202
invalid pointers 211
memory allocation and deallocation 211
messaging APIs 244
MMCS API 212
partition state flags 218
requirements 210212
return codes 213
small partition allocation 245
bridge.config 182
bss, applications storing data 19
buffer alignment 73
built-in floating-point functions 118
C
C++, GNU 21
cache 44
Car-Parrinello Molecuar Dynamics (CPMD) 308
Cartesian
communicator functions 75
optimized functions 68
Charm++ 67
checkpoint and restart application support 169176
directory and file-naming conventions 175
375
376
D
data, applications storing 19
DB_PROPERTY 252
db.properties 182
DCMF_EAGER 70
DDR (double data RAM) 47
debug client, debug server 143
debugging applications 143167
live debug 150155
Scalable Debug API 161167
See also GPD (GNU Project debugger)
deterministic routing 69
direct memory access (DMA) 69
directory names, checkpoint and restarting conventions
175
DMA (direct memory access) 69
DOCK6 313
double data RAM (DDR) 47
Double Hummer FPU 100
double-precision square matrix multiply example 136
Dual mode 17, 299
memory access in 49
dynamic linking 21
Dynamic Partition Allocator APIs 295301
library files 296
requirements 296
E
eager protocol 69
electronic correlation 314
electronic structure method 307
Engineering and Scientific Subroutine Library (ESSL)
105
environment variables 339351
Blue Gene/P MPI 340
Bridge APIs 210
Compute Node Kernel 349
mpirun 187
ESSL (Engineering and Scientific Subroutine Library)
105
Ewald sums 308
extended basic blocks 113
F
fault recovery 170
See also checkpoint and restart application support
file I/O 20
files
on architectural features 331334
checkpoint and restart naming conventions 175
Fortran77, GNU 21
FPABS 127
__fpabs 127
FPADD 128
__fpadd 128
FPCTIW 126
__fpctiw 126
FPCTIWZ 126
__fpctiwz 126
FPMADD 129
__fpmadd 129
FPMSUB 130
__fpmsub 130
FPMUL 128
__fpmul 128
FPNABS 128
__fpnabs 127
FPNEG 127
__fpneg 127
FPNMADD 130
__fpnmadd 129
FPNMSUB 130
__fpnmsub 130
FPRE 127
__fpre 127
FPRSP 126
__fprsp 126
FPRSQRTE 127
__fprsqrte 127
FPSEL 135
__fpsel 135
FPSUB 128
__fpsub 128
freepartition 178
front end node 6, 13
function network 6, 11
functions
Bridge APIs 209249
built-in floating-point, IX compilers 118
built-in, XL compilers 135138
Dynamic Partition Allocator APIs 295301
inline, XL compilers 114
load and store, XL compilers 123
move, XL compilers 125
MPI 80
Real-time Notification APIs 255268
select, XL compilers 135
unary 126128
FXCPMADD 132
__fxcpmadd 132
FXCPMSUB 132
__fxcpmsub 132
FXCPNMADD 132
__fxcpnmadd 132
FXCPNMSUB 133
__fxcpnmsub 133
FXCPNPMA 133
__fxcpnpma 133
__fxcpnsma 133
FXCSMADD 132
__fxcsmadd 132
FXCSMSUB 132
__fxcsmsub 132
FXCSNMADD 132
__fxcsnmadd 132
FXCSNMSUB 133
__fxcsnmsub 133
FXCSNPMA 133
__fxcsnpma 133
__fxcsnsma 133
FXCXMA 134
__fxcxma 134
FXCXNMS 134
__fxcxnms 134
FXCXNPMA 134
__fxcxnpma 134
FXCXNSMA 135
__fxcxnsma 135
FXMADD 130
__fxmadd 130
FXMR 125
__fxmr 125
FXMSUB 131
__fxmsub 131
FXMUL 129
__fxmul 129
FXNMADD 131
__fxnmadd 131
FXNMSUB 131
__fxnmsub 131
FXPMUL 129
__fxpmul 129
FXSMUL 129
__fxsmul 129
G
GA toolkit (Global Arrays toolkit) 67
GASNet (Global-Address Space Networking) 68
GDB (GNU Project debugger) 143149
gdbserver 143
General Parallel File System (GPFS) 13
get_parameters() 199
gid 52
Global Arrays (GA) toolkit 67
global collective network 11
global interrupt network 11, 69
Global-Address Space Networking (GASNet) 68
GNU Compiler Collection V4.1.1 21
GNU profiling tool 361364
GNU Project debugger (GDB) 143149
GPFS (General Parallel File System) 13
H
hardware 314
naming conventions 325330
header files 335338
heap 19
high-performance computing mode 18
high-performance network 69
High-Throughput Computing mode 18
HMMER 315
host system 13
host system software 14
HTC 65
HTC paradigm 201206
htcpartition 202, 359
Index
377
I
I/O (input/output) 20
I/O node 56, 10
daemons 23
debugging 156
features 12
file system services 22
kernel boot 22
software 2224
I/O node kernel 3235
IBM LoadLeveler 142
IBM XL compilers 22
arithmetic functions 126135
basic blocks 113
batching computations 115
built-in floating-point functions 118
built-in functions, using 135138
complex type manipulation functions 121
complex types, using 113
cross operations 119
data alignment 116
data objects, defining 112
default options 99
developing applications with 97138
inline functions 114
load and store functions 123
move functions 125
optimization 107
parallel operations 118
pointer aliasing 114
scripts 100
select functions 135
SIMD 118
vectorizable basic blocks 113
input/output (I/O) 20
Intel MPI Benchmarks 83
J
jm_attach_job() 222
jm_begin_job() 222
jm_cancel_job 222
jm_debug_job() 223
jm_load_job() 224
jm_signal_job() 225
jm_start_job() 225
job modes 3742
job state flags 223
K
kernel functionality 2935
L
L1 cache 4445, 73
L2 cache 44, 46
L3 cache 44, 46
LAMMPS 311
latency, MPI 83
__lfpd 123
378
__lfps 123
__lfxd 124
__lfxs 123
libbgrealtime.so 252
libraries 335338
XL 104
ligand atoms 313
Linux/SMP mode 299
load and store functions 123
LOADFP 123
LOADFX 123124
LoadLeveler 142
M
mapping 355357
MASS (Mathematical Acceleration Subsystem) 104
Mathematical Acceleration Subsystem (MASS) 104
mcServer daemon 24
memory 1820, 4349
address space 20
addressing 19
considerations 9
distributed 44, 66
leaks 20
management 20, 4547
MPI and 71
persistent 49
protection 4749
shared 40
virtual 44
message layer 39
Message Passing Interface. See MPI
messages, flooding of 72
microprocessor 8
midplane 7
Midplane Management Control System (MMCS) 23, 25
Midplane Management Control System APIs 295
MM/MD (Classical Molecular Mechanics/Molecular Dynamics) 307
mmap 40
MMCS (Midplane Management Control System) 23, 25,
33
MMCS console 140
MMCS daemon 24
mmcs_db_console 202
modes, specifying 41
move functions 125
MPI (Message Passing Interface) 18, 65, 68
bandwidth 83
Blue Gene/P extensions 7480
Blue Gene/P implementation, protocols 69
buffer ownership, violating 73
collective 85, 320
communications 74
communications performance 8388
compiling programs on Blue Gene/P 8283
eager protocol 69
functions 80
latency 83
memory, too much 71
N
NAMD 312
natural alignment 108
network 10
10 Gb Ethernet 11
3D torus 10
collective 11, 69
control 11
functional 11
global collective 11
global interrupt 11, 69
high-performance 69
point-to-point 69
torus 10
networks
function 11
node card 4
retrieving information 247
node services, common 32
O
OpenMP 8996, 100
GPD 144
OpenMP, HPC (High-Performance Computing) 65
P
parallel execution 69
parallel operations 118
parallel paradigms 6596
See also MPI (Message Passing Interface)
Parallel Programming Laboratory 67
particle mesh Ewald (PME) method 308
performance
application efficiency 71
collective operations and 85
data alignment and 116
engineering and scientific applications 305321
L2 cache and 46
memory and 45
MPI algorithms 77
MPI communications 8388
persistent memory 49
personality 31
PingPong 318
pm_create_partition() 216
pm_destroy_partition() 217
PME (particle mesh Ewald) method 308
pmemd 309
PMI_Cart_comm_create() 75
PMI_Pset_diff_comm_create() 76
PMI_Pset_same_comm_create() 75
pointer aliasing 114
pointers, uninitialized 20
point-to-point MPI 84
point-to-point network 69
pool, HTC 205
porting applications 353
PowerPC 440d Double Hummer dual FPU 118
PowerPC 440d processor 97
PowerPC 450 microprocessor 8
PowerPC 450, parallel operations on 107
#pragma disjoint directive 115
processor set (pset) 75
pset (processor set) 75
-psets_per_bp 193
pthreads 100
Python 106
Q
q64 100
qaltivec 100
qarch 99
qcache 99
qflttrap 100
qinline 114
qipa 114
QM/MM (Quantum Mechanical/Molecular Mechanical)
308
qmkshrobj 100
Index
379
qnoautoconfig 99
qpic 100
qtune 99
Quantum Mechanical/Molecular Mechanical (QM/MM)
308
R
rack component 4
raw state 256
real-time application code 284293
Real-time Notification APIs 251293
blocking or nonblocking 253
functions 255268
libbgrealtime.so 252
library files 252
requirements 252
return codes 280283
sample makefile 252
status codes 281
Redbooks Web site 374
Contact us xiii
reduction clause 93
rendezvous protocol 69
rm_add_job() 221
rm_add_part_user() 216, 255, 282
rm_add_partition() 215, 254, 281
rm_assign_job() 216
rm_free_BG() 243
rm_free_BP() 243
rm_free_job_list() 243
rm_free_job() 243
rm_free_nodecard_list() 243
rm_free_nodecard() 243
rm_free_partition_list() 243
rm_free_partition() 243
rm_free_switch() 243
rm_get_BG() 214
rm_get_data() 212, 214
rm_get_job() 223
rm_get_jobs() 223
rm_get_partitions_info() 218, 255, 283
rm_get_partitions() 217, 254, 281
rm_get_serial() 215
rm_modify_partition() 218
rm_new_BP() 242
rm_new_job() 243
rm_new_nodecard() 243
rm_new_partition() 243
rm_new_switch() 243
rm_query_job() 224
rm_release_partition() 219
rm_remove_job() 224
rm_remove_part_user() 220, 255, 282
rm_remove_partition() 219
rm_set_data() 212, 215
rm_set_part_owner() 220
rm_set_serial() 215
rt_api.h 252
RT_CALLBACK_CONTINUE 256
RT_CALLBACK_QUIT 256
380
RT_CALLBACK_VERSION_0 255
rt_callbacks_t() 255
RT_CONNECTION_ERROR 283
RT_DB_PROPERTY_ERROR 281
rt_get_msgs() 253
rt_handle_t() 253
rt_init() 252
RT_INVALID_INPUT_ERROR 282283
rt_set_blocking() 253
rt_set_filter() 254
rt_set_nonblocking() 253
RT_STATUS_OK 256
RT_WOULD_BLOCK 282
S
Scalable Debug API 161167
scripts, XL compilers 100
security, mpirun and 141, 178
segfaults 48
Self Tuned Adaptive Routines for MPI (STAR-MP) 79
service actions 24
service node 6, 13
shared libraries 101
shared memory 40
shm_open() 40
signal support, system calls 59
SIMD (single-instruction, multiple-data) 46, 107
SIMD computation 118
SIMD instructions in applications 109111
Single Program Multiple Data. See SPMD 68
single-instruction, multiple-data See SIMD
size command 19
small partition
allocation 245, 248, 284
defining new 248
querying 248
SMP mode 38
as default mode 48
socket support, system calls 58
sockets calls 21
software 1525
SPI (System Programming Interface) 57
SPMD (Single Program Multiple Data) 68, 179
stack 19
standard input 21
STAR-MPI (Self Tuned Adaptive Routines for MPI) 79
static libraries 101
stdin 21
__stfpd 124
__stfpiw 125
__stfps 124
__stfxd 125
__stfxs 124
storage node 13
STOREFP 124125
STOREFX 124
structure alignment 112
submit 141
HTC paradigm 202
submit APIs, HTC paradigm 206
SUBMIT_CWD 203
SUBMIT_POOL 203
SUBMIT_PORT 203
Symmetrical Multiprocessor (SMP) mode 17
system architecture 46
system calls 5161
return codes 52
signal support 59
socket support 58
unsupported 60
System Programming Interface (SPI) 57
T
threading support 18
TLB (translation look-aside buffer) 47
torus communications 74
torus wrap-around 357
translation look-aside buffer (TLB) 47
TXYZ order 355
U
uid 52
unary functions 126128
uninitialized pointers 20
V
vectorizable basic blocks 113
virtual FIFO 39
virtual memory 44
Virtual node mode 17, 38, 299
memory access in 48
virtual paging 20
X
XL C/C++ Advanced Edition V8.0 for Blue Gene 97
XL compilers. See IBM XL compilers
XL Fortran Advanced Edition V10.1 for Blue Gene 97
XL libraries 104
XYZT order 355
Index
381
382
(1.5 spine)
1.5<-> 1.998
789 <->1051 pages
(1.0 spine)
0.875<->1.498
460 <-> 788 pages
(0.5 spine)
0.475<->0.873
250 <-> 459 pages
(0.2spine)
0.17<->0.473
90<->249 pages
(0.1spine)
0.1<->0.169
53<->89 pages
(2.5 spine)
2.5<->nnn.n
1315<-> nnnn pages
(2.0 spine)
2.0 <-> 2.498
1052 <-> 1314 pages
Back cover
INTERNATIONAL
TECHNICAL
SUPPORT
ORGANIZATION
BUILDING TECHNICAL
INFORMATION BASED ON
PRACTICAL EXPERIENCE
IBM Redbooks are developed
by the IBM International
Technical Support
Organization. Experts from
IBM, Customers and Partners
from around the world create
timely technical information
based on realistic scenarios.
Specific recommendations
are provided to help you
implement IT solutions more
effectively in your
environment.
ISBN 0738433330