Inside Caliper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

LLNL-CONF-699263

Caliper: Performance
Introspection for HPC Software
Stacks

D. Boehme, T. Gamblin, D. Beckingsale, P. Bremer, A.


Gimenez, M. LeGendre, O. Pearce, M. Schulz

August 2, 2016

The International Conference for High Performance


Computing, Networking, Storage and Analysis
Salt Lake City, UT, United States
November 13, 2016 through November 18, 2016
Disclaimer

This document was prepared as an account of work sponsored by an agency of the United States
government. Neither the United States government nor Lawrence Livermore National Security, LLC,
nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or
responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or
process disclosed, or represents that its use would not infringe privately owned rights. Reference herein
to any specific commercial product, process, or service by trade name, trademark, manufacturer, or
otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the
United States government or Lawrence Livermore National Security, LLC. The views and opinions of
authors expressed herein do not necessarily state or reflect those of the United States government or
Lawrence Livermore National Security, LLC, and shall not be used for advertising or product
endorsement purposes.
1

Caliper: Performance Introspection


for HPC Software Stacks
David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer,
Alfredo Gimenez, Matthew LeGendre, Olga Pearce and Martin Schulz
Lawrence Livermore National Laboratory,
Livermore, California 94551

Abstract—Many performance engineering tasks, from long- instrumented code is then typically limited to recording tool-
term performance monitoring to post-mortem analysis and on- specific profile or trace information, and can no longer be used
line tuning, require efficient runtime methods for introspection for other purposes. This discourages or even prevents users
and performance data collection. To understand interactions
between components in increasingly modular HPC software, from leaving annotations in the code. As a consequence, most
performance introspection hooks must be integrated into run- pre-installed HPC production software today is not amenable
time systems, libraries, and application codes across the soft- to stack-wide profiling or tracing. Developers who wish to use
ware stack. This requires an interoperable, cross-stack, general- instrumentation-based tools not only have to instrument their
purpose approach to performance data collection, which neither own code, but possibly the entire software stack all the way
application-specific performance measurement nor traditional
profile or trace analysis tools provide. With Caliper, we have to the operating system.
developed a general abstraction layer to provide performance In order to overcome this gap, we need a new instrumen-
data collection as a service to applications, runtime systems, tation approach that separates the concerns of a) software
libraries, and tools. Individual software components connect developers, who expose application/library/runtime semantics
to Caliper in independent data producer, data consumer, and through annotations; b) tool developers, who provide mea-
measurement control roles, which allows them to share perfor-
mance data across software stack boundaries. We demonstrate surements to be correlated with application information; and
Caliper’s performance analysis capbilities with two case studies c) tool users, who decide what information is collected,
of production scenarios. correlated and/or filtered, based on the specific analysis use
Keywords-Performance analysis; High performance comput- case, available measurements, and application state. Such
ing; Computer performance; Software tools; Software perfor- an approach must emphasize ease of use for the developer,
mance; Software reusability; Parallel processing. incurring minimal overhead when used during production runs,
and it must be composable across the entire software stack.
In this paper, we present Caliper, a library that addresses
I. M OTIVATION
these requirements. Caliper is a cross-stack, general-purpose
Increasing on-node complexity, coupled with thread- and introspection framework that explicitly separates the concerns
task-parallel runtimes, will make understanding the perfor- of software developers, tool developers, and users. On the sur-
mance of exascale applications more difficult. Performance face, Caliper provides a simple annotation API for application
problems in modern systems can be caused by many different developers, similar to timer libraries found in many codes. In
factors across many different levels in the software stack, from fact, we can wrap Caliper calls inside existing timers. Under
the operating and runtime system to libraries and application the hood, though, Caliper combines information provided
code. To diagnose problems, we need introspection capabilities through its API from all software layers into a single program
in the form of annotation APIs that allow application and context, and it offers interfaces for tool developers to extract
system developers to expose semantic information from their this information and augment it with measurements. For this
software components. We must also couple this semantic purpose, Caliper includes basic measurement services, but it
information with comprehensive performance measurement also offers an interface for existing tools to exploit the stack-
techniques. Finally, to understand the relationships between wide information. Finally, Caliper users can define a policy
application code, runtime systems, the memory hierarchy, and that links mechanisms to instrumentation points at runtime. By
the network, we must be able to configure tools to correlate separating annotations, measurement mechanisms and policy,
data across software layers. instrumentation annotations can permanently remain in pro-
A wide range of sophisticated profiling and tracing tools duction codes to be used for a wide range of performance en-
can provide such information. Many of them also provide gineering applications, from post-mortem analysis to runtime
their own APIs to collect application and runtime information. adaption. Moreover, software teams can add instrumentation
However, these interfaces are tool-focused and require users to in different software components independent of each other,
add tool-specific actions (like start/stop timer) to their codes, allowing correlation of information across software layers.
instead of allowing developers to express generalizable seman- Components are loosely coupled; modifying annotations does
tics of their code in a reusable way. Further, once added, the not require re-working measurement services, or vice versa.
SC16; Salt Lake City, Utah, USA; November 2016
978-1-4673-8815-3/16/$31.00 2016 IEEE c 2016 IEEE
2

Overall, our contributions through Caliper include: user-defined actions at runtime, e.g. for timing a specific code
• A novel performance analysis methodology that decou- region.
ples the roles of application developers, tool users, and Performance engineers use Caliper as a platform for build-
tool developers by separating these concerns through ing measurement analysis tools. In addition to the global
separate interfaces: context information generated by the source-code annotations,
– A low-overhead annotation interface that allows de- Caliper puts a variety of methodologies at the engineer’s
velopers to convey application information through disposal, including interrupt-based timers, trace and profile
a flexible data model recording, and timer or hardware-counter measurements. Each
– A robust and efficient interface to provide collected of these methodologies is an independent building block.
information to tools Performance engineers can create runtime configuration pro-
– A flexible mechanism to specify data correlation and files that combine the building blocks as needed for specific
filtering policies performance engineering use cases.
• A mechanism to combine the annotation information and
performance data across software-stack boundaries by B. Basic Concepts
creating a stack-wide context Caliper is centered around three basic abstractions: at-
• An efficient way of representing combined context infor- tributes, a blackboard, and snapshots. Attributes represent
mation in a generalized context tree individual data elements that can be accessed through a unique
In two case studies, we demonstrate how Caliper allows key. The blackboard is a global buffer that collects attributes
application developers to diagnose performance interactions from independent sources. Blackboard updates also serve as
between separate components in a large multi-physics code, event hooks that can be configured to trigger additional ac-
and how tool developers can build upon Caliper to access tions. Finally, snapshots represent measurement events. When
application context information at runtime. Both case stud- a snapshot is triggered, Caliper creates a snapshot record that
ies demonstrate the previously unavailable composability and contains the current blackboard contents, and tells connected
flexibility Caliper brings to the performance analysis process. measurement providers to take measurements (e.g., a times-
tamp) and add them to the snapshot record.
Figure 1 illustrates the relationships between those com-
II. T HE C ALIPER A PPROACH : I NTROSPECTION AS A
ponents. Applications, libraries, runtime systems, and tools
S ERVICE
connect to Caliper in one or more different roles. As data
Caliper is a universal abstraction layer for general-purpose, producers, they create event hooks and update attributes on
cross-stack performance introspection. It is built around the the blackboard, or provide measurement attributes for snapshot
principle of separating mechanism and policy: Caliper pro- records. As data consumers, they can read the blackboard
vides data-collection mechanisms (e.g. an instrumentation contents or process snapshot records. They can also perform
API, interrupt-based sampling timers, hardware-performance control tasks, such as triggering snapshots. In this regard, the
counter access, call-stack unwinding, or trace recording) as blackboard and snapshot concepts fulfill two purposes:
independent building blocks, which can be combined into a • they transparently combine attributes provided by differ-
specific performance engineering solution by a user-provided ent data producers across the software stack, and
data-collection policy at runtime. • they enable separation between data producer, data con-
sumer, and measurement control roles.
A. Usage Model Caliper acts as the glue that connects independent compo-
Caliper is aimed at HPC software developers, tool devel- nents with similar or different roles across different software
opers, and performance engineers alike. Software developers stack layers.
instrument their software components with Caliper annotation
commands to mark and describe high-level abstractions, such III. DATA M ODEL
as kernel names, rank/thread identifiers, timesteps, iteration Caliper provides a generic data model that can capture a
numbers, and so on. At runtime, the annotations provide wide range of information, including user-defined, application-
context information to performance engineering applications, specific concepts. Capturing arbitrary introspection informa-
in terms of abstractions defined by the software developer. tion and storing it so it can be both effectively analyzed at
Caliper annotations are independent of each other, and devel- runtime and efficiently streamed to disk is key to making
opers can add them incrementally anywhere in the software Caliper a viable and useful tool for large-scale performance
stack. In the long term, we expect these annotations to be kept engineering. To this end, we have developed a data model that
and maintained permanently in the target components by the is both flexible and efficient. In contrast to traditional, single-
component’s developers. purpose performance profile or trace data formats, our data
Internally, the annotations add or remove entries on a virtual model describes the data’s layout and structure rather than its
blackboard. The combined blackboard contents provide a content or semantics. Our model facilitates general-purpose
holistic view of a program’s execution context as described by data storage and access similar to non-relational databases,
the annotations, across all software stack layers. In addition, but is optimized for storing the contextual information needed
the annotations serve as event hooks that can trigger additional for typical performance engineering use cases.
3

Data producers Measurement control Data consumers

Application Runtime e.g. auto- Data flow


measurement tuning
HPC Software configuration
Developers Control relationship
Libraries Runtime adaption
and Users
Runtime

Blackboard Snapshot records


Caliper Snapshot

Tool Timer PAPI e.g. sample e.g. trace


timer recorder
Developers
Measurement services Snapshot triggers Processing services

Figure 1. Caliper abstractions enable separation of concerns for performance introspection: Software developers publish program state information on the
Caliper blackboard (left column). Tool developers implement data collection, measurement control, and data-processing mechanisms (bottom layer). Based on
a user-provided configuration profile, Caliper activates snapshots to connect the individual introspection building blocks at runtime (middle layer). Snapshot
records with the combined program context and measurement information can be processed at runtime or stored for post-mortem analysis (right column).

A. Attributes
main Attributes
The basic elements in our data model are attributes in the
form of key-value pairs. Attribute keys serve as identifiers phase
for attributes: in addition to a unique name, they store the init loop
attribute’s data type, as well as optional additional metadata state
(e.g., the measurement unit) associated with the attribute. In
additon to integers, floating-point numbers and strings, Caliper serial par. iteration
also supports user-defined datatypes for attributes.

B. Generalized Context Tree 1 2


While the conceptual representation of attributes as key-
value pairs is flexible and generic, a naive implementation of
blackboard buffers and snapshot records using this represen- work
tation directly could result in significant storage requirements.
Therefore, internally, we encode attributes more efficiently. In
particular, we exploit typical characteristics of performance Figure 2. Efficient context information representation. A single pointer to
the ’work’ node represents the program state described by all nodes on the
context information, such as call paths or named kernels: (1) path to root, composed of three different, independent attributes.
only a subset of the attributes changes between subsequent
snapshots, and (2), due to the iterative nature of typical HPC
codes, many attribute combinations likely occur repeatedly
during the execution of a program (e.g., the same call paths defined phase hierarchy as a list; state, representing a “paral-
are visited many times during the execution of a loop). lel” or “serial” execution state; and iteration, representing an
We use these characteristics to build up a generalized iteration number. Each attribute can be updated independently;
context tree at runtime and add to it whenever introspection for example, the module that sets the iteration attribute does
information is updated. A snapshot can then be represented not need to know that the phase attribute exists. Caliper builds
by a single tree node ID. Conceptually, call-path profilers use up the merged tree structure internally whenever the attributes
a similar concept by storing a call-path ID pointing to a node are updated.
in the call tree, instead of storing the entire call path with The path from any tree node to the root represents the infor-
each measurement. However, our approach merges different mation composed of all attributes on the path. Thus, a single
attributes into a single tree, such that the combined information tree node reference is sufficient to encode an entire snapshot.
composed from any number of attributes can be represented In the example in Figure 2, the work node represents a
by a single tree node. snapshot that expands to phase=main/loop/work, iteration=1,
Figure 2 illustrates the concept. Here, a generalized context and state=parallel. Note that the context tree also provides
tree is built out of three attributes: phase, representing a user- a straightforward way to encode hierarchical attributes, as
4

demonstrated by the phase attribute in the example. We use the control APIs, or by registering callback functions for certain
flexible design of the generalized context tree to store attribute events. The APIs allow them to
metadata in it as well, which allows us to encode all metadata • create attribute keys,
in a single structure. • set, update, or remove attributes on the blackboard,
• query attributes on the blackboard,

C. Snapshot Records • take snapshots, and


• register callback functions.
Snapshots create a list of attributes with the blackboard
contents and measurement data at a specific point in the The callback interface provides notifications about
program’s execution, which is stored in a snapshot record. In • the creation of new attribute keys,

contrast to traditional HPC profiling and trace data formats, • updating or removing attributes on the blackboard,

our data model does not place any restrictions on the kind, • snapshot generation,

number or composition of different attributes in a record. • snapshot processing,

While there is a semantic distinction between context and • I/O requests.

measurement attributes, we do not explicitly separate them With these interfaces, Caliper enables functional and phys-
in our data model: both are saved together in a snapshot as ical separation of data producers and consumers, as both only
generic key-value pairs. For subsequent analysis, the contents interact with the Caliper runtime and not directly with each
of the record receives its meaning from the semantics of the other.
attributes, which we presume to be known to the user who
uses or created them. B. Blackboard Buffer Management
By encoding of snapshots through context tree references,
we reduce the space needed to store snapshot records com- Caliper’s runtime component manages the blackboard
pared to writing full key-value lists. This also provides a buffers which store the current values of attributes, and main-
convenient and efficient mechanism to pass a snapshot to tains the process-global generalized context tree. This runtime
measurement tools without having to duplicate the actual intro- component lives in a single shared object instance per process
spection information. However, this also means that we need to that is created automatically on the first invocation of the
store the context tree structure itself in order to reconstruct the Caliper API.
information. Attributes whose values do change often or rarely Caliper maintains separate blackboard buffers for each
reoccur in subsequent snapshots would needlessly increase the thread within the same process. To do that, attributes have
context tree size. For example, recording the global time can a scope parameter, which is either thread or process. For
be very useful but its value never repeats. Therefore, we use a runtimes which support lightweight tasks, a task scope is also
hybrid approach to store snapshot records: we write a context- available. The scope can be set at the creation of the attribute.
tree node ID to encode attributes included in the tree, and we By default, attributes receive the thread scope. Updates of
write explicit key-value pairs for the remaining attributes. A thread-scope attributes go into the thread’s local blackboard
flag can be given to each attribute key via the Caliper API to buffer. Consequently, when two different threads update a
decide whether the attribute should be included in the context thread-scope attribute, each thread updates its own copy.
tree or written explicitly. In contrast, when different threads update a process-scope
attribute, they overwrite each other’s values. The blackboard
buffer management is completely transparent to the user; she
IV. D ESIGN AND I MPLEMENTATION only needs to specify the scope.
In the following, we discuss the design and implementation Caliper does not explicitly address cross-process data inte-
of the Caliper framework. Caliper consists of the source-code gration. Thereby, we avoid dependencies to specific parallel
annotation API, a backend runtime component that manages runtime systems or programming environments, which would
blackboard buffers and the generalized context tree, add-on be needed to communicate across processes. Snapshots are
support services for control tasks (such as I/O and controlling solely managed on a per-process basis, and it is up to analysis
snapshots), and additional data producer, measurement control, tools to manage snapshot records from multiple processes.
and data consumer services. The framework also includes Snapshot records are compatible between processes of the
libraries and basic post-processing tools for Caliper data same application, since attribute names match. Further, tools
streams. Caliper is written in standard C++ and has been can use Caliper to store information about the parallel structure
tested on a variety of HPC platforms, including standard Linux of an application, e.g. by adding process ranks or other task
clusters as well as BlueGene/Q and Cray XC systems. identifiers as attributes to the context of each process, thread
or task. This enables analysis tools to provide comprehensive
A. Architecture analysis operations across information gathered from large
scale parallel applications.
As shown in figure Figure 1, the blackboard and the
snapshot mechanism are the basic methods to collect data from
data producers and provide it to data consumers (including C. Snapshots
the option to write it out for offline analysis). Data producers A snapshot saves the states of all attributes (i.e., the
and consumers interact with Caliper through annotation and blackboard contents) at a specific point in the execution
5

of the target program. Snapshots can be triggered via the having to modify the target application or the instrumentation
push_snapshot or pull_snapshot API calls. The system.
pull_snapshot call returns the snapshot record to the
caller, while the push_snapshot call forwards the snapshot
to other modules through the callback notification interface. E. Annotation Interface
Snapshots can be triggered asynchronously at any time, in- Caliper provides easy-to-use high-level C, Fortran, and C++
dependent of blackboard updates. However, it is possible to interfaces for source-code annotations. It provides functions
trigger a snapshot for some or all blackboard updates by to create and update attributes via begin/end annotations to
registering a callback function to do so, enabling event-driven outline temporal or spacial regions in the code or execution,
data collection workflows. Caliper provides an event service or by directly setting a value. As a typical use case, consider
module for this purpose. a simple legacy profiling interface with TimerStart and
In addition to the blackboard contents, Caliper can add TimerStop calls created by an application developer for
transient attributes to a snapshot. This is done through a basic performance monitoring. We can easily wrap Caliper
callback method that is invoked when a snapshot is triggered, annotations within this interface:
and is used to add measurements such as timestamps to a
snapshot. const s t r u c t Timer {
const char∗ name ;
Snapshots are thread-local; that is, they capture the contents // ...
of a single thread and process blackboard buffer. Thus, a } t i m e r s [ ] = { { ” outer phase ” , / ∗ . . . ∗ / } ,
snapshot record contains all process-scope attributes, and the /∗ . . . ∗/ };
thread-scope attributes from the triggering thread. The task void T i m e r S t a r t ( i n t i d ) {
of combining data from multiple threads or tasks for analysis // ...
purposes is left to the analysis tool. c a l i : : A n n o t a t i o n ( t i m e r s [ i d ] . name ) . begin ( ) ;
}

void TimerStop ( i n t i d ) {
D. Services and runtime configuration // ...
c a l i : : A n n o t a t i o n ( t i m e r s [ i d ] . name ) . end ( ) ;
Much of Caliper’s functionality is provided by modules }
called services, each of which performing a specific task. Ser-
vices can perform data producer, consumer, or measurement Here, we create a cali::Annotation object to access
control roles (e.g., triggering snapshots). an attribute with the given name (which is taken from the
In general, services are independent of each other, but they legacy Timer struct in the example) on the Caliper blackboard.
can be combined to form complex processing chains. For In the example, we use the begin() method to add a Caliper
example, an event-driven snapshot trace of an annotated target attribute with the given timer name in TimerStart, and
program can be collected by enabling the event service (a the end() method to remove it in TimerStop. Note that
measurement control service which triggers a snapshot when by default, the annotations only update the blackboard, but
the blackboard is updated), the trace service (a data con- do not take time measurements. However, we can replicate
sumer service that mainatins Caliper’s snapshot record output the original timekeeping functionality by providing a runtime
buffer), and the recorder service (writes output buffers to configuration profile that takes snapshots connected with a
disk). Users specify the services that should be enabled via a timestamp source for each timer attribute update. The specific
configuration file or an environment variable before running functionality of the profiling interface can now be configured
the target application. By default, no services are active. The at runtime.
following setting loads the event-trace configuration described We can also use the annotation interface to export other
above: kinds of execution context information as attributes, for ex-
CALI_SERVICES_ENABLE=event:recorder:trace ample, the iteration number in a loop:
c a l i : : Annotation iter ann ( ” i t e r a t i o n ” ) ;
The functionality can be enhanced or modified by adding or
replacing services. For example, adding the timestamp or f o r ( i n t i = 0 ; i < MAX; ++ i ) {
callpath service adds timestamps or call stack information iter ann . set ( i ) ;
// ...
to the snapshots, respectively. Similarly, the event service }
can be replaced with a sample service for triggering snap-
shots based on timer interrupts instead of attribute updates. Here, we create the iter_ann annotation object for the
The two services could even be combined to create a hybrid “iteration” attribute in advance. This way, we avoid a having
event-driven and sampled trace. Many of the services also have to look up the attribute by name within the loop. We then
their own configuration options to customize their behavior use the set() method at the begin of each iteration to
further. Caliper provides pre-defined configuration profiles for update the iteration information on the Caliper blackboard.
common tasks, which users can adapt to their needs. They can In contrast to begin(), which would add a new entry,
also create their own configuration profiles. Overall, the ability set() overwrites the current value on the blackboard. Caliper
to add and remove services as needed gives users great flexi- automatically combines the iteration counter information with
bility in creating customized data collection solutions without the information from all other annotations within the program.
6

Cab rzAlastor ���


Nodes:
CPU Dual 8-core Xeon Dual 10-core Xeon ��
[email protected] [email protected]
RAM 32GiB 48GiB ��
Interconnect InfiniBand DDR InfiniBand DDR
Software: ��
OS Linux 2.6.32 Linux 2.6.32

�����������������
Compiler Intel C++ 16.0.150 GCC 4.9.2 ��

Table I ��
LLNL CLUSTER CONFIGURATIONS
��

��

F. Data Streaming and Processing ��

A performance experiment typically produces a large, dis-


��
tributed set of snapshot records. Caliper’s I/O services write
the snapshot records, together with the metadata information ��

��

���

���

��

���
(i.e., the generalized context tree structure), into a data stream.

��
���

���
��

��
��
���

��
In a distributed-memory environment, Caliper currently

���
���
��

���
���

��
produces one context stream per process. Because snapshots


������������
are self-contained and have no inherent ordering, multiple
data streams naturally lend themselves to parallel processing. Figure 3. Runtime of the annotated L ULESH proxy application for different
In particular, search and filter operations are embarrassingly Caliper runtime configurations. The red line highlights the uninstrumented
parallel problems. Moreover, it is possible to perform aggrega- baseline version.
tion operations on one or more attributes across all snapshot
records in a stream. As a result, data streams in our model Blackboard with Snapshots Snapshots
updates only with Timestamps
can cover a range of possible performance experiments, from Context tree and 24.2 KiB 37.2 KiB 38.2 KiB
runtime traces to profiles. blackboards
Trace buffer n/a 6.4 MiB 8.0 MiB
V. P ERFORMANCE E VALUATION Table II
C ALIPER MEMORY CONSUMPTION PER PROCESS IN THE ANNOTATED
To evaluate the performance impact of Caliper annotations L ULESH PROXY APPLICATION FOR DIFFERENT C ALIPER RUNTIME
in a real-world scenario, we compare annotated and non- CONFIGURATIONS .
annotated versions of the L ULESH [1] hydrodynamics proxy
application. Specifically, we use a modified version of L ULESH
that uses the RAJA C++ library [2], which decouples the loops
in the application from the loop execution parameters to enable fastest time for each configuration. The run-to-run variation
static tuning of the loop execution policy. We discuss RAJA was less than 0.3 seconds (or 3% of the execution time)
in more detail in our case study in Section VI-B. for each configuration. In the Caliper-enabled configurations,
For our performance evaluation, we added Caliper anno- Caliper performs 1,123,087 blackboard updates and snapshots.
tations in the RAJA library and in the L ULESH application Snapshots are pushed into Caliper’s trace buffer, but not
code to collect one attribute for each RAJA loop invocation, written to disk. In a typical run, data is written to disk when
and three attributes for each L ULESH main loop iteration. Our the program exits to minimize perturbation.
experiments ran on the Cab cluster at LLNL (see Table I). We Figure 3 shows the results. The stub library configuration
ran annotated and non-annotated versions of L ULESH in its shows a measurable but small (1% of the execution time)
default configuration (problem size 30, run to completion) with effect of adding additional function calls in the RAJA loop
16 OpenMP threads on a single cluster node, and we report invocations. When Caliper performs blackboard updates but no
the wall-clock execution time of its main loop (excluding job snapshots, the main loop’s execution time increases by 0.46
setup and I/O). We compare the following Caliper runtime seconds compared to the uninstrumented version, indicating
configuration profiles: an average cost of about 0.4 microseconds per blackboard
• Replacing Caliper with a no-op stub library (“Stub”), update. Snapshots without additional measurement take about
• Performing blackboard updates only, without taking snap- 0.85 microseconds each on average; when adding timestamps,
shots (“Blackboard”), they take 1.3 microseconds. Table II shows Caliper’s memory
• Recording a snapshot trace (one snapshot per blackboard consumption during the Caliper-enabled runs. With less than
update) without performance measurements (“Snap- 40 KiB in each configuration, the memory requirements for
shot”), the generalized context tree and blackboard buffers are small.
• Recording a snapshot trace with timestamps (“w/ Times- Memory requirements for Caliper’s trace buffers depend on
tamp”) the snaphot granularity and size. Caliper compresses snap-
To mitigate the impact of external effects (e.g., OS noise) on shot records internally to minimize trace buffer sizes. In our
our results, we ran each configuration 5 times, and report the L ULESH example, Caliper used 8 MiB of memory to store
7

us to look at how the components impact each other. For


example, the ability to annotate the regrid phases in SAMRAI
Average Time (µs)

1 · 106 will enable the application developers to anticipate the update


of data structures needed for HYPRE along with increased
timestep length of the application. This goes beyond classical
performance analysis, which provides simple metrics like time
5 · 105 spent in a given function, and is a step towards true algorithmic
performance debugging. Moreover, any instrumentation in a
commonly shared library, such as the widely used HYPRE
package, can be reused for studying other simulation applica-
0 tions built on top of the same library. This enables enhanced
0 20 40 60 80 100
tool support in applications that have no direct use of Caliper
Timestep in their main code and without any further efforts on the side
Iteration Length of the user. Further, existing Caliper instrumentation does not
AMR Regrid interfere with new Caliper instrumentation, making it easy to
Hypre Solve add and remove annotations as application developers see fit.
Such instrumentation reusability and composability is key for
Figure 4. By relating data from independent Caliper context annotations in large software.
the HYPRE and SAMRAI libraries, we learn that regrid phases in SAMRAI To demonstrate these capabilities, we annotated the main
cause re-setup of HYPRE matrices in the following time step and a longer
timestep length as a result. application and its libraries. In the main application, we collect
the following attributes:
1) Application phase (“main”, “loop”);
1,123,087 timestamped snapshots – 7.5 bytes per snapshot on 2) MPI rank;
average. 3) Main loop iteration number (“iteration”).
The separation of mechanism and policy allows great flex- Because the HYPRE library is written in C, we used
ibility in balancing measurement overheads with application Caliper’s C wrapper layer to annotate HYPRE. Here, we
insight. In contrast to traditional introspection approaches, collect:
instrumentation and measurement costs, e.g. for taking times- 1) Execution phase (hypre phase, “vcycle” and “loop”);
tamps or saving snapshot records, occur only when needed. 2) The vcyle level (“vcycle level”).
For all but the most fine-grained semantic abstractions (e.g.,
In SAMRAI, we collect:
individual iterations of inner loops), blackboard updates them-
selves are sufficiently efficient to keep Caliper annotations 1) Execution phase (amr phase, “regrid” and “loop”);
permanently enabled. 2) The regrid level “regrid level”.
We then examined a simulation run that simulates a simpli-
VI. C ASE S TUDIES fied inertial confinement fusion capsule. All experiments ran
In the following, we demonstrate two Caliper use cases. on the rzAlastor cluster at LLNL (see Table I). We ran each
First, we show how Caliper was used to explain performance experiment ten times and averaged the results.
behavior in a large multi-physics code by correlating attributes We enabled Caliper’s timestamp service to automatically
from different libraries. In the second example, we show how obtain phase duration information. With this configuration,
we collect Caliper attributes to create predictive performance Caliper recorded about 92,000 snapshots per process. In our
models, which are then used together with attributes collected configuration, we observe between 1 and 5% measurement
online to select optimal settings at runtime. overhead with Caliper snapshots enabled.
Table III shows an excerpt from the snapshot record stream
produced by Caliper. We selected records from a single
A. Using Caliper to Instrument a Large simulation iteration on a single process; therefore, we omitted
Parallel Multi-Physics Code the iteration number and MPI Rank attributes in the table
Here, we study a large radiation hydrodynamics code as they are identical in all of the selected records. A line
capable of running small serial to large massively parallel represents a single snapshot record and each record contains
simulations. It is used primarily in munitions modeling and all context attributes set by the instrumented components, as
inertial confinement fusion simulations. well as the duration of the phase in microseconds as measured
This code relies on several numeric libraries, including a by the timestamp service. Note that some attributes are not
structured AMR library SAMRAI [3] and a linear solver li- necessarily active in all of the snapshots.
brary HYPRE [4]. We use Caliper to instrument the individual The combined information from all instrumented compo-
components and the application itself. While each component nents allows us to study the influence of parameters of some
is instrumented separately1 , Caliper’s shared context allows component on parameters or execution time of other compo-
1 In practice the development team for each library as well as for the
nents. For example, we can extract whether the refinement
application itself would insert Caliper annotation unknowingly from each of the AMR grid in SAMRAI impacts the execution time
other, enabling true modularity. of HYPRE. Using a simple Python script, we parse and
8

amr phase=regrid duration=36


amr phase=regrid/loop duration=76019
amr phase=regrid/loop regrid level=1 duration=40
amr phase=regrid/loop duration=6
amr phase=regrid duration=361714
amr phase=regrid phase=main/loop duration=143816
amr phase=regrid hypre phase=vcycle phase=main/loop duration=20
amr phase=regrid hypre phase=vcycle/loop phase=main/loop duration=435
amr phase=regrid vcycle level=1 hypre phase=vcycle/loop phase=main/loop duration=358
amr phase=regrid vcycle level=2 hypre phase=vcycle/loop phase=main/loop duration=146
amr phase=regrid vcycle level=3 hypre phase=vcycle/loop phase=main/loop duration=64
amr phase=regrid vcycle level=4 hypre phase=vcycle/loop phase=main/loop duration=39
amr phase=regrid vcycle level=5 hypre phase=vcycle/loop phase=main/loop duration=18
amr phase=regrid vcycle level=4 hypre phase=vcycle/loop phase=main/loop duration=29
amr phase=regrid vcycle level=3 hypre phase=vcycle/loop phase=main/loop duration=60
amr phase=regrid vcycle level=2 hypre phase=vcycle/loop phase=main/loop duration=120
amr phase=regrid vcycle level=1 hypre phase=vcycle/loop phase=main/loop duration=234
amr phase=regrid vcycle level=0 hypre phase=vcycle/loop phase=main/loop duration=243
amr phase=regrid hypre phase=vcycle/loop phase=main/loop duration=8
amr phase=regrid hypre phase=vcycle phase=main/loop duration=8
amr phase=regrid phase=main/loop duration=286
Table III
S AMPLE O UTPUT OF C ALIPER I NSTRUMENTATION IN OUR HYDRODYNAMICS APPLICATION , HYPRE AND SAMRAI. ROWS REPRESENT INDIVIDUAL
SNAPSHOTS , COLUMNS REPRESENT THE COLLECTED ATTRIBUTES .

cross-compare data from raw context streams as presented in production scientific applications, where data can vary
in Table III. Figure 4 shows that the timesteps in which dramatically each time control passes over a loop. To obtain
the SAMRAI library regrids (shown by red markers) are the best performance, we use lightweight decision models
always followed by re-setup of data structures for HYPRE to dynamically tune application parameters at runtime on a
and therefore longer HYPRE solve time (black markers) in loop-by-loop basis in response to changing application data.
the immediately following timesteps. The outliers in the early To achieve this, we require cross-stack data collection and
timesteps are attributed to cold caches. Both of those opera- furthermore, runtime access to that data, both of which are
tions increase the runtime of the application, which is clearly provided by Caliper.
correlated with the total timestep length of the application. To dynamically adjust the application execution configura-
We are able to gather these correlations because we have tion at runtime we use RAJA, a C++ library that provides
instrumented SAMRAI, HYPRE, and the main application, flexible parallel execution methods over defined indices using
and are able to see the information from all three in the same a combination of template execution methods and lambda
context. functions [2]. RAJA is a programming framework that de-
The experience of instrumenting this large hydrodynamics couples the loops in the application from the loop execution
application and its components with Caliper annotations was parameters, enabling static tuning of the policy used to perform
straightforward and will enable us to study correlations be- the loop iterations. For example, the following application
tween parameters in different components and their influence loop:
on other components’ runtime and accuracy. From the software f o r ( i n t i = begin ; i < end ; ++ i ) {
engineering perspective, the ability to instrument a large code sigxx [ i ] = sigyy [ i ] = sigzz [ i ] = − p( i ) − q( i ) ;
};
incrementally is crucial. Additionally, composability of context
annotations means that there is no need to disable subsets of is transformed into a RAJA loop:
the instrumentation in order to look for features in another RAJA : : f o r a l l <e x e c p o l i c y >(IndexSet , [ = ] ( i n t i ) {
part of the code, enabling us to leave the instrumentation in sigxx [ i ] = sigyy [ i ] = sigzz [ i ] = − p( i ) − q( i ) ;
});
place.
We focus on tuning the execution policy, that is, how each loop
is mapped to hardware by the RAJA library. Whilst typically
B. Using Caliper for Runtime Tuning with Supervised Ma-
most thread-safe loops in an application should be run using
chine Learning
threads, in some cases, such as when the workload is small, the
Complex scientific applications exhibit dynamic and data- overhead of threads means it’s actually faster to run in serial.
dependent behavior. This behavior determines the parameter Our model is designed to identify these cases and execute them
selections for tuning the applications runtime. Existing tuning accordingly.
approaches are applied statically, or assume code and data Our tuning model is built using a decision tree algorithm
changes slowly, interleaving tuning phases with application that infers a function from a training set to a set of target
execution. However, these slow changes are rarely the case labels. The training set is a subset of the data collected at
9

RAJA,Library Listing 1. Caliper annotations at the RAJA library level.


template <typename EXEC POLICY T ,
Caliper, typename LOOP BODY>
Annota'ons RAJA INLINE
Applica'on C++,Model void f o r a l l ( const IndexSet& i s e t ,
LOOP BODY loop body )
{
/ / Begin C a l i p e r a n n o t a t i o n s , e . g .
Execu'on c a l i : : A n n o t a t i o n ( ” num indices ” )
. s e t ( i s e t . getLength ( ) ) ;
f o r a l l ( EXEC POLICY T ( ) ,
iset ,
Features,Vectors Model,Genera'on Model,Evalua'on loop body ) ;
/ / End C a l i p e r a n n o t a t i o n s
}

Figure 5. Workflow for automatically generating models to tune application


parameters based on Caliper annotations.

Feature Description
callpath.address Call stack of current function.
runtime by Caliper annotations, consisting of a number of func Address of lambda function.
samples, each sample describes one invocation of some loop func size Total number of instructions in loop body.
index type Type of RAJA IndexSet.
in the application execution. Each sample is labelled with num indices Total number of indices in IndexSet.
the best parameter value, that is, the one with the lowest num segments Number of segments in the IndexSet.
observed execution time. The correct label for each sample problem size Global problem size (application specific).
stride Stride of IndexSet segments.
is determined by inspection. We assign to each sample a label timestep Current timestep (application specific).
based on the execution policy that provided the lowest runtime
Table IV
for that particular lambda invocation. The samples not used as F EATURES GATHERED USING RAJA AND C ALIPER .
part of the training set are reserved for model evaluation. The
classifier built during the training step can then be used to
predict the output of a test set of samples without any labels.
The labels predicted by the classifier for the test set can be
used to evaluate the accuracy of the learned function. We then that might be deemed unimportant by a computer scientist, but
convert the decision trees learned by our model generator into that has a large impact on application performance. Whilst
C++ code, which can be compiled and then used at runtime. features such as the global problem size and the current
Figure 5 shows our integrated workflow for collecting features timestep can be considered application specific, since they
and generating models using Caliper. There are two aspects are accessed using a string identifier (e.g. “timestep”), we
of the workflow that are enabled by the cross-application can use a common naming scheme to allow cross-application
flexibility provided by Caliper: data collection, and online data comparison of these attributes.
access. 2) On-line Data Querying for Runtime Adaption: Our
1) Data Collection: A key challenge for generating models modeling workflow generates C++ code that implements the
using supervised learning is collecting features that may decision process found in the models. This code can be
impact application performance, such that they can be used to compiled with the application and used to dynamically alter
construct a model for predicting which parameter value is the execution policy at runtime. As input, the model requires a
best for a given loop. To collect data for model generation, sample which contains values for all the features used when
we compile the application with an instrumented version of training the model. Using Caliper’s on-line query interface,
the RAJA library. Caliper annotations are inserted within the we are able to access these feature values from across the
RAJA template execution method (see Listing 1). The design application stack within our generated code, and use these
of RAJA naturally decouples the application specific loop values to evaluate the model. A key advantage of using Caliper
body from the platform, architecture, and programming model is that annotations are re-used; the annotations used to record
specific loop execution. This means that the features for each features and create the model are used when the model is being
loop can be recorded without modifying the application code. evaluated online.
We also measure application level features, such as the
current timestep and the global problem size using Caliper 3) Results: Whilst this work is still in its preliminary stages,
annotations inserted into the application source. Caliper al- by using Caliper’s online query interface and our generated
lows us to combine all features across the application stack, data-dependent models, we are able to improve application
generating a feature vector for each loop invocation. Table IV in simple cases. Figure 6 shows the runtime at a range of
describes some of the features we collect. Using Caliper to processor counts on LLNL’s Cab cluster (Table I) using one
automatically collect and merge features from all software of our generated data-dependent models in the CleverLeaf
components, we are able to include features deemed important mini-application [5]. At small problems sizes the default serial
by the application developers in our model. This functionality execution is sufficient, but at larger problem sizes, threading
is essential in capturing application and domain specific data is enabled by the model to improve performance.
10

400 B. Blackboard Systems


Execution Time (sec) Model
Default The general concept of a blackboard originated with the
300
Hearsay-II speech understanding system [16], in which many
disparate knowledge sources stored observations in a common
200
global database. Blackboard systems later grew into a major
sub-field of artificial intelligence [17], [18]. With the shift
100
towards complex node architectures, blackboard-like designs
have become popular for performance introspection tools.
0
0 5 10 15 The Autonomic Performance Environment for Exascale
Node Count (APEX) [19] – originally developed as introspection interface
for the HPX programming model – is a data-sharing layer for
Figure 6. Runtime improvements using the dynamically auto-generated cross-stack performance analysis and runtime tuning. APEX
model on up to 256 cores. shares Caliper’s design philosophy regarding the blackboard
and separation of concerns. However, the TAU-like data
format underlying APEX limits its expressivity to start/stop
VII. R ELATED W ORK regions, whereas Caliper’s data model captures arbitrary kinds
A. Performance Introspection of context information. APEX also hard-wires some of its
profiling activities (e.g. taking timestamps). The Resource-
The HPC community has developed a wealth of perfor- Centric Reflection Toolkit [20] (RCRToolkit) also uses a
mance introspection and data collection frameworks that cover blackboard-like design to monitor parallel runtime systems.
a wide range of methods, systems, and use cases. In gen- RCRToolkit’s shared-memory blackboard is managed by a per-
eral, performance introspection systems are based on a few node daemon process. Clients communicate with the daemon
basic techniques: instrumentation, i.e., inserting commands to using data in Google Protocol Buffers [21] with strict, pre-
observe a program section of interest into the target code; specified schemas.
statistical sampling, which probes the program’s state and
In contrast to these tools, Caliper’s snapshot mechanism
collects measurement data at regular intervals, e.g. via a timer
provides a more flexible, runtime-configurable way to connect
interrupt; call-stack unwinding, which provides the chain of
data producers and consumers. Its data model is unstruc-
function calls leading to a performance probe; and interfaces
tured and schema-less, which allows component designers
within the software stack that serve as hooks for measurement
and application developers more freedom to quickly extend
probes (e.g., the OpenMP tools interface [6]) or provide certain
and modify their annotations and measurements. Additionally,
information (e.g., the PAPI [7] interface for accessing CPU
Caliper uses the generalized context tree to represent attribute
performance counter data).
values efficiently as node handles.
For basic performance monitoring tasks, many HPC appli-
cations use home-grown instrumentation solutions with simple
time-measurement “calipers” around code sections of interest. VIII. C ONCLUSIONS
On the other end of the spectrum frameworks like Score-P [8], Faced with rising architecture and application complexity
TAU [9], HPCToolkit [10], or Open|SpeedShop [11] collect paired with ever growing system scales, the ability to under-
detailed per-thread execution profiles or traces for in-depth stand and optimize application performance is more critical
post-mortem analyses, such as automatic bottleneck detec- than ever before. While a wide range of performance analysis
tion [12], profile analysis [13], [14], or trace visualization [15]. solutions exist that enable us to collect the performance data
Score-P and TAU primarily use instrumentation to collect for this task, many of them require tool-specific annotation so-
measurement data, but also support sampling. In addition lutions tied to a particular data model that cannot be combined
to source-to-source translation or compiler-based mechanisms across tools and/or independently annotated software modules.
to instrument user functions automatically, they also provide With Caliper, we address this combinatorial challenge by
APIs for manual source-code annotation. However, their in- separating the concerns of software developers, tool devel-
strumentation approach creates tool-specific code, requiring opers, and users. With the Caliper annotation API, software
users to recompile their code or potentially an entire soft- developers can expose high-level application semantics for
ware stack specifically for performance analysis. In contrast, performance analysis in a general-purpose, reusable way –
Caliper’s general-purpose annotations can be left in the source a crucial requirement for our long-term goal to integrate
code and activated for performance analysis use cases as performance introspection across the entire HPC production
needed by a runtime configuration profile. software stack from the start.
HPCToolkit and Open|SpeedShop use statistical sampling Our two case studies already demonstrate the benefits of
and call-stack unwinding to generate call-path profiles. While this approach: instrumenting a large radiation hydrodynamics
this approach enables some form of cross-stack analysis, the code as well as its libraries shows how we can combine
call path does not capture application semantics on a relatable context information from independent annotations, and how
level (e.g., kernel names) that can only be obtained through this helped us to dissect performance information. In the RAJA
source-code annotations. The Caliper blackboard makes it use case, we demonstrate how Caliper annotations support
possible to correlate samples with high-level program context. runtime adaption based on supervised machine learning, a
11

novel performance engineering approach that requires both [11] M. Schulz, J. Galarowicz, D. Maghrak, W. Hachfeld, D. Montoya, and
off-line and on-line access to context information. Combined, S. Cranford, “Open|speedshop: An open source infrastructure for parallel
performance analysis,” Scientific Programming, vol. 16, no. 2-3, pp.
these case studies illustrate how flexible introspection offered 105–121, 2008.
by Caliper offers a new path towards effective and insightful [12] M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and
performance analysis for complex HPC applications. B. Mohr, “The Scalasca performance toolset architecture,” Concurrency
and Computation: Practice and Experience, vol. 22, no. 6, pp. 702–719,
Apr. 2010. [Online]. Available: http://apps.fz-juelich.de/jsc-pubsystem/
pub-webpages/general/get attach.php?pubid=142
S OURCE CODE [13] J. Mellor-Crummey, R. Fowler, and G. Marin, “HPCView: A tool for top-
Caliper is available at https://github.com/LLNL/Caliper. down analysis of node performance,” The Journal of Supercomputing,
vol. 23, pp. 81–101, 2002.
Documentation is available at software.llnl.gov/Caliper. [14] K. A. Huck and A. D. Malony, “Perfexplorer: A performance data
mining framework for large-scale parallel computing,” in Proceedings
of the 2005 ACM/IEEE Conference on Supercomputing, ser. SC ’05.
ACKNOWLEDGMENT Washington, DC, USA: IEEE Computer Society, 2005, pp. 41–.
[Online]. Available: http://dx.doi.org/10.1109/SC.2005.55
This work was performed under the auspices of the U.S. [15] W. E. Nagel, A. Arnold, M. Weber, H. C. Hoppe, and K. Solchenbach,
Department of Energy by Lawrence Livermore National Lab- “VAMPIR: Visualization and analysis of MPI resources,” Supercom-
oratory under contract DEAC52-07NA27344 and supported puter, vol. 12, no. 1, pp. 69–80, 1996.
[16] L. D. Erman, F. Hayes-Roth, V. R. Lesser, and D. R. Reddy, “The
by the Office of Science, Office of Advanced Scientific hearsay-ii speech-understanding system: Integrating knowledge to re-
Computing Research as well as the Advanced Simulation and solve uncertainty,” ACM Computing Surveys (CSUR), vol. 12, no. 2, pp.
Computing (ASC) program. 213–253, 1980.
[17] H. P. Nii, “Blackboard application systems, blackboard systems and a
LLNL release LLNL-CONF-699263. knowledge engineering perspective,” AI magazine, vol. 7, no. 3, p. 82,
1986.
[18] D. D. Corkill, “Blackboard systems,” AI expert, vol. 6, no. 9, pp. 40–47,
R EFERENCES 1991.
[19] K. Huck, A. Porterfield, N. Chaimov, H. Kaiser, A. D. Malony, T. Ster-
[1] I. Karlin, A. Bhatele, B. L. Chamberlain, J. Cohen, Z. Devito, ling, and R. Fowler, “An Autonomic Performance Environment for
M. Gokhale, R. Haque, R. Hornung, J. Keasler, D. Laney, E. Luke, Exascale,” Supercomputing Frontiers and Innovations, vol. 2, no. 3,
S. Lloyd, J. McGraw, R. Neely, D. Richards, M. Schulz, C. H. Still, 2015.
F. Wang, and D. Wong, “Lulesh programming model and performance [20] A. Mandal, R. Fowler, and A. Porterfield, “System-wide introspection
ports overview,” Tech. Rep. LLNL-TR-608824, December 2012. for accurate attribution of performance bottlenecks,” in Workshop on
[2] R. D. Hornung and J. A. Keasler, “The RAJA Poratability Layer: High-performance Infrastructure for Scalable Tools (WHIST), Venice,
Overview and Status,” Lawrence Livermore National Laboratory, Tech. Italy, 06/2012 2012.
Rep. LLNL-TR-661403, Sep. 2014. [21] K. Varda, “Google’s data interchange format,” Online, July 7 2008,
[3] B. T. N. Gunney, A. M. Wissink, and D. A. Hysom, “ Parallel Clustering https://developers.google.com/protocol-buffers/.
Algorithms for Structured AMR,” Journal of Parallel and Distributed
Computing, vol. 66, no. 11, pp. 1419–1430, 2006.
[4] R. Falgout, J. Jones, and U. Yang, “The Design and Implementation
of HYPRE, a Library of Parallel High Performance Preconditioners,”
Chapter in Numerical Solution of Partial Differential Equations on
Parallel Computers, A.M. Bruaset and A. Tveito, eds., vol. 51, no. 4,
pp. 267–294, 2006.
[5] D. A. Beckingsale, W. Gaudin, A. Herdman, and S. Jarvis, “Resident
Block-Structured Adaptive Mesh Refinement on Thousands of Graphics
Processing Units,” in Proceedings of the 44th International Conference
on Parallel Processing. IEEE, Aug. 2015, pp. 61–70.
[6] A. E. Eichenberger, J. M. Mellor-Crummey, M. Schulz, M. Wong,
N. Copty, J. DelSignore, R. Dietrich, X. Liu, E. Loh, and D. Lorenz,
“OMPT: OpenMP tools application programming interfaces for per-
formance analysis,” in Proc. of the 9th International Workshop on
OpenMP (IWOMP), Canberra, Australia, ser. LNCS, no. 8122. Berlin
/ Heidelberg: Springer, 2013, pp. 171–185.
[7] P. J. Mucci, S. Browne, C. Deane, and G. Ho, “PAPI: A portable
interface to hardware performance counters,” in Proc. Department of
Defense HPCMP User Group Conference, Jun. 1999.
[8] Knüpfer, Andreas and Rössel, Christian and Mey, Dieteran and Biers-
dorff, Scott and Diethelm, Kai and Eschweiler, Dominic and Geimer,
Markus and Gerndt, Michael and Lorenz, Daniel and Malony, Allen
and Nagel, Wolfgang E. and Oleynik, Yury and Philippen, Peter and
Saviankou, Pavel and Schmidl, Dirk and Shende, Sameer and Tschüter,
Ronny and Wagner, Michael and Wesarg, Bert and Wolf, Felix, “Score-P:
A joint performance measurement run-time infrastructure for Periscope,
Scalasca, TAU, and Vampir,” in Tools for High Performance Computing
2011, Brunst, Holger and Müller, Matthias S. and Nagel, Wolfgang E.
and Resch, Michael M., Ed. Springer Berlin Heidelberg, 2011, pp.
79–91.
[9] S. Shende and A. D. Malony, “The tau parallel performance system,”
International Journal of High Performance Computing Applications,
vol. 20, no. 2, pp. 287–311, 2006.
[10] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-
Crummey, and N. R. Tallent, “Hpctoolkit: Tools for performance anal-
ysis of optimized parallel programs,” Concurrency and Computation:
Practice and Experience, vol. 22, no. 6, pp. 685–701, 2010.

You might also like