Inside Caliper
Inside Caliper
Inside Caliper
Caliper: Performance
Introspection for HPC Software
Stacks
August 2, 2016
This document was prepared as an account of work sponsored by an agency of the United States
government. Neither the United States government nor Lawrence Livermore National Security, LLC,
nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or
responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or
process disclosed, or represents that its use would not infringe privately owned rights. Reference herein
to any specific commercial product, process, or service by trade name, trademark, manufacturer, or
otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the
United States government or Lawrence Livermore National Security, LLC. The views and opinions of
authors expressed herein do not necessarily state or reflect those of the United States government or
Lawrence Livermore National Security, LLC, and shall not be used for advertising or product
endorsement purposes.
1
Abstract—Many performance engineering tasks, from long- instrumented code is then typically limited to recording tool-
term performance monitoring to post-mortem analysis and on- specific profile or trace information, and can no longer be used
line tuning, require efficient runtime methods for introspection for other purposes. This discourages or even prevents users
and performance data collection. To understand interactions
between components in increasingly modular HPC software, from leaving annotations in the code. As a consequence, most
performance introspection hooks must be integrated into run- pre-installed HPC production software today is not amenable
time systems, libraries, and application codes across the soft- to stack-wide profiling or tracing. Developers who wish to use
ware stack. This requires an interoperable, cross-stack, general- instrumentation-based tools not only have to instrument their
purpose approach to performance data collection, which neither own code, but possibly the entire software stack all the way
application-specific performance measurement nor traditional
profile or trace analysis tools provide. With Caliper, we have to the operating system.
developed a general abstraction layer to provide performance In order to overcome this gap, we need a new instrumen-
data collection as a service to applications, runtime systems, tation approach that separates the concerns of a) software
libraries, and tools. Individual software components connect developers, who expose application/library/runtime semantics
to Caliper in independent data producer, data consumer, and through annotations; b) tool developers, who provide mea-
measurement control roles, which allows them to share perfor-
mance data across software stack boundaries. We demonstrate surements to be correlated with application information; and
Caliper’s performance analysis capbilities with two case studies c) tool users, who decide what information is collected,
of production scenarios. correlated and/or filtered, based on the specific analysis use
Keywords-Performance analysis; High performance comput- case, available measurements, and application state. Such
ing; Computer performance; Software tools; Software perfor- an approach must emphasize ease of use for the developer,
mance; Software reusability; Parallel processing. incurring minimal overhead when used during production runs,
and it must be composable across the entire software stack.
In this paper, we present Caliper, a library that addresses
I. M OTIVATION
these requirements. Caliper is a cross-stack, general-purpose
Increasing on-node complexity, coupled with thread- and introspection framework that explicitly separates the concerns
task-parallel runtimes, will make understanding the perfor- of software developers, tool developers, and users. On the sur-
mance of exascale applications more difficult. Performance face, Caliper provides a simple annotation API for application
problems in modern systems can be caused by many different developers, similar to timer libraries found in many codes. In
factors across many different levels in the software stack, from fact, we can wrap Caliper calls inside existing timers. Under
the operating and runtime system to libraries and application the hood, though, Caliper combines information provided
code. To diagnose problems, we need introspection capabilities through its API from all software layers into a single program
in the form of annotation APIs that allow application and context, and it offers interfaces for tool developers to extract
system developers to expose semantic information from their this information and augment it with measurements. For this
software components. We must also couple this semantic purpose, Caliper includes basic measurement services, but it
information with comprehensive performance measurement also offers an interface for existing tools to exploit the stack-
techniques. Finally, to understand the relationships between wide information. Finally, Caliper users can define a policy
application code, runtime systems, the memory hierarchy, and that links mechanisms to instrumentation points at runtime. By
the network, we must be able to configure tools to correlate separating annotations, measurement mechanisms and policy,
data across software layers. instrumentation annotations can permanently remain in pro-
A wide range of sophisticated profiling and tracing tools duction codes to be used for a wide range of performance en-
can provide such information. Many of them also provide gineering applications, from post-mortem analysis to runtime
their own APIs to collect application and runtime information. adaption. Moreover, software teams can add instrumentation
However, these interfaces are tool-focused and require users to in different software components independent of each other,
add tool-specific actions (like start/stop timer) to their codes, allowing correlation of information across software layers.
instead of allowing developers to express generalizable seman- Components are loosely coupled; modifying annotations does
tics of their code in a reusable way. Further, once added, the not require re-working measurement services, or vice versa.
SC16; Salt Lake City, Utah, USA; November 2016
978-1-4673-8815-3/16/$31.00 2016 IEEE c 2016 IEEE
2
Overall, our contributions through Caliper include: user-defined actions at runtime, e.g. for timing a specific code
• A novel performance analysis methodology that decou- region.
ples the roles of application developers, tool users, and Performance engineers use Caliper as a platform for build-
tool developers by separating these concerns through ing measurement analysis tools. In addition to the global
separate interfaces: context information generated by the source-code annotations,
– A low-overhead annotation interface that allows de- Caliper puts a variety of methodologies at the engineer’s
velopers to convey application information through disposal, including interrupt-based timers, trace and profile
a flexible data model recording, and timer or hardware-counter measurements. Each
– A robust and efficient interface to provide collected of these methodologies is an independent building block.
information to tools Performance engineers can create runtime configuration pro-
– A flexible mechanism to specify data correlation and files that combine the building blocks as needed for specific
filtering policies performance engineering use cases.
• A mechanism to combine the annotation information and
performance data across software-stack boundaries by B. Basic Concepts
creating a stack-wide context Caliper is centered around three basic abstractions: at-
• An efficient way of representing combined context infor- tributes, a blackboard, and snapshots. Attributes represent
mation in a generalized context tree individual data elements that can be accessed through a unique
In two case studies, we demonstrate how Caliper allows key. The blackboard is a global buffer that collects attributes
application developers to diagnose performance interactions from independent sources. Blackboard updates also serve as
between separate components in a large multi-physics code, event hooks that can be configured to trigger additional ac-
and how tool developers can build upon Caliper to access tions. Finally, snapshots represent measurement events. When
application context information at runtime. Both case stud- a snapshot is triggered, Caliper creates a snapshot record that
ies demonstrate the previously unavailable composability and contains the current blackboard contents, and tells connected
flexibility Caliper brings to the performance analysis process. measurement providers to take measurements (e.g., a times-
tamp) and add them to the snapshot record.
Figure 1 illustrates the relationships between those com-
II. T HE C ALIPER A PPROACH : I NTROSPECTION AS A
ponents. Applications, libraries, runtime systems, and tools
S ERVICE
connect to Caliper in one or more different roles. As data
Caliper is a universal abstraction layer for general-purpose, producers, they create event hooks and update attributes on
cross-stack performance introspection. It is built around the the blackboard, or provide measurement attributes for snapshot
principle of separating mechanism and policy: Caliper pro- records. As data consumers, they can read the blackboard
vides data-collection mechanisms (e.g. an instrumentation contents or process snapshot records. They can also perform
API, interrupt-based sampling timers, hardware-performance control tasks, such as triggering snapshots. In this regard, the
counter access, call-stack unwinding, or trace recording) as blackboard and snapshot concepts fulfill two purposes:
independent building blocks, which can be combined into a • they transparently combine attributes provided by differ-
specific performance engineering solution by a user-provided ent data producers across the software stack, and
data-collection policy at runtime. • they enable separation between data producer, data con-
sumer, and measurement control roles.
A. Usage Model Caliper acts as the glue that connects independent compo-
Caliper is aimed at HPC software developers, tool devel- nents with similar or different roles across different software
opers, and performance engineers alike. Software developers stack layers.
instrument their software components with Caliper annotation
commands to mark and describe high-level abstractions, such III. DATA M ODEL
as kernel names, rank/thread identifiers, timesteps, iteration Caliper provides a generic data model that can capture a
numbers, and so on. At runtime, the annotations provide wide range of information, including user-defined, application-
context information to performance engineering applications, specific concepts. Capturing arbitrary introspection informa-
in terms of abstractions defined by the software developer. tion and storing it so it can be both effectively analyzed at
Caliper annotations are independent of each other, and devel- runtime and efficiently streamed to disk is key to making
opers can add them incrementally anywhere in the software Caliper a viable and useful tool for large-scale performance
stack. In the long term, we expect these annotations to be kept engineering. To this end, we have developed a data model that
and maintained permanently in the target components by the is both flexible and efficient. In contrast to traditional, single-
component’s developers. purpose performance profile or trace data formats, our data
Internally, the annotations add or remove entries on a virtual model describes the data’s layout and structure rather than its
blackboard. The combined blackboard contents provide a content or semantics. Our model facilitates general-purpose
holistic view of a program’s execution context as described by data storage and access similar to non-relational databases,
the annotations, across all software stack layers. In addition, but is optimized for storing the contextual information needed
the annotations serve as event hooks that can trigger additional for typical performance engineering use cases.
3
Figure 1. Caliper abstractions enable separation of concerns for performance introspection: Software developers publish program state information on the
Caliper blackboard (left column). Tool developers implement data collection, measurement control, and data-processing mechanisms (bottom layer). Based on
a user-provided configuration profile, Caliper activates snapshots to connect the individual introspection building blocks at runtime (middle layer). Snapshot
records with the combined program context and measurement information can be processed at runtime or stored for post-mortem analysis (right column).
A. Attributes
main Attributes
The basic elements in our data model are attributes in the
form of key-value pairs. Attribute keys serve as identifiers phase
for attributes: in addition to a unique name, they store the init loop
attribute’s data type, as well as optional additional metadata state
(e.g., the measurement unit) associated with the attribute. In
additon to integers, floating-point numbers and strings, Caliper serial par. iteration
also supports user-defined datatypes for attributes.
demonstrated by the phase attribute in the example. We use the control APIs, or by registering callback functions for certain
flexible design of the generalized context tree to store attribute events. The APIs allow them to
metadata in it as well, which allows us to encode all metadata • create attribute keys,
in a single structure. • set, update, or remove attributes on the blackboard,
• query attributes on the blackboard,
contrast to traditional HPC profiling and trace data formats, • updating or removing attributes on the blackboard,
our data model does not place any restrictions on the kind, • snapshot generation,
measurement attributes, we do not explicitly separate them With these interfaces, Caliper enables functional and phys-
in our data model: both are saved together in a snapshot as ical separation of data producers and consumers, as both only
generic key-value pairs. For subsequent analysis, the contents interact with the Caliper runtime and not directly with each
of the record receives its meaning from the semantics of the other.
attributes, which we presume to be known to the user who
uses or created them. B. Blackboard Buffer Management
By encoding of snapshots through context tree references,
we reduce the space needed to store snapshot records com- Caliper’s runtime component manages the blackboard
pared to writing full key-value lists. This also provides a buffers which store the current values of attributes, and main-
convenient and efficient mechanism to pass a snapshot to tains the process-global generalized context tree. This runtime
measurement tools without having to duplicate the actual intro- component lives in a single shared object instance per process
spection information. However, this also means that we need to that is created automatically on the first invocation of the
store the context tree structure itself in order to reconstruct the Caliper API.
information. Attributes whose values do change often or rarely Caliper maintains separate blackboard buffers for each
reoccur in subsequent snapshots would needlessly increase the thread within the same process. To do that, attributes have
context tree size. For example, recording the global time can a scope parameter, which is either thread or process. For
be very useful but its value never repeats. Therefore, we use a runtimes which support lightweight tasks, a task scope is also
hybrid approach to store snapshot records: we write a context- available. The scope can be set at the creation of the attribute.
tree node ID to encode attributes included in the tree, and we By default, attributes receive the thread scope. Updates of
write explicit key-value pairs for the remaining attributes. A thread-scope attributes go into the thread’s local blackboard
flag can be given to each attribute key via the Caliper API to buffer. Consequently, when two different threads update a
decide whether the attribute should be included in the context thread-scope attribute, each thread updates its own copy.
tree or written explicitly. In contrast, when different threads update a process-scope
attribute, they overwrite each other’s values. The blackboard
buffer management is completely transparent to the user; she
IV. D ESIGN AND I MPLEMENTATION only needs to specify the scope.
In the following, we discuss the design and implementation Caliper does not explicitly address cross-process data inte-
of the Caliper framework. Caliper consists of the source-code gration. Thereby, we avoid dependencies to specific parallel
annotation API, a backend runtime component that manages runtime systems or programming environments, which would
blackboard buffers and the generalized context tree, add-on be needed to communicate across processes. Snapshots are
support services for control tasks (such as I/O and controlling solely managed on a per-process basis, and it is up to analysis
snapshots), and additional data producer, measurement control, tools to manage snapshot records from multiple processes.
and data consumer services. The framework also includes Snapshot records are compatible between processes of the
libraries and basic post-processing tools for Caliper data same application, since attribute names match. Further, tools
streams. Caliper is written in standard C++ and has been can use Caliper to store information about the parallel structure
tested on a variety of HPC platforms, including standard Linux of an application, e.g. by adding process ranks or other task
clusters as well as BlueGene/Q and Cray XC systems. identifiers as attributes to the context of each process, thread
or task. This enables analysis tools to provide comprehensive
A. Architecture analysis operations across information gathered from large
scale parallel applications.
As shown in figure Figure 1, the blackboard and the
snapshot mechanism are the basic methods to collect data from
data producers and provide it to data consumers (including C. Snapshots
the option to write it out for offline analysis). Data producers A snapshot saves the states of all attributes (i.e., the
and consumers interact with Caliper through annotation and blackboard contents) at a specific point in the execution
5
of the target program. Snapshots can be triggered via the having to modify the target application or the instrumentation
push_snapshot or pull_snapshot API calls. The system.
pull_snapshot call returns the snapshot record to the
caller, while the push_snapshot call forwards the snapshot
to other modules through the callback notification interface. E. Annotation Interface
Snapshots can be triggered asynchronously at any time, in- Caliper provides easy-to-use high-level C, Fortran, and C++
dependent of blackboard updates. However, it is possible to interfaces for source-code annotations. It provides functions
trigger a snapshot for some or all blackboard updates by to create and update attributes via begin/end annotations to
registering a callback function to do so, enabling event-driven outline temporal or spacial regions in the code or execution,
data collection workflows. Caliper provides an event service or by directly setting a value. As a typical use case, consider
module for this purpose. a simple legacy profiling interface with TimerStart and
In addition to the blackboard contents, Caliper can add TimerStop calls created by an application developer for
transient attributes to a snapshot. This is done through a basic performance monitoring. We can easily wrap Caliper
callback method that is invoked when a snapshot is triggered, annotations within this interface:
and is used to add measurements such as timestamps to a
snapshot. const s t r u c t Timer {
const char∗ name ;
Snapshots are thread-local; that is, they capture the contents // ...
of a single thread and process blackboard buffer. Thus, a } t i m e r s [ ] = { { ” outer phase ” , / ∗ . . . ∗ / } ,
snapshot record contains all process-scope attributes, and the /∗ . . . ∗/ };
thread-scope attributes from the triggering thread. The task void T i m e r S t a r t ( i n t i d ) {
of combining data from multiple threads or tasks for analysis // ...
purposes is left to the analysis tool. c a l i : : A n n o t a t i o n ( t i m e r s [ i d ] . name ) . begin ( ) ;
}
void TimerStop ( i n t i d ) {
D. Services and runtime configuration // ...
c a l i : : A n n o t a t i o n ( t i m e r s [ i d ] . name ) . end ( ) ;
Much of Caliper’s functionality is provided by modules }
called services, each of which performing a specific task. Ser-
vices can perform data producer, consumer, or measurement Here, we create a cali::Annotation object to access
control roles (e.g., triggering snapshots). an attribute with the given name (which is taken from the
In general, services are independent of each other, but they legacy Timer struct in the example) on the Caliper blackboard.
can be combined to form complex processing chains. For In the example, we use the begin() method to add a Caliper
example, an event-driven snapshot trace of an annotated target attribute with the given timer name in TimerStart, and
program can be collected by enabling the event service (a the end() method to remove it in TimerStop. Note that
measurement control service which triggers a snapshot when by default, the annotations only update the blackboard, but
the blackboard is updated), the trace service (a data con- do not take time measurements. However, we can replicate
sumer service that mainatins Caliper’s snapshot record output the original timekeeping functionality by providing a runtime
buffer), and the recorder service (writes output buffers to configuration profile that takes snapshots connected with a
disk). Users specify the services that should be enabled via a timestamp source for each timer attribute update. The specific
configuration file or an environment variable before running functionality of the profiling interface can now be configured
the target application. By default, no services are active. The at runtime.
following setting loads the event-trace configuration described We can also use the annotation interface to export other
above: kinds of execution context information as attributes, for ex-
CALI_SERVICES_ENABLE=event:recorder:trace ample, the iteration number in a loop:
c a l i : : Annotation iter ann ( ” i t e r a t i o n ” ) ;
The functionality can be enhanced or modified by adding or
replacing services. For example, adding the timestamp or f o r ( i n t i = 0 ; i < MAX; ++ i ) {
callpath service adds timestamps or call stack information iter ann . set ( i ) ;
// ...
to the snapshots, respectively. Similarly, the event service }
can be replaced with a sample service for triggering snap-
shots based on timer interrupts instead of attribute updates. Here, we create the iter_ann annotation object for the
The two services could even be combined to create a hybrid “iteration” attribute in advance. This way, we avoid a having
event-driven and sampled trace. Many of the services also have to look up the attribute by name within the loop. We then
their own configuration options to customize their behavior use the set() method at the begin of each iteration to
further. Caliper provides pre-defined configuration profiles for update the iteration information on the Caliper blackboard.
common tasks, which users can adapt to their needs. They can In contrast to begin(), which would add a new entry,
also create their own configuration profiles. Overall, the ability set() overwrites the current value on the blackboard. Caliper
to add and remove services as needed gives users great flexi- automatically combines the iteration counter information with
bility in creating customized data collection solutions without the information from all other annotations within the program.
6
�����������������
Compiler Intel C++ 16.0.150 GCC 4.9.2 ��
Table I ��
LLNL CLUSTER CONFIGURATIONS
��
��
��
���
���
��
���
(i.e., the generalized context tree structure), into a data stream.
��
���
���
��
�
��
��
���
��
In a distributed-memory environment, Caliper currently
���
���
��
���
���
��
produces one context stream per process. Because snapshots
�
������������
are self-contained and have no inherent ordering, multiple
data streams naturally lend themselves to parallel processing. Figure 3. Runtime of the annotated L ULESH proxy application for different
In particular, search and filter operations are embarrassingly Caliper runtime configurations. The red line highlights the uninstrumented
parallel problems. Moreover, it is possible to perform aggrega- baseline version.
tion operations on one or more attributes across all snapshot
records in a stream. As a result, data streams in our model Blackboard with Snapshots Snapshots
updates only with Timestamps
can cover a range of possible performance experiments, from Context tree and 24.2 KiB 37.2 KiB 38.2 KiB
runtime traces to profiles. blackboards
Trace buffer n/a 6.4 MiB 8.0 MiB
V. P ERFORMANCE E VALUATION Table II
C ALIPER MEMORY CONSUMPTION PER PROCESS IN THE ANNOTATED
To evaluate the performance impact of Caliper annotations L ULESH PROXY APPLICATION FOR DIFFERENT C ALIPER RUNTIME
in a real-world scenario, we compare annotated and non- CONFIGURATIONS .
annotated versions of the L ULESH [1] hydrodynamics proxy
application. Specifically, we use a modified version of L ULESH
that uses the RAJA C++ library [2], which decouples the loops
in the application from the loop execution parameters to enable fastest time for each configuration. The run-to-run variation
static tuning of the loop execution policy. We discuss RAJA was less than 0.3 seconds (or 3% of the execution time)
in more detail in our case study in Section VI-B. for each configuration. In the Caliper-enabled configurations,
For our performance evaluation, we added Caliper anno- Caliper performs 1,123,087 blackboard updates and snapshots.
tations in the RAJA library and in the L ULESH application Snapshots are pushed into Caliper’s trace buffer, but not
code to collect one attribute for each RAJA loop invocation, written to disk. In a typical run, data is written to disk when
and three attributes for each L ULESH main loop iteration. Our the program exits to minimize perturbation.
experiments ran on the Cab cluster at LLNL (see Table I). We Figure 3 shows the results. The stub library configuration
ran annotated and non-annotated versions of L ULESH in its shows a measurable but small (1% of the execution time)
default configuration (problem size 30, run to completion) with effect of adding additional function calls in the RAJA loop
16 OpenMP threads on a single cluster node, and we report invocations. When Caliper performs blackboard updates but no
the wall-clock execution time of its main loop (excluding job snapshots, the main loop’s execution time increases by 0.46
setup and I/O). We compare the following Caliper runtime seconds compared to the uninstrumented version, indicating
configuration profiles: an average cost of about 0.4 microseconds per blackboard
• Replacing Caliper with a no-op stub library (“Stub”), update. Snapshots without additional measurement take about
• Performing blackboard updates only, without taking snap- 0.85 microseconds each on average; when adding timestamps,
shots (“Blackboard”), they take 1.3 microseconds. Table II shows Caliper’s memory
• Recording a snapshot trace (one snapshot per blackboard consumption during the Caliper-enabled runs. With less than
update) without performance measurements (“Snap- 40 KiB in each configuration, the memory requirements for
shot”), the generalized context tree and blackboard buffers are small.
• Recording a snapshot trace with timestamps (“w/ Times- Memory requirements for Caliper’s trace buffers depend on
tamp”) the snaphot granularity and size. Caliper compresses snap-
To mitigate the impact of external effects (e.g., OS noise) on shot records internally to minimize trace buffer sizes. In our
our results, we ran each configuration 5 times, and report the L ULESH example, Caliper used 8 MiB of memory to store
7
cross-compare data from raw context streams as presented in production scientific applications, where data can vary
in Table III. Figure 4 shows that the timesteps in which dramatically each time control passes over a loop. To obtain
the SAMRAI library regrids (shown by red markers) are the best performance, we use lightweight decision models
always followed by re-setup of data structures for HYPRE to dynamically tune application parameters at runtime on a
and therefore longer HYPRE solve time (black markers) in loop-by-loop basis in response to changing application data.
the immediately following timesteps. The outliers in the early To achieve this, we require cross-stack data collection and
timesteps are attributed to cold caches. Both of those opera- furthermore, runtime access to that data, both of which are
tions increase the runtime of the application, which is clearly provided by Caliper.
correlated with the total timestep length of the application. To dynamically adjust the application execution configura-
We are able to gather these correlations because we have tion at runtime we use RAJA, a C++ library that provides
instrumented SAMRAI, HYPRE, and the main application, flexible parallel execution methods over defined indices using
and are able to see the information from all three in the same a combination of template execution methods and lambda
context. functions [2]. RAJA is a programming framework that de-
The experience of instrumenting this large hydrodynamics couples the loops in the application from the loop execution
application and its components with Caliper annotations was parameters, enabling static tuning of the policy used to perform
straightforward and will enable us to study correlations be- the loop iterations. For example, the following application
tween parameters in different components and their influence loop:
on other components’ runtime and accuracy. From the software f o r ( i n t i = begin ; i < end ; ++ i ) {
engineering perspective, the ability to instrument a large code sigxx [ i ] = sigyy [ i ] = sigzz [ i ] = − p( i ) − q( i ) ;
};
incrementally is crucial. Additionally, composability of context
annotations means that there is no need to disable subsets of is transformed into a RAJA loop:
the instrumentation in order to look for features in another RAJA : : f o r a l l <e x e c p o l i c y >(IndexSet , [ = ] ( i n t i ) {
part of the code, enabling us to leave the instrumentation in sigxx [ i ] = sigyy [ i ] = sigzz [ i ] = − p( i ) − q( i ) ;
});
place.
We focus on tuning the execution policy, that is, how each loop
is mapped to hardware by the RAJA library. Whilst typically
B. Using Caliper for Runtime Tuning with Supervised Ma-
most thread-safe loops in an application should be run using
chine Learning
threads, in some cases, such as when the workload is small, the
Complex scientific applications exhibit dynamic and data- overhead of threads means it’s actually faster to run in serial.
dependent behavior. This behavior determines the parameter Our model is designed to identify these cases and execute them
selections for tuning the applications runtime. Existing tuning accordingly.
approaches are applied statically, or assume code and data Our tuning model is built using a decision tree algorithm
changes slowly, interleaving tuning phases with application that infers a function from a training set to a set of target
execution. However, these slow changes are rarely the case labels. The training set is a subset of the data collected at
9
Feature Description
callpath.address Call stack of current function.
runtime by Caliper annotations, consisting of a number of func Address of lambda function.
samples, each sample describes one invocation of some loop func size Total number of instructions in loop body.
index type Type of RAJA IndexSet.
in the application execution. Each sample is labelled with num indices Total number of indices in IndexSet.
the best parameter value, that is, the one with the lowest num segments Number of segments in the IndexSet.
observed execution time. The correct label for each sample problem size Global problem size (application specific).
stride Stride of IndexSet segments.
is determined by inspection. We assign to each sample a label timestep Current timestep (application specific).
based on the execution policy that provided the lowest runtime
Table IV
for that particular lambda invocation. The samples not used as F EATURES GATHERED USING RAJA AND C ALIPER .
part of the training set are reserved for model evaluation. The
classifier built during the training step can then be used to
predict the output of a test set of samples without any labels.
The labels predicted by the classifier for the test set can be
used to evaluate the accuracy of the learned function. We then that might be deemed unimportant by a computer scientist, but
convert the decision trees learned by our model generator into that has a large impact on application performance. Whilst
C++ code, which can be compiled and then used at runtime. features such as the global problem size and the current
Figure 5 shows our integrated workflow for collecting features timestep can be considered application specific, since they
and generating models using Caliper. There are two aspects are accessed using a string identifier (e.g. “timestep”), we
of the workflow that are enabled by the cross-application can use a common naming scheme to allow cross-application
flexibility provided by Caliper: data collection, and online data comparison of these attributes.
access. 2) On-line Data Querying for Runtime Adaption: Our
1) Data Collection: A key challenge for generating models modeling workflow generates C++ code that implements the
using supervised learning is collecting features that may decision process found in the models. This code can be
impact application performance, such that they can be used to compiled with the application and used to dynamically alter
construct a model for predicting which parameter value is the execution policy at runtime. As input, the model requires a
best for a given loop. To collect data for model generation, sample which contains values for all the features used when
we compile the application with an instrumented version of training the model. Using Caliper’s on-line query interface,
the RAJA library. Caliper annotations are inserted within the we are able to access these feature values from across the
RAJA template execution method (see Listing 1). The design application stack within our generated code, and use these
of RAJA naturally decouples the application specific loop values to evaluate the model. A key advantage of using Caliper
body from the platform, architecture, and programming model is that annotations are re-used; the annotations used to record
specific loop execution. This means that the features for each features and create the model are used when the model is being
loop can be recorded without modifying the application code. evaluated online.
We also measure application level features, such as the
current timestep and the global problem size using Caliper 3) Results: Whilst this work is still in its preliminary stages,
annotations inserted into the application source. Caliper al- by using Caliper’s online query interface and our generated
lows us to combine all features across the application stack, data-dependent models, we are able to improve application
generating a feature vector for each loop invocation. Table IV in simple cases. Figure 6 shows the runtime at a range of
describes some of the features we collect. Using Caliper to processor counts on LLNL’s Cab cluster (Table I) using one
automatically collect and merge features from all software of our generated data-dependent models in the CleverLeaf
components, we are able to include features deemed important mini-application [5]. At small problems sizes the default serial
by the application developers in our model. This functionality execution is sufficient, but at larger problem sizes, threading
is essential in capturing application and domain specific data is enabled by the model to improve performance.
10
novel performance engineering approach that requires both [11] M. Schulz, J. Galarowicz, D. Maghrak, W. Hachfeld, D. Montoya, and
off-line and on-line access to context information. Combined, S. Cranford, “Open|speedshop: An open source infrastructure for parallel
performance analysis,” Scientific Programming, vol. 16, no. 2-3, pp.
these case studies illustrate how flexible introspection offered 105–121, 2008.
by Caliper offers a new path towards effective and insightful [12] M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and
performance analysis for complex HPC applications. B. Mohr, “The Scalasca performance toolset architecture,” Concurrency
and Computation: Practice and Experience, vol. 22, no. 6, pp. 702–719,
Apr. 2010. [Online]. Available: http://apps.fz-juelich.de/jsc-pubsystem/
pub-webpages/general/get attach.php?pubid=142
S OURCE CODE [13] J. Mellor-Crummey, R. Fowler, and G. Marin, “HPCView: A tool for top-
Caliper is available at https://github.com/LLNL/Caliper. down analysis of node performance,” The Journal of Supercomputing,
vol. 23, pp. 81–101, 2002.
Documentation is available at software.llnl.gov/Caliper. [14] K. A. Huck and A. D. Malony, “Perfexplorer: A performance data
mining framework for large-scale parallel computing,” in Proceedings
of the 2005 ACM/IEEE Conference on Supercomputing, ser. SC ’05.
ACKNOWLEDGMENT Washington, DC, USA: IEEE Computer Society, 2005, pp. 41–.
[Online]. Available: http://dx.doi.org/10.1109/SC.2005.55
This work was performed under the auspices of the U.S. [15] W. E. Nagel, A. Arnold, M. Weber, H. C. Hoppe, and K. Solchenbach,
Department of Energy by Lawrence Livermore National Lab- “VAMPIR: Visualization and analysis of MPI resources,” Supercom-
oratory under contract DEAC52-07NA27344 and supported puter, vol. 12, no. 1, pp. 69–80, 1996.
[16] L. D. Erman, F. Hayes-Roth, V. R. Lesser, and D. R. Reddy, “The
by the Office of Science, Office of Advanced Scientific hearsay-ii speech-understanding system: Integrating knowledge to re-
Computing Research as well as the Advanced Simulation and solve uncertainty,” ACM Computing Surveys (CSUR), vol. 12, no. 2, pp.
Computing (ASC) program. 213–253, 1980.
[17] H. P. Nii, “Blackboard application systems, blackboard systems and a
LLNL release LLNL-CONF-699263. knowledge engineering perspective,” AI magazine, vol. 7, no. 3, p. 82,
1986.
[18] D. D. Corkill, “Blackboard systems,” AI expert, vol. 6, no. 9, pp. 40–47,
R EFERENCES 1991.
[19] K. Huck, A. Porterfield, N. Chaimov, H. Kaiser, A. D. Malony, T. Ster-
[1] I. Karlin, A. Bhatele, B. L. Chamberlain, J. Cohen, Z. Devito, ling, and R. Fowler, “An Autonomic Performance Environment for
M. Gokhale, R. Haque, R. Hornung, J. Keasler, D. Laney, E. Luke, Exascale,” Supercomputing Frontiers and Innovations, vol. 2, no. 3,
S. Lloyd, J. McGraw, R. Neely, D. Richards, M. Schulz, C. H. Still, 2015.
F. Wang, and D. Wong, “Lulesh programming model and performance [20] A. Mandal, R. Fowler, and A. Porterfield, “System-wide introspection
ports overview,” Tech. Rep. LLNL-TR-608824, December 2012. for accurate attribution of performance bottlenecks,” in Workshop on
[2] R. D. Hornung and J. A. Keasler, “The RAJA Poratability Layer: High-performance Infrastructure for Scalable Tools (WHIST), Venice,
Overview and Status,” Lawrence Livermore National Laboratory, Tech. Italy, 06/2012 2012.
Rep. LLNL-TR-661403, Sep. 2014. [21] K. Varda, “Google’s data interchange format,” Online, July 7 2008,
[3] B. T. N. Gunney, A. M. Wissink, and D. A. Hysom, “ Parallel Clustering https://developers.google.com/protocol-buffers/.
Algorithms for Structured AMR,” Journal of Parallel and Distributed
Computing, vol. 66, no. 11, pp. 1419–1430, 2006.
[4] R. Falgout, J. Jones, and U. Yang, “The Design and Implementation
of HYPRE, a Library of Parallel High Performance Preconditioners,”
Chapter in Numerical Solution of Partial Differential Equations on
Parallel Computers, A.M. Bruaset and A. Tveito, eds., vol. 51, no. 4,
pp. 267–294, 2006.
[5] D. A. Beckingsale, W. Gaudin, A. Herdman, and S. Jarvis, “Resident
Block-Structured Adaptive Mesh Refinement on Thousands of Graphics
Processing Units,” in Proceedings of the 44th International Conference
on Parallel Processing. IEEE, Aug. 2015, pp. 61–70.
[6] A. E. Eichenberger, J. M. Mellor-Crummey, M. Schulz, M. Wong,
N. Copty, J. DelSignore, R. Dietrich, X. Liu, E. Loh, and D. Lorenz,
“OMPT: OpenMP tools application programming interfaces for per-
formance analysis,” in Proc. of the 9th International Workshop on
OpenMP (IWOMP), Canberra, Australia, ser. LNCS, no. 8122. Berlin
/ Heidelberg: Springer, 2013, pp. 171–185.
[7] P. J. Mucci, S. Browne, C. Deane, and G. Ho, “PAPI: A portable
interface to hardware performance counters,” in Proc. Department of
Defense HPCMP User Group Conference, Jun. 1999.
[8] Knüpfer, Andreas and Rössel, Christian and Mey, Dieteran and Biers-
dorff, Scott and Diethelm, Kai and Eschweiler, Dominic and Geimer,
Markus and Gerndt, Michael and Lorenz, Daniel and Malony, Allen
and Nagel, Wolfgang E. and Oleynik, Yury and Philippen, Peter and
Saviankou, Pavel and Schmidl, Dirk and Shende, Sameer and Tschüter,
Ronny and Wagner, Michael and Wesarg, Bert and Wolf, Felix, “Score-P:
A joint performance measurement run-time infrastructure for Periscope,
Scalasca, TAU, and Vampir,” in Tools for High Performance Computing
2011, Brunst, Holger and Müller, Matthias S. and Nagel, Wolfgang E.
and Resch, Michael M., Ed. Springer Berlin Heidelberg, 2011, pp.
79–91.
[9] S. Shende and A. D. Malony, “The tau parallel performance system,”
International Journal of High Performance Computing Applications,
vol. 20, no. 2, pp. 287–311, 2006.
[10] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-
Crummey, and N. R. Tallent, “Hpctoolkit: Tools for performance anal-
ysis of optimized parallel programs,” Concurrency and Computation:
Practice and Experience, vol. 22, no. 6, pp. 685–701, 2010.