HPCToolkit Users Manual
HPCToolkit Users Manual
Version 2023.02.20
John Mellor-Crummey,
Laksono Adhianto, Jonathon Anderson, Mike Fagan, Dragana Grbic, Marty Itzkowitz,
Mark Krentel, Xiaozhu Meng, Nathan Tallent, Keren Zhou
Rice University
1 Introduction 1
2 HPCToolkit Overview 7
2.1 Asynchronous Sampling and Call Path Profiling . . . . . . . . . . . . . . . . 8
2.2 Recovering Static Program Structure . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Reducing Performance Measurements . . . . . . . . . . . . . . . . . . . . . 10
2.4 Presenting Performance Measurements . . . . . . . . . . . . . . . . . . . . . 10
3 Quick Start 11
3.1 Guided Tour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Compiling an Application . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Measuring Application Performance . . . . . . . . . . . . . . . . . . 12
3.1.3 Recovering Program Structure . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Analyzing Measurements & Attributing Them to Source Code . . . 14
3.1.5 Presenting Performance Measurements for Interactive Analysis . . . 15
3.1.6 Effective Performance Analysis Techniques . . . . . . . . . . . . . . 15
3.2 Additional Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
ii
5.4 Experimental Python Support . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.1 Known Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.5 Process Fraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.6 Starting and Stopping Sampling . . . . . . . . . . . . . . . . . . . . . . . . 43
5.7 Environment Variables for hpcrun . . . . . . . . . . . . . . . . . . . . . . . 44
5.8 Cray System Specific Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11 Known Issues 99
11.1 When using Intel GPUs, using hpcrun may program alter program behavior
when using instruction-level performance measurement . . . . . . . . . . . . 99
11.2 When using Intel GPUs, hpcrun may report that substantial time is spent
in a partial call path consisting of only an unknown procedure . . . . . . . 99
11.3 hpcrun reports partial call paths for code executed by a constructor prior to
entering main . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.4 hpcrun may fail to measure a program execution on a CPU with hardware
performance counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.5 hpcrun may associate several profiles and traces with rank 0, thread 0 . . . 101
11.6 hpcrun sometimes enables writing of read-only data . . . . . . . . . . . . . 101
11.7 A confusing label for GPU theoretical occupancy . . . . . . . . . . . . . . . 101
11.8 Deadlock when using Darshan . . . . . . . . . . . . . . . . . . . . . . . . . . 102
1
Chapter 1
Introduction
HPCToolkit [1, 14] is an integrated suite of tools for measurement and analysis of
program performance on computers ranging from multicore desktop systems to the world’s
largest supercomputers. HPCToolkit provides accurate measurements of a program’s
work, resource consumption, and inefficiency, correlates these metrics with the program’s
source code, works with multilingual, fully optimized binaries, has low measurement over-
head, and scales to large parallel systems. HPCToolkit’s measurements provide support
for analyzing a program execution cost, inefficiency, and scaling characteristics both within
and across nodes of a parallel system.
HPCToolkit principally monitors an execution of a multithreaded and/or multipro-
cess program using asynchronous sampling, unwinding thread call stacks, and attributing
the metric value associated with a sample event in a thread to the calling context of the
thread/process in which the event occurred. HPCToolkit’s asynchronous sampling is typ-
ically triggered by the expiration of a Linux timer or a hardware performance monitoring
unit event, such reaching a threshold value for a hardware performance counter. Sampling
has several advantages over instrumentation for measuring program performance: it requires
no modification of source code, it avoids potential blind spots (such as code available in
only binary form), and it has lower overhead. HPCToolkit typically adds measurement
overhead of only a few percent to an execution for reasonable sampling rates [18]. Sam-
pling enables fine-grain measurement and attribution of costs in both serial and parallel
programs.
For parallel programs, one can use HPCToolkit to measure the fraction of time threads
are idle, working, or communicating. To obtain detailed information about a program’s
computation performance, one can collect samples using a processor’s built-in performance
monitoring units to measure metrics such as operation counts, pipeline stalls, cache misses,
and data movement between processor sockets. Such detailed measurements are essential
to understand the performance characteristics of applications on modern multicore micro-
processors that employ instruction-level parallelism, out-of-order execution, and complex
memory hierarchies. With HPCToolkit, one can also easily compute derived metrics
such as cycles per instruction, waste, and relative efficiency to provide insight into a pro-
gram’s shortcomings.
A unique capability of HPCToolkit is its ability to unwind the call stack of a thread
executing highly optimized code to attribute time, hardware counter metrics, as well as
1
Figure 1.1: A code-centric view of an execution of the University of Chicago’s FLASH
code executing on 8192 cores of a Blue Gene/P. This bottom-up view shows that 16% of the
execution time was spent in IBM’s DCMF messaging layer. By tracking these costs up the
call chain, we can see that most of this time was spent on behalf of calls to pmpi allreduce
on line 419 of amr comm setup.
software metrics (e.g., context switches) to a full calling context. Call stack unwinding is
often difficult for highly optimized code [18]. For accurate call stack unwinding, HPCToolkit
employs two strategies: interpreting compiler-recorded information in DWARF Frame De-
scriptor Entries (FDEs) and binary analysis to compute unwind recipes directly from an
application’s machine instructions. On ARM processors, HPCToolkit uses libunwind ex-
clusively. On Power processors, HPCToolkit uses binary analysis exclusively. On x86 64
processors, HPCToolkit employs both strategies in an integrated fashion.
HPCToolkit assembles performance measurements into a call path profile that asso-
ciates the costs of each function call with its full calling context. In addition, HPCToolkit
uses binary analysis to attribute program performance metrics with detailed precision – full
dynamic calling contexts augmented with information about call sites, inlined functions and
templates, loops, and source lines. Measurements can be analyzed in a variety of ways: top-
down in a calling context tree, which associates costs with the full calling context in which
they are incurred; bottom-up in a view that apportions costs associated with a function to
each of the contexts in which the function is called; and in a flat view that aggregates all
costs associated with a function independent of calling context. This multiplicity of code-
centric perspectives is essential to understanding a program’s performance for tuning under
various circumstances. HPCToolkit also supports a thread-centric perspective, which
enables one to see how a performance metric for a calling context differs across threads, and
a time-centric perspective, which enables a user to see how an execution unfolds over time.
2
Figure 1.2: A thread-centric view of the performance of a parallel radix sort application
executing on 960 cores of a Cray XE6. The bottom pane shows a calling context for usort
in the execution. The top pane shows a graph of how much time each thread spent executing
calls to usort from the highlighted context. On a Cray XE6, there is one MPI helper thread
for each compute node in the system; these helper threads spent no time executing usort.
The graph shows that some of the MPI ranks spent twice as much time in usort as others.
This happens because the radix sort divides up the work into 1024 buckets. In an execution
on 960 cores, 896 cores work on one bucket and 64 cores work on two. The middle pane
shows an alternate view of the thread-centric data as a histogram.
3
quantify scalability losses and pinpoint their causes to individual lines of code executed
in particular calling contexts [5]. We have used this technique to quantify scaling losses in
leading science applications across thousands of processor cores on Cray and IBM Blue Gene
systems, associate them with individual lines of source code in full calling context [16, 19],
and quantify scaling losses in science applications within compute nodes at the loop nest
level due to competition for memory bandwidth in multicore processors [15]. We have also
developed techniques for efficiently attributing the idleness in one thread to its cause in
another thread [17, 21].
HPCToolkit is deployed on many DOE supercomputers, including the Sierra super-
computer (IBM Power9 + NVIDIA V100 GPUs) at Lawrence Livermore National Labo-
ratory, Cray XC40 systems at Argonne’s Leadership Computing Facility and the National
Energy Research Scientific Computing Center; the Summit supercomputer (IBM Power9 +
NVIDIA V100 GPUs) at Oak Ridge Leadership Computing Facility as well as other clusters
and supercomputers based on x86 64, Power, and ARM processors.
4
Figure 1.3: A time-centric view of part of an execution of the University of Chicago’s
FLASH code on 256 cores of a Blue Gene/P. The figure shows a detail from the end of the
initialization phase and part of the first iteration of the solve phase. The largest pane in
the figure shows the activity of cores 2–95 in the execution during a time interval ranging
from 69.376s–85.58s during the execution. Time lines for threads are arranged from top
to bottom and time flows from left to right. The color at any point in time for a thread
indicates the procedure that the thread is executing at that time. The right pane shows
the full call stack of thread 85 at 84.82s into the execution, corresponding to the selection
shown by the white crosshair; the outermost procedure frame of the call stack is shown at
the top of the pane and the innermost frame is shown at the bottom. This view highlights
that even though FLASH is an SPMD program, the behavior of threads over time can be
quite different. The purple region highlighted by the cursor, which represents a call by
all processors to mpi allreduce, shows that the time spent in this call varies across the
processors. The variation in time spent waiting in mpi allreduce is readily explained by an
imbalance in the time processes spend a prior prolongation step, shown in yellow. Further
left in the figure, one can see differences among ranks executing on different cores in each
node as they await the completion of an mpi allreduce. A rank executing on one core
of each node waits in DCMF Messager advance (which appears as blue stripes) while ranks
executing on other cores in each node wait in a helper function (shown in green). In this
phase, ranks await the delayed arrival of a few of their peers who have extra work to do
inside simulation initblock before they call mpi allreduce.
5
6
Chapter 2
HPCToolkit Overview
2. binary analysis to recover program structure from the application binary and the
shared libraries and GPU binaries used in the run;
7
Figure 2.1: Overview of HPCToolkit’s tool work flow.
8
For all but the most trivially structured programs, it is important to associate the costs
incurred by each procedure with the contexts in which the procedure is called. Know-
ing the context in which each cost is incurred is essential for understanding why the code
performs as it does. This is particularly important for code based on application frame-
works and libraries. For instance, costs incurred for calls to communication primitives (e.g.,
MPI_Wait) or code that results from instantiating C++ templates for data structures can
vary widely depending how they are used in a particular context. Because there are often
layered implementations within applications and libraries, it is insufficient either to insert
instrumentation at any one level or to distinguish costs based only upon the immediate
caller. For this reason, HPCToolkit uses call path profiling to attribute costs to the full
calling contexts in which they are incurred.
HPCToolkit’s hpcrun call path profiler uses call stack unwinding to attribute execu-
tion costs of optimized executables to the full calling context in which they occur. Unlike
other tools, to support asynchronous call stack unwinding during execution of optimized
code, hpcrun uses on-line binary analysis to locate procedure bounds and compute an un-
wind recipe for each code range within each procedure [18]. These analyses enable hpcrun
to unwind call stacks for optimized code with little or no information other than an appli-
cation’s machine code.
The output of a run with hpcrun is a measurements directory containing the data, and
the information necessary to recover the names of all shared libraries and GPU binaries.
9
ilarly, it exposes calls to compiler support routines and wait loops in communication libraries
of which one would otherwise be unaware.
10
Chapter 3
Quick Start
11
Figure 3.1: Overview of HPCToolkit tool’s work flow.
code. This usually means compiling with options similar to ‘-g -O3’. Check your com-
piler’s documentation for information about the right set of options to have the compiler
record information about inlining and the mapping of machine instructions to source lines.
We advise picking options that indicate they will record information that relates machine
instructions to source code without compromising optimization. For instance, the Portland
Group (PGI) compilers, use -gopt in place of -g to collect information without interfering
with optimization.
While HPCToolkit does not need information about the mapping between machine
instructions and source code to function, having such information included in the binary
code by the compiler can be helpful to users trying to interpret performance measurements.
Since compilers can usually provide information about line mappings and inlining for fully-
optimized code, this requirement usually involves a one-time trivial adjustment to the an
application’s build scripts to provide a better experience with tools. Such mapping infor-
mation enables tools such as HPCToolkit, race detectors, and memory analysis tools to
attribute information more precisely.
For statically linked executables, such as those often used on Cray supercomputers, the
final link step is done with hpclink.
12
[<mpi-launcher>] hpcrun [hpcrun-options] app [app-arguments]
Of course, <mpi-launcher> is only needed for MPI programs and is sometimes a
program like mpiexec or mpirun, or a workload manager’s utilities such as Slurm’s
srun or IBM’s Job Step Manager utility jsrun.
• Statically linked applications:
First, link hpcrun’s monitoring code into app, using hpclink:
hpclink <linker> -o app <linker-arguments>
Then monitor app by passing hpcrun options through environment variables. For
instance:
export HPCRUN_EVENT_LIST="CYCLES"
[<mpi-launcher>] app [app-arguments]
hpclink’s --help option gives a list of environment variables that affect monitoring.
See Chapter 6 for more information.
Any of these commands will produce a measurements database that contains separate mea-
surement information for each MPI rank and thread in the application. The database is
named according the form:
hpctoolkit-app-measurements[-<jobid>]
If the application app is run under control of a recognized batch job scheduler (such as Slurm,
Cobalt, or IBM’s Job Manager), the name of the measurements directory will contain the
corresponding job identifier <jobid>. Currently, the database contains measurements files
for each thread that are named using the following templates:
app-<mpi-rank>-<thread-id>-<host-id>-<process-id>.<generation-id>.hpcrun
app-<mpi-rank>-<thread-id>-<host-id>-<process-id>.<generation-id>.hpctrace
13
Measuring GPU Computations
One can simply profile and optionally trace computations offloaded onto AMD, Intel,
and NVIDIA GPUs by using one of the following event specifiers:
• -e gpu=nvidia is used with CUDA and OpenMP on NVIDIA GPUs
• -e gpu=amd is used with CUDA and OpenMP on NVIDIA GPUs
• -e gpu=level0 is used with Intel’s Level Zero runtime for Data Parallel C++ and
OpenMP
• -e gpu=opencl can be used on any of the GPU platforms.
Adding a -t to hpcrun’s command line when profiling GPU computations will trace
them as well.
For more information about how to use PC sampling (NVIDIA GPUs only) or binary in-
strumentation (Intel GPUs) for instruction-level performance measurement of GPU kernels,
see Chapter 8.
14
hpcprof usually completes this analysis in a matter of minutes. For especially large
experiments (applications using thousands of threads and/or GPU streams), the sibling
hpcprof-mpi may produce results faster by exploiting additional compute nodes1 . Typically
hpcprof-mpi is invoked as follows, using 8 ranks and compute nodes:
<mpi-launcher> -n 8 hpcprof-mpi hpctoolkit-app-measurements
Note that additional options may be needed to grant hpcprof-mpi access to all threads on
each node, check the documentation for your scheduler and MPI implementation for details.
If possible, hpcprof will copy the sources for the application and any libraries into
the resulting database. If the source code was moved since or was mounted at a different
location than when the application was compiled, the resulting database may be missing
some important source files. In these cases, the -R/--replace-path option may be specified
to provide substitute paths based on prefixes. For example, if the application was compiled
from source at /home/joe/app/src/ but it is mounted at /extern/homes/joe/app/src/
when running hpcprof, the source files can be made available by invoking hpcprof as
follows:
hpcprof -R ‘/home/joe/app/src/=/extern/homes/joe/app/src/’ \
hpctoolkit-app-measurements
Note that on systems where MPI applications are restricted to a scratch file system, it is the
users responsibility to copy any wanted source files and make them available to hpcprof.
15
3.2 Additional Guidance
For additional information, consult the rest of this manual and other documentation:
First, we summarize the available documentation and command-line help:
Command-line help.
Each of HPCToolkit’s command-line tools can generate a help message summariz-
ing the tool’s usage, arguments and options. To generate this help message, invoke
the tool with -h or --help.
Man pages.
Man pages are available either via the Internet (http://hpctoolkit.org/documentation.
html) or from a local HPCToolkit installation (<hpctoolkit-installation>/share/
man).
Manuals.
Manuals are available either via the Internet (http://hpctoolkit.org/documentation.
html) or from a local HPCToolkit installation (<hpctoolkit-installation>/share/
doc/hpctoolkit/documentation.html).
16
Chapter 4
This chapter describes some proven strategies for using performance measurements to
identify performance bottlenecks in both serial and parallel codes.
17
Figure 4.1: Computing a derived metric (cycles per instruction) in hpcviewer.
18
Figure 4.2: Displaying the new cycles/ instruction derived metric in hpcviewer.
metrics you use in your formula can be determined using the Metric pull-down menu in the
pane. If you select your metric of choice using the pull-down, you can insert its positional
name into the formula using the insert metric button, or you can simply type the positional
name directly into the formula.
At the bottom of the derived metric pane, one can specify a name for the new metric.
One also has the option to indicate that the derived metric column should report for each
scope what percent of the total its quantity represents; for a metric that is a ratio, computing
a percent of the total is not meaningful, so we leave the box unchecked. After clicking the
OK button, the derived metric pane will disappear and the new metric will appear as the
rightmost column in the metric pane. If the metric pane is already filled with other columns
of metric, you may need to scroll right in the pane to see the new metric. Alternatively, you
can use the metric check-box pane (selected by depressing the button to the right of f (x)
above the metric pane) to hide some of the existing metrics so that there will be enough
any functions it calls. In hpcviewer, inclusive metric columns are marked with “(I)” and exclusive metric
columns are marked with “(E).”
19
room on the screen to display the new metric. Figure 4.2 shows the resulting hpcviewer
display after clicking OK to add the derived metric.
The following sections describe several types of derived metrics that are of particular
use to gain insight into performance bottlenecks and opportunities for tuning.
While knowing where a program spends most of its time or executes most of its floating
point operations may be interesting, such information may not suffice to identify the biggest
targets of opportunity for improving program performance. For program tuning, it is less
important to know how much resources (e.g., time, instructions) were consumed in each
program context than knowing where resources were consumed inefficiently.
To identify performance problems, it might initially seem appealing to compute ratios
to see how many events per cycle occur in each program context. For instance, one might
compute ratios such as FLOPs/cycle, instructions/cycle, or cache miss ratios. However,
using such ratios as a sorting key to identify inefficient program contexts can misdirect
a user’s attention. There may be program contexts (e.g., loops) in which computation is
terribly inefficient (e.g., with low operation counts per cycle); however, some or all of the
least efficient contexts may not account for a significant amount of execution time. Just
because a loop is inefficient doesn’t mean that it is important for tuning.
The best opportunities for tuning are where the aggregate performance losses are great-
est. For instance, consider a program with two loops. The first loop might account for 90%
of the execution time and run at 50% of peak performance. The second loop might account
for 10% of the execution time, but only achieve 12% of peak performance. In this case, the
total performance loss in the first loop accounts for 50% of the first loop’s execution time,
which corresponds to 45% of the total program execution time. The 88% performance loss
in the second loop would account for only 8.8% of the program’s execution time. In this
case, tuning the first loop has a greater potential for improving the program performance
even though the second loop is less efficient.
A good way to focus on inefficiency directly is with a derived waste metric. Fortunately,
it is easy to compute such useful metrics. However, there is no one right measure of waste
for all codes. Depending upon what one expects as the rate-limiting resource (e.g., floating-
point computation, memory bandwidth, etc.), one can define an appropriate waste metric
(e.g., FLOP opportunities missed, bandwidth not consumed) and sort by that.
For instance, in a floating-point intensive code, one might consider keeping the floating
point pipeline full as a metric of success. One can directly quantify and pinpoint losses
from failing to keep the floating point pipeline full regardless of why this occurs. One
can pinpoint and quantify losses of this nature by computing a floating-point waste metric
that is calculated as the difference between the potential number of calculations that could
have been performed if the computation was running at its peak rate minus the actual
number that were performed. To compute the number of calculations that could have been
completed in each scope, multiply the total number of cycles spent in the scope by the
peak rate of operations per cycle. Using hpcviewer, one can specify a formula to compute
20
Figure 4.3: Computing a floating point waste metric in hpcviewer.
such a derived metric and it will compute the value of the derived metric for every scope.
Figure 4.3 shows the specification of this floating-point waste metric for a code.3
Sorting by a waste metric will rank order scopes to show the scopes with the greatest
waste. Such scopes correspond directly to those that contain the greatest opportunities for
improving overall program performance. A waste metric will typically highlight loops where
• a lot of time is spent computing efficiently, but the aggregate inefficiencies accumulate,
• less time is spent computing, but the computation is rather inefficient, and
• scopes such as copy loops that contain no computation at all, which represent a
complete waste according to a metric such as floating point waste.
Beyond identifying and quantifying opportunities for tuning with a waste metric, one
can compute a companion derived metric relative efficiency metric to help understand how
easy it might be to improve performance. A scope running at very high efficiency will
typically be much harder to tune than running at low efficiency. For our floating-point
3
Many recent processors have trouble accurately counting floating-point operations accurately, which is
unfortunate. If your processor can’t accurately count floating-point operations, a floating-point waste metric
will be less useful.
21
Figure 4.4: Computing floating point efficiency in percent using hpcviewer.
waste metric, we one can compute the floating point efficiency metric by dividing measured
FLOPs by potential peak FLOPs and multiplying the quantity by 100. Figure 4.4 shows
the specification of this floating-point efficiency metric for a code.
Scopes that rank high according to a waste metric and low according to a companion
relative efficiency metric often make the best targets for optimization. Figure 4.5 shows
the specification of this floating-point efficiency metric for a code. Figure 4.5 shows an
hpcviewer display that shows the top two routines that collectively account for 32.2%
of the floating point waste in a reactive turbulent combustion code. The second routine
(ratt) is expanded to show the loops and statements within. While the overall floating
point efficiency for ratt is at 6.6% of peak (shown in scientific notation in the hpcviewer
display), the most costly loop in ratt that accounts for 7.3% of the floating point waste is
executing at only 0.114% efficiency. Identifying such sources of inefficiency is the first step
towards improving performance via tuning.
22
Figure 4.5: Using floating point waste and the percent of floating point efficiency to
evaluate opportunities for optimization.
HPCToolkit can be used to readily pinpoint both kinds of bottlenecks. Using call path
profiles collected by hpcrun, it is possible to quantify and pinpoint scalability bottlenecks
of any kind, regardless of cause.
To pinpoint scalability bottlenecks in parallel programs, we use differential profiling —
mathematically combining corresponding buckets of two or more execution profiles. Dif-
ferential profiling was first described by McKenney [11]; he used differential profiling to
23
compare two flat execution profiles. Differencing of flat profiles is useful for identifying
what parts of a program incur different costs in two executions. Building upon McKenney’s
idea of differential profiling, we compare call path profiles of parallel executions at different
scales to pinpoint scalability bottlenecks. Differential analysis of call path profiles pinpoints
not only differences between two executions (in this case scalability losses), but the con-
texts in which those differences occur. Associating changes in cost with full calling contexts
is particularly important for pinpointing context-dependent behavior. Context-dependent
behavior is common in parallel programs. For instance, in message passing programs, the
time spent by a call to MPI_Wait depends upon the context in which it is called. Similarly,
how the performance of a communication event scales as the number of processors in a
parallel execution increases depends upon a variety of factors such as whether the size of
the data transferred increases and whether the communication is collective or not.
• when different numbers of processors are used to solve the same problem (strong
scaling), one expects an execution’s speedup to increase linearly with the number of
processors employed;
• when different numbers of processors are used but the amount of computation per
processor is held constant (weak scaling), one expects the execution time on a different
number of processors to be the same.
In both of these situations, a code developer can express their expectations for how
performance will scale as a formula that can be used to predict execution performance
on a different number of processors. One’s expectations about how overall application
performance should scale can be applied to each context in a program to pinpoint and
quantify deviations from expected scaling. Specifically, one can scale and difference the
performance of an application on different numbers of processors to pinpoint contexts that
are not scaling ideally.
To pinpoint and quantify scalability bottlenecks in a parallel application, we first use
hpcrun to a collect call path profile for an application on two different numbers of processors.
Let Ep be an execution on p processors and Eq be an execution on q processors. Without
loss of generality, assume that q > p.
In our analysis, we consider both inclusive and exclusive costs for CCT nodes. The
inclusive cost at n represents the sum of all costs attributed to n and any of its descendants
in the CCT, and is denoted by I(n). The exclusive cost at n represents the sum of all costs
attributed strictly to n, and we denote it by E(n). If n is an interior node in a CCT, it
represents an invocation of a procedure. If n is a leaf in a CCT, it represents a statement
inside some procedure. For leaves, their inclusive and exclusive costs are equal.
It is useful to perform scalability analysis for both inclusive and exclusive costs; if the
loss of scalability attributed to the inclusive costs of a function invocation is roughly equal
to the loss of scalability due to its exclusive costs, then we know that the computation
in that function invocation does not scale. However, if the loss of scalability attributed
24
Figure 4.6: Computing the scaling loss when weak scaling a white dwarf detonation
simulation with FLASH3 from 256 to 8192 cores. For weak scaling, the time on an MPI rank
in each of the simulations will be the same. In the figure, column 0 represents the inclusive
cost for one MPI rank in a 256-core simulation; column 2 represents the inclusive cost
for one MPI rank in an 8192-core simulation. The difference between these two columns,
computed as $2-$0, represents the excess work present in the larger simulation for each
unique program context in the calling context tree. Dividing that by the total time in
the 8192-core execution @2 gives the fraction of wasted time. Multiplying through by 100
gives the percent of the time wasted in the 8192-core execution, which corresponds to the
% scalability loss.
to a function invocation’s inclusive costs outweighs the loss of scalability accounted for by
exclusive costs, we need to explore the scalability of the function’s callees.
Given CCTs for an ensemble of executions, the next step to analyzing the scalability
of their performance is to clearly define our expectations. Next, we describe performance
expectations for weak scaling and intuitive metrics that represent how much performance
deviates from our expectations. More information about our scalability analysis technique
can be found elsewhere [5, 19].
Weak Scaling
Consider two weak scaling experiments executed on p and q processors, respectively,
p < q. In Figure 4.6 shows how we can use a derived metric to compute and attribute
scalability losses. Here, we compute the difference in inclusive cycles spent on one core of a
8192-core run and one core in a 256-core run in a weak scaling experiment. If the code had
perfect weak scaling, the time for an MPI rank in each of the executions would be identical.
25
Figure 4.7: Using the fraction the scalability loss metric of Figure 4.6 to rank order loop
nests by their scaling loss.
In this case, they are not. We compute the excess work by computing the difference for
each scope between the time on the 8192-core run and the time on the 256-core core run.
We normalize the differences of the time spent in the two runs by dividing then by the total
time spent on the 8192-core run. This yields the fraction of wasted effort for each scope
when scaling from 256 to 8192 cores. Finally, we multiply these resuls by 100 to compute
the % scalability loss. This example shows how one can compute a derived metric to that
pinpoints and quantifies scaling losses across different node counts of a Blue Gene/P system.
A similar analysis can be applied to compute scaling losses between jobs that use different
numbers of core counts on individual processors. Figure 4.7 shows the result of computing
the scaling loss for each loop nest when scaling from one to eight cores on a multicore node
and rank order loop nests by their scaling loss metric. Here, we simply compute the scaling
loss as the difference between the cycle counts of the eight-core and the one-core runs,
divided through by the aggregate cost of the process executing on eight cores. This figure
shows the scaling lost written in scientific notation as a fraction rather than multiplying
through by 100 to yield a percent. In this figure, we examine scaling losses in the flat view,
showing them for each loop nest. The source pane shows the loop nest responsible for the
greatest scaling loss when scaling from one to eight cores. Unsurprisingly, the loop with the
worst scaling loss is very memory intensive. Memory bandwidth is a precious commodity
on multicore processors.
While we have shown how to compute and attribute the fraction of excess work in a weak
scaling experiment, one can compute a similar quantity for experiments with strong scaling.
26
When differencing the costs summed across all of the threads in a pair of strong-scaling
experiments, one uses exactly the same approach as shown in Figure 4.6. If comparing
weak scaling costs summed across all ranks in p and q core executions, one can simply scale
the aggregate costs by 1/p and 1/q respectively before differencing them.
• Top-down view. This view represents the dynamic calling contexts (call paths) in
which costs were incurred.
• Bottom-up view. This view enables one to look upward along call paths. This view
is particularly useful for understanding the performance of software components or
procedures that are used in more than one context, such as communication library
routines.
• Flat view. This view organizes performance measurement data according to the static
structure of an application. All costs incurred in any calling context by a procedure
are aggregated together in the flat view.
hpcviewer enables developers to explore top-down, bottom-up, and flat views of CCTs
annotated with costs, helping to quickly pinpoint performance bottlenecks. Typically, one
begins analyzing an application’s scalability and performance using the top-down calling
context tree view. Using this view, one can readily see how costs and scalability losses are
associated with different calling contexts. If costs or scalability losses are associated with
only a few calling contexts, then this view suffices for identifying the bottlenecks. When
scalability losses are spread among many calling contexts, e.g., among different invocations
of MPI_Wait, often it is useful to switch to the bottom-up of the data to see if many losses
are due to the same underlying cause. In the bottom-up view, one can sort routines by
their exclusive scalability losses and then look upward to see how these losses accumulate
from the different calling contexts in which the routine was invoked.
Scaling loss based on excess work is intuitive; perfect scaling corresponds to a excess work
value of 0, sublinear scaling yields positive values, and superlinear scaling yields negative
values. Typically, CCTs for SPMD programs have similar structure. If CCTs for different
executions diverge, using hpcviewer to compute and report excess work will highlight these
program regions.
Inclusive excess work and exclusive excess work serve as useful measures of scalability
associated with nodes in a calling context tree (CCT). By computing both metrics, one can
determine whether the application scales well or not at a CCT node and also pinpoint the
cause of any lack of scaling. If a node for a function in the CCT has comparable positive
values for both inclusive excess work and exclusive excess work, then the loss of scaling
is due to computation in the function itself. However, if the inclusive excess work for the
function outweighs that accounted for by its exclusive costs, then one should explore the
scalability of its callees. To isolate code that is an impediment to scalable performance, one
can use the hot path button in hpcviewer to trace a path down through the CCT to see
where the cost is incurred.
27
28
Chapter 5
Monitoring Dynamically-linked
Applications with hpcrun
This chapter describes the mechanics of using hpcrun and hpclink to profile an appli-
cation and collect performance data. For advice on how to choose events, perform scaling
studies, etc., see Chapter 4 Effective Strategies for Analyzing Program Performance.
it will the measure the program’s execution by sampling its CPUTIME and collect a call
path profile for each thread in the execution. More about the CPUTIME metric can be
found in Section 5.3.3.
In addition to a call path profile, hpcrun can collect a call path trace of an execution if
the -t (or --trace) option is used turn on tracing. The following use of hpcrun will collect
both a call path profile and a call path trace of CPU execution using the default CPUTIME
sample source.
Traces are most useful for understanding the execution dynamics of multithreaded or multi-
process applications; however, you may find a trace of a single-threaded application to be
useful to understand how an execution unfolds over time.
While CPUTIME is used as the default sample source if no other sample source is spec-
ified, many other sample sources are available. Typically, one uses the -e (or --event) to
29
specify a sample source and sampling rate.1 Sample sources are specified as ‘event@howoften’
where event is the name of the source and howoften is either a number specifying the pe-
riod (threshold) for that event, or f followed by a number, e.g., @f100 specifying a target
sampling frequency for the event in samples/second.2 Note that a higher period implies a
lower rate of sampling. The -e option may be used multiple times to specify that multiple
sample sources be used for measuring an execution.
The basic syntax for profiling an application with hpcrun is:
hpcrun -t -e event@howoften ... app arg ...
For example, to profile an application using hardware counter sample sources provided
by Linux perf_events and sample cycles at 300 times/second (the default sampling fre-
quency) and sample every 4,000,000 instructions, you would use:
hpcrun -e CYCLES -e INSTRUCTIONS@4000000 app arg ...
The units for timer-based sample sources (CPUTIME and REALTIME are microseconds, so
to sample an application with tracing every 5,000 microseconds (200 times/second), you
would use:
hpcrun -t -e CPUTIME@5000 app arg ...
hpcrun stores its raw performance data in a measurements directory with the program
name in the directory name. On systems with a batch job scheduler (eg, PBS) the name of
the job is appended to the directory name.
hpctoolkit-app-measurements[-jobid]
It is best to use a different measurements directory for each run. So, if you’re using
hpcrun on a local workstation without a job launcher, you can use the ‘-o dirname’ option
to specify an alternate directory name.
For programs that use their own launch script (eg, mpirun or mpiexec for MPI), put
the application’s run script on the outside (first) and hpcrun on the inside (second) on the
command line. For example,
mpirun -n 4 hpcrun -e CYCLES mpiapp arg ...
Note that hpcrun is intended for profiling dynamically linked binaries. It will not work
well if used to profile a shell script. At best, you would be profiling the shell interpreter,
not the script commands, and sometimes this will fail outright.
It is possible to use hpcrun to launch a statically linked binary, but there are two prob-
lems with this. First, it is still necessary to build the binary with hpclink. Second, static
binaries are commonly used on parallel clusters that require running the binary directly
and do not accept a launch script. However, if your system allows it, and if the binary
was produced with hpclink, then hpcrun will set the correct environment variables for
profiling statically or dynamically linked binaries. All that hpcrun really does is set some
environment variables (including LD_PRELOAD) and exec the binary.
1
GPU and OpenMP measurement events don’t accept a rate.
2
Frequency-based sampling and the frequency-based notation for howoften is only available for sample
sources managed by Linux perf events. For Linux perf events, HPCToolkit uses a default sampling
frequency of 300 samples/second.
30
5.1.1 If hpcrun causes your application to fail
hpcrun can cause applications to fail in certain circumstances. Here, we describe two
kind of failures that may arise and how to sidestep them.
• Until Glibc 2.35, most applications running on ARM will crash. This was caused by
a fatal flaw in Glibc’s PLT handler for ARM, where an argument register that should
have been saved was instead replaced with a junk pointer value. This register is used
to return C/C++ struct values from functions and methods, including some C++
constructors.
• Until Glibc 2.35, applications and libraries using dlmopen will crash. While most
applications do not use dlmopen, an example of a library that does is Intel’s GTPin,
which hpcrun uses to instrument Intel GPU code.
• Applications and libraries using significant amounts of static TLS space may crash
with the message “cannot allocate memory in static TLS block.” This is caused
by a flaw in Glibc causing it to allocate insufficient static TLS space when LD_AUDIT
is enabled. For Glibc 2.35 and newer, setting the environment variable
export GLIBC_TUNABLES=glibc.rtld.optional_static_tls=0x400000000
will instruct Glibc to allocate 16MB of static TLS memory per thread, in our experi-
ence this is far more than any application will use (however the value can be adjusted
freely). For older Glibc, the only option is to disable hpcrun’s use of LD_AUDIT.
The following options direct hpcrun to adjust the strategy it uses for monitoring dynamic
libraries. We suggest that you don’t consider using any of these options unless your program
fails using hpcrun’s defaults.
31
--enable-auditor This option is default, except on ARM or when Intel GTPin instru-
mentation is enabled. Passing this option instructs hpcrun to use LD_AUDIT in all
cases.
If your code fails to find libraries when it is monitoring your code by wrapping dlopen
and dlclose rather than using LD_AUDIT, you can sidestep this problem by adding any
library paths listed in the RUNPATH of your application or library to your LD_LIBRARY_PATH
environment variable before launching hpcrun.
[pmu::][event_name][:unit_mask][:modifier|:modifier=val]
• pmu. Optional name of the PMU (group of events) to which the event belongs to.
This is useful to disambiguate events in case events from difference sources have the
same name. If no pmu is specified, the first match event is used.
32
• event name. The name of the event. It must be the complete name, partial matches
are not accepted.
• unit mask. Some events can be refined using sub-events. A unit mask designates
an optional sub-event. An event may have multiple unit masks and it is possible to
combine them (for some events) by repeating :unit mask pattern.
• modifier. A modifier is an optional filter that restricts when an event counts. The
form of a modifier may be either :modifier or :modifier=val. For modifiers without
a value, the presence of the modifier is interpreted as a restriction. Events may allow
use of multiple modifiers at the same time.
– hardware event modifiers. Some hardware events support one or more modi-
fiers that restrict counting to a subset of events. For instance, on an Intel Broad-
well EP, one can add a modifier to MEM_LOAD_UOPS_RETIRED to count only load
operations that are an L2_HIT or an L2_MISS. For information about all modifiers
for hardware events, one can direct HPCToolkit’s measurement subsystem to
list all native events and their modifiers as described in Section 5.3.
– precise ip. For some events, it is possible to control the amount of skid. Skid
is a measure of how many instructions may execute between an event and the
PC where the event is reported. Smaller skid enables more accurate attribu-
tion of events to instructions. Without a skid modifier, hpcrun allows arbitrary
skid because some architectures don’t support anything more precise. One may
optionally specify one of the following as a skid modifier:
∗ :p : a sample must have constant skid.
∗ :pp : a sample is requested to have 0 skid.
∗ :ppp : a sample must have 0 skid.
∗ :P : autodetect the least skid possible.
NOTE: If the kernel or the hardware does not support the specified value of the
skid, no error message will be reported but no samples will be recorded.
33
and stall cycles. Using instrumentation built in to the Linux kernel, the perf events inter-
face can measure software events. Examples of software events include page faults, context
switches, and CPU migrations.
Multiplexing. Using multiplexing enables one to monitor more events in a single execu-
tion than the number of hardware counters a processor can support for each thread. The
number of events that can be monitored in a single execution is only limited by the maxi-
mum number of concurrent events that the kernel will allow a user to multiplex using the
perf events interface.
When more events are specified than can be monitored simultaneously using a thread’s
hardware counters,4 the kernel will employ multiplexing and divide the set of events to be
monitored into groups, monitor only one group of events at a time, and cycle repeatedly
through the groups as a program executes.
For applications that have very regular, steady state behavior, e.g., an iterative code
with lots of iterations, multiplexing will yield results that are suitably representative of
execution behavior. However, for executions that consist of unique short phases, measure-
ments collected using multiplexing may not accurately represent the execution behavior.
To obtain more accurate measurements, one can run an application multiple times and in
each run collect a subset of events that can be measured without multiplexing. Results
from several such executions can be imported into HPCToolkit’s hpcviewer and analyzed
together.
3
The kernel may be unable to deliver the desired frequency if there are fewer events per second than the
desired frequency.
4
How many events can be monitored simultaneously on a particular processor may depend on the events
specified.
34
Thread blocking. When a program executes, a thread may block waiting for the kernel
to complete some operation on its behalf. For instance, a thread may block waiting for data
to become available so that a read operation can complete. On systems running Linux 4.3
or newer, one can use the perf events sample source to monitor how much time a thread
is blocked and where the blocking occurs. To measure the time a thread spends blocked,
one can profile with BLOCKTIME event and another time-based event, such as CYCLES. The
BLOCKTIME event shouldn’t have any frequency or period specified, whereas CYCLES may
have a frequency or period specified.
Launching
When sampling with native events, by default hpcrun will profile using perf events.
To force HPCToolkit to use PAPI rather than perf events to oversee monitoring of a PMU
event (assuming that HPCToolkit has been configured to include support for PAPI), one
must prefix the event with ‘papi::’ as follows:
hpcrun -e papi::CYCLES
For PAPI presets, there is no need to prefix the event with ‘papi::’. For instance it is
sufficient to specify PAPI_TOT_CYC event without any prefix to profile using PAPI. For more
information about using PAPI, see Section 5.3.2.
Below, we provide some examples of various ways to measure CYCLES and INSTRUCTIONS
using HPCToolkit’s perf events measurement substrate:
To sample an execution 100 times per second (frequency-based sampling) counting
CYCLES and 100 times a second counting INSTRUCTIONS:
To sample an execution every 1,000,000 cycles and every 1,000,000 instructions using
period-based sampling:
By default, hpcrun uses frequency-based sampling with the rate 300 samples per second
per event type. Hence the following command causes HPCToolkit to sample CYCLES at
300 samples per second and INSTRUCTIONS at 300 samples per second:
One can specify a different default sampling period or frequency using the -c option.
The command below will sample CYCLES and INSTRUCTIONS at 200 samples per second
each:
35
Notes
• Linux perf events uses one file descriptor for each event to be monitored. Fur-
thermore, since hpcrun generates one hpcrun file for each thread, and an additional
hpctrace file if traces is enabled. Hence for e events and t threads, the required number
of file descriptors is:
For instance, if one profiles a multi-threaded program that executes with 500 threads
using 4 events, then the required number of file descriptors is
If the number of file descriptors exceeds the number of maximum number of open files,
then the program will crash. To remedy this issue, one needs to increase the number
of maximum number of open files allowed.
• When a system is configured with suitable permissions, HPCToolkit will sample call
stacks within the Linux kernel in addition to application-level call stacks. This
feature can be useful to measure kernel activity on behalf of a thread (e.g., zero-
filling allocated pages when they are first touched) or to observe where, why, and
how long a thread blocks. For a user to be able to sample kernel call stacks,
the configuration file /proc/sys/kernel/perf_event_paranoid must have a value
≤ 1. To associate addresses in kernel call paths with function names, the value of
/proc/sys/kernel/kptr_restrict must be 0 (number zero). If these settings are
not configured in this way on your system, you will need someone with administrator
privileges to change them for you to be able to sample call stacks within the kernel.
• Due to a limitation present in all Linux kernel versions currently available, HPC-
Toolkit’s measurement subsystem can only approximate a thread’s blocking time.
At present, Linux reports when a thread blocks but does not report when a thread
resumes execution. For that reason, HPCToolkit’s measurement subsystem approxi-
mates the time a thread spends blocked using sampling as the time between when the
thread blocks and when the thread receives its first sample after resuming execution.
• Users need to be cautious when considering measured counts of events that have been
collected using hardware counter multiplexing. Currently, it is not obvious to a user
if a metric was measured using a multiplexed counter. This information is present in
the measurements but is not currently visible in hpcviewer.
5.3.2 PAPI
PAPI, the Performance API, is a library for providing access to the hardware perfor-
mance counters. PAPI aims to provide a consistent, high-level interface that consists of a
universal set of event names that can be used to measure performance on any processor,
independent of any processor-specific event names. In some cases, PAPI event names rep-
resent quantities synthesized by combining measurements based on multiple native events
36
PAPI_BR_INS Branch instructions
PAPI_BR_MSP Conditional branch instructions mispredicted
PAPI_FP_INS Floating point instructions
PAPI_FP_OPS Floating point operations
PAPI_L1_DCA Level 1 data cache accesses
PAPI_L1_DCM Level 1 data cache misses
PAPI_L1_ICH Level 1 instruction cache hits
PAPI_L1_ICM Level 1 instruction cache misses
PAPI_L2_DCA Level 2 data cache accesses
PAPI_L2_ICM Level 2 instruction cache misses
PAPI_L2_TCM Level 2 cache misses
PAPI_LD_INS Load instructions
PAPI_SR_INS Store instructions
PAPI_TLB_DM Data translation lookaside buffer misses
PAPI_TOT_CYC Total cycles
PAPI_TOT_IIS Instructions issued
PAPI_TOT_INS Instructions completed
Table 5.1: Some commonly available PAPI events. The exact set of available events is
system dependent.
available on a particular processor. For instance, in some cases PAPI reports total cache
misses by measuring and combining data misses and instruction misses. PAPI is available
from the University of Tennessee at http://icl.cs.utk.edu/papi.
PAPI focuses mostly on in-core CPU events: cycles, cache misses, floating point opera-
tions, mispredicted branches, etc. For example, the following command samples total cycles
and L2 cache misses.
The precise set of PAPI preset and native events is highly system dependent. Commonly,
there are events for machine cycles, cache misses, floating point operations and other more
system specific events. However, there are restrictions both on how many events can be
sampled at one time and on what events may be sampled together and both restrictions are
system dependent. Table 5.1 contains a list of commonly available PAPI events.
To see what PAPI events are available on your system, use the papi_avail command
from the bin directory in your PAPI installation. The event must be both available and
not derived to be usable for sampling. The command papi_native_avail displays the
machine’s native events. Note that on systems with separate compute nodes, you normally
need to run papi_avail on one of the compute nodes.
When selecting the period for PAPI events, aim for a rate of approximately a few
hundred samples per second. So, roughly several million or tens of million for total cycles
or a few hundred thousand for cache misses. PAPI and hpcrun will tolerate sampling rates
as high as 1,000 or even 10,000 samples per second (or more). However, rates higher than
37
a few hundred samples per second will only increase measurement overhead and distort the
execution of your program; they won’t yield more accurate results.
Beginning with Linux kernel version 2.6.32, support for accessing performance counters
using the Linux perf events performance monitoring subsystem is built into the kernel.
perf events provides a measurement substrate for PAPI on Linux.
On modern Linux systems that include support for perf_events, PAPI is only recom-
mended for monitoring events outside the scope of the perf_events interface.
Proxy Sampling HPCToolkit supports proxy sampling for derived PAPI events. For
HPCToolkit to sample a PAPI event directly, the event must not be derived and must
trigger hardware interrupts when a threshold is exceeded. For events that cannot trigger
interrupts directly, HPCToolkit’s proxy sampling sample on another event that is supported
directly and then reads the counter for the derived event. In this case, a native event can
serve as a proxy for one or more derived events.
To use proxy sampling, specify the hpcrun command line as usual and be sure to include
at least one non-derived PAPI event. The derived events will be accumulated automatically
when processing a sample trigger for a native event. We recommend adding PAPI_TOT_CYC
as a native event when using proxy sampling, but proxy sampling will gather data as long
as the event set contains at least one non-derived PAPI event. Proxy sampling requires one
non-derived PAPI event to serve as the proxy; a Linux timer can’t serve as the proxy for a
PAPI derived event.
For example, on newer Intel CPUs, often PAPI floating point events are all derived and
cannot be sampled directly. In that case, you could count FLOPs by using cycles a proxy
event with a command line such as the following. The period for derived events is ignored
and may be omitted.
38
The CPUTIME and REALTIME sample sources are based on the POSIX timers
CLOCK_THREAD_CPUTIME_ID and CLOCK_REALTIME with the Linux SIGEV_THREAD_ID exten-
sion. CPUTIME only counts time when the CPU is running; REALTIME counts real (wall
clock) time, whether the process is running or not. Signal delivery for these timers is
thread-specific, so these timers are suitable for profiling multithreaded programs. Sam-
pling using the REALTIME sample source may break some applications that don’t handle
interrupted syscalls well. In that case, consider using CPUTIME instead.
The following example, which specifies a period of 5000 microseconds will sample each
thread in app at a rate of approximately 200 times per second.
Note: do not use more than one timer-based sample source to monitor a program execution.
When using a sample source such as CPUTIME or REALTIME, we recommend not using another
time-based sampling source such as Linux perf events CYCLES or PAPI’s PAPI_TOT_CYC.
Technically, this is feasible and hpcrun won’t die. However, multiple time-based sample
sources would compete with one another to measure the execution and likely lead to dropped
samples and possibly distorted results.
5.3.4 IO
The IO sample source counts the number of bytes read and written. This displays two
metrics in the viewer: “IO Bytes Read” and “IO Bytes Written.” The IO source is a
synchronous sample source. It overrides the functions read, write, fread and fwrite and
records the number of bytes read or written along with their dynamic context synchronously
rather than relying on data collection triggered by interrupts.
To include this source, use the IO event (no period). In the static case, two steps are
needed. Use the --io option for hpclink to link in the IO library and use the IO event to
activate the IO source at runtime. For example,
The IO source is mainly used to find where your program reads or writes large amounts
of data. However, it is also useful for tracing a program that spends much time in read and
write. The hardware performance counters do not advance while running in the kernel, so
the trace viewer may misrepresent the amount of time spent in syscalls such as read and
write. By adding the IO source, hpcrun overrides read and write and thus is able to more
accurately count the time spent in these functions.
5.3.5 MEMLEAK
The MEMLEAK sample source counts the number of bytes allocated and freed. Like IO,
MEMLEAK is a synchronous sample source and does not generate asynchronous interrupts.
Instead, it overrides the malloc family of functions (malloc, calloc, realloc and free
39
plus memalign, posix_memalign and valloc) and records the number of bytes allocated
and freed along with their dynamic context.
MEMLEAK allows you to find locations in your program that allocate memory that is
never freed. But note that failure to free a memory location does not necessarily imply
that location has leaked (missing a pointer to the memory). It is common for programs to
allocate memory that is used throughout the lifetime of the process and not explicitly free
it.
To include this source, use the MEMLEAK event (no period). Again, two steps are needed
in the static case. Use the --memleak option for hpclink to link in the MEMLEAK library
and use the MEMLEAK event to activate it at runtime. For example,
If a program allocates and frees many small regions, the MEMLEAK source may result in a
high overhead. In this case, you may reduce the overhead by using the memleak probability
option to record only a fraction of the mallocs. For example, to monitor 10% of the mallocs,
use:
It might appear that if you monitor only 10% of the program’s mallocs, then you would
have only a 10% chance of finding the leak. But if a program leaks memory, then it’s likely
that it does so many times, all from the same source location. And you only have to find
that location once. So, this option can be a useful tool if the overhead of recording all
mallocs is prohibitive.
Rarely, for some programs with complicated memory usage patterns, the MEMLEAK source
can interfere with the application’s memory allocation causing the program to segfault. If
this happens, use the hpcrun debug (dd) variable MEMLEAK_NO_HEADER as a workaround.
The MEMLEAK source works by attaching a header or a footer to the application’s malloc’d
regions. Headers are faster but have a greater potential for interfering with an application.
Footers have higher overhead (require an external lookup) but have almost no chance of
interfering with an application. The MEMLEAK_NO_HEADER variable disables headers and uses
only footers.
40
5.4 Experimental Python Support
This section provides a brief overview of how to use HPCToolkit to analyze the
performance of Python-based applications. Normally, hpcrun will attribute performance to
the CPython implementation, not to the application Python code, as shown in Figure 5.1.
This usually is of little interest to an application developer, so HPCToolkit provides
experimental support for attributing to Python callstacks.
NOTE: This feature is in an experimental state. Many cases may not work
as expected, crashes and corrupted performance data are likely. Use at your
own risk.
Figure 5.1: Example of a simple Python application measured without (left) and with
(right) Python support enabled via hpcrun -a python. The left database has no source
code, since sources were not provided for the CPython implementation.
If HPCToolkit has been compiled with Python support enabled, hpcrun is able to
replace segments of the C callstacks with the Python code running in those frames. To
enable this transformation, profile your application the additional -a python flag:
(dynamic) hpcrun -a python -e event@howoften python3 app arg ...
As shown in Figure 5.1, passing this flag removes the CPython implementation details,
replacing it with the much smaller Python callstack. When Python calls an external C
41
library, HPCToolkit will report both the name of the Python function object and the C
function being called, in this example sleep and Glibc’s clock nanosleep respectively.
42
(dynamic) hpcrun -f 0.10 -e event@howoften app arg ...
(dynamic) hpcrun -f 1/10 -e event@howoften app arg ...
(static) export HPCRUN_EVENT_LIST=’event@howoften’
export HPCRUN_PROCESS_FRACTION=0.10
app arg ...
With this option, each process generates a random number and records its measurement
data with the given probability. The process fraction (probability) may be written as a
decimal number (0.10) or as a fraction (1/10) between 0 and 1. So, in the above example,
all three cases would record data for approximately 10% of the processes. Aim for a number
of processes in the hundreds.
void hpctoolkit_sampling_start(void);
void hpctoolkit_sampling_stop(void);
For example, suppose that your program has three major phases: it reads input from
a file, performs some numerical computation on the data and then writes the output to
another file. And suppose that you want to profile only the compute phase and skip the
read and write phases. In that case, you could stop sampling at the beginning of the
program, restart it before the compute phase and stop it again at the end of the compute
phase.
This interface is process wide, not thread specific. That is, it affects all threads of a
process. Note that when you turn sampling on or off, you should do so uniformly across all
processes, normally at the same point in the program. Enabling sampling in only a subset
of the processes would likely produce skewed and misleading results.
And for technical reasons, when sampling is turned off in a threaded process, interrupts
are disabled only for the current thread. Other threads continue to receive interrupts, but
they don’t unwind the call stack or record samples. So, another use for this interface is
to protect syscalls that are sensitive to being interrupted with signals. For example, some
Gemini interconnect (GNI) functions called from inside gasnet_init() or MPI_Init() on
Cray XE systems will fail if they are interrupted by a signal. As a workaround, you could
turn sampling off around those functions.
Also, you should use this interface only at the top level for major phases of your program.
That is, the granularity of turning sampling on and off should be much larger than the time
between samples. Turning sampling on and off down inside an inner loop will likely produce
skewed and misleading results.
To use this interface, put the above function calls into your program where you want
sampling to start and stop. Remember, starting and stopping apply process wide. For
C/C++, include the following header file from the HPCToolkit include directory.
#include <hpctoolkit.h>
43
Compile your application with libhpctoolkit with -I and -L options for the include
and library paths. For example,
The libhpctoolkit library provides weak symbol no-op definitions for the start and
stop functions. For dynamically linked programs, be sure to include -lhpctoolkit on the
link line (otherwise your program won’t link). For statically linked programs, hpclink
adds strong symbol definitions for these functions. So, -lhpctoolkit is not necessary in
the static case, but it doesn’t hurt.
To run the program, set the LD_LIBRARY_PATH environment variable to include the
HPCToolkit lib/hpctoolkit directory. This step is only needed for dynamically linked
programs.
export LD_LIBRARY_PATH=/path/to/hpctoolkit/lib/hpctoolkit
Note that sampling is initially turned on until the program turns it off. If you want it
initially turned off, then use the -ds (or --delay-sampling) option for hpcrun (dynamic)
or set the HPCRUN_DELAY_SAMPLING environment variable (static).
HPCTOOLKIT To function correctly, hpcrun must know the location of the HPCToolkit
top-level installation directory. The hpcrun script uses elements of the installation
lib and libexec subdirectories. On most systems, the hpcrun can find the requisite
components relative to its own location in the file system. However, some parallel job
launchers copy the hpcrun script to a different location as they launch a job. If your
system does this, you must set the HPCTOOLKIT environment variable to the location
of the HPCToolkit top-level installation directory before launching a job.
Note to system administrators: if your system provides a module system for con-
figuring software packages, then constructing a module for HPCToolkit to initialize these
environment variables to appropriate settings would be convenient for users.
44
#!/bin/sh
#PBS -l mppwidth=#nodes
#PBS -l walltime=00:30:00
#PBS -V
export HPCTOOLKIT=/path/to/hpctoolkit/install/directory
export CRAY_ROOTFS=DSL
cd $PBS_O_WORKDIR
aprun -n #nodes hpcrun -e event@howoften dynamic-app arg ...
Figure 5.2: A sketch of how to help HPCToolkit find its dynamic libraries when using
Cray’s ALPS job launcher.
in your job’s error log then read on. Otherwise, skip this section.
The problem is that the Cray job launcher copies HPCToolkit’s hpcrun script to a
directory somewhere below /var/spool/alps/ and runs it from there. By moving hpcrun
to a different directory, this breaks hpcrun’s method for finding HPCToolkit’s install
directory.
To fix this problem, in your job script, set HPCTOOLKIT to the top-level HPCToolkit
installation directory (the directory containing the bin, lib and libexec subdirectories)
and export it to the environment. (If launching statically-linked binaries created using
hpclink, this step is unnecessary, but harmless.) Figure 5.2 show a skeletal job script
that sets the HPCTOOLKIT environment variable before monitoring a dynamically-linked
executable with hpcrun:
Your system may have a module installed for hpctoolkit with the correct settings for
PATH, HPCTOOLKIT, etc. In that case, the easiest solution is to load the hpctoolkit module.
Try “module show hpctoolkit” to see if it sets HPCTOOLKIT.
45
46
Chapter 6
On modern Linux systems, dynamically linked executables are the default. With dynam-
ically linked executables, HPCToolkit’s hpcrun script uses library preloading to inject
HPCToolkit’s monitoring code into an application’s address space. However, in some
cases, statically-linked executables are necessary or desirable.
• One might prefer a statically linked executable because they are generally faster if the
executable spends a significant amount of time calling functions in libraries.
To build a version of your executable with HPCToolkit’s monitoring code linked in, you
would use the following command line:
47
hpclink mpicc -o app -static file.o ... -l<lib> ...
In practice, you may want to edit your Makefile to always build two versions of your
program, perhaps naming them app and app.hpc.
#!/bin/sh
#PBS -l size=64
#PBS -l walltime=01:00:00
cd $PBS_O_WORKDIR
export HPCRUN_EVENT_LIST="CYCLES@f200 PERF_COUNT_HW_CACHE_MISSES@f200"
aprun -n 64 ./app arg ...
6.3 Troubleshooting
With some compilers you need to disable interprocedural optimization to use hpclink.
To instrument your statically linked executable at link time, hpclink uses the ld option
--wrap (see the ld(1) man page) to interpose monitoring code between your application
48
and various process, thread, and signal control operations, e.g., fork, pthread_create, and
sigprocmask to name a few. For some compilers, e.g., IBM’s XL compilers, interprocedural
optimization interferes with the --wrap option and prevents hpclink from working properly.
If this is the case, hpclink will emit error messages and fail. If you want to use hpclink
with such compilers, sadly, you must turn off interprocedural optimization.
Note that interprocedural optimization may not be explicitly enabled during your com-
piles; it might be implicitly enabled when using a compiler optimization option such as
-fast. In cases such as this, you can often specify -fast along with an option such as
-no-ipa; this option combination will provide the benefit of all of -fast’s optimizations
except interprocedural optimization.
49
50
Chapter 7
51
A: In this example, s3d_f90.x is the Fortran S3D program compiled with OpenMPI and
run with the command line
mpiexec -n 4 hpcrun -e PAPI_TOT_CYC:2500000 ./s3d_f90.x
This produced 12 files in the following abbreviated ls listing:
krentel 1889240 Feb 18 s3d_f90.x-000000-000-72815673-21063.hpcrun
krentel 9848 Feb 18 s3d_f90.x-000000-001-72815673-21063.hpcrun
krentel 1914680 Feb 18 s3d_f90.x-000001-000-72815673-21064.hpcrun
krentel 9848 Feb 18 s3d_f90.x-000001-001-72815673-21064.hpcrun
krentel 1908030 Feb 18 s3d_f90.x-000002-000-72815673-21065.hpcrun
krentel 7974 Feb 18 s3d_f90.x-000002-001-72815673-21065.hpcrun
krentel 1912220 Feb 18 s3d_f90.x-000003-000-72815673-21066.hpcrun
krentel 9848 Feb 18 s3d_f90.x-000003-001-72815673-21066.hpcrun
krentel 147635 Feb 18 s3d_f90.x-72815673-21063.log
krentel 142777 Feb 18 s3d_f90.x-72815673-21064.log
krentel 161266 Feb 18 s3d_f90.x-72815673-21065.log
krentel 143335 Feb 18 s3d_f90.x-72815673-21066.log
Here, there are four processes and two threads per process. Looking at the file names,
s3d_f90.x is the name of the program binary, 000000-000 through 000003-001 are the
MPI rank and thread numbers, and 21063 through 21066 are the process IDs.
We see from the file sizes that OpenMPI is spawning one helper thread per process.
Technically, the smaller .hpcrun files imply only a smaller calling-context tree (CCT), not
necessarily fewer samples. But in this case, the helper threads are not doing much work.
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
...
}
Note: The first call to MPI_Comm_rank() should use MPI_COMM_WORLD. This sets the
process’s MPI rank in the eyes of hpcrun. Other communicators are allowed, but the first
call should use MPI_COMM_WORLD.
Also, the call to MPI_Comm_rank() should be unconditional, that is all processes should
make this call. Actually, the call to MPI_Comm_size() is not necessary (for hpcrun), al-
though most MPI programs normally call both MPI_Comm_size() and MPI_Comm_rank().
52
Q: What MPI implementations are supported?
A: Although the matrix of all possible MPI variants, versions, compilers, architectures and
systems is very large, HPCToolkit has been tested successfully with MPICH, MVAPICH
and OpenMPI and should work with most MPI implementations.
53
54
Chapter 8
HPCToolkit can measure both the CPU and GPU performance of GPU-accelerated
applications. It can measure CPU performance using asynchronous sampling triggered by
Linux timers or hardware counter events as described in Section 5.3 and it can monitor
GPU performance using tool support libraries provided by GPU vendors.
In the following sections, we describe a generic substrate in HPCToolkit to interact with
vendor specific runtime systems and libraries and the vendor specific details for measuring
performance for NVIDIA, AMD, and Intel GPUs.
While a single version of HPCToolkit can be built that supports GPUs from multiple
vendors and programming models, using HPCToolkit to collect GPU metrics using GPUs
from multiple vendors in a single execution or using multiple GPU programming models
(e.g. CUDA + OpenCL) in a single execution is unsupported. It is unlikely to produce
correct measurements and likely to crash.
55
Metric Description
GKER (sec) GPU time: kernel execution (seconds)
GMEM (sec) GPU time: memory allocation/deallocation (seconds)
GMSET (sec) GPU time: memory set (seconds)
GXCOPY (sec) GPU time: explicit data copy (seconds)
GSYNC (sec) GPU time: synchronization (seconds)
GPUOP (sec) Total GPU operation time: sum of all metrics above
HPCToolkit supports two levels of performance monitoring for GPU accelerated appli-
cations: coarse-grain profiling and tracing of GPU activities at the operation level (e.g.,
kernel launches, data allocations, memory copies, ...), and fine-grain measurement of GPU
computations using PC sampling or instrumentation, which measure GPU computations at
the granularity of individual machine instructions.
Coarse-grain profiling attributes to each calling context the total time of all GPU oper-
ations initiated in that context. Table 8.1 shows the classes of GPU operations for which
timings are collected. In addition, HPCToolkit records metrics for operations performed
including memory allocation and deallocation (Table 8.2), memory set (Table 8.3), explicit
memory copies (Table 8.4), and synchronization (Table 8.5). These operation metrics are
available for GPUs from all three vendors. For NVIDIA GPUs, HPCToolkit also reports
GPU kernel characteristics, including including register usage, thread count per block, and
theoretical occupancy as shown in Table 8.6. HPCToolkit derives a theoretical GPU oc-
cupancy metric as the ratio of the active threads in a streaming multiprocessor to the
maximum active threads supported by the hardware in one streaming multiprocessor.
Table 8.7 shows fine-grain metrics for GPU instruction execution. When possible, HPC-
Toolkit attributes fine-grain GPU metrics to both GPU calling contexts and CPU calling
contexts. To our knowledge, no GPU has hardware support for attributing metrics directly
to GPU calling contexts. To compensate, HPCToolkit approximates attributes metrics to
GPU calling contexts. It reconstructs GPU calling contexts from static GPU call graphs for
NVIDIA GPUs (See Section 8.2.4) and uses measurements of call sites and data flow anal-
ysis on static call graphs to apportion metrics among call paths in a GPU calling context
tree. We expect to add similar functionality for GPUs from other vendors in the future.
The performance metrics above are reported in a vendor-neutral way. Not every metric
is available for all GPUs. Coarse-grain profiling and tracing are supported for AMD, Intel,
and NVIDIA GPUs. HPCToolkit supports fine-grain measurements on NVIDIA GPUs
using PC sampling and provides some simple fine-grain measurements on Intel GPUs using
instrumentation. Currently, AMD GPUs lack both hardware and software support for fine-
grain measurement. The next few sections describe specific measurement capabilities for
NVIDIA, AMD, and Intel GPUs, respectively.
56
Metric Description
GMEM:UNK (B) GPU memory alloc/free: unknown memory kind (bytes)
GMEM:PAG (B) GPU memory alloc/free: pageable memory (bytes)
GMEM:PIN (B) GPU memory alloc/free: pinned memory (bytes)
GMEM:DEV (B) GPU memory alloc/free: device memory (bytes)
GMEM:ARY (B) GPU memory alloc/free: array memory (bytes)
GMEM:MAN (B) GPU memory alloc/free: managed memory (bytes)
GMEM:DST (B) GPU memory alloc/free: device static memory (bytes)
GMEM:MST (B) GPU memory alloc/free: managed static memory (bytes)
GMEM:COUNT GPU memory alloc/free: count
Metric Description
GMSET:UNK (B) GPU memory set: unknown memory kind (bytes)
GMSET:PAG (B) GPU memory set: pageable memory (bytes)
GMSET:PIN (B) GPU memory set: pinned memory (bytes)
GMSET:DEV (B) GPU memory set: device memory (bytes)
GMSET:ARY (B) GPU memory set: array memory (bytes)
GMSET:MAN (B) GPU memory set: managed memory (bytes)
GMSET:DST (B) GPU memory set: device static memory (bytes)
GMSET:MST (B) GPU memory set: managed static memory (bytes)
GMSET:COUNT GPU memory set: count
HPCToolkit also supports tracing of activities on GPU streams on NVIDIA, AMD, and
Intel GPUs.1 Tracing of GPU activities will be enabled any time GPU monitoring is enabled
and hpcrun’s tracing is enabled with -t or --trace.
It is important to know that hpcrun creates CPU tracing threads to record a trace of
GPU activities. By default, it creates one tracing thread per four GPU streams. To adjust
the number of GPU streams per tracing thread, see the settings for HPCRUN CONTROL KNOBS
in Appendix A. When mapping a GPU-accelerated node program onto a node, you may need
to consider provisioning additional hardware threads or cores to accommodate these tracing
threads; otherwise, they may compete against application threads for CPU resources, which
may degrade the performance of your execution.
1
Tacing of GPU activities on Intel GPUs is currently supported only for Intel’s OpenCL runtime. We
plan to add tracing support for Intel’s Level 0 runtime in a future release.
57
Metric Description
GXCOPY:UNK (B) GPU explicit memory copy: unknown kind (bytes)
GXCOPY:H2D (B) GPU explicit memory copy: host to device (bytes)
GXCOPY:D2H (B) GPU explicit memory copy: device to host (bytes)
GXCOPY:H2A (B) GPU explicit memory copy: host to array (bytes)
GXCOPY:A2H (B) GPU explicit memory copy: array to host (bytes)
GXCOPY:A2A (B) GPU explicit memory copy: array to array (bytes)
GXCOPY:A2D (B) GPU explicit memory copy: array to device (bytes)
GXCOPY:D2A (B) GPU explicit memory copy: device to array (bytes)
GXCOPY:D2D (B) GPU explicit memory copy: device to device (bytes)
GXCOPY:H2H (B) GPU explicit memory copy: host to host (bytes)
GXCOPY:P2P (B) GPU explicit memory copy: peer to peer (bytes)
GXCOPY:COUNT GPU explicit memory copy: count
Metric Description
GSYNC:UNK (sec) GPU synchronizations: unknown kind
GSYNC:EVT (sec) GPU synchronizations: event
GSYNC:STRE (sec) GPU synchronizations: stream event wait
GSYNC:STR (sec) GPU synchronizations: stream
GSYNC:CTX (sec) GPU synchronizations: context
GSYNC:COUNT GPU synchronizations: count
58
Metric Description
GKER:STMEM (B) GPU kernel: static memory (bytes)
GKER:DYMEM (B) GPU kernel: dynamic memory (bytes)
GKER:LMEM (B) GPU kernel: local memory (bytes)
GKER:FGP ACT GPU kernel: fine-grain parallelism, actual
GKER:FGP MAX GPU kernel: fine-grain parallelism, maximum
GKER:THR REG GPU kernel: thread register count
GKER:BLK THR GPU kernel: thread count
GKER:BLK GPU kernel: block count
GKER:BLK SM (B) GPU kernel: block local memory (bytes)
GKER:COUNT GPU kernel: launch count
GKER:OCC THR GPU kernel: theoretical occupancy
line arguments to hpcrun that will enable different levels of monitoring for NVIDIA GPUs
for GPU-accelerated code implemented using CUDA. When fine-grain monitoring using PC
sampling is enabled, coarse-grain profiling is also performed, so tracing is available in this
mode as well. However, since PC sampling dilates the CPU overhead of GPU-accelerated
codes, tracing is not recommended when PC sampling is enabled.
Besides the standard metrics for GPU operation timings (Table 8.1), memory allocation
and deallocation (Table 8.2), memory set (Table 8.3), explicit memory copies (Table 8.4),
and synchronization (Table 8.5), HPCToolkit reports GPU kernel characteristics, including
including register usage, thread count per block, and theoretical occupancy as shown in
Table 8.6. NVIDIA defines theoretical occupancy as the ratio of the active threads in a
streaming multiprocessor to the maximum active threads supported by the hardware in one
streaming multiprocessor.
At present, using NVIDIA’s CUPTI library adds substantial measurement overhead.
Unlike CPU monitoring based on asynchronous sampling, GPU performance monitoring
uses vendor-provided callback interfaces to intercept the initiation of each GPU operation.
Accordingly, the overhead of GPU performance monitoring depends upon how frequently
GPU operations are initiated. In our experience to date, profiling (and if requested, tracing)
on NVIDIA GPUs using NVIDIA’s CUPTI interface roughly doubles the execution time of
a GPU-accelerated application. In our experience, we have seen NVIDIA’s PC sampling
dilate the execution time of a GPU-accelerated program by 30× using CUDA 10 or earlier.
Our early experience with CUDA 11 indicates that overhead using PC sampling is much
lower and less than 5×. The overhead of GPU monitoring is principally on the host side.
As measured by CUPTI, the time spent in GPU operations or PC samples is expected to
be relatively accurate. However, since execution as a whole is slowed while measuring GPU
performance, when evaluating GPU activity reported by HPCToolkit, one must be careful.
For instance, if a GPU-accelerated program runs in 1000 seconds without HPCToolkit
monitoring GPU activity but slows to 2000 seconds when GPU profiling and tracing is en-
abled, then if GPU profiles and traces show that the GPU is active for 25% of the execution
time, one should re-scale the accurate measurements of GPU activity by considering the 2×
dilation when monitoring GPU activity. Without monitoring, one would expect the same
59
Metric Description
GINST GPU instructions executed
GINST:STL ANY GPU instruction stalls: any
GINST:STL NONE GPU instruction stalls: no stall
GINST:STL IFET GPU instruction stalls: await availability of next in-
struction (fetch or branch delay)
GINST:STL IDEP GPU instruction stalls: await satisfaction of instruc-
tion input dependence
GINST:STL GMEM GPU instruction stalls: await completion of global
memory access
GINST:STL TMEM GPU instruction stalls: texture memory request
queue full
GINST:STL SYNC GPU instruction stalls: await completion of thread or
memory synchronization
GINST:STL CMEM GPU instruction stalls: await completion of constant
or immediate memory access
GINST:STL PIPE GPU instruction stalls: await completion of required
compute resources
GINST:STL MTHR GPU instruction stalls: global memory request queue
full
GINST:STL NSEL GPU instruction stalls: not selected for issue but
ready
GINST:STL OTHR GPU instruction stalls: other
GINST:STL SLP GPU instruction stalls: sleep
Table 8.8: Monitoring performance on NVIDIA GPUs when using NVIDIA’s CUDA
programming model and runtime.
level of GPU activity, but the host time would be twice as fast. Thus, without monitoring,
the ratio of GPU activity to host activity would be roughly double.
60
𝑆 𝑆
𝑆 𝑆 𝑆 𝑆 Stalled insts
Eligible insts
Issued insts
𝑃 2𝑃 3𝑃 4𝑃 5𝑃 6𝑃 Sampled insts
Figure 8.1: NVIDIA’s GPU PC sampling example on an SM. P − 6P represent six sample
periods P cycles apart. S1 − S4 represent four schedulers on an SM.
Metric Description
GSAMP:DRP GPU PC samples: dropped
GSAMP:EXP GPU PC samples: expected
GSAMP:TOT GPU PC samples: measured
8/6/2019 28
GSAMP:PER (cyc) GPU PC samples: period (GPU cycles)
GSAMP:UTIL (%) GPU utilization computed using PC sampling
SM’s warp schedulers in a round robin fashion. When an instruction is sampled, its stall
reason (if any) is recorded. If all warps on a scheduler are stalled when a sample is taken,
the sample is marked as a latency sample, meaning no instruction will be issued by the warp
scheduler in the next cycle. Figure 8.1 shows a PC sampling example on an SM with four
schedulers. Among the six collected samples, four are latency samples, so the estimated
stall ratio is 4/6.
Figure 8.7 shows the stall metrics recorded by HPCToolkit using CUPTI’s PC sam-
pling. Figure 8.9 shows PC sampling summary statistics recorded by HPCToolkit. Of
particular note is the metric GSAMP:UTIL. HPCToolkit computes approximate GPU uti-
lization using information gathered using PC sampling. Given the average clock frequency
and the sampling rate, if all SMs are active, then HPCToolkit knows how many instruc-
tion samples would be expected (GSAMP:EXP) if the GPU was fully active for the inter-
val when it was in use. HPCToolkit approximates the percentage of GPU utilization by
comparing the measured samples with the expected samples using the following formula:
100 ∗ (GSAMP : TOT)/(GSAMP : EXP).
For CUDA 10, measurement using PC sampling with CUPTI serializes the execution of
GPU kernels. Thus, measurement of GPU kernels using PC sampling will distort the exe-
cution of a GPU-accelerated application by blocking concurrent execution of GPU kernels.
For applications that rely on concurrent kernel execution to keep the GPU busy, this will
significantly distort execution and PC sampling measurements will only reflect the GPU
activity of kernels running in isolation.
61
8.2.3 Attributing Measurements to Source Code for NVIDIA GPUs
NVIDIA’s nvcc compiler doesn’t record information about how GPU machine code
maps to CUDA source without proper compiler arguments. Using the -G compiler option
to nvcc, one may generate NVIDIA CUBINs with full DWARF information that includes
not only line maps, which map each machine instruction back to a program source line,
but also detailed information about inlined code. However, the price of turning on -G
is that optimization by nvcc will be disabled. For that reason, the performance of code
compiled -G is vastly slower. While a developer of a template-based programming model
may find this option useful to see how a program employs templates to instantiate GPU
code, measurements of code compiled with -G should be viewed with skeptical eye.
One can use nvcc’s -lineinfo option to instruct nvcc to record line map information
during compilation.2 The -lineinfo option can be used in conjunction with nvcc opti-
mization. Using -lineinfo, one can measure and interpret the performance of optimized
code. However, line map information is a poor substitute for full DWARF information.
When nvcc inlines code during optimization, the resulting line map information simply
shows that source lines that were compiled into a GPU function. A developer examining
performance measurements for a function must reason on their own about how any source
lines from outside the function got there as the result of inlining and/or macro expansion.
When HPCToolkit uses NVIDIA’s CUPTI to monitor a GPU-accelerated application,
CUPTI notifies HPCToolkit every time it loads a CUDA binary, known as a CUBIN, into a
GPU. At runtime, HPCToolkit computes a cryptographic hash of a CUBIN’s contents and
records the CUBIN into the execution’s measurement directory. For instance, if a GPU-
accelerated application loaded CUBIN into a GPU, NVIDIA’s CUPTI informed HPCToolkit
that the CUBIN was being loaded, and HPCToolkit computed its cryptographic hash as
972349aed8, then HPCToolkit would record 972349aed8.gpubin inside a gpubins subdi-
rectory of an HPCToolkit measurement directory.
To attribute GPU performance measurements back to source, HPCToolkit’s hpcstruct
supports analysis of NVIDIA CUBIN binaries. Since many CUBIN binaries may be loaded
by a GPU-accelerated application during execution, an application’s measurements direc-
tory may contain a gpubins subdirectory populated with many CUBINs.
To conveniently analyze all of the CPU and GPU binaries associated with an execution,
we have extended HPCToolkit’s hpcstruct binary analyzer so that it can be applied to
a measurement directory rather than just individual binaries. So, for a measurements
directory hpctoolkit-laghos-measurements collected during an execution of the GPU-
accelerated laghos mini-app [8], one can analyze all of CPU and GPU binaries associated
with the measured execution by using the following command:
hpcstruct hpctoolkit-laghos-measurements
When applied in this fashion, hpcstruct runs in parallel by default. It uses half of the
threads in the CPU set in which it is launched to analyze binaries in parallel. hpcstruct
analyzes large CPU or GPU binaries (100MB or more) using 16 threads. For smaller
binaries, hpcstruct analyzes multiple smaller binaries concurrently using two threads for
the analysis of each.
2
Line maps relate each machine instruction back to the program source line from where it came.
62
By default, when applied to a measurements directory, hpcstruct performs only
lightweight analysis of the GPU functions in each CUBIN. When a measurements direc-
tory contains fine-grain measurements collected using PC sampling, it is useful to perform
a more detailed analysis to recover information about the loops and call sites of GPU func-
tions in an NVIDIA CUBIN. Unfortunately, NVIDIA has refused to provide an API that
would enable HPCToolkit to perform instruction-level analysis of CUBINs directly. Instead,
HPCToolkit must invoke NVIDIA’s nvdisasm command line utility to compute control flow
graphs for functions in a CUBIN. The version of nvdisasm in CUDA 10 is VERY SLOW
and fails to compute control flow graphs for some GPU functions. In such cases, hpcstruct
reverts to lightweight analysis of GPU functions that considers only line map information.
Because analysis of CUBINs using nvdisasm is VERY SLOW, it is not performed by de-
fault. 3 To enable detailed analysis of GPU functions, use the --gpucfg yes option to
hpcstruct, as shown below:
• If a GPU function G can only be invoked from a single call site, all of the measured
cost of G will be attributed to its call site.
3
Before using the --gpucfg yes option, see the notes in the FAQ and Troubleshooting guide in Sec-
tion 12.5).
63
Figure 8.2: A screenshot of hpcviewer for the GPU-accelerated Quicksilver proxy app
without GPU CCT reconstruction.
• If a GPU function G can be called from multiple call sites and PC samples have
been collected for one or more of the call instructions for G, the costs for G are
proportionally divided among G’s call sites according to the distribution of PC samples
for calls that invoke G. For instance, consider the case where there are three call sites
where G may be invoked, 5 samples are recorded for the first call instruction, 10
samples are recorded for the second call instruction, and no samples are recorded for
the third call. In this case, HPCToolkit divides the costs for G among the first two
call sites, attributing 5/15 of G’s costs to the first call site and 10/15 of G’s costs to
the second call site.
• If no call instructions for a GPU function G have been sampled, the costs of G are
apportioned evenly among each of G’s call sites.
64
A2
A2
(0x10, 1) (0x20, 1) A2 A2
(0x10, 1)
(0x10, 1) (0x20, 1) (0x10, 1)
B3 C0 B3
B3 C0 B3
(0x30, 1) (0x40, 2)
(0x30, 1) (0x40, 2)
(0x30, 1) (0x40, 2) (0x30, 1) (0x40, 2)
(0x50, 1)
D2 E3 D2 E3 SCC5 D3 E3 SCC6 SCC2 D1 E1 D’2 E’2 SCC’4
(0x60, 1)
(0x70, 0) (0x70, 0) (0x70, 1) (0x70, 1/3) (0x70, 2/3)
F3 F3 F3 F1 F’2
Figure 8.3: Reconstruct a GPU calling context tree. A-F represent GPU functions. Each
subscript denotes the number of samples associated with the function. Each (a, c) pair
indicates an edge at address a has c call instruction samples.
IHPCToolkit’s hpcprof analyzes the static call graph associated with each GPU kernel
invocation. If the static call graph for the GPU kernel contains cycles, which arise from
recursive or mutually-recursive calls, hpcprof replaces each cycle with a strongly connected
component (SCC). In this case, hpcprof unlinks call graph edges between vertices within
the SCC and adds an SCC vertex to enclose the set of vertices in each SCC. The rest of
hpcprof’s analysis treats an SCC vertex as a normal “function” in the call graph.
Figure 8.3 illustrates the reconstruction of an approximate calling context tree for
a GPU computation given the static call graph (computed by hpcstruct from a CU-
BIN’s machine instructions) and PC sample counts for some or all GPU instructions
in the CUBIN. Figure 8.4 shows an hpcviewer screenshot for the GPU-accelerated
Quicksilver proxy app following reconstruction of GPU calling contexts using the algo-
rithm described in this section. Notice that after the reconstruction, one can see that
CycleTrackingKernel calls CycleTrackingGuts, which calls CollisionEvent, which even-
tually calls macroscopicCrossSection and NuclearData::getNumberOfReactions. The
the rich approximate GPU calling context tree reconstructed by hpcprof also shows loop
nests and inlined code.4
4
The control flow graph used to produce this reconstruction for Quicksilver was computed with CUDA
11. You will not be able to reproduce these results with earlier versions of CUDA due to weaknesses in
nvdisasm prior to CUDA 11.
65
Figure 8.4: A screenshot of hpcviewer for the GPU-accelerated Quicksilver proxy app
with GPU CCT reconstruction.
66
Argument to hpcrun What is monitored
-e gpu=amd coarse-grain profiling of AMD GPU operations
-e gpu=amd -t coarse-grain profiling and tracing of AMD GPU op-
erations
Table 8.10: Monitoring performance on AMD GPUs when using AMD’s HIP and OpenMP
programming models and runtimes.
67
Argument to hpcrun What is monitored
-e gpu=level0 coarse-grain profiling of Intel GPU operations using
Intel’s Level 0 runtime
-e gpu=level0 -t coarse-grain profiling and tracing of Intel GPU oper-
ations using Intel’s Level 0 runtime
-e gpu=level0,inst=count coarse-grain profiling of Intel GPU operations using
Intel’s Level 0 runtime; fine-grain measurement of In-
tel GPU kernel executions using Intel’s GT-Pin for
instruction counting
-e gpu=level0,inst=count -t coarse-grain profiling and tracing of Intel GPU oper-
ations using Intel’s Level 0 runtime; fine-grain mea-
surement of Intel GPU kernel executions using Intel’s
GT-Pin for instruction counting
Table 8.11: Monitoring performance on Intel GPUs when using Intel’s Level 0 runtime.
Table 8.12: Monitoring performance on GPUs when using the OpenCL programming
model.
Table 8.12 shows the possible command-line arguments to hpcrun for monitoring
OpenCL programs. There are two levels of monitoring: profiling, or profiling + trac-
ing. When tracing is enabled, HPCToolkit will collect a trace of activity for each OpenCL
command queue.
68
Chapter 9
69
When an OpenMP runtime supports the OMPT interface, by registering callbacks using
the OMPT interface and making calls to OMPT interface operations in the runtime API,
HPCToolkit can gather information that enables it to reconstruct a global, user-level view
of the parallelism. Using the OMPT interface, HPCToolkit can attribute metrics for costs
incurred by worker threads in parallel regions back to the calling contexts in which those
parallel regions were invoked. In such cases, most or all work performance is attributed back
to global user-level calling contexts that are descendants of <program root>. When using
the OMPT interface, there may be some costs that cannot be attributed back to a global
user-level calling context in an OpenMP program. For instance, costs assocuated with
idle worker threads that can’t be associated with any parallel region may be attributed
to <omp idle>. Even when using the OMPT interface, some costs may be attributed
to <thread root>; however, such costs are typically small and are often associated with
runtime startup.
HPCToolkit includes support for using the OMPT interface to monitor offloading of
computations specified with OpenMP TARGET to GPUs and attributing them back to the
host calling contexts from which they were offloaded.
70
9.2.2 AMD GPUs
OpenMP computations executing on AMD GPUs are monitored whenever hpcrun’s
command-line switches are configured to monitor operations on AMD GPUs, as described
in Section 8.3.
AMD’s ROCm 5.1 and later releases contains OMPT support for monitoring and
attributing host computations as well as computations offloaded to AMD GPUs using
OpenMP TARGET. When compiled with amdclang or amdclang++, both host compu-
tations and computations offloaded to AMD GPUs can be associated with global user-level
calling contexts that are children of <program root>.
Cray’s compilers only have partial support for the OMPT interface, which renders HPC-
Toolkit unable to elide implementation-level details of parallel regions. For everyone but
compiler or runtime developers, such details are unnecessary and make it harder for appli-
cation developers to understand their code with no added value.
71
72
Chapter 10
HPCToolkit provides the hpcviewer [2, 20] performance presentation tool for inter-
active examination of performance databases. hpcviewer presents a heterogeneous calling
context tree that spans both CPU and GPU contexts, annotated with measured or derived
metrics to help users assess code performance and identify bottlenecks.
The database generated by hpcprof consists of 4 dimensions: profile, time, context,
and metric. We employ the term profile to include any logical threads (such as OpenMP,
pthread and C++ threads), and also MPI processes and GPU streams. The time dimension
represents the timeline of the program’s execution, context depicts the path in calling-
context tree, and metric constitutes program measurements performed by hpcrun such as
cycles, number of instructions, stall percentages and ratio of idleness. The time dimension
is available if the application is profiled with traces enabled (hpcrun -t option).
To simplify performance data visualization, hpcviewer restricts display two dimensions
at a time: the Profile view (Section 10.2) displays pairs of ⟨context, metric⟩ or ⟨profile,
metric⟩ dimensions; and the Trace viewer (Section 10.9) visualizes the behavior of threads
or streams over time.
Note: Currently GPU stream execution contexts are not shown in this view; metrics for
a GPU operation are associated with the calling context in the thread that initiated the
GPU operation.
10.1 Launching
Requirements to launch hpcviewer:
• On all platforms: Java 11 or newer (up to Java 17).
• On Linux: GTK 3.20 or newer.
hpcviewer can either be launched from a command line (Linux platforms) or by clicking
the hpcviewer icon (for Windows, Mac OS X and Linux platforms). The command line
syntax is as follows:
hpcviewer [options] [<hpctoolkit-database>]
73
Here, <hpctoolkit-database> is an optional argument to load a database automatically.
Without this argument, hpcviewer will prompt for the location of a database. Possible
options for hpcviewer are shown in the table below:
On Linux, when hpcviewer is installed using its install.sh script, which chooses a
default maximum size for the Java heap on the current platform. When analyzing mea-
surements for large and complex applications, it may be necessary to use the --java-heap
option to specify a larger heap size for hpcviewer to accommodate many metrics for many
contexts.
On MacOs and Windows the value of JVM maximum heap size is stored in
hpcviewer.ini file, specified with -Xmx option. On MacOS, this file is located at
hpcviewer.app/Contents/Eclipse/hpcviewer.ini.
• Top-down View. This top-down view shows the dynamic calling contexts (call
paths) in which costs were incurred. Using this view, one can explore performance
measurements of an application in a top-down fashion to understand the costs incurred
by calls to a procedure in a particular calling context. We use the term cost rather
than simply time since hpcviewer can present a multiplicity of metrics such as cycles,
or cache misses) or derived metrics (e.g. cache miss rates or bandwidth consumed)
that that are other indicators of execution cost.
74
Figure 10.1: An annotated screenshot of hpcviewer’s interface.
A calling context for a procedure f consists of the stack of procedure frames active
when the call was made to f. Using this view, one can readily see how much of the ap-
plication’s cost was incurred by f when called from a particular calling context. If finer
detail is of interest, one can explore how the costs incurred by a call to f in a partic-
ular context are divided between f itself and the procedures it calls. HPCToolkit’s
call path profiler hpcrun and the hpcviewer user interface distinguish calling context
precisely by individual call sites; this means that if a procedure g contains calls to
procedure f in different places, these represent separate calling contexts.
• Bottom-up View. This bottom-up view enables one to look upward along call paths.
The view apportions a procedure’s costs to its callers and, more generally, its calling
contexts. This view is particularly useful for understanding the performance of soft-
ware components or procedures that are used in more than one context. For instance,
a message-passing program may call MPI_Wait in many different calling contexts.
75
The cost of any particular call will depend upon the structure of the parallelization
in which the call is made. Serialization or load imbalance may cause long waits in
some calling contexts while other parts of the program may have short waits because
computation is balanced and communication is overlapped with computation.
When several levels of the Bottom-up View are expanded, saying that the Bottom-up
View apportions metrics of a callee on behalf of its callers can be confusing. More
precisely, the Bottom-up View apportions the metrics of a procedure on behalf of the
various calling contexts that reach it.
• Flat View. This view organizes performance measurement data according to the
static structure of an application. All costs incurred in any calling context by a
procedure are aggregated together in the Flat View. This complements the Top-down
View, in which the costs incurred by a particular procedure are represented separately
for each call to the procedure from a different calling context.
10.3 Panes
hpcviewer’s browser window is divided into three panes: the Navigation pane, Source
pane, and the Metrics pane. We briefly describe the role of each pane.
76
• In the Bottom-up View, entities in the navigation tree are procedure activations.
Unlike procedure activations in the top-down view in which call sites are paired with
the called procedure, in the bottom-up view, call sites are paired with the calling
procedure to facilitate attribution of costs for a called procedure to multiple different
call sites and callers.
• In the Flat View, entities in the navigation tree correspond to source files, procedure
call sites (which are rendered the same way as procedure activations), loops, and
source lines.
Navigation Control
The header above the navigation pane contains some controls for the navigation and
metric view. In Figure 10.1, they are labeled as “navigation/metric control.”
• Zoom-in / Zoom-out :
Depressing the up arrow button will zoom in to show only information for the selected
line and its descendants. One can zoom out (reversing a prior zoom operation) by
depressing the down arrow button.
• Hide/show metrics :
Show or hide metric columns. A dialog box will appear and the user can select which
metric columns should be shown. See Section 10.8.2 section for more details.
77
• Resizing metric columns / :
Resize the metric columns based on either the width of the data, or the width of both
of the data and the column’s label.
Context menus
Navigation control also provides several context menus by clicking the right-button of
the mouse.
• Copy: Copy into clipboard the selected line in navigation pane which includes the
name of the node in the tree, and the values of visible metrics in metric pane (Section
10.3.3). The values of hidden metrics will not be copied.
• Find: Display the Find window to allow the user to search a text within the Scope
column of the current table. The window has several options such as case sensitivity,
whole word search and using regular expressions.
78
scroll bars for horizontal scrolling (to reveal other metrics) and vertical scrolling (to reveal
other scopes). Vertical scrolling of the metric and navigation panes is synchronized.
10.4.2 Example
Figure 10.2 shows an example of a recursive program separated into two files, file1.c
and file2.c. In this figure, we use numerical subscripts to distinguish between differ-
ent instances of the same procedure. In the other parts of this figure, we use alphabetic
subscripts. We use different labels because there is no natural one-to-one correspondence
between the instances in the different views.
Routine g can behave as a recursive function depending on the value of the condition
branch (lines 3–4). Figure 10.3 shows an example of the call chain execution of the program
annotated with both inclusive and exclusive costs. Computation of inclusive costs from
exclusive costs in the Top-down View involves simply summing up all of the costs in the
subtree below.
79
file1.c file2.c
Figure 10.3: Top-down View. Each node of the tree has three boxes: the left-most is the
name of the node (or in this case the name of the routine, the center is the inclusive value,
and on the right is the exclusive value.
In this figure, we can see that on the right path of the routine m, routine g (instantiated
in the diagram as g1 ) performed a recursive call (g2 ) before calling routine h. Although
g1 , g2 and g3 are all instances from the same routine (i.e., g), we attribute a different cost
for each instance. This separation of cost can be critical to identify which instance has a
performance problem.
Figure 10.4 shows the corresponding scope structure for the Bottom-up View and the
costs we compute for this recursive program. The procedure g noted as ga (which is a root
node in the diagram), has different cost to g as a callsite as noted as gb , gc and gd . For
instance, on the first tree of this figure, the inclusive cost of ga is 9, which is the sum of the
highest cost for each path in the calling context tree shown in Figure 10.3 that includes g:
the inclusive cost of g3 (which is 3) and g1 (which is 6). We do not attribute the cost of g2
here since it is a descendant of g1 (in other term, the cost of g2 is included in g1 ).
Inclusive costs need to be computed similarly in the Flat View. The inclusive cost of a
recursive routine is the sum of the highest cost for each branch in calling context tree. For
80
Figure 10.4: Bottom-up View
instance, in Figure 10.5, The inclusive cost of gx , defined as the total cost of all instances of
g, is 9, and this is consistently the same as the cost in the bottom-up tree. The advantage
of attributing different costs for each instance of g is that it enables a user to identify which
instance of the call to g is responsible for performance losses.
10.5.1 Formulae
The formula syntax supported by hpcviewer is inspired by spreadsheet-like in-fix math-
ematical formulae. Operators have standard algebraic precedence.
81
Figure 10.6: Derived metric dialog box
10.5.2 Examples
Suppose the database contains information from five executions, where the same two
metrics were recorded for each:
1. Metric 0, 2, 4, 6 and 8: total number of cycles
82
– New name for the derived metric. Supply a string that will be used as the column
header for the derived metric. If you don’t supply one, the metric will have no
name.
– Formula definition field. In this field the user can define a formula with
spreadsheet-like mathematical formula. This field must be filled. A user can
type a formula into this field, or use the buttons in the Assistance pane below
below to help insert metric terms or function templates.
– Metrics. This is used to find the ID of a metric. For instance, in this snapshot,
the metric WALLCLOCK has the ID 2. By clicking the button Insert metric,
the metric ID will be inserted in formula definition field. A metric may refer to
the value at an individual node in the calling context tree (point-wise) or the
value at the root of the calling context tree (aggregate).
– Functions. This is to guide the user who wants to insert functions in the formula
definition field. Some functions require only one metric as the argument, but
some can have two or more arguments. For instance, the function avg() which
computes the average of some metrics, needs at least two arguments.
• Advanced options:
– Augment metric value display with a percentage relative to column total. When
this box is checked, each scope’s derived metric value will be augmented with a
percentage value, which for scope s is computed as the 100 * (s’s derived metric
value) / (the derived metric value computed by applying the metric formula
to the aggregate values of the input metrics for the entire execution). Such a
computation can lead to nonsensical results for some derived metric formulae.
For instance, if the derived metric is computed as a ratio of two other metrics,
the aforementioned computation that compares the scope’s ratio with the ratio
for the entire program won’t yield a meaningful result. To avoid a confusing
metric display, think before you use this button to annotate a metric with its
percent of total.
– Default format. This option will display the metric value using scientific notation
with three digits of precision, which is the default format.
– Display metric value as percent. This option will display the metric value for-
matted as a percent with two decimal digits. For instance, if the metric has a
value 12.3415678, with this option, it will be displayed as 12.34%.
– Custom format. This option will present the metric value with your customized
format. The format is equivalent to Java’s Formatter class, or similar to C’s printf
format. For example, the format ”%6.2f” will display six digit floating-points
with two digits to the right of the decimal point.
Note that the entered formula and the metric name will be stored automatically. One
can then review again the formula (or metric name) by clicking the small triangle of the
combo box.
83
Figure 10.7: Plot graph view of a procedure in GAMESS MPI+OpenMP application
showing a imbalance where a group of execution contexts have much higher GPU operations
than others.
84
To create a graph, first select a scope in the Top-down View; in the Figure 10.7, the
procedure gpu tdhf apb j06 pppp is selected. Then, click the graph button to show the
associated sub-menus. At the bottom of the sub-menu is a list of metrics that hpcviewer
can graph. Each metric contains a sub-menu that lists the three different types of graphs
hpcviewer can plot.
• Plot graph. This standard graph plots metric values by ordered by their execution
context.
• Sorted plot graph. This graph plots metric values in ascending order.
• Histogram graph. This graph is a histogram of metric values. It divides the range
of metric values into a small number of sub-ranges. The graph plots the frequency
that a metric value falls into a particular sub-range.
Note that the plot graph’s execution context have the following notation:
<process_id> . <thread_id>
Hence, if the ranks are 0.0, 0.1, . . . 31.0, 31.1 it means MPI process 0 has two threads:
thread 0 and thread 1 (similarly with MPI process 31).
Currently, it is only possible to generate scatter plots for metrics directly collected by
hpcrun, which excludes derived metrics created within hpcviewer.
85
Figure 10.8: A snapshot of a thread filter dialog. Users can refine the list of threads using
regular expression by selecting the Regular expression checkbox.
Figure 10.9: Example of a Thread View which display thread-level metrics of a set of
threads. The first column is a CCT equivalent to the CCT in the Top-down View, the
second and third columns represent the metrics of the selected threads (in this case they
are the sum of metrics from threads 0.1, to 7.1)
provides filtering to elide nodes that match a filter pattern. hpcviewer allows users to de-
fine multiple filters, and each filter is associated with a glob pattern1 and a type. There are
three types of filter: “self only” to omit matched nodes, “descendants only” to exclude only
the subtree of the matched nodes, and “self and descendants” to remove matched nodes
and its descendants.
1
A glob pattern specifies which name to be removed by using wildcard characters such as *, ? and +
86
(b) The result of applying self only filter
on node C. Node C is elided and its children
(nodes D and E) are augmented to the parent
of node C. The exclusive cost of node C is also
(a) The original CCT tree. augmented to node A.
Figure 10.10: Different results of filtering on node C from Figure 10.10a (the original
CCT). Figure 10.10b shows the result of self only filter, Figure 10.10c shows the result of
descendants only filter, and Figure 10.10d shows the result of self and descendants filter.
Each node is attributed with two boxes on its right. The left box represents the node’s
inclusive cost, while the right box represents the exclusive cost.
87
Self only : This filter is useful to hide intermediary runtime functions such as pthread
or OpenMP runtime functions. All nodes that match filter patterns will be removed, and
their children will be augmented to the parent of the elided nodes. The exclusive cost of
the elided nodes will be also augmented into the exclusive cost of the parent of the elided
nodes. Figure 10.10b shows the result of filtering node C of the CCT from Figure 10.10a.
After filtering, node C is elided and its exclusive cost is augmented into the exclusive cost
of its parent (node A). The children of node C (nodes D and E) are now the children of node
A.
Descendants only : This filter elides only the subtree of the matched node, while the
matched node itself is not removed. A common usage of this filter is to exclude any call
chains after MPI functions. As shown in Figure 10.10c, filtering node C incurs nodes D and
E to be elided and their exclusive cost is augmented to node C.
Self and descendants : This filter elides both the matched node and its subtree. This
type is useful to exclude any unnecessary details such as glibc or malloc functions. Fig-
ure 10.10d shows that filtering node C will elide the node and its children (nodes D and E).
The total of the exclusive cost of the elided nodes is augmented to the exclusive cost of
node A.
The filter feature can be accessed by clicking the menu “Filter” and then submenu
“Show filter property”, which will then show a Filter property window (Figure 10.11). The
window consists of a table of filters, and a group of action buttons: add to create a new
filter; edit to modify a selected filter; and delete to remove a set of selected filters.. The
table comprises of two columns: the left column is to display a filter’s switch whether the
filter is enabled or disabled, and a glob-like filter pattern; and the second column is to show
the type of pattern (self only, children only or self and children). If a checkbox is checked,
it signifies the filter is enabled; otherwise the filter is disabled.
Cautious is needed when using filter feature since it can change the shape of the tree,
thus affects different interpretation of performance analysis. Furthermore, if the filtered
nodes are children of a “fake” procedures (such as <program root> and <thread root>),
the exclusive metrics in Bottom-up view and flat view can be misleading. This occurs since
these views do not show “fake” procedures.
88
• Find. To search for a string in the current source pane, <ctrl>-f (Linux and
Windows) or <command>-f (Mac) will bring up a find dialog that enables you to
enter the target string.
89
Figure 10.12: Logical view of trace call path samples on three dimensions: time, execution
context (rank/thread/GPU) and call path depth.
view can interactively present a large-scale execution trace without concern for the scale of
parallelism it represents.
To collect a trace for a program execution, one must instruct HPCToolkit’s mea-
surement system to collect a trace. When launching a dynamically-linked executable with
hpcrun, add the -t flag to enable tracing. When launching a statically-linked executable,
set the environment variable HPCRUN_TRACE=1 to enable tracing. When collecting a trace,
one must also specify a metric to measure. The best way to collect a useful trace is to
asynchronously sample the execution with a time-based metric such as REALTIME, CYCLES,
or CPUTIME.
As shown in Figure 10.12, call path traces consist of data in three dimensions: profile
(process/thread rank), time, and call path depth. A crosshair in Trace view is defined by a
triplet (p, t, d) where p is the selected process/thread rank, t is the selected time, and d is
the selected call path depth.
Trace view renders a view of processes and threads over time. The Depth View (Sec-
tion 10.9.2) shows the call path depth over time for the thread selected by the cursor. Trace
view’s Call Stack View (Section 10.9.4) shows the call path associated with the thread and
time pair specified by the cursor. Each of these views plays a role for understanding an
application’s performance.
In Trace view, each procedure is assigned specific color. Figure 10.12 shows that at
depth 1 each call path has the same color: blue. This node represents the main program
that serves as the root of the call chain in all process at all times. At depth 2, all processes
have a green node, which indicates another procedure. At depth 3, in the first time step all
processes have a yellow node; in subsequent time steps they have purple nodes. This might
indicate that the processes first are observed in an initialization procedure (represented by
yellow) and later observed in a solve procedure (represented by purple). The pattern of
90
Figure 10.13: A screenshot of hpcviewer’s Trace view.
Figure 10.14: A screenshot of hpcviewer’s Trace view showing the Summary View and
Statistics View.
colors that appears in a particular depth slice of the Main View enables a user to visually
identify inefficiencies such as load imbalance and serialization.
Figures 10.13 and 10.14 show screenshots of Trace view’s capabilities in presenting
call path traces. Figure 10.13 highlights Trace view’s four principal window panes: Main
View(the main view), Depth View, Call Stack View and Mini Map View, while Figure10.14
shows additional two window panes: Summary View and Statistics View.
• Main View (top, left pane): This is Trace view’s primary view. This view, which
is similar to a conventional process/time (or space/time) view, shows time on the
91
horizontal axis and process (or thread) rank on the vertical axis; time moves from
left to right. Compared to typical process/time views, there is one key difference.
To show call path hierarchy, the view is actually a user-controllable slice of the
process/time/call-path space. Given a call path depth, the view shows the color
of the currently active procedure at a given time and process rank. (If the requested
depth is deeper than a particular call path, then Trace view simply displays the deep-
est procedure frame and, space permitting, overlays an annotation indicating the fact
that this frame represents a shallower depth.)
Trace view assigns colors to procedures based on (static) source code procedures.
Although the color assignment is currently random, it is consistent across the different
views. Thus, the same color within the Trace and Depth Views refers to the same
procedure.
The Trace View has a white crosshair that represents a selected point in time and
process space. For this selected point, the Call Path View shows the corresponding
call path. The Depth View shows the selected process.
• Depth View (tab in bottom, left pane): This is a call-path/time view for the process
rank selected by the Main View’s crosshair. Given a process rank, the view shows for
each virtual time along the horizontal axis a stylized call path along the vertical axis,
where ‘main’ is at the top and leaves (samples) are at the bottom. In other words,
this view shows for the whole time range, in qualitative fashion, what the Call Path
View shows for a selected point. The horizontal time axis is exactly aligned with the
Trace View’s time axis; and the colors are consistent across both views. This view has
its own crosshair that corresponds to the currently selected time and call path depth.
• Summary View (tab in bottom, left pane): The view shows for the whole time range
displayed, the proportion of each subroutine in a certain time. Similar to Depth view,
the time range in Summary reflects to the time range in the Trace view.
• Call Stack View (tab in top, right pane): This view shows two things: (1) the
current call path depth that defines the hierarchical slice shown in the Trace View;
and (2) the actual call path for the point selected by the Trace View’s crosshair.
(To easily coordinate the call path depth value with the call path, the Call Path
View currently suppresses details such as loop structure and call sites; we may use
indentation or other techniques to display this in the future.)
• Statistics View (tab in top, right pane): This view shows the list of procedures active
in the space-time region shown in the Trace View at the current Call Path Depth.
Each procedure’s percentage in the Statistics View indicates the percentage of pixels
in the Trace View pane that are filled with this procedure’s color at the current Call
Path Depth. When the Trace View is navigated to show a new time-space interval or
the Call Path Depth is changed, the statistics view will update its list of procedures
and the percentage of execution time to reflect the new space-time interval or depth
selection.
• GPU Idleness Blame View (tab in top, right pane): The view shows the list of
procedures that cause GPU idleness displayed in the trace view. If the trace view
92
displays one CPU thread and multiple GPU streams, then the CPU thread will be
blamed for the idleness for those GPU streams. If the view contains more than one
CPU threads and multiple GPU streams, then the cost of idleness is share among the
CPU threads.
• Mini Map View (right, bottom): The Mini Map shows, relative to the process/time
dimensions, the portion of the execution shown by the Trace View. The Mini Map
enables one to zoom and to move from one close-up to another quickly.
• Home : Resetting the view configuration into the original view, i.e., viewing traces
for all times and processes.
• Horiontal zoom in / out : Zooming in/out the time dimension of the traces.
• Vertical zoom in / out : Zooming in/out the process dimension of the traces.
• Undo : Canceling the action of zoom or navigation and returning back to the
previous view configuration.
At the top of an execution’s Main View pane is some information about the data shown
in the pane.
• Time Range. The time interval shown along the horizontal dimension.
• Cross Hair. The crosshair indicates the current cursor position in the time and
execution-context dimensions.
93
10.9.2 Depth View
Depth View shows all the call path for a certain time range [t1 , t2 ] = {t|t1 ≤ t ≤ t2 }
in a specified process rank p. The content of Depth View is always consistent with the
position of the crosshair in Main View. For instance once the user clicks in process p and
time t, while the current depth of call path is d, then the Depth View’s content is updated
to display all the call path of process p and shows its crosshair on the time t and the call
path depth d.
On the other hand, any user action such as crosshair and time range selection in Depth
View will update the content within Main View. Similarly, the selection of new call path
depth in Call Stack View invokes a new position in Depth View.
In Depth View a user can specify a new crosshair time and a new time range.
Specifying a new crosshair time. Selecting a new crosshair time t can be performed
by clicking a pixel within Depth View. This will update the crosshair in Main View and
the call path in Call Stack View.
Selecting a new time range. Selecting a new time range [tm , tn ] = {t|tm ≤ t ≤ tn }
is performed by first clicking the position of tm and drag the cursor to the position of tn .
A new content in Depth View and Main View is then updated. Note that this action will
not update the call path in Call Stack View since it does not change the position of the
crosshair.
94
10.10 Menus
hpcviewer provides four main menus:
10.10.1 File
This menu includes several menu items for controlling basic viewer operations.
• New window Open a new hpcviewer window that is independent from the existing
one. However, filtering CCT node operation (Section 10.7) will affect all hpcviewer
windows.
• Open database Open a database without replacing the existing one. This menu can
be used to compare two databases. Currently hpcviewer restricts maximum of two
database open at a time.
• Switch database Load a performance database into the current hpcviewer window
replacing the existing opened databases.
• Merge databases Merging two database that are currently in the viewer. At the
moment hpcviewer doesn’t support storing a merged database into a file.
• Preferences Display the settings dialog box which consists of three sections:
10.10.2 Filter
This menu only contains one submenu:
• Filter CCT nodes Open a filter property window which lists a set of filters and its
properties (Section 10.7).
• Filter execution contexts (Trace view only) Open a window for selecting which
nodes will be hidden in the tree. Currently filtering CCT nodes only affect the Profile
view, and doesn’t affect the Trace view.
95
Figure 10.15: Procedure-color mapping dialog box. This window shows that any proce-
dure names that match with ”MPI*” pattern are assigned with red, while procedures that
match with ”PMPI*” pattern are assigned with color black.
10.10.3 View
This menu is only visible if at least one database is loaded. All actions in this menu are
intended primarily for tool developer use. By default, the menu is hidden. Once you open
a database, the menu is then shown.
• Show metrics (Profile view only) Display a list of (metric name, metric name descrip-
tion) pairs in a window. For GPU metrics, the descriptions are useful for explaining
what the short and somewhat cryptic metric names mean. From this window, you
can use the edit button to modify the name of the selected metric. When editing a
derived metric, the metric editor will allow you to modify the formula for the metric
in addition to the name. Once you modify a metric and exit this window by selecting
the OK button, the metric pane will refresh the display of any metrics whose name
or formula was modified.
• Show color mapping (Trace view only) Open a window which shows customized
mapping between a procedure pattern and a color (Figure 10.15). Trace view allows
users to customize assignment of a pattern of procedure names with a specific color.
96
10.10.4 Help
This menu displays information about the viewer. The menu contains only one menu
item:
• About. Displays brief information about the viewer, including JVM and Eclipse
variables, and error log files.
10.11 Limitations
Some important hpcviewer limitations are listed below:
• Dark theme on Linux platforms. We received reports that hpcviewer is not very
visible on Linux with dark theme. Support for dark theme on Linux is still an ongoing
work.
• Linux TWM window manager is not supported. Reason: this window manager
is too ancient.
97
98
Chapter 11
Known Issues
This section lists some known issues and potential workarounds. Other known issues
can be seen in the project’s Gitlab issues pages:
11.1 When using Intel GPUs, using hpcrun may program al-
ter program behavior when using instruction-level per-
formance measurement
Description: Binary instrumentation on Intel GPUs uses Intel’s GTPin. For some pro-
grams, using instruction counting, latency instrumentation, and/or SIMD instrumentation
using GTPin has been observed to affect program behavior in undesirable ways, e.g. chang-
ing some program floating point values to NaNs. Testing has confirmed that this is a GTPin
issue rather than an hpcrun issue. Unfortunately, GTPin is closed source, so this problem
awaits a resolution by Intel.
11.2 When using Intel GPUs, hpcrun may report that sub-
stantial time is spent in a partial call path consisting of
only an unknown procedure
Description: Binary instrumentation on Intel GPUs uses Intel’s GTPin. GTPin runs
in its own private namespace. Asynchronous samples collected in response to Linux timer
or hardware counter events may often occur when GTPin is executing. With GTPin in a
99
private namespace, its code and symbols are invisible to hpcrun, which causes a degenerate
unwind consisting of only an unknown procedure.
Workaround: Don’t collect Linux timer or hardware counter events on the CPU when us-
ing binary instrumentation to collect instruction-level performance measurements of kernels
executing on Intel GPUs.
Development Plan: A future version of HPCToolkit will recognize that these unwinds
are indeed full call paths and attribute them as such.
100
measurement subsystem using -e PAPI TOT CYC. Of course, one can configure PAPI to
measure other hardware events, such as graduated instructions and cache misses.
Development Plan: Identify why the use of the Linux perf events subsystem by the
LDMS syspapi sampler conflicts with the use of the direct use of Linux perf events
HPCToolkit and the Linux perf tool but not with the use of Linux perf events by PAPI.
Workaround: In our experience, the hpcrun files in the measurement for the daemon
tagged with rank 0 thread 0 are very small. In experiments we ran, they were about 2K.
You can remove these profiles and their matching trace files before processing a measurement
database with hpcprof. The correspondence between a profile and trace can be determined
because they only differ in their suffix (hpcrun or hpctrace).
101
Sum over rank/thread of exclusive ’GPU kernel: theoretical occupancy
(FGP ACT / FGP MAX)’
The metric is computed correctly by summing the fine-grain parallelism used in each
kernel launch across all threads and ranks and dividing it by the sum of the maximum fine-
grain parallelism available to each kernel launch across all threads and ranks, and presenting
the value as a percent.
Explanation: This metric is unlike others computed by HPCToolkit. Rather than being
computed by hpcprof, it is computed by having hpcviewer interpret a formula.
Workaround: Pay attention to the metric value, which is computed correctly and ignore
its awkward label.
Development Plan: Add additional support to hpcrun and hpcprof to understand how
derived metrics are computed and avoid spoiling their labels.
Description: Darshan is a library for monitoring POSIX I/O. When using asynchronous
sampling on the CPU to monitor a program that is being monitored with Darshan, your
program may deadlock.
Explanation: Darshan hijacks calls to open. HPCToolkit uses the libunwind library.
Under certain circumstances, libunwind uses open to inspect an application’s executable
or one of the shared libraries it uses to look for unwinding information recorded by the
compiler. The following sequence of actions leads to a problem:
1. A user application calls malloc and acquires a mutex lock on an allocator data struc-
ture.
6. The Darshan wrapper for open may try to allocate data to record statistics for the ap-
plication’s calls to open, deadlocking because a non-reentrant allocator lock is already
held by this thread.
102
Workaround: Unload the Darshan module before compiling a statically-linked applica-
tion or running a dynamically-linked application.
Development Plan: Ensure that libunwind’s calls to open are never intercepted by
Darshan.
103
104
Chapter 12
if(NOT CMAKE_CXX_LINK_EXECUTABLE)
set(CMAKE_CXX_LINK_EXECUTABLE
"<CMAKE_CXX_COMPILER> <FLAGS> <CMAKE_CXX_LINK_FLAGS> <LINK_FLAGS>
<OBJECTS> -o <TARGET> <LINK_LIBRARIES>")
endif()
As the rule shows, by default, the C++ compiler is used to link C++ executables. One way
to change this is to override the definition for CMAKE_CXX_LINK_EXECUTABLE on the cmake
command line so that it includes the necessary hpclink prefix, as shown below:
105
cmake srcdir ... \
-DCMAKE_CXX_LINK_EXECUTABLE="hpclink <CMAKE_CXX_COMPILER> \
<FLAGS> <CMAKE_CXX_LINK_FLAGS> <LINK_FLAGS> <OBJECTS> -o <TARGET> \
<LINK_LIBRARIES>" ...
If your project has executables linked with a C or Fortran compiler, you will need analogous
redefinitions for CMAKE_C_LINK_EXECUTABLE or CMAKE_Fortran_LINK_EXECUTABLE as well.
Rather than adding the redefinitions of these linker rules to the cmake command line,
you may find it more convenient to add definitions of these rules to your CMakeLists.cmake
file.
106
application performance with HPCToolkit’s measurement subsystem and gprof instru-
mentation active in the same execution may cause the execution to abort. One can detect the
presence of gprof instrumentation in an application by the presence of the __monstartup
and _mcleanup symbols in a executable. You can recompile your code without the -pg
compiler flag and measure again. Alternatively, you can use the --disable-gprof argu-
ment to hpcrun or hpclink to disable gprof instrumentation while measuring performance
with HPCToolkit.
To cope with gprof instrumentation in dynamically-linked programs, you can use
hpcrun’s --disable-gprof option.
1. cd /etc/modprobe.d
2. grep NVreg_RestrictProfilingToAdminUsers *
Generally, if non-root user access to GPU performance counters is enabled, the grep com-
mand above should yield a line that contains NVreg RestrictProfilingToAdminUsers=0.
107
Note: if you are on a cluster, access to GPU performance counters may be disabled on a
login node, but enabled on a compute node. You should run an interactive job on a compute
node and perform the checks there.
If access to GPU hardware performance counters is not enabled, one option you have
is to use hpcrun without PC sampling, i.e., with the -e gpu=nvidia option instead of -e
gpu=nvidia,pc.
If PC sampling is a must, you have two options:
1. Run the tool or application being profiled with administrative privileges. On Linux,
launch HPCToolkit with sudo or as a user with the CAP SYS ADMIN capability set.
2. Have a system administrator enable access to the NVIDIA performance counters using
the instructions on the following web page: https://developer.nvidia.com/ERR_
NVGPUCTRPERM.
108
12.3.5 Avoiding the error CUPTI ERROR HARDWARE BUSY
When trying to use PC sampling to measure computation on an NVIDIA GPU, you
may encounter the following error: ‘function cuptiActivityConfigurePCSampling failed
with error CUPTI ERROR HARDWARE BUSY’.
For all versions of CUDA to date (through CUDA 11), NVIDIA’s CUPTI library only
supports PC sampling for only one process per GPU. If multiple MPI ranks in your appli-
cation run CUDA on the same GPU, you may see this error.2
You have two alternatives:
1. Measure the execution in which multiple MPI ranks share a GPU using only -e
gpu=nvidia without PC sampling.
2. Launch your program so that there is only a single MPI rank per GPU.
(a) jsrun advice: if using -g1 for a resource set, don’t use anything other than -a1.
1. Measure the execution in which multiple MPI ranks share a GPU using only -e
gpu=nvidia without PC sampling.
2. Launch your program so that there is only a single MPI rank per GPU.
(a) jsrun advice: if using -g1 for a resource set, don’t use anything other than -a1.
109
For reasonable accuracy (±5%), there should be at least 20 samples in each context
that is important with respect to performance. Since unimportant contexts are irrelevant to
performance, as long as this condition is met (and as long as samples are not correlated, etc.),
HPCToolkit’s performance data should be accurate enough to guide program tuning.
We typically recommend targeting a frequency of hundreds of samples per second. For
very short runs, you may need to collect thousands of samples per second to record an
adequate number of samples. For long runs, tens of samples per second may suffice for
performance diagnosis.
Choosing sampling periods for some events, such as Linux timers, cycles and instruc-
tions, is easy given a target sampling frequency. Choosing sampling periods for other
events such as cache misses is harder. In principle, an architectural expert can easily derive
reasonable sampling periods by working backwards from (a) a maximum target sampling
frequency and (b) hardware resource saturation points. In practice, this may require some
experimentation.
See also the hpcrun man page.
• Analyze the execution using information from partial unwinds. Often knowing several
levels of calling context is enough for analysis without full calling context for sample
events.
• Recompile the binary or shared library causing the problem and add -g to the list of
its compiler options.
110
• Your sampling frequency is too high. Recall that the goal is to obtain a representative
set of performance data. For this, we typically recommend targeting a frequency of
hundreds of samples per second. For very short runs, you may need to try thousands
of samples per second. For very long runs, tens of samples per second can be quite
reasonable. See also Section 12.4.1.
• hpcrun has a problem unwinding. This causes overhead in two forms. First, hpcrun
will resort to more expensive unwind heuristics and possibly have to recover from
self-generated segmentation faults. Second, when these exceptional behaviors occur,
hpcrun writes some information to a log file. In the context of a parallel application
and overloaded parallel file system, this can perturb the execution significantly. To
diagnose this, execute the following command and look for “Errant Samples”:
• You have very long call paths where long is in the hundreds or thousands. On x86-
based architectures, try additionally using hpcrun’s RETCNT event. This has two
effects: It causes hpcrun to collect function return counts and to memoize common
unwind prefixes between samples.
• Currently, on very large runs the process of writing profile data can take a long
time. However, because this occurs after the application has finished executing, it is
relatively benign overhead. (We plan to address this issue in a future release.)
111
argument to hpcrun or hpclink to disable gprof instrumentation while measuring perfor-
mance with HPCToolkit.
112
12.6.3 Mac only: hpcviewer runs on Java X instead of “Java 11”
hpcviewer has mainly been tested on Java 11. If you are running an older than Java 11
or newer than Java 17, obtain a version of Java 11 or 17 from https://adoptopenjdk.net
or https://adoptium.net/.
If your system has multiple versions of Java and Java 11 is not the newest version, you
need to set Java 11 as the default JVM. On MacOS, you need to exclude older Java as
follows:
-startup
plugins/org.eclipse.equinox.launcher_1.6.200.v20210416-2027.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.2.200.v20210429-1609
-clearPersistedState
-vmargs
-Xmx2048m
-Dosgi.locking=none
You can decrease the maximum size of the Java heap from 2048MB to 1GB by changing
the Xmx specification in the hpcviewer.ini file as follows:
-Xmx1024m
113
12.6.6 hpcviewer fails due to java.lang.OutOfMemoryError exception.
If you see this error, the memory footprint that hpcviewer needs to store and the metrics
for your measured program execution exceeds the maximum size for the Java heap specified
at program launch. On Linux, hpcviewer accepts a command-line option --java-heap
that enables you to specify a larger non-default value for the maximum size of the Java
heap. Run hpcviewer --help for the details of how to use this option.
12.6.7 hpcviewer writes a long list of Java error messages to the terminal!
The Eclipse Java framework that serves as the foundation for hpcviewer can be some-
what temperamental. If the persistent state maintained by Eclipse for hpcviewer gets
corrupted, hpcviewer may spew a list of errors deep within call chains of the Eclipse frame-
work.
On MacOS and Linux, try removing your hpcviewer Eclipse workspace with default
location:
$HOME/.hpctoolkit/hpcviewer
and run hpcviewer again.
We generally recommend adding optimization options after debugging options — e.g., ‘-g
-O2’ — to minimize any potential effects of adding debugging information.4 Also, be careful
not to strip the binary as that would remove the debugging information. (Adding debugging
information to a binary does not make a program run slower; likewise, stripping a binary
does not make a program run faster.)
4
In general, debugging information is compatible with compiler optimization. However, in a few cases,
compiling with debugging information will disable some optimization. We recommend placing optimization
options after debugging options because compilers usually resolve option incompatibilities in favor of the
last option.
114
Please note that at high optimization levels, a compiler may make significant program
transformations that do not cleanly map to line numbers in the original source code. Even
so, the performance attribution is usually very informative.
115
These options instruct hpcprof/mpi to search for source files that live within any of the
source directories <dir1> through <dirN>. Each directory argument can be either absolute
or relative to the current working directory.
It will be instructive to unpack the rationale behind this recommendation. hpcprof/mpi
obtains source file names from your application binary’s debugging information. These
source file paths may be either absolute or relative. Without any -I/--include options,
hpcprof/mpi can find source files that either (1) have absolute paths (and that still exist on
the file system) or (2) are relative to the current working directory. However, because the
nature of these paths depends on your compiler and the way you built your application, it
is not wise to depend on either of these default path resolution techniques. For this reason,
we always recommend supplying at least one -I/--include option.
There are two basic forms in which the search directory can be specified: non-recursive
and recursive. In most cases, the most useful form is the recursive search directory, which
means that the directory should be searched along with all of its descendants. A non-
recursive search directory dir is simply specified as dir. A recursive search directory dir is
specified as the base search directory followed by the special suffix ‘/+’: dir/+. The paths
above use the recursive form.
• If a source file path is absolute and the source file can be found on the file system,
then hpcprof/mpi will find it.
• If a source file path is relative, hpcprof/mpi can only find it if the source file can be
found from the current working directory or within a search directory (specified with
the -I/--include option).
• Finally, if a source file path is absolute and cannot be found by its absolute path,
hpcprof/mpi uses a special search mode. Let the source file path be p/f . If the
path’s base file name f is found within a search directory, then that is considered a
match. This special search mode accommodates common complexities such as: (1)
source file paths that are relative not to your source code tree but to the directory
where the source was compiled; (2) source file paths to source code that is later moved;
and (3) source file paths that are relative to file system that is no longer mounted.
5
Having a system administrator download the associated devel package for a library can enable visibility
into the source code of system libraries.
116
Note that given a source file path p/f (where p may be relative or absolute), it may be the
case that there are multiple instances of a file’s base name f within one search directory,
e.g., p1 /f through pn /f , where pi refers to the ith path to f . Similarly, with multiple search-
directory arguments, f may exist within more than one search directory. If this is the case,
the source file p/f is resolved to the first instance p′ /f such that p′ best corresponds to p,
where instances are ordered by the order of search directories on the command line.
For any functions whose source code is not found (such as functions within system
libraries), hpcviewer will generate a synopsis that shows the presence of the function and
its line extents (if known).
12.6.13 hpcviewer claims that there are several calls to a function within
a particular source code scope, but my source code only has one!
Why?
In the course of code optimization, compilers often replicate code blocks. For instance,
as it generates code, a compiler may peel iterations from a loop or split the iteration space of
a loop into two or more loops. In such cases, one call in the source code may be transformed
into multiple distinct calls that reside at different code addresses in the executable.
When analyzing applications at the binary level, it is difficult to determine whether two
distinct calls to the same function that appear in the machine code were derived from the
same call in the source code. Even if both calls map to the same source line, it may be
wrong to coalesce them; the source code might contain multiple calls to the same function on
the same line. By design, HPCToolkit does not attempt to coalesce distinct calls to the
same function because it might be incorrect to do so; instead, it independently reports each
call site that appears in the machine code. If the compiler duplicated calls as it replicated
code during optimization, multiple call sites may be reported by hpcviewer when only one
appeared in the source code.
117
12.6.14 Trace view shows lots of white space on the left. Why?
At startup, Trace view renders traces for the time interval between the minimum and
maximum times recorded for any process or thread in the execution. The minimum time for
each process or thread is recorded when its trace file is opened as HPCToolkit’s monitoring
facilities are initialized at the beginning of its execution. The maximum time for a process
or thread is recorded when the process or thread is finalized and its trace file is closed.
When an application uses the hpctoolkit_start and hpctoolkit_stop primitives, the
minimum and maximum time recorded for a process/thread are at the beginning and end of
its execution, which may be distant from the start/stop interval. This can cause significant
white space to appear in Trace view’s display to the left and right of the region (or regions)
of interest demarcated in an execution by start/stop calls.
12.7 Debugging
12.7.1 How do I debug HPCToolkit’s measurement?
Assume you want to debug HPCToolkit’s measurement subsystem when collecting
measurements for an application named app.
<externals-install>/libmonitor/bin
export MONITOR_DEBUG=1
[<mpi-launcher>] app [app-arguments]
118
between the two is that the former uses the --event NONE or HPCRUN_EVENT_LIST="NONE"
option (shown below) whereas the latter does not (which enables the default CPUTIME
sample source). With this in mind, to collect a debug trace for either of these levels, use
commands similar to the following:
[<mpi-launcher>] \
hpcrun --monitor-debug --dynamic-debug ALL --event NONE \
app [app-arguments]
export MONITOR_DEBUG=1
export HPCRUN_EVENT_LIST="NONE"
export HPCRUN_DEBUG_FLAGS="ALL"
[<mpi-launcher>] app [app-arguments]
Note that the *debug* flags are optional. The --monitor-debug/MONITOR_DEBUG flag
enables libmonitor tracing. The --dynamic-debug/HPCRUN_DEBUG_FLAGS flag enables
hpcrun tracing.
export HPCRUN_WAIT=1
export HPCRUN_EVENT_LIST="... the metric(s) you want to measure ..."
app [app-arguments]
119
There are two ways to use launch an application with a debugger when using To attach
a debugger when monitoring an application using hpcrun, add hpcrun’s --debug option
o debug hpcrun with a debugger use the following approach.
or
export HPCRUN_WAIT=1
export HPCRUN_EVENT_LIST="REALTIME@0"
app [app-arguments]
2. Attach a debugger. The debugger should be spinning in a loop whose exit is condi-
tioned by the HPCRUN_DEBUGGER_WAIT variable.
3. Set any desired breakpoints. To send a sampling signal at a particular point, make
sure to stop at that point with a one-time or temporary breakpoint (tbreak in GDB).
5. To raise a controlled sampling signal, raise a SIGPROF, e.g., using GDB’s command
signal SIGPROF.
120
Bibliography
121
[12] T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data
without doing anything obviously wrong! SIGARCH Comput. Archit. News, 37(1):265–
276, Mar. 2009.
[13] NVIDIA Corporation. CUPTI User’s Guide DA-05679-001 v10.1, 2019. https://
docs.nvidia.com/cuda/pdf/CUPTI_Library.pdf.
122
Appendix A
Environment Variables
HPCToolkit’s measurement subsystem decides what and how to measure using infor-
mation it obtains from environment variables. This chapter describes all of the environment
variables that control HPCToolkit’s measurement subsystem.
When using HPCToolkit’s hpcrun script to measure the performance of dynamically-
linked executables, hpcrun takes information passed to it in command-line arguments and
communicates it to HPCToolkit’s measurement subsystem by appropriately setting envi-
ronment variables. To measure statically-linked executables, one first adds HPCToolkit’s
measurement subsystem to a binary as it is linked by using HPCToolkit’s hpclink script.
Prior to launching a statically-linked binary that includes HPCToolkit’s measurement
subsystem, a user must manually set environment variables.
Section A.1 describes environment variables of interest to users. Section A.3 describes
environment variables designed for use by HPCToolkit developers. In some cases, HPC-
Toolkit’s developers will ask a user to set some of the environment variables described in
Section A.3 to generate a detailed error report when problems arise.
• If you launch the hpcrun script via a file system link, you must set the HPCTOOLKIT
environment variable to HPCToolkit’s top-level installation directory.
• Some parallel job launchers (e.g., Cray’s aprun) may copy the hpcrun script to a
different location. If this is the case, you will need to set the HPCTOOLKIT environment
variable to HPCToolkit’s top-level installation directory.
HPCRUN EVENT LIST. This environment variable is used provide a set of (event,
period) pairs that will be used to configure HPCToolkit’s measurement subsystem to
perform asynchronous sampling. The HPCRUN EVENT LIST environment variable must
123
be set otherwise HPCToolkit’s measurement subsystem will terminate execution. If an
application should run with sampling disabled, HPCRUN EVENT LIST should be set to
NONE. Otherwise, HPCToolkit’s measurement subsystem expects an event list of the form
shown below.
event1[@period1]; ...; eventN [@periodN ]
As denoted by the square brackets, periods are optional. The default period is 1 million.
Flags to add an event with hpcrun: -e/--event event1[@period1]
Multiple events may be specified using multiple instances of -e/--event options.
124
Default
Name Value Description
MAX_COMPLETION_CALLBACK_THREADS 1000 See Note 1.
STREAMS_PER_TRACING_THREAD 4 See Note 2.
HPCRUN_CUDA_DEVICE_BUFFER_SIZE 8388608 See Note 3.
HPCRUN_CUDA_DEVICE_SEMAPHORE_SIZE 65536 See Note 4.
Note 1: OpenCL may execute callbacks on helper threads created by the OpenCL
runtime. This knob specifies the maximum number of helper threads that can be handled
by hpcrun’s OpenCL tracing implementation.
Note 2: GPU stream traces are recorded by tracing threads created by hpcrun. Reducing
the number of streams per hpcrun tracing thread may make monitoring faster, though it
will use more resources.
Note 3: Value used as CUPTI ACTIVITY ATTR DEVICE BUFFER SIZE. See
https://docs.nvidia.com/cuda/cupti/group__CUPTI__ACTIVITY__API.html.
Note 4: Value used as CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE. See
https://docs.nvidia.com/cuda/cupti/group__CUPTI__ACTIVITY__API.html
HPCRUN CONTROL KNOBS. hpcrun has some settings, known as control knobs,
that can be adjusted by a knowledgeable user to tune the operation of hpcrun’s measure-
ment subsystem. Names and default values of the control knobs are shown in Table A.1
Flags to set a control knob for hpcrun: -ck/--control-knob name=setting.
125
quicksort. Setting this environment variable may dramatically increase the size of calling
context trees for applications that employ bushy subtrees of recursive calls.
Flags to retain recursion with hpcrun: -r/--retain-recursion
HPCRUN AUDIT DISABLE PLT CALL OPT. By default, hpcrun will use
libc’s LD_AUDIT feature to monitor dynamic library operations. The LD_AUDIT facility
has the unfortunate behavior of intercepting each call to a shared library. Each call to a
shared library is dispatched through the Procedure Linkage Table (PLT). We have observed
that allowing the LD_AUDIT facility to intercept each call to a shared library is costly: on
x86 64 we measured a slowdown of 68× for a call to an empty shared library routine.
To avoid this overhead, hpcrun sidesteps LD_AUDIT’s monitoring of a load module’s calls
to a shared library routine by allowing the address of the routine to be cached in the load
module’s Global Offset Table (GOT). The mechanism for this optimization is complex. If
126
you suspect that this optimization is causing your program to crash, this optimization can
be disabled. If your program is not crashing, don’t even consider adjusting this!
Flag to disable optimization of PLT calls when using LD_AUDIT to monitor shared library
operations with hpcrun: --disable-auditor-got-rewriting.
Caution: turning on debugging flags will typically result in voluminous log messages, which
will typically will dramatically slow measurement of the execution under study.
Flags to set debug flags with hpcrun: -dd/--dynamic-debug f lag
127
known as hpcfnbounds2, was designed to compute the same set of addresses for a load
module using only a lightweight inspection of the load module’s symbol table and DWARF
information. hpcfnbounds2 is over a factor of ten faster and uses over a factor of 10 less
memory than the original. hpcfnbounds2 is the default. If hpcfnbounds2 delivers an
unsatisfactory result, a user can employ hpcfnbounds instead by setting this environment
variable using the --fnbounds command line argument to hpcrun.
128