tracy
tracy
tracy
Profiler
The user manual
https://github.com/wolfpld/tracy
Tracy Profiler The user manual
Quick overview
Hello and welcome to the Tracy Profiler user manual! Here you will find all the information you need to
start using the profiler. This manual has the following layout:
• Chapter 1, A quick look at Tracy Profiler, gives a short description of what Tracy is and how it works.
• Chapter 2, First steps, shows how you can integrate the profiler into your application and how to build
the graphical user interface (section 2.3). At this point, you will be able to establish a connection from
the profiler to your application.
• Chapter 3, Client markup, provides information on how to instrument your application, in order to
retrieve useful profiling data. This includes a description of the C API (section 3.13), which enables
usage of Tracy in any programming language.
• Chapter 4, Capturing the data, goes into more detail on how the profiling information can be captured
and stored on disk.
• Chapter 5, Analyzing captured data, guides you through the graphical user interface of the profiler.
• Chapter 6, Exporting zone statistics to CSV, explains how to export some zone timing statistics into a CSV
format.
• Chapter 7, Importing external profiling data, documents how to import data from other profilers.
Quick-start guide
For Tracy to profile your application, you will need to integrate the profiler into your application and run an
independent executable that will act both as a server with which your application will communicate and as a
profiling viewer. The most basic integration looks like this:
• Add the macro ZoneScoped as the first line of your function definitions to include them in the profile.
• Compile and run both your application and the profiler server.
There’s much more Tracy can do, which can be explored by carefully reading this manual. In case
any problems should surface, refer to section 2.1 to ensure you’ve correctly included Tracy in your project.
Additionally, you should refer to section 3 to make sure you are using FrameMark, ZoneScoped, and any other
Tracy constructs correctly.
2
Tracy Profiler The user manual
Contents
2 First steps 12
2.1 Initial client setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Static library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 CMake integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Meson integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.4 Short-lived applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.5 On-demand profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.6 Client discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.7 Client network interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.8 Setup for multi-DLL projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.9 Problematic platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.9.1 Microsoft Visual Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.9.2 Universal Windows Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.9.3 Apple woes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.9.4 Android lunacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.9.5 Virtual machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.9.6 Docker on Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.10 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.11 Changing network port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.12 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Check your environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 CPU design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2.1 Superscalar out-of-order speculative execution . . . . . . . . . . . . . . . . . 20
2.2.2.2 Simultaneous multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2.3 Turbo mode frequency scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2.4 Power saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2.5 AVX offset and power licenses . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2.6 Summing it up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Building the server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Required libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1.2 Unix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1.3 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Using an IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3
Tracy Profiler The user manual
3 Client markup 26
3.1 Handling text strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Program data lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Unique pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Specifying colors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Marking frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Secondary frame sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Discontinuous frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Frame images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3.1 OpenGL screen capture code example . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Marking zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Manual management of zone scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Multiple zones in one scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.3 Filtering zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.4 Transient zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.5 Variable shadowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.6 Exiting program from within a zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Marking locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.1 Custom locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Plotting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Message log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7.1 Application information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8 Memory profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8.1 Memory pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.9 GPU profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.9.1 OpenGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.9.2 Vulkan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.9.3 Direct3D 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.9.4 Direct3D 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.9.5 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.9.6 Multiple zones in one scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.9.7 Transient GPU zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10 Fibers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.11 Collecting call stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.11.1 Debugging symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.11.1.1 External libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.11.1.2 Using the dbghelp library on Windows . . . . . . . . . . . . . . . . . . . . . . 45
3.11.1.3 Disabling resolution of inline frames . . . . . . . . . . . . . . . . . . . . . . . 46
3.11.1.4 Offline symbol resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.12 Lua support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.12.1 Call stacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.12.2 Instrumentation cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.13 C API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.13.1 Setting thread names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.13.2 Frame markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4
Tracy Profiler The user manual
5
Tracy Profiler The user manual
6
Tracy Profiler The user manual
8 Configuration files 99
8.1 Root directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.2 Trace specific settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A License 100
7
Tracy Profiler The user manual
1.1 Real-time
The concept of Tracy being a real-time profiler may be explained in a couple of different ways:
1. The profiled application is not slowed down by profiling3 . The act of recording a profiling event has
virtually zero cost – it only takes a few nanoseconds. Even on low-power mobile devices, execution
speed has no noticeable impact.
2. The profiler itself works in real-time, without the need to process collected data in a complex way.
Actually, it is pretty inefficient in how it works because it recalculates the data it presents each frame
anew. And yet, it can run at 60 frames per second.
3. The profiler has full functionality when the profiled application runs and the data is still collected. You
may interact with your application and immediately switch to the profiler when a performance drop
occurs.
1 Directsupport is provided for C, C++, and Lua integration. At the same time, third-party bindings to many other languages exist
on the internet, such as Rust, Zig, C#, OCaml, Odin, etc.
2 All major graphic APIs: OpenGL, Vulkan, Direct3D 11/12, OpenCL.
3 See section 1.7 for a benchmark.
8
Tracy Profiler The user manual
Tracy can achieve single-digit nanosecond measurement resolution due to usage of hardware timing
mechanisms on the x86 and ARM architectures4 . Other profilers may rely on the timers provided by the
operating system, which do have significantly reduced resolution (about 300 ns – 1 µs). This is enough to
hide the subtle impact of cache access optimization, etc.
300 ns
  Â
Time
𝐶 1 𝐵1 𝐷1𝐷2 𝐴1 𝐴2 𝐵2 𝐶 2
Figure 1: Low precision (300 ns) timer. Discrete timer ticks are indicated by the  icon.
• The 𝐴 and 𝐷 ranges both take a very short amount of time (10 ns), but the 𝐴 range is reported as 300 ns,
and the 𝐷 range is reported as 0 ns.
• The 𝐵 range takes a considerable amount of time (590 ns), but according to the timer readings, it took
the same time (300 ns) as the short lived 𝐴 range.
• The 𝐶 range (610 ns) is only 20 ns longer than the 𝐵 range, but it is reported as 900 ns, a 600 ns difference!
Here, you can see why using a high-precision timer is essential. While there is no escape from the
measurement errors, a profiler can reduce their impact by increasing the timer accuracy.
Time Stamp Counter readings’ resolution may depend on the used hardware and its design decisions related to how TSC synchronization
is handled between different CPU sockets, etc. On ARM-based systems Tracy will try to use the timer register (~40 ns resolution). If it
fails (due to kernel configuration), Tracy falls back to system provided timer, which can range in resolution from 250 ns to 1 µs.
5 Interestingly the std::chrono::high_resolution_clock is not really a high-resolution clock.
6 This is a real optimization case. The values are median function run times and do not reflect the real execution time, which explains
second to achieve smooth animation. You can also think about physics update frames, audio processing frames, etc.
8 Frame usage is not required. See section 3.3 for more information.
9
Tracy Profiler The user manual
¶ Thread 1
Display
õ Storage
¶ Thread 3
In Tracy terminology, the profiled application is a client, and the profiler itself is a server. It was named this
way because the client is a thin layer that just collects events and sends them for processing and long-term
storage on the server. The fact that the server needs to connect to the client to begin the profiling session may
be a bit confusing at first.
• Tracy is free and open-source (BSD license), while RAD Telemetry costs about $8000 per year.
• Tracy provides out-of-the-box Lua bindings. It has been successfully integrated with other native and
interpreted languages (Rust, Arma scripting language) using the C API (see chapter 3.13 for reference).
• Tracy has a wide variety of profiling options. For example, you can profile CPU, GPU, locks, memory
allocations, context switches, and more.
• Tracy is feature-rich. For example, statistical information about zones, trace comparisons, or inclusion
of inline function frames in call stacks (even in statistics of sampled stacks) are features unique to Tracy.
• Tracy focuses on performance. It uses many tricks to reduce memory requirements and network
bandwidth. As a result, the impact on the client execution speed is minimal, while other profilers
perform heavy data processing within the profiled application (and then claim to be lightweight).
9 See section 2.3.3 for guidelines.
10
Tracy Profiler The user manual
• Tracy uses low-level kernel APIs, or even raw assembly, where other profilers rely on layers of
abstraction.
• Tracy is multi-platform right from the very beginning. Both on the client and server-side. Other
profilers tend to have Windows-specific graphical interfaces.
• Tracy can handle millions of frames, zones, memory events, and so on, while other profilers tend to
target very short captures.
• Tracy doesn’t require manual markup of interesting areas in your code to start profiling. Instead, you
may rely on automated call stack sampling and add instrumentation later when you know where it’s
needed.
• Tracy provides a mapping of source code to the assembly, with detailed information about the cost of
executing each instruction on the CPU.
Mode Zones (total) Zones (single image) Clean run Profiling run Difference
ETC1 201,326,592 16,777,216 110.9 ms 148.2 ms +37.3 ms
ETC2 201,326,592 16,777,216 212.4 ms 250.5 ms +38.1 ms
10 https://github.com/wolfpld/etcpak
11
Tracy Profiler The user manual
The second code block, responsible for ending a zone, is similar but smaller, as it can reuse some variables
retrieved in the above code.
1.8 Examples
To see how to integrate Tracy into your application, you may look at example programs in the examples
directory. Looking at the commit history might be the best way to do that.
• Homepage – https://github.com/wolfpld/tracy
2 First steps
Tracy Profiler supports MSVC, GCC, and clang. You will need to use a reasonably recent version of the
compiler due to the C++11 requirement. The following platforms are confirmed to be working (this is not a
complete list):
• FreeBSD (x64)
• WSL (x64)
• OSX (x64)
12
Tracy Profiler The user manual
• Using the last-version-tagged revision will give you a stable platform to work with. You won’t
experience any breakages, major UI overhauls, or network protocol changes. Unfortunately, you
also won’t be getting any bug fixes.
• Working with the bleeding edge master development branch will give you access to all the new
improvements and features added to the profiler. While it is generally expected that master
should always be usable, there are no guarantees that it will be so.
Do note that all bug fixes and pull requests are made against the master branch.
With the source code included in your project, add the public/TracyClient.cpp source file to the IDE
project or makefile. You’re done. Tracy is now integrated into the application.
In the default configuration, Tracy is disabled. This way, you don’t have to worry that the production
builds will collect profiling data. To enable profiling, you will probably want to create a separate build
configuration, with the TRACY_ENABLE define.
Important
• Double-check that the define name is entered correctly (as TRACY_ENABLE), don’t make a mistake
of adding an additional D at the end. Make sure that this macro is defined for all files across your
project (e.g. it should be specified in the CFLAGS variable, which is always passed to the compiler,
or in an equivalent way), and not as a #define in just some of the source files.
• Tracy does not consider the value of the definition, only the fact if the macro is defined or not
(unless specified otherwise). Be careful not to make the mistake of assigning numeric values to
Tracy defines, which could lead you to be puzzled why constructs such as TRACY_ENABLE=0 don’t
13
Tracy Profiler The user manual
You should compile the application you want to profile with all the usual optimization options enabled
(i.e. make a release build). Profiling debugging builds makes little sense, as the unoptimized code and
additional checks (asserts, etc.) completely change how the program behaves. In addition, you should enable
usage of the native architecture of your CPU (e.g. -march=native) to leverage the expanded instruction sets,
which may not be available in the default baseline target configuration.
Finally, on Unix, make sure that the application is linked with libraries libpthread and libdl. BSD
systems will also need to be linked with libexecinfo.
Link Tracy::TracyClient to any target where you use Tracy for profiling:
CMake FetchContent
When using CMake 3.11 or newer, you can use Tracy via CMake FetchContent. In this case, you do not
need to add a git submodule for Tracy manually. Add this to your CMakeLists.txt:
FetchContent_Declare (
tracy
GIT_ REPOSITO RY https : // github . com / wolfpld / tracy . git
GIT_TAG master
GIT_SHALLOW TRUE
GIT_PROGRESS TRUE
)
F e t c h C o n t e n t _ M a k e A v a i l a b l e ( tracy )
Then add this to any target where you use tracy for profiling:
14
Tracy Profiler The user manual
If you are using the Meson build system, you can add Tracy using the Wrap dependency system. To do this,
place the tracy.wrap file in the subprojects directory of your project, with the following content. The head
revision field tracks Tracy’s master branch. If you want to lock to a specific version of Tracy instead, you
can just set the revision field to an appropriate git tag.
[ wrap - git ]
url = https : // github . com / wolfpld / tracy . git
revision = head
depth = 1
Then, add the following option entry to the meson.options file. Use the name tracy_enable as shown,
because the Tracy subproject options inherit it.
Next, add the Tracy dependency to the meson.build project definition file. Don’t forget to include this
dependency in the appropriate executable or library definitions. This dependency will set all the appropriate
definitions (such as TRACY_ENABLE) in your program, so you don’t have to do it manually.
Finally, let’s check if the debugoptimized build type is enabled, and print a little reminder message if it is
not. For profiling we want the debug annotations to be present, but we also want to have the code to be
optimized.
Here’s a sample command to set up a build directory with profiling enabled. The last option,
tracy:on_demand, is used to demonstrate how to set options in the Tracy subproject.
meson setup build -- buildtype = debugop timized - Dtracy_enable = true - Dtracy : on_demand = true
In case you want to profile a short-lived program (for example, a compression utility that finishes its work in
one second), set the TRACY_NO_EXIT environment variable to 1. With this option enabled, Tracy will not exit
until an incoming connection is made, even if the application has already finished executing. If your platform
doesn’t support an easy setup of environment variables, you may also add the TRACY_NO_EXIT define to your
build configuration, which has the same effect.
By default, Tracy will begin profiling even before the program enters the main function. However, suppose
you don’t want to perform a full capture of the application lifetime. In that case, you may define the
TRACY_ON_DEMAND macro, which will enable profiling only when there’s an established connection with the
server.
15
Tracy Profiler The user manual
You should note that if on-demand profiling is disabled (which is the default), then the recorded events
will be stored in the system memory until a server connection is made and the data can be uploaded11 .
Depending on the amount of the things profiled, the requirements for event storage can quickly grow up to a
couple of gigabytes. Furthermore, since this data is no longer available after the initial connection, you won’t
be able to perform a second connection to a client unless the on-demand mode is used.
Caveats
The client with on-demand profiling enabled needs to perform additional bookkeeping to present a
coherent application state to the profiler. This incurs additional time costs for each profiling event.
11 Thismemory is never released, but the profiler reuses it for collection of other events.
12 Additional configuration may be required to achieve full functionality, depending on your network layout. Read about UDP
broadcasts for more information.
13 You may also look at the library directory in the profiler source tree.
16
Tracy Profiler The user manual
In the case of some programming environments, you may need to take extra steps to ensure Tracy can work
correctly.
If you are using MSVC, you will need to disable the Edit And Continue feature, as it makes the compiler
non-conformant to some aspects of the C++ standard. In order to do so, open the project properties and go
to C/C++ General Debug Information Format and make sure Program Database for Edit And Continue (/ZI) is not
selected.
Due to a restricted access to Win32 APIs and other sandboxing issues (like network isolation), several
limitations apply to using Tracy in a UWP application compared to Windows Desktop:
• To be able to connect from another machine on the local network, the app needs the privateNetwork-
ClientServer capability. To connect from localhost, an active inbound loopback exemption is also
necessary14 .
Because Apple has to be think different, there are some problems with using Tracy on OSX and iOS. First, the
performance hit due to profiling is higher than on other platforms. Second, some critical features are missing
and won’t be possible to achieve:
• Profiling is interrupted when the application exits. This will result in missing zones, memory allocations,
or even source location names.
17
Tracy Profiler The user manual
setenforce 0
mount -o remount , hidepid =0 / proc
echo -1 > / proc / sys / kernel / p e r f _ e v e n t _ p a r a n o i d
echo 0 > / proc / sys / kernel / kptr_restrict
The first command will allow access to system CPU statistics. The second one will enable inspection of
foreign processes (required for context switch capture). The third one will lower restrictions on access to
performance counters. The last one will allow retrieval of kernel symbol pointers. Be sure that you are fully
aware of the consequences of making these changes.
• Inability to obtain precise timestamps, resulting in error messages such as CPU doesn’t support RDTSC
instruction, or CPU doesn’t support invariant TSC. On Windows, you can work this around by rebuilding
the profiled application with the TRACY_TIMER_QPC define, which severely lowers the resolution of time
readings.
• Call stack sampling might lack time stamps. While you can use such a reduced data set to perform
statistical analysis, you won’t be able to limit the time range or see the sampling zones on the timeline.
• --mount "type=bind,source=/sys/kernel/debug,target=/sys/kernel/debug,readonly"
• --user 0:0
• --pid=host
2.1.10 Troubleshooting
Setting the TRACY_VERBOSE variable will make the client display advanced information about the detected
features. By matching those debug prints to the source code, you might be able to uncover why some of the
features are missing on your platform.
15 Tested on Ubuntu 22.04.3, docker 24.0.4
18
Tracy Profiler The user manual
Important
To enable network communication, Tracy needs to open a listening port. Make sure it is not blocked by
an overzealous firewall or anti-virus program.
2.1.12 Limitations
When using Tracy Profiler, keep in mind the following requirements:
• The application may use each lock in no more than 64 unique threads.
• There can be no more than 65534 unique source locations17 . This number is further split in half between
native code source locations and dynamic source locations (for example, when Lua instrumentation is
used).
• If there are recursive zones at any point in a zone stack, each unique zone source location should not
appear more than 255 times.
• Profiling session cannot be longer than 1.6 days (247 ns). This also includes on-demand sessions.
The following conditions also need to apply but don’t trouble yourself with them too much. You would
probably already know if you’d be breaking any.
• Tracy server requires CPU which can handle misaligned memory accesses.
zone.
19
Tracy Profiler The user manual
In a multitasking operating system, applications compete for system resources with each other. This has a
visible effect on the measurements performed by the profiler, which you may or may not accept.
To get the most accurate profiling results, you should minimize interference caused by other programs
running on the same machine. Before starting a profile session, close all web browsers, music players, instant
messengers, and all other non-essential applications like Steam, Uplay, etc. Make sure you don’t have the
debugger hooked into the profiled program, as it also impacts the timing results.
Interference caused by other programs can be seen in the profiler if context switch capture (section 3.15.3)
is enabled.
Where to even begin here? Modern processors are such complex beasts that it’s almost impossible to
say anything about how they will behave surely. Cache configuration, prefetcher logic, memory timings,
branch predictor, execution unit counts are all the drivers of instructions-per-cycle uplift nowadays after the
megahertz race had hit the wall. Not only is it challenging to reason about, but you also need to take into
account how the CPU topology affects things, which is described in more detail in section 3.15.4.
Nevertheless, let’s look at how we can try to stabilize the profiling data.
Also known as: the spectre thing we have to deal with now.
You must be aware that most processors available on the market18 do not execute machine code linearly, as
laid out in the source code. This can lead to counterintuitive timing results reported by Tracy. Trying to
get more ’reliable’ readings19 would require a change in the behavior of the code, and this is not a thing a
profiler should do. So instead, Tracy shows you what the hardware is really doing.
This is a complex subject, and the details vary from one CPU to another. You can read a brief rundown of the
topic at the following address: https://travisdowns.github.io/blog/2019/06/11/speed-limits.html.
Also known as: Hyper-threading. Typically present on Intel and AMD processors.
To get the most reliable results, you should have all the CPU core resources dedicated to a single thread
of your program. Otherwise, you’re no longer measuring the behavior of your code but rather how it keeps
up when its computing resources are randomly taken away by some other thing running on another pipeline
within the same physical core.
Note that you might want to observe this behavior if you plan to deploy your application on a machine
with simultaneous multithreading enabled. This would require careful examination of what else is running
on the machine, or even how the operating system schedules the threads of your own program, as various
combinations of competing workloads (e.g., integer/floating-point operations) will be impacted differently.
20
Tracy Profiler The user manual
• How many cores are in use? Just one, or all 8? All 16?
• What type of work is being performed? Integer? Floating-point? 128-wide SIMD? 256-wide SIMD?
512-wide SIMD?
• Were you lucky in the silicon lottery? Some dies are just better made and can achieve higher frequencies.
• Are you running on the best-rated core or at the worst-rated core? Some cores may be unable to match
the performance of other cores in the same processor.
• What kind of cooling solution are you using? The cheap one bundled with the CPU or a hefty chunk of
metal that has no problem with heat dissipation?
• Do you have complete control over the power profile? Spoiler alert: no. The operating system may run
anything at any time on any of the other cores, which will impact the turbo frequency you’re able to
achieve.
As you can see, this feature basically screams ’unreliable results!’ Best keep it disabled and run at the
base frequency. Otherwise, your timings won’t make much sense. A true example: branchless compression
function executing multiple times with the same input data was measured executing at four different speeds.
Keep in mind that even at the base frequency, you may hit the thermal limits of the silicon and be down
throttled.
21
Tracy Profiler The user manual
2.2.2.6 Summing it up
Power management schemes employed in various CPUs make it hard to reason about the true performance of
the code. For example, figure 3 contains a histogram of function execution times (as described in chapter 5.7),
as measured on an AMD Ryzen CPU. The results ranged from 13.05 µs to 61.25 µs (extreme outliers were not
included on the graph, limiting the longest displayed time to 36.04 µs).
We can immediately see that there are two distinct peaks, at 13.4 µs and 15.3 µs. A reasonable assumption
would be that there are two paths in the code, one that can omit some work, and the second one which must
do some additional job. But here’s a catch – the measured code is actually branchless and always executes
the same way. The two peaks represent two turbo frequencies between which the CPU was aggressively
switching.
We can also see that the graph gradually falls off to the right (representing longer times), with a slight
bump near the end. Again, this can be attributed to running in power-saving mode, with different reaction
times to the required operating frequency boost to full power.
Now that you have a build directory, you can actually compile the program. For example, you could
run the following command:
22
Tracy Profiler The user manual
The build directory can be reused if you want to compile the program in the future, for example if
there have been some updates to the source code, and usually does not need to be regenerated. Note
that all build artifacts are contained in the build directory.
Important
Due to the memory requirements for data storage, the Tracy server is only supposed to run on 64-bit
platforms. While nothing prevents the program from building and executing in a 32-bit environment,
doing so is not supported.
• capstone
• glfw
• freetype
The capstone library will always be downloaded from GitHub when the CMake build directory is created,
unless you have it installed on your system and set the specific build option. You must have git installed for
this download to work. Using the capstone library provided by package managers is not recommended,
as these packages are typically slow to provide up-to-date versions of the library, and the API may be
incompatible.
It is recommended that you install the glfw and freetype libraries on your system so that Tracy can find
them with pkg-config. However, if these libraries are not available, they will be downloaded from GitHub.
2.3.1.1 Windows
There is no need to install external libraries (e.g. with vcpkg). All libraries are downloaded automatically by
CMake. You still need git, though.
2.3.1.2 Unix
On Unix systems (including Linux), you will need to install the pkg-config utility to provide information
about libraries.
Due to some questionable design decisions by the compiler developers, you will most likely also need the
tbb library23 . If not found, this library is downloaded automatically.
Installation of the libraries on OSX can be facilitated using the brew package manager.
2.3.1.3 Linux
There are some Linux-specific libraries that you need to have installed on your system. These won’t be
downloaded automatically.
For XDG Portal support in the file selector, you need to install the dbus library. If you’re one of those
weird people who doesn’t like modern things, you can install gtk3 instead and force the GTK file selector
with a build option.
Linux builds of Tracy use the Wayland protocol by default, which allows proper support for Hi-DPI
scaling and high-precision input devices such as touchpads. As such, the glfw library is no longer needed,
23 Technically, this is not a Tracy dependency, but rather a libstdc++ dependency, but it may still not be installed by default.
23
Tracy Profiler The user manual
but you will need to install libxkbcommon, wayland, wayland-protocols, libglvnd (or libegl on some
distributions).
If you want to use X11 instead, you can enable the LEGACY option in CMake build settings.
Linux distributions
Some Linux distributions require you to add a lib prefix and a -dev or -devel postfix to library names.
You may also need to add a seemingly random number to the library name (for example: freetype2,
or freetype6).
Some Linux distributions ship outdated versions of libraries that are too old for Tracy to build, and
do not provide new versions by design. Please reconsider your choice of distribution in this case, as the
only function of a Linux distribution is to provide packages, and the one you have chosen is clearly
failing at this task.
Window decorations
Please don’t ask about window decorations in Gnome. The current behavior is the intended behavior.
Gnome does not want windows to have decorations, and Tracy respects that choice. If you find this
problematic, use a desktop environment that actually listens to its users.
The recommended development environment is Visual Studio Code24 . This is a cross-platform solution, so
you always get the same experience, no matter what OS you are using.
VS Code is highly modular, and unlike some other IDEs, it does not come with a compiler. You will need
to have one, such as gcc or clang, already installed on your system. On Windows, you should have MSVC
2022 installed in order to have access to its build tools.
When you open the Tracy directory in VS Code, it will prompt you to install some recommended
extensions: clangd, CodeLLDB, and CMake Tools. You should do this if you don’t already have them.
The CMake build configuration will begin immediately. It is likely that you will be prompted to select a
development kit to use; for example, you may have a preference as to whether you want to use gcc or clang,
and CMake will need to be told about it.
After the build configuration phase is over, you may want to make some further adjustments to what is
being built. The primary place to do this is in the Project Status section of the CMake side panel. The two key
settings there are also available in the status bar at the bottom of the window:
• The Folder setting allows you to choose which Tracy utility you want to work with. Select "profiler" for
the profiler’s GUI.
• The Build variant setting is used to toggle between the debug and release build configurations.
With all this taken care of, you can now start the program with the F5 key, set breakpoints, get code
completion and navigation25 , and so on.
24 https://code.visualstudio.com/
25 To get the Intellisense experience if you are using the MSVC compiler, you need to do some additional setup. First, you need to
24
Tracy Profiler The user manual
• TRACY_NO_FILESELECTOR – controls whether a system load/save dialog is compiled in. If it’s enabled,
the saved traces will be named trace.tracy.
• TRACY_NO_STATISTICS – Tracy will perform statistical data collection on the fly, if this macro is not
defined. This allows extended trace analysis (for example, you can perform a live search for matching
zones) at a small CPU processing cost and a considerable memory usage increase (at least 8 bytes per
zone).
• TRACY_NO_ROOT_WINDOW – the main profiler view won’t occupy the whole window if this macro is
defined. Additional setup is required for this to work. If you want to embed the server into your
application, you probably should enable this option.
25
Tracy Profiler The user manual
...
Caveats
• On MSVC the debugger has priority over the application in handling exceptions. If you want to
finish the profiler data collection with the debugger hooked-up, select the continue option in the
debugger pop-up dialog.
• On Linux, crashes are handled with signals. Tracy needs to have SIGPWR available, which is rather
rarely used. Still, the program you are profiling may expect to employ it for its purposes, which
would cause a conflicta . To workaround such cases, you may set the TRACY_CRASH_SIGNAL macro
value to some other signal (see man 7 signal for a list of signals). Ensure that you avoid conflicts
by selecting a signal that the application wouldn’t usually receive or emit.
a For example, Mono may use it to trigger garbage collection.
3 Client markup
With the steps mentioned above, you will be able to connect to the profiled program, but there probably
won’t be any data collection performed29 . Unless you’re able to perform automatic call stack sampling
(see chapter 3.15.5), you will have to instrument the application manually. All the user-facing interface is
contained in the public/tracy/Tracy.hpp header file30 .
Manual instrumentation is best started with adding markup to the application’s main loop, along with
a few functions that the loop calls. Such an approach will give you a rough outline of the function’s time
cost, which you may then further refine by instrumenting functions deeper in the call stack. Alternatively,
automated sampling might guide you more quickly to places of interest.
28 For
example, invalid memory accesses (’segmentation faults’, ’null pointer exceptions’), divisions by zero, etc.
29Withsome small exceptions, see section 3.15.
30 You should add either public or public/tracy directory from the Tracy root to the include directories list in your project. Then
26
Tracy Profiler The user manual
27
Tracy Profiler The user manual
In some cases marked in the manual, Tracy expects you to provide a unique pointer in each occurrence the
same string literal is used. This can be exemplified in the following listing:
Here, we pass two string literals with identical contents to two different macros. It is entirely up to
the compiler to decide if it will pool these two strings into one pointer or if there will be two instances
present in the executable image32 . For example, on MSVC, this is controlled by Configuration Properties
C/C++ Code Generation Enable String Pooling option in the project properties (optimized builds enable it
automatically). Note that even if string pooling is used on the compilation unit level, it is still up to the linker
to implement pooling across object files.
As you can see, making sure that string literals are properly pooled can be surprisingly tricky. To work
around this problem, you may employ the following technique. In one source file create the unique pointer
for a string literal, for example:
Then in each file where you want to use the literal, use the variable name instead. Notice that if you’d like
to change a name passed to Tracy, you’d need to do it only in one place with such an approach.
Frame MarkStar t ( s l _ A u d i o P r o c e s s i n g ) ;
...
FrameMarkEnd ( s l _ A u d i o P r o c e s s i n g ) ;
In some cases, you may want to have semi-dynamic strings. For example, you may want to enumerate
workers but don’t know how many will be used. You can handle this by allocating a never-freed char buffer,
which you can then propagate where it’s needed. For example:
You have to make sure it’s initialized only once, before passing it to any Tracy API, that it is not overwritten
by new data, etc. In the end, this is just a pointer to character-string data. It doesn’t matter if the memory
was loaded from the program image or allocated on the heap.
string in memory.
32 [ISO12] §2.14.5.12: "Whether all string literals are distinct (that is, are stored in nonoverlapping objects) is implementation-defined."
28
Tracy Profiler The user manual
Do I need this?
This step is optional, as some applications do not use the concept of a frame.
Important
• Frame types must not be mixed. For each frame set, identified by an unique name, use either
continuous or discontinuous frames only!
• You must issue the FrameMarkStart and FrameMarkEnd macros in proper order. Be extra careful,
especially if multi-threading is involved.
• String literals passed as frame names must be properly pooled, as described in section 3.1.2.
29
Tracy Profiler The user manual
Table 3: Client compression time of 320 × 180 image. x86: Ryzen 9 3900X (MSVC); ARM: ODROID-C2 (gcc).
Caveats
• Frame images are compressed on a second client profiler threada , to reduce memory usage of
queued images. This might have an impact on the performance of the profiled application.
• This second thread will be periodically woken up, even if there are no frame images to compressb .
If you are not using the frame image capture functionality and you don’t wish this thread to be
running, you can define the TRACY_NO_FRAME_IMAGE macro.
• Due to implementation details of the network buffer, a single frame image cannot be greater than
256 KB after compression. Note that a 960 × 540 image fits in this limit.
a Small part of compression task is offloaded to the server.
b This way of doing things is required to prevent a deadlock in specific circumstances.
Everything needs to be correctly initialized (the cleanup is left for the reader to figure out).
glGenTextures (4 , m_fiTexture ) ;
30
Tracy Profiler The user manual
g l G e n F r a m e b u f f e r s (4 , m_ f iF ra me b uf fe r ) ;
glGenBuffers (4 , m_fiPbo ) ;
for ( int i =0; i <4; i ++)
{
glBindTexture ( GL_TEXTURE_2D , m_fiTexture [ i ]) ;
g lT ex Pa r am et er i ( GL_TEXTURE_2D , GL_TEXTURE_MIN_FILTER , GL_NEAREST ) ;
g lT ex Pa r am et er i ( GL_TEXTURE_2D , GL_TEXTURE_MAG_FILTER , GL_NEAREST ) ;
glTexImage2D ( GL_TEXTURE_2D , 0 , GL_RGBA , 320 , 180 , 0 , GL_RGBA , GL_UNSIGNED_BYTE ,
nullptr ) ;
g l B i n d F r a m e b u f f e r ( GL_FRAMEBUFFER , m_ fi Fr a me bu ff e r [ i ]) ;
g l F r a m e b u f f e r T e x t u r e 2 D ( GL_FRAMEBUFFER , GL_COLOR_ATTACHMENT0 , GL_TEXTURE_2D ,
m_fiTexture [ i ] , 0) ;
We will now set up a screen capture, which will downscale the screen contents to 320 × 180 pixels and
copy the resulting image to a buffer accessible by the CPU when the operation is done. This should be placed
right before swap buffers or present call.
assert ( m_fiQueue . empty () || m_fiQueue . front () != m_fiIdx ) ; // check for buffer overrun
g l B i n d F r a m e b u f f e r ( GL_DRAW_FRAMEBUFFER , m_ fi Fr a me bu ff e r [ m_fiIdx ]) ;
g l B l i t F r a m e b u f f e r (0 , 0 , res .x , res .y , 0 , 0 , 320 , 180 , GL_COLOR_BUFFER_BIT , GL_LINEAR ) ;
g l B i n d F r a m e b u f f e r ( GL_DRAW_FRAMEBUFFER , 0) ;
g l B i n d F r a m e b u f f e r ( GL_READ_FRAMEBUFFER , m_ fi Fr a me bu ff e r [ m_fiIdx ]) ;
glBindBuffer ( GL_PIXEL_PACK_BUFFER , m_fiPbo [ m_fiIdx ]) ;
glReadPixels (0 , 0 , 320 , 180 , GL_RGBA , GL_UNSIGNED_BYTE , nullptr ) ;
g l B i n d F r a m e b u f f e r ( GL_READ_FRAMEBUFFER , 0) ;
m_fiFence [ m_fiIdx ] = glFenceSync ( G L_S YNC _GP U_C OM MAN DS_ COM PLE TE , 0) ;
m_fiQueue . emplace_back ( m_fiIdx ) ;
m_fiIdx = ( m_fiIdx + 1) % 4;
And lastly, just before the capture setup code that was just added39 we need to have the image retrieval
code. We are checking if the capture operation has finished. If it has, we map the pixel buffer object to memory,
inform the profiler that there are image data to be handled, unmap the buffer and go to check the next queue
item. If capture is still pending, we break out of the loop. We will have to wait until the next frame to check if
the GPU has finished performing the capture.
Notice that in the call to FrameImage we are passing the remaining queue size as the offset parameter.
Queue size represents how many frames ahead our program is relative to the GPU. Since we are sending
past frame images, we need to specify how many frames behind the images are. Of course, if this would be
synchronous capture (without the use of fences and with retrieval code after the capture setup), we would
set offset to zero, as there would be no frame lag.
31
Tracy Profiler The user manual
High quality capture The code above uses glBlitFramebuffer function, which can only use nearest
neighbor filtering. The use of such filtering can result in low-quality screenshots, as shown in figure 4.
However, with a bit more work, it is possible to obtain nicer-looking screenshots, as presented in figure 5.
Unfortunately, you will need to set up a complete rendering pipeline for this to work.
First, you need to allocate an additional set of intermediate frame buffers and textures, sized the same as
the screen. These new textures should have a minification filter set to GL_LINEAR_MIPMAP_LINEAR. You will
also need to set up everything needed to render a full-screen quad: a simple texturing shader and vertex
buffer with appropriate data. Since you will use this vertex buffer to render to the scaled-down frame buffer,
you may prepare its contents beforehand and update it only when the aspect ratio changes.
With all this done, you can perform the screen capture as follows:
• Setup vertex buffer configuration for the full-screen quad buffer (you only need position and uv coordi-
nates).
While this approach is much more complex than the previously discussed one, the resulting image quality
increase makes it worthwhile.
You can see the performance results you may expect in a simple application in table 4. The naïve capture
performs synchronous retrieval of full-screen image and resizes it using stb_image_resize. The proper and
high-quality captures do things as described in this chapter.
32
Tracy Profiler The user manual
Important
Zones are identified using static data structures embedded in program code. Therefore, you need to
consider the lifetime of code in your application, as discussed in section 3.1.1, to make sure that the
profiler can access this data at any time during the program lifetime.
If you can’t fulfill this requirement, you must use transient zones, described in section 3.4.4.
40 A zone represents the lifetime of a special on-stack profiler variable. Typically it would exist for the duration of a whole scope of the
profiled function, but you also can measure time spent in scopes of a for-loop or an if-branch.
41 https://en.cppreference.com/w/cpp/language/raii
42 The last parameter is explained in section 3.4.3.
33
Tracy Profiler The user manual
Zone stack
The ZoneScoped macros are imposing the creation and usage of an implicit zone stack. You must also
follow the rules of this stack when using the named macros, which give you some more leeway in doing
things. For example, you can only set the text for the zone which is on top of the stack, as you only
could do with the ZoneText macro. It doesn’t matter that you can call the Text method of a non-top
zone which is accessible through a variable. Take a look at the following code:
{
ZoneNamed ( Zone1 , true ) ;
a
{
ZoneNamed ( Zone2 , true ) ;
b
}
c
}
It is valid to set the Zone1 text or name only in places a or c . After Zone2 is created at b you can
no longer perform operations on Zone1, until Zone2 is destroyed.
enum SubSystems
{
Sys_Physics = 1 << 0 ,
Sys_Rendering = 1 << 1 ,
S ys _N as a lD em on s = 1 << 2
}
...
...
34
Tracy Profiler The user manual
void Function ()
{
ZoneScoped ;
...
for ( int i =0; i <10; i ++)
{
ZoneScoped ;
...
}
}
This doesn’t stop some compilers from dispensing fashion advice about variable shadowing (as both
ZoneScoped calls create a variable with the same name, with the inner scope one shadowing the one in the
outer scope). If you want to avoid these warnings, you will also need to use the ZoneNamed macros.
with
43 https://en.cppreference.com/w/cpp/named_req/Mutex
35
Tracy Profiler The user manual
Alternatively, you may use TracyLockableN(type, varname, description) to provide a custom lock
name at a global level, which will replace the automatically generated ’std::mutex m_lock’-like name. You
may also set a custom name for a specific instance of a lock, through the LockableName(varname, name,
size) macro.
The standard std::lock_guard and std::unique_lock wrappers should use the LockableBase(type)
macro for their template parameter (unless you’re using C++17, with improved template argument deduction).
For example:
To mark the location of a lock being held, use the LockMark(varname) macro after you have obtained the
lock. Note that the varname must be a lock variable (a reference is also valid). This step is optional.
Similarly, you can use TracySharedLockable, TracySharedLockableN and SharedLockableBase to mark
locks implementing the SharedMutex requirement44 . Note that while there’s no support for timed mutices
in Tracy, both std::shared_mutex and std::shared_timed_mutex may be used45 .
Condition variables
The standard std::condition_variable is only able to accept std::mutex locks. To be able to use
Tracy lock wrapper, use std::condition_variable_any instead.
Caveats
Due to the limits of internal bookkeeping in the profiler, you may use each lock in no more than 64
unique threads. If you have many short-lived temporary threads, consider using a thread pool to limit
the number of created threads.
in C++14.
36
Tracy Profiler The user manual
Figure 6: An identical set of values on a smooth plot (left) and a staircase plot (right).
Each plot has its own color, which by default is derived from the plot name (each unique plot name
produces its own color, which does not change between profiling runs). If you want to provide your own
color instead, you may enter the color parameter. Note that you should set the color value to 0 if you do not
want to set your own color.
For reference, the following command sets the default parameters of the plot (that is, it’s a no-op):
TracyPlotConfig(name, tracy::PlotFormatType::Number, false, true, 0).
It is beneficial but not required to use a unique pointer for name string literal (see section 3.1.2 for more
details).
• Ability to rewind view of active allocations and memory map to any point of program execution.
To mark memory events, use the TracyAlloc(ptr, size) and TracyFree(ptr) macros. Typically you
would do that in overloads of operator new and operator delete, for example:
37
Tracy Profiler The user manual
In some rare cases (e.g., destruction of TLS block), events may be reported after the profiler is no longer
available, which would lead to a crash. To work around this issue, you may use TracySecureAlloc and
TracySecureFree variants of the macros.
Important
Each tracked memory-free event must also have a corresponding memory allocation event. Tracy will
terminate the profiling session if this assumption is broken (see section 4.7). If you encounter this issue,
you may want to check for:
• Reporting the same memory address being allocated twice (without a free between two alloca-
tions).
• Untracked allocations made in external libraries that are freed in the application.
This requirement is relaxed in the on-demand mode (section 2.1.5) because the memory allocation
event might have happened before the server made the connection.
38
Tracy Profiler The user manual
{
v k B e g i n C o m m a n d B u f f e r ( cmd , & beginInfo ) ;
TracyVkZone ( ctx , cmd , " Render " ) ;
v k E n d C o m m a n d B u f f e r ( cmd ) ;
}
Add a nested scope encompassing the command buffer recording section to fix such issues.
Caveat emptor
The profiling results you will get can be unreliable or plainly wrong. It all depends on the quality
of graphics drivers and how the underlying hardware implements timers. While Tracy employs
some heuristics to make things as reliable as possible, it must talk to the GPU through the commonly
unreliable API calls.
For example, on Linux, the Intel GPU driver will report 64-bit precision of time stamps. Unfortunately,
this is not true, as the driver will only provide timestamps with 36-bit precision, rolling over the
exceeding values. Tracy can detect such problems and employ workarounds. This is, sadly, not enough
to make the readings reliable, as this timer we can access through the API is not a real one. Deep
down, the driver has access to the actual timer, which it uses to provide the virtual values we can get.
Unfortunately, this hardware timer has a period which does not match the period of the API timer. As a
result, the virtual timer will sometimes overflow in midst of a cycle, making the reported time values
jump forward. This is a problem that only the driver vendor can fix.
Another problem discovered on AMD GPUs under Linux causes the timestamp register to be
reset every time the GPU enters a low-power mode. This can happen virtually every frame if you are
rendering with vertical synchronization disabled. Needless to say, the timestamp data is not very useful
in this case. The solution to this problem is to navigate to the /sys/devices/pci*/*/*/ directory
corresponding to the GPU and set the power_dpm_force_performance_level value to manual and the
pp_power_profile_mode value to the number corresponding to the COMPUTE profile. Your mileage may
vary, however – on my system I only have one of these values available to set. Nevertheless, you will
find a similar solution suggested by the system vendor in a Direct3D 12 section later in the manual.
If you experience crippling problems while profiling the GPU, you might get better results with a
different driver, different operating system, or different hardware.
3.9.1 OpenGL
You will need to include the public/tracy/TracyOpenGL.hpp header file and declare each of your rendering
contexts using the TracyGpuContext macro (typically, you will only have one context). Tracy expects no
39
Tracy Profiler The user manual
more than one context per thread and no context migration. To set a custom name for the context, use the
TracyGpuContextName(name, size) macro.
To mark a GPU zone use the TracyGpuZone(name) macro, where name is a string literal name of the zone.
Alternatively you may use TracyGpuZoneC(name, color) to specify zone color.
You also need to periodically collect the GPU events using the TracyGpuCollect macro. An excellent
place to do it is after the swap buffers function call.
Caveats
• OpenGL profiling is not supported on OSX, iOSa .
• Nvidia drivers are unable to provide consistent timing results when two OpenGL contexts are
used simultaneously.
3.9.2 Vulkan
Similarly, for Vulkan support you should include the public/tracy/TracyVulkan.hpp header file. Tracing
Vulkan devices and queues is a bit more involved, and the Vulkan initialization macro TracyVkContext(physdev,
device, queue, cmdbuf) returns an instance of TracyVkCtx object, which tracks an associated Vulkan
queue. Cleanup is performed using the TracyVkDestroy(ctx) macro. You may create multiple Vulkan
contexts. To set a custom name for the context, use the TracyVkContextName(ctx, name, size) macro.
The physical device, logical device, queue, and command buffer must relate to each other. The queue
must support graphics or compute operations. The command buffer must be in the initial state and be able to
be reset. The profiler will rerecord and submit it to the queue multiple times, and it will be in the executable
state on exit from the initialization function.
To mark a GPU zone use the TracyVkZone(ctx, cmdbuf, name) macro, where name is a string literal
name of the zone. Alternatively you may use TracyVkZoneC(ctx, cmdbuf, name, color) to specify zone
color. The provided command buffer must be in the recording state, and it must be created within the queue
that is associated with ctx context.
You also need to periodically collect the GPU events using the TracyVkCollect(ctx, cmdbuf) macro46 .
The provided command buffer must be in the recording state and outside a render pass instance.
Calibrated context In order to maintain synchronization between CPU and GPU time domains, you will
need to enable the VK_EXT_calibrated_timestamps device extension and retrieve the following function
pointers: vkGetPhysicalDeviceCalibrateableTimeDomainsEXT and vkGetCalibratedTimestampsEXT.
To enable calibrated context, replace the macro TracyVkContext with TracyVkContextCalibrated and
pass the two functions as additional parameters, in the order specified above.
Using Vulkan 1.2 features Vulkan 1.2 and VK_EXT_host_query_reset provide mechanics to reset the
query pool without the need of a command buffer. By using TracyVkContextHostCalibrated you can make
use of this feature. It only requires a function pointer to vkResetQueryPool in addition to the ones required
for TracyVkContextCalibrated instead of the VkQueue and VkCommandBuffer handles.
However, using this feature requires the physical device to have calibrated device and host time domains. In
addition to VK_TIME_DOMAIN_DEVICE_EXT, vkGetPhysicalDeviceCalibrateableTimeDomainsEXT will have
to additionally return either VK_TIME_DOMAIN_CLOCK_MONOTONIC_RAW_EXT or VK_TIME_DOMAIN_QUERY_PERFORMANCE_COUNTE
for Unix and Windows, respectively. If this is not the case, you will need to use TracyVkContextCalibrated
or TracyVkContext macro instead.
46 It is considerably faster than the OpenGL’s TracyGpuCollect.
40
Tracy Profiler The user manual
Dynamically loading the Vulkan symbols Some applications dynamically link the Vulkan loader, and
manage a local symbol table, to remove the trampoline overhead of calling through the Vulkan loader itself.
When TRACY_VK_USE_SYMBOL_TABLE is defined the signature of TracyVkContext, TracyVkContextCalibrated,
and TracyVkContextHostCalibrated are adjusted to take in the VkInstance, PFN_vkGetInstanceProcAddr,
and PFN_vkGetDeviceProcAddr to enable constructing a local symbol table to be used to call through the
Vulkan API when tracing.
3.9.3 Direct3D 11
To enable Direct3D 11 support, include the public/tracy/TracyD3D11.hpp header file, and create a
TracyD3D11Ctx object with the TracyD3D11Context(device, devicecontext) macro. The object should
later be cleaned up with the TracyD3D11Destroy macro. Tracy does not support D3D11 command lists. To
set a custom name for the context, use the TracyGpuContextName(name, size) macro.
To mark a GPU zone, use the TracyD3D11Zone(name) macro, where name is a string literal name of the
zone. Alternatively you may use TracyD3D11ZoneC(name, color) to specify zone color.
You also need to periodically collect the GPU events using the TracyD3D11Collect macro. An excellent
place to do it is after the swap chain present function.
3.9.4 Direct3D 12
To enable Direct3D 12 support, include the public/tracy/TracyD3D12.hpp header file. Tracing Direct3D 12
queues is nearly on par with the Vulkan implementation, where a TracyD3D12Ctx is returned from a call to
TracyD3D12Context(device, queue), which should be later cleaned up with the TracyD3D12Destroy(ctx)
macro. Multiple contexts can be created, each with any queue type. To set a custom name for the context,
use the TracyD3D12ContextName(ctx, name, size) macro.
The queue must have been created through the specified device, however, a command list is not needed
for this stage.
Using GPU zones is the same as the Vulkan implementation, where the TracyD3D12Zone(ctx, cmdList,
name) macro is used, with name as a string literal. TracyD3D12ZoneC(ctx, cmdList, name, color) can be
used to create a custom-colored zone. The given command list must be in an open state.
The macro TracyD3D12NewFrame(ctx) is used to mark a new frame, and should appear before or after
recording command lists, similar to FrameMark. This macro is a key component that enables automatic query
data synchronization, so the user doesn’t have to worry about synchronizing GPU execution before invoking
a collection. Event data can then be collected and sent to the profiler using the TracyD3D12Collect(ctx)
macro.
Note that GPU profiling may be slightly inaccurate due to artifacts from dynamic frequency scaling.
To counter this, ID3D12Device::SetStablePowerState() can be used to enable accurate profiling, at the
expense of some performance. If the machine is not in developer mode, the operating system will remove
the device upon calling. Do not use this in the shipping code.
Direct3D 12 contexts are always calibrated.
3.9.5 OpenCL
OpenCL support is achieved by including the public/tracy/TracyOpenCL.hpp header file. Tracing OpenCL
requires the creation of a Tracy OpenCL context using the macro TracyCLContext(context, device), which
will return an instance of TracyCLCtx object that must be used when creating zones. The specified device
must be part of the context. Cleanup is performed using the TracyCLDestroy(ctx) macro. Although not
common, it is possible to create multiple OpenCL contexts for the same application. To set a custom name
for the context, use the TracyCLContextName(ctx, name, size) macro.
To mark an OpenCL zone one must make sure that a valid OpenCL cl_event object is available. The
event will be the object that Tracy will use to query profiling information from the OpenCL driver. For this to
work, you must create all OpenCL queues with the CL_QUEUE_PROFILING_ENABLE property.
41
Tracy Profiler The user manual
OpenCL zones can be created with the TracyCLZone(ctx, name) where name will usually be a descriptive
name for the operation represented by the cl_event. Within the scope of the zone, you must call
TracyCLSetEvent(event) for the event to be registered in Tracy.
Similar to Vulkan and OpenGL, you also need to periodically collect the OpenCL events using the
TracyCLCollect(ctx) macro. An excellent place to perform this operation is after a clFinish since this will
ensure that any previously queued OpenCL commands will have finished by this point.
3.10 Fibers
Fibers are lightweight threads, which are not under the operating system’s control and need to be manually
scheduled by the application. As far as Tracy is concerned, there are other cooperative multitasking primitives,
like coroutines, or green threads, which also fall under this umbrella.
To enable fiber support in the client code, you will need to add the TRACY_FIBERS define to your project.
You need to do this explicitly, as there is a small performance hit due to additional processing.
To properly instrument fibers, you will need to modify the fiber dispatch code in your program. You
will need to insert the TracyFiberEnter(fiber) macro every time a fiber starts or resumes execution47 . You
will also need to insert the TracyFiberLeave macro when the execution control in a thread returns to the
non-fiber part of the code. Note that you can safely call TracyFiberEnter multiple times in succession,
without an intermediate TracyFiberLeave if one fiber is directly switching to another, without returning
control to the fiber dispatch worker.
Fibers are identified by unique const char* string names. Remember that you should observe the rules
laid out in section 3.1.2 while handling such strings.
No additional instrumentation is needed in other parts of the code. Zones, messages, and other such
events will be properly attributed to the currently running fiber in its own separate track.
A straightforward example, which is not actually using any OS fiber functionality, is presented below:
int main ()
{
std :: thread t1 ([]{
T ra cy Fi b er En te r ( fiber ) ;
TracyCZone ( ctx , 1) ;
zone = ctx ;
sleep (1) ;
T ra cy Fi b er Le av e ;
}) ;
t1 . join () ;
47 You can also provide fiber grouping hints, the same way as for threads, with the TracyFiberEnterHint(fiber, groupHint)
macro.
42
Tracy Profiler The user manual
As you can see, there are two threads, t1 and t2, which are simulating worker threads that a real fiber
library would use. A C API zone is created in thread t1 and is ended in thread t2. Without the fiber markup,
this would be an invalid operation, but with fibers, the zone is attributed to fiber job1, and not to thread t1
or t2.
Table 5: Median times of zone capture with call stack. x86, x64: i7 8700K; ARM: Banana Pi; ARM64: ODROID-C2. Selected
architectures are plotted on figure 7
You can force call stack capture in the non-S postfixed macros by adding the TRACY_CALLSTACK define, set
to the desired call stack capture depth. This setting doesn’t affect the explicit call stack macros.
The maximum call stack depth that the profiler can retrieve is 62 frames. This is a restriction at the level
of the operating system.
Tracy will automatically exclude certain uninteresting functions from the captured call stacks. So, for
example, the pass-through intrinsic wrapper functions won’t be reported.
43
Tracy Profiler The user manual
1,500
x64
x86
1,000
Time (ns)
500
0
0 10 20 30 40 50 60
Call stack depth
Figure 7: Plot of call stack capture times (see table 5). Notice that the capture time grows linearly with requested capture depth
Important!
Collecting call stack data will also trigger retrieval of profiled program’s executable code by the profiler.
See section 3.15.7 for details.
How to disable
Tracy will prepare for call stack collection regardless of whether you use the functionality or not. In
some cases, this may be unwanted or otherwise troublesome for the user. To disable support for
collecting call stacks, define the TRACY_NO_CALLSTACK macro.
libunwind
On some platforms you can define TRACY_LIBUNWIND_BACKTRACE to use libunwind to perform callstack
captures as it might be a faster alternative than the default implementation. If you do, you must
compile/link you client against libunwind. See https://github.com/libunwind/libunwind for more
details.
• On MSVC, open the project properties and go to Linker Debugging Generate Debug Info , where you
should select the Generate Debug Information option.
• On gcc or clang remember to specify the debugging information -g parameter during compilation and
do not add the strip symbols -s parameter. Additionally, omitting frame pointers will severely reduce
the quality of stack traces, which can be fixed by adding the -fno-omit-frame-pointer parameter.
Link the executable with an additional option -rdynamic (or --export-dynamic, if you are passing
parameters directly to the linker).
44
Tracy Profiler The user manual
• On OSX, you may need to run dsymutil to extract the debugging data out of the executable binary.
• On iOS you will have to add a New Run Script Phase to your XCode project, which shall execute the
following shell script:
cp - rf $ { T A R G ET _ B U I L D _ D I R }/ $ { WRAPPER_NAME }. dSYM /* $ { T A R G E T _ B U IL D _ D I R }/ $ {
U N L O C A L I Z E D _ R E S O U R C E S _ F O L D E R _ P A T H }/ $ { PRODUCT_NAME }. dSYM
You will also need to setup proper dependencies, by setting the following input file:
${TARGET_BUILD_DIR}/${WRAPPER_NAME}.dSYM, and the following output file:
${TARGET_BUILD_DIR}/${UNLOCALIZED_RESOURCES_FOLDER_PATH}/${PRODUCT_NAME}.dSYM.
Windows In MSVC you can retrieve such symbols by going to Tools Options Debugging Symbols and
selecting appropriate Symbol file (.pdb) location servers. Note that additional symbols may significantly
increase application startup times.
Libraries built with vcpkg typically provide PDB symbol files, even for release builds. Using vcpkg to
obtain libraries has the extra benefit that everything is built using local source files, which allows Tracy to
provide a source view not only of your application but also the libraries you use.
Unix On Linux48 information needed for debugging traditionally has been provided by special packages
named debuginfo, dbgsym, or similar. You can use them to retrieve symbols, but keep in mind the following:
1. Your distribution has to provide such packages. Not each one does.
2. Debug packages are usually stored in a separate repository, which you must manually enable.
3. You need to install a separate package for each library you want to have symbols for.
A modern alternative to installing static debug packages is to use the debuginfod system, which performs
on-demand delivery of debugging information across the internet. See https://sourceware.org/elfutils/
Debuginfod.html for more details. Since this new method of symbol delivery is not yet universally supported,
you will have to manually enable it, both in your system and in Tracy.
First, make sure your distribution maintains a debuginfod server. Then, install the debuginfod library. You
also need to ensure you have appropriately configured which server to access, but distribution maintainers
usually provide this. Next, add the TRACY_DEBUGINFOD define to the program you want to profile and link it
with libdebuginfod. This will enable network delivery of symbols and source file contents. However, the
first run (including after a system update) may be slow to respond until the local debuginfod cache becomes
filled.
45
Tracy Profiler The user manual
1. Add a TRACY_DBGHELP_LOCK define, with the value set to prefix of lock-handling functions (for example:
TRACY_DBGHELP_LOCK=DbgHelp).
2. Create a dbghelp lock (i.e., mutex) in your application.
3. Provide a set of Init, Lock and Unlock functions, including the provided prefix name, which will
operate on the lock. These functions must be defined using the C linkage. Notice that there’s no
cleanup function.
4. Remember to protect access to dbghelp in your code appropriately!
At initilization time, tracy will attempt to preload symbols for device drivers and process modules. As
this process can be slow when a lot of pdbs are involved, you can set the TRACY_NO_DBGHELP_INIT_LOAD
environment variable to "1" to disable this behavior and rely on-demand symbol loading.
Important
Beware that update will use any matching symbol file to the path it resolved to (no symbol version
checking is done), so if the symbol file doesn’t match the code that was used when doing the callstack
capturing you will get incorrect results.
46
Tracy Profiler The user manual
Also note that in the case of using offline symbol resolving, even after running the update tool to
resolve symbols, the symbols statistics are not updated and will still report the unresolved symbols.
3.13 C API
To profile code written in C programming language, you will need to include the public/tracy/TracyC.h
header file, which exposes the C API.
At the moment, there’s no support for C API based markup of locks, GPU zones, or Lua.
49While technically this name doesn’t need to be constant, like in the ZoneScopedN macro, it should be, as it is used to group the zones.
This grouping is then used to display various statistics in the profiler. You may still set the per-call name using the tracy.ZoneName
method.
47
Tracy Profiler The user manual
Depth Time
1 707 ns
2 699 ns
3 624 ns
4 727 ns
5 836 ns
10 1.77 µs
15 2.44 µs
20 2.51 µs
25 2.98 µs
30 3.6 µs
35 4.33 µs
40 5.17 µs
45 6.01 µs
50 6.99 µs
55 8.11 µs
60 9.17 µs
Table 6: Median times of Lua zone capture with call stack (x64, 13 native frames)
Important
Tracy is written in C++, so you will need to have a C++ compiler and link with C++ standard library,
even if your program is strictly pure C.
• TracyCFrameMark
• TracyCFrameMarkNamed(name)
• TracyCFrameMarkStart(name)
• TracyCFrameMarkEnd(name)
• TracyCZone(ctx, active)
48
Tracy Profiler The user manual
10
Time (µs)
4
0
0 10 20 30 40 50 60
Call stack depth
Refer to sections 3.4 and 3.4.2 for description of macro variants and parameters. The ctx parameter
specifies the name of a data structure, which the macro will create on the stack to hold the internal zone data.
Unlike C++, there’s no automatic destruction mechanism in C, so you will need to mark where the zone
ends manually. To do so use the TracyCZoneEnd(ctx) macro.50
Zone text and name may be set by using the TracyCZoneText(ctx, txt, size), TracyCZoneValue(ctx,
value) and TracyCZoneName(ctx, txt, size) macros. Make sure you are following the zone stack rules,
as described in section 3.4.2!
In typical use cases the zone context data structure is hidden from your view, requiring only to specify its
name for the TracyCZone and TracyCZoneEnd macros. However, it is possible to use it in advanced scenarios,
for example, if you want to start a zone in one function, then end it in another one. To do so, you will
need to forward the data structure either through a function parameter or as a return value or place it in a
thread-local stack structure. To accomplish this, you need to keep in mind the following rules:
• The created variable name is exactly what you pass as the ctx parameter.
• Contents of the data structure can be copied by assignment. Do not retrieve or use the structure’s
address – this is asking for trouble.
• You must use the data structure (or any of its copies) exactly once to end a zone.
Since all C API instrumentation has to be done by hand, it is possible to miss some code paths where a
zone should be started or ended. Tracy will perform additional validation of instrumentation correctness to
prevent bad profiling runs. Read section 4.7 for more information.
50 GCC and Clang provide __attribute__((cleanup)) which can used to run a function when a variable goes out of scope.
49
Tracy Profiler The user manual
However, the validation comes with a performance cost, which you may not want to pay. Therefore, if
you are entirely sure that the instrumentation is not broken in any way, you may use the TRACY_NO_VERIFY
macro, which will disable the validation code.
There is no explicit support for transient zones (section 3.4.4) in the C API macros. However, this functionality
can be implemented by following instructions outlined in section 3.13.11.
• TracyCLockAnnounce(lock_ctx)
• TracyCLockTerminate(lock_ctx)
• TracyCLockBeforeLock(lock_ctx)
• TracyCLockAfterLock(lock_ctx)
• TracyCLockAfterUnlock(lock_ctx)
• TracyCLockAfterTryLock(lock_ctx, acquired)
• TracyCLockMark(lock_ctx)
Additionally a lock context has to be defined next to the lock that it will be marking:
To initialize the lock context use TracyCLockAnnounce, this should be done when the lock you are marking
is initialized/created. When the lock is destroyed use TracyCLockTerminate, this will free the lock context.
You can use the TracyCLockCustomName macro to name a lock.
You must markup both before and after acquiring a lock:
T r a c y C L o c k B e f o r e L o c k ( tr acy_lock _ctx ) ;
W a i t F o r S i n g l e O b j e c t ( lock , INFINITE ) ;
T r a c y C L o c k A f t e r L o c k ( tr acy_lock _ctx ) ;
If acquiring the lock may fail, you should instead use the TracyCLockAfterTryLock macro:
T r a c y C L o c k B e f o r e L o c k ( tr acy_lock _ctx ) ;
int acquired = W a i t F o r S i n g l e O b j e c t ( lock , 200) == WAIT_OBJECT_0 ;
T r a c y C L o c k A f t e r T r y L o c k ( tracy_lock_ctx , acquired ) ;
ReleaseMutex ( lock ) ;
T r a c y C L o c k A f t e r U n l o c k ( tra cy_lock_ ctx ) ;
You can optionally mark the location of where the lock is held by using the TracyCLockMark macro, this
should be done after acquiring the lock.
50
Tracy Profiler The user manual
• TracyCAlloc(ptr, size)
• TracyCFree(ptr)
• TracyCSecureAlloc(ptr, size)
• TracyCSecureFree(ptr)
Correctly using this functionality can be pretty tricky. You also will need to handle all the memory
allocations made by external libraries (which typically allow usage of custom memory allocation functions)
and the allocations made by system functions. If you can’t track such an allocation, you will need to make
sure freeing is not reported51 .
There is no explicit support for realloc function. You will need to handle it by marking memory
allocations and frees, according to the system manual describing the behavior of this routine.
Memory pools (section 3.8.1) are supported through macros with N postfix.
For more information about memory profiling, refer to section 3.8.
• TracyCPlot(name, val)
• TracyCPlotF(name, val)
• TracyCPlotI(name, val)
• TracyCMessage(txt, size)
• TracyCMessageL(txt)
• TracyCMessageLC(txt, color)
• TracyCAppInfo(txt, size)
51
Tracy Profiler The user manual
3.13.8 Fibers
Fibers are available in the C API through the TracyCFiberEnter and TracyCFiberLeave macros. To use
them, you should observe the requirements listed in section 3.10.
To query the connection status (section 3.18) using the C API you should use the TracyCIsConnected macro.
You can collect call stacks of zones and memory allocation events, as described in section 3.11, by using
macros with S postfix, such as: TracyCZoneS, TracyCZoneNS, TracyCZoneCS, TracyCZoneNCS, TracyCAllocS,
TracyCFreeS, and so on.
Tracy C API exposes functions with the ___tracy prefix that you may use to write bindings to other
programming languages. Most of the functions available are a counterpart to macros described in sec-
tion 3.13. However, some functions do not have macro equivalents and are dedicated expressly for binding
implementation purposes. This includes the following:
• ___tracy_startup_profiler(void)
• ___tracy_shutdown_profiler(void)
52
Tracy Profiler The user manual
Here line is line number in the source source file and function is the name of a function in which the
zone is created. sourceSz and functionSz are the size of the corresponding string arguments in bytes. You
may additionally specify an optional zone name, by providing it in the name variable, and specifying its size
in nameSz.
The ___tracy_alloc_srcloc and ___tracy_alloc_srcloc_name functions return an uint64_t source
location identifier corresponding to an allocated source location. As these functions do not require the provided
string data to be available after they return, the calling code is free to deallocate them at any time afterward.
This way, the string lifetime requirements described in section 3.1 are relaxed.
The uint64_t return value from allocation functions must be passed to one of the zone begin functions:
• ___tracy_emit_zone_begin_alloc(srcloc, active)
These functions return a TracyCZoneCtx context value, which must be handled, as described in sec-
tions 3.13.3 and 3.13.3.1.
The variable representing an allocated source location is of an opaque type. After it is passed to one of the
zone begin functions, its value cannot be reused (the variable is consumed). You must allocate a new source
location for each zone begin event, even if the location data would be the same as in the previous instance.
Important
Since you are directly calling the profiler functions here, you will need to take care of manually disabling
the code if the TRACY_ENABLE macro is not defined.
3.14.1 Bindings
An example of how to use the Tracy-Client bindings is shown below:
import numpy as np
53
Tracy Profiler The user manual
def main () :
assert tracy . program_name ( " MyApp " )
assert tracy . app_info ( " this is a python app " )
plot_id = tracy . plot_config ( " plot " , tracy . PlotFo rmatType . Number )
assert plot_id is not None
mem_id = None
index = 0
while True :
with tracy . ScopedZone ( name = " test " , color = tracy . ColorType . Coral ) as zone :
index += 1
tracy . frame_mark ()
if index % 2:
tracy . alloc (44 , index )
else :
tracy . free (44)
if not index % 2:
if mem_id is None :
mem_id = tracy . alloc (1337000000 , index , name = " named " , depth =4)
assert mem_id is not None
else :
tracy . alloc (1337000000 , index , id = mem_id , depth =4)
else :
tracy . free (1337000000 , mem_id , 4)
inner . exit ()
work ()
sleep (0.1)
Please not the use of ids as way to cope with the need for unique pointers for certain features of the Tracy
profiler, see section 3.1.2.
54
Tracy Profiler The user manual
• EXTERNAL_PYBIND11 — Can be used to disable the download of pybind11 when Tracy is embedded in
another CMake project that already uses pybind11.
• BUFFER_SIZE — The size of the global pointer buffer (defaults to 128) for naming Tracy profiling entities
like frame marks, plots, and memory locations.
• NAME_LENGTH — The maximum length (defaults to 128) of a name stored in the global pointer buffer.
Be aware that the memory allocated by this buffer is global and is not freed, see section 3.1.2.
See below for example steps to build the Python bindings using CMake:
mkdir build
cd build
cmake - DTRACY_STATIC = OFF - D T R A C Y _ C L I E N T _ P Y T H O N = ON ../
make - j$ ( nproc )
Once this has finished building the Python package can be built as follows:
cd ../ python
python3 setup . py bdist_wheel
52 To make this easier, you can run MSVC with admin privileges, which will be inherited by your program when you start it from
55
Tracy Profiler The user manual
56
Tracy Profiler The user manual
Important
In this manual, the word core is typically used as a short term for logical CPU. Please do not confuse it
with physical processor cores.
1. Random preemptive multitasking events, which are expected and do not have any significance.
2. Expected waits, which may be caused by issuing sleep commands, waiting for a lock to become
available, performing I/O, and so on. Quantitative analysis of such events may (but probably won’t)
direct you to some problems in your code.
3. Unexpected waits, which should be immediately taken care of. After all, what’s the point of profiling
and optimizing your program if it is constantly waiting for something? An example of such an
unexpected wait may be some anti-virus service interfering with each of your file read operations. In
this case, you could have assumed that the system would buffer a large chunk of the data after the first
read to make it immediately available to the application in the following calls.
57
Tracy Profiler The user manual
Platform differences
Wait stacks capture happen at a different time on the supported operating systems due to differences in
the implementation details. For example, on Windows, the stack capture will occur when the program
execution is resumed. However, on Linux, the capture will happen when the scheduler decides to
preempt execution.
While the call stack sampling is a generic software-implemented functionality of the operating system, there’s
another way of sampling program execution patterns. Modern processors host a wide array of different
hardware performance counters, which increase when some event in a CPU core happens. These could be as
simple as counting each clock cycle or as implementation-specific as counting ’retired instructions that are
delivered to the back-end after the front-end had at least 1 bubble-slot for a period of 2 cycles’.
Tracy can use these counters to present you the following three statistics, which may help guide you in
discovering why your code is not as fast as possible:
1. Instructions Per Cycle (IPC) – shows how many instructions were executing concurrently within a single
core cycle. Higher values are better. The maximum achievable value depends on the design of the CPU,
including things such as the number of execution units and their individual capabilities. Calculated as
#instructions retired
#cycles . You can disable it with the TRACY_NO_SAMPLE_RETIREMENT macro.
2. Branch miss rate – shows how frequently the CPU branch predictor makes a wrong choice. Lower values
#branch misses
are better. Calculated as #branch instructions . You can disable it with the TRACY_NO_SAMPLE_BRANCH macro.
3. Cache miss rate – shows how frequently the CPU has to retrieve data from memory. Lower values are
better. The specifics of which cache level is taken into account here vary from one implementation to
#cache misses
another. Calculated as #cache references . You can disable it with the TRACY_NO_SAMPLE_CACHE macro.
Each performance counter has to be collected by a dedicated Performance Monitoring Unit (PMU).
However, the availability of PMUs is very limited, so you may not be able to capture all the statistics
mentioned above at the same time (as each requires capture of two different counters). In such a case, you
will need to manually select what needs to be sampled with the macros specified above.
If the provided measurements are not specific enough for your needs, you will need to use a profiler
better tailored to the hardware you are using, such as Intel VTune, or AMD µProf.
Another problem to consider here is the measurement skid. It is pretty hard to accurately pinpoint the
exact assembly instruction which has caused the counter to trigger. Due to this, the results you’ll get may
look a bit nonsense at times. For example, a branch miss may be attributed to the multiply instruction.
Unfortunately, not much can be done with that, as this is exactly what the hardware is reporting. The amount
of skid you will encounter depends on the specific implementation of a processor, and each vendor has its
own solution to minimize it. Intel uses Precise Event Based Sampling (PEBS), which is rather good, but it
still can, for example, blend the branch statistics across the comparison instruction and the following jump
instruction. AMD employs its own Instruction Based Sampling (IBS), which tends to provide worse results
in comparison.
Do note that the statistics presented by Tracy are a combination of two randomly sampled counters, so
you should take them with a grain of salt. The random nature of sampling56 makes it entirely possible
to count more branch misses than branch instructions or some other similar silliness. You should always
cross-check this data with the count of sampled events to decide if you can reliably act upon the provided
values.
56 The hardware counters in practice can be triggered only once per million-or-so events happening.
58
Tracy Profiler The user manual
Availability Currently, the hardware performance counter readings are only available on Linux, which
also includes the WSL2 layer on Windows57 . Access to them is performed using the kernel-provided
infrastructure, so what you get may depend on how your kernel was configured. This also means that the
exact set of supported hardware is not known, as it depends on what has been implemented in Linux itself.
At this point, the x86 hardware is fully supported (including features such as PEBS or IBS), and there’s PMU
support on a selection of ARM designs. The performance counter data can be captured with no need for
privilege elevation.
Tracy will capture small chunks of the executable image during profiling to enable deep insight into program
execution. The retrieved code can be subsequently disassembled to be inspected in detail. The profiler will
perform this functionality only for functions no larger than 128 KB and only if symbol information is present.
The discovery of previously unseen executable code may result in reduced performance of real-time
capture. This is especially true when the profiling session had just started. However, such behavior is
expected and will go back to normal after several moments.
It would be best to be extra careful when working with non-public code, as parts of your program will be
embedded in the captured trace. You can disable the collection of program code by compiling the profiled
application with the TRACY_NO_CODE_TRANSFER define. You can also strip the code from a saved trace using
the update utility (section 4.5.4).
Important
For proper program code retrieval, you can unload no module used by the application during the
runtime. See section 3.1.1 for an explanation.
On Linux, Tracy will override the dlclose function call to prevent shared objects from being
unloaded. Note that in a well-behaved program this shouldn’t have any effect, as calling dlclose does
not guarantee that the shared object will be unloaded.
On Windows and Linux, Tracy will automatically capture hardware Vsync events, provided that the
application has access to the kernel data (privilege elevation may be needed, see section 3.15.1). These
events will be reported as ’[x] Vsync’ frame sets, where x is the identifier of a specific monitor. Note that
hardware vertical synchronization might not correspond to the one seen by your application due to desktop
composition, command queue buffering, and so on. Also, in some instances, when there is nothing to update
on the screen, the graphic driver may choose to stop issuing screen refresh. As a result, there may be periods
where no vertical synchronization events are reported.
Use the TRACY_NO_VSYNC_CAPTURE macro to disable capture of Vsync events.
57 You may need Windows 11 and the WSL preview from Microsoft Store for this to work.
59
Tracy Profiler The user manual
The data parameter will have the same value as was specified in the macro. The idx argument is an
user-defined parameter index and val is the value set in the profiler user interface.
To specify individual parameters, use the TracyParameterSetup(idx, name, isBool, val) macro. The
idx value will be passed to the callback function for identification purposes (Tracy doesn’t care what it’s set
to). Name is the parameter label, displayed on the list of parameters. Finally, isBool determines if val should
be interpreted as a boolean value, or as an integer number.
Important
Usage of trace parameters makes profiling runs dependent on user interaction with the profiler, and
thus it’s not recommended to be employed if a consistent profiling environment is desired. Furthermore,
interaction with the parameters is only possible in the graphical profiling application but not in the
command line capture utility.
char * Callback ( void * data , const char * filename , size_t & size )
The data parameter will have the same value as was specified in the macro. The filename parameter
contains the file name of the queried source file. Finally, the size parameter is used only as an out-value and
does not contain any functional data.
The return value must be nullptr if the input file name is not accessible to the client application. If the
file can be accessed, then the data size must be stored in the size parameter, and the file contents must be
returned in a buffer allocated with the tracy::tracy_malloc_fast(size) function. Buffer contents do not
need to be null-terminated. If for some reason the already allocated buffer can no longer be used, it must be
freed with the tracy::tracy_free_fast(ptr) function.
Transfer of source files larger than some unspecified, but reasonably large58 threshold won’t be performed.
60
Tracy Profiler The user manual
61
Tracy Profiler The user manual
Tracy Profiler å
[ × ♥
Address entry
Û Connect i Open trace
Discovered clients: Z
127.0.0.1 | 21 s | Application
Both connecting to a client and opening a saved trace will present you with the main profiler view, which
you can use to analyze the data (see section 5).
Once connected to a client Ctrl + + Alt + R can be used to quickly discard any captured data and
reconnect to a client at the same address.
If frame image capture has been implemented (chapter 3.3.3), a thumbnail of the last received frame
image will be provided for reference.
Suppose the profiled application opted to provide trace parameters (see section 3.16) and the connection
is still active. In that case, this pop-up will also contain a trace parameters section, listing all the provided
62 You should take this literally. If a live capture is in progress and a save is performed, some data may be missing from the capture
set of events.
62
Tracy Profiler The user manual
options. A callback function will be executed on the client when you change any value here.
The new file contains the same data as the old one but with an updated internal representation. Note that
the whole trace needs to be loaded to memory to perform an upgrade.
63
Tracy Profiler The user manual
Trace files created using the lz4, lz4 hc and lz4 extreme modes are optimized for fast decompression and
can be further compressed using file compression utilities. For example, using 7-zip results in archives of the
following sizes: 77.2 MB, 54.3 MB, 52.4 MB.
For archival purposes, it is, however, much better to use the zstd compression modes, which are
faster, compress trace files more tightly, and are directly loadable by the profiler, without the intermediate
decompression step.
64
Tracy Profiler The user manual
Time (s)
100
101
50
0 5 10 15 20 25 0 5 10 15 20 25
Mode Mode
Figure 11: Plot of trace sizes for different compression modes Figure 12: Logarithmic plot of trace compression times for
(see table 7). different compression modes (see table 7).
900
800
Time (ms)
700
600
zstd
500 lz4
lz4 hc
400 lz4 extreme
0 5 10 15 20 25
Mode
Figure 13: Plot of trace load times for different compression modes (see table 7).
65
Tracy Profiler The user manual
Streams
4 8 16 32
Mode
lz4 100.30% 100.30% 100.61% 102.73%
lz4 hc 100.80% 101.20% 101.61% 102.41%
lz4 ext 100.40% 101.21% 101.62% 102.02%
zstd 1 100.90% 101.36% 101.81% 102.26%
zstd 3 100.51% 101.02% 101.53% 102.04%
zstd 6 100.55% 101.10% 101.65% 102.75%
zstd 9 101.27% 103.16% 105.06% 108.23%
zstd 18 103.08% 106.15% 109.23% 115.38%
zstd 22 107.08% 113.27% 122.12% 130.97%
Table 8: The increase in file size for different compression modes, as compared to a single stream.
Streams
4 8 16 32
Mode
lz4 2.04 2.52 2.11 3.24
lz4 hc 3.56 6.73 9.49 15.26
lz4 ext 3.38 6.53 9.57 17.03
zstd 1 2.24 3.68 3.40 3.37
zstd 3 3.23 4.13 4.07 4.50
zstd 6 3.52 6.00 6.53 6.95
zstd 9 3.10 4.26 5.12 5.40
zstd 18 3.22 5.41 8.49 14.51
zstd 22 3.99 7.47 11.10 18.20
Table 9: The speedup (x times faster) in saving time for different modes of compression, as compared to a single stream.
level 22) have significantly worse compression rates when the work is divided. This is a fairly nuanced topic,
and you are encouraged to do your own measurements, but for a rough guideline on the behavior, you can
refer to tables 8 and 9.
• l – locks.
• m – messages.
66
Tracy Profiler The user manual
• p – plots.
• M – memory.
• i – frame images.
• c – context switches.
• s – sampling data.
• C – symbol code.
Flags can be concatenated. For example specifying -s CSi will remove symbol code, source file cache,
and frame images in the destination trace file.
67
Tracy Profiler The user manual
Timeline view
Figure 14: Main profiler window. Note that this manual has split the top line of buttons into two rows.
• Û Connection – Opens the connection information popup (see section 4.2.1). Only available when live
capture is in progress.
• Close – This button unloads the current profiling trace and returns to the welcome menu, where
another trace can be loaded. In live captures it is replaced by r Pause, Resume and □ Stopped buttons.
• r Pause – While a live capture is in progress, the profiler will display recent events, as either the last
three fully captured frames, or a certain time range. You can use this to see the current behavior of the
program. The pause button65 will stop the automatic updates of the timeline view (the capture will
still be progressing).
• Resume – This button allows to resume following the most recent events in a live capture. You will
have selection of one of the following options: ß Newest three frames, or Î Use current zoom level.
• □ Stopped – Inactive button used to indicate that the client application was terminated.
• V Messages – Toggles the message log window (section 5.5), which displays custom messages sent by
the client, as described in section 3.7.
• Û Find zone – This buttons toggles the find zone window, which allows inspection of zone behavior
statistics (section 5.7).
• Statistics – Toggles the statistics window, which displays zones sorted by their total time cost
(section 5.6).
• : Memory – Various memory profiling options may be accessed here (section 5.9).
65 Or perform any action on the timeline view, apart from changing the zoom level.
68
Tracy Profiler The user manual
• 8 Compare – Toggles the trace compare window, which allows you to see the performance difference
between two profiling runs (section 5.8).
• { Tools – Allows access to optional data collected during capture. Some choices might be unavailable.
– Playback – If frame images were captured (section 3.3.3), you will have option to open frame
image playback window, described in chapter 5.19.
– CPU data – If context switch data was captured (section 3.15.3), this button will allow inspecting
what was the processor load during the capture, as described in section 5.20.
– 4 Annotations – If annotations have been made (section 5.3.1), you can open a list of all annotations,
described in chapter 5.22.
– Ì Limits – Displays time range limits window (section 5.3).
– Ý Wait stacks – If sampling was performed, an option to display wait stacks may be available. See
chapter 3.15.5.1 for more details.
• ß Display scale – Enables run-time resizing of the displayed content. This may be useful in environments
with potentially reduced visibility, e.g. during a presentation. Note that this setting is independent to
the UI scaling coming from the system DPI settings.
The frame information block66 consists of four elements: the current frame set name along with the
number of captured frames (click on it with the left mouse button to go to a specified frame), the two
navigational buttons and , which allow you to focus the timeline view on the previous or next frame, and
the frame set selection button , which is used to switch to another frame set67 . For more information about
marking frames, see section 3.3.
The following three items show the 4 view time range, the õ time span of the whole capture (clicking on it
with the middle mouse button will set the view range to the entire capture), and the : memory usage of
the profiler.
69
Tracy Profiler The user manual
• – At least one timeline item (e.g. a single thread, a single plot, a single lock, etc.) is hidden.
Each bar displayed on the graph represents a unique frame in the current frame set68 . The progress of
time is in the right direction. The bar height indicates the time spent in the frame, complemented by the
color information, which depends on the target FPS value. You can set the desired FPS in the options menu
(see section 5.4).
• If the bar is blue, then the frame met the best time of twice the target FPS (represented by the green
target line).
• If the bar is green, then the frame met the good time of target FPS (represented by the yellow line).
• If the bar is yellow, then the frame met the bad time of half the FPS (represented by the red target line).
• If the bar is red, then the frame didn’t meet any time limits.
The frames visible on the timeline are marked with a violet box drawn over them.
When a zone is displayed in the find zone window (section 5.7), the coloring of frames may be changed,
as described in section 5.7.2.
Moving the W mouse cursor over the frames displayed on the graph will display a tooltip with information
about frame number, frame time, frame image (if available, see chapter 3.3.3), etc. Such tooltips are common
for many UI elements in the profiler and won’t be mentioned later in the manual.
You may focus the timeline view on the frames by clicking or dragging the left mouse button on the
graph. The graph may be scrolled left and right by dragging the right mouse button over the graph.
Finally, you may zoom the view in and out by using the mouse wheel. If the view is zoomed out, so that
multiple frames are merged into one column, the profiler will use the highest frame time to represent the
given column.
Clicking the left mouse button on the graph while the Ctrl key is pressed will open the frame image
playback window (section 5.19) and set the playback to the selected frame. See section 3.3.3 for more
information about frame images.
70
Tracy Profiler The user manual
Collapsed items Due to extreme differences in time scales, you will almost constantly see events too
small to be displayed on the screen. Such events have preset minimum size (so they can be seen) and are
marked with a zig-zag pattern to indicate that you need to zoom in to see more detail.
The zig-zag pattern can be seen applied to frame sets on figure 17, and zones on figure 18.
+13.76 s 20 µs 40 µs 60 µs 80 µs 100 µs
The leftmost value on the scale represents when the timeline starts. The rest of the numbers label the
notches on the scale, with some numbers omitted if there’s no space to display them.
Hovering the W mouse pointer over the time scale will display a tooltip with the exact timestamp at the
position of the mouse cursor.
In figure 17 we can see the fully described frames 312 and 347. The description consists of the frame
name, which is Frame for the default frame set (section 3.3) or the name you used for the secondary name set
(section 3.3.1), the frame number, and the frame time. Since frame 348 is too small to be fully labeled, only
the frame time is shown. On the other hand, frame 349 is even smaller, with no space for any text. Moreover,
frames 313 to 346 are too small to be displayed individually, so they are replaced with a zig-zag pattern, as
described in section 5.2.3.
You can also see frame separators are projected down to the rest of the timeline view. Note that only the
separators for the currently selected frame set are displayed. You can make a frame set active by clicking the
left mouse button on a frame set row you want to select (also see section 5.2.1).
Clicking the middle mouse button on a frame will zoom the view to the extent of the frame.
If a frame has an associated frame image (see chapter 3.3.3), you can hold the Ctrl key and click the left
mouse button on the frame to open the frame image playback window (see chapter 5.19) and set the playback
to the selected frame.
If the c Draw frame targets option is enabled (see section 5.4), time regions in frames exceeding the set
target value will be marked with a red background.
71
Tracy Profiler The user manual
Main thread
Update Render
6 Physics
Physics lock
Streaming thread x
• Light blue label – GPU context. Multi-threaded Vulkan, OpenCL, and Direct3D 12 contexts are
additionally split into separate threads.
• White label – A CPU thread. It will be replaced by a bright red label in a thread that has crashed
(section 2.5). If automated sampling was performed, clicking the left mouse button on the x ghost
zones button will switch zone display mode between ’instrumented’ and ’ghost.’
• Green label – Fiber, coroutine, or any other sort of cooperative multitasking ’green thread.’
Labels accompanied by the symbol can be collapsed out of the view to reduce visual clutter. Hover
the W mouse pointer over the label to display additional information. Click the middle mouse button on a
title to zoom the view to the extent of the label contents. Finally, click the right mouse button on a label to
display the context menu with available actions:
• 6 Hide – Hides the label along with the content associated to it. To make the label visible again, you
must find it in the options menu (section 5.4).
Zones In an example in figure 18 you can see that there are two threads: Main thread and Streaming
thread69 . We can see that the Main thread has two root level zones visible: Update and Render. The Update
zone is split into further sub-zones, some of which are too small to be displayed at the current zoom level.
This is indicated by drawing a zig-zag pattern over the merged zones box (section 5.2.3), with the number of
collapsed zones printed in place of the zone name. We can also see that the Physics zone acquires the Physics
lock mutex for most of its run time.
Meanwhile, the Streaming thread is performing some Streaming jobs. The first Streaming job sent a message
(section 3.7). In addition to being listed in the message log, it is indicated by a triangle over the thread
separator. When multiple messages are in one place, the triangle outline shape changes to a filled triangle.
The GPU zones are displayed just like CPU zones, with an OpenGL/Vulkan/Direct3D/OpenCL context
in place of a thread name.
69 By clicking on a thread name, you can temporarily disable the display of the zones in this thread.
72
Tracy Profiler The user manual
Hovering the W mouse pointer over a zone will highlight all other zones that have the exact source location
with a white outline. Clicking the left mouse button on a zone will open the zone information window
(section 5.13). Holding the Ctrl key and clicking the left mouse button on a zone will open the zone
statistics window (section 5.7). Clicking the middle mouse button on a zone will zoom the view to the
extent of the zone.
Ghost zones You can enable the view of ghost zones (not pictured on figure 18, but similar to standard
zones view) by clicking on the x ghost zones icon next to the thread label, available if automated sampling
(see chapter 3.15.5) was performed. Ghost zones will also be displayed by default if no instrumented zones
are available for a given thread to help with pinpointing functions that should be instrumented.
Ghost zones represent true function calls in the program, periodically reported by the operating system.
Due to the limited sampling resolution, you need to take great care when looking at reported timing data.
While it may be apparent that some small function requires a relatively long time to execute, for example,
125 µs (8 kHz sampling rate), in reality, this time represents a period between taking two distinct samples,
not the actual function run time. Similarly, two (or more) separate function calls may be represented as a
single ghost zone because the profiler doesn’t have the information needed to know about the actual lifetime
of a sampled function.
Another common pitfall to watch for is the order of presented functions. It is not what you expect it to be!
Read chapter 5.14.1 for critical insight on how call stacks might seem nonsensical at first and why they aren’t.
The available information about ghost zones is quite limited, but it’s enough to give you a rough outlook
on the execution of your application. The timeline view alone is more than any other statistical profiler
can present. In addition, Tracy correctly handles inlined function calls, which are indicated by a darker
background of ghost zones. Lastly, zones representing kernel-mode functions are displayed with red function
names.
Clicking the left mouse button on a ghost zone will open the corresponding source file location, if able
(see chapter 5.16 for conditions). There are three ways in which source locations can be assigned to a ghost
zone:
1. If the selected ghost zone is not an inline frame and its symbol data has been retrieved, the source
location points to the function entry location (first line of the function).
2. If the selected ghost zone is not an inline frame, but its symbol data is not available, the source location
will point to a semi-random location within the function body (i.e. to one of the sampled addresses in
the program, but not necessarily the one representing the selected time stamp, as multiple samples
with different addresses may be merged into one ghost zone).
3. If the selected ghost zone is an inline frame, the source location will point to a semi-random location
within the inlined function body (see details in the above point). It is impossible to go to such a
function’s entry location, as it doesn’t exist in the program binary. Inlined functions begin in the parent
function.
Call stack samples The row of dots right below the Main thread label shows call stack sample points,
which may have been automatically captured (see chapter 3.15.5 for more detail). Hovering the W mouse
pointer over each dot will display a short call stack summary while clicking on the dot with the left mouse
button will open a more detailed call stack information window (see section 5.14).
Context switches The thick line right below the samples represents context switch data (see section 3.15.3).
We can see that the main thread, as displayed, starts in a suspended state, represented by the dotted region.
Then it is woken up and starts execution of the Update zone. It is preempted amid the physics processing,
which explains why there is an empty space between child zones. Then it is resumed again and continues
execution into the Render zone, where it is preempted again, but for a shorter time. After rendering is done,
73
Tracy Profiler The user manual
the thread sleeps again, presumably waiting for the vertical blanking to indicate the next frame. Similar
information is also available for the streaming thread.
Context switch regions are using the following color key:
• Red – Thread is waiting to be resumed by the scheduler. There are many reasons why a thread may be
in the waiting state. Hovering the W mouse pointer over the region will display more information. If
sampling was performed, the profiler might display a wait stack. See section 3.15.5.1 for additional
details.
• Blue – Thread is waiting to be resumed and is migrating to another CPU core. This might have visible
performance effects because low-level CPU caches are not shared between cores, which may result in
additional cache misses. To avoid this problem, you may pin a thread to a specific core by setting its
affinity.
• Bronze – Thread has been placed in the scheduler’s run queue and is about to be resumed.
Fiber work and yield states are presented in the same way as context switch regions.
CPU data This label is only available if the profiler collected context switch data. It is split into two
parts: a graph of CPU load by various threads running in the system and a per-core thread execution display.
The CPU load graph shows how much CPU resources were used at any given time during program
execution. The green part of the graph represents threads belonging to the profiled application, and the gray
part of the graph shows all other programs running in the system. Hovering the W mouse pointer over the
graph will display a list of threads running on the CPU at the given time.
Each line in the thread execution display represents a separate logical CPU thread. If CPU topology data
is available (see section 3.15.4), package and core assignment will be displayed in brackets, in addition to
numerical processor identifier (i.e. [package :core ] CPU thread ). When a core is busy executing a thread,
a zone will be drawn at the appropriate time. Zones are colored according to the following key:
• Bright color – or orange if dynamic thread colors are disabled – Thread tracked by the profiler.
• Dark blue – Thread existing in the profiled application but not known to the profiler. This may include
internal profiler threads, helper threads created by external libraries, etc.
When the W mouse pointer is hovered over either the CPU data zone or the thread timeline label, Tracy
will display a line connecting all zones associated with the selected thread. This can be used to quickly see
how the thread migrated across the CPU cores.
Clicking the left mouse button on a tracked thread will make it visible on the timeline if it was either
hidden or collapsed before.
Careful examination of the data presented on this graph may allow you to determine areas where the
profiled application was fighting for system resources with other programs (see section 2.2.1) or give you a
hint to add more instrumentation macros.
Locks Mutual exclusion zones are displayed in each thread that tries to acquire them. There are three
color-coded kinds of lock event regions that may be displayed. Note that the contention regions are always
displayed over the uncontented ones when the timeline view is zoomed out.
• Green region70 – The lock is being held solely by one thread, and no other thread tries to access it. In the
case of shared locks, multiple threads hold the read lock, but no thread requires a write lock.
70 This region type is disabled by default and needs to be enabled in options (section 5.4).
74
Tracy Profiler The user manual
• Yellow region – The lock is being owned by this thread, and some other thread also wants to acquire the
lock.
• Red region – The thread wants to acquire the lock but is blocked by other thread or threads in case of a
shared lock.
Hovering the W mouse pointer over a lock timeline will highlight the lock in all threads to help read
the lock behavior. Hovering the W mouse pointer over a lock event will display important information,
for example, a list of threads that are currently blocking or which are blocked by the lock. Clicking the
left mouse button on a lock event or a lock label will open the lock information window, as described in
section 5.18. Clicking the middle mouse button on a lock event will zoom the view to the extent of the
event.
Plots The numerical data values (figure 19) are plotted right below the zones and locks. Note that the
minimum and maximum values currently displayed on the plot are visible on the screen, along with the y
range of the plot and the number of drawn data points. The discrete data points are indicated with little
rectangles. A filled rectangle indicates multiple data points.
268
When memory profiling (section 3.8) is enabled, Tracy will automatically generate a : Memory usage
plot, which has extended capabilities. For example, hovering over a data point (memory allocation event)
will visually display the allocation duration. Clicking the left mouse button on the data point will open
the memory allocation information window, which will show the duration of the allocation as long as the
window is open.
Another plot that Tracy automatically provides is the T CPU usage plot, which represents the total system
CPU usage percentage (it is not limited to the profiled application).
Hovering the W mouse pointer over the timeline view will display a vertical line that you can use to line up
events in multiple threads visually. Dragging the left mouse button will display the time measurement of
the selected region.
The timeline view may be scrolled both vertically and horizontally by dragging the right mouse button.
Note that only the zones, locks, and plots scroll vertically, while the time scale and frame sets always stay on
the top.
You can zoom in and out the timeline view by using the mouse wheel. Pressing the Ctrl key will make
zooming more precise while pressing the key will make it faster. You can select a range to which you
want to zoom in by dragging the middle mouse button. Dragging the middle mouse button while the
Ctrl key is pressed will zoom out.
It is also possible to navigate the timeline using the keyboard. The A and D keys scroll the view to the
left and right, respectively. The W and S keys change the zoom level.
75
Tracy Profiler The user manual
• Û Limit find zone time range – this will limit find zone results. See chapter 5.7 for more details.
• Limit statistics time range – selecting this option will limit statistics results. See chapter 5.6 for more
details.
• Ý Limit wait stacks time range – limits wait stacks results. Refer to chapter 5.17.
• : Limit memory time range – limits memory results. Read more about this in chapter 5.9.
Alternatively, you may specify the time range by clicking the right mouse button on a zone or a frame.
The resulting time extent will match the selected item.
To reduce clutter, time range regions are only displayed if the windows they affect are open or if the time
range limits control window is open (section 5.23). You can access the time range limits window through the
{ Tools button on the control menu.
You can freely adjust each time range on the timeline by clicking the left mouse button on the range’s
edge and dragging the mouse.
Description
Please note that while the annotations persist between profiling sessions, they are not saved in the trace
but in the user data files, as described in section 8.2.
• / Draw empty labels – By default threads that don’t have anything to display at the current zoom level
are hidden. Enabling this option will show them anyway.
76
Tracy Profiler The user manual
• c Draw frame targets – If enabled, time regions in any frame from the currently selected frame set,
which exceed the specified Target FPS value will be marked with a red background on timeline view.
– Target FPS – Controls the option above, but also the frame bar colors in the frame time graph
(section 5.2.2). The color range thresholds are presented in a line directly below.
– Q Darken inactive thread – If enabled, inactive regions in threads will be dimmed out.
– ô Draw CPU usage graph – You can disable drawing of the CPU usage graph here.
• > Draw CPU zones – Determines whether CPU zones are displayed.
– x Draw ghost zones – Controls if ghost zones should be displayed in threads which don’t have any
instrumented zones available.
– h Zone colors – Zones with no user-set color may be colored according to the following schemes:
‗ Disabled – A constant color (blue) will be used.
‗ Thread dynamic – Zones are colored according to a thread (identifier number) they belong to
and depth level.
‗ Source location dynamic – Zone color is determined by source location (function name) and
depth level.
Enabling the Ignore custom option will force usage of the selected zone coloring scheme, disregarding
any colors set by the user in profiled code.
– Î Zone name shortening – controls display behavior of long zone names, which don’t fit inside a
zone box:
‗ Disabled – Shortening of zone names is not performed and names are always displayed in full
(e.g. bool ns::container<float>::add(const float&)).
‗ Minimal length – Always reduces zone name to minimal length, even if there is space available
for a longer form (e.g. add()).
‗ Only normalize – Only performs normalization of the zone name72 , but does not remove
namespaces (e.g. ns::container<>::add()).
‗ As needed – Name shortening steps will be performed only if there is no space to display a
complete zone name, and only until the name fits available space, or shortening is no longer
possible (e.g. container<>::add()).
‗ As needed + normalize – Same as above, but zone name normalization will always be performed,
even if the entire zone name fits in the space available.
Function names in the remaining places across the UI will be normalized unless this option is set
to Disabled.
71 There is an assumption that drift is linear. Automated measurement calculates and removes change over time in delay-to-execution
77
Tracy Profiler The user manual
• Draw locks – Controls the display of locks. If the Only contended option is selected, the profiler won’t
display the non-blocking regions of locks (see section 5.2.3.3). The Locks drop-down allows disabling the
display of locks on a per-lock basis. As a convenience, the list of locks is split into the single-threaded
and multi-threaded (contended and uncontended) categories. Clicking the right mouse button on a
lock label opens the lock information window (section 5.18).
• ô Draw plots – Allows disabling display of plots. Individual plots can be disabled in the Plots
drop-down. The vertical size of the plots can be adjusted using the Plot heights slider.
• ¶ Visible threads – Here you can select which threads are visible on the timeline. You can change the
display order of threads by dragging thread labels. Threads can be sorted alphabetically with the Sort
button.
• ì Visible frame sets – Frame set display can be enabled or disabled here. Note that disabled frame sets
are still available for selection in the frame set selection drop-down (section 5.2.1) but are marked with
a dimmed font.
Disabling the display of some events is especially recommended when the profiler performance drops
below acceptable levels for interactive usage.
78
Tracy Profiler The user manual
79
Tracy Profiler The user manual
Using these options in tandem lets you look at both the inlined function code and the place where it was
inserted. If the Smart location is selected, the profiler will display the entry point position for non-inlined
functions and sample location for inlined functions. Selecting the @ Address option will instead print the
symbol address.
The location data is complemented by the originating executable image name, contained in the Image
column.
The profiler may not find some function locations due to insufficient debugging data available on the
client-side. To filter out such entries, use the 6 Hide unknown option.
The Time or Count column (depending on the 7 Show time option selection) shows number of taken
samples, either as a raw count, or in an easier to understand time format. Note that the percentage value of
time is calculated relative to the wall-clock time. The percentage value of sample counts is relative to the
total number of collected samples. You can also make the percentages of inline functions relative to the base
symbol measurements by enabling the Base relative option.
The last column, Code size, displays the size of the symbol in the executable image of the program. Since
inlined routines are directly embedded into other functions, their symbol size will be based on the parent
symbol and displayed as ’less than’. In some cases, this data won’t be available. If the symbol code has been
retrieved74 symbol size will be prepended with the õ icon, and clicking the right mouse button on the
location column entry will open symbol view window (section 5.16.2).
Finally, the list can be filtered using the Z Filter symbols entry field, just like in the instrumentation
mode case. Additionally, you can also filter results by the originating image name of the symbol. You may
disable the display of kernel symbols with the ½ Include kernel switch. The exclusive/inclusive time counting
mode can be switched using the Timing menu (non-reentrant timing is not available in the Sampling view).
Limiting the time range is also available but is restricted to self-time. If the « Show all option is selected,
the list will include not only the call stack samples but also all other symbols collected during the profiling
process (this is enabled by default if no sampling was performed).
A simple CSV document containing the visible zones after filtering and limiting can be copied to the
clipboard with the button adjacent to the visible zones count. The document contains the following columns:
• src_line – Line in the source file where the zone was set
80
Tracy Profiler The user manual
You start by entering a search query, which will be matched against known zone names (see section 3.4
for information on the grouping of zone names). If the search found some results, you will be presented
with a list of zones in the matched source locations drop-down. The selected zone’s graph is displayed on the
histogram drop-down, and also the matching zones are highlighted on the timeline view.
Clicking the right mouse button on the source file location will open the source file view window
(if applicable, see section 5.16). If symbol data is available Tracy will try to match the instrumented zone
name to a captured symbol. If this succeeds and there are no duplicate matches, the source file view will be
accompanied by the disassembly of the code. Since this matching is not exact, in rare cases you may get the
wrong data here. To just display the source code, press and hold the Ctrl key while clicking the right
mouse button.
An example histogram is presented in figure 21. Here you can see that the majority of zone calls (by
count) are clustered in the 300 ns group, closely followed by the 10 µs cluster. There are some outliers at the
1 and 10 ms marks, which can be ignored on most occasions, as these are single occurrences.
1 µs 10 µs 100 µs 1 ms
100 ns 10 ms 10 ms
Figure 21: Zone execution time histogram. Note that the extreme time labels and time range indicator (middle time value) are
displayed in a separate line.
Various data statistics about displayed data accompany the histogram, for example, the total time of the
displayed samples or the maximum number of counts in histogram bins. The following options control how the
data is presented:
• Log values – Switches between linear and logarithmic scale on the y axis of the graph, representing the
call counts75 .
• Log time – Switches between linear and logarithmic scale on the x axis of the graph, representing the
time bins.
• Cumulate time – Changes how the histogram bin values are calculated. By default, the vertical bars on
the graph represent the call counts of zones that fit in the given time bin. If this option is enabled, the
bars represent the time spent in the zones. For example, on the graph presented in figure 21 the 10 µs
cluster is the dominating one, if we look at the time spent in the zone, even if the 300 ns cluster has a
greater number of call counts.
• Self time – Removes children time from the analyzed zones, which results in displaying only the time
spent in the zone itself (or in non-instrumented function calls). It cannot be selected when Running time
is active.
• Running time – Removes time when zone’s thread execution was suspended by the operating system
due to preemption by other threads, waiting for system resources, lock contention, etc. Available only
when the profiler performed context switch capture (section 3.15.3). It cannot be selected when Self
time is active.
75 Or time, if the cumulate time option is enabled.
81
Tracy Profiler The user manual
• Minimum values in bin – Excludes display of bins that do not hold enough values at both ends of the time
range. Increasing this parameter will eliminate outliers, allowing us to concentrate on the interesting
part of the graph.
You can drag the left mouse button over the histogram to select a time range that you want to look at
closely. This will display the data in the histogram info section, and it will also filter zones shown in the found
zones section. This is quite useful if you actually want to look at the outliers, i.e., where did they originate
from, what the program was doing at the moment, etc76 . You can reset the selection range by pressing the
right mouse button on the histogram.
The found zones section displays the individual zones grouped according to the following criteria:
• Thread – In this mode you can see which threads were executing the zone.
• User text – Splits the zones according to the custom user text (see section 3.4).
• Zone name – Groups zones by the name set on a per-call basis (see section 3.4).
• Call stacks – Zones are grouped by the originating call stack (see section 3.11). Note that two call stacks
may sometimes appear identical, even if they are not, due to an easily overlooked difference in the
source line numbers.
• Parent – Groups zones according to the parent zone. This mode relies on the zone hierarchy and not on
the call stack information.
• No grouping – Disables zone grouping. It may be useful when you want to see zones in order as they
appear.
You may sort each group according to the order in which it appeared, the call count, the total time spent in
the group, or the mean time per call. Expanding the group view will display individual occurrences of the
zone, which can be sorted by application’s time, execution time, or zone’s name. Clicking the left mouse
button on a zone will open the zone information window (section 5.13). Clicking the middle mouse
button on a zone will zoom the timeline view to the zone’s extent.
Clicking the left mouse button on the group name will highlight the group time data on the histogram
(figure 22). This function provides a quick insight into the impact of the originating thread or input data on
the zone performance. Clicking on the 2 Clear button will reset the group selection. If the grouping mode
is set to Parent option, clicking the middle mouse button on the parent zone group will switch the find
zone view to display the selected zone.
100 ns 1 µs 10 µs 100 µs 1 ms 10 ms
The call stack grouping mode has a different way of listing groups. Here only one group is displayed at
any time due to the need to display the call stack frames. You can switch between call stack groups by using
76 More often than not you will find out, that the application was just starting, or access to a cold file was required and there’s not
82
Tracy Profiler The user manual
the and buttons. You can select the group by clicking on the ✓ Select button. You can open the call stack
window (section 5.14) by pressing the Call stack button.
Tracy displays a variety of statistical values regarding the selected function: mean (average value), median
(middle value), mode (most common value, quantized using histogram bins), and σ (standard deviation).
The mean and median zone times are also displayed on the histogram as red (mean) and blue (median)
vertical bars. Additional bars will indicate the mean group time (orange) and median group time (green).
You can disable the drawing of either set of markers by clicking on the check-box next to the color legend.
Hovering the W mouse cursor over a zone on the timeline, which is currently selected in the find zone
window, will display a pulsing vertical bar on the histogram, highlighting the bin to which the hovered zone
has been assigned. In addition, it will also highlight zone entry on the zone list.
Keyboard shortcut
You may press Ctrl + F to open or focus the find zone window and set the keyboard input on the
search box.
Caveats
When using the execution times histogram, you must know the hardware peculiarities. Read section 2.2.2
for more detail.
The profiler will highlight matching zones on the timeline display when the zone statistics are displayed in
the find zone menu. Highlight colors match the histogram display. A bright blue highlight indicates that a
zone is in the optional selection range, while the yellow highlight is used for the rest of the zones.
The frame time graph (section 5.2.2) behavior is altered when a zone is displayed in the find zone window
and the Show zone time in frames option is selected. An accumulated zone execution time is shown instead of
coloring the frame bars according to the frame time targets.
Each bar is drawn in gray color, with the white part accounting for the zone time. If the execution time
is greater than the frame time (this is possible if more than one thread was executing the same zone), the
overflow will be displayed using red color.
Enabling Self time option affects the displayed values, but Running time does not.
Caveats
The profiler might not calculate the displayed data correctly, and it may not include some zones in the
reported times.
If the Limit range option is selected, the profiler will include only the zones within the specified time range
(chapter 5.3) in the data. The inclusion region will be marked with a green striped pattern. Note that a zone
must be entirely inside the region to be counted. You can access more options through the Ì Limits button,
which will open the time range limits window, described in section 5.23.
83
Tracy Profiler The user manual
100 ns 1 µs 10 µs 100 µs 1 ms 10 ms
Note that the traces are color and symbol-coded. The current trace is marked by a yellow symbol, and
the external one is marked by a red v symbol.
When searching for source locations it’s not uncommon to match more than one zone (for example a
search for Draw may result in DrawCircle and DrawRectangle matches). Typically you wouldn’t want to
compare execution profiles of two unrelated functions, which is prevented by the link selection option, which
ensures that when you choose a source location in one trace, the corresponding one is also selected in the
second trace. Be aware that this may still result in a mismatch, for example, if you have overloaded functions.
In such a case, you will need to select the appropriate function in the other trace manually.
It may be difficult, if not impossible, to perform identical runs of a program. This means that the number
of collected zones may differ in both traces, influencing the displayed results. To fix this problem, enable the
Normalize values option, which will adjust the displayed results as if both traces had the same number of
recorded zones.
77When comparing frame times you are presented with a list of available frame sets, without the search box.
84
Tracy Profiler The user manual
Trace descriptions
Set custom trace descriptions (see section 5.12) to easily differentiate the two loaded traces. If no trace
description is set, the name of the profiled program will be displayed along with the capture time.
To see what changes were made in the source code between the two compared traces, select the Source diff
compare mode. This will display a list of deleted, added, and changed files. By default, the difference is
calculated from the older trace to the newer one. You can reverse this by clicking on the Switch button.
Please note that changes will be registered only if the file has the same name and location in both traces.
Tracy does not resolve file renames or moves.
5.9.1 Allocations
The @ Allocations pane allows you to search for the specified address usage during the whole lifetime of the
program. All recorded memory allocations that match the query will be displayed on a list.
78 Memory span describes the address space consumed by the program. It is calculated as a difference between the maximum and
minimum observed in-use memory address.
79While the allocation information window is opened, the address will be highlighted on the list.
80 The actual allocation is typically a couple functions deeper in the call stack.
85
Tracy Profiler The user manual
instrumented program.
86
Tracy Profiler The user manual
the memory events within the time range in the displayed results. See section 5.23 for more information.
Quick example
Let’s say we have an Unix-based operating system with program sources in /home/user/program/src/
directory. We have also performed a capture of an application running under Windows, with sources
in C:\Users\user\Desktop\program\src directory. The source locations don’t match, and the profiler
can’t access the source files on our disk. We can fix that by adding two substitution patterns:
• ˆC:\\Users\\user\\Desktop → /home/user
• \\ → /
By default, all source file modification times need to be older than the cature time of the trace. This can
be disabled using the Enforce source file modification time older than trace capture time check box, i.e. when the
source files are under source control and the file modification time is not relevant.
In this window, you can view the information about the machine on which the profiled application was
running. This includes the operating system, used compiler, CPU name, total available RAM, etc. In addition,
if application information was provided (see section 3.7.1), it will also be displayed here.
If an application should crash during profiling (section 2.5), the profiler will display the crash information
in this window. It provides you information about the thread that has crashed, the crash reason, and the
crash call stack (section 5.14).
82 See section 5.7 for a description of the histogram. Note that there are subtle differences in the available functionality.
83 This does not affect source files cached during the profiling run.
87
Tracy Profiler The user manual
• Basic source location information: function name, source file location, and the thread name.
• Timing information.
• If the profiler performed context switch capture (section 3.15.3) and a thread was suspended during
zone execution, a list of wait regions will be displayed, with complete information about the timing,
CPU migrations, and wait reasons. If CPU topology data is available (section 3.15.4), the profiler will
mark zone migrations across cores with ’C’ and migrations across packages – with ’P.’ In some cases,
context switch data might be incomplete84 , in which case a warning message will be displayed.
• Memory events list, both summarized and a list of individual allocation/free events (see section 5.9 for
more information on the memory events list).
• List of messages that the profiler logged in the zone’s scope. If the exclude children option is disabled,
messages emitted in child zones will also be included.
• Zone trace, taking into account the zone tree and call stack information (section 3.11), trying to
reconstruct a combined zone + call stack trace85 . Captured zones are displayed as standard text, while
not instrumented functions are dimmed. Hovering the W mouse pointer over a zone will highlight it on
the timeline view with a red outline. Clicking the left mouse button on a zone will switch the zone
info window to that zone. Clicking the middle mouse button on a zone will zoom the timeline view
to the zone’s extent. Clicking the right mouse button on a source file location will open the source
file view window (if applicable, see section 5.16).
• Child zones list, showing how the current zone’s execution time was used. Zones on this list can be
grouped according to their source location. Each group can be expanded to show individual entries.
All the controls from the zone trace are also available here.
• Time distribution in child zones, which expands the information provided in the child zones list by
processing all zone children (including multiple levels of grandchildren). This results in a statistical list
of zones that were really doing the work in the current zone’s time span. If a group of zones is selected
on this list, the find zone window (section 5.7) will open, with a time range limited to show only the
children of the current zone.
• ( Go to parent – Switches the zone information window to display current zone’s parent zone (if
available).
• ¡ Statistics – Displays the zone general performance characteristics in the find zone window (section 5.7).
• Call stack – Views the current zone’s call stack in the call stack window (section 5.14). The button will
be highlighted if the call stack window shows the zone’s call stack. Only available if zone had captured
call stack data (section 3.11).
84 For
example, when capture is ongoing and context switch information has not yet been received.
85 Reconstruction
is only possible if all zones have complete call stack capture data available. In the case where that’s not available, an
unknown frames entry will be present.
88
Tracy Profiler The user manual
• A Source – Display source file view window with the zone source code (only available if applicable, see
section 5.16). The button will be highlighted if the source file is displayed (but the focused source line
might be different).
• # Go back – Returns to the previously viewed zone. The viewing history is lost when the zone
information window is closed or when the type of displayed zone changes (from CPU to GPU or vice
versa).
Clicking on the ¿ Copy to clipboard buttons will copy the appropriate data to the clipboard.
• Source code – displays source file and line number associated with the frame.
• Entry point – source code at the beginning of the function containing selected frame, or function call
place in case of inline frames.
• Return address – shows return address, which you may use to pinpoint the exact instruction in the
disassembly.
• Symbol address – displays begin address of the function containing the frame address.
In some cases, it may not be possible to decode stack frame addresses correctly. Such frames will
be presented with a dimmed ’[ntdll.dll]’ name of the image containing the frame address, or simply
’[unknown]’ if the profiler cannot retrieve even this information. Additionally, ’[kernel]’ is used to indicate
unknown stack frames within the operating system’s internal routines.
If the displayed call stack is a sampled call stack (chapter 3.15.5), an additional button will be available,
Global entry statistics. Clicking it will open the sample entry call stacks window (chapter 5.15) for the
current call stack.
Clicking on the ¿ Copy to clipboard button will copy call stack to the clipboard.
int main ()
{
auto app = std :: make_unique < Application >() ;
app - > Run () ;
app . reset () ;
}
86 Executable images are called modules by Microsoft.
87 Or ’’ icon in case of call stack tooltips.
89
Tracy Profiler The user manual
Let’s say you are looking at the call stack of some function called within Application::Run. This is the
result you might get:
0. ...
1. ...
2. Application :: Run
3. std :: unique_ptr < Application >:: reset
4. main
At the first glance it may look like unique_ptr::reset was the call site of the Application::Run, which
would make no sense, but this is not the case here. When you remember these are the function return points, it
becomes much more clear what is happening. As an optimization, Application::Run is returning directly
into unique_ptr::reset, skipping the return to main and an unnecessary reset function call.
Moreover, the linker may determine in some rare cases that any two functions in your program are
identical88 . As a result, only one copy of the binary code will be provided in the executable for both functions
to share. While this optimization produces more compact programs, it also means that there’s no way to
distinguish the two functions apart in the resulting machine code. In effect, some call stacks may look
nonsensical until you perform a small investigation.
Important
To display source files, Tracy has to gain access to them somehow. Since having the source code is
not needed for the profiled application to run, this can be problematic in some cases. The source files
search order is as follows:
1. Discovery is performed on the server side. Found files are cached in the trace. This is appropriate
when the client and the server run on the same machine or if you’re deploying your application to the target
device and then run the profiler on the same workstation.
88 For example, if all they do is zero-initialize a region of memory. As some constructors would do.
90
Tracy Profiler The user manual
2. If not found, discovery is performed on the client-side. Found files are cached in the trace. This is
appropriate when you are developing your code on another machine, for example, you may be working on a
dev-board through an SSH connection.
3. If not found, Tracy will try to open source files that you might have on your disk later on. The
profiler won’t store these files in the trace. You may provide custom file path substitution rules to
redirect this search to the right place (see section 5.12).
Note that the discovery process not only looks for a file on the disk but it also checks its time stamp
and validates it against the executable image timestamp or, if it’s not available, the time of the performed
capture. This will prevent the use of newer source files (i.e., were changed) than the program you’re
profiling.
Nevertheless, the displayed source files might still not reflect the code that you profiled! It is up
to you to verify that you don’t have a modified version of the code with regards to the trace.
• Both – selects combined mode, in which source code and disassembly will be listed next to each other.
Some modes may be unavailable in some circumstances (missing or outdated source files, lack of machine
code). In case the Assembly mode is unavailable, this might be due to the capstone disassembly engine
failing to disassemble the machine instructions. See section 2.3 for more information.
91
Tracy Profiler The user manual
Selecting the Ô Raw code option will enable the display of raw machine code bytes for each line. Individual
bytes are displayed with interwoven colors to make reading easier.
If any instruction would jump to a predefined address, the symbolic name of the jump target will be
additionally displayed. If the destination location is within the currently displayed symbol, an -> arrow will
be prepended to the name. Hovering the W mouse pointer over such symbol name will highlight the target
location. Clicking on it with the left mouse button will focus the view on the destination instruction or
switch view to the destination symbol.
Enabling the ã Jumps option will show jumps within the symbol code as a series of arrows from the
jump source to the jump target, and hovering the W mouse pointer over a jump arrow will display a jump
information tooltip. It will also draw the jump range on the scroll bar as a green line. A horizontal green line
will mark the jump target location. Clicking on a jump arrow with the left mouse button will focus the
view on the target location. The right mouse button opens a jump context menu, which allows inspection
and navigation to the target location or any of the source locations. Jumps going out of the symbol89 will be
indicated by a smaller arrow pointing away from the code.
Portions of the executable used to show the symbol view are stored within the captured profile and don’t
rely on the available local disk files.
Exploring microarchitecture If the listed assembly code targets x86 or x64 instruction set architectures,
hovering W mouse pointer over an instruction will display a tooltip with microarchitectural data, based on
89 This includes jumps, procedure calls, and returns. For example, in x86 assembly the respective operand names can be: jmp, call,
ret.
92
Tracy Profiler The user manual
measurements made in [AR19]. This information is retrieved from instruction cycle tables and does not represent the
true behavior of the profiled code. Reading the cited article will give you a detailed definition of the presented
data, but here’s a quick (and inaccurate) explanation:
• Throughput – How many cycles are required to execute an instruction in a stream of the same independent
instructions. For example, if the CPU may execute two independent add instructions simultaneously
on different execution units, then the throughput (cycle cost per instruction) is 0.5.
• Latency – How many cycles it takes for an instruction to finish executing. This is reported as a min-max
range, as some output values may be available earlier than the rest.
• µops – How many microcode operations have to be dispatched for an instruction to retire. For example,
adding a value from memory to a register may consist of two microinstructions: first load the value
from memory, then add it to the register.
• Ports – Which ports (execution units) are required for dispatch of microinstructions. For example,
2*p0+1*p015 would mean that out of the three microinstructions implementing the assembly instruction,
two can only be executed on port 0, and one microinstruction can be executed on ports 0, 1, or 5. The
number of available ports and their capabilities varies between different processors architectures. Refer
to https://wikichip.org/ for more information.
Selection of the CPU microarchitecture can be performed using the > µarch drop-down. Each architecture
is accompanied by the name of an example CPU implementing it. If the current selection matches the
microarchitecture on which the profiled application was running, the > icon will be green90 . Otherwise, it
will be red91 . Clicking on the > icon when it is red will reset the selected microarchitecture to the one the
profiled application was running on.
Clicking on the K Save button lets you write the disassembly listing to a file. You can then manually
extract some critical loop kernel and pass it to a CPU simulator, such as LLVM Machine Code Analyzer
(llvm-mca)92 , to see how the code is executed and if there are any pipeline bubbles. Consult the llvm-mca
documentation for more details. Alternatively, you might click the right mouse button on a jump arrow
and save only the instructions within the jump range, using the K Save jump range button.
Instruction dependencies Assembly instructions may read values stored in registers and may also
write values to registers. As a result, a dependency between two instructions is created when one produces
some result, which the other then consumes. Combining this dependency graph with information about
instruction latencies may give a deep understanding of the bottlenecks in code performance.
Clicking the left mouse button on any assembly instruction will mark it as a target for resolving
register dependencies between instructions. To cancel this selection, click on any assembly instruction with
right mouse button.
The selected instruction will be highlighted in white, while its dependencies will be highlighted in red.
Additionally, a list of dependent registers will be listed next to each instruction which reads or writes to
them, with the following color code:
• Grey – Value in a register is either discarded (overwritten) or was already consumed by an earlier
instruction (i.e., it is readily available93 ). The profiler will not follow the dependency chain further.
90 Comparing sampled instruction counts with microarchitectural details only makes sense when this selection is properly matched.
91 You can use this to gain insight into how the code may behave on other processors.
92 https://llvm.org/docs/CommandGuide/llvm-mca.html
93 This is actually a bit of simplification. Run a pipeline simulator, e.g., llvm-mca for a better analysis.
93
Tracy Profiler The user manual
Search for dependencies follows program control flow, so there may be multiple producers and consumers
for any single register. While the after and before guidelines mentioned above hold in the general case,
things may be more complicated when there’s a large number of conditional jumps in the code. Note that
dependencies further away than 64 instructions are not displayed.
For more straightforward navigation, dependencies are also marked on the left side of the scroll bar,
following the green, red and yellow conventions. The selected instruction is marked in blue.
94 You should remember that these are results of random sampling. Some function calls may be missing here.
94
Tracy Profiler The user manual
Important
Be aware that the data is not entirely accurate, as it results from a random sampling of program
execution. Furthermore, undocumented implementation details of an out-of-order CPU architecture
will highly impact the measurement. Read chapter 2.2.2 to see the tip of an iceberg.
As described in chapter 3.15.6, on some platforms, Tracy can capture the internal statistics counted by the
CPU hardware. If this data has been collected, the Ë Cost selection list will be available. It allows changing
what is taken into consideration for display by the cost statistics. You can select the following options:
• Sample count – this selects the instruction pointer statistics, collected by call stack sampling performed by
the operating system. This is the default data shown when hardware samples have not been captured.
• Cycles – an option very similar to the sample count, but the data is collected directly by the CPU hardware
counters. This may make the results more reliable.
• Cache
√ impact – similar to branch impact, but it shows cache miss data instead. These values are calculated
as #cache references ∗ #cache misses and will highlight places with lots of cache accesses that also
miss.
• The rest of the available selections just show raw values gathered from the hardware counters. These
are: Retirements, Branches taken, Branch miss, Cache access and Cache miss.
If the HW (hardware samples) switch is enabled, the profiler will supplement the cost percentages
column with three additional columns. The first added column displays the instructions per cycle (IPC)
value. The two remaining columns show branch and cache data, as described below. The displayed values
are color-coded, with green indicating good execution performance and red indicating that the code stalled
the CPU pipeline for one reason or another.
If the Impact switch is enabled, the branch and cache columns will show how much impact the branch
mispredictions and cache misses have. The way these statistics are calculated is described in the list above. In
the other case, the columns will show the raw branch and cache miss rate ratios, isolated to their respective
source and assembly lines and not relative to the whole symbol.
Isolated values
The percentage values when Impact option is not selected will not take into account the relative
count of events. For example, you may see a 100% cache miss rate when some instruction missed 10
out of 10 cache accesses. While not ideal, this is not as important as a seemingly better 50% cache miss
rate instruction, which actually has missed 1000 out of 2000 accesses. Therefore, you should always
cross-check the presented information with the respective event counts. To help with this, Tracy will
dim statistically unimportant values.
95
Tracy Profiler The user manual
• O List – shows all unique wait stacks, sorted by the number of times they were observed.
• Bottom-up tree – displays wait stacks in the form of a collapsible tree, which starts at the bottom of the
call stack.
• Top-down tree – displays wait stacks in the form of a collapsible tree, which starts at the top of the call
stack.
Displayed data may be narrowed down to a specific time range or to include only selected threads.
96
Tracy Profiler The user manual
• Remove – Removes the annotation. You must press the Ctrl key to enable this button.
C Text description
A new view-sized annotation can be added in this window by pressing the + Add annotation button. This
effectively saves your current viewport for further reference.
• 4 Set from annotation – Allows using the annotation region for limiting purposes.
• Û Copy from find zone – Copies the find zone time range limit.
• Ý Copy from wait stacks – Copies the wait stacks time range limit.
Note that ranges displayed in the window have color hints that match the color of the striped regions on
the timeline.
• src_line – Line in the source file where the zone was set
97
Tracy Profiler The user manual
• mean_ns – Mean zone time (equivalent to MPTC in the profiler GUI) in nanoseconds
You can customize the output with the following command line options:
• -e, --self – Use self time (equivalent to the “Self time” toggle in the profiler GUI)
• -u, --unwrap – Report each zone individually; this will discard the statistics columns and instead
report the timestamp and duration for each zone entry
• Fuchsia’s tracing format95 data through the import-fuchsia utility. This format has many commonali-
ties with the chrome:tracing format, but it uses a compact and efficient binary encoding that can help
lower tracing overhead. The file extension is .fxt or .fxt.zst.
To this this tool, assuming it’s compiled, run:
Compressed traces
Tracy can import traces compressed with the Zstandard algorithm (for example, using the zstd
command-line utility). Traces ending with .zst extension are assumed to be compressed. This applies
for both chrome and fuchsia traces.
95 https://fuchsia.dev/fuchsia-src/reference/tracing/trace-format
98
Tracy Profiler The user manual
Source locations
Chrome tracing format doesn’t document a way to provide source location data. The import-chrome
and import-fuchsia utilities will however recognize a custom loc tag in the root of zone begin
events. You should be formatting this data in the usual filename:line style, for example: hello.c:42.
Providing the line number (including a colon) is optional but highly recommended.
Limitations
• Tracy is a single-process profiler. Should the imported trace contain PID entries, each PID+TID
pair will create a new pseudo-TID number, which the profiler will then decode into a PID+TID
pair in thread labels. If you want to preserve the original TID numbers, your traces should omit
PID entries.
• The imported data may be severely limited, either by not mapping directly to the data structures
used by Tracy or by following undocumented practices.
8 Configuration files
While the client part doesn’t read or write anything to the disk (except for accessing the /proc filesystem on
Linux), the server part has to keep some persistent state. The naming conventions or internal data format of
the files are not meant to be known by profiler users, but you may want to do a backup of the configuration
or move it to another machine.
On Windows settings are stored in the %APPDATA%/tracy directory. All other platforms use the
$XDG_CONFIG_HOME/tracy directory, or $HOME/.config/tracy if the XDG_CONFIG_HOME environment variable
is not set.
99
Tracy Profiler The user manual
Appendices
A License
Tracy Profiler (https://github.com/wolfpld/tracy) is licensed under the
3-clause BSD license.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
– getopt_port – https://github.com/kimgr/getopt_port
– libbacktrace ⋆ – https://github.com/ianlancetaylor/libbacktrace
– Zstandard – https://github.com/facebook/zstd
– Diff Template Library – https://github.com/cubicdaiya/dtl
– concurrentqueue ⋆ – https://github.com/cameron314/concurrentqueue
– LZ4 ⋆ – https://github.com/lz4/lz4
– xxHash – https://github.com/Cyan4973/xxHash
• Public domain
100
Tracy Profiler The user manual
– rpmalloc ⋆ – https://github.com/rampantpixels/rpmalloc
– gl3w – https://github.com/skaslev/gl3w
– stb_image – https://github.com/nothings/stb
– stb_image_resize – https://github.com/nothings/stb
• zlib license
• MIT license
References
[AR19] Andreas Abel and Jan Reineke. uops.info: Characterizing latency, throughput, and port usage of
instructions on intel microarchitectures. In ASPLOS, ASPLOS ’19, pages 673–686, New York, NY,
USA, 2019. ACM.
[ISO12] ISO. ISO/IEC 14882:2011 Information technology — Programming languages — C++. International
Organization for Standardization, Geneva, Switzerland, February 2012.
101