Expand description
Profiling counters and their implementation.
§Available counters
Name (for Counter::by_name() ) | Counter | OSes | CPUs |
---|---|---|---|
wall-time | WallTime | any | any |
instructions:u | Instructions | Linux | x86_64 |
instructions-minus-irqs:u | InstructionsMinusIrqs | Linux | x86_64 - AMD (since K8) - Intel (since Sandy Bridge) |
instructions-minus-r0420:u | InstructionsMinusRaw0420 | Linux | x86_64 - AMD (Zen) |
Note: :u
suffixes for hardware performance counters come from the Linux perf
tool, and indicate that the counter is only active while userspace code executes
(i.e. it’s paused while the kernel handles syscalls, interrupts, etc.).
§Limitations and caveats
Note: for more information, also see the GitHub PR which first implemented hardware performance counter support (#143).
The hardware performance counters (i.e. all counters other than wall-time
) are limited to:
- Linux, for out-of-the-box performance counter reads from userspace
- other OSes could work through custom kernel extensions/drivers, in the future
x86_64
CPUs, mostly due to lack of other available test hardware- new architectures would be easier to support (on Linux) than new OSes
- easiest to add would be 32-bit
x86
(akai686
), which would reuse most of thex86_64
CPU model detection logic
- specific (newer) CPU models, for certain non-standard counters
- e.g.
instructions-minus-irqs:u
requires a “hardware interrupts” (aka “IRQs”) counter, which is implemented differently between vendors / models (if at all)
- e.g.
- single-threaded programs (counters only work on the thread they were created on)
- for profiling
rustc
, this means only “check mode” (--emit=metadata
), is supported currently (-Z no-llvm-threads
could also work) - unclear what the best approach for handling multiple threads would be
- changing the API (e.g. to require per-thread profiler handles) could result in a more efficient implementation, but would also be less ergonomic
- profiling data from multithreaded programs would be harder to use due to noise from synchronization mechanisms, non-deterministic work-stealing, etc.
- for profiling
For ergonomic reasons, the public API doesn’t vary based on features
or target.
Instead, attempting to create any unsupported counter will return Err
, just
like it does for any issue detected at runtime (e.g. incompatible CPU model).
When counting instructions specifically, these factors will impact the profiling quality:
- high-level non-determinism (e.g. user interactions, networking)
- the ideal use-case is a mostly-deterministic program, e.g. a compiler like
rustc
- if I/O can be isolated to separate profiling events, and doesn’t impact execution in a more subtle way (see below), the deterministic parts of the program can still be profiled with high accuracy
- intentional uses of randomness may change execution paths, though for cryptographic operations specifically, “constant time” implementations are preferred / necessary (in order to limit an external observer’s ability to infer secrets), so they’re not as much of a problem
- even otherwise-deterministic machine-local communication (to e.g. system
services or drivers) can behave unpredictably (especially under load)
- while we haven’t observed this in the wild yet, it’s possible for file reads/writes to be split up into multiple smaller chunks (and therefore take more userspace instructions to fully read/write)
- the ideal use-case is a mostly-deterministic program, e.g. a compiler like
- low-level non-determinism (e.g. ASLR, randomized
HashMap
s, timers)- ASLR (“Address Space Layout Randomization”), may be provided by the OS for security reasons, or accidentally caused through allocations that depend on random data (even as low-entropy as e.g. the base 10 length of a process ID)
- on Linux ASLR can be disabled by running the process under
setarch -R
- this impacts
rustc
and LLVM, which rely on keyingHashMap
s by addresses (typically of interned data) as an optimization, and while non-determinstic outputs are considered bugs, the instructions executed can still vary a lot, even when the externally observable behavior is perfectly repeatable HashMap
s are involved in one more than one way:- both the executed instructions, and the shape of the allocations depend on both the hasher state and choice of keys (as the buckets are in a flat array indexed by some of the lower bits of the key hashes)
- so every
HashMap
with keys being/containing addresses will amplify ASLR and ASLR-like effects, making the entire program more sensitive - the default hasher is randomized, and while
rustc
doesn’t use it, proc macros can (and will), and it’s harder to disable than Linux ASLR
- most ways of measuring time will inherently never perfectly align with
exact points in the program’s execution, making time behave like another
low-entropy source of randomness - this also means timers will elapse at
unpredictable points (which can further impact the rest of the execution)
- this includes the common thread scheduler technique of preempting the currently executing thread with a periodic timer interrupt, so the exact interleaving of multiple threads will likely not be reproducible without special OS configuration, or tools that emulate a deterministic scheduler
jemalloc
(the allocator used byrustc
, at least in official releases) has a 10 second “purge timer”, which can introduce an ASLR-like effect, unless disabled withMALLOC_CONF=dirty_decay_ms:0,muzzy_decay_ms:0
- hardware flaws (whether in the design or implementation)
- hardware interrupts (“IRQs”) and exceptions (like page faults) cause
overcounting (1 instruction per interrupt, possibly the
iret
from the kernel handler back to the interrupted userspace program)- this is the reason why
instructions-minus-irqs:u
should be preferred toinstructions:u
, where the former is available - there are system-wide options (e.g.
CONFIG_NO_HZ_FULL
) for removing some interrupts from the cores used for profiling, but they’re not as complete of a solution, nor easy to set up in the first place
- this is the reason why
- AMD Zen CPUs have a speculative execution feature (dubbed
SpecLockMap
), which can cause non-deterministic overcounting for instructions following an atomic instruction (such as found in heap allocators, ormeasureme
)- this is automatically detected, with a
log
message pointing the user to https://github.com/mozilla/rr/wiki/Zen for guidance on how to disableSpecLockMap
on their system (sadly requires root access)
- this is automatically detected, with a
- hardware interrupts (“IRQs”) and exceptions (like page faults) cause
overcounting (1 instruction per interrupt, possibly the
Even if some of the above caveats apply for some profiling setup, as long as
the counters function, they can still be used, and compared with wall-time
.
Chances are, they will still have less variance, as everything that impacts
instruction counts will also impact any time measurements.
Also keep in mind that instruction counts do not properly reflect all kinds of workloads, e.g. SIMD throughput and cache locality are unaccounted for.
Structs§
- “Instructions retired” hardware performance counter (userspace-only).
- More accurate
Instructions
(subtracting hardware interrupt counts). - (Experimental) Like
InstructionsMinusIrqs
(but using an undocumentedr0420:u
counter). - “Monotonic clock” with nanosecond precision (using
std::time::Instant
).