Module measureme::counters

Expand description

Profiling counters and their implementation.

§Available counters

Name (for `Counter::by_name()`)	Counter	OSes	CPUs
`wall-time`	`WallTime`	any	any
`instructions:u`	`Instructions`	Linux	`x86_64`
`instructions-minus-irqs:u`	`InstructionsMinusIrqs`	Linux	`x86_64` - AMD (since K8) - Intel (since Sandy Bridge)
`instructions-minus-r0420:u`	`InstructionsMinusRaw0420`	Linux	`x86_64` - AMD (Zen)

Note: :u suffixes for hardware performance counters come from the Linux perf tool, and indicate that the counter is only active while userspace code executes (i.e. it’s paused while the kernel handles syscalls, interrupts, etc.).

§Limitations and caveats

Note: for more information, also see the GitHub PR which first implemented hardware performance counter support (#143).

The hardware performance counters (i.e. all counters other than wall-time) are limited to:

Linux, for out-of-the-box performance counter reads from userspace
- other OSes could work through custom kernel extensions/drivers, in the future
x86_64 CPUs, mostly due to lack of other available test hardware
- new architectures would be easier to support (on Linux) than new OSes
- easiest to add would be 32-bit x86 (aka i686), which would reuse most of the x86_64 CPU model detection logic
specific (newer) CPU models, for certain non-standard counters
- e.g. instructions-minus-irqs:u requires a “hardware interrupts” (aka “IRQs”) counter, which is implemented differently between vendors / models (if at all)
single-threaded programs (counters only work on the thread they were created on)
- for profiling rustc, this means only “check mode” (--emit=metadata), is supported currently (-Z no-llvm-threads could also work)
- unclear what the best approach for handling multiple threads would be
- changing the API (e.g. to require per-thread profiler handles) could result in a more efficient implementation, but would also be less ergonomic
- profiling data from multithreaded programs would be harder to use due to noise from synchronization mechanisms, non-deterministic work-stealing, etc.

For ergonomic reasons, the public API doesn’t vary based on features or target. Instead, attempting to create any unsupported counter will return Err, just like it does for any issue detected at runtime (e.g. incompatible CPU model).

When counting instructions specifically, these factors will impact the profiling quality:

high-level non-determinism (e.g. user interactions, networking)
- the ideal use-case is a mostly-deterministic program, e.g. a compiler like rustc
- if I/O can be isolated to separate profiling events, and doesn’t impact execution in a more subtle way (see below), the deterministic parts of the program can still be profiled with high accuracy
- intentional uses of randomness may change execution paths, though for cryptographic operations specifically, “constant time” implementations are preferred / necessary (in order to limit an external observer’s ability to infer secrets), so they’re not as much of a problem
- even otherwise-deterministic machine-local communication (to e.g. system services or drivers) can behave unpredictably (especially under load)
  - while we haven’t observed this in the wild yet, it’s possible for file reads/writes to be split up into multiple smaller chunks (and therefore take more userspace instructions to fully read/write)
low-level non-determinism (e.g. ASLR, randomized HashMaps, timers)
- ASLR (“Address Space Layout Randomization”), may be provided by the OS for security reasons, or accidentally caused through allocations that depend on random data (even as low-entropy as e.g. the base 10 length of a process ID)
- on Linux ASLR can be disabled by running the process under setarch -R
- this impacts rustc and LLVM, which rely on keying HashMaps by addresses (typically of interned data) as an optimization, and while non-determinstic outputs are considered bugs, the instructions executed can still vary a lot, even when the externally observable behavior is perfectly repeatable
- HashMaps are involved in one more than one way:
  - both the executed instructions, and the shape of the allocations depend on both the hasher state and choice of keys (as the buckets are in a flat array indexed by some of the lower bits of the key hashes)
  - so every HashMap with keys being/containing addresses will amplify ASLR and ASLR-like effects, making the entire program more sensitive
  - the default hasher is randomized, and while rustc doesn’t use it, proc macros can (and will), and it’s harder to disable than Linux ASLR
- most ways of measuring time will inherently never perfectly align with exact points in the program’s execution, making time behave like another low-entropy source of randomness - this also means timers will elapse at unpredictable points (which can further impact the rest of the execution)
  - this includes the common thread scheduler technique of preempting the currently executing thread with a periodic timer interrupt, so the exact interleaving of multiple threads will likely not be reproducible without special OS configuration, or tools that emulate a deterministic scheduler
  - jemalloc (the allocator used by rustc, at least in official releases) has a 10 second “purge timer”, which can introduce an ASLR-like effect, unless disabled with MALLOC_CONF=dirty_decay_ms:0,muzzy_decay_ms:0
hardware flaws (whether in the design or implementation)
- hardware interrupts (“IRQs”) and exceptions (like page faults) cause overcounting (1 instruction per interrupt, possibly the iret from the kernel handler back to the interrupted userspace program)
  - this is the reason why instructions-minus-irqs:u should be preferred to instructions:u, where the former is available
  - there are system-wide options (e.g. CONFIG_NO_HZ_FULL) for removing some interrupts from the cores used for profiling, but they’re not as complete of a solution, nor easy to set up in the first place
- AMD Zen CPUs have a speculative execution feature (dubbed SpecLockMap), which can cause non-deterministic overcounting for instructions following an atomic instruction (such as found in heap allocators, or measureme)
  - this is automatically detected, with a log message pointing the user to https://github.com/mozilla/rr/wiki/Zen for guidance on how to disable SpecLockMap on their system (sadly requires root access)

Even if some of the above caveats apply for some profiling setup, as long as the counters function, they can still be used, and compared with wall-time. Chances are, they will still have less variance, as everything that impacts instruction counts will also impact any time measurements.

Also keep in mind that instruction counts do not properly reflect all kinds of workloads, e.g. SIMD throughput and cache locality are unaccounted for.

Structs§

Instructions
“Instructions retired” hardware performance counter (userspace-only).
InstructionsMinusIrqs
More accurate Instructions (subtracting hardware interrupt counts).
InstructionsMinusRaw0420
(Experimental) Like InstructionsMinusIrqs (but using an undocumented r0420:u counter).
WallTime
“Monotonic clock” with nanosecond precision (using std::time::Instant).

Enums§

Counter

Module measureme::countersCopy item path

§Available counters

§Limitations and caveats

Structs§

Enums§

Module measureme::counters