1+ //! Profiling counters and their implementation.
2+ //!
3+ //! # Available counters
4+ //!
5+ //! Name (for [`Counter::by_name()`]) | Counter | OSes | CPUs
6+ //! --------------------------------- | ------- | ---- | ----
7+ //! `wall-time` | [`WallTime`] | any | any
8+ //! `instructions:u` | [`Instructions`] | Linux | `x86_64`
9+ //! `instructions-minus-irqs:u` | [`InstructionsMinusIrqs`] | Linux | `x86_64`<br>- AMD (since K8)<br>- Intel (since Sandy Bridge)
10+ //! `instructions-minus-r0420:u` | [`InstructionsMinusRaw0420`] | Linux | `x86_64`<br>- AMD (Zen)
11+ //!
12+ //! *Note: `:u` suffixes for hardware performance counters come from the Linux `perf`
13+ //! tool, and indicate that the counter is only active while userspace code executes
14+ //! (i.e. it's paused while the kernel handles syscalls, interrupts, etc.).*
15+ //!
16+ //! # Limitations and caveats
17+ //!
18+ //! *Note: for more information, also see the GitHub PR which first implemented hardware
19+ //! performance counter support ([#143](https://github.com/rust-lang/measureme/pull/143)).*
20+ //!
21+ //! The hardware performance counters (i.e. all counters other than `wall-time`) are limited to:
22+ //! * nightly Rust (gated on `features = ["nightly"]`), for `asm!`
23+ //! * Linux, for out-of-the-box performance counter reads from userspace
24+ //! * other OSes could work through custom kernel extensions/drivers, in the future
25+ //! * `x86_64` CPUs, mostly due to lack of other available test hardware
26+ //! * new architectures would be easier to support (on Linux) than new OSes
27+ //! * easiest to add would be 32-bit `x86` (aka `i686`), which would reuse
28+ //! most of the `x86_64` CPU model detection logic
29+ //! * specific (newer) CPU models, for certain non-standard counters
30+ //! * e.g. `instructions-minus-irqs:u` requires a "hardware interrupts" (aka "IRQs")
31+ //! counter, which is implemented differently between vendors / models (if at all)
32+ //! * single-threaded programs (counters only work on the thread they were created on)
33+ //! * for profiling `rustc`, this means only "check mode" (`--emit=metadata`),
34+ //! is supported currently (`-Z no-llvm-threads` could also work)
35+ //! * unclear what the best approach for handling multiple threads would be
36+ //! * changing the API (e.g. to require per-thread profiler handles) could result
37+ //! in a more efficient implementation, but would also be less ergonomic
38+ //! * profiling data from multithreaded programs would be harder to use due to
39+ //! noise from synchronization mechanisms, non-deterministic work-stealing, etc.
40+ //!
41+ //! For ergonomic reasons, the public API doesn't vary based on `features` or target.
42+ //! Instead, attempting to create any unsupported counter will return `Err`, just
43+ //! like it does for any issue detected at runtime (e.g. incompatible CPU model).
44+ //!
45+ //! When counting instructions specifically, these factors will impact the profiling quality:
46+ //! * high-level non-determinism (e.g. user interactions, networking)
47+ //! * the ideal use-case is a mostly-deterministic program, e.g. a compiler like `rustc`
48+ //! * if I/O can be isolated to separate profiling events, and doesn't impact
49+ //! execution in a more subtle way (see below), the deterministic parts of
50+ //! the program can still be profiled with high accuracy
51+ //! * low-level non-determinism (e.g. ASLR, randomized `HashMap`s, thread scheduling)
52+ //! * ASLR ("Address Space Layout Randomization"), may be provided by the OS for
53+ //! security reasons, or accidentally caused through allocations that depend on
54+ //! random data (even as low-entropy as e.g. the base 10 length of a process ID)
55+ //! * on Linux ASLR can be disabled by running the process under `setarch -R`
56+ //! * this impacts `rustc` and LLVM, which rely on keying `HashMap`s by addresses
57+ //! (typically of interned data) as an optimization, and while non-determinstic
58+ //! outputs are considered bugs, the instructions executed can still vary a lot,
59+ //! even when the externally observable behavior is perfectly repeatable
60+ //! * `HashMap`s are involved in one more than one way:
61+ //! * both the executed instructions, and the shape of the allocations depend
62+ //! on both the hasher state and choice of keys (as the buckets are in
63+ //! a flat array indexed by some of the lower bits of the key hashes)
64+ //! * so every `HashMap` with keys being/containing addresses will amplify
65+ //! ASLR and ASLR-like effects, making the entire program more sensitive
66+ //! * the default hasher is randomized, and while `rustc` doesn't use it,
67+ //! proc macros can (and will), and it's harder to disable than Linux ASLR
68+ //! * `jemalloc` (the allocator used by `rustc`, at least in official releases)
69+ //! has a 10 second "purge timer", which can introduce an ASLR-like effect,
70+ //! unless disabled with `MALLOC_CONF=dirty_decay_ms:0,muzzy_decay_ms:0`
71+ //! * hardware flaws (whether in the design or implementation)
72+ //! * hardware interrupts ("IRQs") and exceptions (like page faults) cause
73+ //! overcounting (1 instruction per interrupt, possibly the `iret` from the
74+ //! kernel handler back to the interrupted userspace program)
75+ //! * this is the reason why `instructions-minus-irqs:u` should be preferred
76+ //! to `instructions:u`, where the former is available
77+ //! * there are system-wide options (e.g. `CONFIG_NO_HZ_FULL`) for removing
78+ //! some interrupts from the cores used for profiling, but they're not as
79+ //! complete of a solution, nor easy to set up in the first place
80+ //! * AMD Zen CPUs have a speculative execution feature (dubbed `SpecLockMap`),
81+ //! which can cause non-deterministic overcounting for instructions following
82+ //! an atomic instruction (such as found in heap allocators, or `measureme`)
83+ //! * this is automatically detected, with a `log` message pointing the user
84+ //! to [https://github.com/mozilla/rr/wiki/Zen] for guidance on how to
85+ //! disable `SpecLockMap` on their system (sadly requires root access)
86+ //!
87+ //! Even if some of the above caveats apply for some profiling setup, as long as
88+ //! the counters function, they can still be used, and compared with `wall-time`.
89+ //! Chances are, they will still have less variance, as everything that impacts
90+ //! instruction counts will also impact any time measurements.
91+ //!
92+ //! Also keep in mind that instruction counts do not properly reflect all kinds
93+ //! of workloads, e.g. SIMD throughput and cache locality are unaccounted for.
94+
195use std:: error:: Error ;
296use std:: time:: Instant ;
397
@@ -60,6 +154,9 @@ impl Counter {
60154 }
61155}
62156
157+ /// "Monotonic clock" with nanosecond precision (using [`std::time::Instant`]).
158+ ///
159+ /// Can be obtained with `Counter::by_name("wall-time")`.
63160pub struct WallTime {
64161 start : Instant ,
65162}
@@ -79,6 +176,9 @@ impl WallTime {
79176 }
80177}
81178
179+ /// "Instructions retired" hardware performance counter (userspace-only).
180+ ///
181+ /// Can be obtained with `Counter::by_name("instructions:u")`.
82182pub struct Instructions {
83183 instructions : hw:: Counter ,
84184 start : u64 ,
@@ -103,6 +203,9 @@ impl Instructions {
103203 }
104204}
105205
206+ /// More accurate [`Instructions`] (subtracting hardware interrupt counts).
207+ ///
208+ /// Can be obtained with `Counter::by_name("instructions-minus-irqs:u")`.
106209pub struct InstructionsMinusIrqs {
107210 instructions : hw:: Counter ,
108211 irqs : hw:: Counter ,
@@ -132,6 +235,10 @@ impl InstructionsMinusIrqs {
132235 }
133236}
134237
238+ /// (Experimental) Like [`InstructionsMinusIrqs`] (but using an undocumented `r0420:u` counter).
239+ ///
240+ /// Can be obtained with `Counter::by_name("instructions-minus-r0420:u")`.
241+ //
135242// HACK(eddyb) this is a variant of `instructions-minus-irqs:u`, where `r0420`
136243// is subtracted, instead of the usual "hardware interrupts" (aka IRQs).
137244// `r0420` is an undocumented counter on AMD Zen CPUs which appears to count
0 commit comments